Discovery ServiceNow ITOM

Succeeding with ServiceNow Discovery & ITOM

Having a successful discovery project is not easy, as a matter of fact, it is one of the most challenging areas of ITOM to “get right”. ServiceNow Discovery serves as one of the main tools to create a reliable data layer used in other processes. Through using discovery organizations get their meta-data, cloud/PaaS & infrastructure in order. Yet how come so many organizations fail in this endeavor? In this article we deep-dive into the most critical areas to succeed with ServiceNow Discovery.


The three strategic pillars of discovery

Discovery typically comes as an exercise for the entire enterprise and leadership team. We have chosen to categorize the different topics into three strategic pillars through our Einar & Partners experience. Each pillar is of equal importance yet many customers, enterprises, and experts focus on just a few. As the old cliché goes, sometimes one can’t see the forest from the tree’s.

Discovery Best Practice ITOM ServiceNow
Our quick reference chart of the most critical areas to keep in mind

Organizational


The perhaps most important pillar is the organizational one. Discovery is not so much a technical exercise as it is organizational. Based on our experience, over 80% of failed discovery projects fails due to neglect in this area.

Security Policies

Security and discovery go hand-in-hand. It is of critical importance to align with the security team at an early stage. This is especially common if you are an organization with a lot of legacy IT. Exposing your entire infrastructure inventory in the cloud can be sensitive. Then there’s also the questions about credentials, encryption, security and access. Failure to involve the security team early might cause unpleasant discussions at best and a complete stop of the project at worst.

Political buy-in

Political buy-in does not necessarily mean the management team, although that’s also important. It’s more about finding champions within the organization that can work as diplomats. If you do not already use a discovery tool or have a CMDB, there’s a large probability that it will become a politically sensitive topic. Why? You might ask. Job security is the answer based on our experience.

When introducing a discovery tool people fear their relevancy and role. Silo’s of data is often a thinly hidden veil of a firm’s internal boundaries. Different departments within a company, afraid of relinquishing power, are loth to share their data or change what they collect and how. There for getting the political influencers and buy-in is extremely critical and one of the more difficult tasks.

Roles & Responsibilities

Expecting a successful discovery project? Then expect to allocate budget for some new roles and responsibilities. Failure to do so results in little to no accountability and frustration from co-workers who suddenly are expected to help without formal approval. Setting expectations towards the organization, assigning the right roles and who is in the driver seat is a must.

Processes


Having solid processes are critical to succeed with discovery. After all, we’re trying to coordinate potentially hundreds of data sources and stakeholders into one data lake at the end of the day. Not streamlining the processes regarding how to execute is dangerous. Thinking one can do things “ad-hoc” as the need arises? A critical mistake too many organizations fall victim to.

Access & Credentials

Tightly aligned with the security policies and team, this is one of the more critical pieces. How will credentials be created, facilitated and stored for discovery? When a new system or source is connected so must the credentials be. Following a rigid process for handling of credentials regarding discovery is a critical puzzle piece.

Firewall changes

When working with discovery there is a need to allow access and open firewalls. Some organizations have a very fragmented network or have strict segmentation. Relying on a process for how to maintain firewall openings suddenly becomes very important. The right ports, the right subnets and the right protocols must be documented and managed. Without it you’re running into the risk of constant errors, access issues and long lead-times to get a successful discovery going.

Rollout method

We’ve seen discovery projects complete in two months and we’ve seen them finish in two years. It all depends on the rollout method and how you plan around it. When rolling out discovery it can be done in many different ways. Yet one common factor is the coordination between different teams and sign-off by CI owners. Choose the right rollout method and stick to the planning.

Technological


The last pillar of the discovery strategy is to get your house in order from a technological perspective. Neglecting this part might lead to a successful discovery project but without anybody using the data or caring about it.

Scoping CI Classes

One of the first and most crucial step is to scope the appropriate CI classes. Meaning, what do we want discovery to discover for us? This determines which key stakeholders to involve. Who investigates and inspect the data, and who uses it? Having a clear scope of CI classes to include, preferably in a stepped approach is our recommendation.

Subnets and sources

Once CI classes are scoped it’s time to dig deeper into the different network segments and sources. Where is the data residing and how to we access it? Where are the credentials stored and how can we optimize the discovery schedules? If you are a global company with data centers spread across the world, multiple clouds and local differences, this exercise tends to be the most time-consuming. Optimize the discovery for which networks and sources to target (and when) ensures stability and consistency.

Sign off by CI owners

Different CI owners also have different requirements. Some owners might be concerned on the impact of the network. Others on the impact on CPU & Load. Meanwhile the third team might be worried about the data quality and individual attributes. In other words, it’s essential to have a governance process for CMDB and to have CI owners inspect and sign off discovery results. This way they also feel more connected to the project and are more likely to use the data.

Conclusion – making discovery successful


As we can see, there are many elements to a discovery rollout that need to be in place for success. The one’s mentioned above are just a few with many more puzzle pieces in the equation. More than anything it is indeed an exercise of politics, coordination and careful planning. Spending adequate time in the preparation phase is key to having a long-lasting discovery success. At Einar & Partners we recognize these elements and the sensitivity of each area. We therefore hope that our readers and clients will find this quick-guide useful moving forward in their discovery adventures. And as always, we’re here to help.

AIOps Self Healing

Self-healing & AIOps – demystifying the hype

A hot topic within AIOps is without a doubt the promised land of self-healing, where an AIOps solution is assisting engineers and SRE’s with automatic actions. But just how efficient is the technology of self-healing? Can it be relied upon or is it merely a buzz-word with little to no practical use? This introduction article from Einar & Partners covers the art of self-healing and what you can expect of it.


History and background

Typically an engineer or SRE has a busy job. One of the more intense positions within a company is having to be available at uncommon hours and fighting outages with unhappy end-users waiting for updates. How often have we not heard the joke of “never push to production on a Friday afternoon” to be followed up by a weekend of technical troubleshooting and pulling out hair in frustration? A horror scenario indeed but more and more common when organizations have to be extremely agile.

As indicated by the name, SRE – or site reliability engineer, has to be on-call and fix issues as they arise while ensuring the business runs smoothly. Statistics indicate that an SRE spends at best 50% of their time fixing issues (like at Google) and at most organizations significantly more. But zooming in on that statistics, the question asked by big organizations like Google and Amazon is, how is the time fixing spent?

The answer is quite simple, whereas most of the “troubleshooting” time is unfortunately spent on a concept called “TOIL”.

TOIL & DevOps – What is TOIL?

Toil is the repetitive, the mundane, the tedious and unproductive work that an SRE has to execute daily. In other words, the tasks that can be automated and create the most significant overhead in terms of time-investment for an organization. Some examples include fetching log-files, rebooting services, running scripts, finding information, service checks, applying configurations or copying and pasting commands from a playbook.

Unless an engineering department is not careful, too much TOIL can easily result in a burnout due to its dull and repetitive nature. Simultaneously, as engineers have to deal with TOIL overload, they are expected to contribute to the development and code-base of applications and services. This situation can easily create confusion about what an SRE is supposed to do. Fighting endless fires or contributing to design and optimization?

Why is TOIL bad?


  • Slows down innovation & progress
  • Reduction of quality due to manual work
  • Never-ending list of manual tasks that takes a long time to teach new resources
  • Burnout
  • High OPEX due to low efficiency

TOIL is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical devoid of enduring value and that scales linearly as a service grows“.

Vivek Rau, Google

Best strategy for automatic remediation?


How can organizations leverage modern solutions and technology to reduce the TOIL with the previous introduction in mind? In a DevOps world where any given application may have hundreds of microservices, states to keep track of, and endless dependencies; automation is vital.

A common misconception is that self-healing and automatic remediation will replace the in-depth troubleshooting that SRE’s and Ops perform. This is not the case, as fixing more complicated issues will always require skilled engineers for the foreseeable future. Implementing auto-remediation has a different focus and concentrates on automating the many small tasks rather than the few big.

Auto remediation and realistic use cases in AIOps

The philosophy of auto-remediation and self-healing is to shift the model that any given alert from an application should start with a human response. It flips this equation in favor of AIOps as the first point of contact rather than a person. Most applications and alerts have a set of standard steps to resolve a given issue. Sometimes, the steps are simple, like restarting a service or gathering data. Other times the fix can be to change a configuration or starting a workflow (think a decision tree).

On top of this Ops teams and SRE’s have the normally expected tasks, like acknowledging alerts, categorizing issues, prioritizing incidents and update tickets. Individually the tasks are very small but put them together and you end up with most of the time spent just repeating the same steps. Over and over again. Self-healing aims to remove the element of repetition from the equation.

How organizations can save time (for real)

The data suggests that a significant portion of the time that engineers spend is related to repetitive tasks. As such AIOps & automatic remediation are about helping relieve the pressure of these types of tasks. That way SRE’s and OP teams can focus on what really is essential, which is the troubleshooting and investigations where AI and automation fall short. The work only fit for the eyes and brain of a person.

Auto-remediation is there for merely another tool in the toolbox of engineers. A right AIOps solution should analyze historical solutions of issues, see what worked, and suggest appropriate actions for the engineers. With enough confidence (based on data) AIOps can start automatic workflows and trigger actions to assist the engineer in his work.

This way, engineers are allowed to focus on the work which matters and free up headspace from the manual tasks. Ultimately this will enable organizations to lower operational expenditure and have a more innovative workforce. The time saved on automating can be re-invested in further automation, creating a positive feedback loop.

Risks and pitfalls with self-healing


Unfortunately, not everything is as picture-perfect as the hypothetical world that AIOps often suggests. To fully realize the value of automatic remediation, several pre-conditions must be met, such as:

  • Having core data connected to the AIOps platform. Without historical knowledge of how incidents were resolved, what solution worked, and the relation to infrastructure changes, AIOps will have a difficult time suggesting actions.
  • Connecting monitoring data. Having alerts and monitoring data feeding the AIOps system is crucial to reduce volume and correlate which remediation fits to what alert type.
  • Culture of automation. The cultural aspect of automation must not be forgotten. Allowing employees to dedicate time to create automation workflows that can be used by AIOps is crucial.

In the end automatic remediation is about handling expectations about what it can and can’t do. We’re quite not at the stage yet where it replaces the role of an operator completely. Yet what an organization should expect is for AIOps to help significantly with the workload and to reduce operational tasks.

Always keep in mind that “anything a human can do, a machine can also do.”

Moving beyond self-healing


So far we’ve covered the concept of TOIL and how it relates to self-healing. But what comes after automatic remediation? There are many paths a successful rollout of AIOps can take, but the holy grail (at least in the year 2021) will be in anomaly detection and proactive alerts. Ideally, SRE’s and operators should focus on proactive alerts rather than reactive alerts. Meaning that anomalies and deviant behaviors can be detected early in logs, metrics, and infrastructure through machine learning. Hopefully, before a P1 ticket has been created.

Anomaly detection is not just buzz-words but one of the few areas where machine learning can be applied in the real world. Detecting outliers based on historical patterns is an area almost impossible for a human operator to engage in, as the sheer volume of metrics & alerts is simply too high. When SRE’s moves from just reacting to alerts to proactively observing the state and behavior of application and services – a technical wonder is in the making.

Getting to that stage is a maturity process just like anything else. A maturity process which more often than not starts with the organization and culture. If the mindset around how SRE’s spend their time does not change, and if TOIL is allowed to wreak havoc, the tools are of little importance at the end of the day.

Conclusion

Is automatic remediation a bit of hype? It depends.

Focusing on the real-world use cases and managing the expectations accordingly, one can quickly see that there is also truth to the story. Self-healing was never about replacing the complex and intrinsic nature of human troubleshooting abilities. It is about freeing up the time to allow people to focus on what matters.  

Starting to automate basic tasks such as gathering information and fetching data is a significant first step to self-healing. With connected monitoring systems and core data (like incidents, changes and problems) AIOps is allowed to form a better contextual awareness to automate remediation. Sometimes much better than what an operator ever could do on her own.

Final words


With the right investment, SRE teams’ costs can be significantly reduced and a culture of continuous improvement and automation is allowed to flourish. Spending time on reducing alert fatigue and TOIL will have resounding positive effects, both in terms of employee satisfaction and performance.

A happy SRE team is a team that is allowed to be innovative and creative. Creative brains are a valuable, limited resource. They shouldn’t be wasted on re-inventing the wheel when there are so many fascinating new problems waiting out there.

Wouldn’t you agree?

Podcast – modern architectures and CMDB

Alexander Ljungström, Strategic Advisor at Einar & Partners – was recently featured in the new “IT Experience Podcast” from ServiceNow. In the episode, Alexander discusses the role of modern architecture and how it shapes the CMDB. Topics such as DevOps, tags and containers is getting more and more relevant when building a data-layer for the modern enterprise.

Together with Usman Sindu, product director for ITOM at ServiceNow, Alexander gives some insight into the European market for AIOps and ITOM.

Are you interested in the philosophy behind a modern CMDB? Then contact Alexander Ljungström directly on alexander@einar.partners.