Analysis Archives - Einar & Partners

8 important ITOM updates in ServiceNow Rome

The release notes of the latest ServiceNow version is out in public and this time we’re going all the way to Rome. The new release is packed with improvements, additions and developments in the ITOM parts of the platform. In this article by Einar & Partners we give you all the important highlights and news in ServiceNow ITOM for the Rome release.

Site Reliability Metrics for SRE’s and Ops

Site Reliability Operations, or SRO – is a product by ServiceNow created specifically for SRE teams that are working heavily with microservices and site reliability engineering. In the Rome Release, the SRO product is being improved additionally with Site Reliability Metrics. Engineers and operations teams that are working with site reliability engineering can now see performance, error budgets, indicators and service level objectives. All within one workspace.

This is a welcome addition to the previously rather lightweight SRO application. It also demonstrates that ServiceNow is more serious than ever to develop their positioning towards cutting edge DevOps and containerized practices.

Containerized MID-servers

Speaking of containerized practices, the MID-server is now officially put into a docker image and available to be pushed out as a container based application. This has already existed for a while, although not supported officially by ServiceNow – until now that is. In practice this means that the MID server capacity and sizing can be scaled and sized very easily depending on load and anticipated activity (for example, discovery).

At the time of this article being written, IntegrationHub and Orchestration is not (officially) supported when using containerized MID-server docker images.

Agent Client Collector – agent based discovery

The agent client collector (ACC) is officially released in its full capacity in Rome, and is starting to become quite the mature alternative to agentless discovery. The agent based discovery in ServiceNow solves the long-standing challenge of having to provide credentials and opening firewalls across the infrastructure. When using the agent based discovery, customers should still be aware that products such as service mapping is not yet supported.

Nonetheless, agent based discovery is perfectly suited for endpoints, laptops and infrastructure where agentless discovery is not permitted.

For more information about the ACC-V framework, check out our video below.

More sources for Health Log Analytics

Health Log Analytics, originating from the acquisition of Loom Systems roughly 2 years ago, has now matured to a strong core-piece of the ServiceNow AIOps portfolio. In the latest release the Health Log Analytics product supports a whole bunch of new sources for ingesting log data, such as:

Amazon CloudWatch
Amazon S3
Microsoft Azure Log Analytics
Microsoft Azure Event Hubs
Apache Kafka
REST API, for streaming your log data to the instance in JSON format

Event Management news

For Event Management it is now possible to integrate Grafana events out of the box and plugin directly to the Event Management engine in ServiceNow. Additionally, ServiceNow has added support for EIF format (Event Integration Facility). It might sounds obscure, but this format is the de-facto standard format for a lot of IBM products. With the closer relationship between ServiceNow and IBM, this will save a lot of headache when integrating technologies such as IBM Tivoli and the corresponding monitoring agents.

Oracle Cloud Discovery – official support

Oracle Cloud can now fully be discovered by ServiceNow Cloud Discovery. Previously only available as an app to the ServiceNow App store, it is now fully integrated in the core part of Discovery. This means that cloud resources that customers have in oracle cloud can now be real-time refreshed and included in the CMDB with a very simple connection.

Kubernetes and cloud components in Service Maps

For organizations that are using tag based service mapping, the latest release will make a big difference. Previously every resource that needed to show up in a tag-based service map also needed a tag. Although this principle makes sense from a logical standpoint, in containerized environments, not every pod and component is tagged.

In the Rome release Kubernetes and cloud components can now be automatically be included based on their relationships. In other words, you just need to have the parent tagged in order to include children.

Tag Governance – a new application

Keeping track of tags in ServiceNow have already been possible for a few releases. But with the latest release, the capability have been lifted to entirely new levels. This is perhaps the most exciting product in the Rome release in our opinion.

With the new tag governance application, tags can be tracked, certified and kept up-to-date through workflows and rulesets. But perhaps more importantly, any tags that are found not be compliant with the defined rules can be remediated. In other words, ServiceNow can correct tags in Azure, AWS and other cloud platforms.

A single pane of glass to keep track for the tagging across multiple clouds and applying smart workflow logic. Isn’t that what we love about this platform?

Summary and final thoughts

As we can see each release continues to be packed with additions to the ITOM portfolio. It appears like ServiceNow is pushing with full force towards keeping control, visibility and compliance on cloud resources and modern architectures (containers, serverless).

Additionally, in the latest release we also see evidence of just how serious ServiceNow are about bridging in to the space of observability and site reliability engineering. With the recent acquisition of Lightstep, a DevOps observability platform, ServiceNow chooses to strategically position themselves towards the modern era of IT Operations more and more.

Exciting times!

10 most significant ITOM news in Quebec

The Quebec version of ServiceNow was recently released to the general public and available for upgrade to customers. For ITOM- & AIOps enthusiasts in the industry, the new release is packed with exciting new additions and completely new product offerings from ServiceNow. It was a long time ago so many updates were added to ITOM and we’re very excited about what new innovation it will bring. To get people up to speed, we’ve compelled this deep-dive article of the 10 most significant and essential ITOM updates in Quebec.

Loom becomes Health Log Analytics – machine learning for log-data

Approximately a year ago, ServiceNow acquired Loom Systems. The company produces a platform which can detect, analyze and act on anomalies in log data across the IT landscape. As we all know, today’s dynamic IT infrastructure generates huge amount of logging. As a matter of fact, logs are the primary tool for SRE’s and engineers during root cause analysis and troubleshooting.

One year later and we can witness Loom System for the first time integrated as a native product in ServiceNow ITOM platform. The new product is called “Health Log Analytics” and it ties directly into the ITOM Health part of ServiceNow (event management & machine learning).

This is a potential game-changer in the ServiceNow AIOps portfolio. Customers can connect to Elasticsearch, Splunk and many more tools to start ingesting log-data to ServiceNow in realtime. With the proprietary and powerful machine learning algorithms that the platform provides, ITOps teams can see anomalies, trends, and log-data patterns at the tip of their fingers.

Traditional metrics are becoming more outdated and with the explosion of DevOps and containerized environments, log-data is more critical than ever before. We can already now start seeing synergies between the Agent Client Collector for monitoring logs and Heath Log Analytics .

For a quick overview of Health Log Analytics, see the video below by ServiceNow.

Kubernetes Discovery Improvements

Kubernetes and containerized environments are more and more important. In the latest Quebec release, customers have the ability to track the YAML files for Kubernetes configurations. By tracking the configuration files, you essentially audit the YAML setup for Kubernetes which is very powerful in troubleshooting scenarios. Additionally, customers who are relying on Istio service mesh can also discover the service mesh fully.

Site Reliability Operations – Track your microservices

Speaking of microservices, in the past months ServiceNow has deployed an excellent app to their app-store for registering and tracking microservices. Through the “Site Reliability Operations” free application, developers can easily register microservices in ServiceNow. Additionally, it has an API that can be hooked into CI/CD pipelines to keep microservices up-to-date. Integrated with Event Management and lifecycle workflows, the application is an excellent way to bridge DevOps into IT Operations.

Changes to licensing model (node counting)

In Quebec PaaS-managed virtual machines and desktops are no longer counted towards the licensing cost. To quote ServiceNow:

“You can identify virtual machines (VMs) that are used as desktops (such as VMware VDI) or managed automatically by PaaS (such as AWS EC2 Container Service). You can exclude VMs from the Server Licensed Resource category.”
ServiceNow Documentation – Quebec Release

Machine Learning in Service Mapping

When running traffic-based service mapping, ServiceNow will automatically track TCP traffic occurrence and frequency and apply machine learning to the dataset. Over time the platform will learn what likely candidates should be included in Service Mapping, their role, their function and give “connection suggestions”. For companies that run application stacks with a lot of incoming and outgoing traffic, this is an excellent way to discover “shadow-dependencies”.

ServiceNow will try to categorize if connections and CI’s have one of the following functions:

Central: Connection used by the entire organization. For example, SSO.
Observer: Likely an application deployed in many places of the infrastructure. For example, monitoring agents.
Middleware: The connection is a middleware component that exchanges data between multiple services.
Internal: A connection only occurring for a particular application service.

Audit MID-server calls for increased security insights

MID-server calls, such as WMI, SSH or WinRM are now audited in a structured way. Discovery administrators can now easily see what machines have recently received remote calls, status, timestamp and trigger. This is a small but important feature that will be highly relevant for a lot of security teams.

Credential aliases – pinning credentials to discovery

Speaking of security, credential aliases can now be used in discovery. For readers who are unfamiliar with this concept, it used to only exist in orchestration, whereas you can “pin” credentials to activities. Now you can use this functionality for discovery, which is a big improvement from a security standpoint. Administrators and the security team can lock credentials on an even more granular level to apply for specific discovery schedules.

Help the helpdesk gone

The good and old-school “Help the helpdesk”-script came into existence before the discovery product. It’s one of the oldest relics in the ITOM suite and has been a faithful companion for many, many releases. But like with many stories, all good things come to an end. The script, which was reasonably outdated by this point, has now been deprecated.

What will replace it? Most likely the agent client collector with agent-based discovery in the future.

IntegrationHub updates

For those who love the flowdesigner, IntegrationHub has been updated with some serious goodies. If you’re an old-school workflow-guru, you will remember the “scratchpad” variables. This is now also a feature in flow designer. Additionally, from an ITOM perspective, you can now write direct SQL queries through JDBC and transfer files through SFTP.

New Linux installer for MID-servers

The Linux installer for MID-servers has received an extensive upgrade in cosmetics and user-friendliness. The new installer guides users through a user-friendly manner during installations to ensure that system requirements are met accordingly.

Summary

As we can see, the new Quebec release is absolutely packed with a lot of new innovative features, especially related to machine learning, algorithms and anomaly detection. We’re especially excited about the improvements in Service Mapping for traffic-based connections as well as the new fantastic Health Log Analytics addition. An exciting future and year head!

What do you think is the most exciting feature? Let us know in the comments below.

Self-healing & AIOps – demystifying the hype

A hot topic within AIOps is without a doubt the promised land of self-healing, where an AIOps solution is assisting engineers and SRE’s with automatic actions. But just how efficient is the technology of self-healing? Can it be relied upon or is it merely a buzz-word with little to no practical use? This introduction article from Einar & Partners covers the art of self-healing and what you can expect of it.

History and background

Typically an engineer or SRE has a busy job. One of the more intense positions within a company is having to be available at uncommon hours and fighting outages with unhappy end-users waiting for updates. How often have we not heard the joke of “never push to production on a Friday afternoon” to be followed up by a weekend of technical troubleshooting and pulling out hair in frustration? A horror scenario indeed but more and more common when organizations have to be extremely agile.

As indicated by the name, SRE – or site reliability engineer, has to be on-call and fix issues as they arise while ensuring the business runs smoothly. Statistics indicate that an SRE spends at best 50% of their time fixing issues (like at Google) and at most organizations significantly more. But zooming in on that statistics, the question asked by big organizations like Google and Amazon is, how is the time fixing spent?

The answer is quite simple, whereas most of the “troubleshooting” time is unfortunately spent on a concept called “TOIL”.

TOIL & DevOps – What is TOIL?

Toil is the repetitive, the mundane, the tedious and unproductive work that an SRE has to execute daily. In other words, the tasks that can be automated and create the most significant overhead in terms of time-investment for an organization. Some examples include fetching log-files, rebooting services, running scripts, finding information, service checks, applying configurations or copying and pasting commands from a playbook.

Unless an engineering department is not careful, too much TOIL can easily result in a burnout due to its dull and repetitive nature. Simultaneously, as engineers have to deal with TOIL overload, they are expected to contribute to the development and code-base of applications and services. This situation can easily create confusion about what an SRE is supposed to do. Fighting endless fires or contributing to design and optimization?

Why is TOIL bad?

Slows down innovation & progress
Reduction of quality due to manual work
Never-ending list of manual tasks that takes a long time to teach new resources
Burnout
High OPEX due to low efficiency

“TOIL is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical devoid of enduring value and that scales linearly as a service grows“.
Vivek Rau, Google

Best strategy for automatic remediation?

How can organizations leverage modern solutions and technology to reduce the TOIL with the previous introduction in mind? In a DevOps world where any given application may have hundreds of microservices, states to keep track of, and endless dependencies; automation is vital.

A common misconception is that self-healing and automatic remediation will replace the in-depth troubleshooting that SRE’s and Ops perform. This is not the case, as fixing more complicated issues will always require skilled engineers for the foreseeable future. Implementing auto-remediation has a different focus and concentrates on automating the many small tasks rather than the few big.

Auto remediation and realistic use cases in AIOps

The philosophy of auto-remediation and self-healing is to shift the model that any given alert from an application should start with a human response. It flips this equation in favor of AIOps as the first point of contact rather than a person. Most applications and alerts have a set of standard steps to resolve a given issue. Sometimes, the steps are simple, like restarting a service or gathering data. Other times the fix can be to change a configuration or starting a workflow (think a decision tree).

On top of this Ops teams and SRE’s have the normally expected tasks, like acknowledging alerts, categorizing issues, prioritizing incidents and update tickets. Individually the tasks are very small but put them together and you end up with most of the time spent just repeating the same steps. Over and over again. Self-healing aims to remove the element of repetition from the equation.

How organizations can save time (for real)

The data suggests that a significant portion of the time that engineers spend is related to repetitive tasks. As such AIOps & automatic remediation are about helping relieve the pressure of these types of tasks. That way SRE’s and OP teams can focus on what really is essential, which is the troubleshooting and investigations where AI and automation fall short. The work only fit for the eyes and brain of a person.

Auto-remediation is there for merely another tool in the toolbox of engineers. A right AIOps solution should analyze historical solutions of issues, see what worked, and suggest appropriate actions for the engineers. With enough confidence (based on data) AIOps can start automatic workflows and trigger actions to assist the engineer in his work.

This way, engineers are allowed to focus on the work which matters and free up headspace from the manual tasks. Ultimately this will enable organizations to lower operational expenditure and have a more innovative workforce. The time saved on automating can be re-invested in further automation, creating a positive feedback loop.

Risks and pitfalls with self-healing

Unfortunately, not everything is as picture-perfect as the hypothetical world that AIOps often suggests. To fully realize the value of automatic remediation, several pre-conditions must be met, such as:

Having core data connected to the AIOps platform. Without historical knowledge of how incidents were resolved, what solution worked, and the relation to infrastructure changes, AIOps will have a difficult time suggesting actions.
Connecting monitoring data. Having alerts and monitoring data feeding the AIOps system is crucial to reduce volume and correlate which remediation fits to what alert type.
Culture of automation. The cultural aspect of automation must not be forgotten. Allowing employees to dedicate time to create automation workflows that can be used by AIOps is crucial.

In the end automatic remediation is about handling expectations about what it can and can’t do. We’re quite not at the stage yet where it replaces the role of an operator completely. Yet what an organization should expect is for AIOps to help significantly with the workload and to reduce operational tasks.

Always keep in mind that “anything a human can do, a machine can also do.”

Moving beyond self-healing

So far we’ve covered the concept of TOIL and how it relates to self-healing. But what comes after automatic remediation? There are many paths a successful rollout of AIOps can take, but the holy grail (at least in the year 2021) will be in anomaly detection and proactive alerts. Ideally, SRE’s and operators should focus on proactive alerts rather than reactive alerts. Meaning that anomalies and deviant behaviors can be detected early in logs, metrics, and infrastructure through machine learning. Hopefully, before a P1 ticket has been created.

Anomaly detection is not just buzz-words but one of the few areas where machine learning can be applied in the real world. Detecting outliers based on historical patterns is an area almost impossible for a human operator to engage in, as the sheer volume of metrics & alerts is simply too high. When SRE’s moves from just reacting to alerts to proactively observing the state and behavior of application and services – a technical wonder is in the making.

Getting to that stage is a maturity process just like anything else. A maturity process which more often than not starts with the organization and culture. If the mindset around how SRE’s spend their time does not change, and if TOIL is allowed to wreak havoc, the tools are of little importance at the end of the day.

Conclusion

Is automatic remediation a bit of hype? It depends.

Focusing on the real-world use cases and managing the expectations accordingly, one can quickly see that there is also truth to the story. Self-healing was never about replacing the complex and intrinsic nature of human troubleshooting abilities. It is about freeing up the time to allow people to focus on what matters.

Starting to automate basic tasks such as gathering information and fetching data is a significant first step to self-healing. With connected monitoring systems and core data (like incidents, changes and problems) AIOps is allowed to form a better contextual awareness to automate remediation. Sometimes much better than what an operator ever could do on her own.

Final words

With the right investment, SRE teams’ costs can be significantly reduced and a culture of continuous improvement and automation is allowed to flourish. Spending time on reducing alert fatigue and TOIL will have resounding positive effects, both in terms of employee satisfaction and performance.

A happy SRE team is a team that is allowed to be innovative and creative. Creative brains are a valuable, limited resource. They shouldn’t be wasted on re-inventing the wheel when there are so many fascinating new problems waiting out there.

Wouldn’t you agree?

Category Archives: Analysis

Site Reliability Metrics for SRE’s and Ops

Containerized MID-servers

Agent Client Collector – agent based discovery

More sources for Health Log Analytics

Event Management news

Oracle Cloud Discovery – official support

Kubernetes and cloud components in Service Maps

Tag Governance – a new application

Summary and final thoughts

Loom becomes Health Log Analytics – machine learning for log-data

Kubernetes Discovery Improvements

Site Reliability Operations – Track your microservices

Changes to licensing model (node counting)

Machine Learning in Service Mapping

Audit MID-server calls for increased security insights

Credential aliases – pinning credentials to discovery

Help the helpdesk gone

IntegrationHub updates

New Linux installer for MID-servers

Summary

History and background

TOIL & DevOps – What is TOIL?

Why is TOIL bad?

Best strategy for automatic remediation?

Auto remediation and realistic use cases in AIOps

How organizations can save time (for real)

Risks and pitfalls with self-healing

Moving beyond self-healing

Conclusion

Final words