Quebec ServiceNow

10 most significant ITOM news in Quebec

The Quebec version of ServiceNow was recently released to the general public and available for upgrade to customers. For ITOM- & AIOps enthusiasts in the industry, the new release is packed with exciting new additions and completely new product offerings from ServiceNow. It was a long time ago so many updates were added to ITOM and we’re very excited about what new innovation it will bring. To get people up to speed, we’ve compelled this deep-dive article of the 10 most significant and essential ITOM updates in Quebec.

Loom becomes Health Log Analytics – machine learning for log-data

Approximately a year ago, ServiceNow acquired Loom Systems. The company produces a platform which can detect, analyze and act on anomalies in log data across the IT landscape. As we all know, today’s dynamic IT infrastructure generates huge amount of logging. As a matter of fact, logs are the primary tool for SRE’s and engineers during root cause analysis and troubleshooting.

One year later and we can witness Loom System for the first time integrated as a native product in ServiceNow ITOM platform. The new product is called “Health Log Analytics” and it ties directly into the ITOM Health part of ServiceNow (event management & machine learning).

This is a potential game-changer in the ServiceNow AIOps portfolio. Customers can connect to Elasticsearch, Splunk and many more tools to start ingesting log-data to ServiceNow in realtime. With the proprietary and powerful machine learning algorithms that the platform provides, ITOps teams can see anomalies, trends, and log-data patterns at the tip of their fingers.

Traditional metrics are becoming more outdated and with the explosion of DevOps and containerized environments, log-data is more critical than ever before. We can already now start seeing synergies between the Agent Client Collector for monitoring logs and Heath Log Analytics .

For a quick overview of Health Log Analytics, see the video below by ServiceNow.

Kubernetes Discovery Improvements

Kubernetes ServiceNow Dashboard

Kubernetes and containerized environments are more and more important. In the latest Quebec release, customers have the ability to track the YAML files for Kubernetes configurations. By tracking the configuration files, you essentially audit the YAML setup for Kubernetes which is very powerful in troubleshooting scenarios. Additionally, customers who are relying on Istio service mesh can also discover the service mesh fully.

Site Reliability Operations – Track your microservices

Site Reliability Operations ServiceNow

Speaking of microservices, in the past months ServiceNow has deployed an excellent app to their app-store for registering and tracking microservices. Through the “Site Reliability Operations” free application, developers can easily register microservices in ServiceNow. Additionally, it has an API that can be hooked into CI/CD pipelines to keep microservices up-to-date. Integrated with Event Management and lifecycle workflows, the application is an excellent way to bridge DevOps into IT Operations.

Changes to licensing model (node counting)

In Quebec PaaS-managed virtual machines and desktops are no longer counted towards the licensing cost. To quote ServiceNow:

“You can identify virtual machines (VMs) that are used as desktops (such as VMware VDI) or managed automatically by PaaS (such as AWS EC2 Container Service). You can exclude VMs from the Server Licensed Resource category.”

ServiceNow Documentation – Quebec Release

Machine Learning in Service Mapping

Traffic Based Service Mapping

When running traffic-based service mapping, ServiceNow will automatically track TCP traffic occurrence and frequency and apply machine learning to the dataset. Over time the platform will learn what likely candidates should be included in Service Mapping, their role, their function and give “connection suggestions”. For companies that run application stacks with a lot of incoming and outgoing traffic, this is an excellent way to discover “shadow-dependencies”.

ServiceNow will try to categorize if connections and CI’s have one of the following functions:

  • Central: Connection used by the entire organization. For example, SSO.
  • Observer: Likely an application deployed in many places of the infrastructure. For example, monitoring agents.
  • Middleware: The connection is a middleware component that exchanges data between multiple services.
  • Internal: A connection only occurring for a particular application service.

Audit MID-server calls for increased security insights

MID-server calls, such as WMI, SSH or WinRM are now audited in a structured way. Discovery administrators can now easily see what machines have recently received remote calls, status, timestamp and trigger. This is a small but important feature that will be highly relevant for a lot of security teams.

Credential aliases – pinning credentials to discovery

Speaking of security, credential aliases can now be used in discovery. For readers who are unfamiliar with this concept, it used to only exist in orchestration, whereas you can “pin” credentials to activities. Now you can use this functionality for discovery, which is a big improvement from a security standpoint. Administrators and the security team can lock credentials on an even more granular level to apply for specific discovery schedules.

Help the helpdesk gone

The good and old-school “Help the helpdesk”-script came into existence before the discovery product. It’s one of the oldest relics in the ITOM suite and has been a faithful companion for many, many releases. But like with many stories, all good things come to an end. The script, which was reasonably outdated by this point, has now been deprecated.

What will replace it? Most likely the agent client collector with agent-based discovery in the future.

IntegrationHub updates

For those who love the flowdesigner, IntegrationHub has been updated with some serious goodies. If you’re an old-school workflow-guru, you will remember the “scratchpad” variables. This is now also a feature in flow designer. Additionally, from an ITOM perspective, you can now write direct SQL queries through JDBC and transfer files through SFTP.

New Linux installer for MID-servers

The Linux installer for MID-servers has received an extensive upgrade in cosmetics and user-friendliness. The new installer guides users through a user-friendly manner during installations to ensure that system requirements are met accordingly.


Summary

As we can see, the new Quebec release is absolutely packed with a lot of new innovative features, especially related to machine learning, algorithms and anomaly detection. We’re especially excited about the improvements in Service Mapping for traffic-based connections as well as the new fantastic Health Log Analytics addition. An exciting future and year head!

What do you think is the most exciting feature? Let us know in the comments below.

AIOps Self Healing

Self-healing & AIOps – demystifying the hype

A hot topic within AIOps is without a doubt the promised land of self-healing, where an AIOps solution is assisting engineers and SRE’s with automatic actions. But just how efficient is the technology of self-healing? Can it be relied upon or is it merely a buzz-word with little to no practical use? This introduction article from Einar & Partners covers the art of self-healing and what you can expect of it.


History and background

Typically an engineer or SRE has a busy job. One of the more intense positions within a company is having to be available at uncommon hours and fighting outages with unhappy end-users waiting for updates. How often have we not heard the joke of “never push to production on a Friday afternoon” to be followed up by a weekend of technical troubleshooting and pulling out hair in frustration? A horror scenario indeed but more and more common when organizations have to be extremely agile.

As indicated by the name, SRE – or site reliability engineer, has to be on-call and fix issues as they arise while ensuring the business runs smoothly. Statistics indicate that an SRE spends at best 50% of their time fixing issues (like at Google) and at most organizations significantly more. But zooming in on that statistics, the question asked by big organizations like Google and Amazon is, how is the time fixing spent?

The answer is quite simple, whereas most of the “troubleshooting” time is unfortunately spent on a concept called “TOIL”.

TOIL & DevOps – What is TOIL?

Toil is the repetitive, the mundane, the tedious and unproductive work that an SRE has to execute daily. In other words, the tasks that can be automated and create the most significant overhead in terms of time-investment for an organization. Some examples include fetching log-files, rebooting services, running scripts, finding information, service checks, applying configurations or copying and pasting commands from a playbook.

Unless an engineering department is not careful, too much TOIL can easily result in a burnout due to its dull and repetitive nature. Simultaneously, as engineers have to deal with TOIL overload, they are expected to contribute to the development and code-base of applications and services. This situation can easily create confusion about what an SRE is supposed to do. Fighting endless fires or contributing to design and optimization?

Why is TOIL bad?


  • Slows down innovation & progress
  • Reduction of quality due to manual work
  • Never-ending list of manual tasks that takes a long time to teach new resources
  • Burnout
  • High OPEX due to low efficiency

TOIL is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical devoid of enduring value and that scales linearly as a service grows“.

Vivek Rau, Google

Best strategy for automatic remediation?


How can organizations leverage modern solutions and technology to reduce the TOIL with the previous introduction in mind? In a DevOps world where any given application may have hundreds of microservices, states to keep track of, and endless dependencies; automation is vital.

A common misconception is that self-healing and automatic remediation will replace the in-depth troubleshooting that SRE’s and Ops perform. This is not the case, as fixing more complicated issues will always require skilled engineers for the foreseeable future. Implementing auto-remediation has a different focus and concentrates on automating the many small tasks rather than the few big.

Auto remediation and realistic use cases in AIOps

The philosophy of auto-remediation and self-healing is to shift the model that any given alert from an application should start with a human response. It flips this equation in favor of AIOps as the first point of contact rather than a person. Most applications and alerts have a set of standard steps to resolve a given issue. Sometimes, the steps are simple, like restarting a service or gathering data. Other times the fix can be to change a configuration or starting a workflow (think a decision tree).

On top of this Ops teams and SRE’s have the normally expected tasks, like acknowledging alerts, categorizing issues, prioritizing incidents and update tickets. Individually the tasks are very small but put them together and you end up with most of the time spent just repeating the same steps. Over and over again. Self-healing aims to remove the element of repetition from the equation.

How organizations can save time (for real)

The data suggests that a significant portion of the time that engineers spend is related to repetitive tasks. As such AIOps & automatic remediation are about helping relieve the pressure of these types of tasks. That way SRE’s and OP teams can focus on what really is essential, which is the troubleshooting and investigations where AI and automation fall short. The work only fit for the eyes and brain of a person.

Auto-remediation is there for merely another tool in the toolbox of engineers. A right AIOps solution should analyze historical solutions of issues, see what worked, and suggest appropriate actions for the engineers. With enough confidence (based on data) AIOps can start automatic workflows and trigger actions to assist the engineer in his work.

This way, engineers are allowed to focus on the work which matters and free up headspace from the manual tasks. Ultimately this will enable organizations to lower operational expenditure and have a more innovative workforce. The time saved on automating can be re-invested in further automation, creating a positive feedback loop.

Risks and pitfalls with self-healing


Unfortunately, not everything is as picture-perfect as the hypothetical world that AIOps often suggests. To fully realize the value of automatic remediation, several pre-conditions must be met, such as:

  • Having core data connected to the AIOps platform. Without historical knowledge of how incidents were resolved, what solution worked, and the relation to infrastructure changes, AIOps will have a difficult time suggesting actions.
  • Connecting monitoring data. Having alerts and monitoring data feeding the AIOps system is crucial to reduce volume and correlate which remediation fits to what alert type.
  • Culture of automation. The cultural aspect of automation must not be forgotten. Allowing employees to dedicate time to create automation workflows that can be used by AIOps is crucial.

In the end automatic remediation is about handling expectations about what it can and can’t do. We’re quite not at the stage yet where it replaces the role of an operator completely. Yet what an organization should expect is for AIOps to help significantly with the workload and to reduce operational tasks.

Always keep in mind that “anything a human can do, a machine can also do.”

Moving beyond self-healing


So far we’ve covered the concept of TOIL and how it relates to self-healing. But what comes after automatic remediation? There are many paths a successful rollout of AIOps can take, but the holy grail (at least in the year 2021) will be in anomaly detection and proactive alerts. Ideally, SRE’s and operators should focus on proactive alerts rather than reactive alerts. Meaning that anomalies and deviant behaviors can be detected early in logs, metrics, and infrastructure through machine learning. Hopefully, before a P1 ticket has been created.

Anomaly detection is not just buzz-words but one of the few areas where machine learning can be applied in the real world. Detecting outliers based on historical patterns is an area almost impossible for a human operator to engage in, as the sheer volume of metrics & alerts is simply too high. When SRE’s moves from just reacting to alerts to proactively observing the state and behavior of application and services – a technical wonder is in the making.

Getting to that stage is a maturity process just like anything else. A maturity process which more often than not starts with the organization and culture. If the mindset around how SRE’s spend their time does not change, and if TOIL is allowed to wreak havoc, the tools are of little importance at the end of the day.

Conclusion

Is automatic remediation a bit of hype? It depends.

Focusing on the real-world use cases and managing the expectations accordingly, one can quickly see that there is also truth to the story. Self-healing was never about replacing the complex and intrinsic nature of human troubleshooting abilities. It is about freeing up the time to allow people to focus on what matters.  

Starting to automate basic tasks such as gathering information and fetching data is a significant first step to self-healing. With connected monitoring systems and core data (like incidents, changes and problems) AIOps is allowed to form a better contextual awareness to automate remediation. Sometimes much better than what an operator ever could do on her own.

Final words


With the right investment, SRE teams’ costs can be significantly reduced and a culture of continuous improvement and automation is allowed to flourish. Spending time on reducing alert fatigue and TOIL will have resounding positive effects, both in terms of employee satisfaction and performance.

A happy SRE team is a team that is allowed to be innovative and creative. Creative brains are a valuable, limited resource. They shouldn’t be wasted on re-inventing the wheel when there are so many fascinating new problems waiting out there.

Wouldn’t you agree?