What is AIOps?
Artificial intelligence for IT operations (AIOps) is the terminology that indicates the use of, typically, machine learning (ML) based artificial intelligence to cut through the noise in IT operations, specifically incident handling and management. Commonly, AIOps solutions’ typical characteristics are to ingest pre-aggregated data from various solutions across the IT management tooling landscape and conclude what is of relevance in order to focus the user’s attention to the relevant situations.
While this sounds promising, and despite the proven record of results, there are some caveats. I would like to discuss two of them in this blog and show an alternative approach.
Two categories of AIOps
Before diving into further details, I wanted to highlight that there are in fact two different categories of AIOps: traditional AIOps approaches, as defined by Gartner, are known for just cutting through the noise – i.e. reducing alerts through correlation that has little to nothing to do with the root-cause analysis. These approaches are slow and inaccurate limiting its practical applications. In comparison, the AIOps approach discussed within this article, is built upon a radically different deterministic AI engine that yields precise, actionable results in real-time. This is very powerful as it enables automated IT operations – the image below summarizes this nicely.
Traditional AIOps is Slow
The first downside to the ML approach is that it is slow. Such a solution needs to collect a substantial amount of data at first for then having a dataset (training data) that an algorithm can use to learn from. And then users typically have the option of reinforcing learning through rating and similar means. This already indicates that a few weeks if not months easily pass until one has honed the system far enough to completely trust it with production monitoring of business-critical applications. And this is by no means only an initial problem. Modern applications undergo frequent change and their deployments are highly volatile which implies an ever-changing dataset. Inevitably, this leads to one very important question addressing the efficiency of ML: can such an AI ever keep up with frequent changes and deployments?
Lost and rebuilt context
The second major concern I want to discuss is around the data processing chain. AIOps solutions are stand-alone and are built for vendor-agnostic data ingestion. Data sources typically include common infrastructure monitoring tools and second-generation APM solutions as well as other solutions. These sets of tools are acquiring one or more different types of raw data (metrics, logs, traces, events, code-level details…) at various granularity, process them and create alerts (a threshold or learned baseline was breached, a certain log pattern occurred and so forth). Typically, only the alert (potentially with some metadata) would be then handed over to the AIOps solution meaning only the aggregated event will be accessible to ML without many additional details. Now the AI is learning similar reoccurring clusters of incoming events for later classification of new events and with that building and rebuilding context (time- and metadata-based correlation) but has no evidence of actual dependencies. There might be integrations that allow for more data to hand over (e.g. metrics) but it’s just adding another dataset and not solving the problem of cause-and-effect certainty.
I’ve illustrated that on the right half of the image below. The alternative on the left will be discussed further down.
The four stages of data processing
That brings me to the four stages of data processing which is another way of looking at the data processing chain. As already indicated above, it starts with the collection of raw data and aggregation to alerts. Further to the right in the scope of AIOps additional aggregation and analysis is achieved.
This approach is often chosen because of its flexibility. In many cases introducing an ML-based AIOps product is considered as a logical evolution in IT management tooling – essentially, it’s introducing another layer that helps to manage lots of events from different solutions (and vendors) into one product and UI with AI assistance to–as said above–reduce the alert flood and help to focus on the critical ones.
Now the price for that flexibility is the potential loss of context when switching tools – context that is needed by the AI for allowing you to go a step further to automated root-cause analysis and with that heading towards fully automated ITOps.
Think of the traditional AIOps solutions as an aid to the status-quo. Something that helps you today, with catching up and getting better in managing incidents, getting better in being reactive. But on your agenda for tomorrow should be getting ahead.
One of the promises of AIOps is to automate IT operations, in other words achieve autonomous operations. Obviously, that can only be truly achieved when the flow through the four stages can happen without human intervention or assistance. While the first two and the last stage have been solved to some extent, the toughest part which is to identify the true root-cause and then based on the insight to choose the best remediation action, is yet to crack in most cases. Let me closer elaborate on three thoughts around root cause identification:
As already indicated above, it is now time to discuss an alternative method of AI. In contrast of the above-discussed machine learning, I would like to briefly explain the concept of fault tree analysis through the following example. Let’s say an application is experiencing a slowdown for its search requests. Now the deviating metric, the response time, triggers the fault tree analysis meaning we start our analysis with the monitored entity which it belongs to, in our case the application. This is now the starting node in our tree. Next, we investigate all dependencies that this application has, for example, third party calls such as CDNs or image servers or more complex requests to a backend or microservice-based application. All those depending nodes will be analyzed and investigated for anomalies.
If a node has been cleared it’ll form a leaf and nodes showing anomalies will be further investigated down their dependencies. Let’s say we look at the webserver that the application is communicating with and further to the front-end tier and search service and see that on all three nodes search requests are slower than usual. Now it’s not that simple to just follow the dependencies in one direction. Let’s assume that on the OS that is hosting the search service is also running, completely independently, another process that consumes a significant amount of CPU which causes a shortage and slows down the search service. This means that from the search service we would follow the dependency to the process and further to the host and then back up to other processes running on that host. This nicely shows vertical (service to process to host) and horizontal (service to service, or process to process…) dependency analysis. This process is continued until a root cause is identified, in our case a chatty neighbor and on the other end of the tree we can assess the impact, how many e.g. users have been affected by that problem. Another huge advantage of that approach is speed. It works without identifying training data, then training and honing.
The Significance of Topology Information
One now very obvious fact to state is that fault tree analysis requires a topology model of your data center or application deployment. Otherwise, we would never be able to walk through the tree as explained above and find the root cause. One aspect of an ML based AIOps solution is to create topology from the ingested data and meta data which is needed to help a human being with root cause analysis. The deterministic AI based on fault tree analysis requires topology in the first place and have any observed raw data such as metrics, logs, events, traces… attached to an entity in that model. This is what is required for precise and repeatable root cause identification.
Two Kinds of Root Cause
The final point I’d like to discuss is the two different kinds of root cause: the technical and foundational root cause. The above example walks us through and explains how the technical root cause is identified; a CPU spike of another process. The foundational root cause now explains what led to that result, for example, a deployment. In order to achieve automated foundational root cause analysis, the AI needs to be capable of browsing through the history or changelog of the monitored entity that has been identified as the technical root cause. And of course, this type of information needs to be available to the AI and therefore be part of the entity.
In this article I aimed to shed some light on the variety of approaches to AIOps and classify them into two categories: the traditional, generic ML based approach vs. the deterministic approach that allows for precise technical and foundational root cause identification and impact analysis in real-time. The second type of AI is commonly referred to as explainable AI. If you want to dig deeper, we have some further readings below, and if you prefer to see that in action, I want to invite you to watch our webinar including a live demo. And of course, we are eager to hear from you if there are any unanswered questions or you just got curious! And sure enough, you always can see for yourself and just take a free trial.