Artificial intelligence for IT operations (AIOps) is an IT practice that uses machine learning (ML) and artificial intelligence (AI) to cut through the noise in IT operations, specifically incident management. But what is AIOps, exactly? And how can it support your organization?
What is AIOps?
Gartner defines AIOps as the combination of “big data and machine learning to automate IT operations processes, including event correlation, anomaly detection, and causality determination.” It’s an accurate description of most AIOps tools and platforms on the market, but it excludes key capabilities that are essential for a comprehensive, modern AIOps solution. Such a modern approach to AIOps serves the full software delivery life cycle and addresses the volume, velocity, and complexity of modern multicloud environments.
Most AIOps solutions ingest pre-aggregated data from various solutions across the IT management tooling landscape — including disparate observability tools — and conclude what is relevant to focus the analyst’s attention. While this sounds promising and has shown to be successful, there are key caveats to consider. Here, we’ll discuss the AIOps landscape as it stands today and present an alternative approach that truly integrates artificial intelligence into the DevOps process.
Two approaches to AIOps
Before diving into further details, let’s highlight the two overarching approaches to AIOps:
Traditional AIOps: Traditional AIOps approaches are designed to reduce alerts and utilize machine-learning models to deliver correlation-based dashboards. These systems are often difficult to scale because the underlying ML engine doesn’t provide continuous, real-time insight into the precise root cause of issues. They require extensive training, and analysts must spend valuable time filtering any false positives.
Modern AIOps: Modern AIOps solutions are built for dynamic clouds and software delivery life cycle (SDLC) automation because they combine full-stack observability with a deterministic AI engine that can yield precise, continuous, and actionable insights in real-time. This contrasts stochastic AIOps approaches that use probability models to infer the state of systems. Only deterministic AIOps technology enables fully automated cloud operations across the entire enterprise development lifecycle.
Why is AIOps needed?
Modern applications are built from hundreds or thousands of interdependent microservices scattered across many clouds, leading to incredibly complex software environments. This complexity leads to greater difficulty understanding the state of these systems, especially when something goes wrong. AIOps is often presented as a way to reduce the noise of countless alerts, but it can and should be more than that. A full-featured deterministic AIOps solution should lead to faster, higher-quality innovation, increased IT staff efficiency, and vastly improved business outcomes.
Humans simply cannot manually review and analyze the massive amount of data that a modern observability solution can process automatically. Any approach that adds more visualizations, dashboards, and slice-and-dice query tools is more of an unwieldy bandage to cover the problem than it is a solution to the problem. Disparate interfaces still require manual intervention and analysis, and in this way, traditional AIOps solutions have essentially become event monitoring tools.
Modern IT is built on and strives for more capable automation, and AI is critical to achieving these goals. Continuous integration and continuous delivery (CI/CD) processes provide smart pipelines for rolling out new features and services. Orchestration platforms such as Kubernetes are relieving operations teams from error-prone and mundane tasks related to keeping services up and running. When all these areas are as automated as possible, developers and operations teams can focus on innovation rather than performing endless administrative tasks.
What are the components of a modern AIOps solution?
A comprehensive, modern approach to AIOps is a unified platform that encompasses observability, AI, and analytics. This all-in-one approach is necessary to address the complexity of identifying problems in systems, analyzing the context and broader business impact of problems, and automating a response to software problems. The best solutions provide real-time, continuous insights into the state of systems and services that are critical to business operations, so businesses can continue to focus more on innovation and less on responding to inevitable problems with complex systems.
Traditional AIOps is limited in the types of inferences it can make because it depends on metrics, logs, and trace data without a model of how components of systems are structured. AIOps should instead leverage the ability of deterministic AI to fully map the topology of complex, distributed architectures to reach resolutions significantly faster.
Challenges of traditional AIOps
There are limitations to the value of what non-deterministic AIOps solutions can provide.
Traditional AIOps does not scale
With a machine-learning approach, traditional AIOps solutions must collect a substantial amount of data before they can create a dataset (i.e., training data) that the algorithm can then learn from. Administrators can reinforce learning through rating and other similar means, but it can take weeks or even months until this “AI” is calibrated well enough to deliver insights into business-critical applications in production.
This approach is hardly set-and-forget. Modern applications undergo frequent changes, and their deployments are highly volatile, which implies an ever-changing dataset. This method simply can’t scale up with frequent changes that occur within complex distributed applications.
Lost and rebuilt context
The second challenge with traditional AIOps centers around the data processing cycle. Traditional AIOps solutions are built for vendor-agnostic data ingestion. This means data sources typically come from disparate infrastructure monitoring tools and second-generation APM solutions. These sets of tools first acquire one or more types of raw data (metrics, logs, traces, events, code-level details, and so on) at different levels of granularity, then process them before finally creating alerts based on a predetermined rule (for example, a threshold, learned baseline, or certain log pattern).
Typically, only the aggregated events will be accessible to ML and will often exclude additional details. Now, the AI is learning similar reoccurring clusters of incoming events for later classification of new events. With that data, it builds and rebuilds context (time- and metadata-based correlation) but has no evidence of actual dependencies. There might be integrations that allow for more data to be processed (such as metrics), but those simply add more datasets without solving the problem of cause-and-effect certainty.
The four stages of data processing
The four stages of data processing is another way of looking at the different approaches teams take to achieve AIOps in their data processing chain. The four stages are: collect raw data, aggregate it for alerts, analyze the data, then execute an action plan.
Teams often follow this approach to achieving AIOps because of its apparent convenience:
- Start with a second-generation APM solution, which covers data collection and aggregation and prepares data for analysis (black arc).
- Introduce an ML-based AIOps product as a logical evolution in IT management tooling. This second solution picks up at data collection, aggregation and analysis, and prepares it for execution (grey arc).
This two-phase approach introduces another layer that helps to manage a lot of events from different solutions (and vendors) with ML assistance to reduce the alert flood and focus on the critical issues.
However, the price for that convenience is the potential loss of context when switching tools — it’s context that the ML needs to allow you to go a step further to achieve automated root-cause analysis and, eventually, fully automated CloudOps.
Think of traditional AIOps solutions as an aid to the status-quo: it can help you catch up, better manage incidents, and be more reactive. But this approach breaks down at the scale and complexity of today’s modern multicloud environments.
Ultimately, AIOps should encompass all four stages of data processing in a single product and UI, including the execution phase, by enabling greater automation throughout your IT organization. This includes CloudOps, with a focus on incident management, DevOps for improved building and testing of applications, and SecOps for helping ensure applications are secured (purple arc). Only an approach that encompasses the entire data processing chain using deterministic AI and continuous automation can keep pace with the volume, velocity, and complexity of distributed microservices architectures.
AIOps use cases
Modern AIOps enables more comprehensive automation across the enterprise, including in CloudOps, DevOps, and SecOps. Let’s take a closer look at these use cases.
- CloudOps. CloudOps includes processes such as incident management and event management. AIOps reduces the time needed to resolve an incident by automating key steps in the incident response process, including identifying the root causes of an incident and automatically responding to address those causes. Logs are a valuable source of information, but often, that information is difficult to identify. AIOps brings techniques that can help identify events that require some response but would likely not be detected and acted upon manually.
- DevOps. DevOps can benefit from AIOps with support for more capable build-and-deploy pipelines. Issues in testing and deploying can be addressed automatically, which help streamline CI/CD pipelines and increase innovation throughput. This increased automation, resilience, and efficiency helps DevOps teams speed up software delivery and accelerate the feedback loop so they can innovate faster and more confidently.
- SecOps. Applications are constantly being improved, revised, and updated with new features, but before that new code can be deployed, it needs to be tested and reviewed from a security perspective. SecOps is responsible for ensuring that applications are secure, and AIOps supports that with the ability to assess applications during development, delivery, and deployment. Anomalous behavior in a newly deployed application can easily escape the detection of humans, but AIOps systems complement SecOps engineers by identifying and reporting on potentially exploitable vulnerabilities.
The great promise of AIOps is to automate IT operations — or achieving autonomous operations. Obviously, that can only be truly achieved when the flow through the four stages can happen without human intervention or assistance. While the collection, aggregation, and execution stages of the data processing chain have been solved to some extent, the toughest part is the analysis stage: identifying the true root cause of an issue and then, based on the insight, choosing the best remediation action. This has yet to be cracked in most cases.
Cracking the analysis stage requires a different approach to AI.
An alternate method to the machine–learning approach is deterministic AI, also known as fault tree analysis. Here’s how it works.
Let’s say, for example, an application is experiencing a slowdown in receiving its search requests. The deviating metric is response time. It triggers the fault tree analysis, so you start your analysis with the monitored entity to which the metric belongs — the application, in this case. This is now the starting node in the tree.
Next, you investigate all dependencies this application has. For example, it may have third-party calls, such as CDNs, or more complex requests to a backend or microservice-based application. All those dependent nodes will be analyzed and investigated for anomalies. If a node has been cleared, it’ll form a leaf, and nodes showing anomalies will be further investigated down their dependencies.
From there, let’s say you look at the webserver the application is communicating with, further to the front-end tier and search service. Then, you see that search requests are slower than usual on all nodes.
Now, it’s not that simple to just follow the dependencies in one direction. Let’s assume the OS hosting the search service is also running another process completely independently that consumes a significant amount of CPU, which causes a shortage and slows down the search service. From the search service, you would follow the dependency to the process and further to the host and then back up to other processes running on that host. This nicely shows vertical (service to process to host) and horizontal (service to service, or process to process) dependency analysis.
This process continues until the system identifies a root cause. In this case, it’s a chatty neighbor, and on the other end of the tree, you can assess the impact (for example, how many users have been affected by that problem).
A huge advantage of this approach is speed. It works without having to identify training data, then training and honing.
The significance of topology information
Because it follows a logical fault tree, deterministic AI requires a topology model of your data center or application deployment. Otherwise, you would never be able to walk through the tree like this and find the root cause.
ML-based AIOps tools ingest data and metadata to offer correlational data and dashboards to conduct root-cause analysis. On the other hand, a deterministic AI approach based on fault-tree analysis leverages topology data and builds an entity model in real time by incorporating observed raw data, including metrics, logs, events, traces, and contextual information, such as user experience data. This entity modeling with contextual data is what enables deterministic AI to deliver precise and repeatable root-cause identification.
Two types of root cause
Also worth noting is there are two different types of root cause: the technical and the foundational. The earlier example explains how the system identifies technical root cause: in this case, a CPU spike of another process. The foundational root cause explains what led to that spike—in this case, a deployment.
To achieve automated foundational root-cause analysis, the AI needs to be capable of browsing through the history or changelog of the monitored entity that has been identified as the technical root cause. And, of course, this type of information needs to be available to the AI and, therefore, be part of the entity.
Taking AIOps to the next level
Ever since Gartner coined the term AIOps for artificial intelligence in IT operations, the practice and its technologies have been maturing. Now, with modern multicloud environments, answering the question, “What is AIOps?” means AIOps must evolve to include the full software delivery life cycle.
The traditional, ML-based approaches — which still rely heavily on human input — give us what are essentially event monitoring tools that cannot scale up to meet the demands of modern multicloud microservice-based apps.
However, a deterministic fault-tree approach to AI allows for precise technical and foundational root cause identification and impact analysis in real time. The result is more complete automation throughout the entire development and delivery pipeline, enabling DevOps staff to do what they do best: innovate and create new solutions to human problems — rather than simply keeping the lights on.