Root cause analysis

To identify the root cause of problems, Dynatrace doesn't depend solely on time correlation. It uses all the OneAgent-delivered context information such as topology, transaction and code-level information to identify events that share the root cause. Using all available context information, Dynatrace can pinpoint the root cause of problems in your application-delivery chain and dramatically reduce the alert spam for single incidences that originate from the same root cause.

Why time correlation alone isn't effective

Time correlation alone is ineffective in identifying the root cause of many performance problems. Consider, for example, a simple time correlation in which a Service A calls a Service B. The first event in this problem evolution sequence is a slowdown on Service B. The next event in the problem evolution sequence is a slowdown on Service A. In this case, time correlation would seem to work pretty well for indicating the root cause of the problem: the slowdown on Service B led sequentially to the slowdown on Service A. This is however a very simplistic problem.

What if the events in the problem evolution sequence are more nuanced and open to interpretation? What if, for example, Service A has a long history of performance problems? With such knowledge, it becomes impossible to say conclusively that the slowdown on Service A was caused by the slowdown on Service B. It may be that Service A is simply experiencing another in a history of performance issues. Such subtleties make time correlation alone ineffective in conclusively pinpointing the root cause of many performance problems.

A context-aware approach for the detection of interdependent events

After Davis (the Dynatrace AI causation engine) identifies a problem in one of your application's components, it uses all monitored transactions (PurePath) to identify interdependencies between the problem and other components that took place around the same time and within a dependent topology. Therefore, all vertical topological dependencies are automatically analyzed as well as the complete horizontal dependency tree.

The image below shows how Davis automatically analyzes all the vertical and horizontal topological dependencies for a given problem. In this example, an application exhibits abnormal behavior, but the underlying horizontal stack is not showing any incidents. The automatic analysis follows all the transactions that were monitored for that application and detects a dependency on Service 1, where Service 1 also exhibits abnormal behavior. In addition, all dependencies of Service 1 show abnormal behavior and are part of the root cause of the overall problem.

Automatic root-cause detection includes all the relevant vertical stacks as shown in the example and ranks all root-cause contributors to find out which one has the most negative impact. Dynatrace not only detects all the root-cause contributors but also offers drilldowns on a component level to analyze the root cause down to the code level, showing, for instance, failing methods within your service code or high GC activity on underlying Java processes.

correlation diagram

Problems are seldom one-time events; they usually appear in regular patterns and are often shown to be symptoms of larger issues within your environment. If any other entities that depend on the same components also experienced problems around the same time, then those entities will also be part of the problem's root-cause analysis.

When Davis detects an interdependency between a service problem and other monitored events, it shows you the details of this interdependency and the related root-cause analysis.

Drill down to code-level details of a detected root-cause component

On the problem overview page, you can click the component tile appearing within the Root cause section to navigate to the components infographics page. It shows you the relevant service, host, or process overview page in the context of the actual problem you're analyzing.

The example below presents a typical problem overview page that shows two root-cause contributors, one service called CheckDestination that degraded in response time and an underlying host that experiences a CPU saturation.

problem

Opening a component overview page within the context of a problem will give you specific navigational hints about the violating metrics or about the detected issues on the focused component. The image below shows the host entity page with a navigational hint to review the CPU metric.

problems-cpu

In case of a high CPU event on a host, you can further drill down to the list of consuming processes on that host to find out which processes are the main contributors.

problem - bad process

Visual resolution path

If several components of your infrastructure are affected, a Visual resolution path is included in the Root cause (see the example above). The visual resolution path is an overview of the part of your topology that has been affected by the problem. You can click the visual resolution path tile to see an enlarged view of the resolution path along with the Replay tab on the right (see image below).

The Replay tab enables you to follow the problem through its lifespan in detail by clicking the play arrow at the top. In the example below, you can see that the problem spiked between 8:00 and 9:00 o'clock.

  • The list of events under the diagram includes all events that occurred within the highlighted interval (2018-06-07 07:45 - 08:00). The events are grouped along the respective entities such as, in the example, www.easytravel.com and MicroJourneyService.
  • Click the arrow next to the name of an entity to display the entity overview page, where you can follow the navigational hints for further analysis.

visual resolution path

Highlights of Davis root-cause detection

Powerful features make Davis root-cause detection fast, smart, and precise, with an increased awareness of external data and events.

Fault-tree analysis

Other products use general-purpose machine learning to build a context model. This takes time for collection and learning. Also, general-purpose machine learning does not provide information about causal direction. You may get a correlation between two metrics, for example, but you have no idea how they are related and whether one causes the other.

Davis has no such delay in learning dependencies, because the Davis context model is built on known dependency information from Smartscape, OneAgent, and cloud integration. Using this information, Davis quickly conducts a fault-tree analysis to analyze millions of dependencies and arrive at the most probable root cause of your problem.

Metric- and event-based detection of abnormal component state

Davis automatically checks all component metrics for suspicious behavior. This involves a near-real-time analysis of thousands of topologically related metrics per component and even your own custom metrics. More than this, however, Davis root-cause analysis can detect the root cause of an abnormal component state even when there is no open event on the component.

Seamless integration of custom metrics within the Dynatrace AI process

You can integrate all kinds of metrics by writing custom plugins, JMX, or by using the Dynatrace REST API. Davis seamlessly analyzes your custom metrics along with all affected transactions. There's no need to define a threshold or to trigger an event for your custom metrics, as Davis automatically picks up metrics that display abnormal behavior.

Third-party event ingests

Davis seamlessly picks up any third-party event along the affected Smartscape topology.

Availability root cause

In many cases, the shutdown or restart of hosts or individual processes is the root cause of a detected problem. The availability root-cause section summarizes all relevant changes in availability within the grouped vertical stack.

Grouped root cause

The root-cause section of the problem details page displays up to three root-cause candidates, with candidates aggregated into groups of vertical topologies. This enables you to quickly review outliers within affected service instances or process clusters.

Use case

To learn more, walk through this root-cause analysis.