Root cause analysis

To identify the root cause of problems, Dynatrace doesn't depend only on time correlation but mainly focuses on correlating events) across time, processes, hosts, services and applications and different monitoring perspectives. By correlating events across all monitoring perspectives, Dynatrace can pinpoint the root cause of problems in your application-delivery chain and therefore dramatically reduce the alert spam for single incidences that originate from the same root-cause.

Why time correlation alone is not effective?

Time correlation alone is ineffective in identifying the root cause of many performance problems. Consider for example a simple time correlation in which a 'Service A' calls a 'Service B'. The first incident in this problem evolution sequence is a slow down on 'Service B'. The next incident in the problem evolution sequence is a slow down on 'Service A'. In this case, time correlation would seem to work pretty well for indicating the root cause of the problem: The slow down on 'Service B' led sequentially to the slow down on Service A. This is however a very simplistic problem.

What if the incidents in the problem evolution sequence are more nuanced and open to interpretation? What if for example 'Service A' has a long history of performance problems? With such knowledge it becomes impossible to say conclusively that the slow down on 'Service A' was caused by the slow down on 'Service B'. It may be that 'Service A' is simply experiencing another in a history of performance issues. Such subtleties make time correlation alone ineffective in conclusively pinpointing the root cause of many performance problems.

Automatic correlation of all dependent topological evidences

Once Dynatrace identifies a problem in one of your application's components, it uses all monitored transactions (PurePath) to identify correlations between the problem and other components that took place around the same time and within a dependent topology. Therefore, all vertical topological dependencies are automatically analyzed as well as the complete horizontal dependency tree.

The image below shows how Dynatrace automatically analyzes all the vertical and horizontal topological dependencies for a given problem. According to this example, an application exhibits abnormal behavior, but the underlying horizontal stack is not showing any incidents. The automatic analysis follows all the transactions that were monitored for that application and detects a dependency on Service 1, where Service 1 exhibits also abnormal behavior. In addition, all dependencies of Service 1 do show abnormal behavior and are part of the root-cause of the overall problem. The automatic root-cause detection includes all the relevant vertical stacks as it is shown in the example and ranks all root-cause contributors to find out which one has the most negative impact. Dynatrace not only detects all the root-cause contributors but also offers drilldowns on a component level to analyze the root-cause down to a code level, showing, for instance, failing methods within your service code or high GC activity on underlying Java processes.

correlation diagram

Problems are seldom one-time events; they usually appear in regular patterns and are often shown to be symptoms of larger issues within your environment. If any other entities that depends on the same components also experienced problems around the same time, then those entities will also be part of the problem's root-cause analysis. When Dynatrace detects a correlation between a service problem and other monitored events, it shows you the details of the correlation and the related root cause analysis.

Drill down to code-level details of a detected root-cause component

On the problem overview page, click the component tile appearing within the Root cause section to navigate to the components infographics page. You will see the relevant service, host or process overview page in the context of the actual problem you're analyzing.

The example below presents a typical problem overview page that shows two root-cause contributors, one service called CheckDestination that degraded in response time and an underlying host that experiences a CPU saturation.

problem

Opening a component overview page within the context of a problem will give you specific navigational hints about the violating metrics or about the detected issues on the focused component. The image below shows the host entity page with a navigational hint to review the CPU metric.

problems-cpu

In case of a high CPU event on a host, you can further drill down to the list of consuming processes on that host to find out which processes are the main contributors.

problem - bad process

Visual resolution path

If there are several components of your infrastructure affected, then a Visual resolution path will be included in the Root cause (see the example above). The visual resolution path provides an overview of the part of your topology that has been affected by this problem. If you click on the visual resolution path tile, you will presented with an enlarged view of the resolution path along with the Replay tab on the right (see image below). This tab enables you to illustrate the problem lifespan in detail by clicking the play arrow at the top. In the example below, you can see that the problem spiked between 8:00 and 9:00 o'clock. The list of events appearing underneath the diagram includes all the events that occurred within the highlighted interval (i.e. 2018-06-07 07:45 - 08:00). The events are grouped along the respective entities. If you click the little arrow next to the name of an entity (e.g. next to MicroJourneyService), you enter the entity overview page where you can follow the navigational hints for further analysis.

visual resolution path

For further reading on root cause analysis, you can check the root cause analysis use case provided below: