How is the Dynatrace approach to problem detection unique?

Dynatrace offers a holistic approach to monitoring the health of your applications. It continuously monitors your application from multiple perspectives:

  • Applications (real users) - This is the experience that your customers have with your application—essentially the response time and overall performance that your application provides when users access your application from a desktop or mobile device browser.
  • Server-side services - These are the various services which collectively deliver your application to your customers. These include web requests, database requests, and communication between services.
  • Infrastructure - This level comprises the physical and/or virtual machines that serve your application to your customers. This level includes the servers, databases, hosts, and processes running in your environment.

How Dynatrace correlates incidents across stack layers

Dynatrace understands the dependencies between all the layers and components of your application stack. Dynatrace knows which services your application calls. Dynatrace also knows which processes on which hosts those services run in.

When a performance incident related to one of your application's services is detected, you are alerted via the Dynatrace mobile app (available for both iOS and Android devices) and Dynatrace creates a problem ID for the incident. If related application- or service-level incidents are detected in your environment, Dynatrace will consolidate those incidents into the same problem for tracking and root cause analysis.

Analyzing underlying dependencies

To understand the analysis of the underlying service level or infrastructure level incidents that have contributed to a problem, click the Visual resolution path graph on each problem detail page. The Visual resolution path shows you the dependencies between your application and the underlying services and infrastructure components that support it.

Each Visual resolution path page includes a Problem evolution viewer that you can use to replay the problem to see how it evolved over time. Here you can see in great detail how your application's dependencies interacted and performed during the time leading up to and during the problem. You can see which failed services calls or infrastructure health issues led to the failure of other service calls and ultimately led to the performance problem that affects your customers' experience.

You can also click Analyze root cause on any problem detail page to understand the underlying cause of the problem and begin problem resolution efforts.

For a detailed root cause analysis example, see Root cause analysis of infrastructure issues.

Why the slowest 10% of response times are important

While other APM tools focus on average response times, Dynatrace takes a different approach—one that focuses on the user experience of all your customers, not just those who're experiencing good or average response times. Dynatrace places special emphasis on the 10% of slowest response times experienced by your customers. This is because if you only know the average (median or mean) response times experienced by the majority of your customers, you'll miss a crucial point: Some of your customers are experiencing unacceptable performance problems!

Consider a typical search service that performs some database calls. The response time of these database calls may vary greatly depending on whether or not the requests can be served from cache of if they must be pulled from the database. Median response time measurements in such a scenario are insufficient because although the majority of your customers (those having their database requests served from the cache) are experiencing acceptable response times, a portion of your customers (those having database requests pulled from the database) are experiencing unacceptable performance. Placing monitoring emphasis on the slowest 10% of your customers resolves such issues.

How effective is time correlation?

Time correlation alone is ineffective in identifying the root cause of many performance problems. For this reason Dynatrace emphasizes sequences of events.

First, let's consider a simple time correlation example in which a 'Service A' calls a 'Service B'. The first incident in this problem evolution sequence is a slow down on 'Service B'. The next incident in the problem evolution sequence is a slow down on 'Service A'. In this case, time correlation would seem to work pretty well for indicating the root cause of the problem: The slow down on 'Service B' led sequentially to the slow down on Service A. This is however a very simplistic problem.

What if the incidents in the problem evolution sequence are more nuanced and open to interpretation? What if for example 'Service A' has a long history of performance problems? With such knowledge it becomes impossible to say conclusively that the slow down on 'Service A' was caused by the slow down on 'Service B'. It may be that 'Service A' is simply experiencing another in a history of performance issues. Such subtleties make time correlation alone ineffective in conclusively pinpointing the root cause of many performance problems.