How problems are detected and analyzed

Problems in Dynatrace represent anomalies, i.e. deviations from a normal behavior or state. Such anomalies can be, for example, a slow service or a slow user login to an application. Whenever a problem is detected, Dynatrace raises a specific problem event indicating such an anomaly.

Note that newly detected anomalous events in your environment won't necessarily result in the immediate raising of a new problem. Raised problems always provide insight into the underlying root cause. To identify the root causes of problems, Dynatrace follows a context-aware approach to detect interdependent events across time, processes, hosts, services, applications, and both vertical and horizontal topological monitoring perspectives. Only through a context-aware approach is it possible to pinpoint the true root causes of problems.

Problem detection

Dynatrace continuously measures incoming traffic levels against defined thresholds to determine when a detected slowdown or error-rate increase justifies the generation of a new problem event. Rapidly increasing response-time degradations for applications and services are evaluated based on sliding 5-minute time intervals. Slowly degrading response-time degradations are evaluated based on 15-minute time intervals.

Understanding thresholds

Dynatrace utilizes two types of thresholds:

  • Automated baselines: Multidimensional baselining automatically detects individual reference values that adapt over time. Automated baseline reference values are used to cope with dynamic changes within your application or service response times, error rates, and load.
  • Built-in static thresholds: Dynatrace uses built-in static thresholds for all infrastructure events (for example, detecting high CPU, low disk space, or low memory).

The methodology used for raising events with automated baselining is completely different from that used for static thresholds. Static thresholds offer a simple and straightforward approach to defining baselines that works immediately without requiring a learning period. This approach, however, is by no means intelligent, because of the following shortcomings:

  • Too much manual effort is required for setting static thresholds for each service method or user action.
  • It can be challenging to set static thresholds for dynamic services.
  • They don't adapt to changing environments.

Dynatrace therefore applied AI to develop an intelligent, automated, multi-dimensional baselining method. This approach, as opposed to static thresholds, works out of the box, without manual configuration of thresholds, and, most important, adapts automatically to changes in traffic patterns.

Note that Dynatrace allows you to adjust the sensitivity of problem detection either by adapting the static thresholds or by deviating from automated baselines.

Problem analysis

Once a problem is detected, you can directly analyze its consequences on the problem's overview page. Dynatrace offers the ability of both a direct impact analysis as well as a business impact analysis. Also, on the problem's overview page, you can analyze the root cause of a problem.

Root-cause analysis

To identify the root cause of problems, Dynatrace doesn't depend only on time correlation but follows a context-aware approach to detect interdependent events across time, processes, hosts, services, applications, and both vertical and horizontal topological monitoring perspectives.

The following scenario involves a problem that has as its root cause a performance incident in the infrastructure layer.

problem lifespan

  1. Dynatrace detects an infrastructure-level performance incident. A new problem is created for tracking purposes and a notification is sent out via the Dynatrace mobile app.

  2. After a few minutes the infrastructure problem leads to the appearance of a performance degradation problem in one of the application's services.

  3. Additional service-level performance degradation problems begin to appear. So what began as an isolated infrastructure-only problem has grown into a series of service-level problems that each have their root cause in the original incident in the infrastructure layer.

  4. Eventually the service-level problems begin to affect the user experience of your customers who are interacting with your application via desktop or mobile browsers. At this point in the problem life span you have an application problem with one root cause in the infrastructure layer and additional root causes in the service layer.

  5. Because Dynatrace understands all the dependencies in your environment, it correlates the performance degradation problem your customers are experiencing with the original performance problem in the infrastructure layer, thereby facilitating quick problem resolution.

Problem alerting

Upon the detection of an anomaly, Dynatrace can generate an alert to notify the responsible people that something is wrong. Dynatrace allows you to set up fine-grained alert-filtering rules that are based on the severity, customer impact, associated tags, and/or duration of detected problems. These rules essentially allow you to define an alerting profile. Through alerting profiles, you can also set up filtered problem-notification integrations with 3rd party messaging systems like Slack, HipChat, and PagerDuty.