Dynatrace continuously measures incoming traffic levels against defined thresholds to determine when a detected slowdown or error-rate increase justifies the generation of a new problem event. Rapidly increasing response-time degradations for applications and services are evaluated based on sliding 5-minute time intervals. Slowly degrading response-time degradations are evaluated based on 15-minute time intervals.
Note that newly detected anomalous events in your environment won't necessarily result in the immediate raising of a new problem. Raised problems always provide insight into the underlying root cause. To identify the root causes of problems, Dynatrace analyzes the correlation of events across time, processes, hosts, services, applications, and both vertical and horizontal topological monitoring perspectives. Only by correlating events across time and all these monitoring perspectives can Dynatrace pinpoint the root causes of problems. And only then will you be alerted to a detected problem. For more details, see automatic correlation of dependent toplological incidents.
Dynatrace utilizes three types of thresholds:
- Automated baselines: Multidimensional baselining automatically detects individual reference values that adapt over time. Automated baseline reference values are used to cope with dynamic changes within your application or service response times, error rates, and load.
- Built-in static thresholds: Dynatrace uses built-in static thresholds for all infrastructure events (for example, detecting high CPU, low disk space, or low memory).
- User-defined static thresholds: With customizable anomaly detection settings (available at Settings > Anomaly detection), you can overwrite the default static thresholds for infrastructure events. You can also switch from automated baselining for application and service anomaly detection to static thresholds. With static thresholds, the detected baseline thresholds are overwritten by your custom static thresholds for individual dimensions.
The methodology used for raising events with automated baselining is completely different from that used for static thresholds. The following sections provide details about both methods:
Dynatrace uses automated baselining to learn the typical reference values of application and service response times, error rates, and load.
With respect to response times, Dynatrace collects references for the
median (above which are the slowest 50% of all callers) and the
90th percentile (the slowest 10% of all callers). A slowdown event is raised if the typical response times for either the median or the 90th percentiles degrade.
Why the slowest 10% of response times are important
While other APM tools focus on average response times, Dynatrace takes a different approach—one that focuses on the user experience of all your customers, not just those who are experiencing good or average response times. Dynatrace places special emphasis on the 10% of slowest response times experienced by your customers. This is because if you only know the average (median or mean) response times experienced by the majority of your customers, you'll miss a crucial point: Some of your customers are experiencing unacceptable performance problems!
Consider a typical search service that performs some database calls. The response time of these database calls may vary greatly depending on whether or not the requests can be served from cache or if they must be pulled from the database. Median response time measurements in such a scenario are insufficient because although the majority of your customers (those having their database requests served from the cache) are experiencing acceptable response times, a portion of your customers (those having database requests pulled from the database) are experiencing unacceptable performance. Placing monitoring emphasis on the slowest 10% of your customers resolves such issues.
Application baselining calculates reference values for 4 different dimensions:
- User action: An application's user action (e.g.,
- Geolocation: Hierarchically organized list of geolocations where user sessions originate from. Geolocations are organized into continents, countries, regions, and cities.
- Browser: Hierarchically organized list of browser families, such as Firefox and Chrome. The topmost categories are the browser families. These are followed by the browser versions within each browser family.
- Operating system: Hierarchically organized list of operating systems, such as Windows and Linux. The topmost categories are the operating systems. These are followed by the individual OS versions.
Service baselining calculates a reference value for the Service method dimension:
- Service method: A service's individual service methods (e.g.,
In the case of database services, the service method represents the different SQL statements that are queried (e.g.,
call verify_location(?) select booking0_.id from Booking booking0_ where booking0_.user_name<>?)
A reference value is additionally calculated for the predefined service method groups, static requests, and dynamic requests.
For database services, a reference value is calculated for the predefined service method groups
Automated baselining attempts to figure out the best reference values for incoming application and service traffic. To do this, Dynatrace automatically generates a baseline cube for your actual incoming application and service traffic. This means that if your traffic comes mainly from New York, and most of your users use the Chrome browser, your baseline cube will contain the following reference values:
bash `USA - New York – Chrome – Reference response time : 2sek, error rate: 0%, load: 2 actions/min`
If your application also receives traffic from Beijing, but with a completely different response time, the baseline cube will automatically adapt and thereafter contain the following reference values:
bash `USA - New York – Chrome – Reference response time : 2sek, error rate: 0%, load: 2 actions/min`
bash `China – Bejing - QQ Browser - Reference response time : 4sek, error rate: 1%, load: 1 actions/min`
Dynatrace detects when your applications and services are initially detected with OneAgent. The baseline cube is calculated two hours after your application or service is initially detected by Dynatrace OneAgent so that it can analyze two hours of actual traffic to calculate preliminary reference values and identify where your traffic comes from. Calculation of the reference cube is repeated every day so that Dynatrace can continue to adapt to changes in your traffic.
To avoid over-alerting and reduce notification noise, the automatic anomaly-detection modes don't alert on fluctuating applications and services that haven't run for at least 20% of a full week (7 days).
Alerting on response time degradations and error rate increases begins once the baseline cube is ready and the application or service has run for at least 20% of a week (7 days).
Dynatrace application traffic anomaly detection is based on the assumption that most business traffic follows predictable daily and weekly traffic patterns. Dynatrace automatically learns each applications’ unique traffic patterns. Alerting on traffic spikes and drops begins after a learning period of one week because baselining requires a full week’s worth of traffic to learn daily and weekly patterns.
Following the learning period, Dynatrace forecasts the next week’s traffic and then compares the actual incoming application traffic with the prediction. If Dynatrace detects a deviation from forecasted traffic levels that falls outside of reasonable statistical
variation, Dynatrace raises either an
Unexpected low traffic or an
Unexpected high traffic problem.
Advantages and disadvantages of automated baselining
Advantages of automated baselining:
- Works out of the box without manual configuration of thresholds.
- No manual effort required to set specific thresholds for geolocations, browsers, etc.
- Adapts automatically to changes in traffic patterns.
Disadvantages of automated baselining:
- Requires a learning period within which Dynatrace learns normal traffic patterns.
- Baselines are evaluated within 5-min and 15-min sliding time intervals.
- Automated detection of reference values for response times, error rates and load.
- A combination of 4 dimensions for applications and 1 dimension for services.
- Baseline cube calculation is initially performed 2 hours after your application or service is first detected by Dynatrace, and thereafter on a daily basis.
- Applications and services have to run for at least 20% of a week before slowdown and error rate alerts are raised.
- Applications have to run for at least a full week before traffic spike and drops alerts are raised.
- Slowdown events are detected for the median and 90th percentile.
Dynatrace infrastructure monitoring is based on numerous built-in, predefined static thresholds. These thresholds relate to resource contentions like CPU spikes, memory, and disk usage. You can change these default thresholds by navigating to Settings > Anomaly Detection > Infrastructure.
For applications and services, you can disable automated baselining-based reference-value detection anytime and switch to user-defined static thresholds. If you set a static threshold for response time and error rate on an application or service level, events will be raised if the static threshold is breached. A slowdown event is raised if the static thresholds for either the median or the 90th percentile response times are breached.
Host predefined static thresholds
|Hosts||Default static threshold|
|Memory event usage||
|Java out of memory||
Network predefined static thresholds
|Network||Default static threshold|
|Number of dropped packets||
|TCP connectivity for process||
Disk predefined static thresholds
|Disk||Default static threshold|
|Low disk space||
|Slow running disks||
|Inodes number available||
Advantages and disadvantages of static thresholds
Advantages of static thresholds:
- Begins to alert immediately without a learning period.
Disadvantages of static thresholds:
- Too much manual effort is required for setting static thresholds for each service method or user action.
- It can be challenging to set static thresholds for dynamic services.
- They don't adapt to changing environments.
- Infrastructure monitoring is built upon predefined static thresholds for numerous metrics.
- Immediately begins to alert on static thresholds without a learning period.
- Events are raised for threshold breaches of the median and 90th percentiles.