Metric events for alerting (legacy)
This page describes the UI approach for Dynatrace version 1.252 or earlier. If you're using a newer version, see the current page.
Dynatrace Davis® AI automatically analyzes abnormal situations within your IT infrastructure and attempts to identify any relevant impact and root cause. Davis relies on a wide spectrum of information sources, such as a transactional view of your services and applications, as well as all events raised on individual nodes within your Smartscape® topology.
There are two main sources for single events in Dynatrace:
- Metric-based events (events that are triggered by a series of measurements)
- Events that are independent of any metric (for example, process crashes, deployment changes, and VM motion events)
Custom metric events are configured in the global settings of your environment and are visible to all Dynatrace users in your environment.
Auto-adaptive baseline
Auto-adaptive baselines represent a dynamic approach to baselining where the reference value for detecting anomalies changes over time. Its main advantage over a static threshold is that the reference value dynamically adapts over time, and you don't have to know the threshold upfront. You also don't have to manually adapt multiple static thresholds for metrics whose behavior changes over time.
When the configuration of a metric event includes multiple entities, each entity receives its own auto-adaptive baseline and each baseline is evaluated independently. For example, if the scope of an event includes five hosts, then Dynatrace will calculate and evaluate five independent baselines.
There's a limit of 100 metric event configurations per environment, regardless of how many individual baselines each configuration has.
Let's look at an example where an adaptive baseline has an advantage over a statically defined threshold. The chart below shows the measured disk write times in milliseconds of a disk. This is a volatile metric that spikes depending on the amount of write pressure the disk faces. If we were to define a threshold for each disk within this IT system, based on the initial data (beginning of the chart), we'd set the static threshold at 20 milliseconds. However, the usage of the disk will later change to a higher load, so a static threshold thus defined will produce many false-positive alerts. To avoid this, we'd have to define a new threshold and manually adapt the configuration.
Auto-adaptive baselining, however, automatically adapts reference thresholds daily based on the measurements of the previous seven days. So if a metric changes its behavior, the threshold adapts automatically.
Baseline calculation
The reference values for baseline calculation are the metric data of the last seven days. Measurements for each minute are used to calculate the 99th percentile of all the measurements. This determines the appropriate baseline. The inter-quantile range between the 25th and 75th percentiles is then used as the signal fluctuation, which can be added to the baseline. By using the number of signal fluctuation
(n × signal fluctuation) parameter, you can control how many times the signal fluctuation is added to the baseline to produce the actual threshold for alerting.
Another important parameter for dynamic baselines is the sliding window that is used to compare current measurements against the calculated threshold. It defines how often the calculated threshold must be violated within a sliding window of time to raise an event (violations don't have to be successive). This approach helps to avoid alerting too aggressively on single violations. You can set the sliding window to a maximum of 60 minutes.
By default, any 3 minutes out of a sliding window of 5 minutes must violate your baseline-based threshold to raise an event. Meaning, an event must have 3 violating minutes within any 5-minute sliding window.
Static threshold
A static threshold represents a hard limit a metric should not violate. The limit can be a single value or a range. As static thresholds don't change over time, they are an important monitoring tool, helping you to define critical boundaries of normal operation.
You need to choose between a static threshold and an adaptive baseline, depending on your use case.
For example, you can use a static threshold to set a limit for total memory usage by a well-known process. In this case, a static threshold is superior to an adaptive threshold, because if memory consumption slowly grows over time, the adaptive threshold simply changes with it, raising no problems and leading, eventually, to a hidden memory leak.
In the illustrations below, memory consumption steadily increases over 30 days. A statically defined threshold of 40 MB will catch the process's abnormal behavior, while an adaptive baseline will increase along with the metric value.
Apart from the threshold value, you can also specify how often the threshold must be violated within a sliding window of time to raise an event (violations don't have to be successive). It helps you to avoid alerting too aggressively on single threshold violations. You can set a sliding window of up to 60 minutes.
By default, any 3 minutes out of a sliding window of 5 minutes must violate your threshold to raise an event. That is, an event would require 3 violating minutes within any 5-minute sliding window.
Advanced metric query
There are two types of metric queries in metric events—basic and advanced. Basic queries just observe raw measurements streaming in. Advanced queries apply a transformation to the raw data. Dynatrace always displays the type of the query in the Alert preview section of the metric event configuration.
Examples of advanced queries include:
- Auto-adaptive baselines, because these require calculation of the baseline.
- Monotonic counters, because a mathematical operation is necessary on the raw counts to gain a continuous rate.
- Missing data alerts, because the condition has to be proactively checked.
- Mathematical expressions.
- Dimension aggregations.
Environment limits
There is an overall limit of 10,000 metric event configurations per monitoring environment that can be divided into the following categories:
- Basic queries—there is no dimension limit for basic queries. For example, you can create an alert for 20,000 CPU cores in one metric event configuration. While there's no dimension limit, the throttling limit of 100 simultaneous alerts per configuration is used as a safeguard.
- Advanced queries—additional limits apply:
- 100,000 dimensions per environment
- 1,000 dimensions per metric event configuration
- 100 advanced query configurations per monitoring strategy. You can have 100 configurations with an auto-adaptive baseline and 100 configurations with custom thresholds.
Missing data alert
Dynatrace provides you the ability to set an alert on missing data in a metric. If the alert is enabled, Dynatrace regularly checks whether the sliding window of the metric event contains any measurements. For example, if the sliding window is set to 3 minutes during any 5 minutes, Dynatrace triggers an alert if no data is received within a 3-minute period.
The missing data condition and baseline/threshold condition are combined by the OR logic.
Event description placeholders
To provide details about an alert, there is a {missing_data_samples}
event description placeholder. It renders to the number of minutes without data received.
Unregular or delayed data streams
We recommend that you disable missing data alerts for sparse data streams, where measurements are not expected in regular intervals, as it will result in alert storms.
For expected late-incoming data (for example, cloud integration metrics with a 5-minute delay), use long sliding windows that cover delays. For a 5-minutes delay, use a sliding window of at least 10 minutes.
Limits
Enabling missing data alerting switches the configuration to an Advanced metric query that is subject to additional limits, as it requires proactive checks, even if no data is streaming in.
Scope of the event
The essential aspect of a custom metric event is the correctly configured metric to be monitored. Many Dynatrace metrics are composed of multiple dimensions. You can choose which dimensions to consider for the event. For example, you can select only user actions coming from iOS devices for your custom alert, based on the Action count metric.
You can further fine-tune an event by selecting monitored entities to which it applies. By default, an event applies to all entities providing the respective metric. Using a rule-based filter, you can organize the entities by host group, management zone, name, and tag. For example, for host-based metrics you can include only those hosts that have a certain tag assigned. The actual set of available criteria depends on the metric.
Alerting scope preview can display up to 100 entities that deliver the selected metric and correspond to all specified filters.
If you set a threshold on more than 100 entities, preview won't be available and it would result in a considerable number of alerts.
Topology awareness
Topology awareness and context is one of the key themes of the Dynatrace observability platform. Data, such as metrics, traces, events, and logs is not simply reported and stored within the platform, it includes references to the topology where the data originated. For example, with process metrics, each measurement comes with a reference to the associated hosts and processes. The Davis AI uses this topological information to automatically perform root-cause detection and impact analysis for detected anomalies. The same applies to all custom metric events that are configured in a monitoring environment.
When an anomaly detection configuration raises an event, Dynatrace automatically identifies the most relevant entity to map the event to. If multiple entity references are detected, the most relevant one is automatically selected. For example, if a metric with a reference to both a host and a process leads to an event, the event is raised on the process
Metric ingestion enables you to submit all types of metric measurements, regardless of the number of entities they relate to. The following scenarios exist:
Measurements aren't related to any entity
If you define a metric event on a non-topological metric, the resulting event will be raised on the monitoring environment itself, and not on a specific Smartscape entity.
Measurements are related to a single entity
If you define a metric event on a measurement that is related to a single entity, the resulting event will be raised on that entity.
Measurements are related to multiple entities
When multiple entities are specified for each measurement, Dynatrace selects the most appropriate entity on which it should raise the event. In the case of a host and a process, the measurement presumably relates to the process rather than the host, so the event is raised on the process.
Event severity
The severity of an event determines if a problem should be raised or not, and if Davis AI should determine the root cause of the given event.
Severity | Problem raised | Davis analysis | Semantic |
---|---|---|---|
Availability | Yes | Yes | Reports any kind of severe component outage. |
Error | Yes | Yes | Reports any kind of degradation of operational health due to errors. |
Slowdown | Yes | Yes | Reports a slowdown of an IT component. |
Resource | Yes | Yes | Reports a lack of resources or a resource-conflict situation. |
Info | No | Yes | Reports any kind of interesting situation on a component, such as a deployment change. |
Custom alert | Yes | No | Triggers an alert without causation and Davis AI involved. |
For more information about built-in events and their severity levels, see Categories of events.
Event duration
In the configuration of a metric event, you specify how many one-minute slots must violate the threshold or baseline during a certain time period (the evaluation window). When this happens, Dynatrace raises an event.
The event remains open until the metric stays within the threshold or baseline for a certain number of one-minute slots within the same evaluation window, at which point Dynatrace closes the event. By default, the number of such de-alerting slots equals the size of the evaluation window. For example, if the size of the evaluation window is set to 5
, the metric has to stay within the threshold or baseline for 5 consecutive one-minute time slots to close the event. You can modify the number of de-alerting slots via the Metric events API.
Metric selector in metric events
The metric selector is a powerful tool for querying your data. It provides you two major possibilities:
- Metric transformations for transforming the metric.
- Metric expressions for combining one or more metrics into a different result by means of simple mathematics.
In this example, we want to detect anomalies on the combined incoming and outgoing network traffic by calculating the sum of all bytes read (builtin:host.net.bytesRx
) and written (builtin:host.net.bytesTx
). The metric expression for that is:
((builtin:host.net."bytesTx":splitBy())+(builtin:host.net."bytesRx":splitBy()))
This expression evaluates to a single metric result that Davis will use to learn a baseline and to detect and alert on anomalies.
A metric selector can consist of thousands of individual metric measurements. It is important to understand the implications when configuring a selector that consists of measurements coming from thousands of individual sources. Dynatrace applies safety limits to anomaly detection in terms of the number of metric dimensions that can be observed within one monitoring environment to avoid any operational issues.
Combining metrics for anomaly detection
With the power of a metric expression, you can implement alerting with a top-down view of a situation rather than alerting on each individual component.
For example, you can observe log patterns across multiple hosts. By calculating the total count of observed log patterns across all relevant log files, Dynatrace can detect pattern anomalies on the accumulated log stream rather than on the individual counts per log file.
In case of sparse counts across many entities (for example, an error count across multiple processes of the same type), aggregated top-down anomaly detection is much more resilient against false-positive alerts compared to detection on an individual error count per process.
Topology mapping
Metric events based on a metric selector support topology awareness. The resulting mapping depends on the data granularity of the result.
Metric selectors that are split by an entity persist that mapping and are topology-aware. The events raised on such metrics are mapped to the original source.
When metric selectors result in a single aggregated series, with no clear entity and topology reference, the events raised on such metrics are mapped to the global monitoring environment.
Override topology mapping
You can override automatic selection of the entity type the events are mapped to. Be aware that you should select only entity types that are referenced in the incoming metric measurements. When an entity type is selected where the metric does not show the necessary dimension, the entity override is ignored.
To override the automatic entity type, in the metric event configuration, expand Advanced entity settings and select the required entity type.
Create a metric event
To create a metric event configuration
- In the Dynatrace menu, go to Settings > Anomaly Detection > Custom events for alerting and select Create custom event for alerting.
- Switch to the Build tab.
- Select the metric for your metric event. You can select the metric by the category it belongs to or by the exact metric name.
- Select a type of aggregation for the metric (where applicable).
- optional Select the dimensions to be considered by the event.
- optional Add rule-based entity filters.
- Define the monitoring strategy.
- Choose the strategy:
- Auto-adaptive baseline—Dynatrace calculates the threshold automatically and adapts it dynamically to your metric's behavior.
- Static threshold—threshold that doesn't change through time.
- Specify a sliding window for comparison. The sliding window defines how often the threshold (whether automatically calculated or manually specified) must be violated within a sliding window of time to raise an event (violations don't have to be successive). It helps you to avoid overly aggressive alerting on single violations. You can set a sliding window of up to 60 minutes.
- Depending on the selected strategy, specify:
- Auto-adaptive baseline—how many times the signal fluctuation is added to the baseline.
- Static threshold—the threshold value. Dynatrace suggests a value based on the previous data.
- Choose the missing data alert behavior. If missing data alert is enabled, it is combined with the baseline/threshold condition by the OR logic.
- Choose the strategy:
- Select the timeframe of the preview. You can receive alerts for 12 hours, one day, or seven days, and evaluate how effective your configuration is.
- Select a title for your event. The title should be a short, easy-to-read string describing the situation, such as
High network activity
orCPU saturation
. - In the Event description section, create a meaningful event message. Event messages help you understand the nature of the event. You can use the following placeholders:
{alert_condition}
—the condition of the alert (above/below the threshold).{baseline}
—the violated value of the baseline.{dims}
—a list of all dimensions (and their values) of the metric that violated the threshold. You can also specify a particular dimension:{dims:dt.entity.<entity>}
. To fetch the list of available dimensions for your metric, query it via the GET metric descriptor request.{entityname}
—the name of the affected entity.{metricname}
—the name of the metric that violated the threshold.{missing_data_samples}
—the number of samples with missing data. Only available if missing data alert is enabled.{severity}
—the severity of the event.{threshold}
—the violated value of the threshold.
- Select Create custom event for alerting to save your new event.
- In the Dynatrace menu, go to Settings > Anomaly Detection > Custom events for alerting and select Create custom event for alerting.
- Switch to the Code tab.
- Type in the required metric selector. For reference, see Metric selector transformation and Metric expressions.
- optional Under Advanced entity settings, define the entity type to which the raised events should be mapped.
- Define the monitoring strategy.
- Choose the strategy:
- Auto-adaptive baseline—Dynatrace calculates the threshold automatically and adapts it dynamically to your metric's behavior.
- Static threshold—threshold that doesn't change through time.
- Specify a sliding window for comparison. The sliding window defines how often the threshold (whether automatically calculated or manually specified) must be violated within a sliding window of time to raise an event (violations don't have to be successive). This helps you to avoid overly aggressive alerting on single violations. You can set a sliding window of up to 60 minutes.
- Depending on the selected strategy, specify:
- Auto-adaptive baseline—how many times the signal fluctuation is added to the baseline.
- Static threshold—the threshold value. Dynatrace suggests a value based on the previous data.
- Choose the missing data alert behavior. If missing data alert is enabled, it is combined with the baseline/threshold condition by the OR logic.
- Choose the strategy:
- Select the timeframe of the preview. You can receive alerts for 12 hours, one day, or seven days, and evaluate how effective your configuration is.
- Select a title for your event. The title should be a short, easy-to-read string describing the situation, such as
High network activity
orCPU saturation
. - In the Event description section, create a meaningful event message. Event messages help you understand the nature of the event. You can use the following placeholders:
{alert_condition}
—the condition of the alert (above/below the threshold).{baseline}
—the violated value of the baseline.{dims}
—a list of all dimensions (and their values) of the metric that violated the threshold. You can also specify a particular dimension:{dims:dt.entity.<entity>}
. To fetch the list of available dimensions for your metric, query it via the GET metric descriptor request.{entityname}
—the name of the affected entity.{metricname}
—the name of the metric that violated the threshold.{missing_data_samples}
—the number of samples with missing data. Only available if missing data alert is enabled.{severity}
—the severity of the event.{threshold}
—the violated value of the threshold.
- Select Create custom event for alerting to save your new event.
Metric events API
The same metric events functionality is available through the Anomaly detection—metric events API. Using the API, you can list, update, create, and delete configurations.