Auto-adaptive baselining for custom metric events

Auto-adaptive baselines represent a dynamic approach to baselining where the reference value for detecting anomalies changes over time. Its main advantage over a static threshold is that the reference value dynamically adapts over time, and you don't have to know the threshold up front. You also don't have to manually adapt multiple static thresholds for metrics whose behavior changes over time.

There's a limit on metric event configurations based on an adaptive baseline—100 per environment.

Let's look at an example where an adaptive baseline has an advantage over a statically defined threshold. The chart below shows the measured disk write times in milliseconds of a disk. This is quite a volatile metric that spikes depending on the amount of write pressure the disk faces. If we were to define a threshold for each disk within our IT system, based on the initial data (beginning of the chart), we'd set the static threshold at 20 milliseconds. However, later the usage of the disk changes to a higher load and a static threshold thus defined would produce quite a few false-positive alerts. To avoid these, we'd have to define a new threshold and manually adapt the configuration.

Static threshold

Auto-adaptive baselining, however, automatically adapts reference thresholds daily based on the measurements of the previous seven days. So if a metric changes its behavior, the threshold adapts automatically.

Auto-adaptive baselining

Baseline calculation

The reference values for baseline calculation are the metric data of the last seven days. Measurements for each minute are used to calculate the 99th percentile of all the measurements. This determines the appropriate baseline. The inter-quantile range between the 25th and 75th percentiles is then used as the signal fluctuation, which can be added to the baseline. By using the number of signal fluctuation (n x signal fluctuation) parameter, you can control how many times the signal fluctuation is added to the baseline to produce the actual threshold for alerting.

Another important parameter for dynamic baselines is the sliding window that is used to compare current measurements against the calculated threshold. It defines how often the calculated threshold must be violated within a sliding window of time to raise an event (violations don't have to be successive). This approach helps to avoid alerting too aggressively on single violations. You can set the sliding window to a maximum of 60 minutes.

Auto-adaptive baseline settings

By default, any 3 minutes out of a sliding window of 5 minutes must violate your baseline-based threshold to raise an event. Meaning, an event must have 3 violating minutes within any 5-minute sliding window.