Alert thresholds

Depending on your monitoring goals, alert thresholds may support operational goals or business goals:

  • Operational – Your expectations of how infrastructure and applications should perform; performance variations depending on geography or provider; performance measurements for transactions and steps.
  • Business – End users' experiences; performance of key steps in a transaction; object and page failures.

Thresholds should be based on actual historical performance, not on performance goals. For thresholds to be meaningful, the baseline performance must be relatively stable, or at least have predictable variations over a defined time period.

Response time thresholds

Threshold levels

Set performance thresholds that reflect end users' expectations. The thresholds should be based on current website performance, not future performance goals.

  • Warning – The end user's experience may be negatively affected, but a critical problem has not (yet) occurred. A Warning alert enables you to to diagnose and resolve an issue before it becomes a threat to revenue or SLA.
  • Severe – Conditions are seriously degraded, and immediate action is required to restore normal performance.
  • Improved – Performance is improved enough that it no longer exceeds the Warning threshold.

Threshold types

You can configure static or dynamic thresholds for Response Time alerts, depending on the monitoring situation.

Static Threshold

A static threshold is a fixed value, for example, the number of seconds for a Response Time alert. The threshold is used for all locations, and remains the same until you reconfigure the alert.

In a static response time calculation, the Dynatrace Portal calculates the number of tests or the percentage of tests that exceed the defined thresholds for each rolling 60-minute time period.

Use a static alert threshold, for example, to ensure that the website is meeting SLA targets, since the performance target is a constant value. A static threshold is also appropriate when your baseline performance data shows little variation over time and between test locations.

Dynamic Threshold

A dynamic threshold is calculated from the historical average for performance, for each test location. The threshold may be different for each location, because local conditions outside your control may affect performance. The threshold may also change over time as the average response time changes.

The Dynatrace Portal determines the status for each location (Backbone node, Mobile location, or Private Last Mile peer) by comparing the current response time average to the response time average over a longer period of time. You configure the length of time for the current average (you can choose a time range from 5 minutes through 24 hours) and the historical average (from 1 day through 7 days).

The alert status is based on the difference between the current average and the historical average. You can configure the Warning and Severe thresholds as an absolute number of seconds or as a percentage: an alert is triggered if the current average response time is X seconds longer than the historical average, or if the current average is X% higher than the historical average.

Dynamic thresholds allow for normal fluctuations in performance from different geographic regions and providers, so you can monitor your website's performance relative to the end users' overall expectations for Internet performance. However, it's important to review the data regularly to make sure response times aren't slowing increasing without triggering an alert.

Average threshold

In an average response time calculation, the portal triggers an alert when the average response time of tests executed in the past hour exceeds the defined thresholds. This threshold is available for Mobile tests only.

Calculating threshold values

One way to calculate threshold values is to use the standard deviation for your baseline data.

Generate an interactive chart, and configure it to show the standard deviation your baseline data:

  1. Set Interval to your data collection period up to the maximum of 1 week.
  2. Click Average Response Time by Test to display the chart configuration, and set Calculation to Standard Deviation.
  3. Click Update.

The data table below the chart lists the standard deviation for the time period. Use that value to set your alert thresholds:

  • Warning = baseline average + (1 x standard deviation)
  • Severe = baseline average + (2 x standard deviation)

You can also calculate the percentage of the standard deviation relative to the baseline average to use percentage threshold values instead of absolute values.

Transaction failure thresholds

In Mobile and Private Last Mile tests, the threshold for a Transaction Failure alert is either a minimum availability percentage or a fixed number of test locations with failed test runs. Depending on the criterion selected in the test settings, the alert is triggered either when availability falls below the specified percentage or when the transaction fails on more than the specified number of Mobile locations or private peers.

In Backbone tests, you configure the node threshold as either a percentage or a fixed number of nodes, and the number of consecutive errors. For example, an alert is triggered when the transaction fails on more than 8% of the nodes in more than 4 consecutive test runs.

Object failure thresholds

In Backbone tests, the Object Failure threshold is configured with the following settings:

  • Test threshold – Either a percentage or a fixed number of objects fail load correctly.
  • Node thresholds – Either a percentage or a fixed number of nodes where objects failed.
  • Number of consecutive errors

For example, an alert is triggered if more than 2% of the objects fail on more than 5 nodes, in more than 3 consecutive test runs.

Byte limit thresholds

In Backbone tests, the Byte Limit threshold is configured with the following settings:

  • Test threshold – Set a lower limit and an upper limit, in bytes.
  • Node thresholds – Either a percentage or a fixed number of nodes where the total bytes downloaded is outside the test thresholds.
  • Number of consecutive errors

For example, an alert is triggered if the total bytes downloaded is less than the lower limit or greater than the upper limit, on more than 5% of the nodes in more than 5 consecutive test runs.