Getting started with alerting

Alerting is a crucial aspect of monitoring your web application. Alerts warn you when your synthetic tests show that the web application is not meeting performance standards, so you can respond quickly to fix the issue.

Alerting can support your Service Level Agreement (SLA) or Service Level Objective (SLO). An SLA or SLO usually states availability expectations: "The site will be available 99% of the time during business hours." It may also specify an acceptable response time: the website should take no longer than a specified number of seconds to load. Even if the website is technically available (will eventually load fully and correctly), if it takes a minute or longer for the first page to appear, visitors will become frustrated and may abandon the site.

An alert is triggered when the test performance meets the conditions configured in the alert, based on how the test performs for each test cycle (test frequency). You can configure the frequency for Backbone tests. Private Last Mile and Mobile tests have a fixed frequency. When a test returns to Good status and remains Good for one test cycle, an Improved alert is triggered so you know the issue that triggered the original alert is resolved.

Dynatrace Synthetic alerts do more than let you know when an issue occurs: through the Alert log, you can view Root Cause Analysis for analysis of probable causes and troubleshooting suggestions.

To be effective, alerts need to be carefully planned and configured. On the one hand, if alert thresholds are too low, the monitoring team will get alerts too frequently and sometimes for issues that don't require action. On the other hand, if the alert thresholds are too high, customers may experience poor performance before the monitoring team is aware there’s a problem.

To create effective alerts, plan your alerts in the context of your overall monitoring strategy:

  1. Identify the people who need information about the website's performance. The information they need will determine the alerts that need to be configured.
  2. Decide which alert types are needed, based on the information the alert recipients need: response time, test or object failure, or (download) byte limit.
  3. Create tests to support the alerting needs,
  4. Establish performance baselines that take into account normal performance fluctuations among locations, time periods, and your website's pages.
  5. Establish alert thresholds and sensitivity, to make sure alerts provide timely warning of problems but aren't "false alarms".

Identify alert recipients

You can configure alerts to send notifications to different people for different conditions. Identify the people who need to respond depending on the type of problem and its severity. For example:

  • Operations staff who would troubleshoot the problem
  • Managers responsible for ensuring performance goals
  • Partners that deliver third-party content that may be a factor in the issue that triggered the alert.

By default, alerts are sent to the Alert log. You can create alert destinations to notify the appropriate people when an alert is triggered. Notifications can be sent to:

  • A mobile device, through the Dynatrace Synthetic Mobile app.
  • Specified email addresses – We recommend using an email group alias for alert recipients and managing any changes through your corporate email system.
  • A URL – Alerts can be posted to a public web page.
  • A ServiceNow instance
  • A Slack channel
  • A VictorOps account

Send sample alerts to alert destinations to make sure the destinations are defined correctly and notifications will be received.

Decide on the type of alert

Depending on the test type, you can configure alerts for:

  • Response Time for the entire test or for each step in a test.
  • Transaction Failure when the transaction fails to complete on a specified number or percentage of nodes or locations. Transaction Failure alerts warn you when Availability falls below the threshold.
  • Object Failure when a specified number or percentage of objects fail to load, even though the page load is otherwise successful.
  • Byte Limit when the total number of bytes downloaded during a successful test execution falls outside the specified limit.

For more details, see Types of Alerts.

Create tests with alerting in mind

Creating an effective alerting system depends on planning the tests for monitoring your system.

The type of alert each test type supports is one factor in deciding which tests you need. For details, see Types of Alerts.

Create synthetic tests that monitor the most critical steps or transactions for your web application. Every activity and scenario should reflect the end user's experience.

Deploy the tests to key geographic locations and service providers for your end users.

To monitor critical processes, provision Backbone tests so you can create alerts for individual steps as well as or instead of for the entire test.

Backbone, Private Last Mile, and Mobile tests all support alerting. Last Mile tests do not support alerting, because public Last Mile data is affected by too many variables for alerts to be useful.

Validate each step

Include validations in your tests to support transaction failure alerts.

Tests consist of actions against elements on a page. A validation checks whether a page loaded successfully by checking for an expected element on the page. You can create an alert that is triggered if validation fails.

Content validation helps ensure that errors are detected and accurately reported.

  • With validation, content that fails to load triggers a specific Content Match Failed error for the step that failed.
  • Without validation, you may get an unhelpful "User Script Error", or the page load may be reported as successful even though the page displays an error message and not the intended content.

Align test frequency with alerting requirements

How soon an alert notification is sent when a problem occurs depends on several configuration details for the test and the alert:

  • How often the test runs
  • The number of nodes the test runs on
  • The number of locations that must be in an error state to trigger the alert
  • The alert threshold type: static or dynamic

Make sure the tests run often enough that failures send timely alerts. For example, if your test only runs from one node once an hour, and your alert settings require three consecutive failed responses before a notification is sent, your customers could be experiencing a performance issue for three hours before your operations staff is alerted to the problem.

Calculate the estimated minimum time from event start to alert:

if (N >= T)
    THEN A = F/N * (T - 1)
else A = NEVER

where

  • N = Number of locations (Backbone nodes, Private Last Mile peers, or Mobile locations)
  • T = Test threshold
  • F = Test frequency in Minutes
  • A = Time from event to alert

Identify performance baselines

Before you configure alerts, collect enough test data to establish the baselines for normal performance. We recommend collecting data for two weeks to be able to see the pattern of performance.

Normal web traffic usually fluctuates according to time of day, day of the week, and even time of year. Within a website, different pages may have different levels and patterns of traffic. For example, normal response time may vary if traffic is heavier at certain times or on certain days, or when background tasks (such as data backups) use system resources. You may decide you need multiple tests that run at different times of day or days of the week, with different alert thresholds, to adjust to normal variations in traffic patterns and performance expectations.

Analyze the data from multiple perspectives to confirm the pattern of performance. In a custom dashboard chart, for example, you can:

  • Focus on response time or availability.
  • Calculate the average or standard deviation.
  • View the data by time, geography, node, or ISP.
  • Drill down to a Raw scatter chart.
  • View the data by step.

Define maintenance windows

Use maintenance windows to suspend testing or alerting during planned maintenance or outages, and to limit data collection to core business hours.

You can create different maintenance windows for each test, so the test runs only during the time period for which it's intended.

Besides defining recurring maintenance windows for regular "downtime", you can create a one-time maintenance window for a unique or infrequent event, such as taking the web application offline during a major upgrade.

For more information, see Maintenance Windows.

Establish alert thresholds and sensitivity

Set alert thresholds that reflect end users' expectations. The thresholds should be based on actual performance baselines, not performance goals.

Depending on the monitoring situation, you can configure static or dynamic thresholds. A static threshold is a fixed value, and is the same for all test locations. A dynamic threshold is calculated by the Dynatrace Portal based on the historical average for performance, for each test location. For more details, see Alert Thresholds.

Besides determining thresholds, consider the importance of the feature being tested: Does a test failure indicate that end users would experience an annoying slowdown? Would they see that a nonessential page feature failed to load? Or would they be unable to complete a transaction or even to access your website?

The importance of the feature or transaction being tested determines the urgency of an alert:

  • Should an alert be sent as soon as a single test fails, or only when a certain number of tests fail from a certain percentage of nodes?
  • Should alerts be sent once or should reminders be sent until the issue is resolved? How much time should elapse between reminders?
  • Do alert recipients need to be notified when the condition has gone back to normal?

You can adjust the alert's sensitivity:

  • Reduce the thresholds – Lower thresholds will trigger alerts sooner. However, they may also create "false alarms".
  • Increase the test frequency and/or locations – Collecting more data means reaching the threshold more quickly when an alert is triggered only after a specified number of failures.

To avoid "false alarms" for Backbone tests, enable Retry on Error in Transaction Failure alerts. When a test run fails because of certain network or Internet issues, the test will automatically be re-run. If the second run also fails, the failure is reported. If the second run passes, the failed test run is discarded and does not affect the overall test performance data.

While you are collecting baseline data, send alerts to the Alert log, but don't configure other alert destinations. These early alerts will help you adjust the alert configuration so alert recipients get timely notification of significant events, and do not receive unnecessary notifications.

Review the pattern of alerts over time to make sure you're getting timely warning of problems, but you aren't getting "false alarms" from alerts that are too sensitive.

For more information, see Alert Thresholds.