Preparing to set up alarms

Setting up alarms is the corollary of monitoring your website. Effective alert emails enable you to take timely action when your website has a hard downtime or its performance is sluggish. Alarms can support performance or availability Service Level Agreements (SLAs), e.g., "The site will be available 99% of the time during business hours."

To be effective, alarms need to be carefully planned and configured. On the one hand, if alert thresholds are too low, the monitoring team gets alerts too frequently, sometimes for issues that do not require action. On the other hand, if the alert thresholds are too high, customers may experience poor performance before the monitoring team is aware of a problem.

Check Alarms for a discussion of basic alarm concepts.

The pages under the Alarms section cover these areas of the MyKeynote Alarms UI:

  • Summary: Shows both threshold and actual values for configured alarms; lists disabled alarms.
  • Create: Set up your alarms—choose measurements and set up availability and/or performance alerts.
  • Configure: Edit, suspend, or delete alarms.
  • Email Layouts: Specify the information and formatting you want in alert emails.
  • Maintenance Windows: Set aside time periods for site maintenance during which data points do not trigger alarms.
  • Baselines: Define a baseline time period in weeks, which allows alarms to be triggered by comparison with performance during the baseline period.** 
    **
  • Alarm Log: See a list of all alarms that have been triggered recently.

To set up effective alert emails, plan your alarms in the context of your overall monitoring strategy:

  1. Identify the people who need information about the website's performance and availability. The information they need will determine the alerts that need to be configured.
  2. Decide on the alert type (performance vs. availability) and the specific type of performance monitoring required based on the information the alert recipients need, e.g., total measurement time, total user experience time, time to interactive page.
  3. Set up measurements to support alerting.
  4. Establish performance baselines that take into account normal performance fluctuations over different locations and time periods.
  5. Establish alert thresholds to reflect the importance of the transaction being monitored and provide timely warning of problems without generating false alerts.

Identify alert recipients

Identify the people who need to respond to alerts and the different types of information they would need, for example:

  • Operations staff who would troubleshoot a problem, and possibly, web programmers who can make design adjustments for components with slow load times
  • Managers responsible for ensuring performance goals
  • Partners who deliver third-party content that might be a factor in the issue that triggered the alert

You can set up different recipients for:

You can create email groups to manage your warning and critical alert email lists. In addition to these recipients, you can set up escalation recipients for critical performance and availability alerts.

Decide on the type of alert

Based on the information relevant to different alert recipients, decide whether you want to set performance or availability alerts, or both.

For performance you can choose to be alerted on different user experience or network components, for example:

  • Total measurement time —The total time of all network traffic for an entire transaction or transaction page measured by the Keynote agent.
  • Total user experience time —The total time elapsed from when the browser started navigating to the page until the browser finished loading the page contents.
  • Time to interactive page —The time elapsed till the page becomes fully interactive for the user. This corresponds to when the browser finishes processing the onload event.
  • Count of bytes, elements, domains, cookies, etc. downloaded
  • Throughput (KB/sec)

Note

Check the alarm configuration page for a complete list of components. Available components vary depending on whether you are monitoring a desktop (TxP, ApP) or mobile (MWP) browser transaction.

For availability, you can choose to trigger alerts based on all errors (excluding some miscellaneous Keynote errors) or specific errors.

Set up measurements to support alarms

Your alarms are more effective when integrated with and fully accounted for in your monitoring plan.

Script planning

Alerts are more meaningful when you set up measurements that reflect the most important transactions and navigation patterns for your end users. Or you might want to focus on common transactions where you believe your users are abandoning your site due to performance problems (see Scripting Best Practices).

Validate each step

Include validations in your tests to support accurate error reporting (e.g., -99101 Error Text Found) in availability alerts.

Validation verifies that the script has loaded the correct page by checking for expected and/or error text on a page. Without validation, error reporting might not be as accurate or helpful, or the script might report a success even if it loads an unexpected page.

Deployment and alerting requirements

Deploy measurements to key geographic locations and service providers for your end users.

You might want to run multiple measurements at different times of the day or days of the week, with different alert thresholds, to account for variations in traffic patterns.

If using dynamic thresholds based on baseline performance data, be sure to run your measurement on a representative enough sample of agents and locations in "test mode" so that initial baseline data is useful.

How soon an alert notification is sent when a problem occurs depends on several deployment and alert options:

  • How often the measurement runs
  • The number of agents the measurement runs on
  • The number of locations or agents that must be in an error state to trigger the alert
  • The alert threshold type—static or dynamic, e.g., if your static threshold does not match current performance times, you might receive more/fewer alerts than expected.

Make sure that measurements run at a cadence that supports alert thresholds. For example, if your measurement only runs on one agent once an hour, and your alert settings require three data points in breach before a notification is sent, your customers could experience poor performance or downtime for three hours before your operations staff is alerted to the problem. (See Availability Best Practices for an example of how to calculate the number of data points required to trip an alert.)

Plan for Maintenance windows

In alarm configuration, you can set up maintenance windows for planned maintenance or outages during which alerts are suppressed, even if measurements cross thresholds. You can set up multiple maintenance windows per measurement and apply any maintenance window to more than one measurement.

Besides one-time or recurring maintenance windows, you can also suspend an alarm for up to 24 hours.

Identify performance baselines

Before configuring alarms, run your measurements for a few days or even weeks so you can get a better sense of performance and availability expectations and set up baselines for TxP or ApP measurements.

Web traffic can fluctuate by time of day, day of week, and even time of year. Within a website, different pages might have different levels and patterns of traffic. In running your measurement, the more data you have to work with, the easier it is to identify anomalies and the true variability in your measurement's performance and availability. If you run your measurement in "test" mode over fewer agents/locations than you eventually plan to deploy, the data you see might not be representative enough for setting baselines and thresholds.

Establish alert thresholds and sensitivity

Set alert thresholds that reflect end users' expectations and the frequency at which you wish to be alerted. Thresholds, whether static or dynamic, should be based on actual performance and availability, not goals.

A static threshold does not vary with fluctuations in your website's performance or availability. A dynamic threshold looks back over a period of 4-6 weeks and sets expected performance as a multiple of the average performance over that time period.

The importance of the feature or transaction being tested can determine the aggressiveness of your alert thresholds:

  • How many data points should be in breach of thresholds before an alert is sent? See Availability Best Practices for an example of how to calculate the number of data points required to trip an alert.
  • How often should alerts be sent until an issue is resolved?
  • Should there be an escalation path?
  • Do alert recipients need to be notified when the alarm state has gone back to OK (normal)?

Review the pattern of alerts over time to make sure you are getting timely warning of problems, but without false alerts. You can adjust alert sensitivity by:

  • Changing thresholds, e.g., lower thresholds will trigger alerts sooner. However, they might also generate false alerts.
  • Changing measurement deployment (frequency, number of agent and/or locations), e.g., collecting more data points means reaching thresholds more quickly when an alert is triggered after a specified number of breaches.

For more information, see Creating or Editing Alarm Settings.