Baseline and smart alerting explained

AppMon provides automatic, smart baselines. The smart alerting mechanism reduces alerts on false positives, which can occur with traditional statistical approaches that look only at averages and violations based on standard deviation.

A baseline is calculated:

Significant measurement

AppMon employs statistical methods to calculate expected application behavior from historical data and to compare current application behavior against the expected behavior.

Measurements for throughput, failure rate, and response time are regularly collected. Over time, the sample size increases to the point where a measurement gains statistical significance. Intermediate measurements are not statistically significant, but they still reflect the current behavior of the application.

The charts for throughput, failure rate, and response times show both significant and insignificant measurements.

The following screenshot shows that two significant measurements lead to a violation that also triggers a built-in incident. Settings for the Business Transaction and Business Transaction Splitting value allow you to adjust that default behavior.

Significant Measurement

Significant Measurement

Baseline calculations

Baselines are calculated based on different statistical approaches for response time, failure rate, and throughput.

Note

When you analyze chart data, take into account that if the number of data points becomes too high for the chart “real estate”, data is aggregated automatically. Keeping the one-minute resolution over a large time period would result in indistinguishable point clumps and excessive memory consumption.

Response time

The Response Time depends on the Business Transaction selected in the Application Overview. For a Server-side Business Transaction, it is the response time for the Server-side PurePath. If it is a user-action-based Business Transaction, it is the response time of the whole user action.

AppMon does not use averages or standard deviation to calculate baselines for response time. Instead, baselines are…

  • Calculated for the 50th percentile (median) and 90th percentile (slowest 10 %) of the actual Business Transaction response time.
  • Updated every 5 minutes for the first day; every day at midnight through the previous 7 days (for as many days as there is data until AppMon has 7 days of data).

Violations are identified if at least two significant measurements are above the threshold.

As the baseline is calculated every day using the data of the past 7 days, it automatically adapts to changes in your application.

Factors that affect the response time baseline

Factors that affect the response time baseline

Failure rate

The failure rate is calculated based on the detected errors that identify a transaction or user action as failed. For details, see Error Analysis.

The baseline for failure rate uses a binomial distribution.

Violation detection is based on significant measurements. For example:

  • One failed out of five requests is a 20% failure rate, but this may not be significant during a low-traffic time range such as late night.
  • 100 failed out of 1000 requests — a 10% failure rate — during a high-load time range is significant, so an alert is issued.
Significant failure rate

Significant failure rate

Throughput

Throughput is the number of transactions that belong to the specific Business Transaction splitting value, such as the number of processed web requests for the URL http://myapp.com/search.jsf.

Business transactions with high throughput have significant measurements more frequently compared to business transactions with lower throughput. This is because it takes shorter intervals to gain a sample size large enough for statistical significance.

For the throughput baseline, AppMon calculates the expected range based on the historical data from the same time frame one week ago.

The initial phase of baselining, when AppMon does not have data from one week ago, uses different time frames, depending on how much data is available:

  • The same 15 minute interval 7 days ago
  • The same 15 minutes interval 1 day ago
  • The same 15 minutes interval 1 hour ago
  • The previous 15 minutes for the first hour when baselining started

Alerts are not issued for violations of the expected range for Throughput. You must check whether throughput is within the same range as for the same time frame from one week ago.

Throughput baseline

Throughput baseline

Throughput influences the measurements for failure rate and response time. If throughput is high, there are more data points for calculating the failure rate and response time and therefore more significant measurements (lower statistical spread). For low throughput, the statistical spread is higher, so the failure rate and response time measurements become less significant.

Violations and alerting

AppMon issues alerts for baseline violations for response time and failure rate.

By default violations are detected when two or more significant measurements violate the baseline.  A violation ends when at least one of the violating significant measurements falls below the baseline.

These settings can be changed for Business Transactions and for splitting values.

You can set absolute and relative thresholds for violations in the Configure Business Transaction dialog box. You may need to scroll down to see the button. See Business Transaction configuration for more information on configuring Business Transactions.

Base line configuration

Base line configuration