Alert system

You can use the alert system to deal with problems reactively or proactively.

This topic is for background and reference.

If you are ready to work with alerts:

Reactive model

In the reactive model of alert monitoring, you react to problems reported by your users (for example, website users).

In such a scenario, the CAS monitors a given website and the AMD continuously measures operation times for operations, transactions, and users. Using the gathered data, the report server displays details on charts so you can measure performance and troubleshoot problems.

When problems are reported by users, you look at the reports and find out that, for example, the problem is with HTTP response time from a certain server. You then go and fix the problem: reboot or restart the process or take other corrective action. In other words, you react to a problem that has already affected your users.

Proactive model

In the proactive model of alert monitoring, you detect problems before your users notice them.

To do this, you need two things:

  • Knowledge of how the problems manifest themselves in your particular environment, and the
  • Means of detecting such situations.

For example, if long HTTP response time is the best early indicator of developing problems, you could display a chart showing the HTTP response time metric and take action if the value of the metric is above a certain value.

It is even better to automate the process and let the system inform you when the metric exceeds the threshold. This is exactly what the alert mechanism was designed to do. Ideally, the system could inform a designated operator about the problem and feed data into an alert management engine. The engine could then perform a corrective action such as restarting the offending server or process.

Thus, the report mechanism enables you to move some of the responsibility and intelligence from a human operator (watching the charts) to the machine (acting on alerts).

Defining and modifying alerts

For an alert to be raised, you need to specify the alert triggering conditions, which requires careful observation and knowledge of the system.

You need to ensure that:

  • You understand what you are trying to achieve.

  • You have gathered your requirements.

  • You know how problems in the monitored system manifest themselves.

  • You can translate your intentions into alert configuration. You must ensure that alerts detect error situations and nothing but error situations. In other words, you must ensure that failure notifications are sent and corrective actions are performed always when needed, but only in those situations.

When configuring alerts, first of all you must consider what the system would be showing if you were troubleshooting a failure in a reactive mode. These could be, for example, slow operations, HTTP response time, SSL handshake errors, stopped pages, 5xx HTTP errors on the login URL, or some textual information that needs to be captured with application error recognition.

Then you need to ask yourself what values for a given time duration are still acceptable and what values mean a real problem. Thus, for example, 5 minutes of high server time might not signify a problem, but if it stays high for more than 15 minutes it might be a problem, particularly if after 30 minutes you also see 5xx HTTP errors. Then you have to react. With this type of information, you can start to think about looking for the right alerts to configure.

It is not enough to detect alert conditions and then trigger and send alert notifications. You need a business process that ensures that this situation will be fixed as soon as possible. In other words, it is not enough to generate many alerts from monitoring tools if you still react to problems only when users call to complain.

Usage scenarios for alerts

The alert system can satisfy various user requirements and operational scenarios, such as:

  • Notifying the recipient of both the beginning and the end of the alert condition. The user is notified when an alert condition is raised and also when the situation returns to normal.

  • Notifying the recipient only if a given condition lasts for a certain period of time, or if a given event is repeated several times. This enables the user to focus on real issues and not on insignificant or intermittent glitches.

  • Notifying the recipient several times at regular intervals throughout the duration of the problem.