Highly available installations

This page describes how AppMon handles availability and describes measures you can take to increase the availability of a  deployment. This page also describes high-availability strategies to protect AppMon against data loss due to hardware or software faults.

Availability is an important consideration in application deployment, especially in production environments. What you can do to increase availability ranges from very simple and inexpensive backup and disaster recovery strategies to sophisticated usage of Collector fail-over and / or fail-over clusters.

Note

Implementing a high-availability AppMon deployment is not required for all customers. In a typical QA environment and most production environments, it is acceptable to lose monitoring data in the rare case of a hardware fault. In addition, AppMon has a built-in watchdog mechanism that automatically brings up the system again after a software fault and Agents can fail-over to a redundant Collector in a designated group.

Availability of system under diagnosis versus AppMon availability

The availability of the AppMon solution and the availability of the System Under Diagnosis (SUD) are not intrinsically related, which means that an AppMon failure does not typically cause the SUD to fail and vice versa. This is an important benefit of loosely coupled Agents.

Availability of system components

The AppMon system architecture implements loose coupling between the components as shown in the preceding diagram. This guarantees that a failing Agent does not take down the AppMon Collector or the AppMon Server. Similarly, a failing AppMon Client does not cause an AppMon Server failure. All components automatically reconnect after a failure. Therefore, it is important to focus on the availability of individual AppMon components.

SUD and Agent

Because the Agent runs in the same process as the SUD, an Agent failure may cause the SUD to stop responding. Note that this is the only AppMon component that cause the SUD to stop responding.

To minimize this risk as much as possible, the following measures are taken:

  • The Agent is a very thin software layer. As much work as possible is done by the Collector and Server.
  • The Agent usually fails gracefully. For example, if the connection to the Collector / Server fails, the Agent simply skips application events or at worst fails to instrument the SUD. In either case, this does not cause failure of the SUD.
  • Significant testing is done by the AppMon QA Team to ensure the reliability of this component.

If you need high availability (HA), such as fail-over support (FO) for the SUD, you can take any measure supported by the SUD to ensure its health (beyond the scope of this page).

Summary: SUD and Agent Availability

Risk Consequences Result
Hardware Failure SUD is unavailable AppMon monitoring alerts you
Software Failure of SUD SUD is unavailable
Agent may be unavailable
At least missing AppMon alerts you
Software Failure of Agent Agent is unavailable
SUD may be unavailable
Even in this one case that AppMon has an adverse effect on the SUD, missing AppMon data alerts you.

AppMon Collector

The Collector collects data such as measurements and PurePath-related events from the Agents and sends these data to the Server. If a Collector fails due to hardware or software failure, the Agents buffer data from a couple of seconds to up to a minute, depending on load. As a result, no data is lost if the Collector is started again within this time.

You should use more than one Collector for Agents of the same type (Agent Group / tier) and configure Collector groups in a production environment.

If the Collector comes up within a minute again, the Agents automatically reconnect to the Collector and the latter to the Server.

If not, the Agents can fail over to a different Collector in the Collector group.

The Collector uses self monitoring with an integrated software watchdog to detect fatal software problems. For issues such as out-of-memory or hanging threads, the Collector process is restarted automatically. This is a very important feature that increases availability whether or not you plan to use clustering techniques for high availability, because the chances for cluster software to detect such problems are very limited.

The time necessary for the watchdog to detect software problems ranges from immediate (for example, out-of-memory) to a couple of minutes (hanging threads).

Summary: Collector Availability

Risk Consequences Precautions
Hardware Failure
Collector Software Failure Collector
Connected Agents and Collector plugin's PurePaths and measurements are missing from the time the Agents try to reconnect to the Collector and could not buffer. Make the Collector highly available by creating Collector groups and having head-room and redundancy in Collectors

AppMon Server

If the Server fails due to a hardware or software failure, the Collectors buffer data for a period of time, ranging from 30 seconds to a couple of minutes depending on load and Collector heap configuration. As a result, no PurePaths or measurements are lost if the Server is started again within this period.

The Server uses self monitoring with an integrated software watchdog to detect fatal software problems. For issues such as out-of-memory or a hanging threads, the Server process is restarted automatically. This is a very important feature that increases availability if you plan to use clustering techniques for high availability or not, because the chances for cluster software to detect such problems are very limited.

The time necessary for the watchdog to detect software problems ranges from immediate (for example, out-of-memory) to a couple of minutes (hanging threads).

Summary: Server Availability

Risk Consequences Precautions
Hardware Failure
Server Software Failure
Connected Collector PurePaths and measurements that could not be buffered are missing. Make the Server highly available

AppMon Frontend Server

For the Frontend Server it is not necessary to implement special availability precautions, but it plays an important role in the component puzzle. It frees the Server from having to provide analysis data to the Clients, giving it more headroom for PurePaths correlation and protecting it from potentially harmful queries.

The AppMon Client is a much less critical component than the Agent, Collector and Server. It is not necessary to implement high availability for the Client.

Performance Warehouse

All Measures and Incidents are stored in the Performance Warehouse RDBMS. Therefore, availability is extremely important. A backup and disaster recovery strategy if high availability of the AppMon solution is a priority.

Note

The AppMon Server buffers data for up to one hour (memory permitting) if the Performance Warehouse is not available. After this period of time has elapsed, the data is removed from the buffer.