It’s become easy to monitor applications that are deployed on hundreds of servers – thanks to the advances in application performance management tools. But – the more data you collect the harder it is to visualize the health state in a way that a single dashboard tells you both overall status as well as the problematic component.

Eugene Turetsky (Dynatrace) and Stephan Levesque (SSQ Financial Group) shared their solution for monitoring large IT infrastructures which contain several hundred components that support SSQ most critical applications running on a variety of technology stacks including WebLogic, Oracle Databases, Ingres Databases, WebSphere MQs, etc. When Stephan showed me his SSQ dashboards, I knew I had to write a blog about this.

Update July 2016: Watch a LIVE Demo of these Dashboards on our Online Performance Clinic YouTube Channel: AppOps Health Check Dashboards

Stephan agreed to share these details with a larger audience – eventually uploading the plugins that were designed, developed and built by Eugene Turetsky for this onto our Dynatrace GitHub Organization. Now – check this out. All Dynatrace dashboards are designated to a wide audience – from high management teams to support engineering teams responsible for maintaining health of specific components. For example, the following screenshot shows one of SSQ’s dashboards: Application health arranged vertically, cluster, server and component health horizontally. The names of the apps and servers are sanitized for privacy reasons:

Each dot represents the health status of a component, aggregated to a cluster or an individual server and aggregated on application level. If an App goes red or yellow, it’s easy to spot which component is causing it
Each dot represents the health status of a component, aggregated to a cluster or an individual server and aggregated on application level. If an App goes red or yellow, it’s easy to spot which component is causing it

Stephan and his colleagues read this dashboard from top left to bottom right: The big red dot in the top left means that at least one of the applications is unhealthy. Spotting which apps are unhealthy is easy – just look for red. On those application rows it is easy to find the red dots that tell which component (Web Server, App Server, Message Queue, …) to focus his root cause analysis on.

Now, let’s look a little deeper into how he calculates the health status of each individual component and how he aggregates the data so that you can rebuild this for your own environment in case you find this useful:

Health Status of Components

A component can be an Application Server, a Database, a Message Queue or a device such as a Load Balancer. Stephan uses Dynatrace to monitor each component and has one or more metrics for each component that tells him whether it is healthy or not. Here are some examples:

  • Application: Application status is Red if one or more clusters or individual un-cluster components are Red. Application status is Yellow (degraded) if some of an application’s clustered components (i.e. nodes) are down but surviving nodes in the cluster can manage application load. Otherwise application status is Green.
  • WebLogic: If all clustered WebLogic components are down (i.e. cluster is down) then the status of WebLogic is Red. If some nodes in the cluster are down but surviving nodes can manage application load, status of WebLogic is Yellow. Otherwise status of WebLogic is Green.
  • Database: If all clustered database components are down (i.e. cluster is down) then status of database is Red. If some nodes in the cluster are down but surviving nodes can manage application load, status of database is Yellow. Otherwise status of database is Green.
  • MQ: If all clustered MQ components are down (i.e. cluster is down) then status of database is Red. If some nodes in the cluster are down but surviving nodes can manage application load, status of MQ is Yellow. Otherwise status of MQ is Green.
  • Dynatrace agents: The state, or availability, of the Dynatrace agents is also monitored. If a critical agent is unavailable, an alert will be triggered and a red dot will be shown.

Whether you use Dynatrace or other APM tools – make sure you capture both system metrics such as Availability, CPU, Memory … but also performance relevant metrics such as Response Time and combine these metrics into your health states.

Aggregating Performance Data from Component to Server to Application

Besides monitoring the health for each component individually, the dashboard also aggregates data “upwards.” Stephan calculates an overall health state per component type, e.g: Overall WebLogic health in the cluster is calculated based on the states of each individual WebLogic instance. The overall Application Health is then calculated by the Applications Availability as well as the aggregated state of all supporting components. The final overall system health shows whether there is any application currently suffering an issue. The following screenshot shows how this works in a simple example:

Health States get aggregated to Health Groups which eventually end up being aggregated to the Application and the Overall System Status
Health States get aggregated to Health Groups which eventually end up being aggregated to the Application and the Overall System Status

Automatic Alerting and Status Push to SharePoint

Besides having this and other dashboards to look at, Stephan’s implementation also lets Dynatrace automatically send out alerts to actively notify the responsible people to look at this. . For example: If an application turns red the application owner gets an Email, or if a WebLogic cluster has an issue it is the Ops folks responsible for that cluster to get an email and so on …

Additionally, they also show all incidents on an internal SharePoint page. They developed a custom SharePoint Web Part that pulls in the list of currently open Dynatrace Incidents. The following screenshot shows how that looks like:

A custom SharePoint Web Part pulls in the Dynatrace Incident information
A custom SharePoint Web Part pulls in the Dynatrace Incident information

Interested in More? Want to share your approach?

Thanks for sharing these ideas, plugins and dashboards and thanks to everyone else that was involved.

If you are interested in the technical details on how monitoring health of large IT infrastructures was built and what it takes to implement something on your environment let me know. We will soon put the Dynatrace Plugins that were built for this on our Dynatrace GitHub Organization as well as describe on our Community Portal how to get this configured. If you don’t have Dynatrace yet but want to give it a try register and download our Dynatrace Free Trial.

If you have your own way of monitoring large environments let us know.