Week 1 – The Proactivity of Troubleshooting

Troubleshooting of performance problems is very often – if not almost always – viewed as a reactive activity. Frankly, I have often seen it done in such a reactive as a firefight; however effective troubleshooting should build upon a solid diagnostics process. If you handle troubleshooting as if firefighting rather than based on solid diagnosis, this inevitably is a sign you have not taken the right proactive measures.

The goal of troubleshooting is to resolve an immediate performance problem – ideally yesterday. As some might expect this does not start when the problem occurs: troubleshooting done right means having a defined process as part of your performance management activities. While we try to avoid this situation as often as possible, we have to accept it is a normal part of our work.  To properly plan in advance, we have to define what has to happen when and what information is necessary to efficiently troubleshoot.

The metric to measure the effectiveness of problem resolution is mean-time-to-repair (MTTR). The beauty of this metric is that it is easy measure as the time from when a problem occurs to when it is solved. However the actual process behind it is much more complex. Let’s look at the various steps of problem resolution.

First we have to know that we have a problem. This means we need adequate monitoring of our application as well as proper alerting and raising of incidents. So effective monitoring is a pre-requisite for effective problem resolution and must answer the following questions:

  • What has happened?
  • When did it happen?
  • Who is impacted?
  • What is the difference compared to before the problem?
  • Why did the problem happen?

The next step is problem analysis. This process step – also referred to as triage – aims at identifying the problem’s root cause. In this step following a structured approach is the key to success. This is where “the rubber meets the road”. If you cannot find the root cause as quickly as possible your processes are not effective. Experience shows that this is also where many companies have the greatest optimization potential (to put positive spin on it ;-)). Besides detailed technical knowledge about the application, the database, the network or the operating system the (immediate) availability of required information is crucial.  If you don’t have this information, you have to start guessing.

In the problem resolution phase the problem gets fixed. This can range from a “simple” configuration change up to complex changes in the application. Choosing the optimal solution for a problem is a challenge in itself. More than the actual coding it requires a lot of brain work. Therefore very often the smartest people are on this, as the solution not only has to be reliable but developed as quickly as possible.  This also means that rare development resources are blocked from other work they need to do.

Regression analysis is happening ideally in parallel to the resolution process. Each of us knows regression problems far too well. The goal to efficiently avoid them was and is one of the key drivers for increased test automation.  A central concept in regression analysis is to have a baseline to measure against … well besides the actual test cases for sure. If you have not collected such information you cannot do any regression analysis. Ideally you have this information stored in some performance repository. Otherwise you will first have to run tests to get a baseline, which slows down your resolution process.

When development is then finished the application is tested in a final large scale system test. I have heard rumors of projects where this does not happen….. All these act ivies are occurring under massive time pressure and heavy management scrutiny. In case you find new problems in this phase, you have to make sure they get fixed immediately. This means that part of this process step is having all required analysis data right at hand – having to re-run a test to get proper diagnostics data is one of the worst things that can happen to you.

Back in production the application must be continuously monitored to insure that the problem is really solved and does not happen again. Additionally you have to verify that no other parts of the application are negatively impacted by changes.

As we can see problem resolution – or troubleshooting – processes are highly complex. The involvement of a number of departments makes proper information delivery vital for success. The first impression that these processes are purely reactive also proves wrong.  The definition of a proper process and responsibilities as well as the necessary information to collect and the availability of the required infrastructure must be managed beforehand – proactively.

This post is part of our 2010 Application Performance Almanach.

Alois is Chief Technology Strategist of Dynatrace. He is fanatic about monitoring, DevOps and application performance. He spent most of his professional career in building monitoring tools and speeding up applications. He is a regular conference speaker, blogger, book author and Sushi maniac.