When we set up our application performance monitoring tool to correctly notify us about unexpected performance degradation, we often know about problems before our users start making calls to support (see our previous post on proactive APM).
But what happens when we actually learn about a major performance outage?
Psychologists identified two ways people falter under pressure. Some people choke: they revert to over-cautious, slow procedures they used when learning new things. Other people panic: their perception gets dangerously narrowed towards one potential cause of the problem and they lose the bigger picture.
But when you are experienced enough, and have defined the standard operation procedures to ensure proper resolution of the problem from the end user perspective, you will neither choke nor panic. Just like one of our customers whose story I detail below.
I Hate Mondays
The rule of thumb is that any updates to the services or infrastructure run by LaverniaNet, an international ISP from Lavernia, a small country in Eastern Europe (names changed for commercial reasons), were performed during the low traffic time, i.e., on Saturday evening, making it Sunday in Australia and Saturday in Americas. It was not different this time. Nothing foretold the trouble that the Operations team was about to face on Monday when they returned to the office.
Roughly at 9:00 on Monday morning the alerts set to monitor LaverniaNet infrastructure went off to show growing number of slow pages. Figure 1 shows the Page Volume chart for the whole day with red color indicating a number of slow pages; the arrow indicates when the alarm was raised.
Step 1: Isolate Problem Domain and Impact
The Operations team did not think twice about what to do. The members immediately invoked “Major Incident” processes and commenced the Fault Domain Isolation analysis. By 9:30 they were able to determine that:
- Only one application was, in fact, impacted.
- It was not a network issue.
- About 1000 Users were affected.
The Operations team called a conference bridge with relevant technical teams, including those responsible for server, application, storage, and database administration.
Based on gathered information they were able to correlate end user experience with database, storage and server performance metrics (see Figure 2). Their conclusion was that they were facing a serious meltdown of one of their crucial applications.
Step 2: Communicate with Users
The next important step was to talk to users. LaverniaNet, based on its previous experience with managing incidents, knew that its customers appreciated clear communication that the company is aware of the current problem, is monitoring it and working on resolving the issue.
During the business conference at 11:00 the Operations team was able to correlate the status reported by users with APM reports. Figure 3 shows the scale of the problem across different locations; the report uses baseline metrics for comparison with normal performance. The red color indicates that the performance was clearly worse than what can be normally expected.
Step 3: Apply Some Immediate Cure
The Operations team, together with relevant technical people, continued investigating the problem. They understood that the poor performance was partially caused by batch processes; those could be turned off for some time without much impact on the end users.
At 12:00 they stopped those batch processes and saw a gradual performance improvement as shown by the APM reports (see Figure 4). The response times were still quite high but improved on average by 4 seconds from the start of the incident.
The resolution, however, neither brought back the system to the normal state nor was it a final one. Therefore the team kept on updating the management team and users every 30 minutes about the current state and postponed any disruptive incident recovery actions to outside peak hours.
Step 4: Disruptive Incident Recovery
Once the load on the infrastructure went to minimal level on Monday evening, the Operations team restarted relevant servers. During another conference with its users, LaverniaNet received positive feedback on the performance after the restart; users were no longer reporting performance problems. The team checked the APM baseline statistics which indicated that the situation did not improve (see Figure 5): they could expect same problems the following day even though users’ perception of the performance said otherwise.
Step 5: Rollback
After consulting with management, the team made the decision to roll back changes introduced in the last update during the weekend.
The Operations team used the APM report that showed load distribution over time to determine these off-peak hours on typical Monday (see Figure 6).
Figure 7 shows page volume drop to zero during the code rollback. We see slow transaction performance, i.e., increased Page load time, as the application servers reload their cache and then the performance stabilizes.
Step 6: Continue Monitoring
On Tuesday near normal performance was resumed with some cache refresh for multi-language pages at the application server. Figure 8 shows that Average Transaction Time and Server Time are back to green, i.e., their deviation from the baseline is within acceptable range.
Once the situation was back to normal, it was time for the development team to learn, based on the real world data gathered from dynaTrace purePaths sessions, what went so wrong with changes introduced during the recent update.
When dealing with incidents, like the one described in this article, delays or incorrect actions have a significant business impact. There is no time to panic and lose bigger picture, narrowing your analysis on only one potential cause of the problem? Nor there is time to choke with tedious procedures analyzing every bit of data reported by you APM tool.
Performance problems can be, to some extent, compared to a medical emergency. When handling a medical emergency, the concept of the ‘golden hour’ is used to define the time period during which responding to traumatic injury with prompt medical treatment can increase the patient’s chances of survival. It is advised, not only in case of medical emergency, to have Standard Operating Procedures (SOP) defined in order to meet the golden hour ‘deadline’.
The proper SOP in APM can ensure we will neither choke trying to determine the root cause of the problem nor focus on the potential, but most likely wrong cause of the problem. The Operations team from LaverniaNet had well-defined “Major Incident” procedures to assist in Fault Domain Isolation analysis using APM tools like Compuware Dynatrace Data Center Real User Monitoring (DCRUM).
The team followed few simple steps:
- Within the first hour: isolate problem domain and assess the impact of the current incident on the end users. Use both an APM tool and a good communication channel with the technical team.
- Communicate with end users that you are aware of the problem and its scope, and that the solution is underway.
- Look for some “low hanging fruits” that could provide an immediate, not necessarily the ultimate, cure for the problem.
- Wait to perform disruptive incident recovery until the load is negligible and the customer impact minimal.
- If that did not help, try rolling back the changes that could have led to that problem.
What we can learn from the Operations team from LaverniaNet are two simple rules:
- Keep the communication channel with the end users and the technical team ready and open the whole time.
- Keep on monitoring performance of your applications at all times: When it comes to application performance, you cannot manage what you cannot measure.
(This article has been based on materials contributed by Pieter Jan Switten and Pieter Van Heck based on original customer data. Screens presented are customized while delivering the same value as out of the box reports.)