I personally don’t like the term “War Room” when describing a firefighting situation that many software companies have to deal with when systems go down or have problems. The way these war rooms typically play out is that key personnel (engineers, operations, business) are summoned into a room until the problem is solved. This was the case back with the Apollo 13 mission and still is now when we look at the famous Facebook war room from Dec 2012:
What’s the problem with these pictures? There are a lot of people in the room that have no clue whether the problem on hand is actually something they can fix or are responsible for. All of these people are summoned without first figuring out which people should look at the problem. Why is that? Because the collected “evidence” in the form of infrastructure monitoring data, log files, user complaints, etc., just shows symptoms but doesn’t tell us anything about the actual impact and root cause of issues:
Looking at the previous image, it is hard to tell which people need to get in a room. Do we just need an Ops guy to restart the process that consumes all of the CPU? Or do we need an application expert that sifts through log files? Do we need to contact our mobile solution provider because it is an actual problem in the 3rd party mobile native app? The typical MO is to simply call-in everybody to figure out the root cause of the problem and with that pulling critical resources from other important projects without even knowing if these folks can actually help solving these problems. How can we change this? By asking the right questions first!
The 10 Real Questions to ask
You don’t need nice and shiny dashboards that show you an aggregated overview of twitter statuses, infrastructure health or insight into slow application transactions. You need data to answer the following questions – whether it is presented in nice dashboards or log files doesn’t really matter:
#1 Is an Individual User Complaining?
Is it “just” the CEO that complains about a problem with your newly deployed internal app because a report doesn’t work on his old IE6? Or is it “just” the end user in a remote location that still uses dial-up? Knowing whether a problem just happens for a single or a very small group of users is important to prioritize.
#2 Are “all” users impacted?
If a large number of users are impacted but you may not have individuals that really complain about it you still need to know as it is very critical to you fix any problems that impact a large number of your users?
#3 Is the problem in the Application?
The next question, after knowing whether users are impacted or not, is to figure out if the problem is in the application or not. This allows us to call in the application experts, architects and developers if needed. Looking at the performance distribution gives us an overview where our hotspots really are:
#4 Is there a problem in the delivery chain?
Modern web applications rely on a long list of services along the delivery chain which lies outside of our own Data Center. That includes CDNs, 3rd Party services, ISPs or mobile networks. Knowing the status of these services and their impact on end user performance of our own application allows us to answer whether to look into our own data center or calling up Akamai, Facebook & Co:
#5 Is one uncritical transaction impacted?
When error rate goes up – is it a critical transaction such as search? Or is it a rather uncritical such as the Contact Page. Or is a BOT causing lots of errors because it crawls through pages that do not exist anyway or that require authentication and with that skews the overall error rate?
#6 Are critical transactions impacted?
What if your critical transactions are impacted such as the landing page, login, search, or entering a ticket in your support system? These are critical transactions to you, your end users, or your colleagues that need to use the back office software for their daily tasks. If these are impacted you need to act fast. Therefore it is important to monitor these critical transactions on failure rate as well as performance. If these are impacted it is more important to act than other transactions that are not vital to your business – and – you also know which subject matter experts to call:
#7 Is the problem related to bad coding?
If application response time is getting slower the first question is whether it is because of bad coding. Analyzing the performance hotspot to the code level can tell you whether most of the time is spent because of inefficient algorithms or just not following coding and architectural best practices:
#8 Does the infrastructure cause an issue?
What if it is not the app itself, but the app is running low on resources provided by the infrastructure? What if the CPU required running the Garbage Collector is not available because the machine also runs lots of other services on an already over utilized machine? In that case it is time to think about the infrastructure – better distributing these applications and services or scaling the infrastructure:
#9 Is the AppServer the issue?
Depending on the AppServer you are using you have multiple configuration options to optimize the usage for your environment. The question remains whether the AppServer might be responsible for performance issues caused by an incorrect setting or corrupt deployment. Correct resource pool (threads, database connection, …) sizing, security settings or logging options can impact the performance. If it turns out that the AppServer is the problem contact your IBM, Oracle, Microsoft … specialist:
#10 Is the problem in the virtual machine?
Leveraging virtual compute power – whether it is from your local running VM server farm or running in one of the cloud providers – provides lots of flexibility. But it can also be the reason for performance problems if the virtual machines are not properly sized or are battling for resources with other virtual machines on the same virtual server. Knowing the impact of virtualization on the application allows you to call in the VM experts and not the app developers to solve a problem:
Have an answer to these questions?
Now that you have an idea about the right questions to ask before you call a war room session together- or before you accept a call into such scenario, you can start focusing on preventing these sessions. Whether you are a developer, architect or on the business side; make sure you have the real facts available in order to get through these situations as fast as possible by calling in the RIGHT people and giving them the RIGHT data to analyze.
Better than spending time in War Rooms however is to prevent the number of times these situations come up. If you want to learn more about this check out some of the other blogs we recently wrote such as Performance-focused DevOps or – in case you happen to be getting ready for the holiday shopping season – Verify Readiness in Test & Pre-Production.