In this post I discuss the seven most important steps to improve your application performance practices. These simple-to-follow practices will help you to improve the way you deal with application performance. Besides eventually improving the performance of your applications it will help you to avoid playing the classical blame game which normally happens when something goes wrong
Rule 1 – Understand the application
While this may sound trivial and the most obvious thing to do – reality paints a different picture. Most failing application performance management processes originate from people’s lack of understanding of their application. This does not mean that the people involved in the process do not possess the knowledge to solve the problem. The case is rather that they do not have enough data to understand what is really going on in the application itself. They follow an approach which pretty much is like: “I will try to make sense of whatever information I can get.” They see the lack of information as an immutable physical law like gravity instead of trying to get the information they really need. In my post on the Proactivity of Troubleshooting I described in detail which information is needed to understand application problems. You basically must be able to answer the following questions.
- What has happened?
- When did it happen?
- Who is impacted?
- What is the difference compared to before the problem?
- Why did the problem happen?
If you cannot answer these questions based on the data you have, then you need to get the data to answer them. Otherwise your activity will be a kind of “performance voodoo” rather than a solid engineering practice. You already think you know where the problem is? Ok, then get the data that proves your guess.
Rule 2 – Measure what Matters
This is the logical consequence of step 1. You have to measure what is important for you and your management. Many people still think that the major goal of performance management is to resolve production problems. This would be equivalent to a CEOs primary job being to save a company that is nearly bankrupt. Most of your work is to avoid problems.
Most people are pretty good in measuring technical aspects of their application like response time of web requests or connection pools. They feel like Captain Kirk (Well, I would prefer Captain Archer) if they have operations dashboards that show a lot of these fancy metrics. However, when they talk to their management they realize that this is not necessarily the information needed by management. In Is there a Business Case for Application Performance discussed this problem in detail.
The concept introduced in this post is Business Transaction Management (BTM). BTM is a specialized discipline within application performance management which targets communicating performance aspects at a business level. Typical questions answered by BTM are:
- Which user was affected by a production problem?
- What are the effects of increasing traffic for transaction X by 10 percent?
- How do we have to change our infrastructure to serve our users better?
- Why did user “Sam” have a problem accessing his account this morning?
So basically this means that you have to relate your low-level metrics to the context of the application. If you only have measures helping with highly technical problems, you will fail in these higher-level activities important to management.
Rule 3 – Objectify Measurements
As all your activities are based on measurement, you better get them right. The basic rule is that measurement results should be the same irrespective of who is measuring them. In case you think this is a no-brainer, try the following: Ask three different people in your organization how much host CPU is consumed by a certain transaction. Ideally, ask a developer, a tester and an operations guy. (If you really run the experiment I would really be interested in the results – please post them in comments below.)
So the important part here is to objectify the way you measure. This includes the measurement method as well as the tooling. You also have to define how to interpret the measurement. This becomes even more important if you have to work across teams. If you can’t agree on how to measure, how can you ever expect to compare results?
Rule 4 – Define a Language
A prerequisite for talking with each other is talking in the same language. This means you have to establish a language that everybody from management to IT super geeks understand. No, I am not talking about English or whatever your native language is. Have you ever been at the doctor’s and getting an explanation about what’s wrong with you and not being able to figure out whether you are pregnant or growing a third leg? So the import part about a common language is that both parties understand what the other one means.
In management we generally refer to this language as Key Performance Indicator (KPIs). KPIs provide a common means for communication across stakeholders. They do not provide the necessary level of detail you need for your daily work, but they help in coordination and planning. If you now think you have done the job if you define your KPIs as CPU usage, memory consumption and network traffic you did not yet get the point. That’s like a doctor telling all your detailed blood values. Your KPI tells you whether you are healthy or not. You do not care about value x being 20 or 30. In fact you probably have no idea whether a higher value is better or not.
Which values you actually choose depends on your application, but they have to cover things like quality-of-service or provisioning information. BTM is a central concept to collect information at this level. Your more-detailed measures are then used to decide how to influence your KPIs in the direction you need to. If you tell your boss that you cannot serve more than 300 concurrent users and he requests you to serve 1000 you have to figure out how to do that. This will then require to you to look at memory consumption or CPU usage.
Rule 5 – Use a Map and a Compass
So what does this metaphor mean? Can you imagine navigating on the ocean solely with a compass? Already after a short time you will realize that knowing where north is does not really help, if you do not know where you are in the first place. What does this mean for performance management? Performance management has to be a continuous activity; otherwise you get lost and cannot make optimal usage of your measurements. As pointed out in an earlier post (Top 10 Reports are not the final answer) many people think that top 10 reports are the ultimate answer to performance optimization. If I simply optimize the ten slowest parts everything will be fine. While this approach will have the desired effect of performance improvement, it can easily become the wrong way to go. What if what is shown in the top ten report is not the cause of why the application is slower?
So you have to create your map first to know the direction to head towards. In performance management this means to continuously monitor your performance. This enables you to understand trends and whether you are on the right track or have to move in another direction. This goes beyond just measuring the trends in response times. You also have to know why you are moving in the wrong direction. (see rule 1)
When I was doing my open sea sailing certificate, I had to deal with exactly the same issue in navigation. First you need to find out whether you are in the right place or not. If you failed navigating to the right coordinate you have to find out why you got there and how to avoid this in the future by taking into account factors like tides or windward drift.
Rules 6 – Do the simple things first
This is a generally good advice. However, most people do not follow it. As in most cases in life, the 80:20 rule also applies to performance management. You get 80 percent of the success by investing 20 percent of the effort. However it seems to be a law of nature that people are much more attracted by getting the other 20 percent of success. Let me give you some examples.
People try to implement complex high-end caching systems before following simple performance best practices. While all these technologies are great from a performance and scalability point they require massive efforts, while an improved web caching strategy requires nearly zero implementation effort.
Another example is people deciding to start with performance management and trying to get everything fully automated from their CI environment over testing to production. While this is a great goal and perfectly follows the Continuous APM idea this endeavor is doomed to failure. You have to start with regular manual performance analysis first and then step-by-step automate your processes as knowing what to measure is a prerequisite for automation (see rules 2 and 3)
Rule 7 – Every ship needs a captain
A point many people also miss when implementing performance management is to define responsibility. The general rule regarding responsibility is that if it is not clearly defined the result will be chaos. Either many people feel responsible leading to a complete mess or nobody feels responsible leading to … well nothing. The vital step to success is to define who is responsible for performance in your company. This does not mean that he has all the expertise to solve every problem. Most likely he will not and will require the help of other people in the organization. His job is to ensure that all necessary steps will be taken and the right people get involved at the right time.
Performance Management should be part of every company’s software processes. However, failing to follow some important rules will lead to frustration, waste of resources and finally failure. I’ve described the top rules for being successful which solve the major problems I have seen out in the wild. I do not claim this list to be complete, but it goes a long way. If you have important rules to add, feel free to post them as a comment.
If you want to read more articles like this visit dynaTrace 2010 Application Performance Almanac