In this post I will address top ten reports and their usefulness in performance engineering. Regularly I hear people saying: “Can you show me the top ten database statements” or something similar. Their approach to performance engineering is to look at the slowest or most time-consuming statements and then make them faster. While this is not necessarily bad, there are a number of considerations to this approach which might make it not the optimal solution for every use case.
Which aggregation to choose …
First we have to ask ourselves what our top criteria will be. Depending on which metric we choose, we will get different results. Especially if these results drive our performance engineering effort, we should be sure they tell us what we want to know. We now assume that our performance management solution provides all metric we want and we are free to choose.
We could choose the average execution time. This will lead us to statements or requests which are the slowest on average. Averages however might not be – and in many cases are not – the best metric to use. In cases where we have a number of very high response times and at the same time very low response times, averages might look good. So averages are not the best choice.
A much better approach is to use percentiles instead. A common choice is the 95th percentile. This means the time in which 95 percent of all requests have been processed. This number is much closer to a representative performance value.
Instead of percentiles standard deviation is used frequently. Standard deviation assumes that values follow a normal distribution. Following a simple formula we can then calculate the standard deviation. The standard deviation then tells us that about 68 percent of all values are within the standard deviation; about 95 percent are within two times the standard deviation and so forth.
Often the choice between using standard deviation and percentiles is made based on personal preference. Personally, I prefer using percentiles. First because it does not assume a normal distribution of values and secondly because it is closer to common SLA definitions like “95 percent of all response times must be within xx milliseconds”.
Sometimes, however, we might not be interested in the performance of a single statement or request but rather on the overall impact on our system. If for example we are suffering from massive CPU load, we want to identify the parts of the application which consume most of the CPU time. Here neither the average nor a percentile will be useful. Instead we will use the sum of all execution or CPU times.
So the first important choice to take is which metric and aggregation will be used for a top ten report. Based on the chosen metric your results and conclusions will be different.
Another aspect to consider for the top ten report is the granularity of the data. We already looked at the impact of volatile values on our metrics. In certain cases these volatile measurements are caused by inadequate data granularity. A granularity that is too coarse might mean that you end up comparing apples with oranges. You even might end up drawing the wrong conclusion from the monitoring data you get or not be able to draw any conclusion at all.
Let’s look at an example. Your performance management solution might report execution times of a generic command-processing method ranging from milliseconds to several seconds. So what does this mean? In fact you can’t say. If all requests are similar the reason most likely is some resource congestion (synchronization, database, network, etc…). If the requests are different it might just be the nature of some requests to be faster than others.
In order to answer this question we need a finer granularity of measurement data. The approach we at Dynatrace have taken to address this problem is to enrich monitoring data with additional context information like method parameters or bind variables of database statements, for example. This additional level of granularity enables performance analysists to more precisely identify problems.
So whatever type of performance management system you are using, make sure your data has the right granularity. If your data is too coarse grained you will not be able to draw proper conclusions from it or have to spend more time investigating in more directions than necessary.
To top then or not top ten …
… that’s the question. As the headline of this post indicates, top ten reports are not the final answer to all engineering efforts. So even if we assume we have chosen the right metric aggregation and granularity, they will not necessarily help us in our performance engineering efforts.
Top ten reports do a good job in helping us to find the parts of an application which consume the most resources. So for all resource-oriented problems they are a great help. We can use CPU time, network traffic or even object creation as metric, and then find the points of the application where optimization will have the greatest impact. For these kind of problems it is important to work with sum values only, as we are interested in global optimization.
What about using top ten reports to find the slowest web requests for example? Here we first have to decide what slow means. Ideally we choose the 95th percentile (or work with standard deviation if you prefer). The question is then whether we want to optimize them or not. There are several reasons, why we might not want to optimize them:
- They are slow by their nature because, for example, they perform some complex processing.
- They only affect a very small number of the overall requests of the system. Optimization would consequently bring only little benefit for end users.
- Response times are within SLA requirements and there would be no real benefit from optimization.
So instead of looking at metrics alone, we are better off taking an impact-oriented approach here. Implicitly this is what people want to ask themselves anyway “… where to tune the application to make it faster for my end-users”.
When top ten reports are not the final answer
If we want to optimize our application for end-users, it is better to follow a transactional-optimization approach then a “top ten” approach. Instead of picking the slowest SQL statements or remote calls, we pick those having the most impact on end-user perceived performance. In order be able to choose these optimization points, we have to answer a couple of questions:
- How many end users are affected by this optimization?
- What is the relative performance improvement of the affected transactions?
These questions however cannot be directly answered by top ten reports. If we just store “top ten” information, we will not be able to even retrieve this information. So instead of storing flat execution information we need to add more transactional information. This means we increase data granularity, by relating performance metrics to a transactional context.
For each transaction type we store distinct metrics for all relevant KPIs that interest us. This requires us to capture additional runtime information and perform additional processing. Dynatrace for example supports this out of the box based on the underlying PurePath technology. You will have to check how this is supported in your performance management solution or how it can be extended to provide this support.
Having this additional metric available – which I call transaction relevance – we can make additional optimization decisions. If we realize for example that our critical statement is only used in one percent of all transactions we might decide to not optimize it at all. At the same time we might not optimize a time-consuming statement if it only contributes a low percentage to the overall performance of the transactions.
As I like to include these reports as part of test results, I have built my own automated report generation based on dynaTrace’s automation interfaces. The report below shows an example of what I like to include in test reports. In this database report, I show for each database statement 1) in which transactions it was used, and 2) how much it contributes to the transaction response time. Following this approach I can easily detect both; transactions having long running statements with high contribution to transaction response times and also fast statements which are executed frequently and respectively also having a high contribution to response times.
Wrapping up we can say that top ten reports are good source for optimizing resource problems and spotting hot spots in your applications. If your primary goal, however, is to optimize transaction performance with high end-user impact, a transactional contribution-based approach serves a better purpose.
This post is part of our 2010 Application Performance Almanach.