In this third article of my Performance Almanac I discuss the role of overhead in performance management. As a performance management solution provider we’re frequently asked “How much overhead does your solution have?” This question is however a bit more complex to answer than just giving a single number.
When discussing this topic I have also realized that there are some dogmas on people’s minds which are not necessarily true. I collected the most common myths and truths about performance measurement overhead and will discuss them in this blog.
Truth: Performance Measurement Produces Overhead
As Werner Heisenberg already told us in his uncertainty principle, measuring a system always modifies the behavior of the system itself. The same is true for performance measurement. In order to collect our measurements we have modify the system and execute additional code. This code execution consumes system resources which we perceive as overhead.
This means that we have to test overhead before using a performance management solution. We at Dynatrace recommend to this as part of the application roll-out. During application roll-out overhead is tested and optimized in case it is too high. Additionally we have to plan for some measurement overhead. If we are already utilizing our systems to their limits, adding performance measurement will simply not be possible.
Myth: Measurement Overhead is a Simple Percentage Figure
Quite frequently we’re asked “How many percent overhead do you have?” So people think overhead is just a simple percentage figure. Overhead itself, however, is a bit more complex. First there is not “the overhead”. Overhead has a number of angles to look at. Most people comprehend overhead as response time overhead. This means the additional time transaction execution takes due to executed measurement code.
The answer is that this depends on the transaction execution time. Measurement time is more or less constant – depending on the actual implementation. This means that the longer a transaction takes the lower the relative overhead gets. So let’s say we have 20 milliseconds measurement overhead. For a 200 millisecond transaction this means 10 percent overhead. For a 2 second transaction this however only means one percent. So instead of looking a plain percentage numbers, you should rather look at real transaction response times.
However there is not only response time overhead. In some cases throughput is a more important measure than response time. While those two are related you will in some cases – such as a transaction processing system – be more interested in the latter. Additionally performance measurement affects other parameters of our system as well. Let’s look at the other types of overhead that we have to take care of:
CPU Overhead represents the additional CPU usage caused by executing monitoring code. While there might be a relation between CPU and response time overhead, this need not be the case. I have seen cases where response times stay totally constant and we could only see an increase in CPU consumption.
Memory overhead is caused by the additional data that is stored within the application. Depending on the storage approach this can be only some megabytes or huge amount of data. Different solutions follow different approaches here. Tools which are primarily targeting development use cases store measurement data in the application’s memory. Production-ready solutions immediately send information to an external server to keep overhead as low as possible. We at Dynatrace, for example, immediately send all collected data to our server to ensure minimal memory consumption.
The last overhead factor is network traffic. All solutions that follow a distributed approach will also utilize the network to send data from the application to where it is stored and processed. The actual network load and utilization – much data is sent how often – depends on the implementation of the monitoring solution. State-of-the-art solutions avoid sending any metadata like method names after the application startup. This reduces network overhead to a minimum.
So overhead is more than a single percentage figure. In order to understand and tune overhead in your system have to monitor a number of system characteristics as well relate it to the specifics of the transactions you are monitoring.
Myth: Overhead is Application Independent
The perception that overhead metrics can be transferred from one application to another while most people believe in it is not true. Overhead to a great extent depends on the behavior of the actual application.
In order to provide a bit more insight into this topic let us look at the two major approaches for measuring application performance. We distinguish sampling-based and event-based monitoring.
In the sampling-based approach the execution stack of all running threads is analyzed in a defined interval. The advantage of this approach is the overhead only depends on the sampling rate – the higher the sampling rate, the higher the overhead. So in this case overhead is independent of the actual applications.
While having predictable overhead, sampling comes with a number of measurement errors. These errors result from the fact that measurement is not related to the beginning and end of a method’s execution. We can only see whether a method was executing while taking a snapshot. Everything that happens between the snapshots is not visible. The longer the sampling interval, the less accurate the measurement gets. If a method is not executed while snapshots are taken it is invisible to the sampling approach. Additionally we cannot distinguish a method that was active during two snapshots only and a method that was executed between snapshots as well.
While these measurement errors limit the usefulness for precise performance measurements, sampling is still very useful for detecting CPU-bound and global application problems even in high load environments.
In event-based measurement – unlike interval-based collection – measurement is triggered on entering and leaving of a method. This approach resolves the problems of sampling. As the start and end of method executions are measured we get precise timing information for every method.
Overhead in this case is however no longer totally predictable up-front. The actual overhead depends on the number of measurements. The higher the load on the system the higher the overhead as more measurement code is executed.
Event-based measurement is performed following two different approaches. Most runtime environments – like the Java VM or the .NET CLR – provide special profiling callbacks which are invoked anytime a method is entered or left. This approach proves to be rather inefficient. Therefore modern tracing approaches follow an alternative approach. They instrument code adding additional measurement code. Besides being faster, as optimizations of the runtime environment – like inlining – still can be performed, this approach enables selective instrumentation. Instead of instrumenting each and every method, only specific methods are instrumented. User-defined rules are used to specify which methods will be monitored and which won’t.
So in case we want to get predictive overhead without analyzing the application up-front we have to use a sampling-based approach. However, if we want to get precise measurements we have to invest some time to properly tune instrumentation rules to reach our overhead goals.
Truth: Overhead affects Application Behavior
As we already discussed performance measurement affects certain characteristics of our system, like response times, CPU consumption or memory usage. As we have seen proper definition of measurement rules helps us to avoid response and throughput problems as well as massive resource-consumption issues.
However there are other effects as well that impact the quality of the measurement data. If we do not carefully select what we measure, our measurement data will be inaccurate leading us to wrong conclusion. This type of problems is also sometimes referred to as Heisenbugs – named after the scientist. Heisenbugs are problems which change their characteristics as soon as they are measured. These are probably the hardest things to find in an application. Typically we see two forms of them:
If we instrument the wrong parts of an application we might suddenly see phantom performance problems due to the introduced overhead. If we measure the execution time of short-running methods which are executed very frequently, the introduced overhead will lead to phantom performance problems. These issues however can be avoided if the methods causing the problem are excluded from measurement. In case your solution does not provide means to exclude these parts of application code it is up to your knowledge to decide whether the results make sense or not. Dynatrace for example will detect such methods automatically and automatically suggests their exclusion.
While phantom problems show problems which do not really exist, there might also be the case that certain problems will not occur when you want to measure them. This is in most cases caused by changed timing behavior of the application. Good examples are synchronization problems, as they are very sensitive to execution times, which makes them really hard to track down. In Dynatrace we decided to provide a special form of instrumentation which allows one to selectively instrument methods involved in synchronization.
So while measurement affects application behavior there are ways to avoid or detect those problems easily. While tools provide good help here, it is still up to the performance analyst to decide whether the actual measurements make sense.
Myth – More Details Mean More Overhead
Many people think that more details always come with the penalty of higher overhead. While there is truth in this statement, it cannot be taken as generally true. Especially the assumption that an approach providing less detail will also come with less overhead does not hold. We have often heard the argument “If you are tracing every transaction, you must have higher overhead than a profiler providing statistical information”. Overhead, however, is much more affected by the architecture of the performance management solution than the granularity of the data.
Let’s look at the different granularity levels first. The lowest granularity level is to only store execution information on method level without hierarchy information. While this results in low memory overhead, all execution context is lost as well, making detailed problem analysis impossible. The next level of detail is to additionally store direct caller information. While this is a slight improvement we still miss important data. Hierarchical call trees provide more details, the show the aggregated call hierarchies of the executed application code. This provides more context; however, the actual invocation sequence as well as transactional context information like method parameters or user information is lost. Transaction tracing provides the finest level of granularity storing each transaction separately with all related context information.
While more details means more data it does not necessarily mean more overhead. The data initially collected in all the mentioned approaches is quite similar. For each method invocation we have to collect the execution time. How additional call stack information is collected depends on the actual implementation; however, there are very cheap methods to get this information.
The key to overhead, however, is how and where this information is stored and processed. The golden rule is the more processing occurs within the application runtime the higher the overhead. Some solutions keep the call tree information in memory for a whole measurement run or send it off for processing at regular intervals. Those solutions come with higher memory overhead and also CPU and execution time overhead. In order to support very high-throughput with low overhead, “raw data” must be immediately sent to a central processing unit, where the processing and data reconstruction happens. We at Dynatrace follow this approach which ensures low overhead with very fine-grained details.
So the actual data processing approach – as well as the architecture – impacts overhead much more than data granularity. However, for one specific solution, significantly more detail will also mean more overhead.
Myth – Overhead Tuning is Complex
So after reading all of this, you might now think that overhead tuning is a complex process. You might have even run into the problem that you have been struggling with a serious overhead issue.
While overhead tuning is still necessary in modern performance management solutions, it has become very easy as tools provide excellent support for this task. Solutions range from simply eliminating frequently called short running methods to more sophisticated approaches which use more contextual information to ensure a certain level of granularity as well as that only methods which are not related to performance problems get excluded. Below you see a screenshot of the automatic performance sensor configuration in Dynatrace which automatically detects which methods to include and exclude from measurement.
The basis for getting precise overhead metrics is to compare load test results with and without the performance management solution installed. These tests will provide you with detailed information on current overhead values. It is important, however, to run these load test in a realistic scenario.
So tuning overhead with proper tool support is a straight-forward task with no magic involved. The better the tool support, the less detailed knowledge of the application is required.
Overhead of performance measurement is an issue performance analysts have to take care of. Modern application performance management solutions, however, provide excellent support in this task. As overhead depends on the actual application and it’s usage, it must be part of the roll-out and maintenance process of the application. Overhead much more depends on the actual measurement and processing approach than the actual data granularity. If you want in-depth measurement at low cost, you will have to go for a professional solution, as it will provide more sophisticated means for data processing.
This post is part of our 2010 Application Performance Almanach.