Is sampling data as good as capturing all transactions in detail? At first glance having all transactions is better than just some of them – but let’s dig a little deeper.

If we’re not capturing all transactions – how do we select those we do capture? Or to be more precise, how do we select the transactions to be captured with full details vs. those with just high-level information.

Sampling

One way of selecting the transactions you follow in depth would be to sample, meaning randomly select a certain percentage of transactions. For example you could choose to only follow every 50th transaction, resulting in a sample rate of 2%. While this reduces application overhead and the load on your monitoring solution – how can you be sure this really accurately represents your system? What if the slow request or error a user complained about is in the 98% you didn’t monitor?

Errors

So what if we add monitoring those transactions that had errors? For web applications we can use HTTP error codes to determine failures (e.g. 404 or 500 errors). For other types of applications looking for warning/error log messages and exceptions is a good indicator. If your APM solution does include end-user monitoring you could even add client-side JavaScript errors. But we still don’t ensure we are getting details for the slow transactions – just random sampling and errors.

Slow Transactions

Since application performance is a key driver for any APM solution picking those transactions with bad performance – meaning they are slow – is important. But what threshold are you using for “slow”? You could use any arbitrary value, but what value? Especially for newly deployed applications or applications with varying load patterns selecting such a value can be tricky.
You could use statistical measures to have the system baseline the slow threshold – e.g. using standard deviation. Those measures work best if the data is following a Gaussian distribution (also known as normal distribution) – which response times rarely do. Below is a snapshot of production traffic (around 6000 individual requests) showing a more commonly seen distribution pattern:

The statistical average (red) is 1050ms, the median or 50th percentile (orange) is between 800 and 850ms and the standard deviation is 711ms
The statistical average (red) is 1050ms, the median or 50th percentile (orange) is between 800 and 850ms and the standard deviation is 711ms

If your monitoring tool were to cover transactions that are slower than 3 standard deviations you would monitor only 1.8% of your traffic:

A common approach of looking at transactions slower than 3 times standard deviation will only cover y very small percentage of transactions
A common approach of looking at transactions slower than 3 times standard deviation will only cover y very small percentage of transactions

Going to just twice the standard deviation would increase coverage to 4% but you would still miss details on everything faster than 2480ms.

Why are all Transactions Needed?

So why would I need all transactions? Why isn’t it enough if I combine random sampling, errors and slow transactions as a data set for APM?

After you looked at the slowest transactions, identified the root cause and have a developer working on the fixes – what’s next? With an APM solution deployed wouldn’t it be nice to – let’s say work on improving the median response time, so the users overall get a more responsive and faster application?

What if your boss asks you for “the most bang for his buck”? He doesn’t care that 1.8% of your users are getting bad response times – he wants to make sure the large majority of the users are feeling the impact of his performance improvement efforts and spending? If we say the large majority is 90% of all requests this data set you would need for your analysis looks like this (covering around 5500 transactions):

It is better to focus and optimize those transactions that really impact the majority of end users
It is better to focus and optimize those transactions that really impact the majority of end users

You could also use additional data points for selecting the transactions to analyze, like excluding internal IPs (we all know that internal users often have different usage patterns than real end users) or only looking for certain geographic units or only certain types of requests or only request that hit a certain server or…

The list could go on and on here, once you have the slowest requests analyzed remember that APM is not just the 1.8% – it means all users with all their requests.

Below is the production traffic from one of our customers – 30 minutes with over 760,000 transactions, all captured with full details:

Having all transactions gives you the real picture on your end user experience
Having all transactions gives you the real picture on your end user experience

Myth Busted?

We have seen time and again that capturing all transactions – in depth – all the time is necessary to address technical as well as business questions. If you don’t you might quickly find yourself looking for another APM solution, once you solved the most pressing issues and worst performing transactions.

Read up on two examples where having all transactions was essential to make the right decision to fix technical problems as well as to see how business is impacted by performance:

If you want to test this yourself – sign up for the Dynatrace Free Trial and let us know what you think.