Most production monitoring systems I have seen have one major problem: There are too many JVM’s, CLRs and Hosts to monitor.

One of our bigger Customers (and a Fortune 500 Company) mastered the challenge by concentrating on what really matters: The Applications!

Ensure Health

The following dashboard is taken directly from the production environment of that customer:

High Level Transaction Health Dashboard that shows how many transactions perform badly
High Level Transaction Health Dashboard that shows how many transactions perform badly

What it does is pretty simple. It shows the load of transactions in their two data centers. The first two charts show the transaction load over different periods of time. The third shows the total execution sum of all those transactions. If the execution time goes up but the transaction count does not, they know they have a bottleneck to investigate further. The pie charts to the right show the same information in a collapsed form. The color coding indicates the “health” of the transactions. Green ones have a response time below a second while red ones are over 3 seconds. In case of an immediate problem  the red area in the five minute pie chart grows quickly and they know they have to investigate.

The interesting thing is that, instead of looking at the health of hosts or databases, the primary indicator they use for health are their end user and business transactions. If the amount of yellow or red transactions increases they start “troubleshooting“. The first lesson we learn from that is to measure health in terms that really matter to your business and end users. CPU and Memory utilization do not matter to your users, response time and error rates do.

Define your Application

Once they detect a potential performance or health issue they first need to isolate the problematic application. This might sound simple, but they have hundreds of applications running in over 1000 JVMs in this environment. Each application spans several JVMs plus several C++ components. Each transaction in turn flows through a subset of all these processes. Identifying the responsible application is important to them and for that purpose they have defined another simple dashboard that shows the applications that are responsible for the “red” transactions:

This dashboard shows which business transactions are the slowest and which are very slow most often
This dashboard shows which business transactions are the slowest and which are very slow most often

They are using dynaTrace Business transaction technology to trace and identify all their transactions. This allows them to identify which specific business transactions are slow and which of them are slow most often. They actually show this on a big screen for all the see. So not only does operations have an easy time identifying the responsible team, most of the time that team already knows by the time they get contacted!

This is our second lesson learned: Know and measure your application(s) first! This means:

  • You define and measure performance at the unique entry point to the application/business transaction
  • You know or can dynamically identify the resources, services and JVMs used by that application and measure those

Measure your Application…and its dependencies

Once the performance problem or an error is identified the real fun begins, as they need to identify where the problem originates in the distributed system. To do that we need apply the knowledge that we have about the application and need to measure the response time on all involved tiers. The problem might also lie between two tiers, in the database or with an external service you call. You should not only measure the entry points but also the exit points of your services. In large environments, like the one in question, it is not possible to know all the dependencies upfront. Therefore we need the ability to automatically discover the used tiers and resources instead.

Show the transaction flow of a single business transaction type
Show the transaction flow of a single business transaction type

At this point, we can isolate the fault domain down to the JVM or Database level. The logical next step is to measure the things that impact the application on those JVMs/CLRs. That includes the resources we use and the third party services we call. But in contrast to the usual utilization-based monitoring, we are interested in metrics that reflect the impact these resources have on our application. For example: instead of only monitoring the connection usage of a JDBC connection it makes much more sense to look at the average wait duration and the number of threads waiting for a connection. These metrics represent the direct impact the resource pool has. The usage on the other hand explains why a thread is waiting – but – 100% usage does not imply that a thread is waiting! The downside with normal JMX-based monitoring of resource measures is that we still can not directly relate their impact to a particular type of transaction or service. We can only do that if we measure the connection acquisition directly from within the service. This is similar to measuring the client side and server side of a service call. The same thing can be applied to the execution of database statements itself. Our Fortune 500 Company is doing exactly that and found that their worst performing application is executing the following statements quite regularly

This shows that a statement that takes 7 seconds on average is executed regulary
This shows that a statement that takes 7 seconds on average is executed regularly. Please excuse the poor quality of the picture.

While we should generally avoid looking at top 10 reports for analysis, in this case it is clear that the statements are was at the core of their performance problem.

Finally we also measure CPU and memory usage of a JVM/CLR. But we again look at the application as the primary context. We measure CPU usage of a specific application or type of transaction. It is important to remember that an application in the context of SOA is a logical entity and can not be identified by its process or its class alone. It is the runtime context, e.g. the URI or the SOAP message that defines the application. Therefore, in order to find the applications responsible for CPU consumption, we measure it on that level. Measuring memory on a transaction level is quite hard and maybe not worth the effort, but we can measure the impact that garbage collection has. The JVM TI interface informs us whenever a GC suspends the application threads. This can be directly related to response time impact on the currently executing transactions or applications. Our customer uses such a technique to investigate those transactions, that consume the most CPU time or are impacted the most by garbage collection:

Execution time spent in fast, slow and very slow transactions compared with their respective volume
Execution time spent in fast, slow and very slow transactions compared with their respective volume

This dashboard shows them that, although most execution time is spent in the slow transactions, they only represent a tiny fraction of their overall transaction volume. This tells them that much of their CPU capacity is spent in a minority of their transactions. They use this as a starting point to go after the worst transactions. At the same time it shows them on a very high level how much time they spend in GC and if it has an impact. This again lets them concentrate on the important issues.

All this gives them a fairly comprehensive, yet still manageable, picture of where the application spends time, waits and uses resources. The only thing that is left to do is to think about errors.

Monitoring for errors

As mentioned before, most error situations need to be put into the context of their trigger in order to make sense. As an example: if we get an exception telling us that a particular parameter for a web service is invalid we need to know how that parameter came into being. In other words we want to know which other service produced that parameter or if the user entered something wrong, which should have been validated on the screen already. Our Customer is doing the reverse which also makes a lot of sense. They have the problem that their clients are calling them and complain about poor performance or errors happening. When a client calls them, they use a simple dashboard to lookup the user/account and from there filter down to any errors that happened to that particular user. As Errors are captured as part of the transaction they also identify the business transaction responsible and have the deep dive transaction trace that the developer needs in order to fix it. That is already a big step towards a solution. For their more important clients they are actually working on proactively monitoring those and actively call them up in case of problems.

In short, when monitoring errors we need to know which application and which flow led to that error and which input parameters were given. If possible we would also like to have stack traces of all involved JVM’s/CLRs and the user that triggered it.

Making sure an optimization works!

There is one other issue that you have in such a large environment. Whenever you make changes it might have a variety of effects. You want to make sure that none are negative and that the performance actually improves. You can obviously do that in tests. You can also compare previously recorded performance data with the new one, but in such a large environment this can be quite a task, even if you automate it. Our customer came up with a very pragmatic way to do a quick check instead of a going through the more comprehensive analysis right away. The fact is all they really care about are the slow or very slow transactions, and not so much whether satisfactory performance got even better.

Transaction Load Performance Breakdown that shows that the outliers are indeed reduced after the fix
Transaction Load Performance Breakdown that shows that the outliers are indeed reduced after the fix

The chart shows the transaction load (number of transactions) on one of their data centers, color coded for satisfactory, slow and very slow response time (actually they are split into several more categories). We see the outliers on the top of the chart (red portion of the bars). The dip in the chart represents the time that they diverted traffic to the other data center to apply the necessary changes. After the load comes back the outliers have been significantly reduced. While this does not grantee that the applied change is optimal in all cases it tells them that overall it has the desired effect under full production load!

What about utilization metrics?

At this point you might ask if I have forgotten about utilization metrics like CPU usage and the like or if I simply don’t see their uses. No I have not forgotten them and they have uses. But they are less important than you might think. A utilization metric tells me if that resource has reached capacity. In that regard it is very important for capacity planning, but as far as performance and stability goes it only provides additional context. As an example: knowing that the CPU utilization is 99% does not tell me whether the application is stable or if that fact has a negative impact on the performance. It really doesn’t! On the other hand if I notice that an application is getting slower and none of the measured response time metrics (database, other services, connection pools) increase while at the same time the machine that hosts the problematic service reaches 99% CPU utilization we might indeed have hit a CPU problem. But to verify that I would in addition look at the Load Average which, similar to the number of waiting threads on a connection pool, signifies the number of threads waiting for a CPU and thus signifies real impact.

The value of operating system level utilization metrics give does get smaller all the time. Virtualization and Cloud technologies not only distort the measurement itself – indeed by both running in a shared environment and having the ability to get more resources on demand, resources are neither finite nor dedicated and thus resource utilization metrics become dubious. At the same time application response time is unaffected, if measured correctly, and the best and most direct indicator of real performance!