Often performance management is still confused with performance troubleshooting. Others think that performance management in production is simply about system and JVM level monitoring and that they are already doing APM.
The first perception assumes that APM is about speeding up some arbitrary method performance and the second assumes that performance management is just about discovering that something is slow. Neither of these two is what we at Dynatrace would consider prime drivers for APM in production. So what does it mean to have APM in production and why do you do it?
The reason our customers need APM in their production systems is to understand the impact that end-to-end performance has on their end users and therefore their business. They use this information to optimize and fix their application in a way that has direct and measurable ROI. This might sound easy but in environments that include literally thousands of JVMs and millions of transactions per hour, nothing is easy unless you have the right approach!
Therefore real APM in production answers the questions and solves problems such as the following:
- How does performance affect the end users buying behavior or the revenue of my tenants?
- How is the performance of my search for a specific category?
- Which of my 100 JVMs, 30 C++ Business components and 3 databases is participating in my booking transaction and which of them is responsible for my problem?
- Enable Operations, Business and R&D to look at the same production performance data from their respective vantage points
- Enable R&D to analyze production level data without requiring access to the production system
Gain End-to-end Visibility
As you are not even aware of these, you cannot fix them. Without knowing the effect that performance has on your users you do not know how performance affects your business. Without knowing that, how do you decide if your performance is ok?
The primary metric on the end user level is the conversion rate. What End-to-End APM tells you is how application performance or non-performance impacts that rate. In other words, you can put a dollar number on response time and error rate!
Thus the first reason why you do APM in production is to understand the impact that performance and errors have on our users’ behavior.
Once you know the impact that some slow request has on your business you want to zero in on the root cause, which can be anywhere in the web delivery chain. If your issue is on the browser side, the optimal thing to have is the exact click path of the effected users.
You can use this to figure out if the issue is in a specific server side request, related to third party requests or in the java script code. Once you have the click path, plus some additional context information, a developer can easily use something like the AJAX Edition to analyze it.
If the issue is on the server side we need to isolate the root cause there. Many environments today encompass several hundred JVMs, CLRs and other components. They are big, distributed and heterogeneous. To isolate a root cause here you need to be able to extend the click path into the server itself.
But before we look at that, we should look at the other main driver of performance management – the business itself.
Create Focus – It’s the Business that matters
One problem with older forms of performance management has been the disconnects from the business. It simply has no meaning for the business whether average CPU on 100 servers is at 70% (or whatever else). It does not mean anything to say that JBoss xyz has a response time of 1 second on webpage abc. Is that good or bad? Why should I invest money to improve that? On top of this we don’t have one server but thousands with thousands of different webpages and services all calling each other, so where should we start? How do we even know if we should do something?
The last question is actually crucial and is the second main reason why we do APM. We combine End User Monitoring with Business Transaction Management. We want to know the impact that performance has on our business and as such we want to know if the business performance of our services are influenced by performance problems of our applications.
While End User Monitoring enables you to put a general dollar figure on your end user performance, business transactions go one step further. Let’s assume that the user can buy different products based on categories. If I have a performance issue I would want to know how it affects my best selling categories and would prioritize based on that. The different product categories trigger different services on the server side. This is important for performance management in itself as I would otherwise look at too much data and could not focus on what matters.
Business Transaction Management does not just label a specific Web Request with a name Booking, but really enables you to do performance management on a higher level. It is about knowing if and why revenue of one tenant is affected by the response time of the booking transaction
In this way Business Transactions create a twofold focus. It enables the business and management to set the right focus. That focus is always based on company success, revenue and ROI. At the same time Business Transactions enable the developer to exclude 90% of the noise from his investigation and immediately zero in on the real root cause. This is due to the additional context that Business Transaction bring. If only bookings via Credit Cards are affected, then diagnostics should focus on only these and not all booking transactions. This brings me to the actual diagnosing of performance issues in production.
The Perfect Storm of Complexity
At Dynatrace we regularly see environments with several hundred or even over thousand WebServers, JVMs, CLRs and other components running as part of a single application environment. These environments are not homogeneous. They include native business components, integrations with for example Siebel or SAP and of course the mainframe. These Systems are here to stay and their impact on the complexity of today’s environments cannot be underestimated. Mastering this complexity is another reason for APM.
Today’s systems serve huge user bases and in some cases need to process millions of transactions per hour. Ironically most APM solutions and approaches will simply break down in such an environment, but the value that the right APM approach brings here is vital. The way to master such an environment is to look at it from an application and transaction point of view.
SLA Violations and Errors need to be detected automatically and the data to investigate needs to be captured, otherwise we will never have the ability to fix it. The first step is to isolate the offending tier and find out if the problem is due to host, database, JVM, the mainframe a thirdparty service or the application itself.
Instead of seeing hundreds of servers and millions of data points we can immediately isolate the one or two components that are responsible for your issue. Issues happening here cannot be reproduced in a test setup. This has nothing to do with lack of technical ability, we simply do not have the time to figure out which circumstances lead to a problem. So we need to ensure that we have all the data we need for later analysis available all the time. This is another reason why we do APM. It gives us the ability to diagnose and understand real world issues.
Once we have identified the offending tier, we know whom to talk to and that brings me to my last point, collaboration.
Breaking the Language Barrier
Operations is looking at SLA violations and uptime of services, the business is looking at revenue statistics of sold products and R&D is thinking in terms of response time, CPU cycles and garbage collection. It is a fact that these three teams talk completely different languages. APM is about presenting the same data in those different languages and thus breaking the barrier.
Another thing is that as a developer you never get access to the production environment, so you have a hard time analyzing the issues. Reproducing issues in a test setup is often not possible either. Even if you do have access, most issues can not be analyzed in real time. In order to effectively share the performance data with R&D we first need to capture and persist it. It is important to capture all transactions and not just a subset. Some think that you only need to capture slow transactions, but there are several problems with this. Either you need to define what is slow, or if you have base lining you will only get what is slower than before.The first is a lot of work and the second assumes that performance is fine right now. That is not good enough. In addition such an approach ignores the fact that concurrency exists. Concurrent running transactions impact each other in numerous ways and whoever diagnoses an issue at hand will need that additional context.
Once you have the data you need to share it with R&D, which most of the time means to physically copy a scrubbed version of that data to the R&D team. While the scrubbed data must exclude things like credit card numbers, it must not loose its integrity. The developer needs to be able to look at exactly the same picture as operations. This enables better communication with operations while at the same time enabling deep dive diagnostics.
Now once a fix has been supplied operations needs to ensure that there are no negative side effects and will also want to verify that it has the desired positive effect. Modern APM solves this by automatically understanding the dynamic dependencies between applications and automatically monitoring new code for performance degradations.
Thus APM in production improves communication, speeds up deployment cycles and at the same time adds another layer of quality assurance. This is the final, but by far not least important reason we do APM.
The reason we do APM in production is not to fix a CPU hot spot, speed up a specific algorithm or improve garbage collection. Neither the business nor operations care about that. We do APM to understand the impact that the applications performance has on our customers and thus our business. This enables us to effectively invest precious development time where it has the most impact and thus furthering the success of the company. APM truly serves the business of a company and its customers, by bringing focus to the performance management discipline.
My recommendation: If you do APM in production, and you should, do it for the right reasons.