In my last post, I wrote about the value of IT/Business collaboration, and the importance of a common language, a common definition of end-user experience – user transaction response time – as the one performance metric both IT and business have in common. In it, I provided some background on the importance of understanding exactly how we define response time, since this definition dictates the usefulness of the measurement. For the sake of brevity, I’ll summarize three common definitions here:
- Session-layer response time: request/response measurements specific to an application
- Application component response time: component-level performance unique to specific application functions
- User transaction response time: click to glass, or end-user experience
These definitions remain important as we consider the other key part of the organization that IT operations must collaborate with: the development team. This is something that you probably already do, perhaps too frequently, and these encounters might remind you of attempting to use your high-school French as you attempt to explain your indigestion to a harried pharmacy clerk in Paris. (I never claimed to be a master of analogies.)
Application Performance Triage and AA NPM: A Tenuous Partnership
Let’s set the stage a bit more. When there’s a performance problem with a critical application, our immediate goal is to restore service as quickly as possible. A triage approach begins with identifying the problem, progresses to isolating the fault domain, and leads to capturing enough diagnostic insight to prove the conclusion, facilitating efficient handoff to the appropriate domain team. Sound familiar? Not only is it something you practice frequently, it is the promise made by most AA NPM (and many other) solutions. But what constitutes appropriate diagnostic insight if the problem is related to application processing? (After all, they can’t all be network problems.)
The answer is something more than a session-layer response time measurement, that ubiquitous metric present in virtually all AA NPM solutions since the late 1990s. These very basic measurements summarize all request/response timings on a given TCP connection without differentiating between transaction types, unfortunately and unavoidably mixing 10 millisecond queries with 15 second reports. While the label – “Application response time” – may sound promising, the measurements themselves are relatively unactionable in real life. Armed with this measurement, the information you are able to pass to the development team is only slightly more interesting than saying “my network tool says your app is slow.” (Or, to the French pharmacy clerk, “J’ai mangé quelque chose de mauvais – I ate something bad.”) The result? The now-proverbial war room. (I’ll poke at some war room promises in another blog.)
In order for this DPI approach to be broadly effective, the analysis (decode) must be sophisticated and flexible enough to parse important transaction parameters. A contemporary example is a web application that uses a single URL for many different transactions, where the transaction parameters are not included in the URL string itself, but rather embedded in the request body. The decode must be able to understand and act on this complexity, and not use the single URL as the identifier for all transactions.
Transaction Visibility is Just the Beginning
This application transaction or component measurement can be considered the lowest common denominator required for effective collaboration with the development team. It should provide enough diagnostic insight to accomplish two goals inherent in triage:
- Accept that there is a problem related to an understandable application/code function (even though the data comes from a network tool)
- Begin to investigate the application logic invoked by the transaction, along with underlying dependencies.
The development team may still have to work to reproduce the problem (especially if it is intermittent) and trace code execution (this isn’t a Dynatrace PurePath, after all), but it achieves our stated goal of inter-team collaboration by defining the problem in language that the development team can relate to. (Ça va?)
If I were only concerned about highlighting the fundamentals of communicating with development teams, I might stop here. (In fact, in the extended AA NPM market, it’s quite common to stop here.) But there remains a critical missing link, one that significantly impairs your ability to leverage this application transaction insight.
Is There a Problem?
Without measuring end-user experience:
- You won’t know users are having a problem until they call you (unless something catastrophic happens).
- You will chase after problems that don’t affect users (because you’re monitoring dozens, or hundreds, of metrics and application components of varying impact).
- When a user does call to complain, you’ll have to interpret their definition of the problem into the corresponding application components to identify a starting point for triage.
The conclusion should be clear; you need to measure end-user experience. This is not generally a capability of most AA NPM solutions – in spite of much marketing phrasing that might suggest otherwise. (Even Gartner, in their Magic Quadrant for NPMD, suggests that end-user experience can be described as “application availability, latency and quality from a network perspective.”) Instead, true end-user experience monitoring requires a deep understanding of complex application behaviors, algorithms and insights beyond the relatively simple DPI-based transaction decoding I’ve focused on here. (Think of the vocabulary you’d require to explain your symptoms and discuss pharmaceutical, homeopathic and holistic treatment options with our friendly French pharmacist.) I’ve recently been referring to this extension to AA NPM as “Transaction Centric NPM” – emphasizing the importance of applying both application and user definitions to transaction measurements. This visibility is a unique differentiating capability of Dynatrace Data Center RUM.