Transaction-Centric NPM: Enabling IT Operations/Development Collaboration

In my last post, I wrote about the value of IT/Business collaboration, and the importance of a common language, a common definition of end-user experience – user transaction response time – as the one performance metric both IT and business have in common. In it, I provided some background on the importance of understanding exactly how we define response time, since this definition dictates the usefulness of the measurement. For the sake of brevity, I’ll summarize three common definitions here:

  • Session-layer response time: request/response measurements specific to an application
  • Application component response time: component-level performance unique to specific application functions
  • User transaction response time: click to glass, or end-user experience

These definitions remain important as we consider the other key part of the organization that IT operations must collaborate with: the development team. This is something that you probably already do, perhaps too frequently, and these encounters might remind you of attempting to use your high-school French as you attempt to explain your indigestion to a harried pharmacy clerk in Paris. (I never claimed to be a master of analogies.)

Application Performance Triage and AA NPM: A Tenuous Partnership

Let’s set the stage a bit more. When there’s a performance problem with a critical application, our immediate goal is to restore service as quickly as possible. A triage approach begins with identifying the problem, progresses to isolating the fault domain, and leads to capturing enough diagnostic insight to prove the conclusion, facilitating efficient handoff to the appropriate domain team. Sound familiar? Not only is it something you practice frequently, it is the promise made by most AA NPM (and many other) solutions. But what constitutes appropriate diagnostic insight if the problem is related to application processing? (After all, they can’t all be network problems.)

The answer is something more than a session-layer response time measurement, that ubiquitous metric present in virtually all AA NPM solutions since the late 1990s. These very basic measurements summarize all request/response timings on a given TCP connection without differentiating between transaction types, unfortunately and unavoidably mixing 10 millisecond queries with 15 second reports. While the label – “Application response time” – may sound promising, the measurements themselves are relatively unactionable in real life. Armed with this measurement, the information you are able to pass to the development team is only slightly more interesting than saying “my network tool says your app is slow.” (Or, to the French pharmacy clerk, “J’ai mangé quelque chose de mauvais – I ate something bad.”) The result? The now-proverbial war room. (I’ll poke at some war room promises in another blog.)

Identifying Transactions

For probe-based solutions, deep packet inspection (DPI) approaches can add valuable insight into these rudimentary session-layer measurements by identifying the transaction that each measured request/response exchange represents. A web URI, a SOAP request and a SQL select statement are common transaction examples; each would be comprised of a single request message followed by a single reply message (and of course each message could be transported via multiple network packets). DPI can be used to extract the transaction identification from the HTTP header, the SOAP body, or the database-specific SQL syntax. Given this capability, unique transaction-specific measurements can be recorded, baselined, reported, assigned thresholds, and used to generate performance alerts. No longer is the measurement mixing a complex java server page that takes 6 seconds with a simple JavaScript request that takes 100 milliseconds, or a 5 millisecond SQL select statement with a 15-second stored procedure.

In order for this DPI approach to be broadly effective, the analysis (decode) must be sophisticated and flexible enough to parse important transaction parameters. A contemporary example is a web application that uses a single URL for many different transactions, where the transaction parameters are not included in the URL string itself, but rather embedded in the request body. The decode must be able to understand and act on this complexity, and not use the single URL as the identifier for all transactions.

Transaction Visibility is Just the Beginning

This application transaction or component measurement can be considered the lowest common denominator required for effective collaboration with the development team. It should provide enough diagnostic insight to accomplish two goals inherent in triage:

  • Accept that there is a problem related to an understandable application/code function (even though the data comes from a network tool)
  • Begin to investigate the application logic invoked by the transaction, along with underlying dependencies.

The development team may still have to work to reproduce the problem (especially if it is intermittent) and trace code execution (this isn’t a Dynatrace PurePath, after all), but it achieves our stated goal of inter-team collaboration by defining the problem in language that the development team can relate to. (Ça va?)

If I were only concerned about highlighting the fundamentals of communicating with development teams, I might stop here. (In fact, in the extended AA NPM market, it’s quite common to stop here.) But there remains a critical missing link, one that significantly impairs your ability to leverage this application transaction insight.

Is There a Problem?

Let’s say you’re monitoring a handful of web applications. (The same holds true if you’re monitoring other applications such as SAP, Oracle eBusiness Suite, Siebel, Microsoft Exchange, etc. I’ll use web as the most familiar.) If you’re monitoring the performance of web components and corresponding backend database queries, you’ve likely also got transaction-specific performance baselines. But how do you know when you have a problem? If the performance of the opensession.jsp changes from 3 to 4 seconds, is that a problem to share with the development team? (If I were to state that the performance changes from 3 to 13 seconds, the answer would be obvious; let’s avoid discussing the simple, catastrophic demo cases designed for vendor playbooks.) What happens if JavaScript file download performance degrades from 1.0 to 1.5 seconds? Should the network team take this one with high priority? How do you go about setting appropriate performance thresholds for thousands of application components?

How can you tell if you have a problem affecting end users?
How can you tell if you have a problem affecting end users?

Like user transactions (click to glass) in many applications, a web page is comprised of perhaps dozens of individual components. The user doesn’t complain about the performance of the opensession.jsp, but about the performance of loading a web page or waiting for a report. To further complicate the problem, performance at the browser is dependent not only on component response time, but also on the inter-component processing delays within the browser itself. An issue with a new JavaScript file may cause it to take a couple extra seconds to run, delaying subsequent requests for page content, yet remain entirely invisible to component-level monitoring.

Without measuring end-user experience:

  • You won’t know users are having a problem until they call you (unless something catastrophic happens).
  • You will chase after problems that don’t affect users (because you’re monitoring dozens, or hundreds, of metrics and application components of varying impact).
  • When a user does call to complain, you’ll have to interpret their definition of the problem into the corresponding application components to identify a starting point for triage.

The conclusion should be clear; you need to measure end-user experience. This is not generally a capability of most AA NPM solutions – in spite of much marketing phrasing that might suggest otherwise. (Even Gartner, in their Magic Quadrant for NPMD, suggests that end-user experience can be described as “application availability, latency and quality from a network perspective.”) Instead, true end-user experience monitoring requires a deep understanding of complex application behaviors, algorithms and insights beyond the relatively simple DPI-based transaction decoding I’ve focused on here. (Think of the vocabulary you’d require to explain your symptoms and discuss pharmaceutical, homeopathic and holistic treatment options with our friendly French pharmacist.) I’ve recently been referring to this extension to AA NPM as “Transaction Centric NPM” – emphasizing the importance of applying both application and user definitions to transaction measurements. This visibility is a unique differentiating capability of Dynatrace Data Center RUM.

Bonne journée!

Gary is a Subject Matter Expert in Network Performance Analytics at Dynatrace, responsible for DC RUM’s technical marketing programs. He is a co-inventor of multiple performance analysis features, and continues to champion the value of network performance analytics. He is the author of Network Application Performance Analysis (WalrusInk, 2014).