Application performance analytics: a network framework

So, why a discussion around Application Performance Analytics? Well, there’s a lot of buzz in this industry around the topic of performance analytics – an informal subset of IT operations analytics (ITOA) – as a solution to the growing mountains of monitoring data and the increasing complexity of application and network architectures.

At the same time, there exist many purpose-built performance analysis solutions. Many are domain-centric – server monitoring and network monitoring, for example – while some exhibit a key ITOA characteristic by incorporating and correlating data from multiple sources. Most perform some level of analysis to expose predefined insights.

Application performance analytics: viewed through a simple framework

In this blog, I’ll outline a simple analytics framework that illustrates how network and application metrics can be derived from a network probe (“wire data” to use the increasingly popular term), combined and analyzed to provide insights that are greater than the sum of the parts. (In fact, that is one of the core promises of ITOA.) I’ll conclude by pointing out some of the more advanced analysis capabilities this framework might need to become a viable modern-day solution.

First, a bit of a disclaimer. My initial intent was to write about the new release of Dynatrace Network Application Monitoring (NAM, previously called DC RUM); that was the assigned task. But I know if I started touting features – especially if I used ubiquitous (and usually meaningless) qualifiers such as “exciting, industry-leading, breakthrough, and best-of-breed,” you’d probably stop reading. Instead, I chose one of NAM’s focus areas – advanced analytics – as an opportunity to wax technical; you can consider the framework a simplification and abstraction of one of the multiple approaches that NAM uses for automated fault domain isolation (FDI).

To keep this blog relatively simple, I’ll use the example of a web page – although the framework would apply to any application that uses a request/reply paradigm; to apply more universally, I’ll switch terms slightly:

  • A transaction is the page load time that the user experiences
  • A hit is a component of the transaction –an image, stylesheet, JavaScript, JavaServer Page, etc.

Application performance analytics: foundation & key insights

The foundation: hit performance

Hit-level performance is the basic building block for the framework; it represents the smallest unit of measurement at the application layer, incorporating request and reply message flows as well as server processing delays. The measurement itself is quite straightforward; virtually any AA NPM probe would provide this (it’s often referred to as session-layer response time), and the only decode requirement is to identify the TCP ports used. A hit begins with a client request message (PDU) that the client’s TCP stack segments into packets for transfer across the network; this is observed by the probe as the request flow. A hit concludes with the server’s reply message that is similarly segmented into packets for transfer across the network in the reply flow.

Often, the probe will sit near the application server, not the client, so a small adjustment should be made to the elapsed hit time observed at the server; add ½ of the network round-trip time (RTT) to the beginning and to the end of the measurement to arrive at a more accurate estimate of the performance at the client node. (RTT can be estimated by examining SYN/SYN/ACK handshakes – if they exist – or by more sophisticated ACK timing measurements.)

Timing diagram of a hit measurement
Timing diagram of a hit measurement

Allocating delays to client, network and server categories starts by calculating the duration of the request and reply flows. At this coarse level, we have a very simple network/server breakdown, but we’ll need to apply additional analytics to make it useful. While it’s a relatively safe assumption that the delay between the last packet of the client request flow and the first packet of the server reply flow should be allocated to server time, it is not appropriate to assume that the duration of the request and reply flows should be allocated to network time. Instead, when a flow’s throughput is low, we should evaluate whether this is caused by the sender or the receiver before we blame the network. For example:

  • The receiver can limit throughput by advertising a TCP window size smaller than the MSS (frequently – and crudely – identified as Win0 events).
  • The sender may be the culprit if it can’t deliver packets to the network fast enough; we sometimes refer to this as “sender starved for data.”

We can consider the remaining flow duration as network time – still a pretty broad category. To further understand network time, we would want to evaluate for packet loss and retransmission as well as TCP receive window constraints in relationship to the BDP.

A more sophisticated analysis would include tests for additional less-common behaviors such as Nagle, application windowing, and TCP slow start; you can read detailed discussions on these and other network-visible performance bottlenecks in the eBook Network Application Performance Analysis.

Application insight: hit decodes

So far, I’ve described an analysis based primarily on packet timings, with some simple TCP header decoding. This might be enough if we’re only interested in analyzing delays for a single hit – for example, using a protocol analyzer. But for a monitoring solution, reporting at this level would be quite uninteresting; all hits for a given TCP session are treated equally.

A very uninteresting application performance report
A very uninteresting application performance report

Effective application insights will require decoding the hit – the request made by the client – so that measurements can be separated into hit-specific buckets. HTTP is a simple example; the hit is the URL or the filename requested by the client browser. Other application protocols might use an encoded id that maps to a more meaningful name. Look what happens to our report:

A deceptively interesting application performance report
A deceptively interesting application performance report

At this point, we have what appears to be great value; I can identify and isolate delays for all hits. But let’s dig a little deeper by asking a couple of leading questions:

  • How should you define service quality? By hit performance? Or by end-user experience? Can you prove acceptable EUE with the report snippet above?
  • How much more value would this information have if you could associate each hit with a username? Or the inverse; without insight into the username, could you isolate the cause of a user’s complaint?

User insight: multi-hit EUE

The answer to the first question above should be EUE. (That’s why I called it a leading question.) To accomplish this, the framework needs to apply more sophisticated application-specific decode algorithms that group a series of hits into a single meaningful user transaction. And while not all applications employ multi-hit transactions, many (or most) do, including web pages, Oracle Forms, SAP, Microsoft Exchange, and database clients. Without this grouping, there is often no direct correlation between hit-level performance and EUE.

The answer to the second question is similarly obvious; the framework’s decode logic should know where to extract the username for different protocols. It will also need the user’s session – because often, the username may only appear in the session initiation, or only on some hits and not others.

Once we’ve grouped a series of hits into a single user transaction, we should extend the transaction’s delay allocation by associating the time between the completion of one hit and the beginning of the next hit to client time, arriving at a measurement of EUE with a client/network/server (CNS) performance model suited for single-threaded transactions.

A multi-hit transaction representing EUE; delays between hits are associated with the client
A multi-hit transaction representing EUE; delays between hits are associated with the client

A more sophisticated analysis would be appropriate for multi-threaded applications – primarily web pages. These permit parallel requests on separate TCP connections, and while the EUE measurement algorithms may not change, the CNS delay allocation may.

Application performance analytics: paths to more insight

Additional metadata

Beyond performance timings and delay allocation, there are other insights to be gleaned from such a network-centric framework. We already discussed one of the most important – the username. Additional examples include application errors and browser or client version. And while both of these can provide statistics as standalone metrics, they become much more valuable when associated with specific users and transactions; otherwise, their troubleshooting value is severely limited.

Applying the framework

With the ability to track EUE for all users and for all transactions, we should now be able to realize significant value through some simple reporting; I’ll list six examples:

  • Understand current performance specific to a site, a link, a server, a transaction, an application
  • Examine meaningful trends or patterns in performance
  • Respond to problems when performance thresholds are approached or exceeded
  • Prioritize problems based on their severity and the number of affected users
  • Validate and diagnose individual user complaints using EUE as a common problem definition
  • Isolate client, network and server fault domains

Framework extensions

While this brings us to the end of my simple framework abstraction, there will always be demands to support new environments, decode new protocols, deliver new insights. Here are a few top of mind examples:

  • Increasingly common, the ability to decrypt SSL traffic is critical to achieving probe-based EUE insights.
  • Request pipelining. Similar to the parallelization inherent in HTTP, pipelining accomplishes multiplexing on a single TCP connection, requiring more sophisticated request/response tracking on the network. HTTP/2 standardizes support for this.
  • Asynchronous message queuing, where the sender and receiver interact with a message queue independently.
  • Multi-tier analysis. Correlation of EUE with middleware and backend transaction performance.

Application performance analytics: start at the end

It’s important, of course, to decide what we want the framework to accomplish – and to do this up front; these goals will dictate the analytic depth required. I’ve made an assumption that there are two primary objectives:

  1. Insight into EUE for a business perspective of IT service delivery – not solely an IT-centric view.
  2. Correlate application and network metrics with metadata to speed troubleshooting.

On the other hand, you could also choose a simpler path, avoiding some analytic complexity; some commercial products fit this model by gathering raw wire data while asking the user do the heavy lifting. With the advent of ITOA solutions, this might seem tempting – but the results may be both costly and disappointing. I’ll conclude with one more leading question: Do you believe it is appropriate to assume that there will be some magic big data analytics engine ready to consume, process, and correlate the raw data to deliver actionable answers? (Here’s a hint to my answer.)