One thing I learned – or more accurately, had reinforced – from the many comments on this blog series is that there are often subtle differences in the implementation of various TCP features and specifications; TCP slow-start and Congestion Avoidance are good examples, as is the retransmission of dropped packets (and even the Nagle algorithm). I have purposely tried to abstract the minutiae of the algorithms, describing instead their important indicators or “typical” characteristics, from which most permutations and anomalies should be recognizable. We can sometimes get caught up in precision (we have nanosecond timestamps!) when estimation is good enough, and not everyone cares to join the discussion on the relative merits of Reno and Tahoe TCP implementations. (I personally might enjoy these, especially over a glass of wine.)

The approach I’ve used to introduce the framework and describe the analyses in the blog series is fundamentally tool-agnostic; some products make it (much) easier, and your familiarity with (or access to) a particular tool will generally dictate your starting point and define your frame of reference. I, of course, use Dynatrace’s Transaction Trace Analysis, a.k.a. ApplicationVantage, which is specifically designed to analyze application performance; its application decodes, network delay analysis, graphics and automated expertise make quick work of identifying and proving performance bottlenecks. (If this provides me an unfair advantage, I’m not complaining.) I also use Wireshark when I need better insight into packet-level details; in fact, Transaction Trace integrates with Wireshark for this purpose.

It can be helpful to categorize the analyses by summarizing the framework in tabular views. Realize that many of the behaviors and associated bottlenecks are interdependent, complicating the use of a simple “client/network/server” (CNS) breakdown by straddling two (or more) categories. Nevertheless, it is helpful to place each of the 12 potential constraints into a “primary bucket,” following an extended CNS paradigm, as one way to visualize the framework.

First, a CNSAP breakdown:

Client Network Server Application Protocol (TCP)
Client Processing Bandwidth Server Processing Chattiness TCP slow-start
Receiver Flow Control (Window 0) Congestion Starved for data Application windowing Nagle algorithm
Packet loss TCP Window (BDP)

Another helpful way to categorize performance problems is by their relationship, or their sensitivity, to the physical application delivery infrastructure; this physical infrastructure can be abstracted to processing nodes (client and server), network bandwidth, and network latency. By suggesting that an application is “sensitive” to a particular physical constraint, we mean that application performance is more likely to be impacted by fluctuations in that constraint, or by behaviors that closely interact with the constraint.

If your physical environment can be characterized by: Then your application is more likely to exhibit performance constraints related to:
Limited bandwidth
  • Bandwidth (serialization)
  • Congestion
  • Packet loss
High latency (RTT)
  • TCP window (BDP)
  • Application chattiness
  • Application windowing
  • TCP slow-start
  • Nagle
Limited node processing
  • Client processing
  • Server processing
  • Starved for data
  • TCP Window 0

One final tip that technically belongs in each of the blog entries; look for consistency. It is always worthwhile to capture and evaluate multiple samples of a problem; this practice can help you avoid chasing red herrings, instead focusing your efforts on consistent and meaningful performance constraints. On the other hand, if you are troubleshooting intermittent slowdowns, anomalies may be important; capturing a fleeting problem may in fact be the most difficult part of the assignment.

Like many of you, I have a keen interest in (and derive rather nerdy pleasure from) analyzing trace files; unlike many of you, my expertise is somewhat narrowly focused on performance analysis. Understanding that privacy often prevents sharing trace files, I still make the conditional offer to look at a trace file or two, perhaps to offer a second opinion. The conditions? The trace should be a good (preferably filtered) representation of a performance concern specific to an end-user (ideally across a WAN), and I offer no service level guarantee since I’ll be looking at these outside of normal business hours. You can contact me on Twitter @gkaiser. If you don’t have a Twitter account, email me at gary.kaiser@dynatrace.com if you’re interested.

I’m also promising to deliver the blog content – with some additional supporting background and detail – in eBook format. (I figure that putting this in writing increases the pressure to deliver quickly.) I’ll post an update when it’s available; give me a few weeks.

Finally, let’s keep in mind the bigger picture, one that will ultimately impact the way we do our jobs. APM and AANPM solutions are becoming more sophisticated and more capable at detecting and diagnosing performance problems for you (and someday will begin to take corrective actions.) Raw trace file analysis as a discipline will change (some of you may reminisce about the days of analyzing operating system crashes via hex dumps), replaced in part by expert systems that analyze packet streams in real time, or that can post-process trace files to identify performance constraints for you. To some extent these systems exist today and can reliably point out many common performance bottlenecks. That leaves the uncommon problems – the mischief caused by such gremlins as stack bugs, protocol spoofers, and application follies – for those of us who enjoy the challenges, and the wine.