Not too long ago I had an opportunity to work with a customer who was experiencing performance problems with their web-based HR application. Users at the headquarters location – about 30 milliseconds away from the data center – would occasionally experience page load times of 10 or 15 seconds – instead of the normal 2 or 3 seconds. Dynatrace Data Center Real User Monitoring (DC RUM) reported both a pattern – the problem occurred regularly, each night at around midnight – and a lack of a pattern – intermittently, the problem would occur during the workday. It was these workday disruptions that elevated the problem priority from an annoyance to one of more critical business impact.
DC RUM & Network App Performance Insight
DC RUM provided clear insight into the timing (when did the problem occur?), the user’s experience (how long did the pages actually take to load?), and the business impact (how many users are affected?). This is critical information; in this case, the users never actually complained (at least not to the help desk; what they said at the water cooler might be another story). Understanding the impact of the problem from the user perspective – page load time – and from the business perspective – how many users are affected – brought important context and priority to the IT team. DC RUM’s advanced performance analytics also automatically isolated the problem to the network, with packet loss identified as the source of the problem.
Squarely in the wheelhouse of the network team – and with the business owner now paying close attention – the next step was to identify the cause of the packet loss. We were able to exonerate one of the usual suspects – high link utilization – as this appeared quite reasonable, averaging about 40% during busy intervals. (I believe this was a 4Mbps link.)
Show Me the Data via DC-RUM
Like many of you who are network professionals, my first instinct was (and remains) “let me see a trace.” These days, that can be just as easily done as said.
DC RUM supports both back-in-time trace retrieval (using a high-speed packet store on disk) as well as real-time contextual capture. For either case, the capture is launched from DC RUM’s Central Analysis Server (CAS), and the report context from which you launch the capture is applied as a filter. For example, you may have drilled into a view highlighting a particular user’s problem with the HR application, examining performance metrics from yesterday between 4:30 and 5:00 p.m. By launching a capture from that report level, the trace filter would use the IP address and time contexts to capture or retrieve only pertinent packets.
It’s in Our DNA
I used Dynatrace Network Analyzer (DNA) to continue the analysis, looking for proof in the form of explicit correlation and root cause. From a 14-second page load captured in the trace, I chose to examine a single representative (problematic) page component – a long-running Java server page. This thread (in DNA terms) took about 3 seconds to download its small (57KB) payload. DNA’s Transaction Expert report supported DC RUM’s fault domain with more transaction-specific details; if there were any doubts about the conclusion, these were quickly put to rest.
A Picture is Worth a Thousand Dropped Packets
I then went for a favorite visualization – the Bounce Diagram, which is essentially a graph of the underlying packet trace with network delays included. The problem jumped out quite clearly; even though the thread begins with a new TCP connection (SYN, SYN/ACK, ACK), the typical TCP slow start pattern was missing. Instead, the server sent a quick burst of 25 packets before pausing for acknowledgement. This was followed by a painfully slow sequence of packet retransmissions. In fact, from the initial burst of 25 packets, only the first 16 packets were delivered; the remaining nine were (presumably) discarded by the router and had to be retransmitted.
Without the benefit of TCP slow start, or more accurately, with a large initial congestion window (CWD) configured, we stand to lose one of the key TCP flow control benefits, and this was quite evidently the case here.
There’s another interesting behavior evident that contributed to the lengthy download; the server only attempted to retransmit one packet at a time, waiting for each packet to be acknowledged before sending the next. Since a receiver commonly ACKs a single packet only after the Delayed ACK timeout expires (typically about 200 milliseconds), each of the 9 retransmitted packets incurred this additional delay.
A Simple Problem with Multiple Solutions
Investigating the network topology a bit further, we found that there was an application delivery controller (ADC) appliance in front of the web server farm. Like many similar devices these days, there are TCP flow control parameters that can be tuned for performance; sure enough, one of these optimization settings permits a very large initial CWD. My suggestion – to reduce the CWD towards a more “normal” value of 2 – was met with healthy skepticism, supported by a nice blog about Faster Web vs. TCP Slow-Start.
I did a little more investigation; the HR application was, in fact, doing a good job of reusing TCP connections, limiting the potential impact from implementing a more conservative TCP slow-start. And the relatively low WAN latency – 60 milliseconds round-trip – would limit the TCP turn delays incurred by slow-start. So even for a larger page component – one that might incur six or eight round-trips during slow-start’s exponential ramp stage – the impact of slow-start would be just a few hundred milliseconds. That imperceptible performance penalty – incurred only at the start of a new TCP connection – would be more than balanced by the lack of extraordinary page load times and the frustration of inconsistent performance.
Once a problem is well-understood, there are often multiple approaches to addressing – or mitigating – its impact. For this case, ensuring persistent TCP connections and extending connection timeout values would contribute to longer-lived TCP connections, reducing the frequency of the problem. (The HR app was quite well-behaved in this category.) Increasing the router queue depth could reduce discards, as might setting a high-priority queue for the HR application traffic. Increasing bandwidth would limit the likelihood of packet discards from large initial bursts; note that the 40% link utilization value I mentioned earlier was an average over a relatively long timeframe of 5 minutes. Changing the client ACK frequency to 1 (acknowledging every packet, effectively eliminating any impact from the delayed ACK timer) wouldn’t solve the packet loss problem, but would certainly speed recovery from that loss.
When you have the luxury of knowing the characteristics of your network environment – bandwidth, latency, utilization – and control over the user device, performance can frequently be fine-tuned to deliver significant benefits. Often, default network device or server configurations assume the opposite – that you have no understanding or control over user devices, locations, or network connections – and define behavior that might be sub-optimal for a particular application’s users.
In a blog series from last year, I covered in much detail the important influences on application performance, focusing on how these are visible in network packet traces. That information – and more – is also available in the eBook Network Application Performance Analysis.