Not too long ago I had an opportunity to work with a customer who was experiencing performance problems with their web-based HR application. Users at the headquarters location – about 30 milliseconds away from the data center – would occasionally experience page load times of 10 or 15 seconds – instead of the normal 2 or 3 seconds. Dynatrace’s Network Application Monitoring (NAM) reported both a pattern – the problem occurred regularly, each night at around midnight – as well as a lack of a pattern – intermittently, the problem would occur during the workday. It was these workday disruptions that elevated the problem priority from an annoyance to one of more critical business impact.
Wire data application performance insight
NAM provided clear insight into the timing (when did the problem occur?), the user’s experience (how long did the pages actually take to load?), and the business impact (how many users are affected?). This is critical information; in this case, the users never actually complained (at least not to the help desk; what they said at the water cooler might be another story). Understanding the impact of the problem from the user perspective – page load time – and from the business perspective – how many users are affected – brought important context and priority to the IT team. NAM’s advanced performance analytics also automatically isolated the problem to the network, with packet loss identified as the network condition causing of the problem.
Squarely in the wheelhouse of the network team – and with the business owner now paying close attention – the next step was to identify the cause of the packet loss. We were able to exonerate one of the usual suspects – high link utilization – as this appeared quite reasonable, averaging about 40% during busy intervals. (I believe this was a 4Mbps link.)
Show me the money
Like many of you who are network professionals, my first instinct was (and often remains) “let me see a trace.” These days, that can be just as easily done as said.
NAM supports both back-in-time trace retrieval (using a high-speed packet store on disk) as well as real-time contextual capture. For either case, the capture is launched from the NAM Server, with the report context used to launch the capture applied as a filter. For example, you may have drilled into a view highlighting a particular user’s problem with the HR application, examining performance from yesterday between 4:30 and 5:00 p.m. By launching a capture from that report level, the trace filter would use the IP address and time contexts to capture or retrieve only pertinent packets.
Using a protocol analyzer to continue the investigation, I chose to examine a single representative (problematic) page component – a long-running Java server page. This small (57KB) page component took about 3 seconds to download; there was a burst of packets followed by a series of retransmissions. Using a graphic representation of a packet trace that incorporates network delays, the problem jumped out quite clearly. Even though the page component download begins with a new TCP connection (SYN, SYN/ACK, ACK), the typical TCP slow start pattern was missing. Instead, the server sent a quick burst of 25 packets before pausing for acknowledgement. This was followed by a painfully slow sequence of packet retransmissions. In fact, from the initial burst of 25 packets, only the first 16 packets were delivered; the remaining nine were lost, presumably discarded by the router, and had to be retransmitted.
Without the benefit of TCP slow start, or more accurately, with a large initial congestion window (CWD) configured, we stand to lose one of the key TCP flow control benefits, and this was quite evidently the case here.
There’s another interesting behavior evident that contributed to the lengthy download; the server only attempted to retransmit one packet at a time, waiting for each packet to be acknowledged before sending the next. Since a receiver commonly ACKs a single packet only after the Delayed ACK timeout expires (typically about 200 milliseconds), each of the 9 retransmitted packets incurred this additional delay.
Simple problem, multiple solutions
Investigating the network topology a bit further, we looked at the application delivery controller (ADC) appliance in front of the web server farm. Like many similar devices these days, there are TCP flow control parameters that can be tuned for performance; sure enough, one of these optimization settings permits a very large initial CWD. My suggestion – to reduce the CWD towards a more “normal” value of 2 – was met with healthy skepticism, supported by a nice blog about Faster Web vs. TCP Slow-Start.
I did a little more investigation; the HR application was, in fact, doing a good job of reusing TCP connections, limiting the potential impact from implementing a more conservative TCP slow-start. And the relatively low WAN latency – 60 milliseconds round-trip – would limit the TCP turn delays incurred by slow-start. So even for a much larger page component – one that might incur six or eight round-trips during slow-start’s exponential ramp stage – the impact of slow-start would be just a few hundred milliseconds. That imperceptible performance penalty – incurred only at the start of a new TCP connection – would be more than balanced by the lack of extraordinarily slow page load times and the frustration of inconsistent performance.
Once a problem is well-understood, there are often multiple approaches to addressing – or mitigating – its impact. For this case, ensuring persistent TCP connections and extending connection timeout values would contribute to longer-lived TCP connections, reducing the frequency of the problem. (The HR app was already quite well-behaved in this category.) Increasing the router queue depth could reduce discards, as might setting a high-priority queue for the HR application traffic. Increasing bandwidth would limit the likelihood of packet discards from large initial bursts; note that the 40% link utilization value I mentioned earlier was an average over a relatively long timeframe of 5 minutes. Changing the client ACK frequency to 1 (acknowledging every packet, effectively eliminating any impact from the delayed ACK timer) wouldn’t solve the packet loss problem, but would certainly speed recovery from that loss.
When you have the luxury of knowing the characteristics of your network environment – bandwidth, latency, utilization – and control over the user device, performance can frequently be fine-tuned to deliver significant benefits. Often, default network device or server configurations assume the opposite – that you have no understanding or control over user devices, locations, or network connections – and define behavior that might be sub-optimal for a particular application’s users.