It was the best of load balancers, it was the worst of load balancers, it was the age of happy users, it was the age of frustrated users.
I get to see a variety of interesting network problems; sometimes these are first-hand, but more frequently now these are through our partner organization. Some are old hat; TCP window constraints on high latency networks remain at the top of that list. Others represent new twists on stupid network tricks, often resulting from external manipulation of TCP parameters for managing throughput (or shaping traffic). And occasionally – as in this example – there’s a bit of both.
Many thanks to Stefan Deml, co-founder and board member at amasol AG, Dynatrace’s Platinum Partner headquartered in Munich, Germany. Stefan and his team worked diligently and expertly with their customer to uncover – and fix – the elusive root cause of an ongoing performance complaint.
Users in North America connect to an application hosted in Germany. The app uses the SOAP protocol to request and deliver information. Users connect through a firewall and one of two Cisco ACE 30 load balancers to the first-tier WebLogic app servers.
When users connect through LB1, performance is good. When they connect through LB2, however, performance is quite poor. While the definition of “poor performance” varied depending on the type of transaction, the customer identified a 1.5MB test transaction that helped quantify the problem quite well: fast is 10 seconds, while slow is 60 seconds – or even longer.
Dynatrace DC RUM is used to monitor this customer’s application performance and user experience, alerting the IT team to the problem and quantifying the severity of user complaints. (When users complain that response time is measured in minutes rather than seconds, it’s helpful to have a solution that validates those claims with measured transaction response times.) DC RUM automatically isolated the problem to a network-related bottleneck, while proving that the network itself – as qualified by packet loss and congestion delay – was not to blame.
Time to dig a little deeper
I’ll use Dynatrace Network Analyzer – DNA, my protocol analyzer of choice – to examine the underlying behavior and identify the root cause of the problem, taking advantage of the luxury of having traces of both good and poor performing transactions. I’ll skip DNA’s top-down analysis (I’m assuming you don’t care to see yet another Client/Network/Server pie chart), and dive directly into annotated packet-level Bounce Diagrams to illustrate the problem.
(DNA’s Bounce Diagram is simply a graphic of a trace file; each packet is represented by an arrow color-coded according to packet size.)
First, the fast transaction instance:
For the fast transaction, most of the 10-second delay is allocated to server processing; the response download of 1.5MB takes about 1.7 seconds – about 7Mbps.
Here’s the same view of the slow transaction instance:
There are two distinct performance differences between the fast transaction – the baseline – and this slow transaction. First, a dramatic increase in client request time (from 175 msec. to 52 seconds!); second, a smaller but still significant increase in response download time, from 1.7 seconds to 7.7 seconds.
The MSB (most significant bottleneck)
Let’s first examine the most significant bottleneck in the slow transaction. The client SOAP request – only 3KB – takes 54 seconds to transmit to the server, in 13 packets.
The packet trace shows the client sending very small packets, with gaps of about 5 seconds between. Examining the ACKs from LB2, we see that the TCP receive window size is unusually small; 254 bytes.
Such an unusually small window advertisement is generally a reliable indicator that TCP Window Scaling is active; without the SYN/SYN/ACK handshake, a protocol analyzer doesn’t know whether scaling is active, and is therefore unable to apply a scale factor to accurately interpret the window size field.
The customer did provide another trace that included the handshake, showing that the LB response to the client’s SYN does in fact include the Window Scaling option – with a scale factor of 0.
Odd? Not really; this simply means that LB2 will allow the client to scale its receive window, but doesn’t intend to scale its own. The initial (non-scaled) receive window advertised by the LB is 32768. (It’s interesting to note that given a scale factor of 7, a receive window value of 256 would equal 32768.)
Once a few packets have been exchanged on the connection, however, LB2 abruptly reduces its receive window from 32768 to 254 – even though the client has only sent only a few hundred bytes. This is clearly not a result of the TCP socket’s buffer space filling up. Instead, it’s as if LB2 suddenly shifts to a non-zero scale factor (perhaps that factor of 7 I just suggested), even though it has already established a scale factor of zero.
Pop quiz: What to do with tiny windows?
Question: what should a TCP sender do when the peer TCP receive window falls below the MSS?
Answer: The sender should wait until the receiver’s window increases to a value greater than the MSS.
In practice, this means the sender waits for the receiver to empty its buffer. Given a receiver that is slow to read data from its buffer – and therefore advertises a small window of less than the MSS – it would be silly for the sender to send tiny packets just to fill the remaining space. In fact, this undesirable behavior is called the silly window syndrome, avoided through algorithms built into TCP.
For this reason, protocol analyzers and network probes should treat the occurrence of small (<MSS) window advertisements the same as zero window events, as they have the same performance impact.
When a receiver’s window is at zero for an extended period, a sender will typically send a window probe packet attempting to “wake up” the receiver. Of course, since the window is zero, no usable payload accompanies this window probe packet. In our example, the window is not zero, but the sender behavior is similar; the LB waits five seconds, then sends a small packet with just enough data (254 bytes) to fill the buffer. The ACK is immediate (the LB’s ACK frequency is 1), but the advertised window remains abnormally small. We can conclude that the LB believes it is advertising a full 32KB buffer, although it telling the client something much different.
After about 52 seconds, the 3K request reaches LB2, after which application processing occurs normally. It’s a good thing the request size wasn’t 30K!
The NSB (next significant bottleneck)
As is quite common, there’s another tuning opportunity – the NSB. This is highlighted by DC RUM’s metric called Server Realized Bandwidth, or download rate. The fast transaction transfers 1.5MB in about 1.6 seconds (7.5Mbps), while the slow transaction takes about 8 seconds for the same payload (1.5Mbps).
Could this be receiver flow control, or a small configured receive TCP window? These would seem reasonable theories – except that we’re using the same client for the tests. A quick look at the receiver’s TCP window proves this is not the case, as it remains at 131,072 (512 with a scaling factor of 9).
DNA’s Timeplot can graph a sender’s TCP Payload in Transit; comparing this with the receiver’s advertised TCP window can quickly prove – or disprove – a TCP window constraint theory.
The maximum payload in transit for the slow transaction is about 32KB; given that the client’s receive window is much larger, we know that the client is not limiting throughput.
Let’s compare this with the fast transaction as it ramps up exponentially through TCP slow start:
It becomes clear that LB1 does not limit send throughput – bytes in flight – to 32KB, instead allowing the transfer to make more efficient use of the available bandwidth. We can conclude that some characteristic of LB2 is artificially limiting throughput.
Fixing the problems
For the MSB (most significant bottleneck), Cisco has identified a workaround (even if they might have slightly misstated the actual problem):
CSCud71628—HTTP performance across ACE is very bad. Packet captures show that ACE drops the TCP Window Size it advertises to the client to a very low value early in the connection and never recovers from this. Workaround: Disable the “tcp-options window-scale allow”.
For the NSB (next significant bottleneck), the LB configuration defaults to a TCP send buffer value of 32768K. Modifying the parameter set tcp buffer-share from the default 32768 to 262143 (the maximum permitted value) allowed for LB2 throughput to match that of LB1.
Wait; do you see the contradiction here? If we disable TCP window scaling, that would limit the effective TCP buffer to 65535, limiting the download transfer rate to under 4Mbps (given the existing link’s 130ms round-trip delay).
But this was the spring of hope; it seems that changing the tcp buffer-share parameter also solved the window scaling problem, without having to disable that option. This suggests a less-than obvious interaction between these parameters – but with happy users, we’ll take that bit of luck.
Is there more?
There are always additional NSBs; this is a tenet of performance tuning. We stop when the next bottleneck becomes insignificant (or when we have other problems to attend to). For this test transaction, the SOAP payload is rather large (1.5MB); while the payload is encrypted, it could still be compressed to reduce download time; a quick test using WinZip shows the potential for at least a 50% reduction.
While some of you will be quick to note that ACE has been discontinued, Cisco support for ACE will continue through January 2019.