In Part VII, we looked at the interaction between the TCP receive window and network latency. In Part VIII we examine the performance implications of two application characteristics – chattiness and application windowing as they relate to network latency.
A “chatty” application has, as an important performance characteristic, a large number of remote requests and corresponding replies – application turns, or app turns in Transaction Trace terminology. These are also often referred to as network round-trips, especially in developer documentation. The negative performance impact of these application turns increases as path latency increases, making remote access a challenge for chatty applications.
Assuming an operation’s communications are serialized, each application turn incurs the round-trip path delay, including queuing delays from congestion that might occur at each hop. To estimate the impact of latency on a particular operation, simply multiply the round-trip path delay be the number of application turns. For example, given a link with 100 milliseconds of round-trip delay, an operation with 40 app turns will encounter 4 seconds of chattiness delay, over and above other delays such as bandwidth and server processing. (Parallelization, common in web browsers, will mitigate this impact somewhat; it affects the math, not the conclusion.) Note that chattiness isn’t inherently bad; only when coupled with network latency does it become a performance problem.
Transaction Trace Illustration
Transaction Trace calculates application turns for any TCP-based communication, whether or not a Thread-level decode exists; this count of application turns can be viewed on many graphs and tables. It is also sometimes helpful to illustrate the “ping-pong” effect of chattiness using the Bounce Diagram. You can also use the Thread Analysis to focus the Bounce Diagram using a split window, selecting a single chatty thread for display.
Application-centric corrective actions include combining multiple small functions into a single large function; stored procedures and java archives (.jar) are examples. Adding parallelization is another example, inherent in web browsers, but often difficult or impossible to implement in other environments. Asynchronous requests – as in AJAX – may decouple the user’s perspective of performance as the anticipatory requests execute in the background.
Physical solutions focus on reducing the latency between the client and server application components. A common approach is to use a thin client solution, where the client logic executes in close physical proximity to the server; client screen updates and keystrokes are transferred across the WAN link.
In some cases, applications may be architected to write data to the network interface (the socket) in blocks, rather than as a stream. This is analogous to writing data to disk, or to memory; each block must be written successfully and acknowledged before the next block can be written. On the network, each block written will incur (at a minimum) the round-trip delay of the path, as the sender must wait for the last packet of each block to be acknowledged before continuing with the next block. Since application uses TCP acknowledgements instead of application-level acknowledgements, these blocks are not counted as application turns by Transaction Trace. We refer to this as application windowing, since the behavior – and its goal of reliable data delivery – is similar in many ways to TCP windowing. In the case of application windowing, the application’s write block size dictates the maximum bytes in flight, and the larger the bandwidth delay product (BDP), the more severe the performance penalty.
Transaction Trace Illustration
Detecting and visualizing application windowing can be challenging; there are no application turns counted to tip you off, since the receiving node does not make explicit requests for each block of data. Observing the trace, we might first suspect a TCP window constraint – the patterns are quite similar. The Bounce Diagram, for example, shows that the sender must wait for TCP acknowledgements before transferring more data; a TCP window constraint would be visually quite similar.
Considering the similarity to a TCP window constraint, we can use the same analysis approach. Graph the sender’s TCP Payload in Transit (i.e., bytes in flight) along with the receiver’s advertised TCP window. Remember that the Congestion Window (CWD), which controls how many packets can be sent, will continue to increase until it reaches the limit imposed by the receiver’s TCP window size, until congestion is detected, or until the application reaches its maximum write block size.
In the case of application windowing, the payload in transit will not reach the TCP window limit, but instead level out evenly at exactly the application’s write (or read) block size. One hint is that the payload in transit value – the application’s block size – will be a multiple of 1024; TCP window sizes are usually multiples of the MSS (e.g., 1460*12 = 17520). Another hint (buried deeper in the evidence) is that the sending application will set the TCP Push (PSH) flag in the last packet of each block to flush the transmit buffer.
The Delayed ACK Timer
Sometimes, the application’s choice of block size may result in an odd number of packets on the network. For example, a block size of 4096 will require 3 packets, assuming a common MSS (packet payload) of 1460. Similarly, a block size of 16K (16384) will require 11 packets. In other cases, a reconfigured MSS – often to accommodate a VPN – may also result in blocks that require an odd number of packets. For example, an 8K block – which would fit in an even 6 packets with a 1460 byte MSS – would require 7 packets if the MSS is reconfigured to 1260. Now consider TCP’s common ACK timing, which we discussed in Part VI: a receiver will ACK every second packet, and acknowledge a single packet once the Delayed ACK timer expires. The Delayed ACK timer is typically 200 milliseconds. Therefore, a block size that results in an odd number of packets will incur both the network round-trip time and the Delayed ACK time. (This is not unlike the Nagle algorithm, also discussed in Part VI, where the algorithm requires previously sent data to be acknowledged before proceeding. In fact, Nagle can combine with application windowing, adding a penalty equal to one network round-trip plus two Delayed ACK timers – one for Nagle, one for the final packet in the block, for each block of data written.)
It should be noted that there may be legitimate reasons for choosing to write data across the network in block mode. Block mode allows the application to keep track of the successful delivery of data, instead of relying solely on TCP; as such, the application may offer a feature (based on a sync point) to recover from an interrupted transfer instead of restarting from the beginning. The application may also be able to multiplex data on a single TCP connection, offering important parallelization.
The application-specific corrective action for an application windowing constraint is to write in stream mode (vs. block mode). For a variety of reasons, this may not be feasible. Mitigating actions include using the largest possible block size (writing in 4K blocks will incur 16 times the windowing delay as writing in 64K blocks), and ensuring that each block results in an even number of packets. If blocks necessarily result in an odd number of packets – perhaps there are multiple MSS configurations that must be supported – then changing the TCP ACK frequency to 1 (sending and acknowledgement after every packet) will remove the Delayed ACK timer penalty; however, each block will still incur the round-trip link delay.
Do you use thin client solutions to address the remote performance limitations of chatty applications? Have you experienced application windowing issues?
Next up – a short conclusion, where I’ll attempt to address some questions and comments, and point to a few good reference sources. Feel free to comment below.