When we think of application performance problems that are network-related, we often immediately think of bandwidth and congestion as likely culprits; faster speeds and less traffic will solve everything, right? This is reminiscent of recent ISP wars; which is better, DSL or cable modems? Cable modem proponents touted the higher bandwidth while DSL proponents warned of the dangers of sharing the network with your potentially bandwidth-hogging neighbors. In this blog entry, we’ll examine these two closely-related constraints, beginning the series of performance analyses using the framework we introduced in Part I. I’ll use graphics from Compuware’s application-centric protocol analyzer – Transaction Trace – as illustrations.
We define bandwidth delay as the serialization delay encountered as bits are clocked out onto the network medium. Most important for performance analysis is what we refer to as the “bottleneck bandwidth” – the speed of the link at its slowest point – as this will be the primary influencer on the packet arrival rate at the destination. Each packet incurs the serialization delay dictated by the link speed; for example, at 4Mbps, a 1500 byte packet takes approximately 3 milliseconds to be serialized. Extending this bandwidth calculation to an entire operation is relatively straightforward. We observe (on the wire) the number of bytes sent or received and multiply that by 8 bits, then divide by the bottleneck link speed, understanding that asymmetric links may have different upstream and downstream speeds.
Bandwidth effect = [ [# bytes sent or received] x [8 bits] ]/ [Bottleneck link speed]
For example, we can calculate the bandwidth effect for an operation that sends 100KB and receives 1024KB on a 2048Kbps link:
- Upstream effect: [100,000 * 8] / 2,048,000] = 390 milliseconds
- Downstream effect: [1,024,000 *8] / 2,048,000] = 4000 milliseconds
For better precision, you should account for frame header size differences between the packet capture medium – Ethernet, likely – and the WAN link; this difference might be as much as 8 or 10 bytes per packet.
Bandwidth constraints impact only the data transfer periods within an operation– the request and reply flows. Each flow also incurs (at a minimum) additional delay due to network latency, as the first bit traverses the network from sender to receiver; TCP flow control or other factors may introduce further delays. (As an operation’s chattiness increases, its sensitivity to network latency increases and the overall impact of bandwidth tends to decrease, becoming overshadowed by latency.)
Transaction Trace Illustration: Bandwidth
One way to frame the question is “does the operation use all of the available bandwidth?” The simplest way to visualize this is to graph throughput in each direction, comparing uni-directional throughput with the link’s measured bandwidth. If the answer is yes, then the operation bottleneck is bandwidth; if the answer is no, then there is some other constraint limiting performance. (This doesn’t mean that bandwidth isn’t a significant, or even the dominant, constraint; it simply means that there are other factors that prevent the operation from reaching the bandwidth limitation. The formula we used to calculate the impact of bandwidth still applies as a definition of the contribution of bandwidth to the overall operation time.)
Networks are generally shared resources; when there are multiple connections on a link, TCP flow control will prevent a single flow from using all of the available bandwidth as it detects and adjusts for congestion. We will evaluate the impact of congestion next, but fundamentally, the diagnosis is the same; bandwidth constrains throughput.
Congestion occurs when data arrives at a network interface at a rate faster than the media can service; when this occurs, packets must be placed in an output queue, waiting until earlier packets have been serviced. These queue delays add to the end-to-end network delay, with a potentially significant effect on both chatty and non-chatty operations. (Chatty operations will be impacted due to the increase in round-trip delay, while non-chatty operations may be impacted by TCP flow control and congestion avoidance algorithms.)
For a given flow, congestion initially reduces the rate of TCP slow-start’s ramp by slowing increases to the sender’s Congestion Window (CWD); it also adds to the delay component of the Bandwidth Delay Product (BDP), increasing the likelihood of exhausting the receiver’s TCP window. (We’ll discuss TCP slow-start as well as the BDP later in this series.)
As congestion becomes more severe, the queue in one of the path’s routers may become full. As packets arrive exceeding the queue’s storage capacity, some packets must be discarded. Routers employ various algorithms to determine which packets should be dropped, perhaps attempting to distribute congestion’s impact among multiple connections, or to more significantly impact lower-priority traffic. When TCP detects these dropped packets (by a triple-duplicate ACK, for example), congestion is the assumed cause. As we will discuss in more depth in an upcoming blog entry, packet loss causes the sending TCP to reduce its Congestion Window by 50%, after which slow-start begins to ramp up again in a relatively conservative congestion avoidance phase.
Transaction Trace Illustration: Congestion
We know that a network path has some minimum amount of delay, in theory based purely on distance and route processing; we define that as path latency. Any delay above this amount can be attributed to congestion. (While we generally consider congestion to be related to link utilization’s impact on router queues, it can also be introduced by processing delays; for example, a busy firewall may experience a delay in examining a packet, adding to end-to-end delay and to our definition – and measurement – of congestion.)
The most accurate method of measuring congestion from a packet trace is to capture at both client and server locations, then merge the two trace files together using Transaction Trace’s remote merge function. This approach assures accurate send and receive timestamps for every packet. We can then analyze transit time over the course of the operation; transit times above the minimum observed value (greater than the path latency) are presumed to be caused by congestion. We generally make the assumption that the minimum observed transit time in a merged task is equal to the “idle” path delay – in other words, at least one packet in the trace succeeded in traversing the network without encountering any significant congestion. This is a reasonable assumption, as the goal is not to calculate precisely the impact of congestion, but rather to prove that congestion is an important contributing bottleneck by estimating its effect.
To illustrate congestion, use the Time Plot view to graph packet transit times, comparing the delta between minimum, average and maximum delays. You may find very short bursts of congestion affecting only a small handful of packets, or perhaps more consistent congestion that affects most of the packets for a flow or operation. The two Time Plot graphs below illustrate these conditions.
Corrective Actions: Bandwidth and Congestion Constraints
Addressing a pure bandwidth constraint is straightforward; the physical (i.e., infrastructure) solution is to increase bandwidth, while the logical (i.e., application) solution is to decrease the amount of data transferred. Data compression is a method for the latter that has been around for decades, and more recent WAN optimization approaches offer further options for data reduction. Caching, interface simplification, and thin client solutions may also provide relief.
Similarly, addressing congestion can be as simple as increasing bandwidth. Alternatively, you may take a more studied approach, identifying and classifying the traffic that contends for bandwidth. QoS policies may be used to mitigate the impact of congestion on time-sensitive applications, effectively allocating more bandwidth to important traffic by limiting the rate of less-critical traffic. And of course you may find cases of incorrectly routed (or forbidden) traffic in unexpected places.
How do you monitor, report and manage congestion in your network?
In an upcoming post in this 8 part series, we’ll look at the impact of packet loss, which is of course quite closely related to bandwidth and congestion constraints. But next, in Part III, we’ll discuss TCP slow-start, introduce the Congestion Window, and illustrate how these are used to control the sender’s transmission rate. Stay tuned and feel free to comment below.