In a recent blog I discussed Enterprise web front-end performance tuning, extending the tuning recommendations to include characteristics of the enterprise data center/private cloud. I left with a promise to nominate a case study for exactly what not to do – let’s call it frontend worst practices. I had an example in mind from the not-too-distant Compuware past, and managed to find it in my stash.

I’ll begin with a rough problem description; a web-based time and billing application is rolled out to users in Stockholm, the first of a half-dozen international targets after successful adoption in North America. Immediately, users begin complaining about performance – at times so bad as to approach unavailable. Link speed was 2Mbps, and DC RUM reports the network round-trip time (RTT) is approximately 140 milliseconds.

Our Dynatrace DC RUM monitoring solution had a bit of a field day isolating the fault domain, pointing at page design, packet loss and latency as culprits. To drill down into further details, we captured a trace of a particularly slow page, one which took 37 seconds to load. (Reproducing the problem was not a challenge since most pages performed poorly.) Examining the trace file, Dynatrace Network Analyzer (DNA) reported the same three areas of interest: Network latency coupled with application turns (i.e., page design) and packet loss. (Interestingly enough, the user complaints about bandwidth proved inaccurate, as DNA showed the impact of bandwidth on page load time was under 4 seconds.)

DNA’s Expert Guide includes a workflow for web page analysis, guiding the user through a series of best practice tests modeled after the aforementioned Frontend Tuning blog. I’ll point out the conclusions here, with a few screenshots to illustrate key points.

Enterprise web front-end performance Worst Practice #1: Too many requests

We saw from the blog post that sensitivity to latency is caused primarily by design inefficiencies, largely dependent on the number of application turns or network round-trips. Our test page? 452 app turns! In theory, this could mean that the page includes almost that number of page components; in our case, there were “only” 114. Two problems jump out – one, that’s a lot of page components, so the page design could benefit from some serious tuning. But what about the other 338 app turns? Let’s continue through the workflow to point out other worst practices.

Enterprise web front-end performance Worst Practice #2: NTLM authentication

I referred to NTLM as an “expensive” authentication approach because of the app turns it incurs to authenticate a TCP connection. On a new (unauthenticated) connection, the initial HTTP request is met with a 401 Unauthorized response, requesting NTLM be used to authenticate. (That’s one app turn.) The subsequent request from the client includes the NTLM authentication message; the server responds with another 401, and includes a challenge response. (That’s the second app turn.) The client sends the request a third time, and (assuming the response to the challenge is correct) this time the request gets serviced by the web server (for the third app turn). Add a SYN/SYN/ACK handshake to set up the TCP connection and you have four app turns required to complete the GET request.

Three threads and 4 app turns; the NTLM Negotiate thread has 2 app turns since it includes the TCP handshake. Elapsed time for the exchange is approximately 670 milliseconds.
Three threads and 4 app turns; the NTLM Negotiate thread has 2 app turns since it includes the TCP handshake. Elapsed time for the exchange is approximately 670 milliseconds.

Enterprise web front-end performance Worst Practice #3: Connection — Closed

Next, we notice many NTLM authentication exchanges – 114, to be exact. NTLM authenticates a TCP connection – are there really 114 TCP connections used to load this web page? In fact, yes; this is because the server (in this example it was actually the proxy server) prevents persistent connections, instead forcing the connection to be closed once the request has completed. All that work authenticating a TCP connection? Throw it away, let’s do it again. (This was a proxy configuration change, made long ago in an attempt to address a different problem – and never reversed.)

Enterprise web front-end performance Worst Practice #4: Inefficient browser caching

Using the if-modified-since HTTP header, the browser checks for content freshness by sending a conditional GET request to the web server; this request informs the server of the date and time stamp of the file in the browser cache. If the content is current, the server responds with a 304 Not Modified response, and the browser then uses the local copy.

The basic problem with this approach is that it requires a round-trip to the server to check for freshness; each inquiry incurs the link latency. A much more effective approach uses an Expires header that allows the browser to determine if the content is fresh without contacting the server. Compounding the problem for us, however, are the three worst practices we’ve already mentioned. Since persistent connections are not used, each request must begin by setting up a new TCP connection, and this connection must then be authenticated using NTLM – 114 times. For cached content (most of our test page), this means four app turns – only to learn that the local copy is to be used.

Enterprise web front-end performance Worst Practice #5: High initial re-transmission timeout

By default, a new TCP connection starts with a 3000 millisecond initial retransmission timer. As TCP observes actual round-trip times on the connection, this timer is reduced to a value more appropriate to the link’s characteristics. But at the beginning of a new connection, a single dropped packet – a SYN packet, an HTTP GET request, a small (e.g., 304) response – can add 3 seconds to page load time. There were a dozen or so retransmissions evident in the trace, which seems (is, actually) quite high. But the impact of these retransmissions is exacerbated by the lack of persistent connections; most of the dropped packets resulted in 3-second delays, since the TCP connections aren’t provided the opportunity for a long (and happy) life.

The Thread Analysis view shows a 3.3 second duration for an initial NTLM Negotiate exchange; this is because the GET request is dropped. Note the timestamps on the two GET request packets (3.9 and 6.9 seconds) – and the HTTP response (304 Not Modified). Four round-trips plus a 3-second retransmission timeout – just to determine that the local cached file can be used.
The Thread Analysis view shows a 3.3 second duration for an initial NTLM Negotiate exchange; this is because the GET request is dropped. Note the timestamps on the two GET request packets (3.9 and 6.9 seconds) – and the HTTP response (304 Not Modified). Four round-trips plus a 3-second retransmission timeout – just to determine that the local cached file can be used.

Enterprise web front-end performance Worst Practice #6: Unnecessary slow start

This, admittedly, might be a bit of a stretch, but I include it anyway to clarify a point I made in the Enterprise Tuning blog. I suggested using an aggressive slow start algorithm – given two conditions. First, of course, that your ADC provides that tuning option; this option is increasingly available. Second, that you have higher-speed links with bandwidth to spare, since bursting a larger flow of packets across a congested lower-speed link will likely make the problem worse. The link speed here – 2Mbps – is not in that former category, so adjusting slow start is not the answer. But slow start does contribute to overall page load time, and this is significantly more pronounced since (again) our TCP connections don’t live very long. Remember, the default slow start algorithm begins with a congestion window (CWD) of two packets.

This Bounce Diagram illustrates the transfer of 69KB from server to client; there are 6 TCP round-trips attributed to TCP slow start, adding approximately 1 second to component download time.
This Bounce Diagram illustrates the transfer of 69KB from server to client; there are 6 TCP round-trips attributed to TCP slow start, adding approximately 1 second to component download time.

Summary

With two simple changes – changing the cache approach and using persistent connections – the page should load in about 5 seconds (even on a lower-speed congested link).  Attention to frontend design, tuning and deployment best practices can deliver significant benefits, but remember that these extend beyond the realm of the development team into operations as well.

Have a trace file to share?

Consider this an informal invitation to share a trace file with me for analysis. It should represent one transaction – a page load if it’s a web app, or generically, a click-to-glass user transaction. If it is unencrypted, thread-level (application) insight will be possible; otherwise, the analysis will simply focus on the network. You can email it to me, or point me to your Dropbox.

If the trace contains sensitive data, you might try a (free and simple) tool such as Wire Edit to mask IP addresses or content. In any case, I will not share or publish any screenshots without your permission.