DNS, TCP and Size: Application Performance Best Practices of Super Bowl Advertisers

This performance stuff is easy – especially if you follow best practices. Allow me to explain how preparation means everything, especially before a big online event like the Super Bowl.

We have analyzed the performance of advertisers’ web sites during the Super Bowl and posted the MVPs as well as the underperformers. This blog post highlights the best practices we can derive when looking at it through the eyes of Application Performance.

It doesn’t matter if the slowness originates in the page structure, the network, the cloud, application logic, database queries, or other areas that compromise an otherwise fast, highly functional, scalable web site. What matters is the application is tuned, and it’s at that point your customers will usually reciprocate with the desired objective – higher conversion.

Looking at the right metrics and using tools that provide these metrics feels like reading the other team’s playbook – it not only provides leverage, but also a significant advantage. Performance specialists have been talking about the performance playbook for years. Namely, the formula of fewer connections, fewer hosts, fewer bytes transferred, combine and minify CSS and JavaScript, establish caching rules for static content, ensure compression headers are set, load test to peak-plus traffic levels, analyze method code for redundancies, and on and on. Let’s have a closer look at 3 Best Practices and their implementations, how GoDaddy and Co implemented these Best Practices and how you can follow them as well.

Best Practice #1: Keep it Slim

One Super Bowl advertiser that followed the performance formula was GoDaddy. Its strategy this year was agile across the board. Let’s define agile for web site deployment – the ability to point to specific web pages dependent on specific lead generation (for example, ads, commercials, social media, etc.) based on the incoming request of your visitor (geography).

A quick and well-coordinated approach involved the deployment of a game time version of the godaddy.com landing page, which reduced the average response time by 60.2%, average bytes transferred by 11.8%, average # of objects by 30.0%, average # of connections by 60.6%, and average # of hosts by 69.2%. At the same time they maintained a perfect availability rate of 100%. This is no small feat – a 100% continuum of uptime given the agility of changing the web page is remarkable and speaks volumes to the importance of careful engineering.

Page Load Time was actually reduced during the peak time of the game by providing a special game version landing page
Page Load Time was actually reduced during the peak time of the game by providing a special game version landing page
Page Load Comparison of Pre vs. During the Game
Page Load Comparison of Pre vs. During the Game

Agile web site deployment can be used continuously or for specific marketing events such as the Super Bowl. Either way, when considering this best practice you need to plan ahead as it takes a serious team effort to implement it right.

It’s important to know the performance formula doesn’t come from the ‘APM underground’. This is mainstream information but it requires an agile culture and a certain set of automation that allows fast switching between different versions of these sites for different purposes. Despite good tooling support in certain areas it is still a challenge for many organizations to implement agile web development with a focus on performance.

Why? Perhaps the performance veterans move on to bigger and better things and neglect to tell the newbies the basics. They leave the new guys to figure it out themselves – to fumble a few times before marketing makes demands. All the while the ecommerce ad execs assume that the right tools are deployed and the correct offensive and defensive performance best practices have been followed?

But who’s checking?

Best Practice #2: Monitor DNS

On Super Bowl Sunday, there were 53 advertisers with their #1 objective to both entertain and generate interest. If successful, they’d achieve volumes of traffic to their web site. Since advertisers spend $millions on their ads, they must expect to have successful campaigns resulting in the desired amount of interest and traffic as a result.

Let’s look at some game time examples and learn where things could have been better:

Soda Stream was offline more than 15% throughout the entire game. This was due to failed DNS Lookup. Monitoring for page failures is basic stuff, and availability is even more important than slow response time because nothing happens on the web unless there is a successful resolve of a host name to an IP address. It’s like a car with no gas (no DNS), versus a car with gas that drives slow. The former doesn’t even move.

DNS is a recursive process – in short, if the local DNS server does not have either an authoritative or cached response for the target, it then forwards the query to other DNS servers. This recursive process is followed until the domain is found, and the local DNS receives the query response and caches it for a period of time specified by the Time to Live (TTL). (RFC 1912)

Here are some TTL facts – as of this writing, for the top 15 sites on the internet TTL values range from 30 to 3600 seconds. The average TTL is 919 seconds or 15.3 minutes – this includes Google, YouTube, Facebook, Yahoo & Amazon. In addition, each of these sites has likely deployed a DNS infrastructure (global load-balanced DNS) to support the mega-traffic volumes they service daily.

What could have gone wrong at Soda Stream? It’s easy to lookup the TTL for any domain, including Soda Stream USA. What was found as of this writing is the TTL for sodastreamusa.com A-Record is currently 0.1 hours (300 seconds), 5 minutes. This means DNS cache will consider the resource records for sodastreamusa.com stale after 5 minutes and when a new query comes in for that domain the recursive process would again be initiated. If a 300 second TTL was the setting during game time, and the DNS infrastructure could not adequately support a site expecting large traffic spikes such as during the Super Bowl, a constant unexpected load on their DNS infrastructure could have caused DNS lookup failures.

DNS Problems lead to bad availability of the website during the game
DNS Problems lead to bad availability of the website during the game
Looking closer revealed that the actual DNS Lookup problems
Looking closer revealed that the actual DNS Lookup problems

One possible scenario why this occurred is that sodastreamusa.com pre-game planning required fast DNS propagation time for frequent sites changes.  Thus, they intentionally turned the TTL down temporarily but forgot to turn it back up for the game. Even increasing it to ~15 minutes (the top 15 sites average) would have greatly reduced the load coming from new queries. Further, if load-balancing and failover is part of their DNS infrastructure, clearly that was not functioning correctly either.

In the end, TTL is about striking a reasonable balance between a fast propagation time and taking advantage of DNS caching. These failed DNS lookups could have likely been prevented and is a fact the Soda Stream team should have been aware of had active monitoring been in place. In this case, the DNS ‘visibility gap’ cost Soda Stream a measureable number of unique visitors and a waste of advertising dollars that went from fizzy to flat.

Best Practice #3: Monitor TCP Socket Connections

Availability issues occurred for Kia during game time as well. But unlike Soda Stream, Kia’s site failed moments after its ad ran.

What failed for Kia?

TCP Socket timeouts for visitors to http://www.kia.com/us/en/vehicle/k900/2015/experience. This is where connections were dropped after 60 seconds as no response had been received from the remote server on a connection that is already established.

Availability dropped right after the Ad aired
Availability dropped right after the Ad aired
The root cause was that requests to their landing page timed out
The root cause was that requests to their landing page timed out

Monitoring and load testing for TCP Socket connections will ensure the web servers are deployed with large enough worker threads, reduce the processing required so that these worker threads are free faster for other incoming requests, or “pushing” some of these requests to a CDN partner. Could this have been avoided with proper monitoring and load testing of the front-end web tier to the expected traffic volumes? Yes, we believe so.

Following these best practices will get you further along in completing the task at hand. APM itself is a journey. A journey that should include the right tools, processes and culture for the purpose of gaining performance insight into fault domains and quickly pin-pointing root cause of performance issues before users are impacted.

Using our Super Bowl analysis, our next post we look at more lessons learned when best practices are not applied.

Greg Speckhart is a Senior APM Solution Consultant with Compuware APM. Greg has been working in the Application Performance space for several years helping enterprises to better understand and improve application performance.