Many blog posts have been written about the Dyn DNS DDOS attack  on October 21st. Dyn itself was very open with the problem and kept updating their customers and users impacted by the attack through their blog and twitter feed. My colleague David Jones (aka Jonesy) blogged about how our Dynatrace Synthetic monitoring picked up the problem for our customers. Here is a screenshot showing how, for instance, product downloads were impacted from different locations around the globe:

Our synthetic tests to validate key functionality such as product downloads alerted us about the DNS outage
Our synthetic tests to validate key functionality such as product downloads alerted us about the DNS outage

Impact on our Backend Services

Those users fortunate enough to make it to our website then encountered another issue. Our web applications (website, community, eservices) rely on third-party web services called from our backend implementation. We for instance rely on a service from Visual Compliance (www.visualcompliance.com) which we call when users login to our services.

We received an early warning signal from our Synthetic Tests that execute the login sequence on a regular basis – but we also saw how our real users struggled with that problem. The following screenshot shows a list of users on that Friday that encountered that problem impacting login and were obviously frustrated (#2). The bottom list in that screenshot shows that these users tried to login (#3), failed, were redirected back to the login page, to try again:

Dynatrace UEM shows us every single user and all their actions that led to frustration during the DNS outage
Dynatrace UEM shows us every single user and all their actions that led to frustration during the DNS outage

The login attempt for the user from Medford, MA — shown in the screenshot above — took about 153s after it came back with an error, and redirected the user back to the login form. A short while later this user tried again, with the same outcome!

The root cause for this was also the DNS outage but, as mentioned at the beginning of this post, was caused in the backend implementation. Our authentication service implementation tries to call the external service but, because of the DNS Problem, our back-end code couldn’t resolve the name eim.visualcompliance.com. It appears we are using a standard — instead of a high — DNS Lookup timeout of 100s. As a result, every login request is  blocked for 100s until the Java libraries throw the UnknownHostException. This unhandled exception ultimately leads to an HTTP 500 that was sent back to the browser:

Dynatrace PurePath showed us exactly where our backend system tried to call that external service. The exception thrown not only makes it to the log file (as seen in the PurePath) but also leads to an HTTP 500 response.
Dynatrace PurePath showed us exactly where our backend system tried to call that external service. The exception thrown not only makes it to the log file (as seen in the PurePath) but also leads to an HTTP 500 response.

Another interesting finding (#5 in the screenshot) that might not be very obvious is this: Because all incoming login requests were blocked for 100s (until running into the default DNS timeout) the Java App Server soon ran out of available outgoing connections to the external service. Remember: every App Server has connection pools for worker threads as well as outgoing connections. If all outgoing connections are waiting for DNS resolution it means that, at some point, new incoming requests simply have to WAIT until a connection becomes available. But that may take up to 100s. In Dynatrace we can observe this by looking at the Wait time we capture for every single PurePath. This is the time a request must wait for a resource – such as a thread or a connection.

And, because we all are impatient people, our data also showed us that many of our users didn’t wait until the site returned an error. Users just kept hitting F5 (browser refresh), making the problem even worse because their initial request was still waiting until it ran into the 100s timeout, at which point their new request was now waiting for another free connection to become available.

All of this is visible in our Dynatrace AppMon & UEM data because we capture every single user and every single click. So we really KNOW exactly how inpatient !

Lessons learned!

While the DNS outage had major impacts on end users that couldn’t reach certain websites, it also caused major issues in applications that tried to call external services and failed. It was a good wake-up call to revisit error handling of application code. A default DNS Lookup timeout setting of 100s is too high in my opinion. If you can’t resolve a name within 10s you probably won’t resolve it at all. So it’s better to fail fast than to block scarce resources such as threads or connections.

If you want to give Dynatrace a try on your own systems check out our free trial offering for both Dynatrace AppMon & UEM, as well as our SaaS-based monitoring solution. It requires only a few clicks, and we show you where you can optimize your implementation to become more robust against attacks like the October 21 event.

I want to make a special shout out to my colleague Ahmad Awadallah who – despite fighting multiple fires during that hectic Friday – was kind enough to send me these screenshots and PurePaths so that I could write up the story. Thanks Ahmad! Keep on keeping these servers healthy!