DNS Incident, Friday, October 21, 2016

Update: 5:30pm EST, November 9, 2016

Andi Grabner provides some great insight in how DNS issue harm applications.  This is a must read.

Analyzing DNS Outage Impact on 3rd party Applications

Update: 9:30am EST, October 26, 2016

The dust is still settling from Friday’s DDoS attack.  Cogeco Peer 1 provided an interesting infographic highlighting some of factors which businesses need to consider when trying to understand the cost of a DDoS attack. We are still looking at the overall cost and impact of the issue.  Below is a real example of what Dynatrace SaaS was showing a customer on Friday the 21st.  Because applications can be very complex (third parties) and highly distributed (remote hosting) even an issue localized to the Northeast US can impact users across the country (and as we saw across the globe).

DTsaas

Here is a recap of what can you do to be better prepared for a DDoS attack by using a digital performance platform.

  • Be Proactive, aggressively manage your site with synthetic monitoring. Let the software robots make sure that your site can be reached from end user locations and notify you the moment something is amiss.  The sooner you know of the issue the sooner you can deal with it.
  • Have a DNS failover strategy, relying on a single DNS provider is a recipe for disaster. You should have relationships with multiple vendors to allow you to switch DNS routing as soon as an issue shows up.
  • Don’t replace synthetics with real user monitoring, real user monitoring will only tell you the absence of traffic in a situation like this. Synthetics will provide more detail as to what is being seen from those end user locations.  Remember that while you need synthetics it cannot replace real user monitoring as that real user monitoring data will tell you how much the issue is costing your business.  You need both!

Update: 2:15pm EST, October 24, 2016

What a weekend.  The team here at Dynatrace has been working hard to analyze the business impact of the DDoS.  Aaron Rudger, Director of Product Marketing at Dynatrace did an analysis against a “representative sample” of sites we monitor to quantify the business impact of this event. In this case, it’s 177 site tests in the Keynote Benchmarks for News, Social Media and Hospitality industries.  The view below is how Dynatrace uses Synthetic Purelytics (detailed object level analysis) to understand the issue.

purelytics

Of those sites 77 were impacted today by the DNS system disruption, or 43%.  The disruption reduced the availability of these sites to 89%, meaning 1 out of 10 people would have been turned away.  Now lets plug in some dollars to this issue.  US online retail sales in 2015 were $342B (US Dept of Commerce)–$936M/day.  US online advertising revenue in 2015 was $60B (PwC)–$163M/day.   This is a crude extrapolation, but we could say this disruption potentially impacted up to $110M in online revenue+sales.

Below is an example of Dynatrace User Experience data from an online retailer.  In this example we are comparing the traffic from last Friday to the previous Friday.  Here we can see that there were  100 fewer conversions for the same time frame.  This shows (based on real user traffic) that this site saw 84% of the expected business they would normally see for that time frame.

DDoS_BI

Many business were impacted on Friday, do you know how much the DDoS attack cost your business?

Update: 11:00pm EST

Reports of this issue impacting regions outside the US are coming in.  Dave Anderson, VP, EMEA and APAC Marketing for Dynatrace highlighted some issues being seen in Australia.  Here is some additional detail as to why Australia is being impacted by a US Issue.  In this example I looked at Australian retailer David Jones (great name for a company).  You can see that we are monitoring from two locations in Australia on the Dynatrace Synthetic monitoring network.   This is where Dynatrace makes the requests from (just like end users in Australia).

au1

Above in the bottom right of the image you can see where Dynatrace is monitoring from Melbourne and Sydney.

When we switch to the Host View (below), these are where the responses are being returned from.   You can see that even for a test being made in Australia to an Australian Company many of the return results come from locations in the US and Europe.  This is because pages are complex and use third parties like analytics tools, social media, ad providers, etc.

au2

For example, David Jones (Australian retailer… not me) uses Coremetrics, which actually resolves to response coming from US-based servers on the East Coast of the US.  So when the US East Coast has a DNS issue it impacts everyone.

au3

It took me a minute to let that sink in, essentially a DDoS attack on US based DNS providers will impact the entire globe.  The internet is more inter-connected than most businesses realize.  Not such a small world.

Update: 9:35pm EST

Two-and-a-half hours since the last wave, DNS Health appears to have normalized for now.

DNS_waves_2

Update: 5:15pm EST

CNN is reporting that the DNS attack coming in waves.  The Dynatrace Synthetic Network is showing that the last wave is abating.

DNS_waves

What’s happening?

Numerous reports today indicated a localized east coast performance event was occurring.  Reports pointed to an attack on DYN.  DYN provides DNS services.  DNS (Domain Name Service) can be thought of as a phone book for the internet.  Domain names (ie. www.mysite.com) get mapped to IP addresses (ie. 128.122.1.101).  This is one of the basic technologies that make the internet work, if it fails a lot of people get very upset.  For the most part DNS services are very fast and robust.  Many services like DYN and CDNs (content delivery networks) use DNS very dynamically to route traffic to remote points of presence (for distributing load closer to end users in the case of a CDN) as well as managing DNS for company’s websites.

Our first indication of an issue was seen in the Dynatrace Internet Health Report with some latency and dropped packets coming from Level 3 (a major network and CDN provider).  This dashboard is available to the public http://internethealthreport.com/ and shows peering relationships between major backbone network providers (these are the companies that move the lion’s share of traffic across the US).

HealthReport

Other non-Dynatrace tools indicated an issue was underway.  Downdetector.com showed a Level 3 issue occurring in the North East US.

level3

It’s ONLY with synthetic monitoring that one can quickly detect and diagnose DNS issues, both at an Internet-scale and for your individual website.

Dynatrace operates the largest network of monitoring agents in the world with dedicated hardware in major US cities and thousands of software agents running from end user machines in every state.  The Dynatrace Synthetic Network Internet Health Map shows the issue impacting locations across the US East Coast.

healthmap_dns

Further investigation shows that issue was being seen in very long DNS Lookup times.  Typically these DNS requests occur in milliseconds, however time banding indicates that these DNS requests are timing out.  When this occurs typically there are retries after 2 seconds elapse.  This time banding indicates an issue with DNS Health.

DNS_benchmark

Here we can see the DNS Time increasing dramatically in certain cities and specific networks.

DNS_by_city

OK, So What? DNS times are slowing down.

Aside from the news reports, and issues accessing social media what is the big deal?  From what Dynatrace can see this problem is not isolated to social media sites, because most other sites use social media components on their web pages.  In some cases, they utilize services from those social media sites (like API’s and JavaScript frameworks) which can impact the performance of their own sites.

For example, here is today’s traffic from a major US retailer.  If you look at the Visits for today compared to the same time last week you can see that the site traffic is down considerably.  When traffic is down conversion count also goes down (this means the amount of revenue being generated online decreases).

UEM

The team here at Dynatrace loves real user data, and our customers also love getting real user data.  However, in an event like this when a DNS issue prevents traffic from reaching your site, real user data only indicate the absence of traffic and not the cause.  That’s why Dynatrace provides the worlds most sophisticated Synthetic Monitoring Network which proactively test sites from end user locations for issues like this.

What can businesses do to prevent this?

  • Be Proactive, aggressively manage your site with synthetic monitoring. Let the software robots make sure that your site can be reached from end user locations and notify you the moment something is amiss.  The sooner you know of the issue the sooner you can deal with it.
  • Have a DNS failover strategy, relying on a single DNS provider is a recipe for disaster. You should have relationships with multiple vendors to allow you to switch DNS routing as soon as an issue shows up.
  • Don’t replace synthetics with real user monitoring, real user monitoring will only tell you the absence of traffic in a situation like this. Synthetics will provide more detail as to what is being seen from those end user locations.  Remember that while you need Synthetics it cannot replace real user monitoring as that real user monitoring data will tell you how much the issue is costing your business.  You need both!

Shawn White, our Vice President of  Digital Experience & Cloud Operations said it best… “Its ONLY with synthetic monitoring that one can quickly detect and diagnose DNS issues, both at an Internet-scale and for your individual website.”

The team here at Dynatrace will continue to monitor the situation and provide updates.