Welcome to the Show of CDN Monitoring: Act 2- How and How Not to Monitor CDNs

In my first blog Act 1 – The What and Why I talked about the benefits and some risks around using a Content Delivery Network (CDN). Today I will cover some details around some common misunderstandings regarding how to monitor CDNs and explain the right monitoring strategy.

Which monitoring options do you have?

In order to manage any complex system you need quality data as quickly as possible. All the enterprise CDN solutions offer some level of insight on the performance they deliver. It is basically based on high level aggregation of relevant log file data and tells you for example how many requests have been received, how much data has been sent out, what status codes were returned, how fast the servers responded, etc.

However this data has 2 major problems:

  • It is not detailed enough
  • It’s provided by the vendor you want to monitor

So instead of only relying on this information you need to add your own monitoring strategy. And there are 2 different flavors for such solutions.

Synthetic Monitoring of CDNs

Traditionally the typical choice when it comes to availability or SLA monitoring, especially when you are dealing with a system you don’t control yourself like the CDN. For many years such synthetic monitoring networks have been on the market offering excellent performance data results from outside of your own data center.

Most of these services rely on a limited number of locations from which the monitoring tests are executed. These backbone locations should be located in high quality data centers with a tier1 network connection offering a very stable environment. This enables you to detect any relevant operational variations in the performance results and that’s why they are often used for high frequency availability and QoS monitoring. These results are much more realistic and accurate than anything you could deploy yourself.

Real User Monitoring of CDNs

A completely different approach is to look at what is happening within the browser of every visitor.

The difficulty in this approach simply lies in how to get the relevant information from all your end users. Your typical JavaScript based real user monitoring solution is quite happy to collect information like the W3C navigation timings or details like the onLoad/unLoad events. These data points, however, do not include any information on specific CDN performance of resources like images, JavaScripts, et cetera. Some solutions also allow the definition of custom events which must be manually instrumented for a given application or they will start to use the W3C resource timings. However, you end up either with a very tedious and thus costly customization or you only get visibility to the very few current browsers which support the newer W3C timings.

While JavaScript based RUM gives you great insight into the overall situation and health of your application it cannot deliver deeper details like network level issues and is not at home in pre-production and testing environments.

How Not to Monitor CDNs

In almost all companies I encounter, synthetic tests are being used to manage performance and availability SLAs. Typically by using a number of backbone quality agents and basically by treating the CDN objects the same way as any other.

Yet backbone monitoring fails quite badly when it comes to monitoring highly distributed systems like CDNs; something which has rightly been mentioned in the performance community for quite some time now.

The reason for this is quite simple. A good CDN service automatically optimizes performance by ensuring content is cached on the edge, optimizing routes or speeding up DNS resolution. The first time a content is requested in a specific region the supposedly closest server or Point of Presence (PoP) with the best route is found but the CDN cache is typically still empty. The second time the PoP and route should obviously in most cases still be the best one but now the content is cached and thus everything is much faster.

Backbone locations do not move. They request the same content over and over again in most cases hitting the same CDN PoP over and over again. It therefore has a filled cache and in addition it is very often even located extremely close to the backbone agent if not even within the same tier1 data center. For example, I often see response times in backbone tests which are under 1 or 2ms for a piece of content.

And since the backbone keeps hitting the same PoP over and over again all you get is a very good picture of how well that single CDN server is responding, but you stay blind to the rest of the network.

Monitoring from few locations
Monitoring from few locations

Ok, so all I have to do is to add a few more backbone locations into the mix and I should be good, right?

Not quite. Let’s take a look at how the correlation between # of backbone monitoring locations vs. hit CDN PoPs looks like.

Over a period of 60 hours the distribution of CDN PoPs monitored from 1 backbone location (Munich) looks like this:

Chart showing the rate at which different PoPs are monitored
Chart showing the rate at which different PoPs are monitored

The chart shows the distribution between CDN PoPs, each one having its own color. Apart from two short blips all the request are being served from one CDN PoP.

What about adding more backbone locations to the mix?

2 backbone locations (Munich & Frankfurt) typically hit 2 CDN PoPs:

Monitoring CDN from 2 locations
Monitoring CDN from 2 locations

Adding a 3rd backbone in Los Angeles:

Monitoring CDN from 3 locations
Monitoring CDN from 3 locations

You get the drift.

Except for a few occasional hits on other machines every 1 location hits 1 CDN PoP.

Since all enterprise CDN services have hundreds or thousands of PoPs broadly distributed at the edge you would base your decisions on extremely limited results.

Synthetic monitoring of CDNs is essential but using only a few locations is not the way to do it!

I am always surprised and shocked when I hear things like “We have a major problem. 25% of our CDN requests are really slow” based on data collected by 3 or 4 locations. Such an analysis is fundamentally wrong!

Some companies have also started to use Real User Monitoring solutions to check on their CDN benefits. However due to the limitations described above they are only getting visibility into the performance of the HTML root objects and not all the other resources served by CDNs. You only need to take a look at the ratio between number of HTML documents vs. all other resources (CSS, JavaScripts, images, etc.) of your applications delivered by the CDN to understand the danger of limited visibility.

Real User Monitoring of CDNs is a must but only looking at HTML resources is dangerous.

How to Monitor CDN

Quite simple:

  • combine the strengths of synthetic and real user monitoring tools
  • account for the special nature of CDNs as an outsourced and widely distributed system

Use synthetic for CDN monitoring

As I mentioned synthetic tools do offer some great value when it comes to monitoring. But when it comes to CDNs you need to monitor such distributed systems from distributed locations.

So what if you could turn the end user locations into monitoring locations?

What if you could not use 5, 10 or 100 locations in your specific region or across the globe but 500, 1.000, 10.000, 100.000?

Doing that should give you a much, much better picture on reality.

Monitoring CDN from many locations
Monitoring CDN from many locations

So what does the picture look like when using this approach?

Monitoring CDN from 100's locations
Monitoring CDN from 100’s locations

Instead of just a few PoPs the synthetic tests run from distributed agents hit over 100 of them!

I will cover some key findings this approach allows to uncover in my follow up blog post.

Use RUM for CDN monitoring

If many locations are obviously the only way to monitor CDNs then the best way surely would be to use all of my real users who access the site anyway from all over the place, right? Absolutely!

However what you need is visibility into CDN and 3rd party performance on object level– from all your real users – without having to custom code a whole library!

Let’s take a look at an application which is mainly being delivered via a CDN and includes a number of additional 3rd party resources. Information like the one shown below should be collected by a RUM solution without having to spend time in customizing JavaScript code. It separates the performance of all the resources (images, JavaScripts Css, etc.) into the specific host buckets. With that information you can identify the impact of any slowdown. And not only that. Since the hosts from which all your content is being served to your users are automatically identified you get a complete list of 3rd party services. Think of how often the question has arisen in your organization about which tools are being used. In most cases the effort just to compile such a list is quite high – let alone figuring out the performance impact they all might have.

With Real User Monitoring auto detected CDN hosts
With Real User Monitoring auto detected CDN hosts

In summary the 5 CDN specific monitoring rules are:

  • Get object level visibility
  • A distributed system needs to be monitored by a distributed system.
  • Make sure you know what your CDN is doing everywhere your end users are.
  • Use as many synthetic locations as possible and check how many CDN PoPs you are actually hitting.
  •  Use RUM monitoring of your CDN to ensure you don’t need to guess the total impact on your business.

In my next blog (Act 3: Things going wrong) I will share some key findings we see very often and which you need to be aware of.