In my first blog Act 1 – The What and Why I talked about the benefits and some risks around using a Content Delivery Network (CDN). Today I will cover some details around some common misunderstandings regarding how to monitor CDNs and explain the right monitoring strategy.
Which monitoring options do you have?
In order to manage any complex system you need quality data as quickly as possible. All the enterprise CDN solutions offer some level of insight on the performance they deliver. It is basically based on high level aggregation of relevant log file data and tells you for example how many requests have been received, how much data has been sent out, what status codes were returned, how fast the servers responded, etc.
However this data has 2 major problems:
- It is not detailed enough
- It’s provided by the vendor you want to monitor
So instead of only relying on this information you need to add your own monitoring strategy. And there are 2 different flavors for such solutions.
Synthetic Monitoring of CDNs
Traditionally the typical choice when it comes to availability or SLA monitoring, especially when you are dealing with a system you don’t control yourself like the CDN. For many years such synthetic monitoring networks have been on the market offering excellent performance data results from outside of your own data center.
Most of these services rely on a limited number of locations from which the monitoring tests are executed. These backbone locations should be located in high quality data centers with a tier1 network connection offering a very stable environment. This enables you to detect any relevant operational variations in the performance results and that’s why they are often used for high frequency availability and QoS monitoring. These results are much more realistic and accurate than anything you could deploy yourself.
Real User Monitoring of CDNs
A completely different approach is to look at what is happening within the browser of every visitor.
How Not to Monitor CDNs
In almost all companies I encounter, synthetic tests are being used to manage performance and availability SLAs. Typically by using a number of backbone quality agents and basically by treating the CDN objects the same way as any other.
Yet backbone monitoring fails quite badly when it comes to monitoring highly distributed systems like CDNs; something which has rightly been mentioned in the performance community for quite some time now.
The reason for this is quite simple. A good CDN service automatically optimizes performance by ensuring content is cached on the edge, optimizing routes or speeding up DNS resolution. The first time a content is requested in a specific region the supposedly closest server or Point of Presence (PoP) with the best route is found but the CDN cache is typically still empty. The second time the PoP and route should obviously in most cases still be the best one but now the content is cached and thus everything is much faster.
Backbone locations do not move. They request the same content over and over again in most cases hitting the same CDN PoP over and over again. It therefore has a filled cache and in addition it is very often even located extremely close to the backbone agent if not even within the same tier1 data center. For example, I often see response times in backbone tests which are under 1 or 2ms for a piece of content.
And since the backbone keeps hitting the same PoP over and over again all you get is a very good picture of how well that single CDN server is responding, but you stay blind to the rest of the network.
Ok, so all I have to do is to add a few more backbone locations into the mix and I should be good, right?
Not quite. Let’s take a look at how the correlation between # of backbone monitoring locations vs. hit CDN PoPs looks like.
Over a period of 60 hours the distribution of CDN PoPs monitored from 1 backbone location (Munich) looks like this:
The chart shows the distribution between CDN PoPs, each one having its own color. Apart from two short blips all the request are being served from one CDN PoP.
What about adding more backbone locations to the mix?
2 backbone locations (Munich & Frankfurt) typically hit 2 CDN PoPs:
Adding a 3rd backbone in Los Angeles:
You get the drift.
Except for a few occasional hits on other machines every 1 location hits 1 CDN PoP.
Since all enterprise CDN services have hundreds or thousands of PoPs broadly distributed at the edge you would base your decisions on extremely limited results.
Synthetic monitoring of CDNs is essential but using only a few locations is not the way to do it!
I am always surprised and shocked when I hear things like “We have a major problem. 25% of our CDN requests are really slow” based on data collected by 3 or 4 locations. Such an analysis is fundamentally wrong!
Real User Monitoring of CDNs is a must but only looking at HTML resources is dangerous.
How to Monitor CDN
- combine the strengths of synthetic and real user monitoring tools
- account for the special nature of CDNs as an outsourced and widely distributed system
Use synthetic for CDN monitoring
As I mentioned synthetic tools do offer some great value when it comes to monitoring. But when it comes to CDNs you need to monitor such distributed systems from distributed locations.
So what if you could turn the end user locations into monitoring locations?
What if you could not use 5, 10 or 100 locations in your specific region or across the globe but 500, 1.000, 10.000, 100.000?
Doing that should give you a much, much better picture on reality.
So what does the picture look like when using this approach?
Instead of just a few PoPs the synthetic tests run from distributed agents hit over 100 of them!
I will cover some key findings this approach allows to uncover in my follow up blog post.
Use RUM for CDN monitoring
If many locations are obviously the only way to monitor CDNs then the best way surely would be to use all of my real users who access the site anyway from all over the place, right? Absolutely!
However what you need is visibility into CDN and 3rd party performance on object level– from all your real users – without having to custom code a whole library!
In summary the 5 CDN specific monitoring rules are:
- Get object level visibility
- A distributed system needs to be monitored by a distributed system.
- Make sure you know what your CDN is doing everywhere your end users are.
- Use as many synthetic locations as possible and check how many CDN PoPs you are actually hitting.
- Use RUM monitoring of your CDN to ensure you don’t need to guess the total impact on your business.
In my next blog (Act 3: Things going wrong) I will share some key findings we see very often and which you need to be aware of.