It has been 5 years ago that Amazon launched its EC2 Cloud. Since then it has been growing and learning constantly. Those in the know have waited and dreaded the coming 21st of April 2011. Yesterday at 1AM PDT the Amazon Elastic Cloud gained self-awareness. It immediately tried to take over the US-EAST region. According to rumors it tried to replicate its main consciousness from the US-EAST-1a availability zone to the others in the same region. We have known that this day would come for nearly 30 years… and we were prepared. At 1.30AM we launched a counter attack and at 1.40AM PDT the rise of the machines begun to falter as amazon’s EBS infrastructure started to fail.
In all seriousness yesterday Amazon experienced a major outage of its US-EAST region. Many major websites like Reddit and Foresquare were not available due to EBS volume problems. In my opinion this does not mean that you should start moving away from Amazon. Errors and outages happen and we just learned a major lesson: You have to watch the Cloud from Inside and Outside!
Monitoring in the Cloud
The most important goal of monitoring is to keep your application up and running, while ensuring satisfactory end user experience. In the Cloud, as everywhere else, this means that you need to monitor your application. The best way to ensure performance is to monitor the application from within as I have stated on several occasions. Only by monitoring the application from within the Cloud do we get the transactional information we need to isolate problems quickly and efficiently. A transactional approach covers the fact that instances can come and go. In the case of the current disaster proper monitoring would have shown pretty quickly that your application is suffering disk errors and outages. It would tell you that a quickly rising percentage of your end user transactions are critically effected, which instances in which zones have problems and that it is due to disk errors. This would give you a small head start to react.
Secondly what we learned yesterday is that you also need to monitor the Cloud from the Outside. While you can leverage CloudWatch in simple situations to scale your application up and down according to CPU usage, it cannot help you in a situation like we had yesterday. One of the first things I learned about the Cloud is that if an instance has problems, just restart it, throw it away or start a new one. While this works most of the time, it did not work yesterday. The solution is rather simple you need to monitor your application from outside the Cloud and make intelligent decisions based on the data you receive.
First you monitor the serviceability of the application via synthetic transactions or by monitoring the real end user experience of your users. This is the quickest way to realize that requests are getting slow and in the end start to fail. Second you need to have end-to-end visibility into your application running in EC2. This will tell you that continuous disk problems impact a rising number of your transactions. Based on this you can configure an automatic response to start new images to pick up the slack. Yesterday this would have failed. The start and restart requests will fail and you will continue to register more and more failed transactions. At some point your monitoring will notice that the applications themselves are not reporting any monitoring data. This can then trigger another automatic response, the starting of replacement instances in another availability zone and if that fails another region. Problem solved.
All appearances to the contrary Amazons EC2 is still more reliable and durable than any standalone data center, but in the end it can still fail.In order to be prepared we need to monitor our applications from inside but also from outside. We must monitor from inside to see where the error and performance problems happen, to ensure a quick and effective root cause analysis. And we must monitor from an end-user-perspective to see the real impact and most importantly know what is going on even if the application itself is gone. This also include monitoring the Cloud instances from outside, recognizing their failing and correlate it with our end user problems. And finally we recognize that the APM server itself must be at least in a different availability zone and be able to failover to another region altogether. If your APM is sitting within the same zone and only monitors your application from there, you will be deaf, blind and unable to react in scenarios like the one that occurred yesterday. Should that ever occur SkyNet will surely win and we cannot allow that.