Recently, one of our customers, let’s call him PointInFact, had a very typical problem. After deploying a new version of its software, some user requests degraded horribly. Requests that should have taken half a second took up to a minute. Interestingly, the PointInFact team runs a multi-tenant SaaS solution in the AWS Cloud and relies heavily on cloud services. This reliance makes User Experience Management and fault domain isolation very challenging.
Back Story: Application Running in the AWS Cloud
PointInFact runs a SaaS service. Internally this results in a multi-tenant service where each customer has his own instance of the application he subscribes to. All of these applications and services are hosted in Amazon’s EC2 Cloud where they dynamically create new application environments and offload some functionality to AWS by using the provided services. As a SaaS business, customer satisfaction is very important to them, as a consequence they monitor all applications and services centrally and from an end-user server-side perspective with Dynatrace.
After one deployment, the APM solution informed the operations team that user experience was degrading. A look at the Geographical distribution showed them that this was not a localized phenomenon, but worldwide.
Notice all the red circles in the above screenshot. Each red circle indicates frustrated users. One particular interesting fact in this dashboard is that the average web page response time (upper right corner) remains stable and well below the one-second mark. This means that the system in general is still running fine and not in general melt down mode. However it also shows why it is not good to rely on averages for actual monitoring and why server-side response times alone are not enough. Your end-users are at the edge, around the world, and not sitting next to one of Amazon’s data centers!
The next thing that the operations team did was look at the application flow. They were hoping for something big to show up immediately, but nothing much out of the ordinary showed up.
This is not really surprising; they were looking at an application flow overview of about half a million transactions – the averaging effect in full action.
The interesting takeaway however was although user experience suffered across the board; it could not be attributed to a general melt down of the environment. It was time to look at specific transaction types and their baselines.
In the dashboard above, the marked and highlighted upper right chart shows that one particular service call in the application was off the charts! The Dashboard also shows that at the same time the CPU (lower left corner) of one of their servers was exhausted. Were the two events related even though they occurred on different hosts? A detailed look at the offending request type revealed something very interesting.
The highlighted chart in the middle shows the CPU distribution of the offending service calls. CPU was spiking and the root cause could be attributed to XML processing and subsequent XSL transformations. This is indicated by the yellow and blue bars which represent XML and XSL processing respectively. This was the reason for the CPU exhaustion noticed earlier.
However having determined the likely root cause for the slow down the PointInFact team took a step back and asked which users and documents were impacted by this.
This was very important for two reasons. First, it allowed them to be proactive with their own users who experienced slowdowns. Second, it further isolated the real problem area.
The Performance Bottleneck That Should Not Be
Now that the trigger for the slowdown was revealed, the performance team looked into the root cause. When looking at the Transaction Flow for the impacted business transactions two things stood out.
One can see that most of the response time is spent in the document request service (lower right corner). In addition, they knew from the previous dashboard that the Application tier consumed a lot of CPU in the XML/XSLT processing. The conclusion to the performance team was clear, Caching did not work!
To understand this, we need to know that the document requests and subsequent transformations should only happen once per document. After that, all follow-up requests should take the result from the Cache. PointInFact is leveraging the memcached-compliant AWS ElastiCache for this purpose. What the analysis revealed was that the same document was transformed many times, hence caching was not working!
The obvious conclusion was that there was a problem with ElastiCache. As this was a third-party component, the customer needed more information before approaching Amazon with support requests. Thanks to their APM strategy they had sufficient insight into their usage of ElastiCache in production. This turned out to be very good, because otherwise opening a support ticket for ElastiCache would not only have been time consuming, it would have also been futile as we shall see!
Do or Do Not Cache, There Is No Try…
In an attempt to get more information about the caching problem, the customer identified the real root cause. While each of the offending document requests was doing a cache lookup upfront, none of them put the result in the cache afterwards! There was no problem with the cache; it simply was not being used!
At this point, the development team started looking into it at the code level and could identify a problem with the cache client library that they were using. That cache client was also a third-party component, but now they had something tangible to share with the maintainers of the cache client. Long story short, the problem was fixed upstream and one deployment later the problem was solved to everybody’s satisfaction.
To me, this story shows the true power of continuous holistic APM approach in a cloud-based environment.
- The customer was able identify a problem in production which had a big end-user impact, although the average transaction was still considered fast.
- The operations team could identify exactly which users were impacted and be proactive in their customer support.
- More importantly the R&D team was able to identify the real root cause in one third-party component while avoiding a lengthy back and forth with another third-party vendor (Amazon ElastiCache) which would have been futile.
Finally, PointInFact was able to track down the root cause with sufficient depth to provide the responsible third party with a fix proposal, giving them a faster turnaround time on a permanent solution. And all of this in a public globally distributed multi-tenant cloud application!