Thinking cloud is thinking application-centric. Organizations utilize public cloud offerings because they want to focus on their revenue-generating applications rather than managing infrastructure. And from there we already can draw the significance of Application Performance Management (APM) in public cloud environments.

Significance of APM in public clouds—let’s focus on our application and measure its performance directly rather than estimate its performance based on infrastructure metrics!
Significance of APM in public clouds—let’s focus on our application and measure its performance directly rather than estimate its performance based on infrastructure metrics!

If we focus on our application, a focus on its performance is essential. Even more important than that is how our application’s performance is perceived by our users. To gain this insight, it is crucial to monitor our application’s performance from our users’ perspective; that’s where User Experience Management (UEM) comes into play.

In my previous post I discussed how APM and UEM are cornerstones in successful public cloud deployments by means of the time cockpit example, now I want to spend some thoughts on the paradigms that I perceive as requirements of modern APM solutions.

A quick summary: it’s not enough to just look at the performance of our application’s infrastructure or simply at how it’s delivered in the browser. Let’s understand why an application is not performing to our expectations. Is it our design? Our implementation? The infrastructure? Or how our users interact with it?

Paradigm #1: It’s Deep

To answer these questions, a deep insight into our application across all tiers is the key. Statistical averages of response times or identification of the slowest 100 database statements might be helpful but is not sufficient for optimization on the application level. Therefore we need rich context information. Ideally, we capture all transactions all the time. Why is this important?

All transactions: Most likely, slow transaction are not just slow, they are executing different code, e.g. data specific processing, error or exception handling. If we only monitor the average we lose the ability to detect and optimize those slow transactions. Likewise it’s only one part of the game to optimize slow database statements itself. In the end we want to optimize our application and not isolated hotspots, meaning we also want to optimize database usage, e.g. implementing data caching on the client side or adding indices to support top ten queries.

Always on: Likewise, after-the-fact recording, i.e. switch on deep data capturing after we identified a problem, is not enough. Capturing deep-dive will allow us to analyze what happened prior to the error which grabbed our attention and enable us to learn how our application behaved before the incident.

Therefore, the perfect solution is to get code-level visibility for all transaction in our system. This helps us to identify the underlying reason for slow transaction, which other transactions (that are executing fast) impact these slow transactions. This insight enables us to also investigate transactions that are fast from an end user perspective but actually impact the overall user experience.

The PurePath follows one transaction—in this case a web request—on code level across all tiers.
The PurePath follows one transaction—in this case a web request—on code level across all tiers.

Paradigm #2: It’s Meaningful

Having this detailed information for every transaction speeds up root-cause identification. However, in order to get the big picture of our application performance, proper aggregation is crucial as it would be too time consuming to analyze thousands or millions of potential problematic transactions. Hence, the concept of Business Transaction is commonly applied in APM solutions (however be aware of vendor specific definitions), which requires identifying similar transactions that fulfill a business-critical service to end users, e.g. logins, purchases, add-to-cart, and searches. Furthermore, this guarantees that we have the same perspective as our users while monitoring our application, rather than looking at abstract infrastructure metrics such as CPU utilization or memory.

Business Transactions are ordered by significance, here easyTravel Logins is ranked top due to its baseline breach for response time. Due to the low load of only three requests per minute it takes some time to have enough statistical significance to identify the baseline violation and avoid false alerting.
Business Transactions are ordered by significance, here easyTravel Logins is ranked top due to its baseline breach for response time. Due to the low load of only three requests per minute it takes some time to have enough statistical significance to identify the baseline violation and avoid false alerting.

Ideally, these Business Transactions grab your attention if an abnormal behavior in e.g. throughput, error rate or response time occurs. Thus, a graphical representation and indication of such a baseline breach is very helpful. Also, since these metrics potentially behave differently, it makes a lot of sense to have different methods of calculating these baselines.

Paradigm #3: Clarity vs. Cloudy

The concept of Business Transactions is also helpful if we are about to monitor our cloud vendor’s performance. I ran a test with the Windows Azure Storage services having storage accounts located in various data centers spread around the globe. The screenshot below shows an overview; you can read about the details of this test here.

The response time for BLOB calls to the European storage account show baseline violations for about 20 minutes every two hours, whereas the Queue and Table Storage Service remain stable
The response time for BLOB calls to the European storage account show baseline violations for about 20 minutes every two hours, whereas the Queue and Table Storage Service remain stable

Paradigm #4: Built-In User Experience Management

Let’s step back and view our application’s performance from our user’s point of view. Obviously, our users don’t directly experience server-side or infrastructure metrics. That’s why a view from this perspective is very valuable for Application Performance Management.

As a particular example, we also are interested how our CDNs are performing. Therefore, modern APM solutions must have a UEM approach that also provides visibility into CDN performance and other third party content such as social media widgets. The screenshot below shows a CDN node of widgetserver.com but obviously, this would work with Windows Azure CDN nodes or Amazon’s CloudFront just as well. Other examples can be found in the previous post.

The average response time of cdn.widgetserver.com is 277ms; none of the requests are failing
The average response time of cdn.widgetserver.com is 277ms; none of the requests are failing

Paradigm #5: Versatility

Finally, state-of-the-art APM solutions are able to automatically discover your application’s flow across all tiers, in any environment, virtualized or not. This is quite significant, since we will face many hybrid environments in near future. Organizations will migrate their applications to the cloud one-by-one, some applications will never be migrated into the cloud, and there are also scenarios where it makes sense to use various clouds within one enterprise or even within one application (read more about that here).

We can follow the application that is deployed across the Windows Azure and EC2 clouds through all tiers seamlessly, independent of the cloud vendor or technology
We can follow the application that is deployed across the Windows Azure and EC2 clouds through all tiers seamlessly, independent of the cloud vendor or technology

Summary

The cloud story wonderfully tells the importance of APM for our business-critical applications: it’s all about the application. Also, we don’t want to sporadically firefight isolated hotspots within our application. Instead, we want to view and optimize our application as a whole, including our end user’s perspective.