Cloud infrastructure monitoring checklist: Are you covered?

Published August 24, 2017 Updated November 9, 2022 7 min read

Katalin Haverinen-Varga

Recently I got a glimpse into one of our latest customer’s cloud migration story, and how they got their cloud infrastructure monitoring needs covered. They happen to be one of the biggest industrial companies worldwide, actually.

The company recently implemented a cloud-first initiative. Accordingly, they migrated their previously outsourced enterprise business critical applications into an internally managed AWS environment with a “lift-and-shift” approach. Their new environment, made up of hundreds of hosts and very diverse technologies in the AWS cloud, triggered the need for:

What we experience here at Dynatrace is that the need for an elastic and fully scalable cloud infrastructure increases, as cloud-native apps increasingly become the standard for companies of all sizes wanting to create better customer experiences.

No wonder that early this year IDC forecasted that:

spending on off-premises cloud IT infrastructure will experience a five-year compound annual growth rate (CAGR) of 14.2%, reaching $48.1 billion in 2020.

But what we also experience is that once in the cloud, these companies quickly realize that in this new environment their traditional infrastructure monitoring approach no longer works.

A few questions to consider

Are you implementing a cloud-based infrastructure for your business-critical apps, similarly to the company in my introduction? Whether it’s on AWS, Azure, Google Cloud, OpenStack or CloudFoundry, you might want to consider the following questions before starting to monitor it with a bunch of different tools.

1. How easy is the solution to implement, configure, and maintain?

With the increasingly complex environments of today’s applications, ease of implementation and ease of use become more than just nice-to-haves — they are essential.

Traditional monitoring solutions require too much manual instrumentation and configuration – a reason why most companies today are only monitoring 5 to 10 percent of their applications.

My recommendation: look out for a monitoring tool that has already embraced the power of automation. This means auto-discovery of your cloud environment, auto-baselining, or even automatic root cause analysis.

2. Does it provide real-time insights into the health of your cloud resources?

Whether you choose to run a public, private, or hybrid cloud, virtualize your datacenter, or simply deploy your applications to CloudFoundry, it should be a basic expectation from every monitoring solution to give you the complete, real-time picture of health of your entire cloud-based architecture.

Do you have your containers under control? What about your load balancers? And about your hypervisor dynamics? There are just so many moving parts in a cloud infrastructure that makes it difficult to identify the underlying cause of aberrant system behavior.

Choose a cloud monitoring solution that has been built from the ground up with dynamic environments in mind. They can eliminate all blind spots and can keep up with any changes of the dynamic environments.

3. Does it provide full-stack application performance monitoring, or only firefighting capabilities at infrastructure level?

Even though a solid cloud infrastructure is the backbone of any successful business, at the end of the day it’s all about the applications. And if they fail, users can be cruel.

Your applications may span many technology tiers, and components from the cloud through the back-end data center and mainframe. To get a full stack view of all your applications, you will need the ability to monitor from different perspectives:

Digital Experience Analytics
Application Performance Management
Cloud and Infrastructure Monitoring

If you care about your apps, I recommend that you choose a unified monitoring tool that provides a holistic view of not only your cloud infrastructure, but also of your applications running on it.

4. How fast it lets you find the root cause of an issue?

What’s one of the biggest obstacles plaguing your IT teams? If it’s alert overload, you are not alone.

Companies still often use different monitoring tools to look at datacenters, hosts, processes and services. When any of these components fail or slow down, it can trigger a chain reaction of hundreds of other failures, leaving IT teams drowning in a sea of alerts. Tools with traditional alerting approach leave you with countless metrics and charts, but then it’s up to you to correlate those metrics to determine what is really happening.

The solution? Using a tool that gives you causation instead of correlation.

If a monitoring tool can capture every transaction all the time and uses a tagging approach across every remoting call, it gives the performance engineer causation based data, which gives them confidence and hard facts on what is causing system problems. Being able to point the Dev team directly to the root cause is priceless when time, money and your business reputation is on the line.

5. How does the solution handle performance baselining for ultra-dynamic environments?

Setting up performance baselines is another tricky part in cloud infrastructure monitoring. It can involve a lot of time-consuming and potentially error-prone manual effort with traditional APM—especially because most of them rely on averages and transaction samples to determine normal performance.

Averages are ineffective because they mask underlying issues by “flattening” performance spikes and dips. Sampling lets performance issues slip through the cracks—creating false negatives.

If you want to effectively baseline your cloud infrastructure’s performance, look for a tool that uses percentiles based on 100% gap-free data. Looking at percentiles (median and slowest 10%) tells you what’s really going on: how most users are actually experiencing your application and site.

6. Does it offer built-in log monitoring, or needs additional tool?

Remember the company I presented in my introduction? One of their key requirements was that log management and log analytics are built-in features. Quite understandably: being able to monitor application performance and analyze related process log files using the same tool helps their DevOps, Development, and QA teams to perform their jobs quickly and efficiently.

If log analytics is also an important part of your monitoring process, choose a solution that has this feature built in. Having a direct access to all log content related to your mission-critical processes extends your monitoring reach well beyond traditional APM data sources.

7. Will the monitoring solution scale with your business needs?

The last, but not least important feature you should look for in a monitoring tool is its ability to scale with your business.

Modern cloud environments run thousands of nodes with hundreds of technologies, distributed across datacenters around the globe. You can keep deploying more and more monitoring tools for each silo to ensure the system limits are not reached, but soon questions like these will come up:

How far will this scale?
How long until I‘ll need a newer, faster, or bigger one?

Picking a monitoring solution that gives you real-time insights into your cloud components is important, but ensuring that it will not crash and burn as you expand your environment is crucial. Therefore, look for a tool that was built with large application environments in mind and therefore scales to any size.

“The value that you get from Dynatrace is almost instant”

Watch the video below to see how Dynatrace helps Citrix reduce cloud resources, as well as time spent on troubleshooting issues.

Wrapping it up

Today’s digital businesses are under more pressure than ever to do things faster, smarter, and more effectively. This is doubly true for companies who run customer facing applications. It basically depends on their technology if they win or lose the war on the battlefield of customer experience. And the trend shows that the winners already implemented a digital transformation strategy – which might as well include a cloud-first initiative and migrating to a cloud-based infrastructure.

However, being there is not enough. The complex architecture and the countless moving parts of a cloud ecosystem require modern monitoring capabilities. Why monitor a modern cloud architecture with a bunch of different, outdated tools? That would detract you from the benefits for which you migrated to the cloud in the first place.

This is what the company described in the intro realized – and, as of today, they are developing new cloud-native applications on their own, deploying these into their cloud environments, and monitoring them happily with Dynatrace.