AWS observability: AWS cloud monitoring best practices for resiliency

Published November 22, 2021 Updated March 27, 2023 6 min read

Jay Livens

Visibility into system activity and behavior has become increasingly critical given organizations’ widespread use of Amazon Web Services (AWS) and other serverless platforms.

With AWS, applications may be distributed horizontally across worker nodes, and microservices may run in Kubernetes clusters that interact with AWS managed services or in serverless functions. These resources generate vast amounts of data in various locations, including containers, which can be virtual and ephemeral, thus more difficult to monitor. These challenges make AWS observability a key practice for building and monitoring cloud-native applications.

Let’s take a closer look at what observability in dynamic AWS environments means, why it’s so important, and some AWS monitoring best practices.

What is AWS observability? And why it matters

Like general observability, AWS observability is the capacity to measure the current state of your AWS environment based on the data it generates, including its logs, metrics, and traces.

Because of its matrix of cloud services across multiple environments, AWS and other multicloud environments can be more difficult to manage and monitor compared with traditional on-premises infrastructure. To cope with this complexity, IT pros need a clear understanding of what’s happening, the context it’s happening in, and what is affected. With dependable, contextual observability data, teams can develop data-driven service-level agreements (SLAs) and service-level objectives (SLOs) to make their AWS infrastructure more reliable and resilient.

AWS: A service for everything

AWS provides a suite of technologies and serverless tools for running modern applications in the cloud. Here are a few of the most popular.

Amazon EC2. EC2 is Amazon’s Infrastructure-as-a-service (IaaS) compute platform designed to handle any workload at scale. With EC2, Amazon manages the basic compute, storage, networking infrastructure and virtualization layer, and leaves the rest for you to manage: OS, middleware, runtime environment, data, and applications. EC2 is ideally suited for large workloads with constant traffic.
AWS Lambda. Lambda is Amazon’s event-driven, functions-as-a-service (FaaS) compute service that runs code when triggered for application and back-end services. AWS Lambda makes it easy to design, run, and maintain application systems without having to provision or manage infrastructure.
Amazon Fargate. Fargate is an AWS serverless compute environment for containers. It manages the underlying infrastructure that hosts distributed container-based applications, which frees developers to focus on innovating and developing applications. Fargate is tailored to run containers with smaller workloads and occasional on-demand usage.
Amazon EKS. EKS is Amazon’s managed containers-as-a-service (CaaS) for Kubernetes-based applications running in the AWS cloud or on-premises. EKS integrates with AWS Fargate using controllers that run on the managed Amazon EKS control plane, which governs container orchestration and scheduling.
Amazon CloudWatch. Amazon’s AWS monitoring and observability service, CloudWatch, monitors applications, resource usage, and system-wide performance for AWS-hosted environments. But if you are using non-AWS tooling or need more breadth, depth, and analysis for your multicloud environment beyond AWS and CloudWatch, you need a different approach.

Serverless technologies can reduce management complexity. But like any other tool used in production, it’s critical to understand how these technologies interact with the broader technology stack. If a user encounters an error page on a website, for example, it’s vital to trace the behavior to the original source of failure.

While AWS provides the foundation for running serverless workloads and coordinated tools for monitoring AWS-related workloads, it lacks comprehensive instrumentation for observability across the multicloud stack. As a result, various application performance and security problems can go unnoticed absent sufficient monitoring.

AWS monitoring best practices

To gain insight into these problems, software engineers typically deploy application instrumentation frameworks that provide insight into applications and code. These frameworks can include break-points/debuggers and logging instrumentation, or processes, such as manually reading log files. The manual approach is usually effective only in smaller environments where applications are limited in scope. Here are some best practices for maintaining AWS observability in larger, multicloud environments.

Make full use of CloudWatch data. As part of your monitoring plan, use CloudWatch to collect monitoring data from all parts of your AWS environment so you can debug any failures. This approach should also include using multiple investigative tools — both AWS native and open source — to deliver a comprehensive view of activity in a multicloud environment. While this provides greater scalability than on-site instrumentation, it also introduces complexity. Multiple tools require teams to collect, curate, and coordinate data sources across disparate environments.
Automate monitoring tasks. Given the sheer number of AWS services and connections to outside technologies, teams now need observability and monitoring tools capable of doing more using automation. With data from sources such as Kubernetes and user experience data, teams need the ability to automatically detect the “unknown unknowns.” Such unknowns include glitches that haven’t yet been identified, can’t be discovered via dashboards, and don’t lend themselves to quick and easy remediation.
Create a special plan for EKS and Fargate monitoring. EKS leaves users with much of the responsibility for detecting and replacing failed nodes, applying security updates, and upgrading Kubernetes versions. If you have multiple EKS clusters, you should set up automation to handle manual tasks and quickly detect bottlenecks or failures. To gain optimum observability of EKS on Fargate, take an all-in-one approach that tightly integrates with AWS and your other Kubernetes environments. The goal is to gain automatic observability into Kubernetes clusters, nodes, and pods combined with analytics tools, such as application metrics, distributed tracing, and real user monitoring.

Automated and intelligent: The Dynatrace approach to AWS observability

Dynatrace provides a wide range of Powered by AI and automation at its core, Dynatrace turns your application data and log analytics into actionable insights and automatable SLOs.

As a long-standing AWS Advanced Technology Partner, Dynatrace integrates closely with AWS services with no code changes. Through auto-instrumentation, Dynatrace provides seamless end-to-end distributed tracing for AWS Lambda functions. Using OneAgent with Dynatrace Operator, Dynatrace combines observability for EKS clusters, nodes, and pods on AWS Fargate with distributed tracing, application metrics, and real user monitoring. Dynatrace ingests CloudWatch metrics and, as a launch partner for Amazon CloudWatch Metric Streams, can provide full observability of AWS services with a fast and direct push of metric data from the source to Dynatrace.

To learn more about how Dynatrace manages AWS observability, join us for an on-demand demo, AWS Observability with Serverless.

Watch demo now!