Header background

Monitoring Kubernetes Infrastructure for day 2 operations

Kubernetes has taken over the container management world and beyond, to become what some say the operating system or the new Linux of the cloud. Eventually, like Linux, Kubernetes will become a mainstream commodity, taken for granted and fade into the background, an intricate piece of the fabric of the cloud.

In the meantime, not only is the Kubernetes platform hot, but it’s also a platform that is hard and non-trivial to deploy (Day 1) and manage (Day 2 ops). One of the promises of container orchestration platforms is to make it easier for the developers to accelerate the deployment of their applications without having to worry about scalability and infrastructure dependencies. But there’s a price to pay; an increasing complexity and burden for the operations teams in charge of managing the platform. Even though Kubernetes is heavily automated, and does self-healing, it’s not something that you flip on and let run on its own. A failure on one of the cluster components can bring down all the applications running on it. For those still skeptical, take a look at the Kubernetes Failure Stories.

Monitoring Kubernetes is an important aspect of Day 2 operations and is often perceived as a significant challenge. Let’s look at some of the Day 2 operations use cases.

Resource utilization management

One of the principles behind the design of Kubernetes is to always look at optimizing the utilization of compute resources by the workload. This does not happen magically. Platform admins must ensure there are enough resources on the cluster to deploy new apps, deal with demand surges or node failures while avoiding impacts on the existing workload.

By default, containers run with unbounded limits. A container (or a pod) running on a node may eat up all the available CPU or memory and affect all other pods on the node, degrading performance (or worse) and preventing any new workload to be scheduled on the node. If nodes run out of resources, Kubernetes may start killing pods or throttle applications. A container with inefficient code might affect critical workloads and practically make the whole node unusable, or worse, because of replication, it can impact the whole cluster. If you don’t properly control the resource utilization of your cluster, you will unavoidably end up with problems as you are adding more workload.

The mechanism Kubernetes provides to platform operators to prevent those kinds of issues is by defining Resource Quotas and enforcing Requests and Limits. Requests are the minimum amount of resources that a container needs to run. Limits, on the other hand, are the maximum resource a container is allowed to use. After the limit is reached, Kubernetes will take actions that depend on the type of resource. Request and limits settings are used by Kubernetes to assign pods a QoS (Quality of Service) class. The pod QoS class will have an impact on scheduling the pod to a node or the eviction priority (which pod gets evicted first) when a node is overcommitted and running out of resources.

Mentioned above, CPU is a compressible resource; you can always allocate fewer or shorter CPU time slices to a process. So once the limit is reached, Kubernetes will not terminate the container, but it will be throttled, i.e. running slower, potentially impacting performance (and the associated application).

Memory is obviously, of a different nature. Unlike CPU, it’s not compressible. Once a container has allocated memory, it cannot be recovered unless it’s released by the container itself. So to protect the node, and the other pods running on it, if a container is allocating memory exceeding its limit, it will be terminated (referred as “OOMKilled” or Out of Memory Killed).

This gives you a picture of how important proper resource utilization management is to Kubernetes operations. You want to make sure no rogue deployment can bring down nodes and affect business-critical workloads. You also want to guarantee those workloads will get scheduled when they are deployed and not get evicted in favor of less critical pods. Of course, you might think, Kubernetes has auto-scaling capabilities so why should I bother about resources? Well, it’s also the ops team job to understand the costs of running the platform, optimize utilization, and avoid those costs to spiral out of control.

Typically, the platform admins create the logical boundaries in Kubernetes in which the workloads run, known as the Namespaces. They can then define ResourceQuotas for each Namespace, which specifies how much CPU and memory the pods in the namespace can request. With ResourceQuotas, development teams must specify the resource requests and limits for their pods, otherwise, they won’t be scheduled. This enforces the resource utilization policies, protecting the cluster, and workloads. How to find the right quota, what should be used as a CPU or Memory request and limit? That’s another example where monitoring is of tremendous help as it provides the current resource consumption picture and help to continuously fine tune those settings.

Node and workload health

Resource utilization is one aspect that will affect your cluster and workload health. But of course, there are many others.

In order to understand what is going on, you need also to have visibility of the lifecycle of your k8s objects (pods, services, nodes) and associated events.

You want to be able to understand:

  • Which nodes and pods are unhealthy, and why.
  • Why your current number of pod replicas is not corresponding to the desired state.
  • How many times containers are restarted by Kubernetes, and why.
  • Which pods could not be scheduled to a node, and which ones have been evicted from a node.
  • When Kubernetes perform autoscaling (up/down) and on what.

For that, on top of the resource usage metrics, you need to monitor cluster events and object state metrics.

Kubernetes events are a type of object providing context on what’s happening inside a cluster. Examples of such events include when a node is running out of resources, a pod is OOM killed, a container image cannot be pulled from the registry, pod liveness or readiness probe failures, scheduler, and replication activities, or pod eviction from a node.

Event objects are not typical log events; they are produced by the API Server and not included in the logs produced by the cluster. But they are highly valuable for the platform operations, as they’re typically one of the first things you look at when something goes wrong with the workload. Since there is no long-term history retention and no built-in mechanism to forward them to external storage, you want to make sure your monitoring solution will continuously collect them and have them stored long-term for post-mortem analysis.

Many of the events related to pods will trigger a change in state in the object lifecycle during which many kinds of issues can be encountered. A healthy pod should be in the “Ready” state. During its lifecycle, it will go through different states but if it gets stuck in one of those, generally that is the indication of a problem. For example, a pod might be stuck in a “Pending” state if it cannot be scheduled to run on any node because there aren’t sufficient resources available. If it is scheduled but the container images cannot be pulled from the registry, it will get stuck in a “Waiting” state. When a pod crashes for a reason, it will get restarted. If it crashes again and again, it will eventually fall in a “CrashloopBackOff” state and you will need to take a look at the chain of events to understand why it is in that state. It might be a problem with the application code, it might also be a liveness probe that is misconfigured or something else.

As you can see, it is important to keep track and monitor the object states in order to understand where issues with your workload might come from.

Monitoring in the Kubernetes world

Kubernetes is a complex beast, monitoring such an environment requires a different set of capabilities than with more classic stacks. Hence it’s not surprising to see that, according to the 2019 CNCF Survey, complexity (38%) and monitoring (32%) are ranking among the top 5 challenges in using, deploying, and managing containers.

In your monitoring strategy, you need to consider a comprehensive, single-pane of glass, approach:

  • Monitoring application workloads: because of the dynamic, complex, polyglot, and distributed nature of the workload, automation is a must-have: automatic discovery of services and dependencies, automatic instrumentation and distributed tracing, automatic deep-dive visibility. Having to rely on manual configurations is a recipe for failure.
  • Monitoring the Kubernetes platform itself: platform issues can produce cascading failures with catastrophic consequences so not only it’s important to monitor the application workload but it is also crucial to have insights on the Kubernetes cluster health and resource utilization.
  • Monitoring the infrastructure: no matter the number of layers of abstraction that Kubernetes and containers provide, they still run on infrastructure, virtual and physical. It is important to understand the impact infrastructure can have on the platform and the application it runs. You might not manage that infrastructure yourself, especially if you run Kubernetes in public clouds, but you still need some level of visibility on the resource usage patterns.

With Kubernetes, the sheer complexity and dynamic nature of the environment, combined with the massive amount of data makes it impossible for humans to analyze, hence it needs to be driven by AI.

Further your Kubernetes knowledge