• Home
  • Modules
  • Infrastructure Monitoring
  • Container platform monitoring
  • Kubernetes
  • Alert on common Kubernetes/OpenShift issues

Alert on common Kubernetes/OpenShift issues

Dynatrace version 1.254+

ActiveGate version 1.253+

To alert on common Kubernetes platform issues, follow the instructions below.

Configure

There are three ways to configure alerts for common Kubernetes/OpenShift issues.

  • Settings apply to all clusters, nodes, namespaces, or workloads in the Kubernetes/OpenShift tenant.
  • To configure settings, go to Settings > Anomaly detection and select any page under the Kubernetes section.

Example:

Kubernetes anomaly detection settings tenant

  • Settings apply to a selected cluster, or to nodes, namespaces, and workloads from a selected cluster.
  • To configure settings, go to the settings of a selected Kubernetes cluster and select any page under Anomaly detection.

Example:

Kubernetes anomaly detection settings cluster

  • Settings apply to selected namespaces or workloads.
  • To configure settings, go to the settings of a selected namespace and select any page under Anomaly detection.

Example:

Kubernetes anomaly detection settings namespace

View alerts

You can view alerts

  • On the Problems page.

    Example problem:

    k8s-alert-view-in-problems

  • In the Events section of a cluster details page.

    Example event:

    k8s-alert-view-in-events

    Note: Select the event to navigate to Data explorer for more information about the metric that generated the event.

Available alerts

See below for a list of available alerts.

Cluster alerts

Alert nameDynatrace versionProblem titleProblem descriptionCalculationMetric expression
Detect cluster readiness issues1.254Cluster not readyReadyz endpoint indicates that this cluster is not ready.Cluster readyz metricbuiltin:kubernetes.cluster.readyz:splitBy():sum
Detect cluster CPU-request saturation1.254CPU-request saturation on clusterCPU-request saturation exceeds the specified threshold.Node CPU requests / Node CPU allocatablebuiltin:kubernetes.node.requests_cpu:splitBy():sum/builtin:kubernetes.node.cpu_allocatable:splitBy():sum*100.0
Detect cluster memory-request saturation1.254Memory-request saturation on clusterMemory-request saturation exceeds the specified threshold.Node memory requests / Node memory allocatablebuiltin:kubernetes.node.requests_memory:splitBy():sum/builtin:kubernetes.node.memory_allocatable:splitBy():sum*100.0
Detect cluster pod-saturation1.258Pod saturation on clusterCluster pod-saturation exceeds the specified threshold.Sum of ready pods / Sum of allocatable pods(builtin:kubernetes.node.pods:filter(and(eq(pod_condition,Ready))):splitBy():sum/builtin:kubernetes.node.pods_allocatable:splitBy():sum):default(0.0)*100.0
Detect monitoring issues1.258Monitoring not availableDynatrace API monitoring is not available.(no metric expression)

Node alerts

Alert nameDynatrace versionProblem titleProblem descriptionCalculationMetric expression
Detect node readiness issues1.254Node not readyNode is not ready.Node condition metric filtered by 'not ready'builtin:kubernetes.node.conditions:filter(and(eq(node_condition,Ready),ne(condition_status,True))):splitBy(dt.kubernetes.node.system_uuid):sum
Detect problematic node conditions1.264Problematic node conditionNode has a problematic condition ('MemoryPressure', 'DiskPressure', 'PIDPressure', 'OutOfDisk' or 'NetworkUnavailable').Nodes condition metricbuiltin:kubernetes.node.conditions:filter(and(or(eq(node_condition,DiskPressure),eq(node_condition,MemoryPressure),eq(node_condition,PIDPressure),eq(node_condition,OutOfDisk),eq(node_condition,NetworkUnavailable)),eq(condition_status,True))):splitBy(dt.kubernetes.node.system_uuid):sum
Detect node CPU-request saturation1.254CPU-request saturation on nodeCPU-request saturation exceeds the specified threshold.Sum of node CPU requests / Sum of node CPU allocatablebuiltin:kubernetes.node.requests_cpu:splitBy(dt.kubernetes.node.system_uuid):sum/builtin:kubernetes.node.cpu_allocatable:splitBy(dt.kubernetes.node.system_uuid):sum*100.0
Detect node memory-request saturation1.254Memory-request saturation on nodeMemory-request saturation exceeds the specified threshold.Sum of node memory requests / Sum of node memory allocatablebuiltin:kubernetes.node.requests_memory:splitBy(dt.kubernetes.node.system_uuid):sum/builtin:kubernetes.node.memory_allocatable:splitBy(dt.kubernetes.node.system_uuid):sum*100.0
Detect node pod-saturation1.254Pod saturation on nodePod saturation exceeds the specified threshold.Sum of running pods on node / Node pod limitbuiltin:kubernetes.node.pods:filter(and(eq(pod_phase,Running))):splitBy(dt.kubernetes.node.system_uuid):sum/builtin:kubernetes.node.pods_allocatable:splitBy(dt.kubernetes.node.system_uuid):sum*100.0

Namespace alerts

Alert nameDynatrace versionProblem titleProblem descriptionCalculationMetric expression
Detect namespace CPU-request quota saturation1.254CPU-request quota saturationCPU-request quota saturation exceeds the specified threshold.Sum of resource quota CPU used / Sum of resource quota CPU requestsbuiltin:kubernetes.resourcequota.requests_cpu_used:splitBy(k8s.namespace.name):sum/builtin:kubernetes.resourcequota.requests_cpu:splitBy(k8s.namespace.name):sum*100.0
Detect namespace CPU-limit quota saturation1.254CPU-limit quota saturationCPU-limit quota saturation exceeds the specified threshold.Sum of resource quota CPU used / Sum of resource quota CPU limitsbuiltin:kubernetes.resourcequota.limits_cpu_used:splitBy(k8s.namespace.name):sum/builtin:kubernetes.resourcequota.limits_cpu:splitBy(k8s.namespace.name):sum*100.0
Detect namespace memory-request quota saturation1.254Memory-request quota saturationMemory-request quota saturation exceeds the specified threshold.Sum of resource quota memory used / Sum of resource quota memory requestsbuiltin:kubernetes.resourcequota.requests_memory_used:splitBy(k8s.namespace.name):sum/builtin:kubernetes.resourcequota.requests_memory:splitBy(k8s.namespace.name):sum*100.0
Detect namespace memory-limit quota saturation1.254Memory-limit quota saturationMemory-limit quota saturation exceeds the specified threshold.Sum of resource quota memory used / Sum of resource quota memory limitsbuiltin:kubernetes.resourcequota.limits_memory_used:splitBy(k8s.namespace.name):sum/builtin:kubernetes.resourcequota.limits_memory:splitBy(k8s.namespace.name):sum*100.0
Detect namespace pod quota saturation1.254Pod quota saturationPod quota saturation exceeds the specified threshold.Sum of resource quota pods used / Sum of resource quota pods limitbuiltin:kubernetes.resourcequota.pods_used:splitBy(k8s.namespace.name):sum/builtin:kubernetes.resourcequota.pods:splitBy(k8s.namespace.name):sum*100.0

Workload alerts

Alert nameDynatrace versionProblem titleProblem descriptionCalculationMetric expression
Detect container restarts1.254Container restartsObserved container restarts exceed the specified threshold.Container restarts metricbuiltin:kubernetes.container.restarts:splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum:default(0.0)
Detect stuck deployments1.260Deployment stuckDeployment is stuck and therefore is no longer progressing.Workload condition metric filtered by 'not progressing'builtin:kubernetes.workload.conditions:filter(and(eq(workload_condition,Progressing),eq(condition_status,False))):splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum
Detect pods stuck in pending1.254Pods stuck in pendingWorkload has pending pods.Pods metric filtered by phase 'Pending'builtin:kubernetes.pods:filter(and(eq(pod_phase,Pending))):splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum
Detect pods stuck in terminating1.260Pods stuck in terminatingWorkload has pods stuck in terminating.Pods metric filtered by status 'Terminating'builtin:kubernetes.pods:filter(and(eq(pod_status,Terminating))):splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum
Detect workloads without ready pods1.254No pod readyWorkload does not have any ready pods.Sum of non-failed pods - Sum of non-failed and non-ready podsbuiltin:kubernetes.pods:filter(and(ne(pod_phase,Failed),ne(pod_phase,Succeeded),ne(k8s.workload.kind,job),ne(k8s.workload.kind,cronjob))):splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum-builtin:kubernetes.pods:filter(and(ne(pod_phase,Failed),ne(pod_phase,Succeeded),ne(k8s.workload.kind,job),ne(k8s.workload.kind,cronjob),ne(pod_condition,Ready))):splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum:default(0.0)
Detect workloads with non-ready pods1.258Not all pods readyWorkload has pods that are not ready.Sum of non-failed pods - Sum of non-failed and non-ready podsbuiltin:kubernetes.pods:filter(and(ne(pod_phase,Failed),ne(pod_phase,Succeeded),ne(k8s.workload.kind,job),ne(k8s.workload.kind,cronjob),ne(pod_status,Terminating))):splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum-builtin:kubernetes.pods:filter(and(ne(pod_phase,Failed),ne(pod_phase,Succeeded),ne(k8s.workload.kind,job),ne(k8s.workload.kind,cronjob),eq(pod_condition,Ready),ne(pod_status,Terminating))):splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum:default(0.0)
Detect memory usage saturation1.264Memory usage close to limitsThe memory usage (working set memory) exceeds the threshold in terms of the defined memory limit.Sum of workload working set memory / Sum of workload memory limits(builtin:kubernetes.workload.memory_working_set:splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum/builtin:kubernetes.workload.limits_memory:splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum):default(0.0)*100.0
Detect CPU usage saturation1.264CPU usage close to limitsThe CPU usage exceeds the threshold in terms of the defined CPU limit.Sum of workload CPU usage / Sum of workload CPU limits(builtin:kubernetes.workload.cpu_usage:splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum/builtin:kubernetes.workload.limits_cpu:splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum):default(0.0)*100.0
Detect high CPU throttling1.264High CPU throttlingThe CPU throttling to usage ratio exceeds the specified threshold.Sum of workload CPU throttled / Sum of workload CPU usage(builtin:kubernetes.workload.cpu_throttled:splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum/builtin:kubernetes.workload.cpu_usage:splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum):default(0.0)*100.0
Detect out-of-memory kills1.268Out-of-memory killsOut-of-memory kills have been observed for pods of this workload.Out-of-memory kills metricbuiltin:kubernetes.container.oom_kills:splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum:default(0.0)
Detect job failure events1.268Job failure eventEvents with reason 'BackoffLimitExceeded', 'DeadlineExceeded', or 'PodFailurePolicy' have been detected.Event metric filtered by reason and workload kindbuiltin:kubernetes.events:filter(and(or(eq(k8s.event.reason,BackoffLimitExceeded),eq(k8s.event.reason,DeadlineExceeded),eq(k8s.event.reason,PodFailurePolicy)),or(eq(k8s.workload.kind,job),eq(k8s.workload.kind,cronjob)))):splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum:default(0.0)
Detect pod backoff events1.268Backoff eventEvents with reason 'BackOff' have been detected for pods of this workload. Check for pods with status 'ImagePullBackOff' or 'CrashLoopBackOff'.Event metric filtered by reasonbuiltin:kubernetes.events:filter(and(eq(k8s.event.reason,BackOff))):splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum:default(0.0)
Detect pod eviction events1.268Pod eviction eventEvents with reason 'Evicted' have been detected for pods of this workload.Event metric filtered by reasonbuiltin:kubernetes.events:filter(and(eq(k8s.event.reason,Evicted))):splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum:default(0.0)
Detect pod preemption events1.268Preemption eventEvents with reasons 'Preempted' or 'Preempting' have been detected for pods of this workload.Event metric filtered by reasonbuiltin:kubernetes.events:filter(or(eq(k8s.event.reason,Preempted),eq(k8s.event.reason,Preempting))):splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum:default(0.0)

Persistent volume claim alerts

Alert nameDynatrace versionProblem titleProblem descriptionCalculationMetric expression
Detect low disk space (MB)1.262Kubernetes PVC: Low disk spaceAvailable disk space for a persistent volume claim is below the threshold.Kubelet volume stats available bytes metrickubelet_volume_stats_available_bytes:splitBy(k8s.namespace.name,persistentvolumeclaim):avg
Detect low disk space (%)1.262Kubernetes PVC: Low disk space %Available disk space for a persistent volume claim is below the threshold.Volume stats available bytes / Volume stats capacity byteskubelet_volume_stats_available_bytes:splitBy(k8s.namespace.name,persistentvolumeclaim):avg/kubelet_volume_stats_capacity_bytes:splitBy(k8s.namespace.name,persistentvolumeclaim):avg*100.0
Related topics
  • Set up Dynatrace on Kubernetes/OpenShift

    Ways to deploy and configure Dynatrace on Kubernetes/OpenShift