Alert on common Kubernetes/OpenShift issues
Dynatrace version 1.254+
ActiveGate version 1.253+
To alert on common Kubernetes platform issues, follow the instructions below.
Configure
There are three ways to configure alerts for common Kubernetes/OpenShift issues.
- Settings apply to all clusters, nodes, namespaces, or workloads in the Kubernetes/OpenShift tenant.
- To configure settings, go to Settings > Anomaly detection and select any page under the Kubernetes section.
Example:
- Settings apply to a selected cluster, or to nodes, namespaces, and workloads from a selected cluster.
- To configure settings, go to the settings of a selected Kubernetes cluster and select any page under Anomaly detection.
Example:
View alerts
You can view alerts
-
On the Problems page.
Example problem:
-
In the Events section of a cluster details page.
Example event:
Note: Select the event to navigate to Data explorer for more information about the metric that generated the event.
Available alerts
See below for a list of available alerts.
Cluster alerts
Alert name | Dynatrace version | Problem title | Problem description | Calculation | Metric expression |
---|---|---|---|---|---|
Detect cluster readiness issues | 1.254 | Cluster not ready | Readyz endpoint indicates that this cluster is not ready. | Cluster readyz metric | builtin:kubernetes.cluster.readyz:splitBy():sum |
Detect cluster CPU-request saturation | 1.254 | CPU-request saturation on cluster | CPU-request saturation exceeds the specified threshold. | Node CPU requests / Node CPU allocatable | builtin:kubernetes.node.requests_cpu:splitBy():sum/builtin:kubernetes.node.cpu_allocatable:splitBy():sum*100.0 |
Detect cluster memory-request saturation | 1.254 | Memory-request saturation on cluster | Memory-request saturation exceeds the specified threshold. | Node memory requests / Node memory allocatable | builtin:kubernetes.node.requests_memory:splitBy():sum/builtin:kubernetes.node.memory_allocatable:splitBy():sum*100.0 |
Detect cluster pod-saturation | 1.258 | Pod saturation on cluster | Cluster pod-saturation exceeds the specified threshold. | Sum of ready pods / Sum of allocatable pods | (builtin:kubernetes.node.pods:filter(and(eq(pod_condition,Ready))):splitBy():sum/builtin:kubernetes.node.pods_allocatable:splitBy():sum):default(0.0)*100.0 |
Detect monitoring issues | 1.258 | Monitoring not available | Dynatrace API monitoring is not available. | (no metric expression) |
Node alerts
Alert name | Dynatrace version | Problem title | Problem description | Calculation | Metric expression |
---|---|---|---|---|---|
Detect node readiness issues | 1.254 | Node not ready | Node is not ready. | Node condition metric filtered by 'not ready' | builtin:kubernetes.node.conditions:filter(and(eq(node_condition,Ready),ne(condition_status,True))):splitBy(dt.kubernetes.node.system_uuid):sum |
Detect problematic node conditions | 1.264 | Problematic node condition | Node has a problematic condition ('MemoryPressure', 'DiskPressure', 'PIDPressure', 'OutOfDisk' or 'NetworkUnavailable'). | Nodes condition metric | builtin:kubernetes.node.conditions:filter(and(or(eq(node_condition,DiskPressure),eq(node_condition,MemoryPressure),eq(node_condition,PIDPressure),eq(node_condition,OutOfDisk),eq(node_condition,NetworkUnavailable)),eq(condition_status,True))):splitBy(dt.kubernetes.node.system_uuid):sum |
Detect node CPU-request saturation | 1.254 | CPU-request saturation on node | CPU-request saturation exceeds the specified threshold. | Sum of node CPU requests / Sum of node CPU allocatable | builtin:kubernetes.node.requests_cpu:splitBy(dt.kubernetes.node.system_uuid):sum/builtin:kubernetes.node.cpu_allocatable:splitBy(dt.kubernetes.node.system_uuid):sum*100.0 |
Detect node memory-request saturation | 1.254 | Memory-request saturation on node | Memory-request saturation exceeds the specified threshold. | Sum of node memory requests / Sum of node memory allocatable | builtin:kubernetes.node.requests_memory:splitBy(dt.kubernetes.node.system_uuid):sum/builtin:kubernetes.node.memory_allocatable:splitBy(dt.kubernetes.node.system_uuid):sum*100.0 |
Detect node pod-saturation | 1.254 | Pod saturation on node | Pod saturation exceeds the specified threshold. | Sum of running pods on node / Node pod limit | builtin:kubernetes.node.pods:filter(and(eq(pod_phase,Running))):splitBy(dt.kubernetes.node.system_uuid):sum/builtin:kubernetes.node.pods_allocatable:splitBy(dt.kubernetes.node.system_uuid):sum*100.0 |
Namespace alerts
Alert name | Dynatrace version | Problem title | Problem description | Calculation | Metric expression |
---|---|---|---|---|---|
Detect namespace CPU-request quota saturation | 1.254 | CPU-request quota saturation | CPU-request quota saturation exceeds the specified threshold. | Sum of resource quota CPU used / Sum of resource quota CPU requests | builtin:kubernetes.resourcequota.requests_cpu_used:splitBy(k8s.namespace.name):sum/builtin:kubernetes.resourcequota.requests_cpu:splitBy(k8s.namespace.name):sum*100.0 |
Detect namespace CPU-limit quota saturation | 1.254 | CPU-limit quota saturation | CPU-limit quota saturation exceeds the specified threshold. | Sum of resource quota CPU used / Sum of resource quota CPU limits | builtin:kubernetes.resourcequota.limits_cpu_used:splitBy(k8s.namespace.name):sum/builtin:kubernetes.resourcequota.limits_cpu:splitBy(k8s.namespace.name):sum*100.0 |
Detect namespace memory-request quota saturation | 1.254 | Memory-request quota saturation | Memory-request quota saturation exceeds the specified threshold. | Sum of resource quota memory used / Sum of resource quota memory requests | builtin:kubernetes.resourcequota.requests_memory_used:splitBy(k8s.namespace.name):sum/builtin:kubernetes.resourcequota.requests_memory:splitBy(k8s.namespace.name):sum*100.0 |
Detect namespace memory-limit quota saturation | 1.254 | Memory-limit quota saturation | Memory-limit quota saturation exceeds the specified threshold. | Sum of resource quota memory used / Sum of resource quota memory limits | builtin:kubernetes.resourcequota.limits_memory_used:splitBy(k8s.namespace.name):sum/builtin:kubernetes.resourcequota.limits_memory:splitBy(k8s.namespace.name):sum*100.0 |
Detect namespace pod quota saturation | 1.254 | Pod quota saturation | Pod quota saturation exceeds the specified threshold. | Sum of resource quota pods used / Sum of resource quota pods limit | builtin:kubernetes.resourcequota.pods_used:splitBy(k8s.namespace.name):sum/builtin:kubernetes.resourcequota.pods:splitBy(k8s.namespace.name):sum*100.0 |
Workload alerts
Alert name | Dynatrace version | Problem title | Problem description | Calculation | Metric expression |
---|---|---|---|---|---|
Detect container restarts | 1.254 | Container restarts | Observed container restarts exceed the specified threshold. | Container restarts metric | builtin:kubernetes.container.restarts:splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum:default(0.0) |
Detect stuck deployments | 1.260 | Deployment stuck | Deployment is stuck and therefore is no longer progressing. | Workload condition metric filtered by 'not progressing' | builtin:kubernetes.workload.conditions:filter(and(eq(workload_condition,Progressing),eq(condition_status,False))):splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum |
Detect pods stuck in pending | 1.254 | Pods stuck in pending | Workload has pending pods. | Pods metric filtered by phase 'Pending' | builtin:kubernetes.pods:filter(and(eq(pod_phase,Pending))):splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum |
Detect pods stuck in terminating | 1.260 | Pods stuck in terminating | Workload has pods stuck in terminating. | Pods metric filtered by status 'Terminating' | builtin:kubernetes.pods:filter(and(eq(pod_status,Terminating))):splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum |
Detect workloads without ready pods | 1.254 | No pod ready | Workload does not have any ready pods. | Sum of non-failed pods - Sum of non-failed and non-ready pods | builtin:kubernetes.pods:filter(and(ne(pod_phase,Failed),ne(pod_phase,Succeeded),ne(k8s.workload.kind,job),ne(k8s.workload.kind,cronjob))):splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum-builtin:kubernetes.pods:filter(and(ne(pod_phase,Failed),ne(pod_phase,Succeeded),ne(k8s.workload.kind,job),ne(k8s.workload.kind,cronjob),ne(pod_condition,Ready))):splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum:default(0.0) |
Detect workloads with non-ready pods | 1.258 | Not all pods ready | Workload has pods that are not ready. | Sum of non-failed pods - Sum of non-failed and non-ready pods | builtin:kubernetes.pods:filter(and(ne(pod_phase,Failed),ne(pod_phase,Succeeded),ne(k8s.workload.kind,job),ne(k8s.workload.kind,cronjob),ne(pod_status,Terminating))):splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum-builtin:kubernetes.pods:filter(and(ne(pod_phase,Failed),ne(pod_phase,Succeeded),ne(k8s.workload.kind,job),ne(k8s.workload.kind,cronjob),eq(pod_condition,Ready),ne(pod_status,Terminating))):splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum:default(0.0) |
Detect memory usage saturation | 1.264 | Memory usage close to limits | The memory usage (working set memory) exceeds the threshold in terms of the defined memory limit. | Sum of workload working set memory / Sum of workload memory limits | (builtin:kubernetes.workload.memory_working_set:splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum/builtin:kubernetes.workload.limits_memory:splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum):default(0.0)*100.0 |
Detect CPU usage saturation | 1.264 | CPU usage close to limits | The CPU usage exceeds the threshold in terms of the defined CPU limit. | Sum of workload CPU usage / Sum of workload CPU limits | (builtin:kubernetes.workload.cpu_usage:splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum/builtin:kubernetes.workload.limits_cpu:splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum):default(0.0)*100.0 |
Detect high CPU throttling | 1.264 | High CPU throttling | The CPU throttling to usage ratio exceeds the specified threshold. | Sum of workload CPU throttled / Sum of workload CPU usage | (builtin:kubernetes.workload.cpu_throttled:splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum/builtin:kubernetes.workload.cpu_usage:splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum):default(0.0)*100.0 |
Detect out-of-memory kills | 1.268 | Out-of-memory kills | Out-of-memory kills have been observed for pods of this workload. | Out-of-memory kills metric | builtin:kubernetes.container.oom_kills:splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum:default(0.0) |
Detect job failure events | 1.268 | Job failure event | Events with reason 'BackoffLimitExceeded', 'DeadlineExceeded', or 'PodFailurePolicy' have been detected. | Event metric filtered by reason and workload kind | builtin:kubernetes.events:filter(and(or(eq(k8s.event.reason,BackoffLimitExceeded),eq(k8s.event.reason,DeadlineExceeded),eq(k8s.event.reason,PodFailurePolicy)),or(eq(k8s.workload.kind,job),eq(k8s.workload.kind,cronjob)))):splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum:default(0.0) |
Detect pod backoff events | 1.268 | Backoff event | Events with reason 'BackOff' have been detected for pods of this workload. Check for pods with status 'ImagePullBackOff' or 'CrashLoopBackOff'. | Event metric filtered by reason | builtin:kubernetes.events:filter(and(eq(k8s.event.reason,BackOff))):splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum:default(0.0) |
Detect pod eviction events | 1.268 | Pod eviction event | Events with reason 'Evicted' have been detected for pods of this workload. | Event metric filtered by reason | builtin:kubernetes.events:filter(and(eq(k8s.event.reason,Evicted))):splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum:default(0.0) |
Detect pod preemption events | 1.268 | Preemption event | Events with reasons 'Preempted' or 'Preempting' have been detected for pods of this workload. | Event metric filtered by reason | builtin:kubernetes.events:filter(or(eq(k8s.event.reason,Preempted),eq(k8s.event.reason,Preempting))):splitBy(k8s.namespace.name,k8s.workload.kind,k8s.workload.name):sum:default(0.0) |
Persistent volume claim alerts
Alert name | Dynatrace version | Problem title | Problem description | Calculation | Metric expression |
---|---|---|---|---|---|
Detect low disk space (MB) | 1.262 | Kubernetes PVC: Low disk space | Available disk space for a persistent volume claim is below the threshold. | Kubelet volume stats available bytes metric | kubelet_volume_stats_available_bytes:splitBy(k8s.namespace.name,persistentvolumeclaim):avg |
Detect low disk space (%) | 1.262 | Kubernetes PVC: Low disk space % | Available disk space for a persistent volume claim is below the threshold. | Volume stats available bytes / Volume stats capacity bytes | kubelet_volume_stats_available_bytes:splitBy(k8s.namespace.name,persistentvolumeclaim):avg/kubelet_volume_stats_capacity_bytes:splitBy(k8s.namespace.name,persistentvolumeclaim):avg*100.0 |