Measures for host health
Individual Host pages show problem history, event history, and related processes for each host. To assess health, the following performance metrics are captured for each host and presented on each Host overview page:
- Disk (storage health)
- NIC (network health)
- Network services (currently, DNS performance)
CPU usage is the primary measurement used to calculate CPU health. This is the percentage of time that a CPU is busy processing data (i.e., when it's not idle). This percentage is computed for all available CPU cores and scaled to a range of
The same calculation method is used for both total CPU usage of a system and CPU usage of a specific process group. This means that a process group that's composed of a single threaded process on a 4-core system will reach maximum CPU usage at
25% (4 x 25% = 100%).
CPU usage measurements are captured every 10 seconds. The average CPU consumption of each 10-second interval is used to calculate total CPU usage. Because Dynatrace averages CPU consumption across 10-second intervals, momentary fluctuations in CPU consumption which happen during the 10-second cycle may be flattened out, but the average CPU consumption over each of the 10-second periods is accurate.
Virtualized hosts show additional measurements related to CPU performance. These values are important to overall virtual machine health.
CPU Ready time
Percentage of time that the virtual machine was ready, but not scheduled to run on the physical CPU.
Note: CPU Ready time should remain below 10%. A CPU Ready time measurement of over 10% indicates that your virtual machines are competing for available resources the virtual machine is unable to execute all of its tasks. Such contention can lead to a drop in application performance. For more information, see How does virtual machine migration affect performance?
The CPU consumption as reported by OneAgent from the perspective of AWS itself, rather than from within the host.
Physical CPU The amount of actively used virtual CPU as a percentage of total available CPU. This is the host view of CPU usage, not the guest operating system view. It is the average CPU utilization over all available virtual CPUs on the virtual machine. For example, if a virtual machine with one virtual CPU is running on a host that has four physical CPUs and the CPU usage is 100%, then you know that the virtual machine is utilizing 100% of one physical CPU's available resources.
AIX hosts show additional measurements related to CPU performance. These values are important to overall virtual machine health.
Configurable setting in the LPAR definition where you specify the default minimum number of the physical CPU cores that will be assigned to this LPAR. This metric reflects the setting value and not the actual consumption. The consumption may differ depending on the availability of the physical CPU.
Percent of your configured entitlement that was used. Considering that your LPAR might receive more CPU than you've configured in the entitlement, the value can reach more than 100%.
Physical CPU consumed
Number of physical CPU cores consumed.
Average number of runnable kernel threads over the sampling period. This is the number of threads that are waiting to run or already running.
A stable workload typically has fewer than five threads running. A rapid increase in the number of threads running may indicate an application problem. Threads competing for the CPU resource are scheduled in round-robin fashion. The number of runnable threads could exceed 100 if every thread executes for a complete or partial time segment.
Average number of kernel threads that are put in the wait queue through the sampling period. Kernel threads are added to the wait queue when they are scheduled for execution and are waiting for one of their process pages to be requested.
I/O message wait
Average number of threads waiting for I/O messages from raw devices. Raw devices are the devices that are directly attached to the system.
I/O direct, buffered
Number of threads per second that are waiting for the file system direct I/O event to occur and number of processes that are asleep waiting for buffered I/O. Value of this metric may differ from the real value. It represents the sum of two of the six elements that amount to the real value.
Number of physical CPU cores assigned to a logical partition.
Number of processors derived by applying simultaneous multithreading (SMT) technology to each virtual processor. For example, 2 virtual processors, each with 4 SMT threads, provides 8 logical CPUs.
Simultaneous multithreading (SMT)
Number of independent threads of execution (to better utilize processor resources).
Host pages include two memory-related metrics for your hosts, Memory used and Page faults. Both measurements and other factors, are used to correlate and calculate host high memory incidents.
Percentage of total RAM used by processes. RAM used by system caches and buffers isn't included in this metric. Dynatrace calculates memory usage as:
memory_used = total_memory_size - (free_memory + active_memory + inactive_memory + reclaimamble_memory)
Number of major page faults per second. Major page faults involve loading a page from disk, thereby adding disk latency to the interrupted program’s execution.
Virtualized hosts show additional measurements related to virtual machine memory usage. These metrics, along with other measurements, are used to detect memory saturation incidents.
The rate of memory compression or decompression. Virtual machine management platforms use memory compression to reduce memory usage. Memory compression saves memory but requires additional CPU cycles. Content that had been previously compressed must be decompressed before it can be used by a virtual machine.
Rate at which memory is swapped from disk into active memory, and vice-versa, from active memory to disk.
Disk health includes:
The total number of bytes read and written to disk per second.
I/O (input/output) operations per second. Operations are counted after operations addressing adjacent disk sectors are merged.
Time from I/O request submission to I/O request completion. The average delay of disk read and write operations in milliseconds. This metric is used to detect host slow disk incidents.
Disk space usage
The amount of disk space that's been used.
Amount of time the disk has been idle.
NIC health includes:
The average rate at which data was transmitted during the interval.
The number of received and sent packets over the host network interface during the interval.
The assessment of the number of dropped packets and errors.
Percentage of properly established TCP connections compared to TCP connections that were refused or timed out.
Note: The Connectivity measure can be used as an indicator of whether or not there's network traffic on a host. Please note however that 0% connectivity doesn't necessarily indicate that there is a problem with a host. Assuming no TCP errors are present, it may simply mean that no users have attempted to connect to the host process during the selected timeframe.
OneAgent version 1.201+ Dynatrace version 1.203+
Dynatrace constantly and automatically tracks DNS requests with zero additional configuration. The Davis AI causation engine automatically detects and analyzes anomalies, such as underperforming DNS communication or a misconfigured DNS server, and provides you with all the relevant details instantly when such issues impact your applications or services. You can also use all the metrics to define custom events that you want to be alerted on.
All the DNS-related metrics are available on each host overview page on the Network services tile. The metrics are organized into DNS queries and DNS errors tabs.
The chart presents the following metrics:
DNS query time
DNS query response time. The average response time is also added to the tab title. Slower response times can be a sign of a stressed DNS server or network communication issues. In the case of an underperforming, unreachable, or unresponsive DNS server, you may also notice a significant increase in reported
DNS query count
The number of DNS queries. A high number of queries together with a high number of
ServFail(2) errors may indicate a DDoS attack based on producing a large volume of DNS queries to non-existent or invalid domains.
DNS orphan response count
The number of DNS responses without a request. This may include responses to requests that already timed out.
The chart presents the percentage of DNS errors in relation to all the DNS queries, excluding orphaned responses and timeouts. If available, the error name contains the RCODE in brackets.
If a host runs containers, you can analyze the health of individual containers. Select View containers in the Processes and Containers section of the Host overview page. The Containers page lists all the containers running on the host and displays average CPU and Memory metrics.
Click the container name to access the Container overview page that gives you a more detailed view on the container health. In addition to the CPU and Memory metrics displayed over time, you can also analyze out of memory kills if they were detected.
Similarly to the Host overview page, the Container overview page lists problems and events detected for the container, including the container-dedicated Out of memory kill event.
Enable container metrics
To collect Kubernetes, non-Kubernetes Docker, and Cloud Foundry container metrics, you must enable Cloud application and workload detection in Settings > Processes and containers > Process group detection > Cloud application and workload detection.
Note: The pause containers aren't reported for Kubernetes and OpenShift.
You can view CPU and memory related metrics in the Container overview page. For details on this set of metrics, see Containers/CPU.
- The Throttled time and Memory cache metrics are not measured for Windows-based containers.
- OOM kill events are reported only for Linux-based containers, as they're not supported on Windows.