Measures for host health

Individual Host pages show problem history, event history, and related processes for each host. To assess health, the following performance metrics are captured for each host and presented on each Host overview page:

  • CPU
  • Memory
  • Disk (storage health)
  • NIC (network health)

CPU health

CPU usage is the primary measurement used to calculate CPU health. This is the percentage of time that a CPU is busy processing data (i.e., when it's not idle). This percentage is computed for all available CPU cores and scaled to a range of 0–100%.

The same calculation method is used for both total CPU usage of a system and CPU usage of a specific process group. This means that a process group that's composed of a single threaded process on a 4-core system will reach maximum CPU usage at 25% (4 x 25% = 100%).

A high CPU usage measurement results in a CPU saturation "resource event" and the generation of a problem.

CPU usage measurements are captured every 10 seconds. The average CPU consumption of each 10-second interval is used to calculate total CPU usage. Because Dynatrace averages CPU consumption across 10-second intervals, momentary fluctuations in CPU consumption which happen during the 10-second cycle may be flattened out, but the average CPU consumption over each of the 10-second periods is accurate.

Memory health

Host pages include two memory-related metrics for your hosts, Memory used and Page faults. Both measurements and other factors, are used to correlate and calculate host high memory incidents.

  • Memory used
    Percentage of total RAM used by processes. RAM used by system caches and buffers isn't included in this metric. Dynatrace calculates memory usage as:
    memory_used = total_memory_size - (free_memory + active_memory + inactive_memory + reclaimamble_memory)

  • Page faults
    Number of major page faults per second. Major page faults involve loading a page from disk, thereby adding disk latency to the interrupted program’s execution.

Disk health

Disk health includes:

  • Throughput
    The total number of bytes read and written to disk per second.

  • IOPS
    I/O (input/output) operations per second. Operations are counted after operations addressing adjacent disk sectors are merged.

  • Disk latency
    Time from I/O request submission to I/O request completion. The average delay of disk read and write operations in milliseconds. This metric is used to detect host slow disk incidents.

  • Disk space usage
    The amount of disk space that's been used.

  • Idle time
    Amount of time the disk has been idle.

NIC health

NIC health includes:

  • Traffic
    The average rate at which data was transmitted during the interval.

  • Packets
    The number of received and sent packets over the host network interface during the interval.

  • Quality
    The assessment of the number of dropped packets and errors.

  • Connectivity
    Percentage of properly established TCP connections compared to TCP connections that were refused or timed out.
    Note: The Connectivity measure can be used as an indicator of whether or not there's network traffic on a host. Please note however that 0% connectivity doesn't necessarily indicate that there is a problem with a host. Assuming no TCP errors are present, it may simply mean that no users have attempted to connect to the host process during the selected timeframe.

Container health

If a host runs containers, you can analyze the health of individual containers. Select View containers in the Processes and Containers section of the Host overview page. The Containers page lists all the containers running on the host and displays average CPU and Memory metrics.

Container health

Click the container name to access the Container overview page that gives you a more detailed view on the container health. In addition to the CPU and Memory metrics displayed over time, you can also analyze out of memory kills if they were detected.

Similarly to the Host overview page, the Container overview page lists problems and events detected for the container, including the container-dedicated Out of memory kill event.

Container metrics

You can analyze the following CPU and memory related metrics in the Container overview page.

Metric name Description Unit
CPU limit CPU limit per container in millicores. MilliCores
CPU logical cores Number of logical cores of the system. Count
CPU shares Number of CPU shares allocated per container. Count
CPU throttled time CPU throttled time per container in nanoseconds per minute. NanoSecondPerMinute
CPU usage % of limit CPU usage per container in percent relative to a cpu limit. If no cpu limit is set, the number of logical cores is used. Percent
CPU usage system time Used system time per container in nanoseconds per minute. NanoSecondPerMinute
CPU usage time Sum of used system and user time per container in nanoseconds per minute. NanoSecondPerMinute
CPU usage user time Used user time per container in nanoseconds per minute. NanoSecondPerMinute
CPU usage mCores CPU usage per container in millicores. MilliCores
Memory cache Current memory page cache per container in bytes. Byte
Memory limit Memory limits per container in bytes. Not set if no limit is available. Byte
Memory limit % Memory limits per container in percent relative to total physical memory. Not set if no limit is available. Percent
Out of memory kills Number of container kills due to out of memory. Count
Memory physical total Total physical memory available on the system in bytes. Byte
Memory resident set Current resident size (RSS + Swap) per container in bytes. Byte
Memory usage % of limit Memory usage per container in percent relative to limit. If no limit is set, total physical memory is used. Percent

Kubernetes and Cloud Foundry container data

You must enable Cloud application and workload detection to collect Kubernetes and Cloud Foundry container data. From the navigation menu, select Settings > Processes and containers > Process group detection, expand Cloud application and workload detection, and turn on Enable cloud application and workload detection.

Note: For Kubernetes and Openshift, the pause containers are not reported.