Measures for host health

Individual Host pages show problem history, event history, and related processes for each host. To assess health, the following performance metrics are captured for each host and presented on each Host overview page:

  • CPU
  • Memory
  • Disk (storage health)
  • NIC (network health)
  • Network services (currently, DNS performance)

host overview

CPU health

CPU usage is the primary measurement used to calculate CPU health. This is the percentage of time that a CPU is busy processing data (i.e., when it's not idle). This percentage is computed for all available CPU cores and scaled to a range of 0–100%.

The same calculation method is used for both total CPU usage of a system and CPU usage of a specific process group. This means that a process group that's composed of a single threaded process on a 4-core system will reach maximum CPU usage at 25% (4 x 25% = 100%).

A high CPU usage measurement results in a CPU saturation "resource event" and the generation of a problem.

CPU usage measurements are captured every 10 seconds. The average CPU consumption of each 10-second interval is used to calculate total CPU usage. Because Dynatrace averages CPU consumption across 10-second intervals, momentary fluctuations in CPU consumption which happen during the 10-second cycle may be flattened out, but the average CPU consumption over each of the 10-second periods is accurate.

Memory health

Host pages include two memory-related metrics for your hosts, Memory used and Page faults. Both measurements and other factors, are used to correlate and calculate host high memory incidents.

  • Memory used
    Percentage of total RAM used by processes. RAM used by system caches and buffers isn't included in this metric. Dynatrace calculates memory usage as:
    memory_used = total_memory_size - (free_memory + active_memory + inactive_memory + reclaimamble_memory)

  • Page faults
    Number of major page faults per second. Major page faults involve loading a page from disk, thereby adding disk latency to the interrupted program’s execution.

Disk health

Disk health includes:

  • Throughput
    The total number of bytes read and written to disk per second.

  • IOPS
    I/O (input/output) operations per second. Operations are counted after operations addressing adjacent disk sectors are merged.

  • Disk latency
    Time from I/O request submission to I/O request completion. The average delay of disk read and write operations in milliseconds. This metric is used to detect host slow disk incidents.

  • Disk space usage
    The amount of disk space that's been used.

  • Idle time
    Amount of time the disk has been idle.

NIC health

NIC health includes:

  • Traffic
    The average rate at which data was transmitted during the interval.

  • Packets
    The number of received and sent packets over the host network interface during the interval.

  • Quality
    The assessment of the number of dropped packets and errors.

  • Connectivity
    Percentage of properly established TCP connections compared to TCP connections that were refused or timed out.
    Note: The Connectivity measure can be used as an indicator of whether or not there's network traffic on a host. Please note however that 0% connectivity doesn't necessarily indicate that there is a problem with a host. Assuming no TCP errors are present, it may simply mean that no users have attempted to connect to the host process during the selected timeframe.

Network services

OneAgent version 1.201+ Dynatrace version 1.203+

Dynatrace constantly and automatically tracks DNS requests with zero additional configuration. The Davis AI causation engine automatically detects and analyzes anomalies, such as underperforming DNS communication or a misconfigured DNS server, and provides you with all the relevant details instantly when such issues impact your applications or services. You can also use all the metrics to define custom events that you want to be alerted on.

All the DNS-related metrics are available on each host overview page on the Network services tile. The metrics are organized into DNS queries and DNS errors tabs.

DNS queries

The chart presents the following metrics:

DNS query time
DNS query response time. The average response time is also added to the tab title. Slower response times can be a sign of a stressed DNS server or network communication issues. In the case of an underperforming, unreachable, or unresponsive DNS server, you may also notice a significant increase in reported Timeout and ServFail(2) errors.

DNS query count
The number of DNS queries. A high number of queries together with a high number of NXDomain(3) and ServFail(2) errors may indicate a DDoS attack based on producing a large volume of DNS queries to non-existent or invalid domains.

DNS orphan response count
The number of DNS responses without a request. This may include responses to requests that already timed out.

DNS errors

The chart presents the percentage of DNS errors in relation to all the DNS queries, excluding orphaned responses and timeouts. If available, the error name contains the RCODE in brackets.

Container health

If a host runs containers, you can analyze the health of individual containers. Select View containers in the Processes and Containers section of the Host overview page. The Containers page lists all the containers running on the host and displays average CPU and Memory metrics.

Container health

Click the container name to access the Container overview page that gives you a more detailed view on the container health. In addition to the CPU and Memory metrics displayed over time, you can also analyze out of memory kills if they were detected.

Similarly to the Host overview page, the Container overview page lists problems and events detected for the container, including the container-dedicated Out of memory kill event.

Enable container metrics

To collect Kubernetes, non-Kubernetes Docker, and Cloud Foundry container metrics, you must enable Cloud application and workload detection in Settings > Processes and containers > Process group detection > Cloud application and workload detection.

Note: The pause containers aren't reported for Kubernetes and OpenShift.

Container metrics

You can view CPU and memory related metrics in the Container overview page. For details on this set of metrics, see Containers/CPU.

Windows-based containers

  • The Throttled time and Memory cache metrics are not measured for Windows-based containers.
  • OOM kill events are reported only for Linux-based containers, as they're not supported on Windows.