Host monitoring (classic page)
Starting with Dynatrace 1.236, we have totally redesigned the host overview page.
- To switch to the new host overview page, just turn on Try it out at the top of the host overview page. You can switch back to the old design if you decide that you prefer it.
- The documentation below describes the old design.
In the Dynatrace menu, go to Hosts to view the Hosts page—a list of all the hosts detected in your environment.
The Hosts page lists all the machines (both physical and virtual) in your environment that have OneAgent installed on them. Click a host to go to that host's dedicated Host page, where you can view all available metrics for the host.
The following image shows the Hosts page in tile viewing mode. You can use the list in the upper-right corner to select a specific metric to analyze (CPU, Memory, Disk latency, or Network traffic) across all hosts. Hosts that require attention appear in red and display at the top of the tile. For example, this image shows a host that has exceeded the available disk space threshold.
The following image shows the Hosts page in table viewing mode. Click any column header to sort the list based on a specific criteria. For example, the list is sorted by CPU usage in the image.
What's included on individual host pages?
Each Host page details the health of the hardware resources that the selected host relies on. Click one of the four health statistics (CPU, Memory, Disk, or NIC) to view details of the metrics that contribute to each measurement.
The following image shows network interface traffic on the Host page.
Problems
The Problems tile lets you see active and closed problems related to the selected host. Dynatrace also reports recurring problem patterns as frequent issues in this tile, but alerts are only sent out if the severity of the recurring pattern increases.
Click a problem in the list to see the root cause of the problem and understand the impact it has on your services.
Availability
In addition to performance metrics and charts you can track host availability on the Host page Availability tile. This represents the percentage of time that the host was online and responsive to requests. Dynatrace detects and shows operating system shutdowns (including reboots) and periods when a host is offline (for example, if the host is down unexpectedly). Unmonitored indicates periods of time when monitoring is turned off.
Processes and containers
Lists processes and containers on the selected host.
- Select View containers to drill down to a list of containers and view container details.
- Select View processes to drill down to a list of processes, from which you can drill down to details on a single process.
Events
The Events tile charts the distribution of events over time, including service deployments, process crash details, and memory dumps. Expand the tile to list events.
Logs
Lists logs that the selected host writes. Click an entry to view the log.
Digging deep into performance factors
On each Host page you'll find one or more buttons directing you to pages that show you the details of the specific components that contribute to the selected host health statistic (Contributing processes, Contributing disks, or Contributing network interfaces). This image shows the memory usage link to the list of component processes that contribute to the memory usage health statistic.
This image shows the component processes that contribute to this health statistic.
Click a specific component process to view its properties and the performance metrics that Dynatrace captured for that component. Both the process tile on the Processes page and the specific process overview page alert you to any processes that require restart.
The process tile also shows information for the host's Docker containers and lets you drill down to the Docker details page to see CPU, Memory, Traffic, and Throttling details for containers in each image.
The following image shows Docker image traffic details for a selected host on the Containers detail page.
Learn more about how processes are tracked
CPU health
CPU usage is the primary measurement used to calculate CPU health. This is the percentage of time that a CPU is busy processing data (i.e., when it's not idle). This percentage is computed for all available CPU cores and scaled to a range of 0–100%
.
The same calculation method is used for both total CPU usage of a system and CPU usage of a specific process group. This means that a process group that's composed of a single threaded process on a 4-core system will reach maximum CPU usage at 25%
(4 x 25% = 100%).
A high CPU usage measurement results in a CPU saturation "resource event" and the generation of a problem.
CPU usage measurements are captured every 10 seconds. The average CPU consumption of each 10-second interval is used to calculate total CPU usage. Because Dynatrace averages CPU consumption across 10-second intervals, momentary fluctuations in CPU consumption which happen during the 10-second cycle may be flattened out, but the average CPU consumption over each of the 10-second periods is accurate.
Memory health
Host pages include two memory-related metrics for your hosts, Memory used and Page faults. Both measurements and other factors, are used to correlate and calculate host high memory incidents.
-
Memory used
Percentage of total RAM used by processes. RAM used by system caches and buffers isn't included in this metric. Dynatrace calculates memory usage as:
memory_used = total_memory_size - (free_memory + active_memory + inactive_memory + reclaimamble_memory)
-
Page faults
Number of hard page faults per second. Hard page faults involve loading a page from disk, thereby adding disk latency to the interrupted program’s execution.
Disk health
Disk health includes:
-
Throughput
The total number of bytes read and written to disk per second. -
IOPS
I/O (input/output) operations per second. Operations are counted after operations addressing adjacent disk sectors are merged. -
Disk latency
Time from I/O request submission to I/O request completion. The average delay of disk read and write operations in milliseconds. This metric is used to detect host slow disk incidents. -
Disk space usage
The amount of disk space that's been used. -
Idle time
Amount of time the disk has been idle.
NIC health
NIC health includes:
-
Traffic
The average rate at which data was transmitted during the interval. -
Packets
The number of received and sent packets over the host network interface during the interval. -
Quality
The assessment of the number of dropped packets and errors. -
Connectivity
Percentage of properly established TCP connections compared to TCP connections that were refused or timed out.
Note: The Connectivity measure can be used as an indicator of whether or not there's network traffic on a host. Please note however that 0% connectivity doesn't necessarily indicate that there is a problem with a host. Assuming no TCP errors are present, it may simply mean that no users have attempted to connect to the host process during the selected timeframe.
Network services
OneAgent version 1.201+ Dynatrace version 1.203+
Dynatrace constantly and automatically tracks DNS requests with zero additional configuration. The Davis AI causation engine automatically detects and analyzes anomalies, such as underperforming DNS communication or a misconfigured DNS server, and provides you with all the relevant details instantly when such issues impact your applications or services. You can also use all the metrics to define custom events that you want to be alerted on.
All the DNS-related metrics are available on each host overview page on the Network services tile. The metrics are organized into DNS queries and DNS errors tabs.
DNS queries
The chart presents the following metrics:
DNS query time
DNS query response time. The average response time is also added to the tab title. Slower response times can be a sign of a stressed DNS server or network communication issues. In the case of an underperforming, unreachable, or unresponsive DNS server, you may also notice a significant increase in reported Timeout
and ServFail(2)
errors.
DNS query count
The number of DNS queries. A high number of queries together with a high number of NXDomain(3)
and ServFail(2)
errors may indicate a DDoS attack based on producing a large volume of DNS queries to non-existent or invalid domains.
DNS orphan response count
The number of DNS responses without a request. This may include responses to requests that already timed out.
DNS errors
The chart presents the percentage of DNS errors in relation to all the DNS queries, excluding orphaned responses and timeouts. If available, the error name contains the RCODE in brackets.
Container health
If a host runs containers, you can analyze the health of individual containers. Select View containers in the Processes and Containers section of the Host overview page. The Containers page lists all the containers running on the host and displays average CPU and Memory metrics.
Click the container name to access the Container overview page that gives you a more detailed view on the container health. In addition to the CPU and Memory metrics displayed over time, you can also analyze out of memory kills if they were detected.
Similarly to the Host overview page, the Container overview page lists problems and events detected for the container, including the container-dedicated Out of memory kill event.
Enable container metrics
To collect Kubernetes, non-Kubernetes Docker, and Cloud Foundry container metrics, you must enable Cloud application and workload detection in Settings > Processes and containers > Process group detection > Cloud application and workload detection.
Note: The pause containers aren't reported for Kubernetes and OpenShift.
Container metrics
You can view CPU and memory related metrics in the Container overview page. For details on this set of metrics, see Containers/CPU.
Windows-based containers
- The Throttled time and Memory cache metrics are not measured for Windows-based containers.
- OOM kill events are reported only for Linux-based containers, as they're not supported on Windows.