Metrics on Grail examples
powered by Grail
Metrics on Grail enable you to pinpoint and retrieve any metric data with the help of Dynatrace Query Language. After reviewing the fundamentals of DQL queries, use the examples on this page to start getting answers from your metrics.
Example 1: Average CPU usage across all hosts
In this example, you'll query the average CPU usage across all monitored hosts in your environment.
OneAgent collects CPU measurements from its host machine. These metrics are accessible through metric keys beginning with dt.host.cpu
.
Observing the aggregate CPU usage across all hosts can help you visually confirm how your infrastructure responds to and recovers from usage spikes or slow, imperceptible growth trends over time.
timeseries usage=avg(dt.host.cpu.usage)
Example 2: Average CPU usage by host, limit to top 3 hosts
In this example, you get every monitored host's average CPU usage and focus on the three hosts with the highest usage.
OneAgent collects CPU measurements from its host machine. These metrics are accessible through metric keys beginning with dt.host.cpu
.
Charting individual hosts' CPU usage helps to visualize normal and outlier usage. By focusing on the three hosts with highest CPU usage, you can begin investigating under-provisioned applications. Likewise, focusing on hosts with the lowest CPU usage may reveal over-provisioning and lead to cost-saving opportunities.
-
Query the data.
timeseries usage=avg(dt.host.cpu.usage), by:{dt.entity.host} | sort arrayAvg(usage) desc | limit 3
-
Simplify results.
A table can be easier to read than a line chart in some situations. Let's query data that works best with table output by focusing on the columns we most care about:
dt.entity.host
andusage
.timeseries usage=avg(dt.host.cpu.usage), by:{dt.entity.host} | sort arrayAvg(usage) desc | limit 3 | fields dt.entity.host, usage=arrayAvg(usage)
This is essentially the same query as above, removing the series and keeping only the series aggregation.
You can refer to the DQL documentation for a list of available
arrayXXX
functions. If you're familiar with metric expressions, you'll find these functions similar to the:fold
transformation.
Example 3: Average CPU usage for a specific host
In this example, you'll learn how to filter results from the timeseries
command.
You can use a filter
parameter to filter hosts directly. It accepts the same values as the filter
command.
As host.name
is a common field, you can filter on host.name
without a lookup
command.
timeseries usage=avg(dt.host.cpu.usage),
by:{dt.entity.host},
filter:{host.name=="dw0sdwk00012U"}
Example 4: Average CPU usage for tagged hosts
This example uses an in
condition filter results to hosts tagged with a responsible team in your organization.
By using the in
operator with a subquery, you can filter on host attributes such as tags
or ipAddress
.
Using the timeseries
filter
parameter is a good way to improve query performance.
timeseries usage=avg(dt.host.cpu.usage),
by:{dt.entity.host},
filter:{
dt.entity.host in [
fetch dt.entity.host
| filter matchesValue(tags, "team:Dorado")
| fields id
]
}
Example 5: Number of hosts sending CPU usage data
In this example, you'll learn how to chain timeseries
with summarize
. You'll first query hosts sending CPU usage data, and then count the number of hosts in the result.
Other DQL commands can also be chained with timeseries
as demonstrated in previous examples, but unlike those examples, summarize
further aggregates the dataset returned by timeseries
. You'll find this two-step aggregation helpful as your questions become more complex and nuanced.
timeseries usage=avg(dt.host.cpu.usage), by:{dt.entity.host}
| summarize count()
Example 6: Total disk throughput for a given host across all disks
In this example, you'll learn how to focus on the relevant context by using the by
parameter.
Like a pivot table, the by
parameter aggregates timeseries across many dimensions into a single host context. This example also uses in
to query a specific host as shown previously.
OneAgent ingests host metrics from many contexts, but you rarely need all the information at once. Aggregating timeseries with by
helps you to focus on the context that matters for your questions.
timeseries bytes_written=sum(dt.host.disk.bytes_written),
by:{dt.entity.host},
filter:{
dt.entity.host in [
fetch dt.entity.host
| filter matchesValue(ipAddress, "10.128.0.106")
| fields id
]
}
Example 7: Top hosts by bytes read with corresponding bytes written
In this example, you'll enrich a single result with context from another metric.
Even when focused on disk read operations, the corresponding disk writes can provide helpful context.
timeseries by:{dt.entity.host}, {
bytes_read=sum(dt.host.disk.bytes_read),
bytes_written=sum(dt.host.disk.bytes_written)
}
| sort arrayAvg(bytes_read) desc
| limit 3
| fields
dt.entity.host,
bytes_read=arrayAvg(bytes_read),
bytes_written=arrayAvg(bytes_written)
Example 8: Total network traffic by host
In this example, you'll calculate total network traffic on your hosts.
Dynatrace collects network traffic in two metrics, bytes_rx
and bytes_tx
. You'll calculate the total traffic by aggregating the series into single measurements and summing measurements to create traffic_gb
.
This example highlights one of the improvements of DQL. With metric selectors, you can calculate traffic_gb
, but you'll lose the bytes_rx
and bytes_tx
data used as inputs. With DQL, these fields remain in the query output.
timeseries by:{dt.entity.host}, {
bytes_rx = sum(dt.host.net.nic.bytes_rx),
bytes_tx = sum(dt.host.net.nic.bytes_tx)
}
| fieldsAdd bytes_rx = arraySum(bytes_rx)
| fieldsAdd bytes_tx = arraySum(bytes_tx)
| fieldsAdd traffic_gb = 1e-9 * (bytes_rx + bytes_tx)
Example 9: Available CPU by Kubernetes Node
In this example, you'll calculate the available CPU on all nodes of your hypothetical "openfeature" cluster.
To return a timeseries instead of a single value, we use the []
operator to take the difference of individual timeseries values. The result is another timeseries that you can visualize with a line chart.
The available CPU is integral for efficient resource utilization and avoiding resource contention. A timeseries visualized with a line chart is one way to show how the available CPU changes over time.
timeseries {
cpu_allocatable=min(dt.kubernetes.node.cpu_allocatable),
requests_cpu=max(dt.kubernetes.node.requests_cpu)
},
by:{dt.entity.kubernetes_node},
filter:{k8s.cluster.name == "openfeature"}
| fieldsAdd result = cpu_allocatable[] - requests_cpu[]
| fieldsRemove cpu_allocatable, requests_cpu
Example 10: Filter relevant hosts by state
In this example, you'll learn how to use an in
operator to focus on running hosts.
An unfiltered timeseries
query can take unnecessary time in large environments with thousands of hosts. By applying the filter
to the timeseries
command directly, your queries may become faster by ignoring unwanted data.
timeseries usage=avg(dt.host.cpu.usage),
by:{dt.entity.host},
filter:{
dt.entity.host in [
fetch dt.entity.host
| filter state == "RUNNING"
| fields id
]
}
Example 11: Average process memory usage by the responsible dev team
In this example, you'll use two in
comparisons to filter results to hosts tagged with a responsible team in your organization.
The process runs on a host, and the host has a tag with the team information on which you need to filter. First, fetch process-host relationships, followed by host-tag relationships, and then add the tag filter necessary for the query.
You'll find nested in
comparisons are often necessary when working with complex relationships.
timeseries usage=avg(dt.process.memory.usage),
by:{dt.entity.process_group_instance},
filter:{
dt.entity.process_group_instance in [
fetch dt.entity.process_group_instance
| filter belongs_to[dt.entity.host] in [
fetch dt.entity.host
| filter matchesValue(tags, "team:Dorado")
| fields id
]
| fields id
]
}
Example 12: Average host CPU usage by host size
In this example, you'll learn how to use a lookup
command to analyze host CPU usage by host size.
OneAgent collects local context from its host: information such as how many CPUs are installed and how much memory it has. You can add this information to your query with a lookup
command.
Host-level information can sometimes be too fine-grained and difficult to interpret. In these situations, a well-chosen lookup
can help you explore and analyze how individual hosts contribute to broader trends.
timeseries usage=avg(dt.host.cpu.usage), by:{dt.entity.host}
| fieldsAdd usage=arrayAvg(usage)
| lookup [
fetch dt.entity.host
| fieldsAdd cpuCores
],
sourceField:dt.entity.host,
lookupField:id,
fields:{cpuCores}
| summarize avg(usage), count_hosts=count(), by:{cpuCores}
When should I use in
or lookup
?
-
Use an
in
comparison tofilter
your data. This is more efficient than filtering with alookup
. -
Use
lookup
when you want to groupby
a field that's not returned by the basetimeseries
command, such as a hostipAddress
.
An in
is usually faster than a lookup
. Choosing correctly will lead to smoother interactions in your dashboards, notebooks, and other Platform applications.
Example 13: Query multiple CPU usage metrics with a single query
In this example, you'll learn how to use the append
command to return multiple CPU metrics with a single query.
Combining queries into one command can be useful for comparing measurements from different contexts, as they will be charted together.
As you query many metrics from a single host and perform no arithmetic, the append
command here is preferred to querying multiple metrics with a single timeseries
command. The append
command is a comparatively more flexible option, as it doesn't require equivalent by
or filter
arguments, for example. Additionally, chaining append
is more efficient from a DQL perspective.
timeseries idle=avg(dt.host.cpu.idle),
by:dt.entity.host,
filter: dt.entity.host == "HOST-EFAB6D2FE7274823"
| append [
timeseries system=avg(dt.host.cpu.system),
by:dt.entity.host,
filter: dt.entity.host == "HOST-EFAB6D2FE7274823"
]
| append [
timeseries user=avg(dt.host.cpu.user),
by:dt.entity.host,
filter: dt.entity.host == "HOST-EFAB6D2FE7274823"
]
Example 14: Connection failure rate by host
In this example, you'll apply what you've learned from previous examples to calculate the failure rate and find hosts running processes with many failed connections.
This example uses the default
parameter to control for the case where there are no failures. It inserts a 0
value anywhere data is missing.
Failure-rate calculations are common and critical for monitoring service-level objectives. Spotting persistent or recurring high failure rates in testing environments could indicate a deployment problem before the application reaches production.
timeseries {
new = sum(dt.process.network.sessions.new),
{reset = sum(dt.process.network.sessions.reset), default:0},
{timeout = sum(dt.process.network.sessions.timeout), default:0}
},
by:{dt.entity.host}
| fieldsAdd result = 100 * (reset[] + timeout[]) / new[]
| filter arrayAvg(result) > 0
| sort arrayAvg(result) desc