• Home
  • Observe and explore
  • Metrics
  • Metrics on Grail examples

Metrics on Grail examples

powered by Grail

Metrics on Grail enable you to pinpoint and retrieve any metric data with the help of Dynatrace Query Language. After reviewing the fundamentals of DQL queries, use the examples on this page to start getting answers from your metrics.

Example 1: Average CPU usage across all hosts

In this example, you'll query the average CPU usage across all monitored hosts in your environment.

OneAgent collects CPU measurements from its host machine. These metrics are accessible through metric keys beginning with dt.host.cpu.

Observing the aggregate CPU usage across all hosts can help you visually confirm how your infrastructure responds to and recovers from usage spikes or slow, imperceptible growth trends over time.

dql
timeseries usage=avg(dt.host.cpu.usage)

Example 2: Average CPU usage by host, limit to top 3 hosts

In this example, you get every monitored host's average CPU usage and focus on the three hosts with the highest usage.

OneAgent collects CPU measurements from its host machine. These metrics are accessible through metric keys beginning with dt.host.cpu.

Charting individual hosts' CPU usage helps to visualize normal and outlier usage. By focusing on the three hosts with highest CPU usage, you can begin investigating under-provisioned applications. Likewise, focusing on hosts with the lowest CPU usage may reveal over-provisioning and lead to cost-saving opportunities.

  1. Query the data.

    dql
    timeseries usage=avg(dt.host.cpu.usage), by:{dt.entity.host} | sort arrayAvg(usage) desc | limit 3
  2. Simplify results.

    A table can be easier to read than a line chart in some situations. Let's query data that works best with table output by focusing on the columns we most care about: dt.entity.host and usage.

    dql
    timeseries usage=avg(dt.host.cpu.usage), by:{dt.entity.host} | sort arrayAvg(usage) desc | limit 3 | fields dt.entity.host, usage=arrayAvg(usage)

    This is essentially the same query as above, removing the series and keeping only the series aggregation.

    You can refer to the DQL documentation for a list of available arrayXXX functions. If you're familiar with metric expressions, you'll find these functions similar to the :fold transformation.

Example 3: Average CPU usage for a specific host

In this example, you'll learn how to filter results from the timeseries command.

You can use a filter parameter to filter hosts directly. It accepts the same values as the filter command.

As host.name is a common field, you can filter on host.name without a lookup command.

dql
timeseries usage=avg(dt.host.cpu.usage), by:{dt.entity.host}, filter:{host.name=="dw0sdwk00012U"}

Example 4: Average CPU usage for tagged hosts

This example uses an in condition filter results to hosts tagged with a responsible team in your organization.

By using the in operator with a subquery, you can filter on host attributes such as tags or ipAddress.

Using the timeseries filter parameter is a good way to improve query performance.

dql
timeseries usage=avg(dt.host.cpu.usage), by:{dt.entity.host}, filter:{ dt.entity.host in [ fetch dt.entity.host | filter matchesValue(tags, "team:Dorado") | fields id ] }

Example 5: Number of hosts sending CPU usage data

In this example, you'll learn how to chain timeseries with summarize. You'll first query hosts sending CPU usage data, and then count the number of hosts in the result.

Other DQL commands can also be chained with timeseries as demonstrated in previous examples, but unlike those examples, summarize further aggregates the dataset returned by timeseries. You'll find this two-step aggregation helpful as your questions become more complex and nuanced.

dql
timeseries usage=avg(dt.host.cpu.usage), by:{dt.entity.host} | summarize count()

Example 6: Total disk throughput for a given host across all disks

In this example, you'll learn how to focus on the relevant context by using the by parameter.

Like a pivot table, the by parameter aggregates timeseries across many dimensions into a single host context. This example also uses in to query a specific host as shown previously.

OneAgent ingests host metrics from many contexts, but you rarely need all the information at once. Aggregating timeseries with by helps you to focus on the context that matters for your questions.

dql
timeseries bytes_written=sum(dt.host.disk.bytes_written), by:{dt.entity.host}, filter:{ dt.entity.host in [ fetch dt.entity.host | filter matchesValue(ipAddress, "10.128.0.106") | fields id ] }

Example 7: Top hosts by bytes read with corresponding bytes written

In this example, you'll enrich a single result with context from another metric.

Even when focused on disk read operations, the corresponding disk writes can provide helpful context.

dql
timeseries by:{dt.entity.host}, { bytes_read=sum(dt.host.disk.bytes_read), bytes_written=sum(dt.host.disk.bytes_written) } | sort arrayAvg(bytes_read) desc | limit 3 | fields dt.entity.host, bytes_read=arrayAvg(bytes_read), bytes_written=arrayAvg(bytes_written)

Example 8: Total network traffic by host

In this example, you'll calculate total network traffic on your hosts.

Dynatrace collects network traffic in two metrics, bytes_rx and bytes_tx. You'll calculate the total traffic by aggregating the series into single measurements and summing measurements to create traffic_gb.

This example highlights one of the improvements of DQL. With metric selectors, you can calculate traffic_gb, but you'll lose the bytes_rx and bytes_tx data used as inputs. With DQL, these fields remain in the query output.

dql
timeseries by:{dt.entity.host}, { bytes_rx = sum(dt.host.net.nic.bytes_rx), bytes_tx = sum(dt.host.net.nic.bytes_tx) } | fieldsAdd bytes_rx = arraySum(bytes_rx) | fieldsAdd bytes_tx = arraySum(bytes_tx) | fieldsAdd traffic_gb = 1e-9 * (bytes_rx + bytes_tx)

Example 9: Available CPU by Kubernetes Node

In this example, you'll calculate the available CPU on all nodes of your hypothetical "openfeature" cluster.

To return a timeseries instead of a single value, we use the [] operator to take the difference of individual timeseries values. The result is another timeseries that you can visualize with a line chart.

The available CPU is integral for efficient resource utilization and avoiding resource contention. A timeseries visualized with a line chart is one way to show how the available CPU changes over time.

dql
timeseries { cpu_allocatable=min(dt.kubernetes.node.cpu_allocatable), requests_cpu=max(dt.kubernetes.node.requests_cpu) }, by:{dt.entity.kubernetes_node}, filter:{k8s.cluster.name == "openfeature"} | fieldsAdd result = cpu_allocatable[] - requests_cpu[] | fieldsRemove cpu_allocatable, requests_cpu

Example 10: Filter relevant hosts by state

In this example, you'll learn how to use an in operator to focus on running hosts.

An unfiltered timeseries query can take unnecessary time in large environments with thousands of hosts. By applying the filter to the timeseries command directly, your queries may become faster by ignoring unwanted data.

dql
timeseries usage=avg(dt.host.cpu.usage), by:{dt.entity.host}, filter:{ dt.entity.host in [ fetch dt.entity.host | filter state == "RUNNING" | fields id ] }

Example 11: Average process memory usage by the responsible dev team

In this example, you'll use two in comparisons to filter results to hosts tagged with a responsible team in your organization.

The process runs on a host, and the host has a tag with the team information on which you need to filter. First, fetch process-host relationships, followed by host-tag relationships, and then add the tag filter necessary for the query.

You'll find nested in comparisons are often necessary when working with complex relationships.

dql
timeseries usage=avg(dt.process.memory.usage), by:{dt.entity.process_group_instance}, filter:{ dt.entity.process_group_instance in [ fetch dt.entity.process_group_instance | filter belongs_to[dt.entity.host] in [ fetch dt.entity.host | filter matchesValue(tags, "team:Dorado") | fields id ] | fields id ] }

Example 12: Average host CPU usage by host size

In this example, you'll learn how to use a lookup command to analyze host CPU usage by host size.

OneAgent collects local context from its host: information such as how many CPUs are installed and how much memory it has. You can add this information to your query with a lookup command.

Host-level information can sometimes be too fine-grained and difficult to interpret. In these situations, a well-chosen lookup can help you explore and analyze how individual hosts contribute to broader trends.

dql
timeseries usage=avg(dt.host.cpu.usage), by:{dt.entity.host} | fieldsAdd usage=arrayAvg(usage) | lookup [ fetch dt.entity.host | fieldsAdd cpuCores ], sourceField:dt.entity.host, lookupField:id, fields:{cpuCores} | summarize avg(usage), count_hosts=count(), by:{cpuCores}

When should I use in or lookup?

  • Use an in comparison to filter your data. This is more efficient than filtering with a lookup.

  • Use lookup when you want to group by a field that's not returned by the base timeseries command, such as a host ipAddress.

An in is usually faster than a lookup. Choosing correctly will lead to smoother interactions in your dashboards, notebooks, and other Platform applications.

Example 13: Query multiple CPU usage metrics with a single query

In this example, you'll learn how to use the append command to return multiple CPU metrics with a single query.

Combining queries into one command can be useful for comparing measurements from different contexts, as they will be charted together.

As you query many metrics from a single host and perform no arithmetic, the append command here is preferred to querying multiple metrics with a single timeseries command. The append command is a comparatively more flexible option, as it doesn't require equivalent by or filter arguments, for example. Additionally, chaining append is more efficient from a DQL perspective.

dql
timeseries idle=avg(dt.host.cpu.idle), by:dt.entity.host, filter: dt.entity.host == "HOST-EFAB6D2FE7274823" | append [ timeseries system=avg(dt.host.cpu.system), by:dt.entity.host, filter: dt.entity.host == "HOST-EFAB6D2FE7274823" ] | append [ timeseries user=avg(dt.host.cpu.user), by:dt.entity.host, filter: dt.entity.host == "HOST-EFAB6D2FE7274823" ]

Example 14: Connection failure rate by host

In this example, you'll apply what you've learned from previous examples to calculate the failure rate and find hosts running processes with many failed connections.

This example uses the default parameter to control for the case where there are no failures. It inserts a 0 value anywhere data is missing.

Failure-rate calculations are common and critical for monitoring service-level objectives. Spotting persistent or recurring high failure rates in testing environments could indicate a deployment problem before the application reaches production.

dql
timeseries { new = sum(dt.process.network.sessions.new), {reset = sum(dt.process.network.sessions.reset), default:0}, {timeout = sum(dt.process.network.sessions.timeout), default:0} }, by:{dt.entity.host} | fieldsAdd result = 100 * (reset[] + timeout[]) / new[] | filter arrayAvg(result) > 0 | sort arrayAvg(result) desc