Adaptive traffic management for Dynatrace Managed
PurePath® distributed traces are end-to-end transactions captured by OneAgent. Each minute, a statistically relevant number of end-to-end distributed traces is captured within each monitored process. Each trace contains code-level and business insights derived from service-level calls to multiple tiers. Because each trace is captured fully and end-to-end, second- and third-level tiers often capture more total service calls than entry-point processes.
When the volume of transactions is high, capturing all traces can increase network bandwidth demands. OneAgent provides a built-in limiter to manage such cases. Each process monitored by OneAgent is allowed to start only a given number of distributed traces per minute. Once the quota is reached, the monitored traffic is used in the most effective way possible via the intelligent mechanism of adaptive traffic management.
How is adaptive traffic management different from other sampling mechanisms?
In typical applications, the distribution of requests is not even. It's rather a combination of: a large number of unique URLs, a medium number of important requests, and, finally, a few kinds of requests that make up the majority of the traffic (for example, image requests or status checks).
With adaptive traffic management, OneAgent first calculates a list of top requests starting each minute, from which it then captures:
- Most traces of unique and rare requests.
- A significant but lower volume of highly frequent requests.
In this way, OneAgent reduces the data sent to your environment, ensuring that the amount of captured traces stays within the host-unit limits of your Dynatrace agreement. Because the sampling is not random, all important data is captured while maintaining a statistically valid sample set.
The following table represents a top-request calculation example, along with the respective capture rates.
|Request||Number of requests processed by the application||Capture factor||Captured distributed traces|
…50 other URIs
In this example, a bit more than 1,000 requests/min are captured by OneAgent, accordingly to the configured target number of request. Depending on the capture factor, URIs are captured each time (URIs C, D, and 50 other URIs) or only 50% of the time (URIs A and B). In this last case, requests are traced end-to-end by OneAgent over 600 times/minute.
You can see the effect of adaptive traffic management in the distributed trace list. If OneAgent is sampling and not all requests are captured, then captured traces will point out that similar requests have not been captured with the message
[amount] more like this in the distributed trace list.
Using adaptive traffic management to reduce the volume of processed data results in saving a lot of network bandwidth and, in the case of Dynatrace Managed environments, precious CPU, memory, network, and storage resources which would otherwise be required to process and store the additional data.
Quota per process
In Dynatrace Managed, the quota of new distributed traces/min that each process can send to Dynatrace is 1,000. Because traffic management depends on your application architecture, network traffic is limited for high-volume entry points (such as a load balancer or NGINX) and spikes might occur.
Adaptive capture control
You can manage the quota of new entry-point distributed traces captured per minute in one of two ways:
- You can reduce the environment quota, and, thereby, the percentage of monitored incoming traffic.
- You can increase the environment quota up to 100,000 to ensure higher fidelity.
This effectively instructs OneAgent to capture all requests, even rare ones, within high volume environments.
Setting this value too high can cause a resource shortage and increase hardware expenditures.
For each process or process group
Note that environment administrators can additionally modify the environment quota.
Go to Settings > Server-side service monitoring > Deep monitoring.
Adjusting this setting can help you in specific cases, for example, if a Dynatrace Managed environment for load testing is consuming too many network, disk, and CPU resources, and you'd rather use those resources for production monitoring. Adjustments to settings are taken into account transparently in all analyses, without affecting service analysis features, except the distributed traces list, or metrics.
Monitor adaptive traffic usage and thresholds
You can use the preset dashboard OneAgent Traces - Adaptive traffic management to track usage and thresholds of adaptive traffic management. Metric and charts provide insights into:
- Full-service calls per host unit
- Captured full-service calls
- OneAgent capture rate
Adaptive load reduction
Adaptive load reduction is a dynamic mechanism that targets environments with a high volume of traffic compared to their assigned host units. Because Dynatrace Managed environments can process a limited number of service calls per minute (depending on the node CPU amount and memory availability), this is particularly helpful for managing sporadic spikes in the volume of processed distributed traces.
When the amount of service calls that an environment can process is breached, adaptive load reduction is triggered:
New incoming distributed traces are skipped in a random fashion, reducing gradually the number of processed distributed traces.
Note that service calls of full distributed traces already in progress are not targeted.
The number of skipped distributed traces is taken into account to ensure stable statistical validity for all metrics, charts, baselining, and events.
You are informed about the reduction of processed data by
- An alert message in the Dynatrace web UI:
Server [amount] activated adaptive load reduction
- A message in the distributed trace list:
[amount] more like this
- An alert message in the Dynatrace web UI:
Adaptive load reduction safeguards your Dynatrace environment from sporadic traffic spikes.
While occasional activation (for example, to cover spikes) will not harm the fidelity of your monitoring data, consistent use for intervals of 15 minutes or longer can impact the accuracy of your monitoring data and metrics because not all data is processed.
If your environment experiences frequent overloads, we recommend exploring long-term solutions.
- Adding hardware and a new Dynatrace Managed cluster node to provide your Dynatrace Managed cluster with the necessary resources to process the additional data.
- Adjusting OneAgent settings to reduce the incoming traffic.
These options should be considered whenever statistical accuracy of data capture is insufficient.
If your Dynatrace Managed cluster is undersized or if a specific request you're interested in comes from a high-volume tier (more than 1,000 requests/min), Dynatrace might not be able to capture the request.
You can increase the volume available for important requests by reducing the amount of traffic related to unimportant requests. To exclude unimportant requests from capture use web request attributes and URL exclusion rules.
You can also increase the quota of captured distributed traces.
The short answer is, not at all.
The shaping of traffic is accounted for transparently and done in a way that ensures statistical validity while capturing rare requests with high probability. All charts show the total real number of requests that your application processes, as does all ad-hoc analysis you might perform. Dynatrace AI is not impacted by this, nor is alerting. You will not see a difference in charts or service call analysis data unless you're looking at a single distributed trace. The only place where this traffic shaping is visible is in the distributed traces list, which displays a message like
[number of traces] more like this.
- Full-service call
Server side call that starts: a distributed trace, a service call at a deep monitored tier, or a custom service call.
All requests for web request services and web services (except for external ones), RMI services, messaging services and custom services are full-service calls.
External calls (such as database calls, external web requests, or generally any opaque service call) are not full-service calls, and so aren't counted against your traffic limit.
The minimum number of full-service calls per minute in a given environment is 5,000 (the equivalent of 20 host units). Each process can start between 50 and 50,000 full-service calls per minute.
- Active host units
Host units currently in use and connected to the environment (not the host units assigned to the environment).