Adaptive traffic management and control

A core feature of Dynatrace is the distributed tracing capability known as PurePath. A single PurePath is a full end-to-end distributed trace. In contrast to other tracing technologies, Dynatrace PurePaths are automatically captured by OneAgent. In addition to service-level traces and response times, PurePaths also provide deep code-level insights, which enable method-hotspot analysis, request attributes, request and database analysis, and detailed error analysis. All of this is provided automatically out-of-the-box with Dynatrace PurePath distributed tracing. PurePath is also one of the core ingredients that enables the Dynatrace AI to perform automatic baselining and root cause analysis. In support of this, Dynatrace distributed tracing provides the highest level of data granularity and fidelity on the market.

Dynatrace high-fidelity distributed tracing

Dynatrace OneAgent traces each distributed transaction end-to-end. OneAgent captures a higher number of distributed traces than any other tracing technology on the market. With a standard setup, each Dynatrace OneAgent captures at least 1,000 new end-to-end distributed traces (aka, PurePaths) within each monitored process every minute (1,000/min/process). Each PurePath is an end-to-end distributed trace (each containing code level and business insight). Due to the full capture for each trace, second and third level tiers often capture many more total traces than entry-point processes.

Due to this high fidelity, in most cases Dynatrace captures every single distributed transaction. Particularly with high volume cases, single entry-point tiers can process many more requests than this. Capturing all traces all the time in such situations can lead to rather high network bandwith demands. To manage this, Dynatrace OneAgent provides a built-in limiter. Each OneAgent-monitored process is allowed to start 1,000 distributed traces per minute. This limit can be thought of as a quota.

Each OneAgent-monitored process captures a limited number of new PurePaths every minute

Once the PurePath quota is reached, OneAgent applies an intelligent mechanism to make use of the monitored traffic in the most effective way possible. This approach is called "adaptive traffic management."

Adaptive traffic management for Dynatrace OneAgent

Adaptive traffic management ensures that each OneAgent captures a minimum number of new distributed traces each minute, thereby limiting the amount of data that's sent. At the same time, it ensures that all important traces are fully captured and that a statistically valid set of traces is maintained for the more frequent but less important requests.

To achieve this, OneAgent calculates a list of the top requests that are started each minute. Typical applications don't have an even distribution of requests. Rather, there are a few kinds of requests that make up a majority of the traffic (for example, image requests or status checks), a medium number of important requests, and a large number of unique URLs. Based on the list of top requests, OneAgent captures traffic in such a way that requests that have the highest volume are captured less frequently (thereby avoiding capture of "more of the same") while every single unique or rare request is captured.

The following table represents such a top-request calculation example, along with the respective capture rates.

Request Number of requests processed by the application Capture factor Captured PurePaths
URI A 900 1/2 450
URI B 440 1/2 220
URI C 250 1 250
URI D 60 1 60
50 Random others 100 1 100
Total: 1500 1080

In this example, OneAgent will capture a bit more than 1,000 requests/min, which is the target request number in this example. URIs C, D, and 50 Random others are captured every single time, while A and B are captured 50% of the time. Yet, OneAgent still traces these requests end-to-end over 600 times/minute.

In almost all cases you won't notice this behavior. Dynatrace retains the knowledge of the capture rate and calculates the response time, throughput, and error rate accordingly. All charts and service analysis show extrapolated data based on the original capture rate. The Dynatrace AI still has more than enough data (more than any other APM solution) to provide you with actionable insights and detailed root-cause analysis of detected problems. You can see this behavior when you look at an individual PurePath in the PurePath list that says 3 more like this. This indicates that a request was captured once however there were three other requests just like it that were processed by the monitored application but not included in the analysis.

With this approach, adaptive traffic management saves you a lot of network bandwidth and in case of Dynatrace Managed precious CPU, memory, network and storage resources that would otherwise be required to process and store this data.

Adaptive load reduction in Dynatrace Managed

Due to the ease of OneAgent deployment, instrumenting new applications, hosts, or even large additional environments is no problem. Customers often onboard hundreds of hosts and applications a day to Dynatrace. This, of course, vastly increases the number of PurePaths that are processed by a Dynatrace Managed cluster.

Sometimes the initial sizing considerations for Dynatrace Managed nodes and clusters are inadequate to support such volume; a Dynatrace Managed cluster may lack the necessary hardware to process all the additional incoming data. To protect the health and integrity of your monitoring environment in such situations, Dynatrace Managed leverages adaptive load reduction on the incoming traces to ensure that monitoring remains stable while a statistically valid set of requests is captured for analysis.

Each Dynatrace Managed node can process a certain number of service calls per minute (a PurePath or distributed trace is made up of many service calls). The number of calls that can be processed depends on the number of CPUs and the amount of memory available to a node. Once this limit is breached, adaptive load reduction is engaged.

New PurePaths coming in from environments with the highest traffic in relation to their assigned host units (i.e., traffic/host unit) are targeted first for load reduction. Dynatrace Managed skips full PurePath processing in these environments. This happens in a random fashion and reduces the number of PurePaths that are processed in a staged fashion. At the same time, the statistical validity of all metrics, charts, baselining, and events is retained because Dynatrace knows the number of PurePaths that have been skipped. This is fully transparent to you, as Dynatrace raises an event and displays a message in the cluster UI.

Adaptive load reduction engaged on a Dynatrace managed cluster due to a resource shortage

As with OneAgent traffic management, the reduction in processed data is accounted for transparently. This will successfully safeguard your cluster from spikes in traffic. In such situations the fact that not all data is processed has no negative impact on your monitoring. Dynatrace AI isn't impacted at all, nor is alerting. All service-based chart data is transparently adjusted (no change is visible) and all analysis views account for this. Unless you're looking at a single PurePath you won't see a difference in charts or service call analysis data. The only place where this is visible is the PurePath list, which displays the message like x more like this.

Only those environments that have a high volume of traffic compared to their assigned host units are targeted. All other environments remain unaffected.

If adaptive load reduction is engaged only occasionally, to cover spikes, you don't need to do anything. However, this should't be an ongoing solution. While OneAgent adaptive traffic management does actively shape traffic, the feature described here exists only as a safeguard.

If adaptive load reduction is engaged on a consistent basis, you have a decision to make. You can add more hardware and a new managed cluster node to provide your Dynatrace managed cluster with the necessary resources to process the additional data. Or you can choose to reduce the incoming traffic by adjusting the traffic management settings for the environment's OneAgents.

Adaptive capture control

In Dynatrace Managed you can define the target number of newly monitored entry-point PurePaths captured per process/minute.

Change the maximum PurePath fidelity of an environment in Dynatrace Managed

The default number of newly monitored entry-point PurePaths captured per process/minute is 1,000, which is high number already. Changing this setting may be useful in some situations. You might have a load-test environment that consumes too much network, disk, and CPU on your Dynatrace Managed cluster and you'd rather use the environment for production monitoring. You can choose to reduce the PurePath fidelity of the environment and thereby reduce the percentage of monitored incoming traffic. In other words, the tests may produce enough data granularity even following reduction of the amount of traces down to 500 or 100 per minute.

By reducing the number of captured PurePaths, you aren't changing any metrics or any service analysis features. Outside of the PurePath list itself this change is accounted for transparently and is taken into account in all analysis.

Alternatively, you can choose to increase the number of entry-point PurePaths captured per process/minute in your most important environments so as to get even higher fidelity than the default. The upper limit is 100,000 new end-to-end requests/process/min. This effectively instructs OneAgent to capture all requests. This is useful when you need to ensure the capture of even rare requests within high volume environments.

Be careful with setting this value too high as it can push your Dynatrace Managed cluster into a resource shortage situation and force you to add more hardware. You can use this setting to effectively ensure the capture of every single transaction at the cost of increased hardware cost.

Adaptive capture control for process groups

In certain situations, you might want to reduce the amount of PurePaths captured for specific processes or process groups. To accommodate this, you can lower the number of captured transactions for a specific environment for certain process groups.

Adaptive capture control for specific process groups

As you can see, this feature additionally allows an environment administrator to reduce the number of PurePaths across an entire environment. This is useful in reducing the amount of network traffic produced by OneAgents environment-wide.

FAQs

How does adaptive traffic management affect charts, baselining, and alerting?

The short answer is, not at all.

The shaping of traffic is accounted for transparently and done in a way that ensures statistical validity while capturing rare requests with high probability. All charts show the total real number of requests that your application processes, as does all ad-hoc analysis you may perform. Dynatrace AI isn't impacted by this. The only place where this traffic shaping is visible is in the PurePath list, which displays a message like x more like this.

Why can't I find my request?

If your Dynatrace Managed cluster is undersized or if a specific request you're interested in comes from a high-volume tier (more than 1,000/min) Dynatrace may not capture the request. OneAgent does its best to capture all unique requests, but there are limits to what can be done. We'll continue to work on improving this mechanism. You can reduce the amount of produced traffic by excluding unimportant requests from capture; this is done with web request attributes and URL exclusion rules. This approach leaves more volume available for important requests.

In Dynatrace Managed you can also increase the number of captured PurePaths, if this is important to you.