Dynatrace high fidelity distributed tracing

One of the core features of Dynatrace is its distributed tracing capability that is called PurePath. A PurePath is a full end to end distributed trace. In contrast to other tracing technologies a Dynatrace PurePath is captured automatically by the OneAgent, includes not only service level traces and response times but also deep code level insight. This enables method hotspots, request attributes, request and database analysis and detailed error analysis. All of that is fully automatic. It is also one of the core ingredients that enables the automatic baselining and root cause analysis of the Dynatrace's AI. To facilitate this Dynatrace's distributed tracing capability has the highest data granularity and fidelity on the market.

High fidelity tracing

The Dynatrace OneAgent traces each distributed transaction it captures end to end. It also captures the highest amount of distributed traces in the market. In its standard setup a Dynatrace OneAgent will capture at least 1000 new end-to-end distributed traces (aka PurePaths) every minute in each monitored process (1000/min/process). Each PurePath or trace is always end to end. Due the full capture of each trace, second or third level tiers will often capture many more total traces than entry point process.

Due to this high fidelity Dynatrace will capture every single distributed transaction in most cases. In rare cases single entry point tiers process many more requests than this. Capturing all traces would lead to an unreasonable amount of network traffic on the monitored hosts. To manage this a Dynatrace OneAgent has a limit built-in. Each OneAgent monitored process is allowed to start a certain amount (1000/min) distributed traces every minute. This can be thought of as filling up a quota.

Each OneAgent monitored process will capture a certain amount of new PurePaths every minute

Once that quota is full, the OneAgent will apply an intelligent mechanism to use the available traffic quota in the best possible way. This is called adaptive traffic management.

Dynatrace OneAgent - Adaptive traffic management

Adaptive traffic management ensures that a OneAgent always captures at least a certain amount of new distributed traces each minute, limiting the amount of data sent. At the same time, it ensures that all important traces are fully captured while maintaining a statistical valid set of the more frequent but less important requests.

To achieve this the OneAgent calculates a list of top PurePaths started in every minute. Typical applications do not have an even distribution of requests, rather there are a few requests that make up a large majority of the traffic (e.g. image requests or status checks), a medium number of important requests and a large number of unique URLs.

Based on this Top request list the OneAgent will shape the traffic in a such a way that requests that have the highest volume are captured less often (avoiding capturing more of the same) while attempting to capture every single unique or rare request.

The following table represents such a top request calculation with their respective capture rates.

Request Number of requests processed by the Application Capture factor Captured PurePaths
URI A 900 1/2 450
URI B 440 1/2 220
URI C 250 1 250
URI D 60 1 60
50 Random others 100 1 100
Total: 1500 1080

In this situation the OneAgent would capture a bit more than 1000 requests/min which is the target request number in this example. URI's C, D and 50 others are captured every single time, while A and B are captured 50% of the time. Yet OneAgent still traces them over 600 times/minute end to end.

In almost all cases you will not notice this behavior. The Dynatrace OneAgent retains the knowledge about the capture rate and will calculate response time, throughput and error rate accordingly; Any chart or service analysis will show extrapolated data based on the original capture rate. The Dynatrace AI has more than enough data (more than any other APM solution) to give you the right answers and root causes for your problems.

You will see this when you look at a single PurePath in the PurePath list which will say "2 more like this" indicating that this request was captured once but there were two like this processed by the monitored application.

This behavior will save you a lot of network bandwidth.

Dynatrace Managed - Adaptive traffic management

Due to the easy of deployment of the OneAgent, adding new applications, hosts or even huge new additional environments became the new normal. Customers often onboard hundreds of hosts and apps a day unto Dynatrace. This of course vastly increase the number of PurePath being processed by the Dynatrace Managed cluster.

Sometimes it happens that the initial sizing considerations of the Dynatrace managed nodes or cluster did not meet these new facts and the existing Dynatrace managed cluster lacks the necessary hardware to process all the additional incoming data. It is then most important to keep the Dynatrace managed environment healthy and ensure that your monitoring is always available.

To protect the health and integrity of your monitoring environment Dynatrace Managed will engage Adaptive traffic management on the incoming traces.

Each Dynatrace Managed node can process a certain number of Service calls per minute (a PurePath or distributed trace is made up of many service calls). How many it can process depends on the number of CPUs and the amount of memory available to the node. Once that limit is breached, Adaptive traffic management will engage on the Dynatrace Managed nodes.

New PurePaths coming in from environments with the most traffic in relation to their assigned host units (i.e. traffic/host unit) will targeted for traffic management. Dynatrace Managed will skip full PurePaths in those Environments from being processed. This will happen in a random fashion and will reduce the amount of PurePaths that are being processed in stages. At the same time the statistical validity of all metrics, charts, baselining and events will be retained, because Dynatrace knows the amount of PurePaths it skips. This is fully transparent to the customer, as Dynatrace will raise an event and display it in the cluster UI.

Adaptive traffic management has engaged on a Dynatrace managed cluster due to a resource shortage

As with OneAgent traffic management the reduction in processed data is accounted for transparently. The fact that not all data is being processed should have no negative impact on the monitoring. The AI is not being impacted at all nor is alerting. All Service based chart data will be transparently adjusted (no change will be visible) and all analysis views also account for this. Unless you are looking at a single PurePath you will not see a difference chart or service call analysis data. One place where this will be visible is in the PurePath list, as you will see a message saying “x more like this”.

Only those environments that have a high volume of traffic compared to their assigned host units will be targeted. All other environments remain unaffected.

If this happens occasionally, you don't need to do anything. However, this should not be a constant situation. While the Adaptive Traffic Management on the OneAgent does actively shape the traffic, the feature described here exists as a safeguard only.

You now have a choice to make. You can add more hardware and a new managed cluster node. By doing that you give the Dynatrace managed cluster the necessary resources to process the additional data.

Or you can choose to reduce the incoming traffic by adjusting the traffic management of an environment's OneAgents.

Adaptive capture control

In Dynatrace Managed you can define the Number of newly monitored entry point PurePaths captured per process/minute

Change the maximum PurePath fidelity of an Environment in Dynatrace Managed

The default is 1000/process/minute, which is high value in and of itself. Changing it is useful for a multitude of reasons. You might have a load test environment that consumes too much network, disk and CPU on you Dynatrace Managed cluster and you would rather use that for your production environment. You might choose to reduce the PurePath fidelity of that environment and thus reduce the incoming monitoring traffic. In other words, those tests might produce enough data granularity even when you reduce the amount of traces per minute to 500 or 100.

Importantly by reducing the number of captured PurePaths, you are not changing any metrics or any of the Service analysis features. Outside of the PurePath list itself this change is accounted for transparently and all analysis takes it into account.

Alternatively, you can also choose to increase number to get an even higher fidelity than the default for your most important environments. The upper limit is 100.000 new end to end requests/process/min. For all intends and purposes this would tell the OneAgent to capture all requests. This is useful if you need to ensure the capture of even rare requests in high volume environments

You should be careful with setting this value too high, because it could force the Dynatrace Managed cluster into a resource shortage situation or force you to add much more hardware.

How to change the Adaptive capture control limit for specific process groups

In certain situations, you might only want to reduce the amount of PurePaths captured for specific processes or process groups. To accommodate this, you can lower the number of captured transactions for a specific environment or certain process groups.

How change the Adaptive capture control limit for specific process groups

As you can see this also allows an environment admin to reduce the number of PurePaths for the whole environment. This useful to reduce the network traffic produced by the OneAgents on this environment.

FAQ

How does "Dynatrace OneAgent - Adaptive traffic management" affect charts, baselining, and alerting?

The short answer is, not at all.

The shaping of the traffic is accounted for transparently and done in a way that will ensure statistical validity while capturing rare requests with higher probability. All charts will still show the total real number of requests that your application processes. Any ad-hoc analysis will do so as well. The AI is not impacted at all. The one place where this will be visible is in the PurePath list, as you will see a message says, “x more like this”.

Why can't I find my request?

Yes, that is true. If your Dynatrace managed cluster is undersized or if the request in question comes from a high-volume tier (more than 1000/min) Dynatrace might not capture certain requests. The OneAgent is doing its best to capture all unique requests, but there are of course limits. We are also always working on improving this mechanism. You can reduce the produce traffic by excluding unimportant requests from being captured (via the web request URL exclusion rules). This will leave more volume available for your important requests. In Dynatrace Managed you can also increase the number of PurePaths being captured if this is important to you.