Adaptive traffic management and control
A core feature of Dynatrace is the PurePath® distributed tracing capability. A single PurePath is a full end-to-end distributed trace. In contrast to other tracing technologies, Dynatrace PurePath traces are automatically captured by OneAgent. In addition to service-level traces and response times, PurePath traces also provide deep code-level insights, which enable method-hotspot analysis, request attributes, request and database analysis, and detailed error analysis. All of this is provided automatically, out of the box, with Dynatrace PurePath distributed tracing. PurePath tracing is also one of the core ingredients that enables Davis, the Dynatrace AI, to perform automatic baselining and root cause analysis. In support of this, Dynatrace distributed tracing provides the highest level of data granularity and fidelity on the market.
Dynatrace high-fidelity distributed tracing
Dynatrace OneAgent traces each distributed transaction end to end. OneAgent captures a higher number of distributed traces than any other tracing technology on the market. With a standard setup, each Dynatrace OneAgent captures at least 1,000 new end-to-end distributed traces (that is, PurePath traces) within each monitored process every minute (1,000/min per process). Each end-to-end PurePath distributed trace contains code-level and business insights. Due to the full capture for each trace, second- and third-level tiers often capture many more total traces than entry-point processes.
Because of this high fidelity, in most cases Dynatrace captures every single distributed transaction. Particularly with high volume cases, single entry-point tiers can process many more requests than this. Capturing all traces all the time in such situations can lead to high network bandwidth demands. To manage this, Dynatrace OneAgent provides a built-in limiter. Each OneAgent-monitored process is allowed to start 1,000 distributed traces per minute. This limit can be thought of as a quota.
Once the PurePath quota is reached, OneAgent applies an intelligent mechanism which makes use of the monitored traffic in the most effective way possible. This approach is called "adaptive traffic management."
Adaptive traffic management for Dynatrace OneAgent
Adaptive traffic management ensures that each OneAgent captures a minimum number of new distributed traces each minute, thereby limiting the amount of data that's sent. At the same time, it ensures that all important traces are fully captured and that a statistically valid set of traces is maintained for the more frequent but less important requests.
To achieve this, OneAgent calculates a list of the top requests that are started each minute. Typical applications don't have an even distribution of requests. Rather, there are a few kinds of requests that make up a majority of the traffic (for example, image requests or status checks), a medium number of important requests, and a large number of unique URLs. Based on the list of top requests, OneAgent captures traffic in such a way that requests that have the highest volume are captured less frequently (thereby avoiding capture of "more of the same") while every unique or rare request is captured.
The following table represents such a top-request calculation example, along with the respective capture rates.
|Request||Number of requests processed by the application||Capture factor||Captured PurePath traces|
|50 other URIs||100||1||100|
In this example, OneAgent will capture a bit more than 1,000 requests/min, which is the target request number in this example. URIs C, D, and 50 other URIs are captured each time, while A and B are captured only 50% of the time. Yet, OneAgent still traces these requests end-to-end over 600 times/minute.
In almost all cases you won't notice this behavior. Dynatrace retains the knowledge of the capture rate and calculates the response time, throughput, and error rate accordingly. All charts and service analysis show extrapolated data based on the original capture rate. The Dynatrace AI still has more than enough data (more than any other APM solution) to provide you with actionable insights and detailed root cause analysis of detected problems. You can see this behavior when you look at an individual PurePath trace in the PurePath list that says
3 more like this. This indicates that a request was captured once while there were three other requests just like it that were processed by the monitored application but were not included in the analysis.
With this approach, adaptive traffic management saves you a lot of network bandwidth, and, in the case of Dynatrace Managed, precious CPU, memory, network and storage resources that would otherwise be required to process and store this data.
Capture limits in Dynatrace SaaSDynatrace version 1.232+
For Dynatrace SaaS environments, the capture limit of each OneAgent is not bound to a static threshold value. Instead, the capture limit is calculated based on the number of host units per OneAgent-monitored host and the number of incoming service calls. The maximum number of service calls is then distributed among all monitored services on each OneAgent-monitored host. This approach not only prevents services with high loads from consuming your available volume, it also scales automatically with your future requirements as your monitored environment grows or spikes. Especially in highly dynamic environments where services come and go, it's essential to rely on smart algorithms to ensure that your monitoring needs are met now and in the future.
As an example, if a OneAgent sends a high number of distributed traces and Dynatrace processes more than 250 service calls per host unit per minute, the OneAgent capture limit is decreased to match the service call limit. Alternatively, if Dynatrace processes fewer than 250 service calls per host unit per minute, then the OneAgent capture limit is increased until the OneAgent captures all distributed traces or the maximum number of 250 complete service calls per host unit per minute is reached. Note that only service calls with a server-side are counted. This means that database calls and external web request calls do not contribute to the limit.
Monitor Adaptive Traffic ManagementDynatrace version 1.232+
To keep track of actual usage and tresholds of Adaptive Traffic Management Dynatrace provides dedicated self-monitoring metrics.
|The number of service calls processed by the OneAgent|
|The maximum allowed number of service calls (with server-side) received per minute by the cluster|
|The maximum allowed number of service calls (with server-side) received per minute by the cluster|
You can integrate these self-monitoring metrics into your own custom self-monitoring dashboard, or you can create a dedicated dashboard like the one shown below to show historical and current capture rate.
The necessary metric expressions used to build this dashboard are listed below:
|Received service calls|
|Processed service calls|
|Maximum allowed service calls|
Adaptive load reduction in Dynatrace Managed
Due to the ease of OneAgent deployment, instrumenting new applications, hosts, or even large additional environments is no problem. Customers often onboard hundreds of hosts and applications a day to Dynatrace. This, of course, vastly increases the number of PurePath traces that are processed by a Dynatrace Managed cluster.
Sometimes the initial sizing considerations for Dynatrace Managed nodes and clusters are inadequate to support such volume; a Dynatrace Managed cluster may lack the necessary hardware to process all the additional incoming data. To protect the health and integrity of your monitoring environment in such situations, Dynatrace Managed leverages adaptive load reduction on the incoming traces to ensure that monitoring remains stable while a statistically valid set of requests is captured for analysis.
Each Dynatrace Managed node can process a certain number of service calls per minute (a PurePath distributed trace is made up of many service calls). The number of calls that can be processed depends on the number of CPUs and the amount of memory available to a node. Once this limit is breached, adaptive load reduction is engaged.
New PurePath traces coming in from environments with the highest traffic in relation to their assigned host units (i.e., traffic/host unit) are targeted first for load reduction. Dynatrace Managed skips full PurePath processing in these environments. This happens in a random fashion and reduces the number of PurePath traces that are processed in a staged fashion. At the same time, the statistical validity of all metrics, charts, baselining, and events is retained because Dynatrace knows the number of PurePath traces that have been skipped. This is fully transparent to you, as Dynatrace raises an event and displays a message in the cluster UI.
As with OneAgent traffic management, the reduction in processed data is accounted for transparently. This will successfully safeguard your cluster from spikes in traffic. In such situations the fact that not all data is processed has no negative impact on your monitoring. Dynatrace AI isn't impacted at all, nor is alerting. All service-based chart data is transparently adjusted (no change is visible) and all analysis views account for this. Unless you're looking at a single PurePath, you won't see a difference in charts or service call analysis data. The only place where this is visible is the PurePath list, which displays the message like
x more like this.
Only those environments that have a high volume of traffic compared to their assigned host units are targeted. All other environments remain unaffected.
If adaptive load reduction is engaged only occasionally, to cover spikes, you don't need to do anything. However, this shouldn't be an ongoing solution. While OneAgent adaptive traffic management does actively shape traffic, the feature described here exists only as a safeguard.
If adaptive load reduction is engaged on a consistent basis, you have a decision to make. You can add more hardware and a new Dynatrace Managed cluster node to provide your Dynatrace Managed cluster with the necessary resources to process the additional data. Or you can choose to reduce the incoming traffic by adjusting the traffic management settings for the environment's OneAgents.
Adaptive capture control
In Dynatrace Managed you can define the target number of newly monitored entry-point PurePath traces captured per process/minute.
The default number of newly monitored entry-point PurePath traces captured per process/minute is 1,000, which is a high number already. Changing this setting might be useful in some situations. You might have a load-test environment that consumes too many network, disk, and CPU resources on your Dynatrace Managed cluster and you'd rather use the environment for production monitoring. You can choose to reduce the PurePath fidelity of the environment and thereby reduce the percentage of monitored incoming traffic. In other words, the tests might produce enough data granularity even after reducing the number of traces down to 500 or 100 per minute.
By reducing the number of captured PurePath traces, you aren't changing any metrics or any service analysis features. Outside of the PurePath list itself, this change is accounted for transparently and is taken into account in all analyses.
Alternatively, you can choose to increase the number of entry-point PurePath traces captured per process/minute in your most important environments so as to get even higher fidelity than the default. The upper limit is 100,000 new end-to-end requests/process/min. This effectively instructs OneAgent to capture all requests. This is useful when you need to ensure the capture of even rare requests within high volume environments.
Be careful with setting this value too high as it can push your Dynatrace Managed cluster into a resource shortage situation and force you to add more hardware. You can use this setting to effectively ensure the capture of every single transaction at the cost of increased hardware expenditures.
Adaptive capture control for process groups
In certain situations, you might want to reduce the number of PurePath traces captured for specific processes or process groups. To accommodate this, you can lower the number of captured transactions for a specific environment for certain process groups.
Go to Settings > Server-side service monitoring > Deep monitoring.
As you can see, this feature additionally allows an environment administrator to reduce the number of PurePath traces across an entire environment. This is useful in reducing the amount of network traffic produced by OneAgents environment-wide.
Frequently asked questions
How does adaptive traffic management affect charts, baselining, and alerting?
The short answer is, not at all.
The shaping of traffic is accounted for transparently and done in a way that ensures statistical validity while capturing rare requests with high probability. All charts show the total real number of requests that your application processes, as does all ad-hoc analysis you may perform. Dynatrace AI isn't impacted by this. The only place where this traffic shaping is visible is in the PurePath list, which displays a message like
x more like this.
Why can't I find my request?
If your Dynatrace Managed cluster is undersized or if a specific request you're interested in comes from a high-volume tier (more than 1,000/min), Dynatrace may not capture the request. OneAgent does its best to capture all unique requests, but there are limits to what can be done. We'll continue to work on improving this mechanism. You can reduce the amount of produced traffic by excluding unimportant requests from capture; this is done with web request attributes and URL exclusion rules. This approach leaves more volume available for important requests.
If this is important to you, you can also increase the number of captured PurePath traces in Dynatrace Managed.