Adaptive traffic management and control
A core feature of Dynatrace is the PurePath® distributed tracing capability. A single PurePath distributed trace is a full end-to-end distributed trace. In contrast to other tracing technologies, distributed traces in Dynatrace are automatically captured by OneAgent. In addition to service-level traces and response times, distributed traces also provide deep code-level insights, which enable method-hotspot analysis, request attributes, request and database analysis, and detailed error analysis. All of this is provided automatically, out of the box, with Dynatrace distributed tracing. Tracing powered by PurePath® technology is also one of the core ingredients that enables Davis, the Dynatrace AI, to perform automatic baselining and root cause analysis. In support of this, Dynatrace distributed tracing provides the highest level of data granularity and fidelity on the market.
Dynatrace high-fidelity distributed tracing
Dynatrace OneAgent traces each distributed transaction end to end. OneAgent captures a higher number of distributed traces than any other tracing technology on the market. With a standard setup, each Dynatrace OneAgent captures a certain number of new end-to-end distributed traces within each monitored process every minute . Each end-to-end distributed trace contains code-level and business insights. A distributed trace can contain many service calls to many tiers. Due to the full end to end capture for each trace, second- and third-level tiers often capture more total service calls than entry-point processes.
Because of its high fidelity tracing, Dynatrace captures every single distributed transaction in the majority of cases. Particularly with high volume cases, single entry-point tiers can process more requests than this. Capturing all traces all the time in such situations can lead to increased network bandwidth demands. To manage this, Dynatrace OneAgent provides a built-in limiter. Each OneAgent-monitored process is allowed to start a given number of distributed traces per minute. This limit can be described as a quota.
Once the distributed traces quota is reached, OneAgent applies an intelligent mechanism which makes use of the monitored traffic in the most effective way possible. This approach is called "adaptive traffic management."
SaaS vs Managed - Adaptive traffic management version 1 vs version 2
Version 1 - Dynatrace Managed
In version 1 of Adaptive traffic management each process is allowed to start a fixed number of 1,000 new distributed traces per minute. This version is currently the default and active for most Dynatrace managed clusters. This works well but has a downside. If your architecture has a high volume entry point (load balancer, Nginx, …) that exceeds this then your traffic will be limited there. Additionally this behavior also leads to traffic sent to Dynatrace that might react spiky dependent on your applications traffic.
Version 2 - Dynatrace SaaS
In version 2 of Adaptive traffic management each process's limit is auto adapted and tuned so that overall a allowed service call volume per minute is sent to Dynatrace independent of your applications architecture. The allowed volume per minute is based on the actively used Host Units in a given environment: 250 Full service calls x active host units. The allowed volume scales with your license guaranteeing fair value to all our customers. As a practical example this means that a moderate environment that consist of 50 hosts with 32gb each (100 host units) would process up to 25,000 full service calls per minute.
The advantages of this system are
- High volume entry points can send more traces as the traffic is managed based on the overall volume that is sent to the Dynatrace environment
- Low volume applications share their unused transaction volume with high volume applications that need it
- Infrastructure hosts send no traces but increase the overall allowed volume.
- Network traffic sent to Dynatrace is less spiky because the traffic is managed based on the overall volume
This version is currently active in all Dynatrace saas environments.
FAQ
- What is a Full service call
A Full service call are server side calls that either start a distributed trace, start a service call at a deep monitored tier or custom service calls. External calls like database calls, external web requests or generally any opaque service call are not full service calls and not counted against your traffic limit. In Dynatrace terms this means it counts all requests for WebRequest and WebService (except for external ones), RMI service, Messaging service and Custom services. - What is an active host unit
The system is looking at Host units currently in use and connected to an environment and not the host units assigned to an environment. - What is the minimum number of full service calls/min in a given environment
The system will always at least process 5,000 full service calls/min in a given environment, the equivalent of 20 host units. - What is the minimum and maximum number of traces that a process can produce
The system will auto adapt the OneAgent between 50-50,000 traces per minute in order to meet the overall traffic volume reaching the Dynatrace system.
Adaptive traffic management on the Dynatrace OneAgent
Adaptive traffic management ensures that each OneAgent captures a minimum number of new distributed traces each minute, thereby limiting the amount of data that's sent. At the same time, it ensures that all important traces are fully captured and that a statistically valid set of traces is maintained for the more frequent but less important requests.
To achieve this, OneAgent calculates a list of the top requests that are started each minute. Typical applications don't have an even distribution of requests. Rather, there are a few kinds of requests that make up a majority of the traffic (for example, image requests or status checks), a medium number of important requests, and a large number of unique URLs. Based on the list of top requests, OneAgent captures traffic in such a way that requests that have the highest volume are captured less frequently (thereby avoiding capture of "more of the same") while every unique or rare request is captured.
The following table represents such a top-request calculation example, along with the respective capture rates.
Request | Number of requests processed by the application | Capture factor | Captured distributed traces |
---|---|---|---|
URI A | 900 | 1/2 | 450 |
URI B | 440 | 1/2 | 220 |
URI C | 250 | 1 | 250 |
URI D | 60 | 1 | 60 |
50 other URIs | 100 | 1 | 100 |
Total: | 1500 | 1080 |
In this example, OneAgent will capture a bit more than 1,000 requests/min, as it is the configured target request number in this example. URIs C, D, and 50 other URIs are captured each time, while A and B are captured only 50% of the time. Yet, OneAgent still traces these requests end-to-end over 600 times/minute.
In almost all cases you won't notice this behavior. Dynatrace retains the knowledge of the capture rate and calculates the response time, throughput, and error rate accordingly. All charts and service analysis show extrapolated data based on the original capture rate. The Dynatrace AI still has more than enough data (more than any other APM solution) to provide you with actionable insights and detailed root cause analysis of detected problems. You can see this behavior when you look at an individual distributed trace in the distributed traces list that says 3 more like this
. This indicates that a request was captured once while there were three other requests just like it that were processed by the monitored application but were not included in the analysis.
With this approach, adaptive traffic management saves you a lot of network bandwidth, and, in the case of Dynatrace Managed, precious CPU, memory, network and storage resources that would otherwise be required to process and store this data.
Monitoring Adaptive Traffic Management
Dynatrace version 1.232+To keep track of actual usage and thresholds of Adaptive Traffic Management Dynatrace provides dedicated self-monitoring metrics.
You can integrate these self-monitoring metrics into your own custom self-monitoring dashboard, or you can create a dedicated dashboard like the one shown below to show historical and current capture rate.
The necessary metric expressions used to build this dashboard are listed below:
Value | Metric expression |
---|---|
Full service calls received by cluster | dsfm:server.service_calls.received:splitBy():sum:auto:rate(1m) |
Full service calls processed by OneAgent | dsfm:oneagent.service_calls.processed:splitBy():sum:auto:rate(1m) |
Maximum allowed full service call / min | dsfm:server.service_calls.maximum_allowed_per_minute:splitBy():avg:auto:auto |
OneAgent capture rate | (dsfm:server.service_calls.received:splitBy():sum:auto)/ (dsfm:oneagent.service_calls.processed:splitBy():sum:auto)*(100) |
These are the used metrics
Self-monitoring metric | Description |
---|---|
dsfm:oneagent.service_calls.processed | The number of full service calls processed by the OneAgent |
dsfm:server.service_calls.received | The number of full service calls received by the Dynatrace cluster |
dsfm:server.service_calls.maximum_allowed_per_minute | The maximum allowed number of full service calls per minute based on your license |
Adaptive load reduction in Dynatrace Managed
Due to the ease of OneAgent deployment, instrumenting new applications, hosts, or even large additional environments is no problem. Customers often onboard hundreds of hosts and applications a day to Dynatrace. This, of course, vastly increases the number of distributed traces that are processed by a Dynatrace Managed cluster.
Sometimes the initial sizing considerations for Dynatrace Managed nodes and clusters are inadequate to support such volume; a Dynatrace Managed cluster may lack the necessary hardware to process all the additional incoming data. To protect the health and integrity of your monitoring environment in such situations, Dynatrace Managed leverages adaptive load reduction on the incoming traces to ensure that monitoring remains stable while a statistically valid set of requests is captured for analysis.
Each Dynatrace Managed node can process a certain number of service calls per minute (a distributed trace is made up of many service calls). The number of calls that can be processed depends on the number of CPUs and the amount of memory available to a node. Once this limit is breached, adaptive load reduction is engaged.
New distributed traces coming in from environments with the highest traffic in relation to their assigned host units (i.e., traffic/host unit) are targeted first for load reduction. Dynatrace Managed skips full distributed trace processing in these environments. This happens in a random fashion and reduces the number of distributed traces that are processed in a staged fashion. At the same time, the statistical validity of all metrics, charts, baselining, and events is retained because Dynatrace knows the number of distributed traces that have been skipped. This is fully transparent to you, as Dynatrace raises an event and displays a message in the cluster UI.
As with OneAgent traffic management, the reduction in processed data is accounted for transparently. This will successfully safeguard your cluster from spikes in traffic. If such situations are occasional, the fact that not all data is processed has no negative impact on your monitoring. Dynatrace AI isn't impacted at all, nor is alerting. All service-based chart data is transparently adjusted (no visible change), and all analysis views account for this. You won't see a difference in charts or service call analysis data unless you're looking at a single PurePath. The only place where this is visible is the distributed traces list, which displays the message like x more like this
.
Only those environments that have a high volume of traffic compared to their assigned host units are targeted. All other environments remain unaffected.
If adaptive load reduction is engaged only occasionally, to cover spikes, all you need to know is it shouldn't be a lasting solution. While OneAgent adaptive traffic management actively shapes traffic, the feature described here only exists as a safeguard. A sporadic ALR incident, for example, the one that occurs once a week for 15 minutes, doesn't necessarily indicate the cluster is overloaded. It might be caused by an unusual load spike or a temporary network interruption.
If adaptive load reduction is engaged on a consistent basis for longer periods (lasting more than 15 minutes), and especially in situations when ALR is active all the time, the fact that not all data is processed might undermine your monitoring, metrics and general data quality. To prevent such a scenario, you need to choose one of two available options. You can add more hardware and a new Dynatrace Managed cluster node to provide your Dynatrace Managed cluster with the necessary resources to process the additional data. Or you can choose to reduce the incoming traffic by adjusting the traffic management settings for the environment's OneAgents. You might also consider these options in adaptive traffic management whenever statistical correctness of data capture is not sufficient to you and you need higher data accuracy.
Adaptive capture control
In Dynatrace Managed you can define the target number of newly monitored entry-point distributed traces captured per process/minute.
The default number of newly monitored entry-point distributed traces captured per process/minute is 1,000, which is a high number already. Changing this setting might be useful in some situations. You might have a load-test environment that consumes too many network, disk, and CPU resources on your Dynatrace Managed cluster and you'd rather use the environment for production monitoring. You can choose to reduce the distributed trace fidelity of the environment and thereby reduce the percentage of monitored incoming traffic. In other words, the tests might produce enough data granularity even after reducing the number of traces down to 500 or 100 per minute.
By reducing the number of captured distributed traces, you aren't changing any metrics or any service analysis features. Outside of the distributed traces list itself, this change is accounted for transparently and is taken into account in all analyses.
Alternatively, you can choose to increase the number of entry-point distributed traces captured per process/minute in your most important environments so as to get even higher fidelity than the default. The upper limit is 100,000 new end-to-end requests/process/min. This effectively instructs OneAgent to capture all requests. This is useful when you need to ensure the capture of even rare requests within high volume environments.
Be careful with setting this value too high as it can push your Dynatrace Managed cluster into a resource shortage situation and force you to add more hardware. You can use this setting to effectively ensure the capture of every single transaction at the cost of increased hardware expenditures.
Adaptive capture control for process groups
In certain situations, you might want to reduce the number of distributed traces captured for specific processes or process groups. To accommodate this, you can lower the number of captured transactions for a specific environment for certain process groups.
Go to Settings > Server-side service monitoring > Deep monitoring.
As you can see, this feature additionally allows an environment administrator to reduce the number of distributed traces across an entire environment. This is useful in reducing the amount of network traffic produced by OneAgents environment-wide.
Frequently asked questions
How does adaptive traffic management affect charts, baselining, and alerting?
The short answer is, not at all.
The shaping of traffic is accounted for transparently and done in a way that ensures statistical validity while capturing rare requests with high probability. All charts show the total real number of requests that your application processes, as does all ad-hoc analysis you may perform. Dynatrace AI isn't impacted by this. The only place where this traffic shaping is visible is in the distributed traces list, which displays a message like x more like this
.
Why can't I find my request?
If your Dynatrace Managed cluster is undersized or if a specific request you're interested in comes from a high-volume tier (more than 1,000/min), Dynatrace may not capture the request. OneAgent does its best to capture all unique requests, but there are limits to what can be done. We'll continue to work on improving this mechanism. You can reduce the amount of produced traffic by excluding unimportant requests from capture; this is done with web request attributes and URL exclusion rules. This approach leaves more volume available for important requests.
If this is important to you, you can also increase the number of captured distributed traces in Dynatrace Managed.