Managed failover mechanism

Dynatrace Managed allows for high-availability deployments across single or multiple data centers that consist of multiple, equally important nodes that run the same services.

To achieve the best failover deployments, we recommend the following:

Redundancy
Plan to deploy a minimum of three nodes per cluster. In such clusters, all nodes automatically replicate the data across nodes, so there are typically two replicas in addition to the primary shard.

Avoid split-brain sync problems
While two-node clusters are technically possible, we don't recommend it. Our storage systems are consensus-based and require a majority for data consistency, so a two-node cluster is vulnerable to "split-brain" and should be treated as a temporary state when migrating to three or more nodes. Running two nodes may create availability or data inconsistencies from two separate data sets (single-node clusters) that overlap and are not communicating and synchronizing their data with each other.

The entire configuration of the Dynatrace cluster and its environments (including all events, user sessions, and metrics) is stored on each node, so Dynatrace can continue to operate fully functionally after node loss:
- A cluster with three nodes can survive the loss of one node
- A cluster with five or more nodes can survive the loss of up to two nodes
The latency between nodes should be around 10 ms or less.

Log Monitoring event data is replicated in the Elasticsearch store to achieve high availability and optimize storage cost. As a result, if a node goes down, Dynatrace has a backup copy stored on the other node. However, the failure of two nodes makes some log events unavailable. If the nodes come back up, the data will be available again. Otherwise, data is lost.

Raw transaction data (call stacks, database statements, code-level visibility, etc.) isn't replicated across nodes. It's evenly distributed across all nodes. As a result, in the event of a node failure, Dynatrace can accurately estimate the missing data. This is possible because this data is typically short lived, and the high volume of raw data that Dynatrace collects ensures that each node still has a large enough data set even if a node isn't available for some time.

If you plan to achieve regional fault-tolerance (where all cluster nodes in one location domain can fail), distribute cluster nodes in separate physical locations following one of options below:
- Two locations can only be implemented with High availability for multi-data centers.
- Three low-latency locations can be implemented with Rack aware managed deployment, High availability for multi-data centers combining two low-latency locations into one Dynatrace data center, or distribution of no more than two cluster nodes in each low-latency location (in total up to 6 cluster nodes).
- Six locations can be implemented with High availability for multi-data centers and rack aware managed deployment in each Dynatrace data center.
The replication factor of three ensures that each location has all the metric and event data.

For Dynatrace Managed installations that are deployed across globally distributed data centers (with latency higher than 10 ms), you need Premium High Availability, which provides fully automatic failover capabilities in cases where an entire data center experiences an outage. This extends the existing high availability capabilities of Dynatrace Managed to provide geographic redundancy for globally distributed enterprises that need to run critically important services in a turnkey manner without depending on external replication or load balancing solutions.
See High availability for multi-data centers.
Hardware
To prevent loss of monitoring data, deploy each node on a separate physical disk. To minimize performance loss, deploy nodes on systems with the same hardware characteristics. In the event of a hardware failure, only the data on the failed machine is affected; there is no monitoring data loss because all nodes replicate the monitoring data. Performance loss is minimized because all nodes operate on the same type of hardware with an evenly distributed workload.
Processing capacity
Build your cluster with additional capacity and possible node failure in mind. Clusters that operate at 100% of their processing capacity have no processing capacity to compensate for a lost node and are thus susceptible to dropping data in the event of a node failure. Deployments planned for node failure should have a processing capacity one-third higher than their typical utilization.

If a node fails, the NGINX that is load-balancing the system automatically redirects all OneAgent traffic to the remaining working nodes, and there is no need for user action other than replacing the failed node.