High availability for multi-data centers

Dynatrace Premium High Availability (Premium HA) is a self-contained out-of-the-box solution that provides near-zero downtime and allows monitoring to continue without data loss in failover scenarios. This solution provides cost savings in terms of compute and storage allocations by eliminating the need for separate stand-by DR hosts and the associated infrastructure to store and transfer backup data.

While the computing capacity available to the cluster is positively affected by the additional nodes in the peered data center (DC), the impact is non-linear. For capacity planning, the nodes in the additional DC should be considered as redundant rather than as expanded capacity. This is because the additional DC will have a copy of all the Cassandra and Elasticsearch data from the initial DC.

The maximum number of nodes supported for Dynatrace High Availability clusters is 30 (15 nodes per DC).

The effective minimum is 6 nodes (3 nodes per DC) because of the High Availability 1000+ host license requirement and other technical reasons. Both DCs within a cluster should be symmetrically sized.

How to remedy segmented clusters

If one part of a cluster loses connection with another part of the cluster, this doesn't necessarily mean that that part of the cluster is unavailable. The problem may simply be a connectivity issue. You need to determine which part of the cluster will act as the surviving one. Short, up to 3 hours, network disconnections between data centers are repaired automatically. To avoid data inconsistency, for longer outages we recommend shutting down server service in all nodes at affected data center. You can start services when network connectivity is stable again.

To handle the situation when one part of the cluster is unavailable, Dynatrace Mission Control tracks the health of all nodes and automatically designates one part of the cluster as primary (surviving). During the recovery, this designation is used to determine how to re-sync all parts of the cluster. This means that Dynatrace High Availability isn't supported for completely off-line Dynatrace Managed Clusters.

Data sharding and replication

Using virtual racks, Dynatrace High Availability stores three copies of all configuration data, metrics, and user sessions in each DC. This provides optimal performance and reliability in failover scenarios.

Raw transaction data (such as PurePaths, call stacks, and database statements) is distributed randomly across all DCs so that a statistically representative data set is always available on each DC.

Data is synchronized asynchronously between DCs. This eliminates the 10-ms latency requirement that applies to all cross-DC clusters. Data synchronization is engineered to minimize bandwidth consumption between DCs and prevent data loss in case of a DC outage.

During outages of less than three hours, Dynatrace High Availability will automatically and transparently re-synchronize the data across DCs. For outages of up to three days, the Dynatrace Mission Control team will trigger the necessary repair and synchronization jobs. After that, the malfunctioning portion of the cluster must be reinstalled.

Telemetry data routing

You can use network zones to control the flow of telemetry data to the cluster nodes in the various DCs. While Dynatrace High Availability implements various optimizations to reduce cross-DC traffic, we recommend, for the sake of data redundancy, that you allow ActiveGates to send data to both DCs. OneAgents and ActiveGates can be configured to prefer certain network zones while preserving their ability to failover to another part of the cluster in case of a DC outage. Note that load balancers can be leveraged for this purpose as well.

For active-passive deployments of applications, we recommend that you not disable ActiveGates in the passive portions of the deployment. This keeps all parts of the Dynatrace infrastructure in play in case of a DR scenario and enables failover without reconfiguration or rediscovery.

Technical requirements

Dynatrace High Availability requires an OS that supports cgroups version 1.0 and systemd version 219 or later (for example, RHEL/CentOS 7+).

The various nodes will continue to communicate with each other over the usual ports as described earlier. The bottom line is that the ports that need to be open between nodes in a single DC are the same ports that need to be open within the cluster if the cluster spans two DCs.

The connections between nodes in different DCs need to be encrypted. Dynatrace does not create or install the required certificates to ensure this—you’ll need to do that manually. Round-trip network latency of up to 100 ms is supported. Bandwidth consumption depends on a variety of factors. For more information, please consult the ESA team.

It is possible to migrate a single-DC cluster (or a DC-agnostic cross-DC cluster) to a dual-DC High Availability cluster. For more information, please consult the ESA team.

Such a deployment requires a Premium High Availability license. See Calculate Dynatrace monitoring consumption.