High availability deployment

Release dependent information

The following information has been tested for DC RUM May 2017 (17.0.x) deployments only.

High availability configuration is not a supported feature for DC RUM May 2017 deployments as a whole, but you can create a high availability setup using existing farm/cluster features and creating a failover setup for components that do not have the automatic failover capability.

High availability deployment

The preferred high availability setup is to use the report server failover feature in a farm, deploy secondary instance of the RUM Console component (disabled), create and propagate to all report servers a list of RUM Console failover addresses, and set up the RUM Console database replication to the failover zone.

The following terminology is used:

  • Availability Report Server (CAS or ADS) - Standard report server installation (CAS or ADS) located in the Availability Zone and operating as Nodes in a DC RUM cluster.
  • Availability RUM Console - Standard installation of the RUM Console located in the Availability Zone and integrated with LDAP if LDAP is used for user management.
  • Availability Database - Availability RUM Console database located in the Availability Zone (typically on the same machine as the RUM Console).
  • Failover Report Server (CAS or ADS) - Standard report server installation (CAS or ADS) located in the Failover Zone and operating as failover Nodes in a DC RUM cluster.
  • Failover RUM Console - Standard installation of the RUM Console located in the Failover Zone and integrated with LDAP if LDAP is used for user management.
  • Failover Database - Failover RUM Console database located in the Failover Zone (typically on the same machine as RUM Console).
  • security.yaml - A YAML file located in the <installation directory>/config/ folder of every report server. This file is a list of Failover RUM Console addresses that will be used instead of Availability RUM Console in an event of the Availability RUM Console failure.
Report server types

The following setup shows the CAS report server. Depending on your deployment scenario, each CAS may have an accompanying ADS or you may be using ADS servers instead of CAS. The high availability setup applies to both CAS and ADS types of report servers and to the combination of both.

Based on illustration

The following procedure is based on an example illustrated above. Your deployment may vary in number of report servers, clusters and RUM Consoles.

To build a high availability setup:

Configure high availability

1. Disable failover RUM Console

List the running services on the Failover RUM Console machine and disable the RUM Console service (Dynatrace RUM Console).

DC RUM services are monitored by a watchdog service that ensures all essential DC RUM services are running. If you simply stop the Failover RUM Console service, the watchdog service will automatically restart it. Make sure that the Failover RUM Console service (Dynatrace RUM Console) is actually disabled.

2. Set up database replication

Set up database replication of the Availability RUM Console to the failover duplicate.
Use official database mechanisms to configure the replication.

Be aware that your database replication is one directional. Make sure you configure it to replicate from Availability databases to Failover databases.

3. Create the failover cluster

In the Availability RUM Console, add all installed CAS report servers to the devices list and set up a farm with a cluster where:

  • One CAS report server in the Availability Zone is a primary node in the primary cluster.
  • One CAS report server in the Failover Zone is a failover node for the primary node in the primary cluster.
  • One CAS report server in the Availability Zone is a node in the primary cluster.
  • One CAS report server in the Failover Zone is a failover node for a node in the primary cluster. Your Availability Zone RUM Console farm setup should look like this:

4. Publish your farm configuration.

5. Configure LDAP

Configure LDAP integration on both the Availability RUM Console and the Failover RUM Console.

6. Create a failover list for RUM Console

Edit the YAML file located in the <installation directory>/config/ folder of every report server and append the Failover RUM Console addresses.

First entry is important

Make sure that the first entry on the list is the Availability RUM Console address. That is the default RUM Console that should be used by the report servers containing this file.

The list may consist of more than one Failover RUM Console address. Use the following format to create the Failover RUM Console list:

URL: https://availconsole.lab.org:4183
URL: https://failoverconsole1.lab.org:4183
URL: https://failoverconsole2.lab.org:4183

The order in which the Failover RUM Consoles are listed is the order in which the report server will attempt to communicate in an event of a preceding RUM Console failure. In the listed example, if the Availability RUM Console  (availconsole.lab.org) stops responding, the report server will attempt to use the next RUM Console listed (failoverconsole1.lab.org). If the failoverconsole1.lab.org stops responding, the report server will attempt to use the next RUM Console listed (failoverconsole2.lab.org).

7. Restart CAS or ADS service

The security.yaml file with its RUM Console list is read by the report server on startup. Any changes to the security.yaml file require a restart of the report server service.

High availability traits and limitations

DC RUM high availability setup requires certain synchronization actions to be performed manually. The Availability RUM Console database is replicated periodically, so the only action to be performed is to start the disabled Failover RUM Console service during the failover switch.

  • Farm Status Indicators
    The DC RUM deployment passes through several synchronization cycles. Error messages and misleading statuses will occur during and after the switch to the Failover Zone (especially statuses relating to a DC RUM farm). Dismiss these error messages while operating in the Failover Zone.

Switching to failover zone

Complete availability zone failure

To quickly switch to the Failover Zone, enable the Failover RUM Console service (Dynatrace RUM Console).

While this quick switch will keep your DC RUM deployment operational, you will have to manually switch the report server roles within the cluster to be able to configure the report servers again.
See Swap failover with primary.

Swap order
  1. All Primary Nodes of the Primary cluster (CAS or ADS).
  2. All Primary Nodes of other clusters (if multiple clusters are in use).
  3. All Nodes within the Primary cluster.
  4. All Nodes within other clusters (if multiple clusters are in use).

Publish the configuration for the swap to take effect.

Statuses and error messages

Until the DC RUM deployment passes through several synchronization cycles, error messages and misleading statuses may occur (especially statuses related to a DC RUM farm). Dismiss these error messages while operating in the Failover Zone.

Partial availability zone failure

  • Availability Report Server failure - No action required
    The failure of any of the report servers (Nodes within the cluster) in the Availability Zone will trigger error messages visible on the Failover report servers (failover nodes within a cluster).
    No action is required. The failover report server within the cluster will automatically take over the functions of the parent nodes.
    See Deployment with failover nodes.
    While this automatic switch will keep your cluster operational, you will have to manually switch the report server roles within the cluster to be able to configure the report servers again.
    See Swap failover with primary.
  • Availability RUM Console failure - Enable the Failover RUM Console
    The RUM Console information and configuration are stored in its database. Replicating the RUM Console database from the Availability Zone to the Failover Zone ensures that the Failover RUM Console contains the same information and configuration. To recover from an Availability RUM Console failure, enable the RUM Console service (Dynatrace RUM Console) in the Failover Zone and begin using the Failover RUM Console as your DC RUM configuration tool.

Switching back to availability zone

Several service restarts are required for the DC RUM deployment to become fully operational again.

The start sequence is critical for successful transition to the Availability Zone. While operating in the Failover Zone, the Failover Database is populated with data and the downed Availability Database is not. As a result, the databases are not synchronized. Simply switching back to the Availability Zone could potentially lose changes performed while operating in the Failover Zone. For example:

  • Added or removed devices
  • Changes made to DPN configuration or device connection parameters
  • Changes made to technical hierarchy
  • Changes made to users (added, deleted or modified users)
  • Names of the traces recorded while operating in the Failover Zone

If you are certain that no changes listed above have been made while operating in the Failover Zone, or you accept the loss of these changes, you may skip step 1 in the process of switching to the Availability Zone.

If you are unsure if such changes have been made, to prevent the loss of configuration changes, you must first copy the Failover Database to the Availability Database.

To switch back to the Availability Zone:

  1. Copy the data generated while operating in the Failover Zone to the availability database.
    There are several methods to synchronize the database (copy only the differences or copy the entire database). Select the method that best meets your needs.
  2. Stop and disable the Failover RUM Console service (Dynatrace RUM Console).
  3. Start the Availability Report Servers.
  4. Start the Availability RUM Console.
  5. Make sure the Availability Database to Failover database replication is operating properly.
  6. Using the Availability RUM Console, switch the report server roles in the cluster. Maintain the same swap order that you used when switching to the Failover Zone:
    1. All Primary Nodes of the Primary cluster (CAS or ADS).
    2. All Primary Nodes of other clusters (if multiple clusters are in use).
    3. All Nodes within the primary cluster.
    4. All Nodes within other clusters (if multiple clusters are in use).

See Swap failover with primary.
Publish the farm configuration with the restored report server roles.

Statuses and error messages

Until the DC RUM deployment passes through several synchronization cycles, error messages and misleading statuses may occur (especially statuses relating to a DC RUM farm). Dismiss these error messages while operating in the Failover Zone.

Upgrading high availability deployment

While this High Availability deployment is limited only to DC RUM May 2017 service pack 1 releases, updating to the latest service pack release you will have to execute the standard DC RUM upgrade procedure with additional steps covering the database replication and failover components.

To upgrade the High Availability setup:

  1. Stop the database replication.

  2. Apply standard DC RUM upgrade procedures. Consider your farm deployment when performing the upgrade.

  3. Upgrade the failover components individually.

    Disable the failover NAM Console service

    Make sure that you disable the Failover RUM Console service (Dynatrace RUM Console). The service becomes enabled and is started after the upgrade.

  4. Start the database replication.