Build automated self-healing systems with xMatters and Dynatrace (Part 2 of 3)

Welcome back to the blog series in which we share how you can easily solve three common problem scenarios by using Dynatrace and xMatters Flow Designer. In Part 1 we explored how DevOps teams can prevent a process crash from taking down services across an organization in five easy steps. In this post, we’ll look at how you can set up an automated response to full disk errors and thereby prevent your services from going down.

Use case #2: Prevent a full disk error from taking down services across an organization

  • Step 1 – Let Dynatrace analyze your infrastructure health in real-time.

The Dynatrace all-in-one software intelligence platform gives your team real-time visibility into your underlying infrastructure—be it on bare metal, VMware, OpenStack, AWS, Azure, or a hybrid solution. As soon as Dynatrace detects a disk health related issue—in this case Low disk space—the Dynatrace AI causation engine provides automated root cause analysis that shows you all related performance errors, as well as which applications and services have been affected by the issue. This gives you the full picture of the problem and its impact across your full stack, from host level to the end-user.

Dynatrace problem notification: Low disk space
  • Step 2 – xMatters passes Dynatrace data to alerts that provide actionable responses.

This is where xMatters Flow Designer comes into play, by automating remediation steps at the touch of a button. When the Dynatrace problem notification about this Low disk space problem is displayed, xMatters triggers an alert based on a workflow that was created previously in Flow Designer.

In this alert, xMatters includes all the important incident information from Dynatrace, so there’s no need for you to visit additional system dashboards. Based on this contextual data, resources are prompted with their pre-configured response options, each of which kicks off a workflow across systems (based on the severity of the issue). On-call resources can simply select the right response and launch a workflow that restores disk health while simultaneously documenting the issue in their chat and service desk – which can reduce mean time to respond by up to 90%.

Dynatrace data in an xMatters alert (left), with actionable responses (right)
  • Step 3 – xMatters creates and updates Jira issues with incident information from Dynatrace.

In the meantime, Flow Designer also triggers the creation of a ticket in Jira (or your incident management system of choice) and automatically updates the ticket with the incident data provided by Dynatrace.

xMatters creates and updates Jira issues
  • Step 4 – xMatters creates a dedicated Slack channel

Similarly, the Flow Designer workflow automatically creates a channel in your preferred chat tool (Slack, Microsoft Teams, or other). The respective chat bots include the Dynatrace incident information in the chats. Here you can reference on-call schedules and invite the right team members to join the conversation. Once the incident is resolved, your chat transcript is automatically attached to the respective service desk ticket (for example, Jira) to give you and your team a complete picture of what happened.

xMatters creates a dedicated Slack channel
  • Step 5 – xMatters triggers a runbook in Ansible to fix the disk latency

As a last step, xMatters triggers a runbook in Ansible to push the disk latency fix. In this case, a team member determines that the best course of action is to delete temporary files from disk to free up processing power and return the disk to proper health. In the background, all integrated systems (Dynatrace, Jira Service Desk, and Slack) are updated with details of the action taken and the newly improved performance data.

Depending on the type of the issue, xMatters launches workflows across your systems to start the automated self-healing process.

Wrap up

A full disk can impact performance, sometimes even causing critical processes such as backups to fail. However, because Dynatrace can pinpoint symptoms of impending full disk events before they strike, you can take proper action and automate workflows across systems to address these issues quickly. Dynatrace and xMatters are your first line of defense against threats to your disk health.

For more details on these steps, see Self-Healing DevOps with xMatters and Dynatrace: Full Disk Prevention on the xMatters blog. For more information on Dynatrace and xMatters, please visit our technology partner page.

Stay updated