Build automated self-healing systems with xMatters and Dynatrace (Part 3 of 3)

Welcome back to the blog series in which we show how you can easily solve three common problem scenarios by using Dynatrace and xMatters Flow Designer. Here’s what we discussed so far:

In Part 3 we will investigate how you can set up automated self-healing across two production environments.

Blue/green deployment for releasing software faster, safer.

Driven by the need for faster time to market, tighter and faster feedback loops, businesses today are leveraging CI/CD (Continuous Integration and Continuous Delivery), as well as CO (Continuous Operations). In doing so, they automate build processes to speed up delivery, and minimize human involvement to prevent error.

One of the several deployment strategies is the blue/green deployment approach: In this method, two identical production environments work in parallel. One is the currently-running production environment receiving all user traffic (let’s say the “blue” one), the other is a clone of it (“green”), but idle. Both use the same database back-end and app configuration. The new version of the application is deployed in the green environment and tested for functionality and performance. Once the testing results are successful, application traffic is routed from blue to green. Green then becomes the new production.

For everyone who has not yet experienced Dynatrace monitoring modern CD/CO environments – this is how we capture blue/green data:

Dynatrace capturing blue/green environment data

And below the response time of blue/green (here yellow and magenta) – they show how traffic is switched between these environments.

Response time for blue/green environment traffic

Use case #3: Automate self-healing across two production environments with xMatters and Dynatrace

In this use case we will rely on the xMatters & keptn integration. We will use xMatters’ Flow Designer to create toolchains that automate remediation, and Dynatrace’ keptn to automate continuous operations.

  • Step 1 — The Dynatrace Davis AI-engine identifies the root cause

Let’s assume your team has just pushed new code and it passed pre-deployment testing. Great! But now you realize it doesn’t work well with the other services running.

Davis®, Dynatrace’ deterministic AI-engine is here to give you the exact answers to resolve operational challenges in the cloud automatically. It detects the root cause without the need for thresholds or baselines. But wait, there is more: Davis doesn’t just tell what exactly happened; it also understands how everything is connected—the relationships and interdependencies between each layer, component, and bit of code in your application environment.

Dynatrace Davis in action
  • Step 2 — Keptn captures version data and triggers an xMatters workflow

With Dynatrace’ keptn, you can take immediate action and revert to the parallel environment. Keptn orchestrates Continuous Deployment, as well as Continuous or Automated Operations. So, as Dynatrace notices a problem, it pushes its details to keptn, which enriches the data with current deployment details and triggers an xMatters workflow. This allows the developer on-call to take swift action and fix the problem.

  • Step 3 — xMatters alerts all the relevant resources

Now it’s up to xMatters to alert the relevant resources. The alert comes with the full context of the issue, including errors caused, impacted systems, and level of severity. Now your team can quickly investigate the situation and determine the best response action. With the push of a button, they can launch a remediation workflow across the entire toolchain to automate rollback to the green environment through keptn. Simultaneously, xMatters pushes all the critical information into the proper channels like Slack, Jira Service Desk, and Dynatrace.

Keptn triggers an xMatters workflow; xMatters alerts relevant parties
  • Step 4 — xMatters creates a Jira Service Desk ticket & triggers rollback

At this point you already want to have a ticket about the incident in your service desk. No worries! xMatters automatically creates it in all your service desk tools, adds all the relevant incident information, and also continuously updates it, so it can be easily referenced in team post-mortems. Furthermore, xMatters simultaneously triggers the rollback, so your team doesn’t have to choose between taking immediate remediation action or starting full timeline documentation in your service desk.

xMatters creates a Jira Service Desk ticket & triggers rollback
  • Step 5 — xMatters Slackbot pulls the on-call database

Usually different services are owned by different teams and developers. Therefore, in the case of a release error, different teams must be involved in the resolution. xMatters makes this easy, too, by spinning up a dedicated Slack channel. Here you can select the relevant teams to invite to the channel. It also pushes all the incident information into this channel, so those who join get immediate context. As they collaborate in chat, your team can also use the xMatters bot to update the related service desk ticket. Once the issue is resolved, the full chat transcript from this channel will also be automatically added to the service desk ticket, giving you even more detail for your post-mortem exercise.

xMatters Slackbot pulls the on-call database
  • Step 6 — Flow Designer rolls back through keptn

Thanks to the xMatters & keptn integration, keptn is rolling back to the previous, stable version in the green environment, within moments of starting the xMatters remediation workflow. Thus, the issue is fixed just minutes from detection. With the incident resolved, the related chat channels and service desk issues are automatically closed.

Flow Designer rolls back through keptn
  • Step 7 — Reporting, cross-referencing similar incidents

Once the versioning is rolled back to the stable environment, your team might want to look at what happened, and how to prevent things like this in the future.

Because xMatters automatically updated your incident management systems – from chat to service desk to monitoring – with the incident information and steps taken to resolve, your teams have all the documentation at hand during the post-mortem meeting. Furthermore, as xMatters retains historical incident data, you can quickly cross-reference similar incidents to identify any patterns of issues to better prevent future incidents.

And if this is not enough: xMatters’ user-level analytics show which team members broke their personal record for mean time to respond.

Wrap up

Driven by the need for faster time to market, tighter and faster feedback loops, businesses today are adopting agile development methods to deliver applications faster and react quickly to customer’s needs. The blue/green deployment method is a fail-safe process that gives developers the possibility to roll back to the previous, stable version of any release. Today, when customers are more demanding and more impatient than ever, being able to do so within moments is crucial.

With xMatters, Dynatrace, and keptn, you can empower your team to execute a rollback with the push of a button, while automatically collecting critical incident information and bringing together all relevant resources to solve the issue. Thus, your team can safely, quickly and frequently release great new features.

Depending on the type of the issue, xMatters launches workflows across your systems to start the automated self-healing process.

To learn more about this, see Self-Healing DevOps Part III: Automated Blue/Green Deployment Remediation on the xMatters blog. For more information on Dynatrace and xMatters, please visit our technology partner page.