Header background

Automate prioritization of quality improvements with Dynatrace SLO violation prediction and problem analysis

Does either of the following situations sound familiar to you? You have plans to enjoy an upcoming vacation, but production problems in your area of responsibility prevent you from taking time away. Or, you’ve been asked to ensure the customer experience of a production system that you have no familiarity with—you’re now responsible for a system for which you don’t know the structure of the underlying services.

How can you know where best to invest time and money into quality improvements in such situations? Which known issues have the most significant impact on your customer experience and business success? Without automatic problem prioritization, you might easily misallocate your resources to low-priority issues.

This blog post explains how you can effectively use your business-critical metrics as service level objectives (SLOs). The problems that Dynatrace identifies in your systems are automatically linked to critical SLOs and their related error budget and burndown rate.

The error budget and burn rate provide site reliability engineers (SREs) with the information they need to take action before their end users are affected. Such information dramatically improves your chances of avoiding war-room meetings with other stakeholders.

Error budgets and burndown rates

Dynatrace provides alerts on high error budget burn rates that predict when an error budget will be depleted if no action is taken. Analysis of detected problems includes root cause analysis for quick problem remediation and the assurance that your SLO targets are met. While the SLO status of a critical metric might be okay (displayed in green) or at a warning level (indicated in yellow), the error budget might be consumed quickly. In such situations, Dynatrace Davis® AI detected problems show the identified root cause of the problem in addition to a call to action to mitigate the SLO violation before it affects your users or your error budget. This way, you can stop the consumption of your error budget and focus on the right problem at the right time.

Example SLO dashboard tiles
Figure 1: The SLO dashboard tiles provide all the information you need: The red arrows show the error budget trend going down, and the red warning icons indicate that Davis AI has detected problems that impact these SLOs. Select these warning icons to view the related problem descriptions, which include root cause analysis and call-to-action details that you can use to fix SLO-impacting problems.

Error budgets as a tool for prioritizing investments

An error budget can be understood as a metric value that equates to an acceptable rate of technical errors that can occur in a system before the errors affect end user experience. In essence, error budgets tell you when investments into quality improvements are worth the effort.

To succeed, organizations must put their customers’ needs front and center. By utilizing an error budget, SREs can measure where customer satisfaction is at risk due to a high burn rate.

Dynatrace supports SREs in their need to:

  • see which SLO error budgets are burning down to effectively prioritize work on problems that impact those SLOs.
  • see the trend of SLO status/error budgets to predict if and when an SLO will exhaust its error budget.
  • receive alerts for high burn rates so that mediation efforts can be planned.
  • get support finding the root cause of a high error budget burn rate so that the problem can be fixed before the error budget is depleted.

Different approaches to investment prioritization

It’s vital to distinguish between reactive work that’s based on alerting (“fire fighting”) and proactive planning for investments into quality and automation. Both approaches require prioritization. The reactive approach results in a low mean time to repair rate (MTTR—one of the DORA metrics) for newly discovered problems that impact SLOs and a high error budget burn rate. With the proactive alerting-based approach, SREs must identify and implement solutions for mitigating depleting error budgets. Conversely, intelligent prioritization of investments into quality improvements and automation better ensures SLOs. In many cases, multiple problems contribute to a depleted error budget and SREs must manually investigate all related problems to prioritize investments into quality.

Dynatrace provides solutions for both the proactive approach and the reactive approach to prioritizing quality investments.

The reactive approach

In reactive fire-fighting style prioritization based on identified root causes, Dynatrace Davis® AI identifies problems and shows you the number of potentially impacted SLOs. You can link directly to the impacted SLOs from the problem page (Figure 2 below). This way, prioritizing work on one problem over the other is easy. SREs can set up alerts for high error budget burn rates so that they can react quickly to impacted SLOs before those error budgets are depleted. Dynatrace Davis AI presents the root cause analysis for each detected problem so that SREs can define action items that will improve error budget burn rates and avoid any SLO breaches with minimal mean time to repair.

The Problems view shows the count of affected SLOs and crosslinks to those SLOs
Figure 2: The Problems view shows the count of affected SLOs and crosslinks to those SLOs.

The proactive approach

With the proactive approach to investment prioritization for quality improvements and automation, problems related to SLOs show all affected SLO error budget burn rates and depleted error budgets (Figure 3 below). They also show a count of all problems that affect each SLO. With this information, SREs gain an overview of all problems that contribute to each depleted error budget. This sort of crucial insight is invaluable for planning future quality improvements and implementing automated problem remediation.

This SLO overview shows an error budget burn rate icon and enables you to create alerts for high error budget burn rates
Figure 3: This SLO overview shows an error budget burn rate icon and enables you to create alerts for high error budget burn rates.

While the status of the first SLO shown in Figure 3 is still okay, the error budget has already been consumed. This is why SREs need to receive such alerts before error budgets are depleted. As the second SLO is already in a bad state, besides fire fighting, the investigation of the seven related problems will help the SRE to understand the history of previous problems and common root causes. The SRE can then determine exactly where quality improvements or remediation automation is needed most.

What’s next?

Find out how easy it is to set up your first SLOs and then automate and scale SLO practices in your organization.

Explore how fully automated remediation of problems can help to keep your SLOs in good shape. Utilizing on-demand synthetic tests and release validation in Dynatrace provides you with continuous assurance of the status of your SLOs—all with a single solution.

Dynatrace is happy to provide you with a demo or proof of concept for Cloud Automation. We also offer a free Dynatrace trial if you want to get started directly!