Common SLO pitfalls and how to avoid them

Today, online services require near 100% uptime. This demand creates an increasing need for DevOps teams to maintain the performance and reliability of critical business applications. Architecting service-level objectives (SLOs), along with service-level agreements and service-level indicators, is a great way for teams to evaluate and measure software performance that stays within error budgets. But there are SLO pitfalls. As such, it’s important when creating your SLOs to avoid these common mistakes that can cause more headaches for your DevOps teams.

SLO pitfalls

Pitfall 1: SLOs not aligned with your business goals

One common pitfall is creating an SLO that is not aligned to your business goals or a service level agreement (SLA). This can create an unnecessary distraction and steal time away from critical tasks. For example, the IT team of a bank wants to ensure that for a trailing 30-day period there is 99.9% service availability with <50ms latency for an application with no revenue impact. Setting a stringent SLO for an application that’s not business-critical can lead to wasted time and resources when it comes to remediating issues or performing tasks to ensure uptime.

If an SLO is not tied back to a key business objective or external SLAs, it is best to reconsider or recalibrate the objective. The best investment is in managing SLOs for customer-facing, revenue-generating, high visibility applications. For example, constant SLO violations of service availability for the check deposit application would create customer dissatisfaction leading to potential revenue impact.

Pitfall 2: SLOs with no ownership or accountability

When SLOs are violated, who do you call? Who owns it? SLOs created by upper management without buy-in from relevant development, operations, and SRE stakeholders can lead to finger-pointing, blaming, and chaotic war rooms when violations occur. A broken SLO with no owner can take longer to remediate and is more likely to recur compared to an SLO with an owner and a well-defined remediation process.

To avoid orphaned SLOs, ensure there are high levels of collaboration between key stakeholders during the creation of an SLO and that SLOs are vetted, viable, and agreed upon. Establish the relevant service level indicators (SLIs) that need to be monitored, the process for remediating any issues, the relevant tools required, and timeframes for resolution. You should discuss and agree upon all these questions before your team adopts an SLO.

Pitfall 3: Using SLOs reactively vs. proactively

Commonly, teams create SLOs because they are simply following what others in the industry are doing, or because they are common best practices. But many fail to understand the business objective it is tied to. In these organizations, IT teams may not pay attention to the SLOs until violations happen, after which individual owners scramble to resolve them. This is reactive in nature and erodes the value SLOs bring to an organization in maintaining the health, reliability, and resiliency of an application. Being reactive also does not prevent similar violations from reoccurring in the future, instead takes away critical time from your developers.

To avoid this, start the SLO discussion early in the design process. Push for SLO evaluation to be incorporated into the CI/CD pipeline and not just in production. Ensure error budgets are set up and tracked with alerting and root cause analysis, so development teams can understand and triage issues before they become problems and cause violations.

Pitfall 4: SLO thresholds that are too high or too low

One of the most common SLO pitfalls is overpromising by setting SLO targets too high or underdelivering by setting SLO targets too low. SLOs are important for evaluating how successful your team is at delivering what has been agreed upon, either in the customer-facing SLA or the internally agreed-upon business objective. If you set SLOs so they are in constant violation or in constant compliance, then they become meaningless and do not help you understand the health of your application.

Let’s take service availability for example. According to Google G-Suite researchers, a good availability metric should be meaningful (captures user experience), proportional (change in the metric should be proportional to the change in user-perceived availability), and actionable (insight into why the metric is low or high).

A good rule of thumb is this: your success in SLOs should correlate with customer and user experiences, and violations should represent deteriorating services. For example, setting an SLO with a service availability of 89% can be problematic, as the amount of downtime of 11% can impact a significant set of users. Meanwhile, DevOps teams would not get any alerts or be worried about customer impact as their SLOs are within the threshold.

To set meaningful thresholds, work with your relevant stakeholders to establish SLOs that are achievable but also impactful for user experiences. Review with owners to calibrate SLIs that best capture the specific use case. Tailoring SLOs in this way ensures that you’re spending resources making sure that SLOs are met, used efficiently, driving customer value, and helping Developers improve their QA and resolution processes.

Pitfall 5: Manual evaluation of SLOs through dashboards and spreadsheets

Developing dashboards and spreadsheets to track SLO performance can be extremely useful for organizing and visualizing your SLOs and SLIs. However, another of the common SLO pitfalls is that many organizations assemble these metrics manually using disparate tools, which can take time from innovation. Simply performing eyeball analytics by looking at multiple dashboards slows down the quality evaluation process and introduces a higher risk of failures.

Continuous and automated release validation is the answer. The ability to automatically evaluate test results, leverage key SLIs from your monitoring tools, and calculate quality scores that can automate the go/no-go decision at every stage of the lifecycle is critical in reducing human error and scaling the QA process. The power to automatically stop bad code in its tracks through an intelligent, data-driven approach is significant for development teams that are constantly constrained by manual processes, yet asked to deliver higher quality software at speed.

An automatic and intelligent approach to creating and monitoring SLOs

Avoiding SLO pitfalls and meeting the challenges of creating SLOs can be frustrating, especially with today’s complex IT processes. However, with adequate planning and high collaboration between Biz, Dev, Ops, and Security teams, stakeholders can be better prepared to establish SLOs that ensure you’re delivering software that’s reliable, resilient, and meets customer expectations.

An observability platform like Dynatrace provides all the SLIs you need to build and calibrate effective SLOs. Dynatrace has SLOs natively built into the platform and can automate the evaluation process to enable continuous release validation. Leveraging a platform like Dynatrace is a great boon for modern IT teams that are resource-constrained but looking to be nimble and agile. When implemented successfully, SLOs can provide numerous benefits to your business, including reducing expensive and time-consuming service outages, eliminating silos, and increasing collaboration. Explore how Dynatrace can help you create and monitor SLOs, track error budgets, and predict violations/problems before they occur, so teams can prevent failures and downtime.

To get started with SLOs in Dynatrace, download the free trial today.

Download an overview of common SLO pitfalls and how to avoid them.

Stay updated