What are SLOs? Here's a guide to service-level objectives, how they work, and how they help DevOps teams auotmate and deliver better software.
As organizations adopt microservices-based architecture, service-level objectives (SLOs) have become a vital way for teams to set specific, measurable targets that ensure users are receiving agreed-upon service levels. SLOs, together with service-level indicators (SLIs), deliver the performance promised in service-level agreements (SLAs) and other business level objectives (BLOs) while staying within error budgets.
But what are SLOs? And why have SLOs and SLIs become so important as teams automate processes to consistently meet SLAs and error budgets? To get a better handle on this, let’s start with some definitions.
What are SLAs, SLOs, SLIs, and error budgets?
SLOs are best understood as part of a framework for tracking service levels that also includes service level agreements (SLAs), service-level indicators (SLIs), and error budgets.
What are SLAs?
SLAs, or service-level agreements, are contracts signed between a vendor and customer that guarantees a certain measurable level of service. They are often drawn up with specific financial consequences if the vendor fails to provide the guaranteed service. SLAs are usually composed of many individual SLOs to help formalize the details of what is being promised. For example, an SLA between a web host provider and customer can guarantee 99.95% uptime for all web services of a company over a year.
What are SLOs?
As defined by Gartner, service-level objectives are an agreed-upon target within an SLA that must be achieved for each activity, function, and process to provide the best opportunity for customer success. In layman’s terms, SLOs represent the performance or health of a service. These can include business metrics, such as conversion rates, uptime, and availability; service metrics, such as application performance; or technical metrics, such as dependencies to third-party services, underlying CPU, and the cost of running a service. For example, if the SLA for a website is 99.95% uptime, its corresponding SLO could be 99.95% availability of the login services. Organizations commonly use SLOs in production environments to ensure released code stays within error budgets.
What are error budgets?
Error budgets are an allowance for a certain amount of failure or technical debt within an SLO. For example, if your SLO guarantees 99.5% availability of a website over a year, your error budget is .05%. Error budgets allow development teams to make informed decisions between new development vs operations and polishing existing software. Properly set and defined SLOs should have error budgets that give developers space to innovate without impacting operations.
What are SLIs?
SLIs provide the actual metrics and measurements that indicate whether you are meeting your SLO. Most SLIs are measured in percentages to express the service level delivered. For example, if your SLO is to deliver 99.5% availability, the actual measurement may be 99.8%, which means you’re meeting your agreements and you have happy customers. To gain an understanding of long-term trends, you can visually represent SLIs in a histogram that shows actual performance in the overall context of your SLOs.
To learn more about how Dynatrace does SLOs, check out the on-demand performance clinic, Getting started with SLOs in Dynatrace.
Why are SLOs important?
In short, service-level objectives ensure reliability. Generally, SLOs are important because they:
- Improve software quality. SLOs help teams define an acceptable level of downtime for a service or a particular issue. SLOs can shine light on issues that fall short of a full-blown incident, but also don’t fully meet expectations. Achieving 100% reliability isn’t always realistic, so using SLOs can help you figure out the balance between innovating (which could result in downtime) and delivering (which ensures users are happy).
- Help with decision making. SLOs can be a great way for DevOps and infrastructure teams to use data and performance expectations to make decisions, such as whether to release, and where engineers should focus their time.
- Promote automation. Stable, well-calibrated SLOs pave the way for teams to automate more processes and testing throughout the software delivery life cycle (SDLC). With reliable SLOs, you can set up automation to monitor and measure SLIs and set alerts if certain indicators are trending toward violation. This consistency enables teams to calibrate performance during development and detect issues before SLOs are actually violated.
- Avoid downtime. It is inevitable that software can break. SLOs allow DevOps teams to predict the problems before they occur and especially before they impact customers. By shifting production-level SLOs left into development, you can design apps to meet production SLOs to increase resilience and reliability far before there is actual downtime. This trains your teams to be proactive in maintaining software quality and saves you money by avoiding downtime.
How SLOs work
Cloud-native software and its supporting tools and infrastructure generate a diversity of metrics and data points every second that indicate a system’s state and performance. Service-level objectives define or support a set of higher-level business goals, which you can measure by leveraging the data and insights from observability tools.
The goal of SLOs is to deliver more reliable, resilient, and responsive services that meet or exceed user expectations. Reliability and responsiveness are often measured in nines on the way to 100%. For example, an objective for system availability can be:
- 90% – one 9
- 99% – two 9s
- 99.9% – three 9s
- 99.99% – four 9s
- 99.999% – five 9s
Each decimal point closer to 100 usually involves greater cost and complexity to achieve. Users may require a certain level of responsiveness, after which they can no longer detect a difference. Setting SLOs is part science and part art, striking a balance between statistical perfection and realistic goals.
You can set SLOs based on individual indicators, such as batch throughput, request latency, and failures-per-second. You can also create SLOs based on aggregate indicators, for example, the application performance index (Apdex), an industry standard that measures user satisfaction based on a variety of metrics.
Gathering and analyzing metrics over time will help you determine the overall effectiveness of your SLOs so you can tune them as your processes mature and improve. These trends also help you adjust business objectives and SLAs.
SLO best practices
Service-level objectives define what good service means over a specific duration of time based on the measurements of SLIs. Here are some best practices to help you achieve the goals set out in your SLOs:
- Less is more. It is important to define SLOs that support the SLA or business objective. Defining too many SLOs that don’t support a broader goal means extra work without any meaningful output.
- Don’t over-promise and underdeliver SLO targets. An SLO should accurately represent service health and performance. If you intentionally set low SLO targets to avoid violations, you will not be able to make informed product decisions, because the SLOs will not provide an accurate picture of how resources and time should be spent. Similarly setting unrealistically high SLO targets will increase the cost and amount of effort required for very minimal incremental gains.
- Get business alignment. To ensure technical teams and business stakeholders are working toward the same expectations, they should not only agree on the SLO targets, but they should ensure the correct people understand the SLOs. If engineers cannot deliver on the SLO targets, the organization risks failure to comply with its SLAs to customers.
- Prioritize SLOs for certain customers. To make the best use of resources, paying customers with stringent availability requirements may require a higher SLO baseline than freemium users.
- Be adaptable. SLOs are living breathing commitments and you need to adjust them at times to fit your teams and customer needs. If a team is growing far more than your processes can keep up, it may be time to adjust your SLOs. A user base that has grown exponentially larger can also warrant an adjustment to your target SLOs.
- Automate SLO evaluation. Dashboards and manual metric collection sheets make remediation processes slow and don’t provide root cause analysis. Ensure your solution not only collects relevant SLIs and evaluates SLOs automatically, but also takes it one step further, by automatically alerting you before an SLO is violated and providing all the context you need to address an issue before it becomes a problem
- Use SLOs beyond production, across the full SDLC. Leveraging SLOs for production workloads is merely the first step. To truly take advantage of SLOs, incorporate it across the delivery pipeline in areas such as release decision making, automating blue-green or canary deployments, rollbacks or remediation, software quality evaluation, chaos testing, ChatOps, and so on.
It’s important to consider SLOs as an ongoing process and commitment to deliver optimal performance. IT workloads and end-user expectations are continually changing. An SLO designed for the workload requirements right now may not be equally valid for future performance requirements.
Keep SLOs simple, few and realistic. Avoid absolute numbers that are unachievable. You may set an internal SLO that acts as a safety margin or buffer to deliver a lower SLO target agreed with the end-users.
Easily create and manage SLOs with Dynatrace
As more organizations adopt microservices, creating measurable SLOs is becoming more important to consistently deliver reliable, resilient, and responsive software that meets agreed-upon service levels. SLOs also help teams assess release risk and make decisions.
Microservices architecture means there are infinitely more apps, tools, and cloud-based infrastructure that influence an application’s performance and availability. This makes developing effective SLOs more challenging. Dynatrace makes it easy to create and manage SLOs with out-of-the-box SLO templates and guidance for setting up SLOs with the right metrics, combined with automatic, AI-powered analytics and root-cause problem detection.
SLOs also set the stage for automating processes so you can speed up issue discovery and remediation before customers are impacted.
Ready for a deeper look at how to use SLOs for automation? Join us for the on-demand performance clinic, Automating SLOs as code–from Ops to Dev with Dynatrace.