Header background

Site reliability done right: 5 SRE best practices that deliver on business objectives

Site reliability engineering has emerged as a critical discipline for organizations seeking the benefits of digital transformation.

Keeping pace with modern digital transformation requires ensuring that applications are responsive, resilient, and always available amid increased complexity. As a result, site reliability has emerged as a critical success metric for many organizations.

Site reliability engineering (SRE) has recently become a critical discipline in recent years as the world has shifted in favor of web-based interactions. Mobile retail e-commerce spending in the U. S. surpassed $387 billion in 2022, more than double the figure of three years earlier. The volume of travel spending booked online is expected to reach nearly $1.5 trillion by 2027, up from $800 billion in 2021. With so many of their transactions occurring online, customers are becoming more demanding, expecting websites and applications to always perform perfectly. One recent report found that 32% would leave their favorite brand after just one bad experience. Website load times have been found to have a direct correlation with conversion rates.

This shift is leading more organizations to hire site reliability engineers to guarantee the reliability and resiliency of their services. But the transition to SRE maturity is not always easy.

How site reliability engineering affects organizations’ bottom line

SRE applies the disciplines of software engineering to infrastructure management, both on-premises and in the cloud. The practice uses continuous monitoring and high levels of automation in close collaboration with agile development teams to ensure applications are highly available and perform without friction.

According to the emerging trends from the global shift towards web-based interactions, IT infrastructure performance has a dramatic impact on the organizations’ bottom-line business goals. Uptime Institute’s 2022 Outage Analysis report found that over 60% of system outages resulted in at least $100,000 in total losses, up from 39% in 2019. More than one in seven outages cost more than $1 million.

Maintaining reliable uptime and consistent service quality has become more complex as organizations expand their computing footprints across multiple data centers and in the cloud. Microservices-based architectures and software containers enable organizations to deploy and modify applications with unprecedented speed. However, cloud complexity has made software delivery challenging. There are now many more applications, tools, and infrastructure variables that impact an application’s performance and availability.

Understanding the interactions between these factors heavily influences decisions about whether and when to promote a new release into production. That’s why good communication between SREs and DevOps teams is important. By automating and accelerating the service-level objective (SLO) validation process and quickly reacting to regressions in service-level indicators (SLIs), SREs can speed up software delivery and innovation.

Understanding the goal of “five-nines” availability

The guiding principle of SRE has long been “five-nines” availability, meaning systems are operative 99.999% of the time. As organizations distribute workloads among a greater number of cloud environments, that goal has become harder to attain because more variables are involved in the computing equation. The growing amount of data processed at the network edge, where failures are more difficult to prevent, magnifies complexity.

Visibility and automation are two of the most important SRE tools. The Dynatrace 2022 Global CIO Report found that 71% of top IT executives say the explosion of data produced by cloud-native technology stacks is beyond human ability to manage, and more than three-quarters say their IT environment changes once every minute or less. The takeaway is clear: IT environments are now too complex to manage without automation and AI. Without these capabilities, achieving five-nines availability will become close to impossible.

Aligning site reliability goals with business objectives

Because of this, SRE best practices align objectives with business outcomes. The following three metrics are commonly used to measure success:

  • Service-level agreements (SLAs). These metrics are the product of an agreement between the service provider and customer that certain measurable levels of service will be delivered.
  • Service-level objectives (SLOs). These metrics are the factors and service levels that must be achieved for each activity, function, and process to deliver on the SLA. These can include business metrics, such as conversion rates, as well as technical measures like underlying CPU availability. They’re typically expressed as percentages, such as 99.5% availability.
  • Service-level indicators (SLIs). At the lowest level, SLIs provide a view of service availability, latency, performance, and capacity across systems.

5 SRE best practices

Let’s break down SRE best practices into the following five major steps:

1. Start looking for signals

Begin by monitoring the “four golden signals” that were originally outlined in Google’s SRE handbook:

  • Latency: the time it takes to serve a request
  • Traffic: the total number of requests across the network
  • Errors: the number of requests that fail
  • Saturation: the load on the network and servers

2. Identify KPIs

Next, create a list of the key performance indicators (KPIs) that are important to the business. These may include technical metrics such as response times to search query referrals, page load times, and error message frequency. They may also include business metrics influenced by performance, such as shopping cart abandonment and page views per customer.

3. Establish SLOs

Drawing on the KPIs, you can now create a list of SLOs. Remember that less is more. SLOs should directly relate to the SLA or business objective. Defining too many SLOs creates more work without a clear business impact.

Make SLOs realistic. If they’re intentionally set low to avoid SLA violations, they won’t provide an accurate picture of how systems are impacting user experience. If set too high, they can drive higher costs and effort for little incremental gain.

4. Identify key stakeholders

Once you’ve ensured SLAs and SLOs are realistic, begin recruiting stakeholders. For example, a panel of customers may occasionally provide feedback on service quality and performance. Business stakeholders can understand how SLOs relate to business results using historical data trends. Everyone needs to agree on SLO targets; otherwise, the organization will risk failing to deliver on its SLAs.

5. Automate workflows wherever possible

Automation is critical to achieving the business agility that digital transformation demands. A powerful observability solution collects relevant SLIs and evaluates SLOs automatically. It can also automatically generate alerts before SLO violation and even automatically repair many problems. Automation also enables tools to move into developers’ hands so they can make decisions about deploying code without needing to involve operations teams.

Automating SRE best practices with Site Reliability Guardian

As part of the early 2023 rollout of its new AppEngine low-code toolset for creating custom, compliant, and intelligent data-driven applications, Dynatrace introduced Site Reliability Guardian (SRG) for automated change impact analysis

AppEngine is a unique solution that consolidates observability, security, and business data with full context and dependency mapping to simplify the creation of intelligent applications and integrations. It enables teams, for the first time, to leverage causal AI for customized applications that address specific use case requirements.

SRG enriches the Dynatrace platform’s value for SREs by automating change impact analysis. It detects regressions and deviations from previously observed behavior across metrics such as latency, traffic, error rates, saturation, security coverage, vulnerability risk levels, and memory consumption.

SRG also enriches the Dynatrace platform’s value to DevOps teams who can automate release validation in pre-production environments to ensure only high-quality, highly secure software moves to production. Should the SRG detect any violations of key objectives or metrics, the CI/CD pipeline tool can halt the delivery of the artifact.

For each critical service or application, SRG can automatically monitor for golden signals, validate SLOs, and probe for security vulnerabilities before and after deployment or configuration changes. It can also send targeted notifications to the service and application owners as needed. The result is safer, more secure releases for DevOps teams and less overhead for SREs.

Keeping the focus on site reliability

SRE is a critical discipline for organizations to master as they navigate the transition to digital business powered by data-driven decisions. Tight integration with business objectives and automation now makes it possible for organizations to proactively monitor their digital presence to ensure the highest levels of availability, responsiveness, and customer experience.

To learn more about the challenges SREs are facing and how organizations are tackling them, download the free State of SRE Report.