What is site reliability engineering? And what do site reliability engineers do?

As more organizations adopt cloud-based computing and the demand for digital services increases, site reliability engineering (SRE) practices have become essential. These practices help organizations meet service level agreements (SLAs) for availability, performance, user experience, and business KPIs.

But what exactly is SRE, and what do site reliability engineers do?

What is site reliability engineering?

Site reliability engineering (SRE) is the practice of applying software engineering principles to operations and infrastructure processes to help organizations create highly reliable and scalable software systems. As a discipline, SRE focuses on improving software system reliability across key categories including availability, performance, latency, efficiency, capacity, and incident response. Those who perform the tasks involved are known as site reliability engineers.

The term “site reliability engineering” was coined in 2003 by Google VP of Engineering Ben Sloss, who famously noted on his LinkedIn profile, “If Google ever stops working, it’s my fault.” According to Google, “SRE is what you get when you treat operations as a software problem.”

Although every organization and software system is unique, it’s important to understand the fundamentals of SRE — as well as the skills and mindset of its engineers — as you think about how to optimize the reliability and overall quality of your software.

Five things to know about site reliability engineering

1. SRE focuses on automation

A major goal of SRE is to reduce duplication or redundancy of effort as much as possible. SRE teams focus on automating manual tasks, such as provisioning access and infrastructure, setting up accounts, and building self-service tools. That enables development teams to focus on delivering features, and operations teams can focus on managing infrastructure.

Automating processes is even more critical as organizations speed up delivery of new features into production. On one hand, speed comes from DevOps teams who leverage automation to increase continuous integration and continuous delivery (CI/CD). On the other hand, the move to microservice architectures and the adoption of cloud-native technology, containers, Kubernetes, and serverless architectures offer even more ways to delivery smaller changes faster. These methods increase efficiency and speed, but also demand consistent, repeatable processes that reduce risk and provide feedback loops for measuring operations, so teams can identify areas for improvement.

2. SRE bridges the gap between Dev and Ops

Everything the organization does in the value stream process should answer the question “how do we ensure this runs in production reliably?” SREs drive resiliency-based engineering. They can become mentors and ensure that resiliency is a top priority for both developers and operations.

Applying the DevOps mindset and skills to software reliability helps reduce silos between development and operations teams by sharing responsibility for detecting reliability and performance issues early in the development life cycle. Collaboration between developers, operations, and product owners enables site reliability engineers to define and meet uptime and availability targets.

3. SRE drives a “shift-left” mindset

SRE is a constantly evolving discipline, presenting opportunities to build methods, policies, and processes into the delivery pipeline that allow applications to “auto-remediate” or users to solve their own problems. A shift-left mindset means SREs can embed reliability principles from Dev to Ops, baking reliability and resiliency into each process, app, and code change to improve the quality of software that goes to production.

Here are some ways SRE helps to drive a “shift-left” mindset:

  • Develop quality gates based on production-level service level objectives (SLOs) to detect issues earlier in the development cycle.
  • Automate build testing and validation using service-level indicators (SLIs) and SLOs
  • Influence architectural decisions during initial design stages to ensure resiliency and scale at the outset of software development.

The goal is to take early, proactive steps to ensure quality and reliability are built-in from the beginning. SRE can influence processes more broadly and expand to coordinating testing across the enterprise in support of CI/CD practices.

To learn more about how Dynatrace enables SRE with “shift-left SLIs,” join us for the on-demand performance clinic Automated SRE-driven performance engineering with Dynatrace.

4. SRE builds services and tools to help operations and support

Traditionally, a major goal of operations teams is to improve uptime. This single-dimensional approach looks for the coveted “five nines” of uptime, or 99.999%, which translates to just over five minutes of downtime per year.

But the higher frequency of change in distributed cloud-native environments requires a multi-dimensional approach.

The goal of SRE is to enable higher change rates while maintaining resiliency and that coveted 99.999% uptime. In multicloud environments, resiliency is measured across multiple key metrics such as performance, user experience, responsiveness, conversion rates, and so on. To accomplish their goal, SRE teams need to build and implement services that improve operations and facilitate the release process across all these areas. This can be anything from adjusting monitoring and alerting to making code changes in production. Site reliability engineers often build custom tooling from scratch to meet specific needs in the software delivery or incident management workflow.

Adopting an SRE approach also requires standardizing the technologies and tools teams use. Standardization makes it easier to manage operations and reduces the burden of managing incompatible technologies, which gives teams more time to collaborate and innovate.

5. SRE requires a cultural change

Because SRE is a practice, it requires a change in how teams across multiple disciplines communicate, solve problems, and implement solutions. To adopt a successful SRE culture, organizations must adopt new approaches to managing risk. It also means they must adapt governance processes, invest in hiring, and educate a collaborative workforce that’s versed in engineering and operations and learns and adapts quickly.

Organizations can then integrate these skilled engineers at key points in the DevOps life cycle. In development and testing teams, SRE specialists develop automation that helps developers test early and often without impeding agile delivery schedules. At a system level, SRE specialists develop tooling that coordinates releases and launches, evaluates system architecture readiness, and meets system-wide SLOs. At a governance level, SRE specialists help to define and oversee enterprise architecture, establish best practices, and select tools and resources that support company-wide site reliability.

What does a site reliability engineer do?

To get an expert’s view on what site reliability engineers do, I asked our DevOps Activist, Andi Grabner.

“Site reliability engineers use good practices around software engineering to provide resilient infrastructure and resilient services to their organizations and the people that actually deliver new applications,” he explains. He also notes that SREs often come from traditional operations roles, such as systems engineers who keep systems up and running. “Site reliability engineers ensure systems stay reliable, resilient, and available,” he adds.

Typical expectations for SREs

Typically, SREs are tasked with ensuring that the speed of delivery doesn’t result in security, service, or solution interruption. But as Grabner notes, “Expectations are a little different for every company. There’s no golden rule. Many are responsible for monitoring and observability and maintaining systems — and providing automation to spin up required environments.”

Grabner highlights the role of SREs in providing the frameworks and platforms for service and application deployment. “When things go wrong, SREs often take on the role of first-line defenders if there’s an alert,” he says. “In a great organization, they don’t do it alone — they work constantly with and within individual application teams to deal with apps that are under fire.”

Perhaps the most important role of SREs is architecting resiliency. “You can’t buy resiliency as-a-service,” Grabner observes. “You have to architect for it by building systems that are resilient by design.” This architectural approach recently helped Dynatrace to withstand an AWS outage in Germany. Automatic delivery, resiliency, and auto-remediation helped to ensure critical systems weren’t affected.

What makes a great SRE?

Great SREs are risk takers, tinkerers, and innovators. They figure out what it takes to scale a system from 100 users to 100,000 users to 1,000,000 users while maintaining uptime and resiliency. They are systems thinkers who consider how decisions made in development affect production environments, and how the needs of production systems can influence design.

This requires constantly testing, accepting failure, and adapting, automating repeatable processes along the way. Successful SREs bring a resiliency and adaptation mindset to every situation.

Grabner highlights the need for SREs to learn from their mistakes. “Some companies run ‘chaos days’ to deal with worst-case scenarios to understand what could happen and how to deal with it,” he says.

Automation is another marker of SRE success. “People excel in this role when they try to automate all the tasks that can and have to be automated,” says Grabner. “This frees them up to deliver true innovation.” He notes that while anyone can innovate under the right circumstances, teams are often held back by the “toil” of manual and repetitive tasks. “Your goal is to automate yourself out of your current role and into your next role.”

Finally, Grabner made it clear that SREs can’t operate in isolation. “You need to let people educate themselves with new technologies and practices,” he says. “Show the world. Don’t keep secrets — be open and share your own learnings, along with learning from others. There are a lot of great conferences out there — it’s worth getting inspired by what others do and inspiring others with what you do.”

DevOps vs. SRE

Where DevOps teams focus on streamlining change, SREs help ensure these changes don’t increase overall failure rates. In effect, they’re two sides of the same coin: DevOps automates speed, while SRE automates reliability. “It’s a balance between speed and safety,” Grabner says.

He sees DevOps processes as moving left to right along the development life cycle, using automation to speed up new capabilities that are typically measured by deployment frequency and lead time for changes. In comparison, SRE moves right to left using production-level requirements in development, with a focus on limiting failure rates and reducing the time required to restore service. “SRE is about making sure that even though there is a lot of change, these changes don’t break things.”

Grabner sees SRE and DevOps overlapping when it comes to SLOs. “SLOs are all about supporting business goals,” he said. “Companies may need systems at 99% reliability. They may want to increase their user base or improve the end-user experience.” Satisfying these goals is the role of DevOps. “But underneath these goals are technical goals that are specific to your objectives,” he says. “They contribute to business success with the right features at the right time and help you cope with change.” Delivering on these goals is the job of SRE staff. As a result, “SLOs are a great way to bring DevOps and SREs together.”

Solving for site reliability

Site reliability isn’t and never will be a “solved problem”. New services and applications combined with evolving enterprise demands mean there’s always work for SRE teams, and there’s always room for improvement.

As Grabner notes, when it comes to improving SRE impact, “The biggest thing is to be open and share your own learnings, along with learning from others. There are a lot of great conferences out there — it’s worth getting inspired by what others do and inspiring others with what you do.” He also highlights the need to learn from your mistakes. “Some companies run ‘chaos days’ to deal with worst-case scenarios to understand what could happen and how to deal with it.” Finally, Grabner made it clear that SREs can’t operate in isolation. “You need to let people educate themselves with new technologies and practices. Show the world. Don’t keep secrets — don’t see it as a silo.”

Looking for a solution to up-level your SRE practices? Dynatrace can help. Providing automatic and intelligent observability for even the most complex distributed cloud environments, the Dynatrace Software Intelligence Platform empowers SRE and DevOps teams to identify problems before they occur. Driven by continuous automation with AI at its core, Dynatrace delivers precise root-cause answers to site reliability issues at every step of the software development lifecycle. From early development in pre-production environments through delivery and operations in production environments, Dynatrace helps SRE teams improve reliability, availability, and latency, and mitigate the business impact of service outages and slowdowns.

Watch webinar

You can also join us for the on-demand virtual clinic Automate Deployment and Site Reliability with Bots, ChatOps and Dynatrace on how Dynatrace can assist your organization in automating deployment and site reliability.

Stay updated