What is site reliability engineering? All you need to know

What is site reliability engineering?

Site reliability engineering (SRE) is the practice of applying software engineering principles to operations and infrastructure processes to help organizations create highly reliable and scalable software systems. As a discipline, SRE focuses on improving software system reliability across key categories including availability, performance, latency, efficiency, capacity, and incident response.

The term “site reliability engineering” was coined in 2003 by Google VP of Engineering Ben Sloss, who famously noted on his LinkedIn profile that “if Google ever stops working, it’s my fault.” According to Google, “SRE is what you get when you treat operations as a software problem.”

As more organizations expand services via the cloud and demand for digital services increases, SRE practices are essential to meet up-time service level agreements, and to meet the continuous-integration/continuous-delivery (CI/CD) demands of DevOps and DevSecOps teams.

Although every organization and software system is unique, it’s worth understanding some fundamentals of SRE as you think about how to apply it to your own situation.

SRE bridges the gap between Dev and Ops teams

SRE applies DevOps principles to developing systems and software that help increase site reliability and performance. Applying a DevOps mindset and skills to software reliability helps reduce silos between development and operations by sharing responsibility for detecting reliability and performance issues early in the development lifecycle. Collaboration between developers, operations, and product owners enables site reliability engineers to define and meet uptime and availability targets.

SRE focuses on automation

A major goal of SRE is to reduce duplication or redundancy of effort as much as possible. SRE teams focus on automating manual tasks, like provisioning access and infrastructure, setting up accounts, and building self-service tools, so developers can focus on delivering features, and operations teams can focus on managing infrastructure.

This focus on automating processes is even more critical as organizations adopt more cloud-native technologies, including containers, Kubernetes, and serverless applications. DevOps teams must constantly adapt by using agile methodologies and rapid delivery models, such as CI/CD. These methods increase efficiency and speed, but they also demand consistent, repeatable processes that reduce risk and provide feedback loops for measuring operations, so teams can identify areas for improvement.

SRE drives a “shift left” mindset

Site reliability engineering is a constantly evolving discipline, presenting opportunities to build methods, policies, and processes into the delivery pipeline that allow applications to “auto-remediate,” or for users to solve their own problems. Shift-left using an SRE approach means that reliability is baked into each process, app and code change.

Here are some of the ways SRE can help drive a “shift-left” mindset:

  • Automating build testing and validation using service-level indicators (SLIs) and service-level objectives (SLOs)
  • Monitoring SLOs and testing them in pre-production with intelligent quality gates to detect issues earlier in the development cycle
  • Deploying closed-loop remediation – continuous testing and remediation—to fix problems in pre-production before software is released to production

The goal is to take early, proactive steps to ensure quality and reliability are built in from the beginning. SRE can influence processes more broadly and expand to coordinating testing across the enterprise in support of continuous software integration and delivery.

SRE builds services and tools to help operations and support

One of the goals of SRE is to improve uptime. Ideally, companies are looking for the coveted “five nines” of uptime, or 99.999%, which translates to just over five minutes of downtime per year. Compare that with four nines, or 99.99%, which equates to nearly an hour of downtime per year.

To accomplish this goal, SRE teams build and implement services that improve operations and facilitate the release process. This can be anything from adjusting monitoring and alerting to making code changes in production. Site reliability engineers often build custom tooling from scratch to meet specific needs in the software delivery or incident management workflow.

Adopting an SRE approach also requires that teams standardize the technologies and tools they use. Standardization makes it easier to manage operations, and reduces the burden of managing incompatible technologies, which frees up teams to collaborate and innovate.

SRE requires a cultural change

Because SRE is a practice, it requires a change in how teams across multiple disciplines communicate, solve problems, and implement solutions. To adopt a successful SRE culture requires that organizations adopt new approaches to managing risk. It also means adapting governance processes and investing in hiring and educating a collaborative workforce who are versed in engineering and operations, and who learn and adapt quickly.

Organizations can then integrate these skilled engineers at key points in the DevOps life cycle. On development and testing teams, SRE specialists develop automation that helps developers test early and often without impeding agile delivery schedules. At a system level, SRE specialists develop tooling that coordinates releases and launches, evaluates system architecture readiness, and meets system-wide SLOs. At a governance level, SRE specialists help define and oversee enterprise architecture, establish best practices, and select tools and resources that support companywide site reliability.

Solving for SR

Site reliability isn’t and will never be — a “solved problem.” New services and applications combined with evolving enterprise demands mean there’s always work for SRE teams and there’s always room for improvement.

Dynatrace can help. Built for automatic and intelligent observability of even the most complex distributed cloud environments, the Dynatrace Software Intelligence Platform empowers SRE and DevOps teams to identify problems as — or even before — they occur. Driven by continuous automation with AI at its core, Dynatrace delivers precise root-cause answers to site reliability issues at every step of the software cycle. From early development in pre-production environments, through delivery and operations in production environments, Dynatrace helps SRE teams improve reliability, availability, and latency, and mitigate the business impact of service outages and slowdowns.

For more about this ongoing conversation, see A guide to event-driven SRE-inspired DevOps.

Watch webinar

You can also join us for the on-demand virtual clinic Automate Deployment and Site Reliability with Bots, ChatOps and Dynatrace on how Dynatrace can assist your organization in automating deployment and site reliability.

Stay updated