Header background

Five-nines availability: Always-on infrastructure delivers system availability during the holidays’ peak loads

During the holiday shopping season, IT teams hover over performance dashboards hoping their systems will deliver five-nines availability, or 99.999%, under peak loads. But is system availability that meets five nines uptime even realistic? Can users tell the difference?

For retail organizations, peak traffic can be a mixed blessing. While high-volume traffic often boosts sales, it can also compromise uptimes. The nirvana state of system uptime at peak loads is known as “five-nines availability.” In its pursuit, IT teams hover over system performance dashboards hoping their preparations will deliver five nines—or even four nines—availability.

Five-nines availability has long been the goal of site reliability engineers (SREs) to provide system availability that is “always on.” But as more organizations adopt cloud-native technologies and distribute workloads among multicloud environments, that goal seems harder to attain. In fact, a recent survey found that many cloud applications are vulnerable to outages, despite growing confidence in cloud platforms and services.

But is five nines availability attainable? How can IT teams deliver system availability under peak loads that will satisfy customers without breaking the bank?

Five-nines availability: The ultimate benchmark of system availability

Site reliability engineering teams often measure system availability in percentages in the pursuit of 100% uptime. Each decimal point closer to 100 equals higher uptime. Many IT operations build their service-level agreements (SLAs) around uptime percentages measured in nines indicating cumulative downtime per year.

System availability

Downtime per year

90% (one nine) 36.53 days
99% (two nines) 3.65 days
99.9% (three nines) 8.77 hours
99.99% (four nines) 52.60 minutes
99.999% (five nines) 5.15 minutes

While even five minutes of downtime per year might sound excessive for perfectionists, achieving it involves increasing cost and complexity. For organizations running their own on-premises infrastructure, these costs can be prohibitive.

Cloud service providers, such as Amazon Web Services (AWS), can offer infrastructure with five-nines availability by deploying in multiple availability zones and replicating data between regions. But most organizations use some combination of hybrid and multicloud environments. And even if a cloud platform offers five nines, applications running on that cloud platform often don’t.

Complicating the situation further, increasingly connected services are pushing more data processing to the edge. Gartner estimates that less than half of enterprise-generated data is now created and processed in data centers or the cloud. Instead, to speed up response times, applications are now processing most data at the network’s perimeter, closest to the data’s origin.

With so many variables in modern application delivery, organizations need an always-on infrastructure to deliver continuous system availability, even under peak loads. They also need a way to track all the services running on their distributed architectures, from multicloud environments to the edge.

What is always-on infrastructure?

Always-on infrastructure refers to IT services and environments that enterprises operate and manage in a way that delivers uninterrupted service. An always-on infrastructure provides the foundation for system availability that can deliver five-nines availability.

Traditionally, teams achieve this high level of uptime using a combination of high-capacity hardware, system redundancy, and failover models. Now, the rise of cloud computing is making high-availability infrastructure easier to achieve, but also more complicated to manage. Especially for organizations that hastily added digital services during the pandemic.

Why always-on infrastructure is critical for system availability during peak loads

The nightmare scenario for online retailers during peak periods, such as Black Friday and Cyber Monday, is a system crash. Equally damaging—and stressful for staff—is slow checkouts or dropped transactions that leave customers waiting in long lines. Or worse: abandoning carts and leaving for a competitor. One report found that 32% of customers would walk away from their favorite brand after just one bad experience.

However, the complexity of modern cloud environments makes it increasingly difficult to stay on top of the services running on these diverse networks. “Overall outage rates are partly the result of…digital infrastructure,” says Andy Lawrence of the Uptime Institute, “and the complexity operators face as they transition to hybrid, distributed architectures.”

Finding the root cause of a slowdown or outage in multicloud environments is challenging enough during normal times. But under peak loads, the ability to pinpoint problems and quickly remediate them can mean the difference between a profitable season and lost revenue or brand damage. Ideally, teams need the ability to anticipate problems and fix conditions before an issue develops.

How to create an always-on infrastructure for five-nines availability

Achieving five-nines availability and delivering great user experiences under peak loads requires an integrated approach that spans all digital touchpoints. Such an approach depends on observability throughout a multicloud infrastructure all the way to the edge that can provide a real-time response.

  1. Gather observability data from all digital touchpoints. Include metrics, event logs, distributed traces, metadata, user experience data, and telemetry data from open source technologies and cloud platforms. With broad observability data, you can understand what’s happening at every endpoint and service in a multicloud computing environment.
  2. Establish service-level objectives (SLOs). SLOs define the performance margins you need to achieve, such as five-nines uptime. SLOs enable ITOps teams to detect problems as they’re developing before they result in an outage. By using SLOs for early problem detection, you can increase system availability, resilience, and reliability. As a result, your teams can proactively respond to issues before they cause disruption or downtime.
  3. Integrate infrastructure monitoring on a single AIOps platform. Because modern IT architectures consist of traditional and cloud-native technologies, organizations also use specialized monitoring tools tailored for specific uses. But disparate tools also bring disparate points of view. By integrating point solutions into a single platform in context, teams across the organization can work from the same data and gain new insights.
  4. Apply AI for real-time root-cause analysis. A key requirement for teams operating systems under peak load is detecting root-cause issues in real time. Using traditional monitoring tools, IT teams may not even know there is a problem until customers complain of unacceptable delays and incomplete transactions. A platform-based approach that uses AI can automatically pinpoint root causes in real time so teams can respond to issues they missed using traditional methods.
  5. Automate IT operations. With monitoring data consolidated into a single AI-enabled analytics platform, teams can automate operations and incident response. Such a solution with extensive application programming interfaces (APIs) and integrations to enterprise resource planning (ERP) systems make it possible to detect what was previously undetectable and initiate automatic remediation.

Five-nines availability: An aspirational goal

Although five-nines availability is a coveted goal within the IT industry, for most, it is unrealistic and cost-prohibitive. Moreover, most users can’t detect millisecond delays. As a result, organizations usually strive for more realistic system availability goals that still meet user expectations. Nonetheless, teams can continually move system availability goals toward perfection using an AIOps platform approach.

To learn more about how the Dynatrace Software Intelligence platform helps deliver continuous system availability under peak loads, join us for the on-demand observability clinic, Leverage Davis AI to analyze your system before things break.