A guide to event-driven SRE-inspired DevOps

“This is a mouthful of buzzwords” is how I started my recent presentations at the Online Kubernetes Meetup as well as the DevOps Fusion 2020 Online Conference when explaining the three big challenges we are trying to solve with Keptn – our CNCF Open Source project:

  • Automate build validation through SLI/SLO-based Quality Gates
  • Breaking monolithic pipelines into event-driven Delivery Choreography
  • Embrace event-driven auto-remediation with an SLO-based safety net

You can watch the recording of my Online Kubernetes Meetup on YouTube (embedded below), watch the recording of DevOps Fusion on their website as well as access the slides on my Slideshare:

Since the talk, I made a few adjustments based on feedback and plan on giving an updated version at upcoming events including our own DynatraceGo! in October 2020 where I’ll specifically focus on how Keptn integrates and automates on top of Dynatrace Davis. It’s a free virtual event so I hope you join me.

In this blog, I want to dig a bit into a core capability of Keptn which is used across all use cases: SLI/SLO-based Evaluations for Quality Gates as well as Auto-Remediation.

Shifting-left SRE to automate Quality Gates

If you’re not familiar with Site Reliability Engineering (SRE) and the concepts of Service Level Indicators (SLIs), Service Level Objectives (SLOs) and Service Level Agreements (SLAs) I recommend watching the YouTube Video from Google Engineers called SLIs, SLOs, SLAs, oh my! (class SRE implements DevOps)!

While an SLI is just a metric, an SLO just a threshold you expect your SLI to be in and SLA is just the business contract on top of an SLO. It’s great that this discipline and ideas have got a lot of attention lately – most likely because of their catchy names, good marketing, and some really great success stories. This is also the reason why Keptn leverages the same ideas and terminology. But instead of just for production reporting and alerting, Keptn is using this concept for automated build validation. It’s a shift-left of SLIs & SLOs to automate the evaluation of each build based on indicators that will be very important later in production. The following animation shows this “Shifting-Left SRE” approach that Keptn implements:

A guide to event-driven SRE-inspired DevOps
Shifting Left SRE means to leverage SLIs & SLOs not only in production but as part of continuous delivery to validate every commit

While this concept of metrics-based quality gates is not new (think about the Unbreakable Pipeline or what we did 10 years ago with AppMon’s Test Automation feature), Keptn implements SLIs/SLOs as a core capability and powers its core use cases such as automated quality gates in delivery, performance engineering as a self-service or auto-remediation validation.

Thanks to its event-driven architecture, Keptn can pull SLIs (=metrics) from different data sources and validate them against the SLOs. Currently supported data sources are Prometheus, Dynatrace, Neoload with others in the works, e.g. Wavefront.

As I personally helped drive the latest innovation of the Dynatrace SLI Provider, let me show you how easy we made it for Dynatrace users to define SLIs & SLOs:

Visual SLIs & SLOs through Dynatrace dashboards

First, let me take a moment to thank Klaus Enzenhofer, Technical Product Manager at Dynatrace, who challenged me to make SLI & SLO definition simpler so that it is more appealing to a non-deep technical user base.

In our recent Performance Clinic Automate Business Level Objective Monitoring with Dynatrace & Keptn I was able to live demo the latest version of the Dynatrace SLI Provider for Keptn which now accepts a Dynatrace dashboard as input as you can see in the following example!

A guide to event-driven SRE-inspired DevOps
Keptn SLIs and SLOs for Dynatrace can now be defined as a Dynatrace Dashboard making it easy for everyone to define their quality criteria

As you can see – it’s as easy as creating a dashboard in Dynatrace, putting the SLIs that are important for you on it, define the thresholds via the title description and that’s it. Next time Keptn must evaluate SLIs/SLOs for either quality gates, performance analysis after a performance test, or during auto-remediation, it will pull the metrics from that dashboard.

This approach allows every team to easily define their own dashboard for their own SLIs/SLOs. You can also start creating dashboard templates to faster onboard applications of a certain type or technology stack.

For those of you who know Klaus, he also challenged me to extend this concept to also include business-relevant metrics such as conversion rates, end-user experience, page load times, and bounce rates. To make Klaus happy, the Dynatrace SLI Provider now also supports any real user metrics, calculated metrics, as well as USQL (User Session Query Language), queries:

A guide to event-driven SRE-inspired DevOps
You can add any real user or USQL query on your Dynatrace dashboard for SLI/SLO evaluation

If you want to learn more about this capability check out the readme on the Dynatrace SLI Service GitHub page.

Level up your existing Continuous Delivery

The benefit of Keptn providing these capabilities is that every other tool you use in your DevOps toolchain can leverage it by integrating Keptn through the Keptn API or by extending Keptn with a Keptn Service.

We have several examples of users using Keptn SLI/SLO-based Quality Gates in Jenkins, Azure DevOps, XebiaLabs (now digital.ai), GitLab, or other delivery tools. A common scenario is what Christian Heckelmann from ERT is doing with their GitLab pipeline. He reduced the complexity of doing build validation in his GitLab pipeline by simply letting Keptn do the job:

A guide to event-driven Site Reliability Engineering-inspired DevOps
Keptn can easily level up your existing delivery such as automating build approvals through the SLI/SLO based quality gates

Christian is also working on leveraging Keptn for performance test orchestration which will allow him to also remove those test steps out of the delivery pipeline. For that Keptn already offers integrations with JMeter and Neoload with other testing tool integrations in the works. After that, I think Christian is looking into Auto-Remediation with Keptn to ensure healthy environments all the way into production!

Let’s drive this vision forward, together!

There’s much more shown in the meet-up presentation, so make sure you watch it!

I also encourage you to join our Keptn community, star us on GitHub, join our Keptn Slack and test drive Keptn by walking through our Online Keptn Tutorials. Together we can drive even more innovation.

Stay updated