Key takeaways from o11yfest 2021 – SRE, Observability, and OpenTelemetry

Published May 26, 2021 6 min read

Giulia Di Pietro

From May 17 to May 18, 2021, the Open-Source Engineering team at Dynatrace attended the virtual observability conference, o11yfest. The conference aims to increase the “awareness in OpenTelemetry and other relevant projects and techniques related to increasing visibility, transparency, and traceability of mission-critical data across software teams”.

Since Dynatrace has been actively taking part in the development of open standards for several years, as well as this being a key part of our observability platform, we were keen to attend o11yfest to learn about latest trends, use-cases, and developments in this space.

The two-day conference offered two tracks:

One dedicated to the business value of OpenTelemetry observability, where we got an insight into use cases from Skyscanner, HBE Digital, and Logz.io, to name a few.
The second focused on the OTel community, with more technical talks by representatives from companies like AWS, Elastic, and more.

This blog dives into what we learned by listening to experts in the field and the key takeaways we gathered from the talks.

SRE is becoming a key component of observability

A recurring theme was Site Reliability Engineering. In the keynote by Christina Yakomin and Steve Prazenica from Vanguard, the presenters recounted their journey from a monolith with alert-based incident reporting and no positive health signals to an observable microservice architecture. Alerting was centralized and generalized, which caused on one side “alert fatigue”, meaning too many alerts to too many people, and on the other side, the alerts did not give a deeper insight into the issues. Instead, they tended to be based off easier to collect things like CPU, memory, and response codes.

To improve their alerting system, the team decided to move to SLO/SLI-based alerting, which require deeper knowledge of the application but are more helpful than generalized metric alerts.

Jonah Kowall, CTO of Logz.io, also gave a talk on the topic of “Managing applications SLAs using Traces and Metrics”. Based on SLAs agreed with customers, you can set up certain measurable metrics and traces that can be monitored with your tools. A great example of this is Dynatrace, that’s been doing SLA monitoring for years, which Jonah also shared in his talk.

Dynatrace key takeaways from o11yfest 2021 — A screenshot of Jonah’s talk with a picture of Dynatrace’s SLA capability.

Trace-based sampling can help you save storage costs

Juraci Paixão Kröhling from Red Hat talked about how trace-based sampling is key to understanding whether you need to store certain traces or not. This can help you save money in storage costs in the long run. It’s tempting to lean towards storing all traces for all transactions because more data is better for analysis, but if you are not analyzing it, then you are paying for nothing. Juraci then spoke about the pros and cons of different types of sampling: head-based, tail-based, stateless collector, adaptive, and remote sampling.

A key takeaway from this talk is how important it is to be aware of the different sampling strategies and know which one makes sense for your application in a particular overload situation. There is no one-size-fits-all solution.

This resonated well with our team. After all, we solved this problem years ago and are handling sampling strategies transparently for our customers. Yet, with third party formats like OpenTelemetry this topic got a lot of attention by our research team again. After all, it’s key to sample and store traces efficiently while not missing out on important events.

Seek support from vendors on your o11y journey

David Lucia recounted the successful story of implementing observability at Simplebet in partnership with Lightstep. He mentioned he is a proponent of getting the support of an experienced commercial vendor, if you have the possibility, to implement such a project. OSS tools are good, but you will likely want the rich features and support provided by vendor solutions.

We second that. While data collection might get commoditized over time, storing the data efficiently and providing answers from that data is something commercial vendors spent years innovating. DIY solutions cannot match this while still requiring constant maintenance, which takes time and focus away from the core business of an organization. High TCO and low ROI are the consequence, and this should clearly be avoided.

AWS aims to support the streamlining of observability

In the keynote by Jaana Dogan from AWS, she mentions that there is a need for simplification in observability because there are too many agents, too little correlation, too little standards, and too many products to support.

However, she mentions that OpenTelemetry is something that they have in use, for example: semantic conventions, data models and the collector. The usage of OTel is important for the relationship between AWS and observability vendors (like Dynatrace), to be able to enable them. Customers often use more than one vendor, and they want to make sure that there is not a lot of friction in terms of being able to push telemetry.

When Dynatrace set out on contributing to OpenTelemetry in 2019, it was our hope that cloud vendors will take the opportunity to standardize the way how they collect and – more importantly – emit telemetry data. Initiatives like the one Janaa is driving validate this strategy and we even teamed up with AWS to add our OpenTelemetry exporters to the AWS Distribution for OpenTelemetry.

OTel is designed to ensure flexibility and stability

One of the co-founders of OpenTelemetry, Ted Young, gave a very interesting talk outlining the reasons why OTel is designed the way it is. It should be stable, flexible, and upgradeable, and for this reason it’s based on two components: API and SDK. The API are all tools that are needed to instrument a library. The SDK is what the operator uses to gather and process telemetry data.

The great thing about this is that companies that don’t support OTLP natively have their own pluggable exporters and distro packages. So, the design of OTel allows for flexibility. For example, the SDK can always be updated to the latest version without breaking a company’s chosen instrumentation.

EBPF is predicted to become a canon for telemetry

AWS engineers predict that eBPF will eventually become the canonical way of collecting out-of-the-box telemetry. However, for widespread usage, it needs to simplify its tooling. eBPF is currently being used in security and ops by AWS and the company’s engineers are planning on developing an Event data type in OpenTelemetry to export eBPF events. EBPF will add a new layer to OpenTelemetry, and this is an exciting development we are following closely.

What’s next?

OpenTelemetry is a key ingredient for cloud-native observability and it’s great to see how much traction it gets.

Just collecting more data is not enough. At the end of the day, it’s all about automation and providing actionable answers. This is what sets Dynatrace apart as a Leader in application performance monitoring.

As just relying on manual instrumentation would be a step back, the goal we want to achieve is to make OpenTelemetry a first-class member of platforms, languages, and libraries.

A dedicated team of engineers at Dynatrace is contributing to OpenTelemetry to make this vision come true.

It was great to participate in o11yfest and to learn more about what’s happening in this space. We are curious about where OTel is heading, and we look forward to taking part in next year’s observability event.