Open Observability – Part 1: Distributed tracing and observability

Are you getting started with distributed tracing/observability, or plan to use open-source instrumentation? This two-part blog post series will get you started by explaining the key concepts of observability and distributed tracing.

I lead a dedicated Dynatrace engineering team which has contributed to the OpenTelemetry project since its inception. My team holds a seat on the OpenTelemetry governance- and technical committee and maintains the project’s JavaScript agent. I also serve as co-chair of the W3C Distributed Tracing working group, which works on standards like W3C Trace Context.

As a strong supporter of open source and open standards, I’m aware that the wide availability of standards, open-source tools, and some newly coined terms are causing a lot of confusion.

Following is a list of questions that I hear frequently; I will answer these questions in this blog post. In a follow-up blog post to be published later, I will delve deeper into OpenTelemetry, its use cases, and how to use it with Dynatrace.

  • What is distributed tracing?
  • What does W3C Trace context do?
  • How is monitoring different from observability?
  • Why is everyone talking about OpenTelemetry?

Let’s kick this off with distributed tracing, an important topic that is very close to our hearts at Dynatrace.

Distributed tracing

Distributed tracing describes the act of following a transaction through all participating applications (tiers) and sub-systems, such as databases.

Distributed computing didn’t start with the rise of microservices. Already in the 2000s, service-oriented architectures (SOA) became popular, and operations teams discovered the need to understand how transactions traverse through all tiers and how these tiers contributed to the execution time and latency.

In the mid 2000s, Google published their Dapper paper which describes techniques for distributed tracing at Google. It also introduced the terms ‘Trace’ for a transaction and ‘Span’ for an operation within a trace. Depending on the scope of the collected data, a span can represent a single operation or everything that happens within a participating tier.

Around the same time, Dynatrace released PurePath Version 1, which allowed for distributed tracing that provided not only cross-tier timings but also code-level details.
A Dynatrace PurePath consists of subpaths, a similar concept to that of traces and spans.

Traces and spans in distributed tracing
Figure 1: Traces and spans in distributed tracing

Today, distributed tracing is state-of-the-art, and most performance monitoring solutions support at least a flavor of it.

W3C Trace Context

We now know that a set of spans forms a trace. But how is this relationship represented? All systems that support distributed tracing use some identifiers, the trace context, that is passed along with the transaction. For HTTP this means that at least a trace ID is injected into the header of outbound requests and extracted from the header of inbound requests.

For a long time, each vendor used their own format for this trace ID, which meant that proxies and middleware had to be configured specifically for each trace ID they wanted to support.

Also, if different monitoring/observability tools were used in a service infrastructure, different trace ID formats would cause any trace to break if it passed through a system using another format.

The diagram below shows both cases: a passive middleware that doesn’t propagate the trace ID and a service that users another monitoring vendor. In both cases, a broken trace will be the result.

Broken path examples with incompatible middleware ID formats and different monitoring systems
Figure 2: Broken path examples with incompatible middleware ID formats and different monitoring systems.

This lack of interoperability was the reason why Dynatrace was a founding member of the W3C Distributed Tracing Working Group with the goal of standardizing how trace context is represented throughout the industry.

Today, W3C Trace Context is the standard format of projects like OpenTelemetry and it is expected that cloud providers will adopt this standard over time, providing a vendor-neutral way to propagate trace IDs through their services. Of course, Dynatrace supports W3C Trace Context as well. The below diagram is updated to show the trace ID with W3C trace context. In this case the trace is preserved with no broken link.

End-to-end trace with W3C Trace Context
Figure 3: End-to-end trace with W3C Trace Context

Please note: The propagation of W3C Trace Context must be supported by all participating systems. It does not provide a way to trace through any arbitrary system.

Also, as you can see in the diagram above, if a system only propagates trace context without an agent reporting spans or if spans are created by another monitoring vendor, the trace will not break but will also not contain spans for each of the respective tiers.

Observability vs. monitoring

You may have noticed recently that nearly everyone is talking about “observability” while the term “monitoring” is frowned upon. You may have even heard that this or that vendor “does just monitoring” while other vendors provide true and pure observability.

Let’s look beyond the marketing buzz and start out by defining with observability means.

Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.

Applied to the monitoring space, this means that the more telemetry data an application emits the more observable it is.

Unknown unknowns

In conjunction with the term observability, you will often hear about “unknown unknowns.” These are problems that you are unaware of when adding instrumentation to your code. When a system is observable, it is more likely to emit the right data needed to find the root cause of a detected problem by analyzing its outputs.

Let me give you an example. Let’s say that your telemetry data shows that 8% of transactions fail when a user clicks the “Buy now” button in your shop.

This failure rate is the bare minimum of observability. The problem you are facing is an unknown unknown, as you didn’t anticipate it when instrumenting your application. The failure rate alone tells you that something is wrong but not what is happening.

You can now increase the observability of your system by starting to collect all the arguments or request parameters involved when an order is processed.

With that data at hand, you can start to look at how failing transactions differ from successful ones and if you look at the data the right way, you will figure out that all failing transactions are, in this example, using the British Pound as their currency. If you drill deeper, you will then find out that the currency conversion for this currency is broken.

The more data you collect, the more observable your system becomes and the more likely it will be able to track down the root causes of such problems.

How observability differs from monitoring

Much has been said about observability being the next big thing and that some systems only provide monitoring but not observability, but is this really valid in 2021?

The overall fallacy in this assumption becomes obvious when you look at the definitions of the terms.

“Monitoring” comes from the verb “to monitor” and describes the act of looking at systems. The same can be said for “observing.” “Observability,” however, describes how observable a system is.

This means that you can monitor the outputs of a software system, but the fidelity of the data is determined by the system’s observability. If we look at the industry today, tools like Nagios may fall into the bucket of pure monitoring. All agent-based products add observability to applications by instrumenting them so that they emit telemetry data (and vendors let you analyze this data for unknown/unknowns). That said, of all vendors, Dynatrace adds the most observability value by providing actionable answers on top of data, not just more data on glass.

OpenTelemetry

For many years, distributed tracing was state-of-the-art and used by operations teams to identify problems in distributed systems.

For a long time, developers didn’t need distributed tracing as the applications they were responsible for were still monoliths and they could generate traces with local tracing tools that often came with their IDE.

This changed with the rise of microservices. Suddenly, they were working on a set of services which provided a specific capability and tracing a single service with their IDE was no longer enough.

Operations teams needed distributed tracing as well, so vendors started building tools to solve this problem from them.

That’s how developer-centric observability tools like OpenCensus and OpenTracing emerged.

In 2019, the OpenCensus and OpenTracing projects merged into what we now know as OpenTelemetry. Today, OpenTelemetry is the second most popular CNCF project after Kubernetes.

OpenCensus and OpenTracing became OpenTelemetry
Figure 4: OpenCensus and OpenTracing became OpenTelemetry

In a nutshell, OpenTelemetry provides

  • an agent that can be deployed into applications,
  • an API that can be used to manually report telemetry data from the application,
  • and some auto-instrumentation for well-known libraries and APIs.

OpenTelemetry aims to support three so-called observability signals, namely:

  • metrics
  • traces
  • and logs

At this point, only the tracing specification is stable, with metrics and logs to be expected later this year or in 2022, respectively.

While OpenTelemetry does not provide analytics, a backend or a UI, it defines a format and provides exporters to send the collected telemetry data to third-party systems like Dynatrace.

For a primer on OpenTelemetry, see “What is OpenTelemetry? Everything you wanted to know” and “How OpenTelemetry can improve Observability and Monitoring”, written by my peer at Dynatrace, Wayne Segar.

Dynatrace provides automatic observability

To make your system observable with Dynatrace, all you need to do is install OneAgent.

OneAgent will automatically instrument your applications and they will start sending telemetry data to Dynatrace. Our Davis AI will instantly start analyzing this data for anomalies and if a problem occurs, it will take you directly to the root cause (see figure 4) and provide you with all the data needed to solve the problem.

Dynatrace screenshot root cause detail
Figure 5: Dynatrace root cause detail

Sometimes it makes sense to increase the observability by adding additional, application-specific information to your telemetry data. In the second post of this series, I will show you how to use OpenTelemetry to do this.

What’s next

In the second part of this series, I will cover the current state of OpenTelemetry, provide examples for how you can start using it today, and possible risks. I will also cover how you can use Dynatrace in conjunction with OpenTelemetry to get actionable answers from it.

Also, if you’re interested in taking a deeper look at Advanced Observability and how it can benefit your organization, take a look at Dynatrace eBook.

If you want to see Advanced Observability in action take Dynatrace’s 14-day Free Trial for a spin.

Stay updated