OpenTelemetry enables automated operations management at scale

The OpenTelemetry project was created to address the growing need for AIOps as infrastructure complexity grows. Here's how it provides actionable answers.

The OpenTelemetry project was created to address the growing need for artificial intelligence-enabled IT operations — or AIOps — as organizations broaden their technology horizons beyond on-premises infrastructure and into multiple clouds.

OpenTelemetry provides a set of vendor-agnostic application program interfaces (APIs) to create a common way to instrument applications and collect data from logs and traces across a wide variety of frameworks and languages. Gartner has estimated that 70% of new cloud-native application monitoring will use open source instrumentation by 2025.

The arrival of the OpenTelemetry initiative is timely, as development teams are increasingly becoming active in monitoring and observability efforts to accelerate release times and simplify management. Development teams like to use their favorite tools, but instrumenting every single one can quickly become impractical.

Dynatrace Perform 2022 session Get actionable answers at scale from OpenTelemetry

During the Dynatrace Perform 2022 session “Get actionable answers at scale from OpenTelemetry,” Dynatrace product manager, Arlindo Lima, and W.W. Grainger tech lead, Jaspreet Sethi joined Jay Livens, Director of Product Marketing at Dynatrace to explore how to get answers at scale from OpenTelemetry. Here’s what we learned.

Unified standard

The Cloud Native Computing Foundation has been promoting OpenTelemetry as a way to unify data collection. It uses standardized application program interfaces that a wide variety of vendors and user organizations can support. More than 20 leading cloud and operations analytics vendors have added support to their products — including Dynatrace, which is one of the top contributors to the project.

Dynatrace Perform 2022 session What is OpenTelemetry

OpenTelemetry was purposely conceived to complement — and not compete with — existing analytical tools. It is not an analytics engine, and it doesn’t capture deeper observability signals such as CPU profiling, thread analysis, or memory location profiling.

“The decision not to have a back end in OpenTelemetry was intentional, so people wouldn’t have to worry about whether the back-end choice would limit their ability to use it,” said Arlindo Lima, Senior Product Manager at Dynatrace.

Manual or automated deployment

There are two ways to use OpenTelemetry: manually or using automation.

Users can add the APIs manually to their code to define exactly what needs to be measured and monitored continuously after the code is deployed for maintenance purposes. The reference architecture works with C++, .NET, Erlang/Elixir, Go, Java, PHP, Python, Ruby, Rust, and Swift — with support for additional languages to come. The other option is semi-automatic instrumentation. This is when the API library is referenced from the application code. Then, OpenTelemetry makes assumptions about what needs to be measured.

“With auto instrumentation, you have less control over which operations are monitored, but it’s faster to implement and a good starting point when you don’t have insights into an application,” said Lima. “Start with auto instrumentation and, if you don’t get enough insights, move to manual instrumentation.”


Extended visibility

Dynatrace’s observability platform is an example of how OpenTelemetry metrics can be enhanced for better visibility. Dynatrace can import metrics from OpenTelemetry and other open source tools — such as OpenTracing, Prometheus, and StatsD. Then, it can combine them with additional monitoring data specific to Dynatrace. This includes CPU activity, profiling, thread analysis, and network profiling. It then continually monitors an environment as containers are started and shut down, giving developers and administrators fine-grained observability. Dynatrace’s causation-based AI analysis enables AIOps by pinpointing areas that merit attention. This reduces the total volume of data that needs to be monitored.

Dynatrace Perform 2022 session Making all observability data actionable

“You can analyze all this data with Dynatrace AI intelligence bubbling up insights and problems, so your teams can react faster and more efficiently to incidents,” Lima said.

Taming complexity at W.W. Grainger

At industrial supply giant W.W. Grainger, observability enables the IT organization to give developers a standard set of tools to smooth the transition to “shift-level” DevOps. With 350 active services, Jaspreet Sethi, tech lead at W.W. Grainger, explained, “We have many distributed systems and many product domains, each having their own priorities and issues. A large number of applications makes coordination across the organization difficult. This is only exacerbated by modernization and our move to the cloud.”

W.W. Grainger created a developer portal with application starter kits for monitoring. “Developers have free, instrumented software ready to go on day one with everything included out of the box,” Sethi said. They can run security scans, discover APIs, and track API consumption without extensive setup or configuration.

Dynatrace dashboard Prometheus Kubernetes cluster overview

“The standardized tools allow teams to understand and debug issues as they arise, and the correlation capability in Dynatrace makes it easy to understand those events,” he said.

Getting to OpenTelemetry

Lima suggested a six-stage approach to adopting OpenTelemetry:

  1. Choose a vendor that is an active contributor to open source in general and the OpenTelemetry project in particular.
  2. Coordinate development, operations, and site reliability engineering (SRE) teams to identify the best use cases. These should focus on the data types that OpenTelemetry supports.
  3. Start small. Standards are still evolving, and some are considered experimental. “Tracing is a good place to start,” Lima said. “It’s considered mature, but the algorithms in most frameworks are considered experimental, metrics are still not yet stable, and logs are even further out.”
  4. Establish company-wide conventions so that teams use similar attribute names for easier filtering and troubleshooting.
  5. Don’t forget about data privacy. Developers shouldn’t need access to sensitive data to troubleshoot problems. Operations and SRE teams can use tools such as Dynatrace to control what type of data is captured.
  6. Aim high. “You want to make the goal problem discovery and resolution,” Lima said. Complement the signals you capture with OpenTelemetry with real user monitoring, synthetic transaction monitoring, auto-discovery of new nodes, and the ability to suppress false positives.

To learn more about OpenTelemetry efforts, make sure to watch the entire Perform 2022 session here.

Stay updated