“How do I structure my observability team?” is one of the most common questions folks leading software teams ask me. My advice: Don’t create a centralized “observability team” that’s responsible for all the observability within an organization.
Observability shouldn’t exist as a silo. It touches many parts of an organization, from development to production, and should be treated as a team sport.
As we know, our systems can only be considered observable if they emit telemetry. No data means that we can’t understand what is happening in our systems. Fortunately, the OpenTelemetry® (OTel) ecosystem from the Cloud Native Computing Foundation (CNCF) has become the de-facto standard for instrumenting, generating, collecting, and exporting telemetry data.
What does this mean for observability adoption in an organization? Let’s dig in.
Observability is everyone’s responsibility
Reliability can’t happen without observability. Observability must be looked at holistically. It is not the sole responsibility of any one team or individual. Everyone has an important part to play, and to a certain extent, the parts weave into each other.
Instrumenting code
There are two types of OpenTelemetry instrumentation:
Code-based instrumentation should be done by application developers, and not by an “observability team.” Developers know their applications best. Asking someone else to instrument your application is like asking someone else to write your code comments. Please never do that.
Zero-code instrumentation usually involves a shim or bytecode instrumentation wrapper around your code. If you’re a developer writing code in a language that supports OpenTelematry auto-instrumentation, you should understand how to implement both zero-code and code-based instrumentation. In doing so, you can use the instrumentation to troubleshoot your own code.
In some environments, zero-code instrumentation may be managed by the OTel Operator. If this is the case, the responsibility often falls to SRE or platform engineering teams. Event in those cases, developers should understand; at least at a high level, how zero-code instrumentation is configured with the OTel Operator.
Managing observability infrastructure
Observability infrastructure still needs to be managed, whether you’re using a SaaS vendor (e.g. Dynatrace) or an open source stack. If you’re using OpenTelemetry, chances are you’re managing at least one OTel Collector, and perhaps many. If you’re running your applications on Kubernetes, you’ll likely deploy and manage Collectors within the cluster as well. In most organizations, this responsibility falls under platform engineering or SRE teams, and these teams are essential to robust, reliable software delivery in large, complex environments.
That said, developers should still understand how the OpenTelemetry Collector is configured. It’s true that you don’t need to go through a Collector to send OTel data to an observability backend for non-production. However, the Collector still offers some nice things that direct-from-application doesn’t (e.g. batching data, masking data, and automatic retries), and I still highly recommend using it, even in development.
Making CI/CD pipelines observable
DevOps engineers can’t escape observability either, because guess what? We can make CI/CD pipelines observable too. While CI/CD pipelines may not be a production environment that external users interact with, they most certainly are a production environment that internal users interact with (i.e. software engineers, platform engineers, and SREs).
CI/CD pipelines are defined by code, and like it or not, that code can still fail. Making our application code observable helps us make sense of things when they fail in production. So, it stands to reason that having pipeline observability can help us understand what’s going on when CI/CD pipelines fail.
There’s been some great buzz around the observability of CI/CD pipelines, especially now that there’s an official OTel CI/CD Special Interest Group (SIG). This will give our favorite CI/CD tools a shared language for the observability of CI/CD pipelines, creating a foundation for them to support OpenTelemetry tools in this context.
We’re not there yet, which means that right now we must stitch a few tools together to achieve CI/CD observability. Fortunately, things are moving nicely in this space, and if you haven’t considered CI/CD pipeline observability in your organization before, now’s the time to start thinking about it. To learn more about what’s happening with OTel CI/CD observability, check out the #otel-cicd channel on CNCF Slack.
Troubleshooting
The beauty of observability is that once you instrument your code, you put the ability to troubleshoot in the hands of many. Consider the ripple effect when developers instrument their code:
- Developers: Instrumentation allows developers to debug their code as they’re writing it.
- QA testers: Instrumentation allows testers to troubleshoot failed tests, allowing them to file more detailed bug reports. If QAs can’t track down the issue, then it means that there is missing instrumentation that developers need to add to their code. This turns observability into a quality gate.
- SREs: Instrumentation allows SREs to troubleshoot production issues, gain insight into system performance, and ensure overall system reliability.
Ensuring adherence to observability practices
Remember how I advised against creating an “observability team” responsible for all observability within an organization? I still stand by that. That said, I do believe that organizations should have an observability team responsible for enterprise-wide observability oversight and advocacy. A team that defines and disseminates observability standards and practices within that organization. This team would need to stay up to date in the latest observability practices, vendor offerings, and the OpenTelemetry ecosystem— not just as an observer, but also as a project contributor, while also encouraging developers, platform engineers, and SREs to contribute.
This “observability practices team,” can’t, however, exist on an island. First off, it needs to be aligned with leadership to ensure that everyone is on the same page when it comes to observability. The team also needs support from individual practitioners. As a result, the team also needs to work with developers, SREs, platform engineers, QAs, and DevOps engineers to ensure that the practices and standards that it comes up with make sense.
If observability is to be a team sport, it needs coordination and guidance. There should be guardrails in place, to ensure that you have standard tooling, practices, and enforcement of said practices. Practices and standards include things like standard Collector configurations, and standard attributes emitted to your chosen observability backend(s).
Standardizing tooling is important because I’ve seen far too many “tool jungles” in organizations, where each team or department has their own tooling and practices, and it ends up being a recipe for disaster. Too much redundancy and overlap.
In addition, the observability practices team should not be responsible for instrumenting developers’ code, nor should it be managing infrastructure. It’s there to work with these other groups and to make sure that things are done right.
Final thoughts
Observability weaves its way into various aspects of an organization. It’s not just a developer concern. It’s not just an SRE concern. It’s not just a QA concern. It’s certainly not the concern of a single “observability team.” Doing so downplays its importance, takes away our collective responsibility towards observability, and dilutes the promise of observability. The only way to make this work is by ensuring that the teams participating in this team sport that we call observability don’t operate in silos.
Looking for answers?
Start a new discussion or ask for help in our Q&A forum.
Go to forum