Erik Landsness, NOC Director, Beachbody, recently presented at our Perform User Conference. In his session he discussed his team’s mission to build a “zero-dashboard NOC”. What Beachbody is essentially doing is embracing IT automation and letting machines do “the boring stuff” 😊. The key here is that this frees up time for even more automation, and so on, so that teams are now leveling-up their value to the organization beyond watching dashboards (and having more fun, too).
A Traditional NOC
Erik started at Beachbody about 18 months ago leading a small team of 3 people. Because of their size, and the company’s super-fast growth ($M to $B in a few short years), they were constantly trying to stretch beyond their limits and come up with better ways to work smarter, not harder.
An interesting place to start is when Erik initially walked into the NOC and saw a sea of red screens, yet the room was calm, no panic. The alert noise was so overwhelming that it had neutralized concern, and red was just the feature of the day. They were running the classic tools and drowning in alerts. In fact, Erik had 60,000 emails in his inbox on Day 1, a human impossibility. There was no obvious way to see what was smoke and if there was fire.
The early days at Beachbody, many tools, and drowning in a sea of red
Here is what was going on in Erik’s NOC (and maybe yours, too)?
- There was a whole lot of tooling going on and a tremendous amount of time was spent combing through data to find problems, locate sources, and investigate issues.
- Monitoring was myopic in the sense that it was focusing on things like CPU, memory, disk I/O, utilization, etc.
- The dashboards were pretty – and if a picture was worth a thousand words, things in the NOC were running okay.
- There was an overload of alert noise that made it impossible to clearly understand exactly what to focus on and for what reasons.
- There was little or no time available for creating or designing preventative processes.
These activities were time-consuming, sapped resources, and used a lot of guess work because the environment was scattered with blind spots. Plus, because of Beachbody’s fast growth, there were a lot of contractors in and out, building and leaving. Erik’s team didn’t even know what the total environment looked like; collectively this was a maintenance nightmare.
Moving to a Zero-Dashboard NOC
With this as a base, and in the spirit of working smarter, not harder, Erik’s team defined four initiatives that characterized what they’d need to move toward a modernized ‘zero-dashboard’ NOC.
The team would need to:
- Be more virtual and agile – the days of sitting in front of dashboards would be a thing of the past replaced with automation-driven self-discovery processes.
- Create an SRE function and level-up expertise so that responsibilities could be shared across the technology teams.
- Leverage AI to move from individual sensor alerts to full problem alerts; alerts that include automatic detection, root cause, and action-oriented detail for clear visibility and speed.
- Simplify the tool base and better integrate it with contextual monitoring data to automate actions, increase efficiencies and lower risk.
These initiatives would augment expertise and make the NOC more optimized, accurate, responsive and smart. And as crazy-busy as a NOC can be, start small but start! Automation frees up time and resources, that can create more time for automation … and, with the help of the right tools and technology, the path to zero-dashboards builds its own momentum.
Start with small, incremental improvements that have an impact quickly; this frees up more time for more automation, and momentum builds!
Today Erik’s team is successfully executing automated improvements. For example, Erik’s team now pushes everything into Slack and has created a single alerting channel where the NOC resides and receives alerts. Writing a very simple Slackbot (they named Halbot 🙂 ) they are automating the challenge of teams not following their change process. Now you can call a change in Slack, which will update the NOC, where it is time stamped, and when called complete, updates the NOC. The change process has been automated. This little bit of automation now integrates with GitHub. It deploys, grabs data from monitoring, sees and acknowledges alerts, and creates ServiceNow tickets. It does a lot of additional stuff, but this highlights how just a small piece of automation can free up a patch of time for doing other things, like writing code.
Dynatrace detects a problem, Erik’s team is notified via Slack, on the right is the evolution of the problem as seen through Dynatrace’s AI-enabled data.
Erik is executing his vision in part on the power of Dynatrace and Dynatrace AI-powered data. Here are some additional Dynatrace capabilities that are helping, or will soon help, Beachbody’s vision for a zero-dashboard NOC:
- Beachbody needed a full understanding of their environment. With Dynatrace this could not be easier. Erik installed Dynatrace OneAgent and in minutes, his entire application stack was auto-discovered / auto-instrumented, end-to-end. (This single agent covers all technologies and processes, including containers).
“One of the things I love about Dynatrace is the OneAgent. You will be blown away by how amazing it is … install the agent and get complete visibility immediately”. – Erik Landsness, Perform User Conference
- Beachbody was looking to leverage artificial intelligence (AI) for automated, improved, more proactive monitoring. Dynatrace’s AI-powered technology automatically detect and prioritizes problems, and also provides action-oriented remediation steps e.g. how many users were impacted, what is the root cause, what is most important. Guessing and alert overload are a thing of the past.
- The Dynatrace ReST API will help Beachbody enable automation by integrating and enriching data from existing tools and feeding this into the AI engine. This creates more context which can empower operations to create more automation, that frees up time for – more automation, and eventually more interesting roles in the NOC.
These capabilities reinforce what Erik and his team are looking toward to help them in their zero-dashboard journey. Although they are not there yet, they can already see time freeing up that will allow them to provide more of an advisor role – one that helps technology teams understand what to look for and how to better monitor applications. With several projects in the works, they are also looking at building smarter, or self-healing, actions using AI and the increased contextual data it will provide. This is where the power of automation can play a big role by gratefully removing things like 2:00 a.m. escalation calls.
Beachbody no longer looks at TV dashboards for the sake of dashboards (although they still may use them for specific problems) and the days of getting 20 people on a bridge call are becoming fewer as they continue to automate IT and move closer to a true zero-dashboard NOC.
Here are video highlights of Erik’s presentation “Beachbody’s Smarter Operations with Zero-Dashboard Monitoring – Let Machines Do the Boring Stuff”, And if you’d like to learn more about Dynatrace, you might start by visiting our performance activist, Andi Grabner’s, “What is Dynatrace and How to Get Started” video. Just follow along by signing up for our SaaS-based Free Trial, or request an On-Premise Trial, it’s easy.