Modernizing Hybrid-Cloud Operations with Dynatrace

Many organizations start moving different types of workloads to the cloud and they do it for many reasons, such as modernizing operations through automation, on demand elasticity or cost savings to name a few.

I recently got to work with Jacek Jaworski, Development Manager at StepStone, on a presentation we jointly gave at AWS Summit in Berlin, Germany. Our session was titled “Re-Fitting a Monolith for Hybrid-Cloud Continuous Delivery” and will be the source of several blog posts, webinars and conference presentations. Jacek used the following architectural diagram (seen below on right) to explain their hybrid cloud model spanning on-premise and AWS workload deployments. The Dynatrace Technology overview (left) shows the power of Dynatrace OneAgent providing full coverage and eliminating the need for many point monitoring tools:

Common hybrid cloud architecture spawning on-premise and public cloud with a variety of used services and technologies.
Common hybrid cloud architecture spawning on-premise and public cloud with a variety of used services and technologies.

Leverage DAVIS deterministic AI to guarantee smooth operations of migrated Microsoft workloads

In this blog, I want to highlight one specific use case Jacek presented around modernizing operations after migrating Microsoft-related workloads from their On-Premise Data Center to the AWS Cloud. While AWS CloudWatch provides cloud resource metrics and events, their team decided to leverage Dynatrace DAVIS and its deterministic AI for automated problem and root cause detection. Why? Because alerting on metric thresholds or point events doesn’t scale!

Jacek explained it well on how they leverage AIOps and he picked an example of a recent issue on a Windows Host running SQL Server they moved to AWS. This migrated host impacted SLAs on several of their critical services. He started that section of our presentation with the following screenshot of a Dynatrace dashboard. “While dashboards that show metrics and events are helpful to answer certain questions for different stakeholders – what really matters for our modern IT Operations team is whether we have an actual problem that impacts our SLAs”, Jacek said.

Lots of “good to know” information but what really matters for a modern operations team are the Dynatrace DAVIS AI-detected problems
Lots of “good to know” information, but what really matters for a modern operations team are the Dynatrace DAVIS AI-detected problems

In the above screenshot we see that Dynatrace DAVIS is currently tracking an open problem impacting their SLAs. As StepStone has setup the Problem Notification Integrations with PagerDuty, the Incident Response Team automatically gets a PagerDuty notification. StepStone also uses PagerDuty’s integration with Slack, which means their teams collaborate through a Slack channel and get notifications like shown on the following screenshot:

Dynatrace DAVIS notifies PagerDuty which connects the right teams and pushes the relevant information to a slack channel.
Dynatrace DAVIS notifies PagerDuty which connects the right teams and pushes the relevant information to a slack channel.

The way they setup their incident response workflow automatically sets up a BlueJeans meeting where the whole team can collaborate. The Dynatrace link gets all of them to the problem details:

Dynatrace DAVIS detected the impact of the problem and identified the root cause being a full disk on a Windows host running on AWS
Dynatrace DAVIS detected the impact of the problem and identified the root cause being a full disk on a Windows host running on AWS

Root cause is just a click away! Seems to be a slow disk on a Windows host running on AWS. All AWS meta data (tags, region, availability zone, host name …) are automatically pulled in as well through the Dynatrace AWS Integration:

Dynatrace OneAgent automatically pulls in all data from this host with details on the actual issue with this disk
Dynatrace OneAgent automatically pulls in all data from this host with details on the actual issue with this disk

Dynatrace DAVIS: The difference to metric & event-based analysis

We often get a lot of questions around this – such as:

  • How is this different than just looking at AWS CloudWatch data?
  • Why do I need Dynatrace for this? What’s the real value-add?

Good questions!

And here is my answer: Taking again the example of StepStone, Dynatrace only notifies the team if the slow disk impacts availability, performance or end user behavior of any of the services that depend on this resource. It doesn’t just alert on a slow disk in case it really doesn’t matter!

OneAgent makes this possible. OneAgent goes beyond simply monitoring system metrics, in fact it covers your full application stack end-to-end (host, processes, containers, services, applications & end users) helping you understands all dependencies between hosts & services thanks to Dynatrace’s Smartscape & PurePath technology. Dynatrace additionally pulls in AWS CloudWatch (and any other external metrics & events) to give the AI an even broader view of your entire hybrid cloud infrastructure!

They say that “A picture is worth more than a 1,000 words”. I think that an animated picture does an even better job! We captured Dynatrace Problem Evolution for StepStone’s slow disk problem. The dependency graph in the middle shows the impacted windows host on the bottom, the SQL Server running on it and all the services across the hybrid cloud architecture that depend on this database. On the right you see a list of events that Dynatrace DAVIS automatically correlates to this problem, e.g: slow disk, slow SQLs as a result, a slow service call, impacted SLAs on the top … – all this in a single problem – creating a single alert.

All information in a single view to fix the root cause: The slow disk on that EC2 Windows host has a ripple effect on many other services.
All information in a single view to fix the root cause: The slow disk on that EC2 Windows host has a ripple effect on many other services.

The additional Value-Add for Enterprise Cloud Architects

While the Dynatrace AIOps approach enables modern operation teams to reduce the noise of metric or event-based monitoring and focus on the actual root cause of actual problems, it provides another benefit to enterprise architects.

Besides the visualization in the Problem Evolution, the Dynatrace ServiceFlow of those impacted transactions tell us a lot about the susceptibility of the current architecture to issues like slow disk or latency. Leveraging this data in architectural reviews allows enterprise architects to build more fault tolerance into their end-to-end transaction flows:

ServiceFlow shows hotspots in the overall architecture by highlighting services that are less tolerant to impacts such as the slow disk.
ServiceFlow shows hotspots in the overall architecture by highlighting services that are less tolerant to impacts such as the slow disk.

More to come …

I will be writing another blog post about how StepStone re-fitted their existing monolithic architecture into the current hybrid-cloud model. I will show you step-by-step how they use monitoring data to optimize end-to-end performance and how they plan to integrate monitoring data in their continuous delivery pipelines to provide faster feedback to engineers and architects (Shift-Left).

Stay updated