Shift-Left with Dynatrace: Automating Performance Feedback with AWS, Mesos and Containers

I don’t think there is a correct or one-size-fits-all definition of Shift-Left, neither is there one for DevOps, Cloud-Native or any of these other heavily used terms. What I do know is that there are experts in our industry trying to make it easier for engineers to get faster self-service quality, performance and scalability feedback about their proposed code changes.

In a recent chat with one of our customers who works in the in the financial sector, I was told about how they transformed their core financial services leveraging a modern cloud native architecture running on AWS. That transformation also included a shift of their development practices, their delivery pipeline and how to provide feedback to engineering and business earlier in the development cycle!

To give you one example: Instead of executing performance tests at the end of the Kanban time-box or at the end of the release cycle, performance tests are run for every build deployment and daily scheduled runs for latest code changes of each microservice. This gives both the infrastructure and development teams more and faster feedback about the impact of code or configuration changes on performance, scalability and resource consumption.

In a remote desktop session, their performance engineering team lead showed me their Dynatrace environment. I took a couple of screenshots and notes and hope that this will inspire others to redefine performance engineering with the power of the cloud, containers and Dynatrace AI. As I am respecting their request of anonymity, I have blurred out sensitive information in the screenshots!

#1: Eight Feature Based Environments powered by Terraform, EC2, Mesos, Docker & OneAgent

Let’s start with explaining what type of environment the performance engineering team is providing to the engineering teams to get faster performance feedback. They run a Dynatrace Managed installation in their AWS VPC, running 10 different Dynatrace Tenants to separate the different pre-production environments they are monitoring (Feature1, Feature2. Feature n, Performance, etc). If you want a Dynatrace Managed instead of a SaaS instance on AWS check out the recently announced Dynatrace Managed Cluster on AWS Quick Start.

On the infrastructure side their environments leverage Terraform and Ansible to deploy EC2 instances on which they run Mesos orchestration to host their Spring Boot Java services in Docker containers!

Here are some of the benefits they get out of Dynatrace, Cloud & Container Monitoring capabilities:

Benefit #1: Dynatrace allows easy management of containers and images!

One challenge in very dynamic containers environments, is to keep track of all the currently deployed containers, where they run and how many resources they consume. Making sure that those containers that should be running have enough resources and can reach all the other depending services is another challenge in highly dynamic environments. Dynatrace provides that type of visibility out-of-the-box! We call this “No-Touch Container Monitoring” as you don’t have to change anything in your container to get this visibility. This works for Docker, ECS, Fargate or any other Container flavor:

Dynatrace provides full container visibility by either being deployed on the Docker host or running as a Docker container itself.
Dynatrace provides full container visibility by either being deployed on the Docker host or running as a Docker container itself.

Benefit #2: Dynatrace provides automatic monitoring of container resource consumption

For a Docker host, Dynatrace provides visibility into the actual containers deployed on that host and shows how many resources these containers consume right now and over time:

Easy overview of all containers overall, grouped by image or by host. All these metrics are also automatically fed into the Dynatrace AI for automatic anomaly detection.
Easy overview of all containers overall, grouped by image or by host. All these metrics are also automatically fed into the Dynatrace AI for automatic anomaly detection.

Benefit #3: Dynatrace leverages CloudWatch for Metrics and for Tags

Our customer has also setup the Dynatrace AWS CloudWatch integration which not only automatically pulls in key CloudWatch metrics of AWS Services such as S3, EBS, RDS, DynamoDB, Lambda, ELB … but also pulls in Tags from EC2 instances, which get automatically applied to the monitored hosts where a Dynatrace OneAgent is installed. This makes managing all performance data much easier because dashboards, filters and notifications can be setup using the same tags as already defined on AWS.

Dynatrace automatically monitors RDS, is aware of frequent CPU saturation problems, sees backup events and applies the AWS Tags to this monitored entity
Dynatrace automatically monitors RDS, is aware of frequent CPU saturation problems, sees backup events and applies the AWS Tags to this monitored entity

And here is the Dynatrace AWS Overview showing how many AWS services and resources the teams are using in their Performance-Dev Environment:

Most of this data is pulled from AWS CloudWatch and gets combined with data captured from Dynatrace OneAgent, Plugins or pushed via the Dynatrace REST API.
Most of this data is pulled from AWS CloudWatch and gets combined with data captured from Dynatrace OneAgent, Plugins or pushed via the Dynatrace REST API.

TIP: If you want to learn more about basic AWS monitoring, I suggest you walk through my 101 AWS Monitoring GitHub Tutorial or watch my 101 AWS Monitoring Performance Clinic on YouTube. There I explain the integration with CloudWatch, how to deploy OneAgents on EC2 and how to monitor applications deployed on Beanstalk, ECS or Lambda!

#2: Ensuring Healthy Cloud Infrastructure with help of Dynatrace AI

While having more monitoring data available is great, thanks to the Dynatrace OneAgent and the AWS CloudWatch integration, it doesn’t mean that the team can spend more time analyzing more data points. This is where the Dynatrace automatic baselining, anomaly and root cause detection helps.

I asked them if they have an example of how the Dynatrace AI helps their infrastructure team to help a stable and healthy environment for the services that the development teams run on top.

They opened the Dynatrace Problem View, clicked on the Infrastructure filter and then walked me through the following screenshots. They showed me how Dynatrace detected a network connectivity issue of several haproxy instances running in Docker containers orchestrated by Mesos across several EC2 instances.

Benefit #1: Dynatrace’s automatic anomaly detection across all monitored entities and metrics

The Dynatrace OneAgent monitors every single container and all processes running in these containers. OneAgent automatically detects technologies and services such as haproxy, message queues, web- or application servers, databases, …

The Dynatrace anomaly detection understands which metrics are important for each type of service and reports an anomaly if a metric shows problematic or unusual behavior. Thanks to this auto-detection capability DevOps team can react much quicker to infrastructure related problems before they start impacting the services that run on them:

Automatic detection of Connectivity issues on this HAProxy that runs in a container on Mesos on EC2
Automatic detection of Connectivity issues on this haproxy that runs in a container on Mesos on EC2

Benefit #2: Dynatrace Automatic Dependency and Impact Detection

Thanks to Smartscape, the infrastructure team not only knows which infrastructure components or critical services are currently in an unhealthy state. Smartscape also shows all the depending services that haproxy is connecting. Based on that information, it is easier to understand the potential impact to higher-level services, applications or even end users. This also helps to prioritize remediation actions – whether executed manually or automated.

Smartscape visualizes where the problematic HAProxy actually “lives” and which other services are depending on it
Smartscape visualizes where the problematic haproxy actually “lives” and which other services are depending on it

TIP: If you want to learn more about how Dynatrace helps your IT Service and Operations (ITSM) check out the information around our ServiceNow Integration or how you can integrate Dynatrace with any CMDB YouTube Tutorial. If you want to see other examples of detected infrastructure problems check out my recent blog on AI In Action: RabbitMQ, Cassandra and JVM Memory.

#3: Shift-Left Performance Feedback with the help of Dynatrace AI

What runs on top of this dynamic cloud & container infrastructure? I am sure you guessed the answer: Services that their development teams are trying to get performance feedback on. The primary type of services they implement using Spring Boot expose REST APIs for their B2B offerings. Some of these APIs have well defined SLAs, which is why their team decided to define several custom thresholds for the different REST endpoints.

When developers make code changes, those get automatically deployed with the next scheduled build and get automatically tested. LoadRunner and JMeter are used to generate the load against their various REST APIs. These tests typically run for little over one hour after which developers pro-actively reach out to the Dynatrace dashboards to analyze how their code was performing, where the hotspots are and where there might have scalability issues.

Benefit #1: Dynatrace AI reduces time for analyzing performance test results

While Dynatrace gives them access to all data through the dashboards, the team started to see the benefit in time savings when using the Dynatrace AI automatic detected problems and root causes. The time saved can be better spent on building new features instead of manually analyzing the same metrics, log files, stack traces or CPU samples every time a test executes.

The following screenshot shows an automatically detected problem that happened during one of the automated load test runs. There was a 47% slowdown of a specific REST API endpoint caused by a CPU spike on that EC2 Linux machine, where Mesos hosts the Tomcat process in a container:

Dynatrace automates all the manual work a performance engineer would do. Highlighting slowdowns on individual endpoints and surfacing the potential root cause.
Dynatrace automates all the manual work a performance engineer would do. Highlighting slowdowns on individual endpoints and surfacing the potential root cause.

Benefit #2: Dynatrace Pro-Active Alerting of Dev Teams

Thanks to the host, process and service tagging capabilities of Dynatrace, each service is tagged with the name of the team responsible. In case Dynatrace detects a problem the team automatically gets notified thanks to the Alerting Profile feature in Dynatrace. Alerting Profiles allow sending problem notifications ONLY to those teams of services where a problem was detected. The notification (email, JIRA, Slack, Mattermost, Flowdock) also gets sent out immediately when the problem is detected and not only at the end of the test. This also speeds up the feedback loop cycle time in case a code change has an obvious issue which can be detected by Dynatrace within minutes.

Dynatrace can notify teams immediately when the problem is detected. This shortens feedback loop time
Dynatrace can notify teams immediately when the problem is detected. This shortens feedback loop time

Benefit #3: Dynatrace speeds up fault domain isolation between Infrastructure and Application

The CPU spike on that host makes you probably wonder: is it a problem with the infrastructure, the container or the actual app. When clicking into that root cause box in the problem ticket we end up seeing all the captured data from that EC2 Linux machine. We clearly see that there are a lot of other processes and containers running on that same box – all competing for CPU, memory, disk and the network. Very interesting to learn that the same box also runs Swagger (on Node.js), Kafka (two Jetty’s) and Filebeat (Go) besides our Tomcat that hosts the service under test:

The Dynatrace OneAgent gives full visibility into every process and container running on this EC2 Linux machine
The Dynatrace OneAgent gives full visibility into every process and container running on this EC2 Linux machine

A click on “Consuming processes” gives us a detailed CPU breakdown of all processes & containers on that machine – clearly highlighting that the cause is our Apache Tomcat process:

Dynatrace gives us key resource metrics for every process over time. Easy to spot that it was indeed Tomcat consuming all that CPU!
Dynatrace gives us key resource metrics for every process over time. Easy to spot that it was indeed Tomcat consuming all that CPU!

Benefit #4: Dynatrace detects code level problems. In custom and 3rd party code

Now that we know the problem lies within Tomcat it is easy to find out what caused it. Dynatrace provides several hotspot detection options such as the response time analysis. Code execution is the clear “winner” in this case followed by two SQL statements:

Response time analysis highlights the top hotspots down to SQL statements, queue access, service calls or method execution.
Response time analysis highlights the top hotspots down to SQL statements, queue access, service calls or method execution.

As the SLA for their REST API endpoints is 200ms it is interesting to learn why most of this time is consumed in code execution. More interesting is WHERE in the code the time is spent. Clicking on “Code execution” in the response time analysis infographic, brings up the method hotspots view with a detailed breakdown. We got a winner: Hibernate!

44.9% of the total code execution time is spent retrieving data from the database through hibernate.
44.9% of the total code execution time is spent retrieving data from the database through hibernate.

The above screenshot shows the hotspot across all the request that exceeded the 200ms SLA. This is very useful and makes it easier to analyze and fix hotspots that are impacting many transactions and not just individual requests.

Thanks to the Dynatrace PurePath technology every single transaction is also available for inspection. Seeing the PurePath allows engineers to better understand the sequence of code execution which is very useful in distributed or asynchronous transactions.

Dynatrace PurePath giving you full insights into every single end-to-end transaction. Very useful for engineers to understand where time is spent!
Dynatrace PurePath giving you full insights into every single end-to-end transaction. Very useful for engineers to understand where time is spent!

Tip: I get a lot of questions from users that integrate Dynatrace with their load testing tools. Make sure to check out my blog on Load Testing Redefined or watch my Load Testing YouTube tutorial. If you want to learn more about diagnostics option with Dynatrace then check out Basic Diagnostics or Advanced Diagnostics with Dynatrace.

What else can be done with Dynatrace?

While the performance engineering team at our customer already leverages Dynatrace to speed up continuous feedback cycles to engineering as well as the infrastructure team there is more that can be done. Here are some additional ideas:

#1: Use the Dynatrace REST API to automate “sanity checks”:

  • How many containers are running vs. how many should run?
  • Do we run too many containers on a single host?

#2: Use the Dynatrace REST API to automate “deployment optimizations”:

  • Re-distribute top resource consuming containers to avoid resource constraints
  • Co-locate containers that are “tightly coupled”. This prevents network contention issues

#3: Use the Dynatrace REST API to automate “remediation actions”:

#4: Integrate Dynatrace tighter into your CI/CD Pipeline

If you have any further questions let me know. Happy to give you more insights into how Shift-Left can be done with Dynatrace and happy to share your own implementation. Just let me know!

Stay updated