Dynatrace AI In Action: Rogue Python Script Impacting Atlassian DevOps Tools

Addteq – Your DevOps Experts and partners of Atlassian and Dynatrace – is building tighter DevOps Use Case integrations between Dynatrace and the Atlassian tool suite. Addteq also educates and helps our joint customers on how to our tools can optimize your end-to-end delivery processes.

Not only are they building these integrations and educating you on DevOps best practices, but they also use Dynatrace to monitor their own internal DevOps toolchain such as JIRA, Confluence, Bitbucket and Bamboo, to name a few!

On a recent status call with Himanshu Chhetri (CTO) and Sukhbir Dhillon (CEO), they mentioned how well the Dynatrace AI worked for them after deploying the Dynatrace OneAgent on their internal servers. Dynatrace automatically identified the root cause of a slowdown of their JIRA and Confluence, even before their developers got heavily impacted by this problem. This obviously speaks to the pro-active nature of Dynatrace.

#1 – The Dynatrace Problem Ticket

Every time Dynatrace detects an anomaly in your environment it creates a problem ticket. Thanks to the Dynatrace JIRA Integration, which is currently being extended by Addteq, the problem ticket automatically created a JIRA ticket in Addteq’s JIRA instance. This ticket triggered their own internal problem resolution workflow.

The Dynatrace Problem ticket indicated that six services were impacted including JIRA, Confluence as well as some shared services such as the TokenService that experienced a very high failure rate:

Dynatrace automatically detected unusual high failure rate on the TokenResource service.
Dynatrace automatically detected unusual high failure rate on the TokenResource service.

#2 – Problem Evolution

One of the features that gets people excited about the Dynatrace AI, is the fact that Dynatrace correlates all events that happen on all depending components into a single problem ticket. Instead of having to look at events from your log monitoring, infrastructure monitoring, application performance monitoring and end user monitoring tools; you get all this information in a single spot: Dynatrace!

And the one view in the Dynatrace Web UI that shows all these events along a timeline is the Problem evolution view that gives us a time-lapse option to “replay” the chain of events. Here is the problem evolution for their problem:

Dynatrace Problem Evolution showing that the problem when through two problem phases.
Dynatrace Problem Evolution showing that the problem when through two problem phases.

#3 – Automatic Deployment Detection

If you look at the distribution of events (top right) we can see that this problem when through two phases. Each phase shows a spike of events coming through: one shortly after midnight – the second one shortly before 2AM.

The first bulk of events are all related with a restart and a redeploy of the jira-install service. Turns out that this is “normal.” Well – kind of normal. Digging through the automatically detected deployment events shows us that every time they restart that service, we see high CPU and error log messages, resulting in some of the failures we can observe in the other depending services:

Dynatrace shows us that every time jira-install gets restarted we see some impact on the other depending services.
Dynatrace shows us that every time jira-install gets restarted we see some impact on the other depending services.

#4 – Python process gone ROGUE!

The second bulk of events is related to the real problem. Turns out that the server addteq-crowd, a Linux machine hosting Atlassian Crowd, runs out of CPU. Crowd is single sign on services that is used by all other Atlassian tools such as JIRA and Confluence. If this service is impacted it impacts everyone else.

Dynatrace detected the CPU saturation issue on that Linux machine and also created an event that was correlated to our problem ticket
Dynatrace detected the CPU saturation issue on that Linux machine and also created an event that was correlated to our problem ticket

Looking closer at this Linux machine shows us that it is not Crowd itself, which runs in the Tomcat container, that uses all the CPU. Turns out it is the Python-based app called duplicity which is used for file and directory backups:

Dynatrace OneAgent automatically monitors all processes – including the ones consuming all the CPU!
Dynatrace OneAgent automatically monitors all processes – including the ones consuming all the CPU!

Duplicity runs on every of Addteq’s hosts but only runs into high CPU on addteq-crowd. This can easily be seen by looking at the Dynatrace Process Group overview for Duplicity – showing us resource consumption of all instances of Duplicity across all hosts where it runs:

Dynatrace automatically monitors every process instance across all hosts with easy accessible charts in the Process Group Details view
Dynatrace automatically monitors every process instance across all hosts with easy accessible charts in the Process Group Details view

Tip: Process Group Detection is a key capability in Dynatrace. The automatic detection works extremely well but can always be customized to your special needs. To learn more check out Mike Kopp’s blog on Enhanced Process Group Detection.

Actions based on Problem Detection

Our friends from Addteq weren’t aware of Duplicity having an issue on a single machine and didn’t know the actual impact it had on Atlassian Crowd, which impacted all the other services. Because Dynatrace automatically analyzes all this data and is aware of all these dependencies it is possible to identify these problems that we would have normally not even thought of.

It’s great to see a partner like Addteq not only building integrations and therefore extending the Dynatrace ecosystem, it’s also great that they “Walk the Talk” and are actively using Dynatrace to ensure their systems are optimally running, ensuring their employees are not impacted by any python scripts gone rogue! 😊

If you want to try Dynatrace yourself, simply sign up for our Dynatrace SaaS Trial. If you want to learn more about how to optimize JIRA and Confluence in particular read my blog on “Optimizing Atlassian JIRA and Confluence Productivity with Dynatrace”.

Andreas Grabner has 20+ years of experience as a software developer, tester and architect and is an advocate for high-performing cloud scale applications. He is a regular contributor to the DevOps community, a frequent speaker at technology conferences and regularly publishes articles on blog.dynatrace.com. You can follow him on Twitter: @grabnerandi

Looking for answers?

Start a new discussion or ask for help in our Q&A forum.

Go to forum