Dynatrace for Developers: Fixing memory leaks in our cloud-native services on k8s

Dynatrace enables our customers to monitor and optimize their cloud infrastructure and applications through the Dynatrace Software Intelligence Platform. On top of that, our CNCF Open Source project, Keptn, brings cloud automation use cases around progressive delivery and operations to our users enabling our Dynatrace customers to establish a NoOps culture like we have at Dynatrace engineering.

A big part to the success within Dynatrace is that we use Dynatrace® across the software lifecycle on our own software projects. That spans across the platform, our website, support platform, our business backends, as well as our open source projects such as Keptn.

Today’s story is about how the Keptn development team is using Dynatrace during development and load-testing. We want to share how Dynatrace helped us identify and fix memory leaks in one of the most central and critical components within Keptn: our event broker.

It happened in June 2020. We were in the process of developing a new feature and wanted to make sure it could handle the expected load behavior. For that reason, we started a simple load-test scenario where we flooded our event-based system with 100 cloud-events per minute. At some point we noticed this error coming up in our load-test logs:

Error: Send * was unsuccessful. Post http://event-broker.keptn.svc.cluster.local: dial tcp *:*: connect: connection refused

Hurray, we broke it! Now let’s fix it. We started looking at Kubernetes where we saw that the event-broker pod kept crashing. From there we went into the logs of the event-broker container but couldn’t find any indication of any obvious error. We were kind of in the dark on this one!

Houston, we have a problem!

Our container logs didn’t contain any valuable root-cause information, and digging through a whole lot of events in our Kubernetes cluster was not a great efficient option either (we would have found the information we were looking for, but the event log is unfiltered and it would have cost a lot of time to dig through it).

Luckily for us, our load-test environment has Dynatraces OneAgent installed which automatically monitors our complete Keptn deployment on k8s. We also enabled Dynatrace’s Kubernetes monitoring via the documented ActiveGate approach. Using the Dynatrace Kubernetes Dashboard we were able to see that there were several Out of Memory Events (OOM) happening in our cluster, coincidentally around the time we started the load-test at 08:00:

Dynatrace gives us automated overview of all k8s events happening in our namespaces
Dynatrace gives us an automated overview of all k8s events happening in our namespaces

Okay, so this sounds like a classical memory leak, and the event-log shown in Dynatrace confirmed that it happens in the event-broker process:

Dynatrace shows us all relevant information about k8s events such as out of memory killing
Dynatrace shows us all relevant information about k8s events such as out of memory killing

Can we fix it? Yes, we can!

To fix the memory leak, we leveraged the information provided in Dynatrace and correlated it with the code of our event broker. Thanks to the simplicity of our microservice architecture, we were able to quickly identify that our MongoDB connection handling was the root cause of the memory leak. We filed a Pull Request on GitHub, merged and backported the fix, and deployed the change to our Kubernetes cluster. Then we ran the load-tests again. Voila, no more error-messages, problem solved within hours of detecting it.

But did we really solve the problem, or did we miss something?

We adapted our load-test to run with even higher load and used Dynatrace to confirm that the OOM events were gone -– First win! However, to be 100% sure we fixed the memory leak for good, we had to dig deeper and investigate the event broker process and its critical health metrics provided by the Dynatrace OneAgent.

If you look at the screenshot below you can see that we deployed the new version at around 15:00. The climbing number of total requests reflects that Dynatrace is truly monitoring each individual simulated request from our load-test.

What was great for us, in the second load-test run, was the fact that our memory leak fixe didn’t just keep the memory footprint at around 20MB but our event-broker also had no problems handling much higher load than before.

Our fix resulted in higher throughput with less resource consumption. And of course: no more OOMs
Our fix resulted in higher throughput with less resource consumption. And of course: no more OOMs

Conclusion: Dynatrace is always on for us developers

Thanks to using Dynatrace as part of our day-to-day development work we were able to fix a problematic memory leak which not only resulted in a more stable event broker service but also improved resource efficiency of that service. As a team, we’re now much more sensitive to OOM events and Dynatrace helps us identify those early in development and during our load-tests, rather than in production. If you want to learn more about Dynatrace and Keptn – or even want to join us in our efforts check out our Keptn GitHub rep or join us on slack for a conversation.

Stay updated