Thanks to our friends from Prep Sportswear who let me share their memory leak detection story with you. It is a story about “fighting technical debt” in software that matured over the years with initial developer’s no longer on board to optimize/fix their code mistakes.
Check out their online store and browse through their pages – especially cool are the product details pages where they use some really nice server side rendering to show you the final customized product you buy. After reading this blog you will also understand how much memory your actions will consume on their ASP.NET servers and why that is not always a good thing 🙂
Overall I was impressed when they showed me what level of service they deliver to their end users. They have found some interesting ways to monitor End User Experience especially when it comes to monitoring Bot traffic (this will take a blog on its own). Both of us (Richard Dominguez from Prep Sportswear and myself) were also impressed when we found the reason for the following very interesting memory pattern that – until now – forced them to prematurely recycle their AppPools every 6 hours to avoid an inevitable OutOfMemory Crash:
Richard – “Developer in Operations” – took several Memory Dumps along the incline of memory in that 6 hour period. He used Dynatrace and the Leak Detection Heap Dump which captures all objects on the heap including object references as well as values of primitive types (String, Int, …). Let me tell you which Hotspots we found that were OK and those that were causing that Memory Leak!
Hotspot #1: Product Catalog Cache
The hotspot view is always a great way to start as it highlights the largest objects on the heap. We drilled into the Object and CachedProductCatalog but noticed that it is OK to have these objects on the Heap as they were all used to Cache product information. We also double checked that there is a mechanism to “expire” objects from the cache and make sure the cache is not growing endlessly by comparing multiple heap dumps:
Clicking on the Object from the Hotspot brings us to the actual Object Reference Tree showing us how these objects relate to each other, how many objects are actually on the heap and why they are not cleared by the Garbage Collector:
As I mentioned in the beginning: this was not the memory leak we were hunting as we could see these objects being collected by the GC when comparing multiple heap dumps.
Hotspot #2: 300MB Duplicated String of Stack Traces
The big “A-Ha” moment came when we looked at the Duplicated Strings. Turns out that more than 300MB are consumed by Strings holding very similar stack trace information:
Every single String instance can be explored including the full content it holds. Looking at the actual Stack Trace that was stored in these strings already revealed where these objects got created – in the constructor of their BlockStream class:
Memory Leak: BlockStream objects are referenced in global Object
Knowing that these Strings hold 300MB of Heap Space “hostage” for keeping stack trace information was one thing. Now it was time to figure out why they are not cleaned out by the Garbage Collector. Looking at the Object Reference Tree of these String object shows that these Stack Traces are not only captured in the constructor of the BlockStream Class but are also stored in a member variable. All BlockStream objects are then referenced and cached in a global XtplCache which seems to be never cleared -> hence the memory leak:
Final Findings: Why did we end up here? How do we get out of here?
In talking with Richard and his colleagues we found out that most of the original developers were no longer with the company. It therefore remains a mystery why the developers decided to create a full stack trace every time a BlockStream object is created as well as keeping it in memory – maybe for some custom logging? The fact is that this implementation causes a lot of memory overhead and forced them into pre-mature recycling of their servers.
Their goal is to get this code cleaned up. Now that they know where the problem is – down to the line of code – it will be (fairly) easy for the current engineering team to fix this issue even though it is a fix on code that they “inherited”.
Thanks again for this great story and allowing us to share this with the public. I am sure there are many similar stories out there of code bases that grew over time; original developers no longer with the company; and new teams trying to keep the system running. These types of diagnostic steps help engineering teams to keep track and manage what the industry calls “Technical Debt” – it is an exercise that the team at Prep Sportswear is doing constantly to improve their system performance to ensure that many folks can buy their cool sportswear without any bad user experiences.
If you want to do this type of analysis on your own feel free to download the Dynatrace Free Trial and follow some of my YouTube Video Tutorials on how to diagnose performance, architectural and scalability issues.