As a performance architect I get called into various production performance issues. One of our recent production issues happened with Tibco Business Event (BE) Service constantly violating our Service Level Agreements (SLA) after running 10-15 hours since the last restart. If we keep the services running for longer we would see them crash due to an “out of memory” execution. This is a typical sign of a classic memory leak!
In this blog I’ll walk through the steps taken to analyze this issue in our production environments, which tools were used, explaining background information on Java Garbage Collection, as well as how this problem was resolved. I hope you find this useful!
Tibco Memory Leak in our Production Environment
Performance engineering is the science of discovering problem areas in applications under varying but realistic load conditions. It is not always easy to simulate real traffic and find all problems before going live. It is therefore advisable to determine how to analyze performance problems not only in test but also in a real production environment. Having the right tools installed in production allows us to analyze issue and find root cause that are hard to simulate in testing.
The following diagram visualizes our service layout. Tibco BE is used as a inventory availability rule engine. Tibco Business Works (BW) is used as data access service and Tibco Active Space is used as an in memory data grid.
In order to avoid any SLA breaches the operations team implemented a very typical, yet not very elegant damage prevention: Restart the services every 10-11 hours before the services became too slow and eventually crashed with an out-of-memory exception.
My team was tasked to find the root cause of the increasing memory which was believed to also slow down the application before the actual crash happened. Our goal was to fix the memory issue once and for all.
Memory Monitoring with Dynatrace AppMon
Our tool of choice was Dynatrace AppMon which we already used to for live production performance monitoring of all our services in production. Let me walk you through the steps in Dynatrace on how we identified the memory leak and its root cause. In case you want to try it on your own I suggest you:
- Obtain your own Dynatrace Personal License
- Watch the Dynatrace YouTube Tutorials
- Read up on Java Memory Management
Step #1: Basic Memory Metrics Monitoring
In the Dynatrace Client you start with Monitoring > Infrastructure and then select your host server. We narrowed down the problem by analyzing the key Java process and memory metrics which are automatically monitored by Dynatrace. Dynatrace pulls these metrics natively through the JVMTI interface. Here is what we found:
- No available free memory in Old Generation Heap Space
- As a result, major Garbage Collection (GC) is constantly kicked in to try to make memory available for process execution
- High GC cycle time resulting in runtime suspensions
- High suspension time resulting into slowdown of request processing
- The slow down resulted in SLAs being missed and errors thrown by the API gateway (no response from BE)
Step 2: Seeing Ripple Effect into API Gateway
Having Dynatrace installed on all our services allows us to fully leverage Dynatrace PurePath Technology. The following screenshot of a PurePaths shows how the API Gateway service is trying to reach Tibco BE. Tibco BE is however struggling with Garbage Collection Suspension and therefore triggers the internal SLA which causes the API Gateway to thrown an error back to the caller as well as logging some error messages. Having this end-to-end view available makes it very easy to understand service-to-service interactions, how SLAs and timeouts are handled and what the ripple effect of a bad service is to the initial caller!
Step 3: Observing the “Pre-Cautious” Restarts
The automatic restarts that were put in place to prevent a service crash can be easily observed by looking at the same Dynatrace Process Health Dashboard over a longer time period. We can see the restarts happening every 10-11 hours:
Memory Leak Detection with Dynatrace
After we have looked at all the memory metrics to observe current memory behavior and the impact of high memory usage, garbage collection and the automated restarts it is time to find the root cause of the growing Old Generation heap space.
Memory Snapshots in Dynatrace
Dynatrace automatically captures so called Total Memory Dumps during out of memory of exception, we were able to see all the objects that were in memory when one of these crashes happened in the past:
A good practice is to focus on custom or framework classes, e.g: be.gen.xx instead of focusing on Java Runtime classes such as Strings. Why? Because these custom classes most likely reference lots of these Strings and are therefore to blame for high memory consumption.
Dynatrace Trending Memory Snapshot
Besides the Total Memory Dumps it is recommended to take so called Trending Memory Dumps over a longer time period. These Trending Dumps give insights into which objects are really growing over time and are the main reason for a growing heap. In our case most of the objects in the be.gen. package are growing:
Dynatrace Reference Tree Analysis
Knowing which classes are growing allows us now to go back to the Total Memory Dump to see who is holding references to these objects and with that blocking them from being garbage collected:
We are seeing same pattern with other be.gen instances as well as shown in the following screenshot where we walked the reference tree for a single be.gen instance all the way to the Tibco DefaultDistributionCacheBasedStore:
Analyzing the Tibco Cache Object Creation Behavior
Based on the findings in the reference tree we checked how we are using the DefaultDistributedCacheBasedStore. It seems we are creating ConcurrentHashMap for each be.gen object and keep adding it to DefaultDistributedCacheBasedStore using CacheConceptHandler. These ConcurrentHashMap objects are never garbage collected and therefore fill up the heap.
To identify where our code is creating these objects we use the Dynatrace Memory Sensor Rule feature for our be.gen.xx classes. This feature allows us to selective monitor the creation and life time of these objects on the heap. The following screenshot shows the output of a so called Selective Memory Dump taken after 1 hr 55 mins of JVM restart:
After 1h 55mins we see 1.3 million objects on the heap. None of the objects that were created in that timeframe got ever collected by the GC – which explains our memory leak.
Based on the collected information from our production environment we did code walk-throughs with the developers and found many places where ConcurrentHashMap was not set to null. We also opened support tickets with Tibco Professional Services to understand why concept.clear was not cleaning up objects even though our developers expected it. TIBCO Support suggested deleting the unused cache concept instances and assign Hashmaps to null even after using HashMap.clear API of TIBCO BE.
After code changes were functionally tested, we decided to apply this fix on one of the production servers and take that server out of auto JVM recycle process we had in place.
We waited for 48 hours to see if the server remained healthy or if it still suffered the increased memory behavior. We were delighted when we watched the Dynatrace Processes Health Dashboard to see that our cached objects were correctly promoted to Old Generation over the course of several hours. When Old Generation reached a critical point, GC kicked in and correctly cleared out all the memory without running into an Out Of Memory Exception: