We are working with a lot of performance engineers that have Tibco Business Works (BW) in the mix of technologies they are responsible for.
This particular story comes from A. Alam – a performance engineer who is responsible for a large enterprise application that uses Tibco to connect their different system components. Alam and his team identified 2 unrelated memory leaks in their Tibco installation using Dynatrace Application Monitoring. One of the leaks was caused by a well-known problem in Tibco’s internal message handling. The other one was caused by configuration and implementation issues. If you want to follow Alam’s steps simply download the 30 Day Free Trial of Dynatrace.
Issue #1: Out of Memory caused by leaking Tibco Business Event (BE)
The following screenshot is from Alam’s Tibco BW production system and shows the main performance counters every performance engineer of a Java or .NET based application needs to look at: Heap Sizes by Memory Generation (Young & Old in Java – Gen 1-3 & Large Object in .NET), Time spent in Garbage Collection and Number of GC Activities per Heap Space of the Tibco BW JVM:
Tip: If you want to read up on Java Memory Management check out the Memory Chapter of the Java Enterprise Online Performance Book. If you want to learn about how to use Dynatrace to analyze memory problems check out my How To Quickly Find and Analyze Memory Leaks blog.
The screenshot above also shows a very typical memory allocation behavior: (1) New objects get allocated in the Young space and get promoted (2) through Garbage Collector (GC) activity from Eden to Survivor and later into the Old Generation. Once Old Generation is full (3) the GC tries to clean objects from that space (4). If that is not possible an Out of Memory Exception is thrown telling the application that it cannot allocate memory for newly instantiated objects. If this situation doesn’t get resolved the application will crash (5).
Dynatrace automatically captures a full memory dump upon an Out of Memory Exception in JVMs. This allowed Alam to do a post mortem analysis showing him which objects were on the heap when the crash happened. To his and the software engineers surprise, it turned out that for every single message Tibco processed it was still holding on their custom Business Event Objects. This should normally not happen as these objects are no longer needed once processed by Tibco. The following screenshot shows the list of classes (obfuscated for security reasons), the instance count and also the allocated memory for these objects on the heap. If you try this yourself using Dynatrace have a look at the Total Memory Dashlet:
Root Cause & Fix: It turned out that this was a configuration and implementation issue which the engineers could easily fix once they saw that these objects were kept in memory.
Issue #2: Known Memory Leak in Tibco’s Internal Message Processing
Alam is not only monitoring the production environment. He is also responsible for their large scale load tests in pre-production. Prior to a new release they run long-time high transaction volume load tests. They call them “Endurance Tests”. The goal is to verify that Tibco can process ~ 300,000 messages per hour for 72 hours straight without a problem.
The following screenshot shows the same Process Health Dashboard for Tibco BW during that load test as they use in production monitoring. They ran their endurance tests over the weekend and it showed a similar memory allocation pattern as the first problem we discussed. Within the first 24 hours, the Old Generation space filled up. After another 12 hours the system crashed with an out of memory exception. The load test had to be aborted after 36 hours:
Tip: During a load test Alam created so called “Selective Memory Dumps” every hour. These are light weight dumps containing number of object instances per class. This makes it easier to identify problematic leaking objects over time. These dumps can be scheduled in Dynatrace to be created automatically.
After the system crashed Alam had several of these Selective Memory Dumps as well as the final full memory dump (including all object references) available. First step is to compare two (or more) selective dumps to understand which classes are actually growing over time. In their case they found com.tibco.tibrv.TibrvMsg and java.lang.Finalizer as the two problematic candidates:
Tip: When comparing dumps don’t focus on classes like String or char – these are typically referenced by the classes that are the real problem and typically grow in the same ratio. It is better to focus on custom classes or classes of frameworks such as Tibco.
Knowing that the growing classes were TibsrvMsg and Finalizer they focused on these classes when analyzing the full memory dump taken when the system crashed with an out of memory exception. As it was not their own code they also showed this “Reference Tree” to a Tibco architect who pointed out that there is a well-known memory leak with the native message implementation of tibrvnative.jar that could lead to exactly this situation:
Solution: With the help of the Tibco architect they replaced the problematic native implementation with the Java implementation offered by tibrvj.jar.
Take Away: Alam’s Best Practices
Tibco is a key component to many enterprises. Just as any other system that relies on runtimes such as Java or .NET, it is important to keep an eye on potential memory leaks because relying on a Garbage Collector isn’t insurance for memory leak free applications. To be safe, do your homework
- Setup monitors for the key memory performance counters
- Monitor them in pre-production as well as production
- Educate developers on memory management best practices
- Take light-weight memory dumps from time to time to identify bad trends