As a performance architect I conduct many performance tests against Java Enterprise apps, analyzing performance, scalability and stability issues of the application under load. One of our recent tests was against an application that kept crashing with “Out of Memory Exceptions” whenever we ran our JMeter-based Load Tests. Typically, the first thought in such a case is usually: this is probably a memory leak!
In this blog I’ll walk through the steps taken to analyze the issue, which tools were used, explaining background information on Java Garbage Collection, as well as how this problem was resolved. Note: it wasn’t a Memory Leak but a Logging mechanism that allocated too much memory, crashing our Java Virtual Machines (JVMs).
The Test Environment
Java performance testing is the science of discovering problem areas in application under varying load conditions. That said, selecting the right tools is key to performance load test success.
For this scenario our Application Under Test (AUT) was a Java-based application hosted in Amazon AWS. My team was tasked to analyze some serious memory issues. In order to help identify the problem we used JMeter 2.12 to generate synthetic load, Dynatrace AppMon for Performance Monitoring, as well as CloudWatch!
Out-of-Memory Exceptions: Observed Application Issues
The application was observed to crash within the first few minutes of steady state load test execution. To better understand the issue from a higher level we executed the following tasks:
- Gradually decreasing the load: Application still continued to crash even at 1 transaction per second (TPS)
- Increasing the number of servers: Being hosted in AWS gave us the opportunity to auto scale new machines during the test; however that only delayed the environment crash
- Upgrading server configuration: This defies the aim of test to enable best performance with optimum hardware usage
A peek into Dynatrace revealed high ‘Response Time’ (Time taken for the application to successfully process incoming HTTP Requests) and ‘System Utilization’ (CPU, Memory, Disk and Network on the host machines) that erroneously exceeded target SLA (Service Level Agreement) values.
A deeper analysis on each EC2 server instance highlighted that within a few minutes of test execution, the ‘Used Memory’ (Dynatrace captures Memory Measures exposed via JMX) showed a steep increase until it reached the maximum allocated size (4GB). At this point GC (Garbage Collection) was triggered, and the JVM ultimately shut down with an Out Of Memory (OOM) Exception. As a result, CPU (time spent in GC) and Network Utilization (crash information is logged,) increased significantly.
To gain a better understanding of the memory usage patterns of the application, and to determine why the Garbage Collector caused high CPU use but, ultimately, couldn’t clean up enough memory resulting in the OOM Exception. If you analyze memory-related problems it is always useful to have at least a basic understanding of how memory management and Garbage Collection actually works.
Brief Intro to Java Garbage Collector
The Java Heap Space is typically categorized into “Young” and “Old” generations. Depending on the type of JVM and Memory Settings used, you might see different Heaps and Sizes in your application. Let’s discuss a standard scenario where you have a Young and an Old generation.
Newly created objects are found in ‘Young generation’, which disappear when a minor Garbage Collection (GC) takes place.
Objects surviving from minor GC are pushed to “Old” generation; Old generation GC happens less frequently, and is referred to as ‘major GC’. Frequent and longer major GC cycles indicate the possibility of memory leaks.
For a more detailed overview and explanation of Java Memory Management, Garbage Collection and Tuning check out the Online Java Performance Book from Dynatrace.
Triaging Out-of-Memory Exceptions with Dynatrace
Dynatrace automatically provides a range of basic and advanced memory metrics. Under Monitoring > Infrastructure > host server we narrowed down the problem by examining the following metrics:
- No available free memory
- As a result, major GC is constantly kicked in to make memory available for process execution
- High GC cycle time
At the same time, Young generation appears to be comparatively stale. Young gen started with 2500 MB but as Old Generation increased, Young Generation decreased in size. This also caused continuous minor GC.
This leads to high CPU utilization and an increased number of threads.
It became evident that memory allocation for some objects was not appropriate, as all objects were directly promoted to Old Generation, thereby hinting to a potential memory leak for these servers. This is a general conclusion if you see your Old Generation constantly growing without ever being cleared up by a GC. Also check out the online book section “The different type of memory leaks and how to analyze them” for more on this topic.
To analyze the problem we decided to test the same transaction throughout a longer period of time using our JMeter scripts. At the same time, we took Java Heap Dumps with Dynatrace’s Memory Diagnostics Feature. Before we started we also tested one Heap Dump to determine the baseline memory usage.
The “compare” feature in Dynatrace for Total Memory snapshots helped identify that the MessageFormatMessage class from log4j consumed 2.12GB of heap space. These should be temporary objects per transaction, and should be cleaned immediately by the GC.
Out-of-Memory Exceptions: Root Cause & Recommendation
It turned out that the current log strategy in the application was generating about 100MB worth of log entries for EVERY single transaction, with the log entries being kept in memory throughout the execution of the transaction! The in-memory representation of these logging objects were several factors higher than the actual log output. This means that every executed transaction actually had to allocate about 2GB on the heap. The GC couldn’t clean the memory while the transaction was still executing and, as a result, crashed the JVM with an OOM Exception.
With this new insight on the current logging strategy we proposed several changes:
- Set log level = ERROR and log only what is absolutely critical
- Review the log entries that must be logged and ensure that only important ones are printed
- If log statements are also meant for trace transaction and response time of each hop we suggest to only log entry and exit points of each service call. Custom correlation ID and the Dynatrace x-dynatrace header can be logged in each service along with other information, for example, response time and timestamp. Logging x-dynatrace header can be useful if you are planning to link the log management system with Dynatrace
If you are also dealing with Java memory issues, and are not sure if you are actually having a memory leak or only bad memory allocation patterns, I suggest you get your own free copy of Dynatrace AppMon. You can register for the 30-Day Free Trial which will automatically convert to a lifetime free Personal License (allowing you to analyze your local apps forever). There are also several YouTube Tutorials by Andreas Grabner. Check out the one for Memory Diagnostics with Dynatrace where I walk you through the steps of instrumenting your Java app, and analyzing the different types of memory issues from which your app may suffer.