If you are following this blog you will have come across multiple posts that point out the correlation between Garbage Collection (GC) and Java Performance. Moreover, there are numerous guides on How Garbage Collection Works or Configure Garbage Collection policy on your Java Virtual Machine (JVM).

Having read the above posts you can quickly come to the conclusion that misconfigured GC parameters can be catastrophic. But where do you start? What happens when you spin off your first JVM? What are the basic parameters that can truly hurt your application’s Java performance?

I wanted to present the following case from one of my clients to point out that most of the times we get caught up in the details and we forget to check the basics. This a classic example.

Garbage Collection Problem

My client is running on IBM JVMs and has been experiencing performance issues for a long time. More specifically, they were accustomed to getting complaints from users for slow performance during high load periods. From a technical perspective, the System Engineers were documenting high CPU week after week. Their plan of attack to address this behavior was to simply throw more resources at it and push through the peak load intervals.

Garbage Collection “Lightbulb Moment”

After a while, their environment grew to a point that made buying more CPU and Memory cost prohibitive every time they had a performance problem. This is where Dynatrace application monitoring came in; my client had always suspected that their performance problems were originating from their application’s JVMs, but they had no visibility into those JVMs.

After installing Dynatrace Java Agents, we started looking at some slow Response Time PurePaths arranged in descending order. We quickly identified that Suspension Time (time spent on GC between uses of the JVM) and CPU were predominant on the percentage Breakdown.

Figure 1 - PurePath Breakdown
Figure 1 – PurePath Breakdown

Going through the data, Dynatrace clearly pointed out that GC was the problem.

Figure 2- Transaction Flow
Figure 2- Transaction Flow

The critical moment came when they faced extremely high load the Monday after a holiday weekend. All servers were red and their environment froze. After a closer look on the “Show Application Process” details I couldn’t believe what I was seeing!

Figure 3 - Host Information
Figure 3 – Host Information

I could see there were frequent GC invocations but I couldn’t understand why they were taking so long, and then it hit me! The memory heap graph had no Minor GC. All the GC invocations were Major GC (or “Stop the World”) invocations.

Figure 4- Application Process
Figure 4- Application Process

Basically, their application was pausing every single minute for 20 to 30 seconds in order to clean up old generation objects. Since there was no Minor GC to take care of the young frequently called objects, all processes would have to be suspended for the Major GC to do the cleanup. As a consequence, their application was unable to process user requests and their CPU utilization was predominately spent on GC.

Garbage Collection Fix

As it turns out, my client had gradually spun up JVMs over time since 2010. They would start a new JVM using the default settings and just set the heap limits. At the time, IBM JVMs were using a non-Generational approach called “Optimum Throughput” by default. The default settings changed to “Generational Policy” permanently sometime in 2011.

As soon as we identified the issue, I advised the client to reconfigure half of their environment to use a Generational GC approach. My goal was to compare the two different approaches.

All we did was to add the following parameter on their JVM parameters list: -Xgcpolicy:gencon

Results

We restarted the JVMs and BOOM! The results were immediate. GC Copy (or Minor GC) showed up on our “Application Process” dashboard. GC Minor was short and fast. Suspension Time was down to zero.

Figure 5- Application Process (Before & After)
Figure 5- Application Process (Before & After)

The difference between Generational and Non-Generational configured servers was more than obvious.

Figure 6 - Suspension Time (generational vs non-generational JVMs)
Figure 6 – Suspension Time (generational vs non-generational JVMs)

Finally, Response Time under similar load was vastly different. Before the implemented changes Response Time was higher and clearly correlated with the Load patterns. After the changes, Response Time was more stable and spikes were eliminated.

Figure 7 - Web Requests vs Response Time (Before & After)
Figure 7 – Web Requests vs Response Time (Before & After)

Conclusion

After this discovery, my client was eager in checking the basic settings for other applications running on IBM JVMs in their environment. They recognized that understanding the impact of GC was key for their application’s performance. I advised them to look at multiple additional parameters affecting GC on their JVMs.

Figure 8- JVM Memory allocation parameters*
Figure 8- JVM Memory allocation parameters*

* http://javahonk.com/how-many-types-memory-areas-allocated-by-jvm/

I hope this post helps you understand the implications of using default settings on IBM JVMs. This client story clearly demonstrates that they can be very dangerous for your application’s health.

That said, when was the last time you took a closer look on your application’s basic settings? In case you think it is time to verify your application’s Memory behavior and GC impact you might find this YouTube Tutorial on Memory Diagnostics for Java & .NET beneficial. If you are ready to take action, you can download a Dynatrace Personal License (free forever for your local apps) to get started!