The Spring Framework is great as it removes a lot of legwork that developers would otherwise need to do in order to get a new application up and running. Instead of spending time re-inventing the wheel, it is generally easy and convenient to use frameworks for common tasks such as Caching, Database Access or Data Binding with UI Elements. “Trusting” a framework blindly without looking “underneath the hood” is, however, not a good idea. It’s like blindly trusting a used car dealer without checking on engine, breaks, or tires before taking it on a long road trip.
This story focuses on a large bank that built its core e-Banking application on WebSphere and the popular Open Source Spring Framework. Every beginning of the month the bank faces an unusual high number of online customers checking whether their paycheck has already arrived. For ten consecutive months they saw that most logins between 10AM and NOON failed due to timeouts. That’s frustrating for the banking customers and bad for the image of the bank. The bank decided to use an APM (Application Performance Management) solution to enable it to understand its user experience and why customers are frustrated with their logins.
In this blog we have a look at the symptoms, the analysis process and how the team members fixed the problem in their deployed Spring Framework in such a way that they no longer have any outages but in fact increased the number of online customers by 79% due to improved end user experience.
Observation #1: Running out of Worker Threads
To monitor the health of JVMs the bank has different dashboards that monitor metrics for Memory, Network, CPU, I/O and Thread Metrics. The following dashboard shows the operations team at the bank which JVM currently runs low on available worker threads. The maximum number of threads is 250. These “traffic lights” turn red in case a JVM uses more than 150 – which is an early warning signal for the team – especially in a situation where most of the JVMs show up in RED:
The following chart shows that the threads start being active at around 10 AM in the morning reaching a critical stage at about 10:10AM when most of the JVMs are already maxing out their available worker threads:
Observation #2: Threads Waiting on Class Loader
What are these threads doing? Looking at these threads and analyzing their state quickly showed that most of these threads (242 out of 250) were waiting on the WebSphere CompoundClassLoader as all of these threads tried to load additional classes. Due to the high number of threads trying to access that shared resource – the Class Loader – most threads gut stuck in waiting:
Observation #3: Most Time Spent in Class Loading Even Under Normal Load
Looking at the response time contribution of the actual web requests executed by these end users shows the same picture. 80-90% of the response time is contributed by Java Class Loading. Interesting fact though is that this doesn’t just happen during peak load – but also during “normal” load:
Observation #4: Class Loader Tries to Load Non-Existing Classes
So – is all this class loading necessary? Looking at the actual transactions being executed shows us that for EVERY web request the app server tries to load a class that does not exist – leading to a ton of ClassNotFoundExceptions. Because this class can never be loaded successfully but the app server keeps trying to load it for every request we have the root cause of the problem. This is true for fast and slow transactions and highlights the importance of seeing this level of detail for every transaction in your system as the fast ones are also holding on to that “scarce” resource “Class Loader” and therefore impacting other transactions.
The following screenshot shows a PurePath for one of these requests on their system. The flexibility and ability of the PurePath Technology to capture every single transaction including the information who wanted to load these classes and why that failed was critical to identify the root cause:
Root Cause: Newer Version of Spring Framework checks existence of BeanInfo Classes
After some investigation it was discovered that a recent update to a newer version of the popular Spring Framework introduced a new behavior in which the framework always tries to load the BeanInfo class for each Bean. As this check is done for every bean in every request it causes all of these ClassNotFoundExceptions. Further investigation on this also showed that this behavior was already reported to Spring back in January 2012 as part of SPR-9014.
Result: No Outages Anymore and Boost in Logins by 79%
Rolling back to the previous version of Spring not only got rid of the performance impact of heavy class loading. Since the change they had
- No Outages since that deployment
- Improved reputation which resulted in 79% more logins to their system
- Lower operational costs due to saved CPU cycles
Fixing all of these problems easily justified the decision to use Dynatrace. Not only did they solve these problems but other critical ones such as issues in their data access layer causing duplicate key exceptions as well as problems with misconfigured beans resulting in a high number of internal exceptions which impacted overall transaction performance in a similar way as explained in The Performance Impact of Exceptions.
Thanks to Ibrahim Mohammed – Enablement Service Engineer – for discovering this behavior and helping both our customer and our community to be aware of these issues and how to prevent them