Does your Ops team care about the number of Exceptions thrown in the application – do they even monitor this number? Does your Test Team report the list of Exceptions thrown during a load test to engineering or are they just sending those that end up in a logfile? Is development interested in the Exceptions that are thrown within frameworks while executing their unit tests? Why should they care? Is there a real impact on performance that comes of a couple of exceptions?
Two years ago Alois Reitbauer wrote a nice article about The Cost of an Exception, which is typically hard to evaluate. After a recent deployment of a new version we saw that 30% of the CPU on our Application Server was consumed by creating Exception objects – these were Exceptions that never made it to a logfile – so nobody really cared until we identified it being a performance impact on the infrastructure and to the end user. The root cause is simple – but also not that easy to find if you don’t look at all Exceptions thrown and not just those that bubble up to the end user or as SEVERE messages into log files.
The big lesson learned was that Exceptions can have a severe impact on resource utilization as well as end user performance. After this discovery Ops, Test and Dev are now watching out for high Exception creation in order to ensure that code changes, configuration changes or deployment mistakes are detected before they impact the end user.
Symptom: High CPU Utilization on an Application Server
During a recent production load test that we ran against an updated version of our community site we noticed that the CPU was behaving differently on our Application Server compared to the previous tests. We ran this test outside of regular business hours in order to not impact the regular users on the production system. We expected that CPU utilization increased with increased load – but – comparing it to a previous production load test this was much higher than expected. The following screenshot shows the Process Health Dashboard of our Java Application Server (Tomcat) where the CPU displayed the unexpected behavior:
Root Cause: CPU Hotspots reveals Exception Handling as Main Performance Problem
The next step was to identify the hotspots in the application causing the high CPU utilization. The following screenshot shows the top CPU-consuming methods in a 5-minute interval on our application server just when CPU utilization began crossing the 60% mark. 96 seconds (s) out of the 300s (5 minutes) were consumed by fillInStackTrace() which was called every time an Exception object was created:
fillInStackTrace() was called from the Throwable constructor. That means that every exception that gets created ends up calling this method which turns out to be our hotspot. We also see that 79% of the time it is the MissingResourceException which gets thrown when one of the i18n utility classes try to get text from the deployed resource bundles.
Sheer Volume is the Problem – Not the Individual Exception
As with a lot of things – it is not a single Exception that consumes CPU – but – it is the sum of all Exceptions. How much did it take to consume 30% CPU? In our case about 182,000 Exceptions in 5 minutes!
An Obsolete Plugin Is The Root Cause
We quickly identified problem by looking at the PureStack and PurePath information, the log files, and with help from the great support team at Atlassian. It turned out that after the upgraded to a newer version of our Confluence instance we forgot to upgrade one of the plugins that we actually no longer use. The old version of the plugin caused these Exceptions when Confluence iterated through the different resource packages. As we actually don’t use the problematic plugin no end user would have complained about broken functionality. The only way this problem manifested itself was unusual high CPU consumption that – under heavy load – impacts all users on the system.
Lessons Learned for Dev, Test and Ops
Knowing that Exceptions can be a performance impact means that we need to make sure we prevent too many Exceptions from being thrown. All teams involved in the application lifecycle can do their part to make sure that the problem won’t occur – or – if it does happens – will be addressed proactively. Here is how:
- Operations: They now monitor and alert on unusual behavior in the number of Exceptions thrown in Production. This catches problems that are introduced with configuration changes or deployments of new code that hasn’t thoroughly been tested. It also detects deployment issues such as missing files that also results in similar Exceptions
- Testing: They look at the number of Exceptions just as they did after running this test. Comparing it with previous tests allows them to identify any regression that was introduced.
- Development: We do develop our own plugins and extensions to Confluence. This story taught us that during development we also need to make sure that our custom plugins don’t access any APIs that cause internal exceptions. We also automated that through tests executed in our Continuous Integration. Executing these tests also captures the number of exceptions and lets the build fail in case we observe untypical behavior.