Don‘t Trust Your Log Files: How and Why to Monitor ALL Exceptions

I would say that only one out of a million exceptions thrown in an application actually makes it to a log file – unless you run your application in verbose logging mode  – Do you agree? No? Here is why I think that is: Because most exceptions are handled by your code or by the frameworks your app uses. Here is a chart from an enterprise application showing that there are about 4000x more custom application exception objects thrown than important log messages written:

High number of exceptions objects created that never end up in a log file. Can they be ignored? What’s their impact?
4000 time more Exceptions than log messages: Can they be ignored? What’s their impact?

So why worry about these exceptions that nobody cares to write to a log file? Two reasons:

  1. They are typically thrown for a good reason and therefore indicate a problem, e.g: configuration issues in frameworks or runtime problems
  2. Every Exception object is a potential performance problem because it means the JVM needs to allocate memory, get the stack trace and dispose the object soon after

Reason #1: Configuration Problems

The following shows a transaction where the method getImagePath makes a web service call to a backend server using HttpClient. getImagePath uses a an HTTP Endpoint URL. The Web Service however only supports HTTPS (SSL). The web service call therefore fails with an SSLException. getImagePath retries 3 times until it gives up and just returns a default value to the caller. No log entry written, no exception thrown to the caller, everything seems OK to the outside world even though we have a severe impact on an end user who is waiting longer than necessary for an image that he doesn’t get to see:

Exceptions are highlighting configuration problems (wrong URL) but the calling method is not doing anything with that information
Exceptions are highlighting configuration problems (wrong URL) but the calling method is not doing anything with that information

Key Takeaways:

  • End Users: This code is executed for every user that executes this request and none of them will get the correct image path. Additionally, the user is waiting on it for several seconds. We all know what users will do if they have to wait too long.
  • Business: If your app delivers dynamic user-specific content, e.g: recommendations for that user you need to ensure that no configuration problem causes your app to deliver incorrect content. As business owner you want to get alerted when a problem in the app causes incorrect responses to your users.
  • Operations: When users complain, there is no documented evidence of a problem (nothing in a log file). Make sure to monitor outgoing web requests and the status of these calls as this helps you to identify if you have requests that start failing or not delivering what they are supposed to deliver.
  • Developers: Everything probably worked well when they tested this web service in their own environment where they used a dummy or mocked web service endpoint. Make sure to add log for these situations and let Operations know how to configure these endpoints.

Reason #2: Performance Impact

Not too long ago I blogged about a system that was pounded by 180k Exceptions within 5 minutes. None of which ever made it to any log file but they consumed the entire server CPU. Used for what? For getting stack trace information, creating exception objects and disposing them later on again by the Garbage Collector. The following screenshot was taken for the same getImagePath example from above but now actually showing the global impact of these 3 exceptions on the overall application.

If you consider that there are hundreds or even thousands of users trying to get this image path, these 3 exceptions multiply to become thousands of exceptions. When you don’t know that these exception actually happen and that it impacts the end user:

You need to look at all Exception objects to understand their performance impact on your application
You need to look at all Exception objects to understand their performance impact on your application

Key Takeaways

  • Operations: Make sure you monitor the number of exceptions thrown in your application and also the performance impact. A single exception may cost nothing – but – the amount of exceptions could easily compound.
  • Developers: Don’t ignore exceptions or use exceptions pre-maturely for code flow control. Be aware of the performance impact and the need to log critical exceptions.
  • Business: Make sure that features that are important to business can be monitored. Talk with developers to log information in case of problems and with operations to monitor these problems so that you don’t have to wait for complaining users.

Why you haven’t done this already

Most monitoring tools don’t look at these types of things as they seem unimportant. The other reason is because many monitoring tools only identify these patterns when they clearly impact the end user. In this case an individual request might not be impacted – but the overall system is and you can see this when you have visibility into every request, every exception that gets created and every log messages written.

To sum it up in one sentence: Don’t wait until it is too late and don’t accept the excuse that this is not important – because it clearly is!

Andreas Grabner has 20+ years of experience as a software developer, tester and architect and is an advocate for high-performing cloud scale applications. He is a regular contributor to the DevOps community, a frequent speaker at technology conferences and regularly publishes articles on You can follow him on Twitter: @grabnerandi