In my recent article, “Software Quality Metrics for your Continuous Delivery Pipeline – Part III – Logging”, I wrote about the good parts and the not-so-good parts of logging and concluded that logging usually fails to deliver what it is so often mistakenly used for: as a mechanism for analyzing application failures in production. In response to the heated debates on reddit.com/r/devops and reddit.com/r/programing, I want to demonstrate the wealth of out-of-the-box insights you could obtain from a single urgent, albeit unspecific log message if you only are equipped with the magic ingredient; full transaction context:
- “How many users are affected and who are they?”
- “Which tiers are affected by which errors and what is the root cause?”
When all you Have is a Lousy Log Message
Does this story sound familiar to you? It’s a Friday afternoon and you just received the release artifacts from the development team belatedly, which need to be released by Monday morning. After spending the night and another day in operations to get this release out into production timely, you notice the Monday after that everything you have achieved in the end was some lousy log message:
08:55:26 SEVERE com.company.product.login.LoginLogic – LoginException occurred when processing Login transaction
While this scenario hopefully does not reflect the common case for you, it still shows an important aspect in the life of development and operations: working as an operator involves monitoring the production environment and providing assistance in troubleshooting application failures mainly with the help of log messages – things that developers have baked into their code. While certainly not all log messages need to be as poor as this one, getting down to the bottom of a production failure is often a tedious endeavor (see this comment on reddit by RecklessKelly who sometimes needs weeks to get his “Eureka moment”) – if at all possible.
Why There is no Such Thing as a 100% Error-Free Code
Production failures can become a major pain for your business with long-term effects: they will not only make your visitors buy elsewhere, but depending on the level of frustrations, your customers may choose to stay at your competition instead of giving you another chance.
As we all know, we just cannot get rid of application failures in production entirely. Agile methodologies, such as Extreme Programming or Scrum, aim to build quality into our processes; however, there is still no such thing as a 100% error-free application. “We need to write more tests!” you may argue and I would agree: disciplines such as TDD and ATDD should be an integral part of your software development process since they, if applied correctly, help you produce better code and fewer bugs. Still, it is simply impossible to test each and every corner of your application for all possible combinations of input parameters and application state. Essentially, we can run only a limited subset of all possible test scenarios. The common goal of developers and test automation engineers, hence, must be to implement a testing strategy, which allows them to deliver code of sufficient quality. Consequently, there is always a chance that something can go wrong, and, as a serious business, you will want to be prepared for the unpredictable and, additionally, have as much control over it as possible:
Without further ado, let’s examine some precious out-of-the-box insights you could obtain if you are equipped with full transaction context and are able to capture all transactions.
Why this is important? Because it enables you to see the contributions of input parameters, processes, infrastructure and users at all times whenever a failure occurred, solve problems faster, and additionally use the presented information such as unexpected input parameters to further improve your testing strategy.
Initial Situation: Aggregated Log Messages
Instead of crawling a bunch of possibly distributed log files to determine the count of particular log messages, we may, first of all, want to have this done automatically for us just as they happen. This gives a good overview on the respective message frequencies and facilitates prioritization:
What we see here (analysis view based on our PurePath technology) is that there have been 104 occurrences of the same log message in the application. We could also observe other captured event data, such as the severity level and the name of the logger instance (usually the name of the class that created the logger).
Question #1: How many users are affected and who are they?
Having the full transactional context and not just the log message allows us to figure out which critical Business Transactions of our application are impacted. From the dashboard above we can observe that “Logins” and “Logins by Username” have failed: we see that 61 users attempted the 104 logins and who these users were by their username.
Question #2: Which tiers are affected by which errors?
The next step is to analyze what other types of errors happened. In this Errors view we see all errors, such as HTTP 500, or unhandled exceptions, that occurred in the application along those execution paths that contained our template log message. It is also interesting to observe which application tiers were affected by errors:
If we were dealing with a performance related issue, we might as well want to know about the health of some affected host belonging to one of our application tiers:
Question #3: What is the root cause of this failure?
To get to know more about the root cause, it helps to understand which actions a particular user or a group of users took that led to a failure:
This visits-centric representation allows us to better understand what caused a single user to experience the login problem, like user “dominik” who visited our site from Surrey, British Columbia, Canada from 10:34 until 10:35 on the 16th June 2014. During his stay he performed a couple of user actions, of which the action “click on ‘Login’” on the page “orange.jsf” failed.
By drilling further down, we can investigate the method call graph that was executed by the servers on behalf of the failed user action:
Here, we could observe that there were failed method invocations in the “/services/AuthenticationService” servlet. We would also see any relevant method arguments, executed SQL statements, exceptions and log messages, as well as performance metrics for each call.
In this article, I have demonstrated what a useful strategy for analyzing application failures in production could look like: one that delivers out-of-the-box and that comes with the magic ingredient: full transaction context for all transactions.
However, there’s even more to it: the right tool can help your teams thrive in terms of DevOps. Dynatrace (available as a Free Trial), for example, allows you to record the monitored session to a file, which you can share amongst your teams. This mechanism not only secures evidence, it also enables collaboration: production failures are tracked by operations and communicated to and tackled by development teams. In parallel, test automation engineers can refine their testing and deployment strategies to prevent the same failure from happening again.
Feel free to share your thoughts with me.