Does the following situation sound familiar? From one minute to the other, your production servers grind to a halt, terse emails are complemented by equally hectic phone calls, and the first order of business is to get back up and running. After the dust settles, you’re usually left with a pile of log files and the assignment of figuring out what happened, why it happened, and what to do to keep it from happening again.
A common first step is trying to reproduce what has gone wrong. More often than not, this consumes a considerable amount of time that would be better spent on actually fixing the problem. In this first blog post of a series, I will present you a Step-by-Step Guide to Diagnose Stuck Transactions within minutes and show how a modern APM Solution helps to pinpoint common production problems, without spending hours on reproducing them at first.
The Problem: Response Time Increases
One of our customers deployed a new version of their prime web application, and for the first couple of days, everything seemed fine. One evening, their operations team was alerted about significantly increasing response time, and upon further investigation, they recognized that a Stuck Transaction was blocking their application. While restarting the server solved the problem from an operations perspective it was not a long term solution as currently processing transactions got canceled and users were thrown off the system.
Their APM solution automatically captures thread dumps in case of too long running or stuck transactions. These dumps assist developers with diagnosing the issue. Let’s take this example and walk through one of the problems often seen in a production environment: Diagnosing Stuck Transactions and Identifying the Root Cause.
Step 1: Identify problematic JVM/CLR
To identify the correct thread dump to analyze, it’s important to know which server was affected by the stuck transaction, and at what time that happened.
When drilling to the actual transactions that flow through that application server and focusing on the timeframe just before the server got restarted, I noticed a timed-out PurePath that spent 100% in sync time. This means that all it did was wait for one or more monitors, which makes it a likely culprit for our stuck transaction.
Step 2: Identify blocked threads
Having identified the affected application server, we can browse the list of available thread dumps and open the corresponding one.
When looking at the thread dump, threads can be grouped by their state to get a quick overview on how many threads the application executed when the dump was performed and how many of them were actually blocked. In this case, we are looking at four threads as the following screenshot shows:
Step 3: Identify root cause
Already the first one of these threads is the thread from our timed-out PurePath we looked at in Step 1 – identifiable by its ID. We also get the most important piece of information we need to identify the root cause: Which object is this thread waiting on, and who owns it?
Usually, the thread owning the object we are waiting for is highlighted in red. Given the number of threads in this dump, it’s unlikely to spot it this way, but search for the ID of the owning thread giving us the following information:
In this case, the method com.vaadin.ui.Table.unregisterPropertiesAndComponents is working on the same instance of ConfigurableReportsApplication that AbstractApplicationPortlet.handleRequest is waiting for. Having identified the thread, we have the full stack trace at hand in the lower pane, and can track down how this situation occurred.
Trying to reproduce such concurrency-related problems on a developer machine typically requires a major effort. If your APM solution fully supports collaboration between teams, all analysis steps described above can be performed in your local environment, without direct access to production servers. Using an APM solution like Compuware dynaTrace, it only took a couple of minutes to answer the crucial questions named above: we know exactly what happened and how it happened, and are thus also able to modify our application in a way to avoid this situation in the future.