Have you seen this error message before “java.sql.Exception: ORA-00060: deadlock detected while waiting for resource”?
This is caused when parallel updates require locks on either rows or tables in your database. I recently ran into this exception on an instance of an IBM eCommerce Server. The first thought was that there are simply too many people hitting the same functionality that updates Sales Tax Summary information – which was showing up in the call stack of the exception:
The logical conclusion would be to blame this on too many folks accessing this functionality or outdated table statistics causing update statements to run too long causing others to run into that lock. It turned out to be caused by something that wasn’t that obvious and wouldn’t have shown up in any Exception stack traces or log files. A misconfigured timeout setting on the load balancer caused a re-execute of the original incoming web request. While the first app server was still updating the table and holding the lock – as it had a longer timeout specified as the load balancer – the second app server tried to do the same thing causing that exception.
In this blog I’ll show you the steps necessary to analyze the symptoms (timeouts and client errors) and to identify and fix the root cause of the problem.
Step #1: Identifying Who and What is Impacted
Identifying failing results is easy – start by looking at HTTP Response Codes (Web Server Access Log), Severe Log Messages (App Server Logs) or problematic Exception objects (either in log files or other monitoring tools). In our case I identified the SQL Lock Exception, a corresponding severe log message and the resulting HTTP 500 and also traced it back to the individual users and their actions that caused these issues.
This impacts our Business!
Now we know that this problem impacts a critical feature in our app: Users can’t add items to their cart.
Step #2: Understanding the Transaction Flow
Before drilling deeper I typically get an overview of the flow of the transaction from the browser all the way back to the database. This high-level view lets me understand which application components are involved and how they are interconnected. The transaction flow in this case highlights some interesting issues with the “Add Item to Cart” click. It appears to execute more than 33k SQL Statements for this single user interaction causing 45% time executed just in Oracle:
This is an Architectural Problem
Getting the full end-to-end execution path for a single user interaction (Add Item to Cart) and seeing how it “branches” out makes it really obvious how these individual problems (too many SQL, High Execution Time, …) end up impacting end users. Just spotting the individual hotspots without having this connection would make it harder to understand the real root cause.
Step 3: Analyzing Database Hotspots
Starting from the back we can look at these 33k SQL Executions. Are these different statements or is the same statement called several thousand times due to bad coding or incorrectly configured data access frameworks? The database hotspots reveal that it is mainly the latter – only a few statements make the majority of these calls:
It’s Not the Database – It’s the App
Analyzing database activities from the database alone wouldn’t show that problem because the individual executions of these statements are actually fast. Knowing that they come from a single end user interaction moves the hotspot back to the application. That’s where the real problem is. The solution is to reduce the number of SELECT and UPDATE statements executed whether it is through optimizing the SQL or caching data is up to the development team.
Step 4: Analyzing Load Balancer Behavior
Now that we know these requests spend a lot of time in executing database queries it also explains the very long execution time. What we don’t yet know is why we see two outgoing requests from IHS (which acts as a Load Balancer) for one incoming request. A closer look at the timeline of this request shows exactly what happens when:
- The Load Balancer (IHS) forwards the request to server 01
- After 60 seconds it assumes the request is gone and re-execute it to server 02
- After 120 seconds IHS returns an HTTP 500 error to the end user
- After 199s and 282s these two requests actually come back successfully
Ops Needs to Sync with Dev
When Ops is responsible for the infrastructure and components (like IHS which does the load balancing) and they don’t sync up with the Application Server Team, situations like this can happen where a component that should prevent problems can actually cause extra problems.
That’s why it is important that these teams work together to deploy an optimally configured end-to-end system.
To sum up the key lessons learned in this story:
- Application Team: make sure you understand what your application is actually doing on the database. You most likely use frameworks that abstract the data access layer. Make sure to optimize for your application and not run with default settings. Learn more on Bad DB Access Patterns
- Database Admins: Don’t just rely on your own database analysis tools. Get together with the Application guys to figure out if individual SQLs are slow or whether they are just calling too many statements. That eliminates the guess work. Learn more about Database Performance Issues
- Operations: Make sure to discuss your deployment settings with the application team. Only then you can ensure that all the cool components you have in your infrastructure will support the hosted applications. Learn more about Proper Server Settings.
- Business: Don’t wait until your users are complaining. Become proactive and get monitoring and alerting based on technical evidence such as bad HTTP responses, database errors and so on. Learn more about Business Performance Optimization.