An eCommerce site that crashes 7 times during the Christmas season, being down for up to five hours each time it crashes is a site that loses a lot of money and suffers reputation damage. It happened to one of our customers, before we started working with them. They shared their story and what they learned at our annual perform conference earlier this month. Among several reasons that led to these crashes I want to share more details on one of them which I see more often with other websites as well. Load Balancers on Round-Robin instead of Least-Busy can easily lead to App Server crashes caused by heap memory exhaustion. Let’s dig into some details, and see how to identify these problems and how to avoid them.
The Symptom: Crashing Tomcat Instances
The website is deployed on 6 Tomcats with 3 Frontend Apache Web Servers. During peak load hours individual Tomcat instances started showing growing response times and a growing number of requests in the Tomcat processing queue. After a while these instances crashed due to out-of-memory exceptions and with that also brought down the rest of the site as load couldn’t be handled any more with the remaining servers. The following image shows the actual flow of transactions through the system highlighting unevenly distributed response time in the Application Servers and functional errors being reported on all tiers (red colored server icon):
Once the App Server started rejecting incoming connections we can observe the first ripple effect of errors. We can see a very high number of Exceptions in the Database Layer, Exceptions thrown between Application Tiers the Web App responding with HTTP 500s:
The Root Cause: Inefficient Database Statements and Connection Pool Usage
The exceptions caught in the Database Layer (JDBC) were already a very good hint for the root cause of this problem. A closer look at the Exceptions shows that connection pools are exhausted which causes problems in the different components of the application:
Looking at the performance breakdown by application layer reveals how much performance impact connection pooling has on the overall transaction response time:
Now, it was not only the size of the pool that was the problem – but – several very inefficient database statements that took a long time to execute for some of the application’s business transactions. This caused the Application Server to hold on to the Connection for longer than normal. As the load balancer was configured with Round Robin the App Server still got additional requests served. Eventually – just by the random nature of incoming requests – one App Server received several of these requests which were executing these inefficient database calls. Once the connection pool was exhausted the application started throwing exceptions which ultimately also led to a crash of the JVM. Once the first App Server crashed it didn’t take too long to take the other App Servers down as well.
The Solution: Optimizing App and Load Balancer
The problem was fixed by looking at the slowest database statements and optimizing them for performance by e.g: adding indices on the database or making the SQL statements more efficient. They also optimized the pool size to accommodate the expected load during peak hours.
They also changed the Load Balancer setting from Round-Robin to Least-Busy which was the preferred setting from the LB Vendor – this configuration had simply been forgotten in the production environment.
The Result: No Site Downtime Since
Since they made the changes to the application and the Load Balancer the site has never gone down. Now – the next holiday season is coming up and they are ready for the upcoming seasonal spikes. Even though they are really confident that everything will work without problems they learned their lesson and are approaching performance proactively through proper load testing.
Next Steps: Proactive Performance Management
The lesson learned was that these problems could have been found prior to the holiday shopping season by doing proper load testing. They did load testing before but never encountered this problem because of two reasons:
- they didn’t test using expected peak volumes for long enough sessions and
- they didn’t use a tool that simulated real customer behavior variations (too few scripts and the scripts were too simple) and tested their highly interactive web site.
Their strategy for proactive performance management is that they
- Perform Load Tests on the production system during low traffic hours (2AM-6AM), accepting the risk of minor sales losses in case of a crash, versus major sales losses during the holiday shopping season.
- Multiply the hourly load test volume by 2.5 since their actual peaks are 10 hours long.
- Use a Load Testing Service that uses real browsers in different locations around the US.
- Use an APM Solution that identifies problems within the application while running the load test.
If you want to read more on common performance problems that are not found prior to moving to production check out my recent series of blogs: Supersized Content, Deployment Mistakes or Excessive Logging