In order to be ready for Christmas season, online retailers typically bring their shops into shape right before Black Friday. Together with Cyber Monday this is the most important day in the retailer’s year.
Stilnest.com (@Stilnest) is a publishing house for designer jewelry, running their online shop on Magento. While the guys at Stilnest did a good job in preparing their environment, the interest in their products and, therefore, the traffic on their site, was much higher than expected. The shop even went down after one of the YouTube stars released a new video showing off her new jewelry line powered by Stilnest.
In this classical “War Room” situation I worked together with Stilnest to find the root cause(s). Good news is that we worked out a solution and brought the shop online again – just in time for Black Friday. This blog summarizes the technical details on what went wrong in their case. I hope you find this useful and may start checking your own environment – the holiday shopping season is not yet over:
- Fixed environmental settings: Processes per CPU core
- Optimized number of database statements per request
- Speed up database connection
- Optimized 3rd party modules (Magento marketplace)
A quick overview of their Environment
The shop runs in AWS. A load balancer distributes the incoming requests to different servers, organized in an auto scaling group. Each server runs one instance of Nginx and three instances of PHP-FPM. Another server running Varnish is used for caching, while the PHP processes connect to a MySQL database, hosted by AWS.
What they did to prepare for Black Friday!
First thing you want to make confirm is that the servers running Nginx and PHP are not overloaded. The general rule for Nginx is one worker process per CPU core. In our setting we are using EC2 instances with 4 cores and start 1 Nginx worker per instance. The other 3 cores are reserved for PHP. In PHP there is no limitation on CPU cores, you can run as many child processes as your memory allows. Check the memory consumption of your application and specify the number of child processes accordingly!
Then ensure your auto scaling group in EC2 is configured properly! Several metrics can be perfectly used for this purpose, like CPU and memory consumption. But there are more relevant metrics that you should consider:
- page load time
- request rates
- concurrent users
- user experience
APM tools like Dynatrace provide these data in a most convenient way to use them in the EC2 auto scaling criteria.
These charts demonstrate the scaled environment under different load conditions:
webrequests over time
server instances at low load
auto scaled server instances under peak load
But be prepared for a large number of instances! In our example all PHP processes connect to the same MySQL database! Make sure your database server is sized properly!
Tip: to avoid additional possible performance bottlenecks use IP addresses to connect your application to the database, rather than using the hostname! Thus you make sure there are no additional wait times caused by the DNS!
As a final step load tests were performed. Under high request rates the environment scaled up to more than 50 instances. Everything seemed to be ready for the big season.
The webshop was running fine. Four days before Black Friday the video was published on YouTube, where a new jewelry line was presented: necklaces with zodiac signs, produced by Stilnest, that could be ordered via their webshop. The response to the announcement was much greater than expected and, as a result, the webshop crashed.
As we had the application monitored with Dynatrace, it was rather easy to start investigations. Using the Performance Hotspot Dashlet in Dynatrace we determined that the major contributor to the response time was the database.
performance hotspot: database
Further drill down revealed more about the issue:
- very long execution time of the PDO class’ constructor
- very long execution time of several SQL statements
- in some cases a huge amount of DB statements per webrequest
transaction flow for a single webrequest with far too many calls to the database
very slow executions of several database statements
caller breakdown showed the originator of the slow statements
However, it seems to be unsuitable for large traffic volumes. The assumption was that the high load on the database caused both the slow response times and the connection delays. The performance gains on the frontend was less than 100 milliseconds against a performance loss of some hundreds of milliseconds — and even seconds — in the backend. Therefore, the only appropriate solution was to simply remove the “performance module”.
The result was a remarkable performance gain, and the load tests no longer crashed the application. I have no idea why a module that’s intended for performance optimization is designed to use that many database calls. That might work for smaller applications, but not for a high-traffic online shop. In high-traffic environments we typically try to avoid database calls by implementing appropriate caching mechanisms, rather than creating additional load on the database!
Be careful when using 3rd party modules! Test them properly, especially under high load!!!
Stilnest.com is ready for Cyber Monday!!!
Additional jewelry promotions like this are planned for the next couple of days. Stilnest’s online shop was ready for Black Friday and the rest of the holiday shopping season.
Be sure you set up and test your application properly! Make sure the servers are properly sized for the expected load. Load tests are essential to confirm your environment will not crash during the high season, especially on Black Friday and Cyber Monday. Don’t only consider certain tiers in your application. Instead, monitor everything end-to-end, from the user actions in the browser down to the database. An easy way to analyze your full stack is to use Dynatrace Free Trial and Dynatrace Personal Edition. Request your free license, download the software and start monitoring! For PHP based apps make sure to check out the Dynatrace PHP Video Tutorial.