It’s Black Friday in the US! For me it’s actually already early Saturday being located in Europe. The past hours I spent on troubleshooting eCommerce sites that went down during their Black Friday peak sales periods. My friends at hybris kept me in the loop with their effort on keeping their customer’s commerce sites up and running.

I thought to share my experience of analyzing the root cause on four sites powered by hybris. None of these problems took me longer than 15 minutes to analyse. Most of the time I spent writing emails, jumping on webex’s with nervous clients to the east and to the west. I did calls as far as Kazachstan and Toronto, me sitting in the middle in Austria.

The really surprising thing for all four situations? It would have been so easy to avoid the downtime of the shops!

If only….

#1 Someone would have looked at the number of DB calls … per transaction!

This is a very old, very common, very frequent anti-pattern. TOO MANY DATABASE CALLS. It is simply not OK if a single page in your eCommerce shop makes 3200 DB calls, everytime when someone accesses it! This WILL kill your database on Black Friday! Especially, if the page is your sales promotion page, say goodbye to your orders and boosted revenue!

too many database calls
A simple architectural check can avoid this: Too many Database Calls

Want to learn more? Identifying bad database access patterns in your pipeline – fully automated!.

# 2 Someone would have checked the external service use … and optimized it to the max!

External Service Calls are always somewhat expensive. Of course! They are external, which means network connections have to be established, response times and roundtrip times have to be taken into account and you can’t do anything about them in first place. So make sure you use them in a smart way! That means: as few calls as possible, make smart calls, transfer only the data you need. AND: when you see that the external service call eats up 50% of your transactions’ response time during low times, then you can be certain it will be the same but very likely worse during high times when you have thrice the traffic and number of calls to the exernal service!

External Services are critical, likely more sensitive to high load

#3 Someone would have re-thought past optimizations and review them

Of course it’s not easy to plan ahead for every little aspect of the high load times. Assume you have optimized your environment to the max, for example you have found the perfect Java Garbage Collection Settings after a load test. But then you found some other issues and made changes to the environment: adding servers, removing tiers, application deployments and fixing code. Wouldn’t it be wise to spend a thought on the past optimizations that were the best – but only for a slightly different situation? A “no we have changed that to the optimum, we won’t touch that” attitude can lead yourself and your environment right into a ditch!
By raising the fixed GC trigger-level in below example, we could avoid long, hard-trying GC runs and use the increased memory, while faster clanup would happen.

Sometimes it makes sense to rethink previous adjustments after other changes have been made

#4 Someone would have discontinued the usual ecommerce business processes

This is a not an obvious one, buth nevertheless important! If your ecommerce business process usually allows updating product catalogs and prices during run-time, anytime while customers are shopping online, it’s probably a good idea to suspend that process at the highest load day of the year.
As a supermarket owner would you clear out your shelves and replace products while the crowd is trying to get stuff into their basket? Or would you do that after-hours when there is none in the store and the interference with buying customers is minimal?
Changing catalogs, pricing, etc. in an online store usually invalidates things like caches, resulting in higher miss-rates and potentially more load on databases or wherever the new, current item can be found!

One little change can have a big impact somewhere else

Conclusion

Writing this article took almost longer than finding the above issues during my Black Friday night. The first two issues are a good example of architecture validation that can be way upfront of any production impact. This could have been identified on the developer’s desk by using a free Dynatrace Personal license. The latter two are good practices: re-think changes in complex environments and re-think your processes in unusual situations.

Unfortunately for two of the four the APM insight came to late and their Black Friday wasn’t as good as it could have been. Maybe Cyber Monday will be better for their eCommerce business! The other two were agile enough to make changes and their day was saved!