I just recently wrote a blog about BOTs causing unwanted traffic on our servers. Right after I wrote this blog I was notified about yet another “interesting” and unusual load behavior on our download page which is used by customers to download latest product versions and updates:
If you see such a load behavior you typically assume that you just released a new product version or maybe an update to our agents and many people are downloading it like crazy. Unfortunately that was not the case. The spike in traffic was caused by an implementation issue between our authentication service and our download role-check logic. It resulted in a browser of one of our customers to go into an endless redirect loop between these different authentication and download pages, which caused several thousand HTTP Requests per minute.
Spotting the “Single Browser Gone Wild”
The first thing I wanted to know was which users are currently downloading our software. We use Dynatrace UEM (want to evaluate on your app? start here!) which tracks every action of every single visitor on our pages. The interesting finding was that there weren’t large numbers of users trying to hit the download page. Instead, there was a single visitor that caused that traffic spike. The following shows the Dynatrace Visits dashlet highlighting the one user from North America using a FF31 requesting the same Single Sign On Page more than 5000 times in a couple of minutes:
Root Cause: Incorrect Handling of User Roles
Looking at first User Action showed me that the user was correctly redirected to the Single Sign-On Page that we have in our system. He entered username and password and hit next. Then I explored the next User Action PurePaths to find out what happened next. It turns out that the user who successfully logged on (username captured as part of the PurePath) didn’t have any of our internal user roles assigned that we use to manage privileges such as download, open a support ticket, etc.
After the login page redirected back to the Download page that page redirected back to login as it was missing the download role privilege. The login page was then automatically reposted by the browser which started the endless redirect loop between the login page and the download page:
Life Saver for Continuous Deployment: Having Every Request of Every Visitor Available
I know that many performance experts out there believe that looking at high level metrics is enough to track end user problems. I do agree to a certain extend as you can find certain problems by looking at high level or aggregated metrics but there are cases like this where having all data available is a life saver.
In my case, I sent out an email to the responsible team including the exported PurePaths shown here in the blog. They went ahead and provided a fix within minutes. Traffic went down to normal before any other customer with a similar problem could overload our servers.
It turned out that the problematic code got introduced by a recent deployment. There was no automated test that covered a download scenario for a user without a role assignment as this was not anticipated. Even though we have good test coverage it is good to know we have a safety net in production that gives us the level of detail we need to fix these problems that slip through our automated testing as fast as possible.
I am sure we are not the only ones with these stories. So – feel free to comment and tell us about your bad deployments and how you deal with it.