Have you ever deployed a change to production and thought “All went well – Systems are operating as expected!” but then you had to deal with users complaining that they keep running into errors?
We recently moved some of our systems between two of our data centers – even moving some components to the public cloud. Everything was prepared well, system monitoring was set up and everyone gave the thumbs up to execute the move. Immediately following, our Operations dashboards continued to show green. Soon thereafter I received a complaint from a colleague who reported that he couldn’t use one of the migrated services anymore as the authentication web service seemed to fail. The questions we asked ourselves were:
- Impact: Was this a problem related to his account only or did it impact more users?
- Root Cause: What is the root cause and how was this problem introduced?
- Alerting: Why don’t our Ops monitoring dashboards show any failed web service calls?
It turned out that the problem was in fact
- Caused by an outdated configuration file deployment
- It only impacted employees whose accounts were handled by a different authentication backend service
- Didn’t show up in Ops dashboards because the used SOAP Framework always return HTTP 200 transporting any success/failure information in the response body which doesn’t show up in any web server log file
In this blog I give you a little more insight on how we triaged the problem and some best practices we derived from that incident in order to level-up technical implementations and production monitoring. Only if you monitor all your system components and correlate the results with deployment tasks will you be able to deploy with more confidence without disrupting your business.
Bad Monitoring: When Your End Users become your Alerting System
So – when I got a note from a colleague that he could no longer use Dynatrace AJAX Edition to analyze the web site performance of a particular web site I launched my copy to verify this behavior. It failed with my credentials which proved that it was not a local problem on my colleague’s machine:
Asking our Ops Team that manages and monitors these web services resulted in the following response:
“We do not see any errors on the Web Server nor do we have any reported availability problems on our authentication service. It’s all green on our infrastructure dashboards as can be seen on the following screenshot:”
Web Server Log Monitoring is NOT ENOUGH
As mentioned in the initial paragraph, it turned out that our SOAP Framework always returns HTTP 200 with the actual error in the response body. This is not an uncommon “Best (or worst) Practice” as you can see for instance on the following discussion on GitHub.
The problem with that approach though is that “traditional” operations monitoring based on web server log files will not detect any of these “logical/business” problems. As you don’t want to wait until your users start complaining it’s time to level-up your monitoring approach. How can this be done? Those developing and those monitoring the system need to sit down and figure out a way how to monitor the usage of these services and need to talk with Business to figure out which level of detail to report and alert on.
How can you find out if your current monitoring approach works? Start by looking more closely at problems reported by your users but that you don’t get any automatic alerts on. Then, talk with engineers and see whether they use frameworks like mentioned here.
Bad Deployment: Triaging the technical Problem
To identify the actual root cause of this problem I pulled the Dynatrace PurePath of my failed authentication request as shown in the following screenshot. If you don’t have Dynatrace you may have some detailed application traces or log files that you can look at. I can find the PurePath for my authentication request by simply using my local IP or even by the username I passed to the request as this context information is automatically captured. As Dynatrace always captures all transactions end-to-end, whether they are fast/slow/failed/successful I am sure the data is always there. You can see here how easy it is to find the root cause of the problem and why it couldn’t be picked up by monitoring web server log files:
Looking at the screenshot above we can see how our authentication web service is actually implemented (to a big surprise to the engineers when I showed them). When the call comes in, it makes up to 3 different internal web service calls
- 1st Call: First it checks whether the user’s session is still authenticated
- 2nd Call: If that is not the case it checks a JIRA user directory that contains all “Free” customer accounts
- 3rd Call: If the user can’t be found there we query our corporate Active Directory (AD) via an LDAP Proxy to check if the user is an employee
The first successful web service result marks a successful login and returns a positive result to the AJAX Edition.
Root Cause: Outdated Configuration File Deployed
In the PurePath screenshot above we see that my user account (being an employee) failed to authenticate in the first two web service calls (I didn’t have a valid user session anymore and my account was not found in JIRA). The third – which checks against AD – failed because a connection to the LDAP Proxy couldn’t be established. This seemed to be the technical problem responsible for why the authentication failed. My first guess was that this was a side-effect of a recent data center migration where we moved some of our services from one data center to the next. Well – this is just partially correct.
I also showed the PurePath to our system architect. His response: “Wait – we shouldn’t be calling the LDAP Proxy any longer as we migrated ALL our user accounts to JIRA” Now that was an interesting observation! An observation really only possible because we had the full end-to-end transaction available showing us what is going on internally.
It turned out that when we moved services, some outdated configuration elements were used causing our authentication service to act as if we never migrated our users. That’s why the web service still made the call into LDAP which failed because the LDAP Proxy wasn’t available any longer.
Lessons Learned for Devs, Ops and Everybody Else!
As seen in our example, all stakeholders can learn a lesson:
- Developers: Make sure that the frameworks you use not only provide the functionality you need but can also be monitored in a production environment. This also means following standard practices for basic things like error reporting. Whether this is for Web Services or other frameworks, make sure you pick the right tools for your job and think about others that need to test and monitor your applications.
- System Architects: Constantly monitor your live system to validate it is really working as you designed it. When moving services between physical or virtual locations make sure that everything is still working as expected. Are all configuration files correctly deployed and updated? Can services still call each other? Work with Operations to ensure depending systems can connect.
- Operations: Make sure you understand all dependencies between services and the required configuration elements. Discuss with engineering how to monitor the successful deployment and operation of your web services. Is it enough to monitor the entry points or do you also need to monitor backend systems? Is it enough to monitor web server logs or do you need to extend monitoring to application metrics as well?
- Business: If your business relies on these web services, make sure you also get relevant dashboards that show general health and usage statistics such as Number of Failed vs Successful Requests. If possible, get a detailed breakdown in why calls are failing, e.g: are users simply entering wrong user credentials (then you may have a usability issue) or is there some other issue (then you need to ping your Ops and Apps team).
Many corporations (including ours) try to embrace the ideas of Continuous Delivery. But it takes a lot of effort and there will always be problems along the way that automated deployment checks won’t catch. DevOps also requires Continuous Improvements – and this is what we are doing. I also hope that sharing these stories will help you to avoid problems others already run into. Also – feel free to share your stories. Let us know how you solved the performance and deployment challenges in your organization.