Following an interruption in our SSO service in early January, here's an update on what we're doing to prevent a future disruption.
Two weeks ago, we experienced a service interruption with our SSO service. As our CTO Bernd Greifeneder shared in his blog post, what was a relatively small software release resulted in a large number of our customers being unable to access their Dynatrace environments. While all alerting and integrations were not impacted, many of our customers who rely upon Dynatrace to get their job done were understandably frustrated and want to know what we’re doing to ensure we don’t allow the same incident to occur again.
I would like to update you on the steps we have taken to date, what we’re still planning to do, and when we hope to have these steps completed. As you’re likely aware, we have a very agile software development process, one that allows us to introduce major functionality every two weeks into production and hotfixes whenever it is necessary. Here’s the update:
- Improve architectural design to eliminate SSO bottleneck risk [Completed]
- Security and access are critical aspects of our architecture, and as such, there are many areas we’re looking to improve. These include improving API traffic management and caching mechanisms to reduce server and network load, optimizing database queries, and adding additional compute resources, just to name some. While some of these are already done, such as adding additional compute, others require more development and testing. We anticipate these being completed by the end of Q1.
- Improve SSO deployment automation to enable faster rollbacks [Completed]
- We’ve updated our automated deployment stacks to ensure consistent versioning across all our environments as well as enhancing our auto-scaling mechanisms to speed up recovery, should this ever be necessary again. (Hopefully never.) This has been completed.
- Evaluate improvements of throttling and caching strategy [Completed]
- As mentioned above, by introducing enhanced traffic management with our SSO service along with endpoint caching, we can minimize or completely mitigate this type of issue from occurring in the future. This is in active development and testing now, and we anticipate this being completed by the end of Q1.
- Remove dependency between SSO backend services to ensure auto-remediation [Completed]
- The dependency that caused our automated scaling and recovery mechanisms to fail, requiring us to redeploy our SSO services from scratch and prolonging this incident, has now been removed. This has been completed.
- Remove overly tight couplings between Dynatrace services that depend on SSO [Completed]
- Given SSO is at the core of user and service authentication, we need to improve the exception handling if the SSO service is not available rather than allowing an application to fail. We’re actively developing and testing new ways of handling this situation and expect to have this completed by the end of Q1.
- Increase the load testing scenarios with more corner cases for broader execution [Completed]
- We continue to profile real-world user activity and traffic patterns to incorporate into our automated CI/CD pipeline. We expect to incrementally add to this area over the next couple of weeks, with this being fully completed by the end of Q1.
- Accelerate the Dynatrace Status portal update delivery speed and review our accessibility to Tech Support in such cases [In Progress].
- It took us 30 minutes to publish the initial notification to our Dynatrace Status Page. It is our goal to make this even faster and more automated. We’ve already updated several internal communication procedures to ensure we provide faster notifications. Longer term, we’re looking to redesign the status page to make it easier to find service status and include automated notifications and updates to changes in status. We’re still scoping, gathering requirements, and collecting customer feedback for a new status page.
- Additional step – Increased accessibility of Dynatrace support team.
- Finally, we also recognize that this issue limited the ability of many of you to contact our support team. We’re exploring additional channels to contact our support team should there be similar incidents in the future. This includes email, a phone number (or click-to-talk), and even social media such as Twitter. We don’t have a timeline for these yet, but as we make progress in these areas, we will be sure to let you know.
It is my hope that this level of transparency helps demonstrate the partnership we strive for with each of our customers. Building a world where software works perfectly may seem impossible, but I believe that by applying the principle of continuous learning to mistakes when they happen (and when it comes to software, it will happen), we can get closer to seeing this vision become a reality.