On Tuesday, January 3, 2023, Dynatrace experienced a service disruption of our SSO service. Here's an update on the cause and our plan to prevent a future disruption.
On Tuesday, January 3, 2023, at 15:26 UTC, we experienced an interruption of Dynatrace’s Single Sign On (SSO) service, preventing our customers from logging into their Dynatrace Software as a Service (SaaS) environments and other Dynatrace portals.
Our customer’s monitoring data collection was not affected, apart from a few identified and informed customers with a larger dependency on SSO authorization.
Dynatrace development and support teams were immediately notified by the Dynatrace production monitoring, facilitating instant remediation by R&D specialists, as actions beyond normal auto-remediation were required.
The SSO service disruption occurred due to a new implementation of one of the Account Settings screens’ inefficient use of the SSO API, which caused an excessive load to the underlying SSO infrastructure. Normally, our orchestration and overload prevention mechanisms would be able to handle such situations, but due to a recently introduced unintended dependency between the deployment of two SSO services, automatic scaling and recovery mechanisms failed and severely complicated the rollback. Finally, we needed to redeploy the SSO services from scratch, with removed dependency between the two services and are fully operational again.
During the time of reinstating access, we identified a slower pace of our communication on the Dynatrace Status portal, which resulted in a delay in information updates to our customers and an important improvement action item for our team
We’re fully aware of how much our customers and partners depend on our platform to monitor business-critical applications in their environment and how much stress and pain this incident created. This is why we continue striving daily to deliver the best observability platform in the market.
To prevent such a significant service disruption from happening again, we are taking several immediate and mid-term actions in addition to the existing rigorous automated testing process:
- Improve architectural design to eliminate SSO bottleneck risk
- Improve SSO deployment automation to enable faster rollbacks;
- Evaluate improvements of throttling and caching strategy;
- Remove dependency between SSO backend services to ensure auto-remediation;
- Remove overly tight couplings between Dynatrace services that depend on SSO;
- Increase the load testing scenarios with more corner cases for broader execution;
- Accelerate the Dynatrace Status portal update delivery speed and review our accessibility to our Tech-Support in such cases.
We thank our customers and partners for prompt communication with our team and for your understanding while we worked hard to restore the SSO service. If you have additional questions or concerns, please do not hesitate to reach out to us.
Author’s note: Shawn White has shared an update on our progress in addressing these areas in this blog post.