At Dynatrace, we help the world’s most important applications work. Whilst working with IT and digital business teams the world over, we are never short of horror stories. Stories that are too scary to reveal the identities of the people, or the brands involved. But this Halloween, I thought it best to share anonymously, just a few frightening tales to remind you all – you are not alone in facing digital complexity.
Let me share a few of my favourites below, but please feel free to leave some anonymous comments.
An advanced personalisation feature for a major global brand, fails at midnight
(shared by Brian “the vampire” Chandler)
It’s midnight. The marketing team is asleep, the media agency has the advertising ready to go to publishers first thing in the morning, emails are set to hit the customer base, and instore promotions are ready, not just across the country, but across the world.
This is a major website feature release which would give the end users the ability to digitally customise their product. At the time this was a revolutionary digital experience.
At midnight, just prior to the launch, the ghosts of Halloween wreak havoc.
The IT team picks up:
- The daily peak image request experiences 4x the load during final load tests. This can’t be? The server was required to deliver nearly 50 images per user, to use this feature, which was bypassing the cache. A multitude or problems arose but the simplest summary was – essentially the application code was effectively DDOS attacking the system.
- Once the team resolved the above, they were ready for launch. Except the hosting provider decided at this very moment it was be good time to switch their PRODUCTION dedicated premium disk storage onto shared non-premium VM storage. You don’t have to be in IT to understand that a switch from PREMIUM to shared non-premium is going to a bad thing. In fact it was a really bad thing with the media server basically rendered useless in its ability to deliver the images to the end user.
- The impact of the above? Response times were through the roof, or the images didn’t load at all, the launch was delayed, and the ghosts of Halloween ruled supreme.
But like all fighting IT teams – they dug in and launched successfully the next day.
12,000 people cannot login to the mobile app – social media in meltdown
(“Sick” Nick Ross)
If you have 12,000 paying customers that can’t login that is a cause for some concern. When one of the leading UK brands had poor app store reviews they turned to Dynatrace.
POS systems in every store go down.
(Jeppe ‘grave digger” Lindberg)
This one isn’t anonymous. In fact it was a feature presentation at Perform in 2017. The leading retailer in Denmark – COOP prepared to launch a mobile application that would allow customers to accumulate loyalty points, and pay, all through the app.
Except launch day all the POS systems went down. Crisis is somewhat of an understatement. As panic enshrewed Dynatrace’s AI engine detected the root cause of the problem was a memory allocation issue in their cloud environment. The team calmly allocated increased memory and the POS systems came back online.
An email offer that is simply too good to refuse
(Dave “Dracula” Anderson)
Last year I was sent an offer from one of the leading brands in Australia. The offer was pure marketing genius.
“Are they really going to give that away for this?” “I’m in ….click.”
Unfortunately, the majority of Australia thought exactly the same as I did and clicked – at the same time. Except when we went to the landing page we got….nothing.
I quickly spun up some synthetic tests:
- For over an hour the site was near on useless to the end user.
- Anyone in marketing will tell you the first hour is when you get the most engagement.
Great offer. Poor execution. Staggering emails was the most appropriate solution for such a campaign.
Yep even Dynatrace has ghosts in the system
(Andi “The Slayer” Grabner)
Is this yet another example of a marketing team and IT team causing chaos? Yes.
- Marketing – our data suggests plain text emails perform better than HTML emails. Let’s switch our free trial post email to plain text.
- IT – reluctantly switches. Except the token required to be send to the customer is wiped from the email.
- Result – 3 days of emails being sent without licences.
Did Dynatrace pick this up? Yes. Did anyone check Dynatrace to see the impact of the change. No.
How can you master digital complexity
These only just scratch the surface of some the scariest stories that IT departments faced this year. But, a lesson can be learned from these terrifying tales: today’s hyper-scale and hyper-dynamic ecosystems are becoming far too complex for IT teams to monitor on their own. As enterprises continue to try and anticipate every potential scary IT situation without the help of a monitoring platform, these stories will only continue to happen.
So, what can companies do to make sure they’re preventing these Halloween horror stories all year round? Here are three tips to consider when looking for a monitoring solution for your business.
Scaling for the Cloud
The cloud allows enterprises to innovate and scale faster than ever before. I sat with one international ‘on promise hosted’ retailer, on black Friday, and watched helplessly as their systems crashed under the load of extreme traffic surges. The volume was much bigger than they anticipated and the simply couldn’t scale their systems to cope. They are quickly scaling to the cloud to alleviate a repeat.
But, finding the time to innovate can become increasingly difficult as organizations spend much of their IT efforts dealing with digital performance problems resulting from complex multi-cloud ecosystems. To better handle these problems, businesses are finding that they need a purpose-built cloud monitoring solution that can integrate, auto-detect and scale in these environments, giving their IT teams a view of everything from the end user to the infrastructure in an easy and automated way. Leveraging tools such as artificial intelligence allows all the “heavy lifting” to be automated, so teams can proactively identify problems and pinpoint the underlying root cause before a customer is impacted.
Holistic Digital Experience
While businesses can take every precaution needed to prevent IT issues, sometimes they are unavoidable. And, when a problem hits, a company needs to be able to quickly assess how it is affecting users and, ultimately, the company’s bottom line. This allows them to better prioritize and escalate issues that are having the biggest impact. Access to a holistic view of the entire digital experience provides enterprises with the information needed to fix the issue, and inform customers with a meaningful explanation of how the issue occurred and how it’s been resolved.
Consider your DevOps
In the continuous delivery world we live in, teams must be able to quickly react to failed deployments before users are ever hit. With complete performance monitoring, DevOps can proactively receive alerts on availability issues before pushing code into production, and react more quickly to any issues identified when running in production. As a result, your DevOps team can confidently deliver new releases and rapidly troubleshoot any problem that occurs to quickly fix the issue before a user is even aware of it.
Have your own performance horror story? Share it in the comments below!
Last but not least
C’mon where is the fun without sharing this story. The scene is DevOps days this year. The event is not Halloween, or even Halloween themed, but one of our competitors thought it was appropriate to dress as TREX and proclaim that because they are huge now having been bought by one of the big IT firms, they will now hire lots of people, and you should work there.
I don’t get it, but I did have a genuine laugh.
Thanks Frances Ward, Brian Chandler, Andi Grabner, Nick Ross and Jeppe Lindberg for your contribution to this post. Happy Halloween.