The recent outbreak of SARS-COV-2, better known as the resulting COVID-19 infection, is having a significant impact on everybody’s life. One impact is a move towards digital services at an unprecedented scale. Fortunately, there are proven strategies for how to handle situations like this. We want to share some best practices that help you get through the current situation.
One of the impacts of the COVID-19 pandemic is a move towards digital services at an unprecedented scale. Some businesses are attempting to replace lost revenue streams through a shift to online activity. Other organizations are scrambling to support significant growth in online users. All of this puts a lot of pressure on IT systems and applications.
While most government agencies and commercial enterprises have digital services in place, the current volume of usage — including traffic to critical employment, health and retail/eCommerce services — has reached levels that many organizations have never seen before or tested against.
Organizations need to prepare for both expected and unexpected demand, not only for the services that their customers and users rely on today but for the services being developed for tomorrow.
There are proven strategies for handling this. In this article, I will share some of the best practices to help you understand and survive the current situation — as well as future proof your applications and infrastructure for similar situations that might occur in the months and years to come.
Step 1: Understand Traffic Patterns and Potential Spikes; Remove Team Silos
The impact of traffic spikes is illustrated by the load that eCommerce web sites typically see during Black Friday. A massive rush of users over a very short time period makes systems begin to slow, and then potentially return errors. Retail sites are usually well-prepared for these spikes, as they know when they are coming. Other sites, including eLearning services, also experience seasonal or time-of-day-based usage patterns that they can prepare for (see figure 1).
However, the situation many websites recently experienced at the onset of COVID-19 was unprecedented — large, sudden traffic bursts with no clear pattern or knowledge of when the next burst would happen. For example, traffic spikes in government employment portals sometimes resulted from COVID-related news announcements (figure 2).
The best way for organizations to get ahead of the curve, so they aren’t caught off-guard by sudden and unexpected spikes in online activity, is to rethink the structure of their IT teams. Learning from the past and incorporating data about future events helps to ensure your team is not hit by a surprise like this again. Achieving this requires business and technical teams in an organization to be in lockstep; communicating, aligning and preparing for what might happen. We refer to this as a BizDevOps strategy.
To support BizDevOps, organizations must create tighter collaboration among teams. They must establish an integrated communications approach centered on end-user data. As needed, teams should mandate daily meetings or standups to review what happened the day before, plan for the current day and look ahead to the days that follow. With everyone looking at the same data, it’s easier to work towards a common goal. And with everyone on the same page, organizations are ready to act quickly when surprises happen — like, say, a sudden rush of online activity prompted by a pandemic.
Step 2: Understand What to Get Ready for
Just seeing or predicting a spike in traffic will not solve the problem. Your systems will still be overloaded, and problems will continue to impact your users. The next step is to understand when your system is going to break.
As the dashboard example in figure 3 illustrates, historical usage patterns will not serve this purpose. This dashboard reflects traffic to the Austrian Economic Chamber website in late March 2020, starting when people began to submit requests for government financial support related to COVID-19. The site managers knew this event was coming, but they were not prepared for the massive amount of traffic that followed.
So how do you know what to prepare for?
As no situation from the digital era compares to the current pandemic, this might be hard to assess. But often the best strategy is to conduct a stress test. To do this, simply hit your infrastructure with an increasing amount of traffic until you start to see a negative impact on response times or other errors.
Ideally, you’ll have a dedicated environment for this. If that isn’t the case, then test against your live system when the impact on users is minimal.
Once you know the limits of what your system can handle, you’re ready for the next step.
Step 3: Understand Why Your Systems Break
Simply knowing when your systems break won’t help you deliver better service to your users. Depending on the type of system you are running, the fix may not be obvious. It might be as simple as adding new servers, or as complicated as changing specific application behavior, like switching from dynamic to static content or disabling certain functionalities under high load.
Once you start to hit the breaking point identified by your stress test (step 2), utilize automated root-cause analysis (see figure 4) to identify which components broke and exactly why they broke. This will enable your teams to find ways to fix the problems.
Step 4: Validate Fixes in Real-Time
Once you’ve understood the root cause of problems, you’ll need your teams to work on fixes that can also be deployed rapidly into your live environment. As such, it’s imperative to monitor the impact of these fixes on the overall health of your system in real-time. Sometimes, fixing things in one place can result in problems arising somewhere else, which might mean you need to either roll back the fix or repair whatever new problem it has caused.
It’s important to focus on fixing one problem at a time, though, and validating the impact of these fixes. Teams should fix small, atomic problems and — as needed — combine these together rather than attempting to fix complex processes involving many interdependencies.
Step 5: Automate the Fix and Make It Repeatable
After completing step 4, you will have a list of small fixes that can be applied separately. You might be tempted to write down how to execute them. Documentation is good. A repeatable implementation is, however, much better.
Making fixes available in scripts has benefits, especially if you need to react quickly to unexpected situations. First, anybody can run them instantly without having to learn any specifics. Running an automated script is always much faster than performing these steps by hand. It is also much less error-prone.
Step 6: Automate the Workflow
Up to now, all steps have required someone to actively drive the process. While this might work fine during business hours, it does not prepare you for unexpected events outside of usual business hours.
Luckily, at this point, you have all the ingredients you need to automate the process and move towards what we call NoOps. Essentially, this means that no manual operations tasks are needed to perform well-defined operational steps.
In this step, we are linking root-cause analysis to the proper remediation action (see figure 5). Having this in place allows actions to be triggered automatically as needed.
You might wonder why you should not simply always have your systems at maximum capacity, or with all remediation features and steps enabled.
First, running infrastructure costs money. As shown above (figure 5), many sites are subject to very short spikes of capacity shortages — usually between half an hour and one hour. Running at full capacity all the time would result in unreasonably high costs.
Second, mitigation actions do not come free. Some mitigation actions will have an impact on your business. Let’s look at news/media sites, for example (figure 6). Third-party services, such as advertisements that are served up to the news site, could be one reason why the system slows down during times of peak traffic. A mitigation action could be to remove these components temporarily from the site.
Obviously, this would result in a loss in advertising revenue. It might also be that removing the third-party advertising components doesn’t have an impact in terms of improving user experience. Even if it does have an impact, it might be that the advertising revenue could help to pay for upscaling infrastructure, which might have a greater impact on user experience.
Help to Get Started
We understand that organizations out there, who traditionally did not have to deal with surges in traffic impacting user experience, might not be prepared to implement all these measures. At Dynatrace we are helping by providing a few free services to help your organization respond to COVID-19. A great way to begin is with our free trial.