A few days back I was called into a war room situation with the hosted services group of our partner hybris. They were facing issues with the eCommerce site of one of their customers, a large UK based luxury clothing retailer. The situation was quite critical. Even though loadtests were conducted and everything appeared to be optimal, the eCommerce site encountered issues during a recent customer promotion which required the implementation of customer waiting rooms. Waiting rooms are a thing of the past, and as a customer you definitely you don’t want to end up on a “Sorry, please wait” page while shopping online. With Black Friday, Cyber Monday and the holiday shopping season pending there was a great deal of pressure resolve the issue.

The non-technical complexity

Technical architectures today can be quite complex, and will become more complex in the future. There are frontend servers, databases, backend services, 3rd party services like payment providers and search engines, CDNs, 3rd party content and whatnot. But sometimes there is another level of complexity — one that’s not technical — the number of different parties involved. In this case there are a total of four parties involved:

  • The Customer: That is feeling most of the pain. When the site goes down they lose money and suffer a damaged reputation, usually after investing in a new, modern, fancy eCommerce store focusing on mobile users and desktop users alike that now doesn’t work!
  • The Implementation Partner: Hired to implement the new site. In a well-coordinated project they implemented the customer’s wishes, bringing in the necessary expertise, and “making the impossible possible”. The project required resources from both the customer and implementation partner, but the project was delivered on time.
  • The Hosting Provider: Responsible for providing the resources and platform where the new site would run. They defined the sizing to meet the customer’s business requirements, provided application and database servers, configured  backup and disaster recovery, and high availability, too. The “go-live” was coordinated with the other parties and everything went according to plan.
  • The Platform Provider: This is hybris, the vendor of the eCommerce platform. They know the system in-and-out, work according to best practices and sizing guidelines, and are skilled at identifying problems and bug-fixing.

Did I say four parties?! Oh, wait! After a brief examination of the architecture it turned out there were two more! An external search engine provider and an external payment provider. Although the customer environment uses only technical interfaces like webservice calls to these, they probably also have an SLA agreement that, if violated, must react via customer support and services, adding two parties to the mix! But let’s leave that aside for the moment.

The Blame Game

If you put yourself in the customer role (maybe you are) and you see your eCommerce site going down at the busiest and most business relevant time, who would you turn to in first place? To your implementation partner? The one that delivered the project on time with defined handover criterions and a successful go-live delivered? Or would you speak with your hosting provider (internal or external) who is responsible to keep your site up and running? In my experience the hosting provider is the first one to be blamed, because: “It’s their servers that went down and couldn’t handle the business load!” and “It’s their environment that is slow!” with all fingers pointing to the hosting provider — “You broke it, you fix it!” And so the blame game starts going around the table, with everybody being unhappy while both time and money are rapidly consumed!

Phase #1: more resources

The hosting provider often jumps to immediately applying more resources against the problem. And sometimes it works, often at a significant cost to the customer, or it doesn’t. But what if applying more resources doesn’t help and if it just delays the problem until the next traffic spike when a marketing blast drives more users to the site?

Phase #2: call in the experts

If adding more resources doesn’t work that’s when everyone find themselves in a war-room scenario, bringing all the experts to the table, during which every domain expert presents their view and attempts to align with others in the room. The scenario typically involves Webserver, Application Server and Database professionals, as well as implementation and platform provider experts. They all throw in their knowledge and analysis data and hope the final blame doesn’t fall on them! At this point the finger pointing can get dirty, as the cost of delays — and emotions — mount. We’ve all seen war room participants divide along “political” lines with discussions collapsing into something that is not  problem focused, with individuals simply defending their position to avoid taking the blame/responsibility. But sometimes it gets worse.

Phase #3: rinse and repeat

This phase is particularly costly in terms of time and money. Basically the war room described above is repeated over hours, days, even weeks. Money is effectively burned and the customer gets frustrated. Everybody is unhappy and often the situation is further complicated by inviting more experts to participate. The war-room expands, and the days and nights get longer as they contribute their insights (sometimes) and opinions (typically). And the whole time users are getting increasingly frustrated! In this phase the non-technical complexity and communication between all parties explodes! As more people attempt to contribute to a solution the coordination between groups becomes more complex, and more likely to once again collapse. At this point a war-room “manager” is assigned to establish clear lines of communication, and a “board” is created for escalation and decisions. Unfortunately, the wrong decisions will still be made for various reasons. So how do you get out of this loop? Either repeat until you find the solution – which can be tedious and cost intensive or change the approach!

Turn on the X-Ray

Rarely does a year pass when I am NOT a part of one or more of these war room scenarios, usually just before — or during — the holiday season. Most of the time under high pressure and a need for immediate resolution, typically during Phase #2 or #3.  More late nights, more pressure, more emails, all as an external expert. Being that external expert has its benefits:

  • I usually do not feel the pain: I’m not losing money, I’m not getting blamed for a non-working environment, or something that happened in the past. I’m just another expert at the table of experts.
  • I usually don’t care what has been done and who did it. “Tabula Rasa” is the rule of the day! I can start from a clean slate.
  • I don’t care (too much) about the relation and potential tension between other parties and their particular interests. It’s in my interest to solve the problem as fast and precise as possible.

There are also disadvantages but I’ll leave those out :-). So what’s my priority objective in these situations?

Create Visibility!

That’s what I do by turning on Dynatrace. I remove the “might”, “maybe” and “if” from the discussion. I work with the facts. I want to see everything and the relationships between all the elements in the form of end-to-end visibility. From the user side to the database and the code, I want to pinpoint not fingerpoint! So, let’s get back to the big UK retailer mentioned at the start of this blog.  Here is what we did:

  1. Get the environment Dynatrace’d: The hybris managed services team took care of this for Webservers, Appserver, User Experience Management, and everything else.
  2. Get some privacy (and your preferred hot beverage): Especially when you are in Phase #2 or #3 there is lot’s of noise (figuratively and literally). Take yourself out of the room and situation for some time while working on the problem.
  3. Unbiased and focused analysis: I listen to the problem description and what has occurred in the past, but usually only if it’s already very specific – – “Our users can’t do a order checkout”, “Our search doesn’t work”, and then I break the application apart step by step, documenting all my findings.
  4. Present findings, lining out work items which are assigned to the individuals most capable of addressing them.
  5. Verify all changes and the impact of each.

Focused Analysis

I work most often with eCommerce offerings using the hybris platform. Together with hybris experts I’ve created a

hybris monitoring fastpack which helps me to obtain a quick overview of the environment, allowing me to focus on very specific areas such as certain pages and background jobs. I do not need someone to walk me through the typical user journey, I simply examine production data to know where I need to focus. That single dashboard usually defines my next steps. Going forward almost every analysis is filtered by one of these page types. I can identify exactly which services are being used when a product category page is opened, which database statements are called, and which code is executed. The next step often brings me very close to the needed results. Two analysis screens allow me to locate potential bottlenecks: Response Time Hotspots and Transaction Flow (filtered by a page type): focused analysis The rest of the analysis is why you need the aforementioned privacy: focusing on and obtaining an unbiased understanding of the environment, writing down the findings, and correlating facts (not guesses). Finally, I add recommendations to the documented findings. Remember, not everyone presented with these results will immediately understand the implications, because it’s not necessarily within their area of expertise. Usually some “translation” is required using some terminology. This can still be difficult, but having the facts at hand it makes it much easier than guessing!

Facts from the field

These are some real-world numbers from my past engagements:
Fastest Time to Root Cause (after Dynatrace Installation): 5 minutes
Biggest Time and Resource Saving: 2 weeks ongoing war-room with 30 people vs 1 day analysis time with 3 people until a solution was found
Largest Group of Parties involved: 6 companies with overall 30-40 people
Best cost-item on the war-room: professional catering (3 meals/day) for the whole team

Conclusion

War rooms can be tough. With the increased number of parties involved in problem hunting, communication between teams can be an obstacle for timely resolution. Information is lost, guesswork is added, and often it’s more trial-and-error than fact-based decision making. Creating visibility by turning on an “APM X-Ray” to help remove the subjective element from the discussion, replace finger pointing with pinpointing and accelerate problem resolution. Implementing a proactive application performance management process early in the lifecycle might avoid war rooms completely!