Just last week a senior Hybris consultant shared the story of a customer engagement on which he was working. This customer had problems, serious problems! We are talking about response times far beyond the most liberal acceptable standard! They were unable to solve the issue in their eCommerce platform – specifically Hybris. Although the eCommerce project was delivered by a System Integrator/Implementation Partner, the vendor still gets involved when things go really wrong! After all, the vendor knows best, right?

So when he started working with this customer his first question was:

Do you have an APM Tool in Place?

Why? Imagine you are an expert for a highly-customizable platform that has been adapted to the customer’s needs. Within a very short time you are expected to get a complete overview of a mostly unknown environment in order to solve a pressing issue. So you need information, accurate information, the best information available. Just facts, no rumors or hearsay! It’s like when your child gets hurt at the playground. You take them to the hospital and one of the the first things a doctor does is perform an X-Ray to get a clear image of the injury – perhaps a a broken bone.

“Yes we do have an APM solution!”  the customer replied. “Good” the expert consultant said. “Let’s take a look at this problem in your staging environment.”  Customer:“It’s only monitoring our production environment…and we already tried using it to solve the problem”. Expert consultant (confidently):“Oh, OK, then let’s work on production data to investigate the problem”, and then asked for access.

Looking at production data provides the benefit of using “the real data and the real problem” for investigation, not the one replicated by a “close to real” test. Don’t get me wrong, I’m not saying that performance analysis and diagnosis has to happen in production, but often it’s the quicker way to resolve, well, production problems! Preventing these problems from ever hitting production by first using APM best practices is a whole other topic. More on that later.

Soon after, he logged into A “Dynamic” monitoring solution featuring nice dashboards and alerts, blinking on violated average response times. He saw an overview of the environment and even identified one specific business-relevant transaction that was extremely slow. What he saw also confirmed the issue about which the customer was complaining. The problem was obvious, but the solution wasn’t. He needed details. He found database statements that were executing often, but all were functioning fast enough and seemed fine. So, what was making the transaction take so long? A deep investigation of the transaction executions would be needed.

Can you export this live data…?

“…so I can take it to our lab for deeper investigation?” the consultant asked? The day has been long and he wanted to analyze the data offline, while on the commute home to his family. A 45-minute ride should be enough to find the root cause, and he would be in time for family dinner!

“Export production performance data for offline analysis…how would that be possible?” a dynamic and genuinely surprised system administrator asked. “I don’t think that’s even possible!” – and it wasn’t possible with their monitoring tool. So the Hybris expert stayed a bit longer, missed his train home, but eventually gave up investigating the problem for the day. Fortunately he was home in time for dinner, and his wife wasn’t angry. Peace at home, but none at the customer who had to live another day with the persisting problem.

The next day he went back to the customer, eager to solve the issue. It didn’t go away over night, and the analysis was still hindered by missing facts – facts that their APM solution couldn’t identify and report.

Careful! "code-level detail" in one APM tool might mean something different in another solution!
Careful! “code-level detail” in one APM tool might mean something different in another!

I need more Visibility!

That morning the consultant decided to try a different approach. Since he couldn’t get the deep level of visibility with the monitoring tool in place, he asked the customer if he could install his preferred APM solution, as it would provide deeper visibility in the production environment. Using the Dynatrace Free Trial with out-of-the-box diagnosis for Hybris he connected to the environment. Now he had the deep visibility he needed to find the root cause. It took him less than an hour to identify the issue, the fix was suggested and implemented shortly thereafter.

What was different? Why did one APM solution fail and another succeed so quickly? It was certainly not a lack of knowledge on the part of the Hybris services experts. They see a wide range of  customer problems every day, they work with many different tools, and know their business inside and out.

Just as a doctor can diagnose your child’s broken arm by simply looking at it, he still takes an X-ray to determine if there is an additional injury or potential complication, so he has sufficient information to provide the best treatment! Unfortunately, the APM tool used by the customer in this situation didn’t provide the depth and detail required to successfully identify the root cause of their problem.

The Difference: APM Tool vs. APM Solution

So, why did the consultant succeed almost immediately with one solution while the other failed? There are multiple factors that contributed to his success:

  1. Exact End-To-End Visibility: Meaning high-level detail generated with execution of a single transaction. To give you an idea the level of “exact” is code execution on a method level, we are talking of measured, non-aggregated timings like CPU Time, Execution Time, Total Time, I/O, Sync and Wait time. Further, he needed access to method arguments and return values, as they might be critical during analysis. External calls, database execution, CDN content, errors and exceptions were additional details required.
  2. Full coverage, no sampling, every transaction: It doesn’t matter if you are single-replaying an issue by manually clicking through, or if you are looking at data in production. You need details, and for any investigation, you want them for every transaction, regardless of whether they are executed only once or multiple times. Aggregations and samples aren’t enough!
  3. Ability to share and analyze offline: Eventually the consultant needed to share the observed behavior with architects and developers. Not as a summary, written up with screenshots or with log file snapshots. He needs to share with another expert all the details of the execution tree of the failing transactions in the event he missed something!
  4. Same solution, same knowledge: The likelihood of miscommunication and misunderstanding is reduced significantly if all teams involved in troubleshooting get the same insights by using the same tools. If production operation sees the exact same data as testers and developers, represented and analyzed with the same tools, they speak the same language. There is no more “…but my tool measures this and your’s shows different…”, and there is a significant, corresponding reduction in the communication cycle!

So when you select an APM solution, choose one that covers monitoring, and also fits the needs of testers, developers, operations, system integrators, implementation partners, and the platform vendor. A solution that tells you that a transaction is slow but doesn’t tell you exactly why, where, and who will needs to fix it, is almost worthless.

Charting is not enough! It's about the data that lies behind these charts and the data that will help you find the root cause
Charting is not enough! It’s about the data that lies behind these charts and the data that will help you find the root cause

Choose Wisely – because you will fail at some point

Choose a solution that supports the full application lifecycle! Your application went through development or customization and testing before it went live and, finally, into production. A production problem will likely go back the exact same way, from production to replay in test, and eventually a developer needs to fix the problem. And then the solution will take a similar route back to  production. Because problems are part of the lifecycle the solutions to solve them should also be part of the lifecycle.

Choose a solution that the people in your lifecycle know how to use! Include your organisation and those with whom they work: partners and vendors. Trust their selection and recommendation because they are the ones you will likely call on to resolve the really tricky problems. Also, follow their best practices and recommendations. For example, Hybris provides well- documented best practices for Application Performance Monitoring.

Finally, accept the fact that there will be large and small problems. There is no such thing as “happily ever after” for a platform where changes are made continuously. You will fail or someone else will, and it will impact you in some way. After all, we are only human.

Conclusion

  • There is a difference between APM Tool and APM Solution. If you choose one make sure you take into account not only the needs of the production team, but also those with whom you will be working and the type of data they will require. Choose a solution that helps everyone. Don’t do simple feature-list or key-word comparison! Compare the solutions by putting them to the test.
  • Trust your platform vendor and your partners. They have extensive and in-depth experience with situations like yours, and they know how to address quickly. Copy their best practices and leverage their expertise to attain the same level of competence.