For over 3 years now I have been working closely with hybris. During this time we have demonstrated a lot of value for customers in optimizing custom hybris configurations at a critical point in the initial deployment of their sites. Understanding factors that contribute to user experience prior to launching a site is critical for success in the launch and in terms of ongoing digital performance, brand value and sales conversions.
Because of this, together we developed a basic configuration for monitoring a production hybris environment with Dynatrace. In this post I will provide 10 key health checks to perform prior to the launch of a hybris site, or anytime during the lifecycle of a site.
For this post, I am using the Dynatrace Application Monitoring tool. The specific configuration I am using is available as the hybris fastpack for Dynatrace. Not only is Dynatrace the standard tool hybris uses to monitor customers’ environments in their managed services offering, but it’s also part of their toolbox when hybris professional services conducts a performance review or a pre go-live check for customers.
A performance review engagement is usually organized as a project. It has a timeline, it requires resources, time and material and often it has a time constraint – which can be tough especially when pressure to launch a new site is high or there are urgent problems to address. For efficiency reasons it’s required to standardize tools and automate repeatable tasks. This is where the hybris fastpack for Dynatrace comes into play.
Below, I’ve distilled the performance review down to 10 key metrics to look for when validating at your site’s performance. Stay tuned also for a part 2 of this blog where I’ll walk through how these health checks come together in a real-world example.
Check #1. Prioritize Efforts by Page Class Performance
It is important to understand where it’s most beneficial to focus resources on performance improvements. We typically focus on pages that are called very often or have a sub-optimal response time and require more focus than others. Also a Checkout-Page that is really slow might require more attention, as you don’t want to lose customers in their final step of shopping.
This can be achieved by having a tool that detects which page controllers are defined and used in the monitored environment, sees which pages are called, how often and what their average response time is. In the dashboard below, we also see a tabular summary for pages with metrics, including error rates for different timeframes (6hrs, 24hrs, 7 days) broken down by the individual pages.
Check #2: Webrequest Performance
A monitoring solution like Dynatrace can help you check on the sizing of a hybris environment. The first chart shows the page impressions over time, which should match the general hybris sizing guidelines. The middle chart shows the average webrequest response time over all requests, the lower chart presents an application layer breakdown. This displays the overall time consumed by transactions broken down by different application code or library components. Should there be a big contribution of any specific layer you’ll immediately know where to dig deeper. For example if this shows the JDBC layer as main contributor you’ll know that bad performance might be database related, and if it shows customized Java components you can dig deeper there.
Check #3: Request Balancing
In this view we see if traffic to the web and application servers is balanced correctly. It also provides a quick view of the number of application and webserver worker threads as well as CPU usage.
Check #4: Webrequest Distribution
When monitoring a production environment this dashboard immediately shows the distribution of fast and slow requests. The charts display – on a logarithmic axis – the number of requests per response time category (red: >5s, orange: 3-5s, yellow: 1-3s, green: <1s). If there are any events or changes that impact the response time it’s easy to visually see the change.
Check #5: Sessions Counters
This dashboard wraps up different session types and concurrent users (when using User Experience Management). It’s important to know that not only users are creating sessions but also running cron jobs create JALO sessions. The charts display not only the end-user sessions but with a bit of customization also the backend and cockpit sessions.
Check #6: Database Performance
Quite often a cause leading to performance issues is the database. In most cases it’s not the database itself – but how it’s being used. That’s why we created this dashboard which lists the overall database execution count (number of statements) and the time spent for database calls. However, the most important metric on this dashboard is the average number of database calls per transaction! This is where I usually focus because it immediately reveals an architectural problem if the number is too high! Database calls and communication latency are adding up if too many calls are happening per transaction, which could easily lead to connection pool issues and congestion.
Check #7: Java Garbage Collection
This is another dashboard to keep a close look on. Especially the Garbage Collection Impact on Transactions chart. It needs a little explanation what this graph means: Garbage Collection is normal, it has to happen. Also it’s key to have the right garbage collection settings for your JVMs. But how do you know that you have the perfect GC settings? The optimum would be if garbage collection doesn’t affect the user’s transactions and doesn’t slow down these. The Garbage Collection Impact Rate measure is a rate calculation between the transactions with and without garbage collection time. If the ratio is 1 (or 100%) this means that no garbage collection impacts transactions, however if it drops to 0.7 (70%) or less you know a lot of time is being spent on garbage collection and transactions are unnecessarily paused for too long, affecting the user’s experience!
So, I always look for a value close to 90% or higher even during loadtests.
Check #8: Cron Jobs Dashboard
This Dashboard lets you identify which CronJobs are running, when and where. It helps align execution plans with load on the environments and allows us to identify if there are issues with schedules or executions.
Check #9: SOLR Request Performance
If the environment is using SOLR as a search provider this dashboard will show the response times of the SOLR engine and the overall number of calls. I rarely find issues with SOLR itself. The most common problem pattern is – similar to database problem patterns – too many unnecessary calls per transaction to the search provider. Ideally there should be only 1 call per transaction to the search backend but I have seen instances where there were 50+ calls. Of course this will add a massive latency to transactions leading to bad response times for pages. The lower chart which shows the number of calls to SOLR per transaction should look as boring as in this screenshot: a flat, constant and small number.
Check #10: Understand End User’s Experience
The best performing backend servers do not guarantee the best client-side performance. Things like 3rd party content and client-side errors could impact end-user experience. That’s why I also look at the End User Performance dashboard. I focus on the overall user experience index (APDEX based), which should be close to 1 as well as the User Action Breakdown which tells me where the majority of time is being spent from the user’s perspective.
If the site is relying on third party content (e.g. marketing trackers, google analytics, etc.) it’s always worth keeping an eye on the load time of these as well. In the past I’ve seen sites with more than 20 different tracking tools that were added over time by the request of different departments and just the inclusion of all these (mostly overlapping) tools negatively impacted the user experience of the site.
Use an Overall View to Track Performance Analysis
It doesn’t matter if we are using the above dashboards for production monitoring or during an extended load test. They help identify where you stand and support you in visualizing and narrowing down potential performance bottlenecks. However they do not solve problems (that would be nice, wouldn’t it). Rather, they point us in the right direction and alert us if something turns bad.
Having a standardized configuration and a proven set of monitoring dashboards available for a widely used eCommerce platform helps to make getting started with Application Performance Management easy. Bringing the experience from many other environments and from the ecommerce platform vendor itself into a well defined best-practice implementation saves time and money and makes your start with APM a lot easier. Also, if used for firefighting and urgent troubleshooting this can save you precious time while under pressure.
In the next post I will walk through a real-world scenario to show you how using the metrics described above in conjunction with each other can help you to find issues and optimize your hybris deployment.