Our dynaTrace Community Portal is our gateway to our users. Especially with the rapidly-growing number of world-wide users of our Free Dynatrace AJAX Edition, it is necessary to keep track on how well our pages perform from around the globe to satisfy our “performance hungry” community 🙂
There are two important questions we want to get answered:
- How Fast or Slow are pages from different regions in the world?
- YSlow? In case of a slow page we need to figure out whether the problem is caused by slow network connections, static content deliver or dynamic application code
Let me guide you through the questions you have to answer in order to run a monitoring project from the cloud – and let me show you one actual implementation that we use to monitor our community portal pages.
Question 1: What do we monitor?
In an ideal world we would monitor all possible use-case scenarios and click paths that a user can do on our web site. Unless you only have a bunch of static pages, however, this is unrealistic. Therefore it is important to figure out what are the most important pages and most important use cases. For every web site the performance of the landing page is important. The speed of this first impression defines whether the user starts with a positive or negative experience, and is therefore likely to continue or leave for a different site.
If you run an online shop your most important pages probably are those that show the products you are selling (e.g.: a special offer page showing the currently discounted units or the individual products in their categories). The use cases for an online shop would be a product search and the complete processes of purchasing an item with all the steps this includes, e.g.: putting into shopping cart, verifying billing and shipping information and the final credit card transaction.
Answer: In the case of our Dynatrace Community Portal the important pages are the start site that contains the latest News and links to other resource areas, the individual Topic Sections, Downloads, Online Documentation and Educational Services. As for important use cases: content that is not public available requires to user to log in. The first use case therefore is accessing content that requires a login. Another use case is commenting on pages. We allow our community to provide comments their thoughts and feedback in certain download or discussion areas. As it is important for us to get feedback from our community we need to make sure that commenting on a page is fast.
Question 2: From where do we want to monitor?
This question can be answered by another question: where are your current users located and from where do you expect new users to come? If you have a web site for a local grocery store in the US you are probably not interested in how fast the page is from Asia. If you run a global marketplace for goods it is in your interest that everybody around the world has an equally fast experience.
Answer: In case you don’t know where your users come from it is rather easy to find out. You can either start by analyzing your web server log files and lookup the regions of the incoming requests. Most web hosting services provide this feature and allow you to create nice geographical reports. Another option is to use the data from analytics services such as Google Analytics.
Question 3: How frequently do we monitor?
The frequency depends on how fast you want to get to know about a problem and how likely it is that you catch all your problems. If you monitor your page once a day you may run into the problem that your web-site is not reachable for almost 24 hours before you realize it. It can also happen that you only have problems during a certain period of the day and if you do not monitor in this period you will never be aware of it. Running it too often may overwhelm you with data – and – depending on the monitoring service you use – might also get a bit expensive.
Answer: Ideally you want to monitor your site as frequently as possible. To compensate for the price of a monitoring service and avoid being overwhelmed by data you can opt to check your important pages every 1-5 minutes and your important use cases every 5-15 minutes. This allows you to react almost immediately when your web site is no unavailable or is just slow for your users. It also allows you to react to problems with your main use case scenarios within an acceptable time frame.
Question 4: What information do we need to identify the root cause of a problem?
This is probably the most important question to answer: What information do YOU need in order to fix a problem that has been identified by your monitor as fast as possible to get the RED light back to GREEN?
Answers: First we need to know whether this problem is regional or global? If you execute your monitors from different spots and only one has a problem it is very likely a network or content-delivery issue for that region. Cross checking the monitoring results from the different regions helps to answer this question.
In case they are static you can use Content Delivery Networks (CDN) to bring these resources closer to the end-user instead of serving them from your own web servers out of your central data center. Also double check cache settings on these objects to make sure these objects are not requested more often then necessary. I wrote several blogs about this problem on sites such as the Frankfurt Airport.
In case it is dynamically generated content by your application servers you need to get the information that your software architects and/or engineers can use to identify the root cause of this particular failed transaction as fast as possible. Everybody has their own approach on what data to collect. Some use extensive logging and log analyzers to digest the logs generated at the time when the problem happened. Some just analyze the performance metrics from their web and app-servers to figure out where the problem could be. Others use application performance management solutions that collect in-depth transactional traces that can be pulled out for the engineers to analyze and to get to the root cause of the problem.
Show Case with Webmetrics and Dynatrace
Alright – let’s get back to my own monitoring project. I’ve identified which pages I want to monitor. I know that we have a global user base so I want to monitor these pages from around the globe. I want to monitor the important pages on a 5 minute interval and the important use cases every 15 minutes. And – lucky me – I have Dynatrace running on our Application Server that hosts Dynatrace Community Portal. We use Confluence as Content Management Software and customized it to our needs. We added our own macros and plugins. Our web designers did a great job in making the site look really appealing and I and some of my colleagues are generating a lot of content to share with our community.
Setting up my monitors
I am using Webmetrics Services from Neustar to implement my monitoring project. The monitoring service executes monitors from The Cloud – meaning – the monitors can be placed on different monitoring agents around the globe and all results are accessible through a single monitoring dashboard.
I’ve created a new Application Monitor and recorded the monitoring scenario that clicks through my important pages. In order to leverage a interface of Dynatrace that links monitoring transactions with their respective names and location to a dynaTrace PurePath that is captured on the application server, I augment the monitoring script with the additional Dynatrace HTTP Header for the individual monitoring requests:
In the Monitoring Settings I can specify how often and from which location the monitor should be executed. As you can see from the script above I coded the monitoring agent location in my header – in this case New Orleans. I can now go on and create multiple of these monitors that I configure to execute from those locations that I want, e.g: Chicago, London, Sydney … Passing the location via the Dynatrace Header is not a necessity – but it allows me to also identify the incoming requests on the Application Server by monitoring location.
Monitoring Dashboard with Root Cause Analysis
I created a Dynatrace Dashboard that includes both the Webmetrics Online Dashboard as well as the Dynatrace PurePath-related data. This eliminates an extra dashboard that I would normally need to look at and it brings the data from the monitoring service and the application performance data nicely together. The following screenshot shows this dashboard where we see that Webmetrics reports really slow page response times for my monitored transactions. On the bottom I see the actual monitoring transaction that makes it all the way to the back Application Server. We see the number of monitoring transactions (this allows me to see how often the monitors are really executed) and the response time of these requests from the application-server point-of-view:
The interesting observation here is that Webmetrics reports an average execution time of 20s. Dynatrace on the Application Server tells us that we have one transaction (Monitoring Transaction Home Page) being constantly slow with about 10s on average. We also see that the same transaction had a huge spike on April 25th.
Background Information: the Dynatrace HTTP Header allows us to see the same transaction name (Home Page) on the server side captured data as used in the monitoring script.
The conclusions here are that
1) we must have a content delivery problem as pages for the end user take 20s whereas the server “only” takes 10s and
2) our Home Page transaction has a severe performance problem on the server.
What is the problem with Content Delivery?
Webmetrics allows me to get details on every single resource that was downloaded for monitors that exceeded my limits. Turning on the detailed log level shows me individual download times of individual elements:
What is the problem on the Server Side?
Using the Dynatrace HTTP Header Integration between the monitoring script and Dynatrace on the server-side not only allows Dynatrace to use the same transaction names. It also allows the monitoring service to capture the PurePath ID that Dynatrace captured for every single monitoring request. With that we can lookup the PurePath ID for a failed monitor and lookup this particular PurePath in Dynatrace to figure out what the problem was. The following screenshot shows not only the one PurePath in question (the one we identified with NeuStar) but also some others to the same URL. I was interested in seeing if these are consistently slow or whether we just deal with some outliers:
Monitoring web sites has gotten easier due to service providers such as Webmetrics by Neustar. Before you start your monitoring project make sure to answer those questions that I’ve asked myself before I started monitoring our Dynatrace Community Web Site. As always I hope to get some feedback from you about your Best Practices on this topic – I am sure some of you deal with big applications and big monitoring scenarios and it would be great to get some additional insight. There is also a link to a recently recorded webinar with Zappos. They also use monitoring and load testing services from the cloud and give their insight into their day-to-day operations.