Week 2 – The many faces of end-user experience monitoring

Inspired by a comment of Wim Leers on one of our other posts on web performance, I decided to switch plans and write this week about end-user experience monitoring. If you google end-user experience monitoring tools you will find a number of different approaches.

End-user experience – as I think we all agree – is the performance perceived by the actual user at a specific point in time. Sounds simple, but is in reality not that easy to ascertain. End-user monitoring is also often referred to as the last mile in monitoring (somehow reminds me of the movie with Tom Hanks). This is mainly because it is really complex and difficult to determine. In fact the closer you get to your user the harder it gets.  Additionally, you suddenly have to monitor thousands and thousands of users.  This means also that you can only send a limited amount of data as otherwise you would need be able to process the amounts of information. Furthermore this information has to be sent over the wire. Sending lots of performance data from your end users won’t make them happy either.

Data Collection

Before looking at the different approaches, let’s first discuss what information we really want to track.

Instead of looking at the data, let’s discuss which questions we want to get answered by end-user monitoring. These questions will bring us to the required data anyway. First we need to get monitoring information about general user experience:

  • How long did it take to load the page?
  • Where there any problems on the page?
  • How long did certain actions take on a page?

Secondly, in case of problems we want to get additional information which helps us to understand and diagnose them such as:

  • What was the reason for a slow download – the network or the server?
  • What did the user do that caused a problem?
  • What browser, operation system and connection was used?

The first question can be answered by tracking resource load times. Here we want to measure the follwing metrics:

  • Time of First Byte – When was the first byte of the web page received
  • Time to First Visual – When was the page content visible the first time. This is the first time a drawing operation occured
  • Time to On Load – When was the onLoad event of the page executed
  • Time to Page Ready – When all intial content is ready and JavaScript execution can safely start.

The network times should be split up into wait time (the delay until a browser connection available),  DNS lookup time, transfer time and server time. This is especially useful in diagnosing network related problems. Further we want to see HTTP Headesr to find improper caching configuration or problems like HTTPS related connection problems.

So far this information was very much what we need for any type of web page. For Web 2.o applications we need to get much more information. We are additionally  interested in the time it takes to execute end user events like the click of a button, round trip time for AJAX requests or rendering times and JavaScript errors. When talking about AJAX requests I not only mean XHR Requests but also requests made using dynamically-generated script blocks.

Having an understanding of the data we want to collect, let’s now look at the various technologies which are used to collect end-user experience metrics.

Synthetic Transactions

Synthetic Transactions are based on the concept that they emulate real users. They use pre-recorded scripts defining end-user behavior that are executed at defined intervals. Providers of these tools execute these scripts from up to more than one hundred locations worldwide. We at Dynatrace provide a script-based monitoring plug-in for the same purpose.  While they are not really monitoring the perception of real end-users, they constantly monitor the performance of specific pre-defined transactions.  Transactions, however, have to be “read-only” as they would otherwise kick off for example real purchase processes. This limits the usage to a certain subset of your business critical transactions. If you create a kind of dummy user in your application whose transactions will be filtered out later in the process, you might be able to monitor more transactions.

The key point here is that synthetic transaction monitoring is not about the performance perceived by real users of your application. It rather acts as a reference measurement that will help to show performance degradation, detect networking problems or provide notificatons in case of errors.

There are also numerous SaaS providers offering this service. Technically there are two different approaches how requests are generated: Some solutions replay recorded HTTP traffic patterns, others drive real browser instances. The advantage of the second approach is clearly that they need not emulate browser behavior. Especially in Web 2.0 applications which rely on the heavy usage of JavaScript  and AJAX communication only a browser-based approach is feasible.

Overview Synthetic Monitoring Architecture
Overview Synthetic Monitoring Architecture

Network Sniffing

Network Sniffing is the second category of end-user monitoring tools. Unlike synthetic transactions they rely on real end-user traffic.  Special appliances are used which monitor the whole network traffic being sent between clients and web servers.  These appliances however are now within your own network, meaning they are farther away from your end users.  On the other hand you can use them to monitor real end-user traffic. They analyze end-user traffic in real time, verify response times against SLAs and also check content for correctness. These monitoring solutions also provide the actual HTML seens by your end users and enable you to follow the click path of your users.

The disadvantage of this approach is that no browser-level metrics are collected. Problems causes by massive DOM access, excessive JavaScript execution or rendering problems cannot be found at this level.

Overview Network Sniffing Architecture
Overview Network Sniffing Architecture

Instrumentation at Browser Level

The closes place to the end user is the browser.  Therefore collecting metrics at the browser-level is the most accurate way to monitor end-user perceived performance.  Monitoring at the browser level is achieved by injecting JavaScript monitoring code into the page. The easiest way to do this is using header and footer injection. Small portions of JavaScript script code is injected at the beginning and the end of a page. This code will collect data for certain browser timings like first byte received, page completed or onLoad. Additional instrumentation can be used to collect more metrics by injecting additional JavaScript measurement code. Examples are the Google Logging API for Speed Tracer which uses the console logging API of Webkit.

How to efficiently get this information into the JavaScript is still an open issue, however. Ideally this is done automatically during the loading of the code. This means every web request must modified either in real time or there is some pre processing of JavaScript resources.  The alternative approach is to add additional logging information into the code. This kind of source-level instrumentation required developers to add these calls.

The challenge is then to deliver these results back a centralized monitoring server. Episodes by Steve Souders suggests the use of beacons. Beacons are small web requests with piggybacked monitoring information. Alternatively XHR requests can be used to communicate with a monitoring server. Both approaches however use browser connections which are also required by the application itself. Communication should therefore be kept to a minimum with small payload only.

On the server-side the challenge is to process this potentially huge amount of data. Thousands and thousands of clients sending small packets still leads to huge amounts of data. Specific requirements of Web 2.0 application impose another challenge here. Plain page timing metrics might not be enough for Google Mail and other single-page highly-interactive applications. Monitoring the response times of XHR requests is essential to understand the communication behavior of the application. The ZK framework’s performance monitor for example, offers such capabilities.

While we get much more metrics at this level we still miss important details.  We will not get any rendering information as this is information we cannot query on the JavaScript level. Tracing of network requests also represents a challenge at this level as we have to inject code everywhere where resources get loaded.  Analyzing whether a resource got loaded from the cache or not is not possible either as such information is not available at this level.

Browser Plug-Ins and Extensions

If we want to get even more information there is no way around using a browser plug-in or extension. There is Firebug for Firefox, Speed Tracer for Chrome and Dynatrace AJAX Edition for Internet Explorer. The major disadvantage from an end-user monitoring point of view, is that the user has to install the plug-in first. While this is no issue in development or test environments, it is an issue in production. Apart from that these tools provide a lot of information which is by far too much for end-user monitoring at a large scale. The major advantage however is that you get the most details and can overcome the limitation of JavaScript injection in browsers.

Currently these tools are mostly used in test and development environments. In some cases they can also be used for troubleshooting end user problems – if the user agrees to install the plug-in. They however have a great value for optimziing end-user experience up front. The detailed metrics of these tools allow optimizing the end-user performance for specific browsers. Below you can see a screenshot of a Dynatrace AJAX Edition timeline view showing rendering, JavaScript and download behavior of the browser as well as a detailed trace of JavaScript execution.

Browser Diagnosis Showing Detailed Metrics on Rendering, Download, etc and a Detailed JavaScript Trace
Browser Diagnosis Showing Detailed Metrics on Rendering, Download, etc and a Detailed JavaScript Trace

The Role of Server-Side Monitoring

Although this post is about end-user experience, I want to include some words on server-side  monitoring as well. One could argue that this the approach is the furthest away from the end user. This is true; however, application performance monitoring at the server side also provides insight into end-user performance. Very often server-side monitoring is combined with end-user monitoring. The image below show an example of  an integrated monitoring view from active monitoring with server-side metrics. Problems having their root cause on the server side can only be efficiently diagnosed using server-side monitoring and diagnosis data.

Integrated End-User and Client Side Monitoring using Synthetic Transactions
Integrated End-User and Client Side Monitoring using Synthetic Transactions

What will the future bring us?

This is an interesting and difficult question to answer. It is best answered by looking at the challenges we are facing. The first challenge is the tool challenge. Currently it is not possible to use a single tool to get all the information you need.  The reason is that it is currently not possible to collect all information using one single technology. JavaScript injection is easy to roll out however, it has limitations regarding the data it can provide. Browser plug-ins require explicit deployment while enabling the deepest insight into browser behavior – not to mention the roll out challenge. Network sniffers – while beeing able to capture and correlate the total traffic of a user –  have no insight into the browser. Synthetic transactions basically serve a slightly different purpose.

The future will be first about integrated tool chains combining information from many sources in a seamless way. Monitoring at the browser level will be done using more and more sophisticated approaches. Hopefully browser developers will open the functionality they currently only provide via plug-ins and also allow the development of standardized extensions based on a unified API. A first draft of a unified interface is the Web Timing Working Draft.

This post is part of our 2010 Application Performance Almanach.

Alois is Chief Technology Strategist of Dynatrace. He is fanatic about monitoring, DevOps and application performance. He spent most of his professional career in building monitoring tools and speeding up applications. He is a regular conference speaker, blogger, book author and Sushi maniac.