In recent testimony to the House Energy and Commerce Committee, Health and Human Services Secretary Kathleen Sebelius admitted that HHS failed to perform enough testing to ensure a working system and revealed that parts of the website did not get “loaded” until the third week of September, one week before its launch. “We did not adequately do end-to-end testing”, Sebelius said.
As a performance professional for more than 20 years, I’ve seen mission critical projects succeed and fail. And all the successful projects incorporate a performance and scalability testing plan that starts in development, flows into heavier and heavier load testing, and rolls into production day one, and everyday thereafter. What steps should have been followed to create a successful launch of the HealthCare.gov site? What are the steps and best practices that forward-thinking IT organizations execute to ensure success? And what do you do if you, like the folks responsible for the government HealthCare site, are in firefighting mode – trying to rescue a failed launch?
There are a number of best practices that should be followed to implement a proper performance engineering process. I will focus on one of the most important from a Healthcare.gov perspective – effective load testing.
Key activities for an effective load test are:
- Generate load representative of real users
- Follow user transactions and isolate bottlenecks
- Communicate findings with development & operations
- Compare test results
Let’s look at the 4 steps of this load testing triage in greater detail:
1. Generating Load
Automated load generation tools differ greatly in the methods used to generate load. The most popular tools generate load from inside the firewall with test machines sending traffic over the local area network to the system under test (SUT). This approach is very beneficial in the early stages of component and integration testing, but falls short of providing a complete performance picture because it does not account for key variables outside the firewall, including network delays, DNS lookup, firewall rules, load balancing, CDN’s, and other third-party components.
The most accurate method of generating end-user load is through a geographically dispersed load testing network that drives load from two sources:
- Cloud Data Centers – Large volumes of load can be generated by harnessing the resources of commercially available cloud data centers. Select locations that represent the geographies of your actual users.
- Last Mile –Last mile endpoints utilize machines that are connected to the internet using local ISPs at various bandwidth capabilities and are used as testing agents during a load test to provide the most accurate end-user measurement available. While datacenter load is needed for high volume and repeatability, last mile load provides a much more realistic view of user experience.
Why is external load testing important? We’ve conducted synthetic monitor of the healthcare.gov site from several datacenters and last mile peers across the U.S. over the past several days to test for availability and performance. These single-user synthetic transactions are a great way to baseline performance of critical pages and transactions.
Figure 1 shows that the average response time of the healthcare.gov home page as measured from datacenters in 10 cities was 3.4 seconds.
Figure 1- Average Backbone Response Time
Similar monitoring from the last mile (Figure 2) shows an average response time of 11.4 seconds.
Figure 2- Average Lastmile Response Time
This monitoring data shows an 8 second difference between response times as measured from the datacenter versus the last mile. If we were to measure this transaction from inside the firewall, we may see a response time of 1.5 seconds. So what does this have to do with load testing?
Consider the impact on the infrastructure of a transaction that takes 1.5 seconds versus 3 seconds versus 11 seconds to execute. This increased session length has performance implications on all tiers of the infrastructure as queues, connections, processes, memory allocations, pools, etc. stay open longer. Now extrapolate this impact across the thousands of users that access the site concurrently and its clear to see how resource usage and users concurrency can increase exponentially with a realistic load versus a load generated inside the firewall. Internal load testing can give a false sense of security over the reality of an external load test with last mile users.
External load testing is essential to pre-launch testing of customer-facing web applications because it:
- Applies the most realistic load on the infrastructure
- Shows user experience based on geography
- Exercises any geo-based technologies (CDN, load balancing)
- Details the impact of 3rd parties (often geo based) on performance
- Mobile infrastructure
2. Isolate Bottlenecks
The generation of realistic load can be quite painless with the advances in load testing technologies as described above. The purpose of generating load, however, is to isolate the root cause of performance issues as reported by the load testing tool. Isolation is only possible with 1) detailed monitoring of the system under test and 2) integration between the load testing tool and the monitoring solution.
Since we don’t have access to the systems hosting the healthcare.gov site, we’ll use a travel booking application to illustrate.
Figure 3- Load Test Performance Report
Figure 4- Destination Search Waterfall Chart
We can further drill into the object in question to obtain detailed information about the execution path of this object and a host of other performance data, as shown in Figure 5 below.
Figure 5- Destination Search PurePath Diagram
A strong integration of performance monitoring with web load testing removes the guesswork from bottleneck analysis and immediately points to the root cause of performance problems. These are the types of processes and tools required to quickly understand the source of performance problems in complex web applications.
In my next post, I’ll discuss the final steps in the performance testing process – Communicating with Development/Operations and Comparing Test Results. For more information, you can also download a detailed white paper entitled “Graduating from Load Testing to Performance Engineering.”