Best Practices from Zappos to deliver WOW Performance

Zappos – the leading eCommerce site for shoes and apparel – recently talked about their best practices of delivering WOW Performance to their customers. Zappos re-architected their web-site and went from Perl to Enterprise Java as the need to scale and perform was driven by explosive business growth and performance problems in their old architecture.

Performance is the key to business success for every eCommerce site. Zappos picked an Application Performance Management Solution that enabled them to deliver their #1 Core Value to their Clients: “Deliver WOW through Service

Why Zappos needed to re-architecture

Zappos eCommerce site exploded over the years with growing popularity. The website serves millions of users/visitors daily and processes between 60-65.000 purchases every day. Their #1 Core Value is “Deliver WOW through Service” – as their success is clearly powered by their customers – and – customers like great service in order to continue shopping.

The original platform was built with Perl which showed performance problems with the growing success and growing number of online transactions. Due to the lack of analysis tools and also the lack of structured performance testing it was not clear where the performance problems actually were and why the implementation didn’t scale as required.

The decision to re-architecture the application using Enterprise Java was made in order to scope with scaling and demands on high performance.

Zappos Performance Environment

The application has been re-architected to run on 3 tiers. Each tier is hosted by 2 JVMs – making it 6 JVMs in total. Once a build passes the functional requirements it moves on to the performance lab. In-house load testing tools and the load-testing services from SOASTA allow them to test up to 1000’s of transactions/sec on a build-to-build basis.

The load testing results delivered response times of individual pages as well as transaction throughput. These results were noted and compared from build to build. The following image shows the results per individual transaction:

Response Time Results from Load Testing
Response Time Results from Load Testing

It was assumed that having response times is enough to manage application performance. However – if a build suddenly showed a performance or throughput decrease everybody scratched their head because these numbers alone didn’t give any indication about the actual root cause. Capturing CPU, Memory, Network and I/O activity in the system helped to identify a problematic JVM – but it didn’t help to identify the problematic code or code change that led to that issue.

The lack of visibility into the system and the amount of time spent to find the actual problem caused Zappos to look for an application performance management solution to get insight into the application while under heavy load.

Requirements on an Application Performance Management Solution

The requirements by Zappos for an APM Solution were to

  1. get insight into the application down to the method level
  2. follow each of their distributed transactions across all 3 tiers
  3. run under heavy load with less than 5% CPU overhead
  4. do not return min/max/avg values but values per individual transaction
  5. include contextual data like method arguments, database access, remoting calls or exceptions
  6. integrates with their internal and external load testing services
  7. easy hand-off to developers and offline analysis capabilities
Requirements for an Application Performance Management Solution
Requirements for an Application Performance Management Solution

Zappos invited several vendors of Application Performance Management Solutions. You can listen to the webinar to hear more details about the selection process and the pros and cons of the individual vendors.In the end Zappos selected dynaTrace Test Center Edition for server-side performance management in combination with the FREE Dynatrace AJAX Edition for their JavaScript/AJAX components in order to perform full End-To-End Performance Management of their eCommerce Site to “Deliver WOW Performance“.

Continuous APM in Practice @ Zappos

During the initial POC Phase – which was done in their performance environment – several performance issues could be identified and fixed in the first test runs, e.g.: by identifying a wrong caching strategy the performance of the cache could be boosted by 12x.

Today – as Dynatrace is deployed in their performance lab – every build that gets into the performance lab and is tested with the internal testing tools or with the testing services from SOASTA is performance managed by Dynatrace. Every build needs to get “dynaTrace Certified” before it is passed on to the next stage. In case of performance regressions from one build to the other, the guesswork from the past is over. Dynatrace identifies the problematic transactions (PurePaths) and compares it to the transactions from the previous build to highlight the differences. Zappos makes heavy use of the PurePath Comparison feature:

Identifying regressions per individual transaction
Identifying regressions per individual transaction

The comparison shows the structural difference (which methods are new/removed or called more/less frequent) as well as time difference (which methods take longer/shorter to execute). Not only does this difference data give great input for developers – every single transaction that is captured includes additional contextual information like SQL statements, bind variables, method arguments, return values, exceptions, … As Zappos runs on a multi tier environment it is essential for them to see a the full transaction that spans across all their tiers. Dynatrace’s PurePath technology is able to follow transactions across runtime boundaries following remoting calls via Web Services, RMI, .NET Remoting, WCF or Messaging from one runtime to the next. The developer can then look at the full PurePath:

Individual Transaction including contextual Information
Individual Transaction including contextual Information

In the above screenshot we see a PurePath that made a synchronous HttpCall call from machine zeta01 to zeta02 where the request was handled by a servlet. The HotSpot visualization on the top right as well as the colour coding of the methods in the PurePath tree indicate which methods contribute the most to this individual transaction. Additional context information like servlet attributes, execution times, … can be analyzed to help with problem diagnosis and solution.

Full End-to-End Tracing – The App is more than what happens on the server

Zappos also jumped on the Web2.0 wagon and delivers better end-user experience by making use of client-side JavaScript/AJAX. This however means that performance management must also look at the browser providing a full End-To-End View of the application. Using the Free Dynatrace AJAX Edition which brings the PurePath technology to the browser Zappos is able to analyze the performance problems in their JavaScript code (slow code and inefficient DOM manipulations), the problems with AJAX (too many asynchronous calls) and the problems that this new architecture causes on the server (more requests caused by AJAX).

The following image shows the browser side-analysis with Dynatrace AJAX Edition and the integration with Dynatrace on the server-side that allows Zappos to Drill Into the actual server-side transaction for each individual Network/XmlHttpRequest(XHR):

End-To-End Performance Analysis - from Browser to the Backend
End-To-End Performance Analysis – from Browser to the Backend

The analysis on the browser side helped Zappos to optimize their JavaScript performance and reduce the network roundtrips that resulted in a faster and more interactive end-user experience: Delivering WOW Performance

Best Practices for WOW Performance

Zappos derived several best practices while implementing Continuous Application Performance Management and while running it in their performance lab on a build-to-build basis.

  • You can’t start testing too soon
  • Stop the guesswork -> Give the developers the actionable evidence they need
  • Averages and Aggregates aren’t enough -> You need to see all transactions to find the outliers
  • Don’t settle for simple response time metrics -> Get as granular as  possible – down to the method level
  • Find the exceptions!

Check out the full webinar where Kevin and Ryan talk about all their challenges and success in detail.

Andreas Grabner has 20+ years of experience as a software developer, tester and architect and is an advocate for high-performing cloud scale applications. He is a regular contributor to the DevOps community, a frequent speaker at technology conferences and regularly publishes articles on You can follow him on Twitter: @grabnerandi