The Art of DevOps Part III – Staging Grounds

In this 4 part blog series, I am exposing DevOps best practices using a metaphor inspired by the famous 6th-century Chinese manuscript: “The Art of War”. It is worth reminding that Sun Tzu, just like me, considered war as a necessary evil, which must be avoided whenever possible. What are we fighting for here? Ultimately, we’re fighting for the absolute best services and features that we can deliver to our customers as quickly as we possibly can, and to eliminate the “War Room” scenario we are all so familiar with.

When we last left off, the troops had deployed the build from the Islands of Development to the Staging Grounds. The environments have been outfitted with extensive application monitoring tools and prepared for testing that exposes the battle readiness of the build. Ultimately each build must be exhaustively tested to help ferret out issues that regularly infiltrate user scenarios, UI, high concurrent use, performance, security, disaster recovery, processes, infrastructure and any other function that affects successful operations.

Map 4.22 15 2_HoferUpdates_wPieces

Uncovering issues in the build accomplishes the primary objective in the Staging Grounds. Important risk analysis must be performed during the initial requirements gathering during each sprint. Risks must be assessed regarding how each addition or change in the sprint will affect the overall operation. Because not every change is created equal, the level of risk each change poses will determine:

  • What should be tested
  • The priority in which to build the automated test cases
  • How comprehensive the testing should be around the change
  • Data that may be needed to exercise the identified risks
  • Configuration change effects
  • Number of possibilities to consider when testing the build. This may include but is not limited to:
    • Geographical placement of the tests
    • Consumers such as:
      • Desktop browsers
      • Native Mobile Applications
      • Thick clients
      • Browser versions
    • Business Continuity & Disaster Recovery Scenarios
    • Login users, roles and rights
    • Workflow paths
    • Operating Systems


Constructing the automated testing can be challenging, time consuming and expensive. While automation in testing adds to the defenses, it also introduces new complexities. Automated test cases must be maintained and updated; therefore, automation cases should be scrutinized for their overall value and return on investment. Automation in testing acts as a supplemental guard to manual testing and will often be evolutionary. A solid feedback loop between operations and development is crucial to improving the automated tests within the delivery pipeline. Ideally the automation should decrease the time to identify an issue and enable troops to focus manual test cases on more obscure and valuable scenarios. As a best practice all test cases identified should be tracked and managed in a Test Case manager like Zephyr, Silk Central or TestRail, to name just a few. Moreover, the system that is chosen must integrate with the requirements tracking tool. This will maintain traceability by linking the test cases with the requirements throughout the lifecycle. For example, Zephyr is the native test case manager used in conjunction with Jira.

As for the staging environments, the infrastructure should be as close to imitating the infrastructure found in the Operational Battlefield. This will ensure that all testing performed is relevant and test results will accurately predict what’s to be expected in operations. Troops should also, when deemed effective, establish new virtual machines using baseline images and apply version controlled configuration templates executed by the same configuration management tools found in the Islands of Development. The following types of performance tests are then performed in the Staging Grounds:

A host of software options are available for our arsenal and affords many of the automation functions mentioned above. A few common tools embraced are Selenium, Load Runner and Visual Studio Test. Larger and far more comprehensive testing for web sites can be executed using platform services such as Dynatrace Synthetic Monitoring – a global network of machines that provide most of the testing variations reviewed earlier. Batteries of tests can be authored in this SaaS environment and directed to hit the Staging Grounds with a vengeance. A global simulation emulates a vast number of end users, using a diverse set of desktop configurations and mobile devices that are located across the world. Traffic patterns can be increased to introduce high volumes of concurrent users and attain accurate real world effects on the staged build. Leveraging these automated testing techniques in combination with a highly focused set of manual tests offers the optimal atmosphere for uncovering issues.

The second objective is to quickly pinpoint where problems lie and effectively lower the MTTR (Mean Time to Resolution). This capability was introduced when arming the Staging Grounds with advanced performance management using tools that were implemented by the troops in the Islands of Development. This affords us a major strategic advantage. Using tools like Dynatrace Application Monitoring, troops gain incredible insight well beyond the typical high level test case pass/failure reports. A comprehensive performance management tool will reveal the end-to-end call stacks associated with the failed test cases. It also displays the monitored processes and machine states at the time of the issue. Problems are exposed quickly showing the problem tier, host, process, component and class with even method or event level accuracy. This detail supports analyzing build to build comparisons and providing troops with deltas at either a macro or very granular level. Pertinent information can be reviewed showing positive or negative fluctuations in key performance indicators at the test case level, e.g. UpdateAccount. Comprehensive solutions like this put on our radar KPIs along with degradation or improvement analysis against prior builds and accumulated base lines while all of it can be visualized per test case. Significant measurements may include but are not limited to:

  • Response times and Durations(Includes background asynchronous timings)
  • Overall database counts as well as individual SELECT, UPDATE, DELETE counts
  • Error Counts
  • Exception Counts
  • CPU timings
  • Significant baseline deviations
  • Selected method execution timings
  • SLA violations
  • Remote call counts

The third major objective here is to form masterful views and communication processes that accurately reflect the current state and provide broad situational awareness. This is key to eliminating the need for War Rooms.

First lines of defense include automatic alerting that signal dangers in the build, raise change requests, immediately point out where problems are, abort build promotions, auto add change requests into the teams work queue and much more. It’s this type of information integrated with automatic processes that forms a superior line of defense and closes the window of time it takes to promote changes into the Operational Battlefield.

A second line of defense is constructing an accurate picture of the build state so there is full transparency of its health available to every level of command. This is accomplished by creating consistent dashboards that offer clear-cut windows into the environment. Everyone throughout the lifecycle should be able to communicate using these shared dashboards allowing them to view, communicate and analyze the same metrics.

The third is the ability to isolate an issue down to a specific test case, metric or even an end-to-end enterprise call stack and packaging up the intelligence to send directly to the identified development troops responsible for analysis or change. This is what separates a modern method in reducing MTTR from an antiquated or deficient one.

Implementing all of the competencies will greatly increases the confidence in securing a safe build and promoting it into the operational battlefield. So until next time, we leave the staging grounds, deploy to operations and I will meet you again on the frontlines.

Brett Hofer is as passionate about DevOps as he is about music and art. Specializing in delivering complex mission critical software under methodologies such as Agile, Lean and Waterfall (to name a few), his success at managing and delivering projects with complex technical and political challenges is almost legendary. More than twenty years of broad software/IT experience—from product designer and solution architect to senior management—has given him a unique 360° perspective on IT that has earned the respect of customers and peers alike. Tweet him at @brett_solarch