In this 4 part blog series, I am exposing DevOps best practices using a metaphor inspired by the famous 6th century Chinese manuscript: “The Art of War”. It is worth reminding that Sun Tzu, just like me, considered war as a necessary evil, which must be avoided whenever possible. What are we fighting for here? Ultimately, we’re fighting for the absolute best services and features that we can deliver to our customers as quickly as we possibly can, and to eliminate the “War Room” scenario we are all so familiar with.
Preparation, preparation, preparation. Everything up to this point has been preparing for operational use. This is where the world of users converge onto our environments, expending everything that’s been created and potentially pushing our assets to unexpected extremes. One of the main DevOps strategies is to expect the problems by establishing laser accurate detection and forging rapid resolution techniques throughout the entire environment. Therefore the operational battlefield must possess a comprehensive monitoring strategy.
Winning this battle is measured by how happy our consumers and business commanders are with the intended results. However, even with all of our efforts the operational battlefield is typically laden with problems and issues everywhere. When problems do arise it is how quickly the troops are able to identify, prioritize, communicate and resolve them that determines victory.
Situational Awareness and Identification
To identify risks and issues the troops must be completely aware of all aspects of the battlefield. This means, not only should they be aware of real time situations but they must be aware of upcoming situational changes both internally and externally. Newly identified challenges found here should end up in the pipeline, ultimately forming the feedback loop.
Optimal situational awareness is accomplished through a standardized platform of information, a virtualized command center of sorts. It may be viewed centrally or distributed. It becomes key to eliminating the traditional War Room and may certainly be one of the most underrated and overlooked collection of resources in the arsenal. An optimal information center manages all vital infrastructure and software changes in the pipeline, major business events and metrics to be expected as well as what is actively occurring in the field. It requires participation by all troops to effectively navigate the sea of information that is generated. All hands participation is required, meaning the operations and the development troops are always working closely together in every environment to monitor and manage risks. Due to the fact that so much can be going on at one time, dashboard and monitoring tactics should be carefully planned out and treated as a critical part of the overall development process. The more visibility into the current operational goals and challenges, the more opportunities troops have to solve them together. Like the other environments, the current health state of the production environment must be clearly depicted. Ideally, dashboards of health state should be designed and constructed earlier in the lifecycle. Designing dashboards stories and scenarios in turn will aid in determining instrumentation and data gathering strategies. It’s very typical for all the information to be stored across multiple tools. However, every attempt should be made to consolidate the tools while ensuring statuses and metrics remain consistent between roles and environments. For example, the same monitoring tools to depict state and health should be the same across development, stage and production. Below is a list of intelligence that should be readily accessible and well organized through mash-ups and/or well organized shortcuts:
- Active User & User Experience Metrics
- Active Sessions
- Abandon Rates
- Number of Logins
- User or Tenant frustration levels
- User Experience vs Competitors User Experience
- Hardware and Infrastructure Health
- Active Infrastructure State
- Process Health
- Disaster Recovery State
- Errors and Alerts
- Application and Transactional Activity
- Baselines & Deviations
- Errors and Alerts
- Synthetic Traffic Reports and Status
- Delivery Pipeline Statuses
- Recent changes
- Upcoming changes & fix list
- Business Related Metrics & Health
- Conversion Rates if Applicable
- Business & High Traffic Events
- Risks and Issues Board
- Security Details & Risks
- Scheduled Outages & Dependency Board
- 3rd Party Dependency Health
- Enterprise Service statuses
- External Systems & Web Services
- Resource Links Board
- Contact numbers
- Project Plans & Documents
- Up to date Design Documents
- Procedure Documents
For public web site battles, identification of issues through synthetic traffic ensures that operations maintains insight into challenges that may often go unreported. Example solutions such as Dynatrace Synthetic Monitoring and Keynote play important roles in that campaign.
Finally, even the operational dashboard building should be agile. Since it’s unrealistic to think that all dashboards will be prepared for operations ahead of time there should be a wealth of ad-hoc capabilities in the tools that enable troops to quickly build insight into the environment while later incorporating their information into the ever evolving command center views.
Pick your battles. So listen, it’s the age old saying, “You can win a battle but lose the war.” Make sure the troops are directed to the key challenges: users and the business. Operations can have a big influence on these challenges. For example, Trooper Joe has been diligently picking off two query problems over the last few days BUT the query problem is affecting less than a quarter percent of the users weekly. When development and operations work closely together proper prioritization means greatly increasing the value of each sprint and changes within the delivery pipeline. It’s imperative that high valued assets are consistently delivered to the battlefield.
Communication is absolutely key to winning. Moreover, communication must be throughout the cycle while utilizing the same tools. If the delivery pipeline is maintained in one tool than the same tool should be viewed and collaborated on by all members. The same goes for the monitoring tools. Not only will the metrics be consistent throughout but training on the tools and sharing information becomes natural, reusable and uniform.
Now it’s one thing to have the data simply presented on a display and another thing to make the information actionable. Optimally, troops should be able to identify a problem on a board and then drill down into further details to immediately begin root cause analysis or begin further action. Therefore, effective tools will not only allow deep dive analysis but also present various perspectives on the same issue. Also, common communication practices and procedures must be established when handling each class of issues. For example:
- Server is down, identify server name and data center, then contact…
- Application A is experiencing high response times, drill down and try to determine specifics such as it’s only when requests are hitting the fourth node in the application server cluster so contact…
- Verify Credit Card calls are failing some of the time. Powerful monitoring tools will enable troops to find the recorded transactional call stacks that fail and package them up for analysis.
Ultimately problems get to the proper troops faster, creating a greater likelihood of resolution. Again, this is another key benefit of Continuous Delivery and DevOps: find and resolve issues quickly because they are going to happen.
It’s been a pleasure writing this series and disclosing some of the major insights and observations I’ve collected over the years. I’ve enjoyed sharing some of the key points of DevOps, which has incidentally become the cultural destination for many organizations. Hopefully, you’ve gained valuable intel during this series to aid in fighting the battles more effectively while embracing diplomatic collaborations between development and operations. So, until next time, stay strong, push onwards and upwards. Peace out!