I was honored to do a DevOps Handbook Lessons Learned webinar with the DevOps “Godfather” Gene Kim earlier this month. In preparation for it I not only started reading his new DevOps Handbook, but also revisited the main messages of his previous DevOps book – The Phoenix Project. Gene (and co-authors) talk about the three ways which represent a maturity path of organizations from a rigid, slow and manual value chain, towards an automated, high-quality continuous delivery model with feedback loops everywhere in the value delivery chain, which ultimately encourage a culture of innovation and experimentation.
The Phoenix Project talks a lot about removing process- or people-related bottlenecks to achieve “The First Way” – which is all about increasing the flow of work through the pipeline:
- If it takes eight weeks to get a new physical server think about Infrastructure as Code and automatically deploy into virtual or cloud environments
- If it takes three people to sign off paperwork to purchase a $50 item think about how Trust in employees and removing that sign-off process for items under a certain dollar amount
- If your NOC team only meets once a week for change requests that are manually executed the following weekend think about how to automate deployments, rollbacks to increase velocity
The new DevOps Handbook sheds more light into the “best practices” that allowed organizations such as Google, Etsy, Verizon or Capital One to increase the flow of the software they push through the Continuous Delivery Pipeline. In our webinar I used the following slide to highlight the for me most important steps to ensure a constant flow of high quality value. It starts with Removing your existing Code and Architectural Bottlenecks! Which you have to do before even thinking about modernizing with concepts such as “Monolith to Microservices”, “Migration to Containers or PaaS” or “Shift-Left Performance”:
Fix your Bottlenecks! Or can’t you? Because fear holds you back!
Alright Andi! Let’s fix some bottlenecks then. Let’s schedule time in the next Sprint and just get it done so we can finally start with our DevOps Transformation!
Sounds easy – right? It could be! But unfortunately many engineering teams are overwhelmed with this task because they are faced by a reality that looks like this:
- They inherited code from engineers that left the company
- The code is poorly written, not well documented and depends on 3rd party libraries that haven’t been maintained or updated in a while
- There are not enough tests to ensure code changes don’t break anything
- There is not enough time for code analysis nor are the engineers available that wrote that code
- They do not know which code actually runs and is used in production which would be helpful to prioritize
- Not all of them have architectural skills or know how to use performance and scalability tools to identify and fix hotspots and bottlenecks
This situation leads to “fear of change” instead of “Continuous Innovation and Optimization”. The same situation was presented by Gene Kim in our webinar. He was telling the story of the Google Web Service Team back in 2005. This team had no automated testing in place for code that was rather complex but key to Google’s success. Developers were scared to push changes to google.com because they feared they break the search engine. They changed their engineering culture and demanded automated tests which ensured that code changes don’t break key functionality in production. With that in place they could start removing bottlenecks with the confidence that they are not opening new ones or breaking the software! The Google Web Service Team has since then matured into the fastest moving teams within their organization!
What else besides Tests does it take?
I am a big proponent of automated testing. I also started my career in that space: as a tester on a testing tool J which is why I beg you to start creating tests that you can execute on a regular basis.
But tests alone don’t get rid your bottlenecks. Tests will ensure that you didn’t break anything. The question is: Where do you start fixing if you still have no insights into 100% of the code you are responsible for. Where are the bottlenecks right now? And how are the individual modules, classes and functions depending on each other? Does the code behave differently in my local dev, test or production environment? Might we have data or environment driven problems I cannot find on my local workstation? Will I just keep breaking tests now because I still don’t know what I am doing?
Let me take the fear from you by showing you how I’ve been helping hundreds of engineers in the last years to identify their bottlenecks. And thanks to their willingness to share their data and code with me I wrote a series of blog posts – but especially one that explains how to identify the most common problem patterns which lead to bottlenecks by looking at metrics
The Manual Way of Finding Code Bottlenecks
If you look at my most common problem patterns I always start with looking at metrics such as # of SQL Statements execute, # of Web Service Calls made, # of Threads Utilized, # of Tiers Involved. etc.
If you have tools that can give you this data then great. That’s a good start. I see many tech savvy folks using open source or dev tools to get this data for a single application tier or container. So – make sure to capture them while running your tests.
While running your load test against your system you will crank up the load until you see your app no longer responding fast enough. You will most likely see CPU, Memory, Threading or I/O related bottlenecks. I wrote several blog posts about using Key Performance Metrics for Load Testing:
The challenge now is to answer questions like: Who is consuming all that memory? Why are so many threads blocked? Why are so many threads in sync? Who is making all these SQL queries? Why do I have millions of log entries?
Most teams typically go towards analyzing log files or manually correlate application metrics captured in the different tiers to figure out why there is a problem. Some also use simpler APM Tools that give details for those transaction these tools believe are slow and are problematic. These tools often also correlate data based on timestamps only – which makes it almost impossible to really analyze most of these problem patterns because they require a real end-to-end transactional traced view. And because a memory leak or an exhausted connection pool is most often caused by a logical unrelated transaction you also need every piece of data captured all the time. Otherwise I would also be afraid making code changes based on half facts and guess work.
If you don’t want to spend too much time manually correlating and analyzing data — and making guesses — on what happened in those transactions you didn’t get to see – then I suggest you keep reading on why I believe you need 100% True End-to-End Data.
Fixing Bottlenecks without Fear requires 100% True End-to-End Data
I want to start with a big THANK YOU. All of this was only possible thanks to the hundreds of Dynatrace AppMon & UEM users that leveraged my Share Your PurePath program. And our Dynatrace engineers that worked hard on automating the manual analytics that I described in Automatic Problem Detection with Dynatrace. It is based on our truly unique 100% End-to-End Transactional Tracing PurePath Technology. These two capabilities are paramount for successful removal of bottlenecks – and here is why:
Unless you have a single-tier monolith application the reality is that you have multiple tiers/components/services involved when executing a request. Using monitoring tools that tell you hotspots on a single tier are a good start – but potentially lead you into wasted time and effort to fix a problem in an area which is actually not your problem!
Why? Because trying to optimize a single tier without knowing the role it plays in the end-to-end value stream is not the right way to “remove your bottleneck”. Just as explained by Klaus in “Saving MIPS” where it turned out that a change in a Java component caused duplicated calls into the Mainframe. Trying to optimize the Mainframe code would not have solved this problem alone.
This is why you need to understand the full end-to-end flow of transactions: from your End Users (Mobile, Desktop, External REST Users) into your “New Stack” (Node.js, Docker, PaaS, Spring Boot, ..) all the way through your “Enterprise Stack” (WebSphere, Mainframe, Oracle)
As explained above, bottlenecks are typically caused by bad coding or architectural mistakes. While bottlenecks typically manifest themselves through failures or high response time, the root cause itself is most often not in that observed problem.
For instance: Bad coding of a background job that consumes all connections to your database or transfers too much data over the network will impact your critical end user transactions. If you just get the details on those slow transactions you don’t really know the root cause. Trying to fix these transactions that show the symptom is again – a waste of time – and will not solve the actual problem.
Bottlenecks also don’t always show up in your pre-prod or dev environment as you have different load scenarios.
To really identify the root cause of bottlenecks it is therefore necessary to have 100% coverage of all transactions. Only this allows you to learn the real root cause and shows you how fragile and monolithic your application code really is. Because: if an independent feature can impact your critical feature you definitely have too many depending components!
Automated Bottleneck Detection with Dynatrace
Instead of manually collecting and correlating data from different tools that only give you partial view, simply follow my YouTube Tutorials, Blog posts or the step-by-step guides on our AppMon Trial community web site. Once you have it installed and it captures data from your application you can go ahead and analyze your bottlenecks that Dynatrace automatically identified. The beauty of it is: it also detects bottlenecks that have not yet impacted anybody because you haven’t put it under load or into production. The “bad code” or “bad architecture” is however already in your application. Thanks to the 100% End-to-End PurePath we can detect these patterns even on your local test environment. Here is a screenshot of one of my Share Your PurePath users that – instead of using my service to help him – simply used the new automatic pattern detection feature.
This feature not only puts me out of my job, it also enables everyone to do the work that I’ve done over the last couple of years! It automates work that should be automated so that engineers can focus on really fixing these bottlenecks and have more time to start innovating! Welcome to your first step in your DevOps Transformation Journey.