When something goes wrong who’s to blame? In this post I take a closer look at who is responsible when applications have performance problems. Interestingly very often the question as to who is responsible is at least as important as how to solve the problem. In order to not have to search for the responsible person all the time I aim to find a general rule on who is responsible for performance problems.
Let’s start by looking at all the people who are involved in the software delivery process and investigate whether they are responsible for a performance problem.
Obviously the developer must be responsible as he has written the code that is not performing well. OK – problem solved. But wait, this cannot be true. First of all, developers write well-performing code. They are very good at what they do. Also all their testing showed that everything is working fine, so something must have gone wrong later in the lifecycle. Furthermore, if it was developer error, it was not the developer’s code but another developer’s code. Our developer has always had this gut feeling that the service he was calling somehow was slow. He would have built that application differently anyway – it was just this architect guy who decided it had to be that way.
So it is the architect who is responsible. He designed this well performing architecture that scales infinitely. However it obviously does not. Obviously the reason is not the architecture, because it is undoubtedly great. It uses all hype technologies and the coolest framework for just about everything. So if he would have coded the application everything would be perfect. He however just did not have the time to write the code along with everything else. As always the developer did not get the architecture implemented right. He has printed all those cool diagrams (even in color) and those devs screwed it. Obviously the head of R&D cannot get his people to do the simplest thing on earth – build the application as it was designed.
The Head of R&D
The head of R&D screwed it up. He had everything he needed but he did not get the job done. He could not manage his developers to get their job done. We even sent him to a Scrum seminar – a pretty expensive one – and this did not help either. We shortened release cycles, got more flexible with requirements and he still cannot get the job done. However he does not feel responsible at all. He shipped a new iteration release every week. The testing guys needed weeks to provide feedback based on a codebase that was everything but up to date. The feedback that “tests failed at 300 users” was not terribly helpful either. At the end he had to task three developers adding debug output and holding the hands of the testing people. In the end his guys did the job better themselves. Who needs testers if the hard work is ultimately done by development?
Obliviously the tester is responsible. How couldn’t we see this from the beginning? He should test whether the application performs as expected. If we tested right, there would not be any problems. Why do we give them all these expensive testing tools and hardware if the outcome is just something like “we have a problem”? Asking the tester, however, he does not feel responsible at all. He wrote all the test scripts and ran the tests. He does not know the code so how should he know what went wrong? If the developers know exactly what kind of information they need, why didn’t they simply put the proper tracing into the code? He then simply sends back the logs and they fix it – life should be so simple. Instead they come over and block the testing infrastructure to find the problem. In the end it is the tester who gets blamed when he cannot finish testing in time. Well, this is pretty tough if some developers occupy your infrastructure and you cannot do your job. Ultimately everything worked out anyway – all the tests ran perfectly – so something must have gone wrong in production.
The Operations Guy
Why didn’t we search for the responsible party closest to where the problem occurs? All others got it right in the end, and then operations is unable to run the application. All they have to do is keep everything up and running. We even wrote a whole run book for them. Where is the problem? Well that’s hard to know; and when there is a problem and you need information they behave like getting logs and other data is a matter of national security. If we would just get access to that production system we would fix it right away. Talking to the operations guy draws a different picture. All he is doing is running the application as he was told. If the development team gives them the wrong information it is not their fault. They try to keep the service available by all means possible. Even more important they have not written the code and they are damn sure that this is an application and not an infrastructure problem. And yes … they love to send around log files, change debug levels, and restart the server every couple of hours. That’s why they started to work in IT and this is also their ultimate destiny. So if the developers would just …
… wait! It seems like we are getting back to where we started. At the end everybody is and is not responsible at the same time. Nobody did anything wrong but there are reasons for blaming everyone else as to why things did not work out.
What is the conclusion? Well, following the famous German writer Bert Brecht, there is no conclusion. It is up to the audience – you – to make up your mind. However like always finding the responsible person never solved any problems. In my next post I will look at the process of bringing an application into production and where performance management is done incorrectly – leading to production problems.
If you want to read more articles like this visit dynaTrace 2010 Application Performance Almanac