A few years back I was the manager of an internal service provider. My job was to deliver services to pharmaceutical business units within my company. We bought service externally as well as created our own and sold standards based functionality internally on a no profit basis. My main responsibility was Intel based servers and the bulk of the services was file and print.

Every month we would have a meeting with all business units where we would account for the services delivered the previous month. This was a few years ago so I had to produce 98% availability during business hours which equated to maximum of 4 hours of downtime per month.

On Christmas Eve a RAID subsystem decided to convert to what I call “coffee grinder mode” and all the disks began shining an ironically festive bright red light. However, we had prepared and started swapping the disks to new ones and soon we had some volumes back. Then we started the restore from our DAT tapes only to discover that they were corrupt and only had fractions of recoverable material.

Panic set in and a consultant and I started inserting tape after tape that we had stored and the tape player whined and spun and stuttered. After a couple of days of missed Christmas dinners and less than happy families we finally had something that could be interpreted as a file system and we could exhale and open our Christmas gifts, happy to have done our jobs.

Showdown

Shortly thereafter we held our monthly meeting with all business unity representatives as well as my CIO, CTO and CFO. In the room the atmosphere was almost palpable as everyone knew what we had been doing while they had been celebrating Christmas. Now it was “merely” a question of how bad the damage was and how long it had been off line and how much the “punishment” would be.

So I flipped up my charts on the beamer with the bars for all servers that the customers were using and all of the bars in the chart were well above 98%.

98Availability
Look how great we were doing! 😉

I think it was 3 seconds before my main counterpart the system owner of the biggest Business Unit (a short, elderly and generally very happy lady) raised from her chair, ears beaming as red as Rudolph’s nose, pointing at me and shouting:

“You are lying!”

Up until then I had not experienced such an unpleasant moment in all of my career and it took a few seconds before I could compose myself and begin to explain.

In short, I had (with the customers consent) PING’ed all my servers every minute to show that they were available and ready to service. First of all, no user would use PING to do their daily job so this was my first mistake. PING as we all know, is based on the ICMP protocol and I’ve yet to come across an application that uses ICMP for any end user related activity. Secondly the disaster that struck only hit the RAID volumes where the user’s files were. The system drive itself hadn’t suffered and hence, the OS kept on running the whole time while we were flipping, first disk drives into the RAID and then DAT tapes into the tape unit and since the system drive and the OS was alive the whole time while the user drives were gone, all PING’s was promptly recorded and allocated as “Service Available” – which strictly could be seen as the “lie” that I was accused of.

Lessons learned

Once the dust settled, this lady taught me a lesson that I have lived by since:
If you don’t know what your users are experiencing, you know next to nothing.

After this showdown, I had to implement a system that measured and mimicked what the users were doing and that was accessing files over the network, not PINGing!
In the end, the customer was actually quite happy to pay the extra charge for implementing this as it meant that they would get a closer understanding of all their user’s needs and performance as well as a closer dialogue with us, their Service Provider.

We as a service provider also started to act differently and asked what the business was in need of instead of telling them what they received.

The moral of this story is that if you don’t know what your users are experiencing – you are lying!