For the train companies of the United Kingdom, today was a tough one. How’s this for a headline (which we all gazed at during our morning coffee break):

If you haven’t heard, you can read all about how the poor train companies of the UK copped a battering by commuters on social and traditional media for a ticket machine malfunction. I feel for the commuters but my sympathy today lies with the train companies. Well except for the ticket collectors that obviously didn’t get the memo, and handed out fines to those that boarded without a ticket!

Coverage:

Here’s why it’s so hard to be consistently perfect

Purchasing a single ticket is actually a hyper complex transaction from an IT point of view.

The complexity links from the (and let’s list them which isn’t even all of them!):

  1. front end software the customer uses at the attempted purchase
  2. third party payment gateway that processes the payment
  3. integration between the machine (of the device), the software and the gateway
  4. security certificate required to make a secure transaction
  5. credit check application
  6. hosting environment in which all this runs
  7. interconnection between the transactions that are crossing different hosting environments, from the end user, the train station, and through to the back end applications.

Yet all the customer cares about is their experience at the machine – it needs to be perfect, or if there’s a problem, it needs to be resolved in seconds…so they can board their train on time.

Not like this:

So what happened today in the UK?

We may not find out for sure what happened but from our experience, monitoring millions, if not billions of transactions a day, there are three common areas where problems can arise. When it comes to IT complexity, rapid release cycles and digital experience, typically problems centre on:

Human error – Oops did I do that? 

Software needs updating. Software updates are mostly written by humans, and when multiple humans are working together, it’s not uncommon for mistakes to be made. Even if you have the most stringent pre-production testing, issues can still arise once you push to production because you can never accurately replicate what software will do in the wild.

As our champion devops guru Andreas Grabner always preaches in his talks – #failfast. If the issue relates to change that was made, roll it back, fast.

in this case with the outage today, I doubt it was a software update in the core operating system. I’d expect it was a third party failure, which incidentally might have had it’s own update. But more on this on point 3.

Security

Not one to speculate on, but obviously when a software failure causes mass disruption to people, it would be fairly normal to assume that maybe some sort of planned security attack. But again I doubt it.

Delivery chain failure

The most likely cause for the train machine failure is simply a failure somewhere in the digital delivery chain. Considering a single transaction today runs across 82 different technologies, from devices, networks, 3rd party software applications, hosting environments, and operating systems, it doesn’t take much for a single failure to cause a complete outage. Understanding where that is, so that you can quickly resolve is critical. Referencing what I said in point 1, it’s probable that a simple update to any of these 82 different technologies caused a break in the chain. Or maybe one of these 3rd parties had their own outage.

And that’s where AI comes in.

This is why you need AI powered application monitoring, with the ability to see the entire transaction across every single one of the different technologies. But not just across the transaction, but the ability to go deep from the end point machine, to the host infrastructure, the line of code, and the interconnections between all the services and processes.  It’s the only way you can identify the root cause of the problem – in minutes not hours, or days.

The days of eye balling charts, having war room discussions with IT teams, are definitely over. Software rules our lives, and it simply cannot fail. Otherwise digital businesses face a day like this on social media:

What if the machine fixed the machine?

With the ability to see the immediate root cause of a problem, it’s not improbable for the machine to learn how to course correct itself. In the same way when servers are overloaded, a load balancer can direct traffic to a under utilised host. So if you can detect an issue in the delivery chain then the machine can about self correcting itself with an alternative path.  If the payment gateway fails, then it could auto redirect to a new hosted payment gateway for instance. Our chief technical strategist Alois Reitbauer demo’d just this scenario (ok a more simpler version) at Perform 2017. So it’s not that far off.