British Airways' IT System Crash

It's just the tip of the iceberg for companies who have underestimated the complexity in today's technical systems.

By Mike Johnson, 1/2/2018

Experts are predicting that BA’s recent IT system failure may cost the airline in excess of £100 million. Savvy IT managers of companies around the globe which employ complex IT systems will no doubt be taking note of the BA IT crash and will be concerned to ensure that such an eventuality does not impact their IT infrastructure.

BA’s IT system had apparently crashed six times since it was released. In any good engineer’s mind, this would set alarm bells ringing – system failures indicate that the design for Reliability, Availability and Maintainability (RAM) was not particularly high on the priority list during a system’s design and development phase. Why was it not a high priority? Quite possibly because the complexity of the whole system was so overwhelming that, during the development phase, the best engineers were working on a daily basis to overcome interfacing together multiple sub-systems from numerous sources, working on different platforms with hugely varying interdependencies, etc etc!

BA is still investigating the crash, but early sources suggest that responsibility for the failure was with the powering of data centres and the outsourcing of IT services to India. Given that the loss of power to a data centre (whether through a power surge (as has been suggested) or otherwise) is a reasonably likely event, if this was the cause of the crash of the entire IT system, then the blame, as such, should be allocated to the design of a system with inadequate provisions in place for responding to such a predictable failure, rather than to the root cause itself. How the event could be linked to outsourcing to India is rather harder to fathom. The updates to the system could have become more time-consuming due to the increase in organisational complexity, but the system’s reliability is an inherent property of the in-service system and is not likely to be linked to this behaviour.

What’s remarkable is that the system didn’t die gracefully, the entire system crashed. This raises questions about the modularity of such a system. Is it simply the case that there’s a monolith in one data centre being backed up by another monolith in an adjacent data centre? If so, then it can also be concluded that maintenance and future modifications of this system will be far from ideal.

In addition to the £100 million+ lost due to this IT failure, BA will no doubt need to spend considerable sums of money on trying to get to the root cause of the problem and preventing it from happening again. Of course, the system of interest is already so well constrained that it will be very difficult (and expensive) to implement an effective solution at the in-service stage of the lifecycle. Perhaps only a million spent additionally during the design and development phase could have averted this whole issue.

I’ve taught systems engineering for over 5 years. An observation I made early on was that, no matter how much you emphasise the importance of minimising complexity in a system development, when it comes to making actual design trade-offs in the workshop, so often the decisions fall in favour of the more complex solutions!

Why is complexity so often underestimated by the human race? One factor is that human beings are inherently optimistic when it comes to engineering - often if a technical risk can’t be identified, then there are no risks! Other times, it can be that technical risks are not prioritised over other project risks. Also, it’s a lot easier to learn through experience by simply making mistakes and then reacting to them. Of course, as educational as this is, it gets very expensive and such expense can be avoided if failures are predicted and mitigated against.

There is hope though. As the World Economic Forum’s recent report on the future of jobs identifies, complex problem solving is increasingly becoming a core skill: “With regard to the overall scale of demand for various skills in 2020, more than one third (36%) of all jobs across all industries are expected by our respondents to require complex problem-solving as one of their core skills…” (see link for full details. http://reports.weforum.org/future-of-jobs-2016/skills-stability/)

The complexity of systems in today’s world is increasing to levels where the success of competing products will be determined by an enterprise’s ability to apply complex problem-solving to its technical developments.

A simple lesson for all future complex system developments is to apply a mantra - that key requirements such as those relating to RAM, do not get implemented after the system of interest is in-service. They need to be prioritised early on in the design phase to have any real significant impact on the RAM in-service performance. In addition, complexity should never be underestimated. As Dave Snowden eloquently points out, “it’s far more beneficial to assume that a problem is complex and be proved wrong than to assume the problem is simple and for the opposite to be true!”

Blog | Newsletters

The latest in Systems Thinking and Project Management

British Airways' IT System Crash

It's just the tip of the iceberg for companies who have underestimated the complexity in today's technical systems.

By Mike Johnson, 1/2/2018

Latest Posts

Newsletters

Social