Like many others, we are trying to wrap our heads around the recent British Airways outage, an event so far-reaching and arguably avoidable that it’s difficult to believe such a thing can happen — yet it did. While our aim is not to criticize BA, this event provides some good lessons for everyone. It’s a reminder that bad things can happen, even to a good organization. You need to be aware of the risks to your own technology and business and defend against them before they harm your business and your customers.
As a rough estimate, BA will suffer direct losses of US$20 million to $25 million (75,000 passengers at an average revenue per passenger of about $300).[i] Three days of missed bookings amount to a potential additional $105 million loss, to say nothing of the reputational damage and other indirect losses. It might take the airline a few quarters to recover fully. Public memory is short, and the beleaguered traveler is forgiving, but a three-day no-show is extreme. BA execs will get to the root-cause analysis soon, but the event (and historical failures at airlines in general) provides a bonanza of lessons for execs everywhere who want to better equip their organizations to handle such exigencies.
Here’s what you should do:
1. Perform a comprehensive risk assessment and business impact analysis. Assessing risk is the first step in risk mitigation and recovery planning. Building a comprehensive risk assessment and business impact analysis is the key, as there are umpteen single points of failure that will fly underneath the radar unless risk assessment professionals view them critically. Were BA’s electrical systems effectively a single point of failure? That’s hard to digest. Clients often ask me “When does the risk assessment end?” It doesn’t — it’s an ongoing, unending exercise that’s just part of doing business. Executives must keep themselves abreast of the latest risks facing their business. Other areas of the business (e.g., finance, HR) are relentless about risk assessment and analysis. As every business is now a digital business, tech leaders must be even more on board than those sister organizations.
2. Identify faults, isolate them, and contain them before they spread like wildfire. We live in a world with a mesh of services so tangled that a single failure emanating from a vulnerable source traverses from one system to another, causing an unending wave of failures. The BA case is particularly nasty, but it’s not unique. Faults must be caught early, before they spread. Good architecture and design are the real answers, but fault identification is critical to isolation and containment. Whether it’s technology infrastructure, networks or electrical, systems, I always recommend that clients develop their own chaos monkey[ii] for at least their key infrastructure, if not all of it. It does not surprise me that airlines — which continue to use archaic, brittle technologies — are this failure-prone.
3. Automate recovery. The British Airways episode just confirms that firms of all sizes continue to suffer from a lack of automated recovery procedures. While electrical systems are portrayed as the culprits, the ultimate effect was that technology services were unavailable. It’s always puzzling why important companies like airlines are not able to swing disaster recovery sites into action. Remember the marketers’ cliché: Services are built and delivered at the click of a button. Reality suggests otherwise, as it did with BA. The failure indicates that either DR services existed only as a checkbox capability (or did not exist at all!) or the recovery plans were not tested well enough for executives to be confident that they could be instantiated. For a company the size of BA in a critical infrastructure industry, failover to alternate systems must be automatic and transparent.
4. Build redundancies. This should be obvious, but apparently it needs repeating. The bottom line for me: When a company becomes completely dependent on technology, resiliency should be near, if not at, the top of the CIO’s priority list. Building strong, resilient infrastructure costs big bucks. Management doesn’t want to burden the company’s budgets with investments in building redundancy — but they’re inevitably shocked by the whopping damage disruption causes. Direct passenger compensation alone will exceed BA’s predicted cost savings from not implementing enough redundancies.
Building the capabilities listed above is tough but not impossible — all it takes is rigor and discipline, much like with a fitness program. Improvement is hard, because the inertia of established processes and incumbent technology is difficult to overcome. It’s painful initially, but it gets easier with time. I understand that building a resilient, dependable technology organization takes a lot —hence, we are currently researching the best practices and latest and greatest technologies that will help firms like yours do just that. Be on the lookout for our upcoming report on the topic: Design For Dependability.
If you have any inputs or questions, I am eagerly waiting to hear from you.
[i] British Airways financial results for the year ending December 31, 2016.
[ii] “Chaos monkey” is a service that identifies groups of systems, randomly inserts failures into the system, and terminates one or more of the systems in a group.