How Complexity Spilled The Oil
The Gulf oil spill of April 2010 was an unprecedented disaster. The National Oil Spill Commission’s report summary shows that this could have been prevented with the use of better technology. For example, while the Commission agrees that the monitoring systems used on the platform provided the right data, it points out that the solution used relied on engineers to make sense of that data and correlate the right elements to detect anomalies. “More sophisticated, automated alarms and algorithms” could have been used to create meaningful alerts and maybe prevent the explosion. The Commission’s report shows that the reporting systems used have not kept pace with the increased complexity of drilling platforms. Another conclusion is even more disturbing, as it points out that these deficiencies are not uncommon and that other drilling platforms in the Gulf of Mexico face similar challenges.
If we substitute “drilling platform” with “data center,” this sound awfully familiar. How many IT organizations are relying on relatively simple data collection coming from point monitoring such as network, server, or application while trying to manage the performance and availability of increasingly complex applications? IT operations engineers sift through mountains of data from different sources trying to make sense of what is happening and usually fall short of finding meaningful alerts. The consequences may not be as dire as the Gulf oil spill, but they can still translate into lost productivity and revenue.
The fact that many IT operations have not (yet) faced a meltdown is not a valid counterargument: There is, for example, a good reason to purchase hurricane insurance when one lives in Florida, even though destructive storms are not that common. Like the weather, there are so many variables at play in today’s business services that mere humans can’t be expected to make sense of it.
If the challenge is real, finding the right solution may not be easy. IT operations have acquired solutions from diverse vendors, mostly as a reaction to perceived issues and uncertainties. Because the data collected comes from diverse sources, it needs first to be “normalized”: The raw data from a monitoring collector must be run through a normalization algorithm to: 1) convert it into a form that could be used in comparison with other data types, and 2) placed in an actual context to determine its dependencies. An example of normalization is to consider a data value in a “period context”: At a given time of the day, on a given day of the year, is the value collected within x% of its “normal” value?
There are several solutions on the market that provide normalization and statistical analysis for improving alerts. But for these to be effective, we also must remember that all elements of the infrastructure and application must be instrumented and provide data. Another disaster, the Three Mile Island nuclear power plant failure, can be directly traced to an incomplete infrastructure monitoring leading to an incorrect conclusion about the root cause of the problem.
Monitoring is useless if it is not: 1) covering all potential points of failure, and 2) using normalization and statistical analysis to make sense of the data. As the Oil Spill Commission points out, you can’t expect a person to spend hours in front of a screen and detect minute variations that are the warning signs of impending disaster.