Offshore Oil Rig Accident Lessons Spill into IT
As the environmental disaster resulting from the British Petroleum (BP) oil spill unfolds in the Gulf of Mexico, I am reminded of common problems in the information security industry. Granted, the scale and potential impact of the oil accident—with thousands of barrels of crude spewing out of a hole on the ocean floor a mile deep each day, endangering sensitive coastline ecosystems from Texas to the Florida panhandle—are thankfully hard to duplicate in the world of technology. A bad patch release or a data breach can cost a company millions of dollars and engineers their jobs, but in most cases, lives are not at risk. Lack of “circuit breakers” regulating stock exchange trading may lead to the evaporation of close to a trillion dollars in minutes, but trading can be halted, plugs can be pulled. Not so with this oil spill, where despite BP’s best efforts to contain the disaster, extensive and probably long-lasting damage to the Gulf Coast and losses to the fishing and tourism industries appear unavoidable.
Reading over Congressional testimony and media analysis of the spill, it appears that several mistakes were made that have analogues in the world of technology. They were mistakes that highly-trained, intelligent people who should know better, make. They should look familiar to us in the high-tech world. Here are five:
1. Just because you can doesn’t mean you should.
In fast-paced, competitive technical fields, engineers may feel pressure to produce novel solutions on tight schedules. The Deepwater Horizon disaster demonstrates the extent to which global demand for oil has pushed energy companies to drill in increasingly dangerous, difficult environments, using ever more advanced and unproven technologies. It is not news that problems in a system increase dramatically with the complexity of the system. Moreover, inability—through time, physical, or financial limitations—to fully test a complex system before deploying it is probably a recipe for trouble. The principle of Occam’s Razor—where the simplest solution is generally the best one—may be a rule of thumb to remember when designing solutions particularly for mission critical systems.
2. Two is one and one is none.
This is a common sense saying that a wilderness survival enthusiast once told me. Bring more than one knife, and more water than you think you will need. Know more than one way out of your hotel, have more than one route to get to work. The oil rig designers are probably now regretting that they did not build more robust redundancy into their blowout prevention systems. IT professionals are unlikely to later regret backing up data more frequently than absolutely necessary and storing it in more than one physical location. When creating business resiliency plans, it is common sense to have more than one solution to the most likely problems, and for critical systems, to have key tools on hand and ready to deploy.
3. Short cuts can get you into deep water.
Reports are emerging that suggest safety corners were cut on the oil rig leading up to the accident. The blowout preventer had been modified, emergency cut-off valves had leaky hydraulics, and at least one had a dead battery, according to several reports. Engineers on tight deadlines may be tempted to take short cuts, bend rules, or downplay known problems in order to get the job done. The enormous expense and productivity loss involved in taking down working systems for crisis testing may give planners reason to delay or rationalize. This may be particularly true when the economy is uncertain and workers feel insecure about their jobs. In the case of the oil rig disaster, technicians apparently ignored conflicting pipe pressure test results, which indicated a problem. Peer review, objective oversight, and other time-honored best practices may be helpful in avoiding these traps.
4. Accident containment can keep a bad problem from becoming a disaster.
In complex operations, mistakes will be made, accidents will happen. In fact, there are entire theoretical schools built around so-called system accidents. In the case of the oil rig disaster, engineers had taken basic safety precautions, drawn up disaster plans, and installed backup systems, but in the event of the low probability high impact scenario—when the wound started gushing blood—there was no tourniquet on hand. Risk models relied on the blowout preventer functioning effectively. A month later, the well is still gushing oil into the Gulf of Mexico.
5. All those security precautions are there for a reason.
Experience shows that, in sophisticated systems, various elements interact with each other in complex and often unpredictable ways. Sadly, many of the safety devices we see every day—smoke detectors, seat belts, brake lights—were only standardized after hard experience proved their necessity. As a dangerous bubble of methane burped from the ocean bottom toward BP’s oil rig, a succession of security devices including a blowout preventer, cement plugs, and a huge wall of mud, failed successively to hold it back. In a situation where a cascading system failure brings down an operation, there may be plenty of time to blame regulators for lack of oversight after the fact, but the anvil will fall hardest on the person with his hand on the switch.