The complexity required for robustness, often goes against robustness

In the past few months we have seen major outages from United Airlines, the NYSE, and the Wall Street Journal. With almost 5,000 flights grounded, and NYSE halting trading the cost of failure is high. When bad things happen IT personal everywhere look at increasing fault tolerance by adding redundancy mechanisms or protocols to increase robustness. Unfortunately the complexity that comes with these additional layers often comes with compromise.

The last thing your boss wants to hear is, “The network is down!”. Obviously it’s your job to prevent that from happening, but at what cost? Many of us enjoy twisting those nerd knobs, but that tends to only harbor an environment with unique problems. I too fear the current trend of adding layer after layer of network duct tape to add robustness, or worse, to try and fix shortcomings in applications. NAT, PBR, GRE, VXLAN, OTV, LISP, SDN… where does it end!?

The greater the complexity of failover, the greater the risk of failure. We often forget the lessons of our mentors, but keeping the network as simple as possible has always been best practice. As Dijkstra said, “Simplicity is a great virtue but it requires hard work to achieve it and education to appreciate it. And to make matters worse: complexity sells better”. This is a fundamental design principle that is often overlooked by enthusiastic network engineers or, even worse, sales or marketing engineers who are trained to sell ideas that only work in PowerPoint. When planning out your latest and greatest network design each and every knob that you tweak puts you farther and farther into uncharted territory. While it may work, for now, you’ll be the only one running anything close to those features in unison. And when, not if, you have to call TAC, they have to understand the fundamental design of the network BEFORE they can troubleshoot it. Validated designs do exist for a reason.

At this point in time I would encourage you to read up on a couple infamous network outages including Beth Israel Deaconess Medical Center in Boston, whose spanning tree problem took the network down for four days, and the story about how the IT Division of the Food and Agriculture Organization of the United Nations recently had a rather serious, but brief, four hour outage…

While both of these outages were simple in nature, the complexity of the growing network was key in causing the failure. A lack of continuous design, periodical review, and most important failover testing inherently nurture failure.