Cisco Blogs

Why #FAIL isn’t Failure – Lessons from the Cloud

April 21, 2011 - 1 Comment

If you were paying attention to the Intertubes or Twitterverse today, you probably heard about an issue at one of the well-known Cloud Computing providers. Needless to say, fingers were being pointed left and right, and all the “experts” came out to explain their 20/20 hindsight into causes (still unknown) and avoidance.

I purposefully avoided any comments about these events because sometimes in life systems go down. If you’ve been in the technology industry long enough, and actually worked in support or operations, you know that even the best designs can have issues. And I’m not ashamed to say that I’ve been the cause of some (temporary) issues with large customer systems. When it happens, it’s not a good day for anyone involved – the operators, their customers, the fat-finger typer or wrong-cable puller, etc.

What dawned on me throughout the day were all the people labeling this #FAIL. This is the Internet’s new meme anytime something goes slightly different than plan.

  • Food is slightly overcooked – #FAIL.
  • Plane flight took off a few minutes late – #FAIL.
  • Software upgrade took three clicks instead of two – #FAIL.

This got me thinking about the best advice I’ve ever gotten from a mentor, early in my career. We were talking about the characteristics of the company’s management team and the mentor told me, “The thing that they all share is that they were HUGE failures at one point in their career. And that experience is what’s made them better and the leaders they are today.” Yes, that’s right, back in the day people regarded failure as part of the experiences of life. An activity that if dealt with properly could create a learning experience and an opportunity to get better

But for many people today, it was all about #FAIL and the need to create new fears for people considering (or actually deployed upon) Cloud Computing. It’s unfortunate because I’d be willing to bet that anyone effected by today’s issues will have learned an incredible lesson and their systems will be better going forward. In fact, one of the greatest characteristics of Cloud Computing (public or private) is the ability to Fail Fast.

When I talk to customers about Cloud Computing, one of the first things we typically talk about is the reality that they will probably (eventually) deliver IT services for their business from a variety of Cloud sources – Private, Public, Commodity, Community. What we try and help them understand is that the principles they consider for their internal systems – Availability, Security, Mobility – need to also be considered for external systems. There aren’t any short-cuts, but there are some ways to apply best practices from across the community (for public and private usage) and to actually get better at failing faster.

The headlines for the next couple of days may look back at today’s events and try to highlight the #FAIL, but strong leaders and smart engineers will find ways to turn those failures into learning for the next steps in the Journey.

P.S. – Kudos to the other Cloud Computing providers in the industry for taking the high-road and not slinging mud at this difficult situation. That showed class. I’m glad to see there was a brotherhood of people that recognize that operating the world’s largest systems is not easy.

To learn more about Cisco’s Cloud Solutions, please visit the Cloud homepage.

In an effort to keep conversations fresh, Cisco Blogs closes comments after 60 days. Please visit the Cisco Blogs hub page for the latest content.


  1. Great post Brian. It’s always the lessons learned that are the most important aspect of any service provider outage. In this case its the full incident report that will be most insightful.

    How was the issue detected, how many resources were available at the time? How quickly was the root cause discovered? What actions were taken, with a timeline and what communication lines were available to customers during the outage? What steps are being taken to prevent the issue from occurring again. How will the network and systems be improved to offer customers the ability to survive a similar outage.

    The service provider affected today needs to publish more guidance to customers on how to architect a environments and applications to be more resilient.