We’ve come a long way as an industry in last 10 years. As I travel to #KubeCon in Austin, I’m reflecting back on what has changed.
10 years ago, I ran an independent research group called the IT Process Institute, and was lead researcher on a study designed to identify change, config, release best practices. I had the privilege of working with @realgenekim and personally interviewing IT ops teams from a dozen companies recognized for their exemplary results. From the interviews, we created hypothesis about what practices enabled highest levels of performance. And then collected data from 250 companies in order to test whether those practices correlated with higher performance across a broad industry sample.
Back Then – People were Breaking Things
Change was the biggest cause of system failures. Applications were hard wired to their environments. Systems were reaching a point of complexity where a single person didn’t have knowledge to understand the impact of a simple change. And people were responsible for making changes. Changes made by people up and down the stack often had unintended consequences. As a result, we used change advisory boards, forward schedule of change, release engineers, and a CMDB to help document dependencies. Change management was a major ITIL process implemented to help gain control. Controls made sure people followed processes, and helped reduce the chaos related to managing brittle, finicky, prickly systems.
The general approach to successful code release, was to test changes in a pre-production environment that was “sufficiently similar” to production, in order to verify changes worked before rollout.
Changes to deployed systems – in response to change request or service impacting incident – often left production systems in an unknown state. That resulted in additional service quality and security/compliance risk. As a result, the collective “we” IT professionals shot ourselves in the foot over, and over again.
Pinnacle of the Slow and Careful Era
As example of exemplary practice at one organization where the whole IT org’s bonus was tied to down time (think IT group that ran a US stock exchange)
- Rollouts – including environment and application changes, were documented in a runbook. They practiced and timed the rollouts in a pre-production environment. They knew what should happen, and how long it should take.
- Rollbacks – were documented in a runbook, and practiced, and timed.
- Scheduled changes – during nightly maintenance windows. If the rollout wasn’t successful by a pre-set time, they would trigger rollback. A task that didn’t match the runbook also triggered rollback.
- Devs were banned from Production – and they had a “break glass” process where developers could fix production in an emergency. But someone from Ops literally looked over their shoulder and wrote down everything they did.
A key question of that time, was how much money to spend on building and maintaining a redundant, underutilized, “sufficiently similar” pre-production environment in order to pre-test changes to ensure success?
Digital Eats “Slow and Careful” for Lunch
The “Slow and careful” era had an inherent conflict built in. Everyone knew that slowing down improved results. A careful and cautious approach improved uptime and security and compliance related to complex systems. However, that approach turned out to be wholly inadequate as Marc Andreessen realized that Software is eating the world and The lean startup with minimally viable products, and new digital business models (Uber, AirBnB) — all relied on getting new products and features into users hand faster, not slower.
Looking back at my interview notes, 10 years ago, I asked everyone “What metrics do you use to measure success?” Everyone measured uptime and change success rate. Nobody measured frequency of change, or time between change request and completed change.
Along Comes Kubernetes
At same time I was conducting this research, Google was building Borg the first unified container management system. Their second iteration was called Omega. Both remain proprietary. But their third version of this system is called Kubernetes. And they launched this as an open source project to share their new and powerful way of doing things, and help drive usage of their infrastructure as a service Google Cloud Platform.
Kubernetes is a container orchestration system. But more importantly, Kubernetes codifies a new way of doing things that wasn’t yet aspirational in the “Slow and careful” era. Kubernetes changes how you build, deploy and manage applications – that is “built for purpose” to meet the needs of the digital era.
Velocity is the New Metric of Choice
In the digital era, feature velocity replaces uptime and change success rate as the defining operational metric.
Slow and careful IT – with a focus on uptime, doesn’t support digital business models that need new features to attract users. Fast and careless Dev – that produces unusable or unavailable applications, drives users away.
Velocity – as a measure, combines the two. It measures the number of features you can ship while maintaining quality of service. Kubernetes and ecosystem tools – give you what you need to move quickly while maintaining quality.
@kelseyhightower, Brendan Burns, and Joe Beda explain in “Kubernetes up and running” that there are three core concepts baked into Kubernetes that enable velocity. And based on my look back, represent an 180 degree shift and transformation from the best practice of the slow and careful era.
- Immutability – Once an artifact is created, it is not changed by users. Antipattern: change something in a container or application deployed via container. It is better to create a new container and redeploy, than for a human to make a change to a deployed system. This supports a green/blue release process. There is no runbook rollback. There is no “break glass” process for people making changes to deployed systems.
- Declarative configuration – Kubernetes objects define desired state of the system. Kubernetes makes sure the actual state matches the desired state. There is no runbook with a documented series of steps to take. It does not need to be executed to be understood. Its impact is declared.
- Self-healing – Kubernetes includes a controller-manager that continuously takes actions to make sure current state matches desired state. People don’t repair (e.g. make changes) via mitigation steps performed in response to an alert or change request. Kubernetes consistently and repeatedly takes actions to ensure current state matches desired state.
Runbooks are replaced by Immutability and declarative configuration. Self-healing replaces “break glass” production repair processes.
I believe Kubernetes is more than a container orchestrator. Velocity enabled by Kubernetes represents a new IT operating model for how applications are built and managed.
I’m excited to see what’s up at KubeCon this year.
Stop by Booth P18 to see how Cisco participates in the Kubernetes community, and offers powerful network and management solutions to help you deploy production grade Kubernetes in your organization.
Kubernetes stands on the shoulders of giants, so to speak. Some key stepping stones and enables that make Kubernetes possible and popular now, include:
- DevOps – culture shift and automation tools that implement the idea that you can speed up AND increase service quality.
- Virtualization – VMs abstract applications from infrastructure.
- Infrastructure as code – configuration tools that help maintain desired state
- Cloud computing – infrastructure services for rent called via API
- Software Defined Datacenter – compute, storage and network via API in an on premises infrastructure.
- Containers – immutable images that bundle an application and all of its dependencies.