Devops Means “No, you cannot operate my cloud”
One of the things I really believe strongly is that modern SaaS software development – both the practices and code it produces – are significantly different from traditional enterprise premises-based software development. Yet, I find that for people who have never built and operated a modern SaaS platform, these differences are difficult to grasp. Let me replay for you a conversation I’ve had many times.
Jonathan: “We’ve built this awesome new Cisco Spark cloud platform, which powers the Cisco Spark app. We do continuous delivery, pushing new updates every day. Our engineers operate the platform – a.k.a. devops – and they track a bunch of metrics on quality and engagement that they use every day to make improvements in the code.”
Customer/Partner: “That sounds great! I’ve got a question though – do you have a packaged version that I can operate on premises?”
The answer is – of course not.
When I tell customers/partners this, they are surprised. The reason for this is NOT that we don’t want their money (trust me that’s not it), or that we have some kind of policy or strategic reason that we don’t want to do it. The reason is that it’s technically infeasible. And doing so would mean we’d have to destroy many of the benefits that we’ve built for our customers in the first place.
The reason ultimately comes down to the very nature of what devops is. One of the key principles of devops is, “the developers operate the software.” This is much, much more than just “our developers are the ones that get paged when there is a problem.” What it means is that our operational processes and our software development processes are intertwined. Our developers operate by MODIFYING OUR CODE. This means that the only way someone else can operate our code is if we also ship them our developers. Which, of course, neither us nor they really want.
To illustrate this, here are a few concrete examples:
Yes, our developers are on call.
This means when a particular service fails (our cloud is composed of dozens of coarse-grained microservices), the developers who own that service are notified. It is their job to fix it. Fixing isn’t just an issue of rebooting the server. It requires going through logs, looking at code trace, and then making changes to the code. These changes could be for the purposes of diagnosing the problem. Or they could be a fix. But either way, acting on alerts requires MODIFYING OUR CODE.
This is in contrast to premises software, where, in order for someone else to operate it, an engineer must prepare extensive documentation to explain what the code does. In addition, the engineer must build in a set of configuration and provisioning hooks to allow the operations team to make changes, and these must be documented too. As a consequence, the set of remedies that the operations team can take is much more limited – adjusting configuration, rebooting servers, etc. Some problems can be fixed this way, but not all. The extensive work required for this documentation and training can only be done when updates to the code are released infrequently.
Consequently, the devops approach means the engineers can spend their time coding and fixing, rather than preparing documentation. Furthermore, everything moves faster, and the engineer has much wider latitude to fix changes – because they can MODIFY CODE to both diagnose and fix, rather than relying on a limited set of configuration and provisioning adjustments. This produces much higher quality. Because of all of this, our code cannot just be thrown over the wall to another company to operate, as it lacks these hooks and documentation – and would be worse off if it had them.
Part of devops is not just responding to alerts, but tracking
metrics that measure the experience of users in production.
As an example, we have a metric in Cisco Spark to track how many milliseconds it takes between when a user taps on a room in the room list, until the room display renders on the phone. A bunch of processing happens on the phone (and sometimes a cloud query) to render that page. Our developers keep an eye on this metric, and they make changes to the clients and to our cloud services to constantly improve it. Consequently, acting on our metrics and improving them requires MODIFYING OUR CODE.
Software upgrades in cloud work differently than on prem.
On prem, the vendor ships you a new version, and then you use a maintenance window to install and upgrade it. A modern SaaS product does continuous delivery, which means the software in the cloud is upgraded once a day, typically more. There are many parts of this upgrade process, but one of the interesting parts is how a new feature rolls out.
Typically, as a feature nears completion, it’s turned on for a small set of users. This is done using the concept of a “feature toggle,” which is a flag that is stored in the cloud and indicates what features are turned on for which users. Our server and client code access these toggles to figure out what features to manifest. Our cloud software has a microservice that manages these feature toggles, and allows us to link them with users, organizations, and even Spark rooms.
We’ll initially turn on the feature for a small core team, and then watch metrics and logs and bug reports to improve it. We’ll improve the software based on this data. As the feature matures, we roll it out to increasing sets of users — again modifying the code based on actual production usage — to make sure the feature actually works. Many problems only show up in actual production, when real-world usage and use cases reveal issues that are not caught in the perfect world of automated test. Finally, once it looks like its baked enough, we enable the toggle for everyone and it shows up for all to use. In this model, the way we develop and mature the feature requires operating it while we develop it.
Our software is optimized for us to operate efficiently,
not for others to install cost effectively.
Our software is designed as a large number of small “microservices” – each of which does a very concrete set of things. These microservices do not stand alone. They each rely on many of the other services to do their core job. For example:
- Our room service – which stores the list of rooms, participants in the room, and list of activities in the room – does not know how to restart itself if it fails. We rely on CloudFoundry for that – which we’ve deployed and operate.
- The room service doesn’t have its own logs – it ships logs to a logging service, which is itself a cluster of many services and associated databases.
- The room service doesn’t have its own database either – it relies on a shared Cassandra instance, used by it and many other services.
Indeed, our cloud has many database technologies deployed. All of this is great when there is only one copy of all these that we deploy and operate. But, for someone else to deploy and operate, it amounts to quite a bit of complexity which would become overly costly.
SaaS software is built differently because it relies on an economy of scale. Because it operates as a single instance (that can span multiple data centers), the incremental costs of a new service or component are small, compared to the reliability and velocity benefits, which are large. This gets inverted when someone else runs your code. When that happens, the economies of scale are reduced and the costs of a new service component go up. This pushes the software towards consolidation – just “one box please.” It also pushes you to dramatically slow velocity, since it is impossible to receive daily — let alone monthly — updates of software from a vendor in this way. As a result, our SaaS code would be cost inefficient for others to run, and would lose all of the velocity benefits we have built.
Modern Devops = Better Software
These are just some of many, many examples that demonstrate how a modern devops practice intertwines development and operations such that only the developers of the code can operate it. These practices ultimately produce better software, with higher quality, better reliability, rapid innovation cycles, and a better user experience than can be developed with more traditional non-cloud practices. And those benefits are ones that customers, partners, and end users need.
So – if I ever tell you “no, you cannot get a packaged version of my cloud for you to operate.” It’s not because I don’t like you. It’s because to deliver to you – mr. customer – the benefits you demand from my software, there is no other way to deliver them but for us to run it. Because running it and coding it are the same thing.