Building Resiliency Guardrails to Isolate Crashes in Cisco Products

At Heathrow Airport outside of London, more than 600 flights were disrupted or cancelled, and 42,000 pieces of luggage were temporarily lost. In Washington, D.C., a computer operated by the National Security Agency was offline for three days. In Panama, two dozen patients died after accidentally receiving an overdose of gamma radiation to treat their cancer. Ariane 5, a $7 billion rocket built by the European Space Agency to carry satellites into orbit, exploded less than a minute into its maiden voyage.

What do all of these events have in common? Software bugs and crashes.

With 190 million lines of code, Cisco IOS XE, like any other large software stack, can never be crash-proof. But the software engineering team within Cisco Enterprise Networking has developed techniques to dramatically limit the impact of software crashes. Those techniques, written into IOS XE code, add tremendous resilience to every Cisco enterprise networking device.

From Monolithic OS to Resilient Modular Software Stack

When Cisco IOS was first developed, it was a monolithic operating system. Any fault in any module, including upgrades to different versions, could cause the software to crash. It could then take minutes, hours, or even longer to restart Cisco routers and switches.

Moving from IOS to Cisco IOS XE, Cisco developers strived to make sure that the user experience was the same while adding techniques to improve the fault isolation of processes running within the system. As a complete networking software stack running on a Linux kernel, IOS XE was designed with separate fault domains so that a fault in one part of the system did not take the rest of the system down. This is demonstrated in systems with separate line cards and forwarding engines such as the Cisco ASR 1000 Series Aggregation Services Routers and the Cisco 8000 Series Customer Edge Routers. The line cards, route processors, and forwarding processors can be reloaded and upgraded independently without an entire system reload. Today, if a Cisco product running IOS XE suffers a crash, the system does not go down because the faults are isolated to specific domains.

In the latest version of IOS XE, the software resiliency is being increased by reducing the fault domains to a single process. This is achieved by creating a process runtime architecture that use three software techniques: work units, transactions, and persistence.

Work Units Limit the Scope of Faults

With IOS XE, in the event of a crash or a version upgrade, processes continue operating as if the restart didn’t occur. One of the key foundations is that all processes in the system are designed to operate on discrete and independent work units. Crashes ― software force reloads ― are limited to work units with a definite start and end for processing including packets, socket operations, timers, and inter-process communication (IPC) events. For example, if a buggy IPC is sent (e.g., perhaps the process that sent it has gone down and is no longer valid and the software can’t handle that error condition), the crash is limited to that work unit and operation. It can crash at that point in time within a very small boundary.

There are a couple of ways of implementing multitasking in the system: preemptive threading and user–level cooperative multitasking. Preemptive threading delegates the scheduling to the kernel and removes from the programmer the burden of thinking about their thread runtime. On the downside, since it is true concurrency at the hardware level, this introduces a very difficult environment to guarantee that the data is correct. IOS XE takes the cooperative multitasking approach by using a user–level library that gives the process multitasking capabilities without adding hardware–level concurrency. This allows the user events to form a natural work unit that the process works with.

Persistence

The other feature provided by the runtime database to support process restart-ability is persistence. Persistence is the characteristic or state of a system that outlives or persists beyond the process that created it. Since the goal is to enable the process to disappear and reappear, it cannot work off data stored in the block starting symbol (BSS) or data segment unless that data is recreated on process startup. The databases are memory-mapped files that provide persistence across reloads and IPC based on message queue IPC (MQIPC).

Transactions

From a data perspective, when a process restarts, it needs to restart from a known good state. Process state in the heap and global data must be consistent. This is achieved by bracketing the work units in software transactions. The infrastructure that provides this functionality is an in-memory runtime database provided by The Description Language (TDL) infrastructure. Before the database and objects within it are modified, a transaction is started. The TDL database objects are modified in their original place but all changes to the database are recorded in an undo log.

Upon successful processing of an event, the undo log is discarded, but if there is a fault or restart, the last in-flight transaction is aborted automatically and the undo log is played back, reverting all the changes back to the original state. Data is transformed in an atomic fashion, so everything that would change the global state is part of each transaction. With this fail–safe mechanism, either all the data is processed or none of it is. This means that the data is always in a consistent state.

From a software perspective, that’s unique. Developers are used to atomic operations at the hardware level so if two threads are doing increments of the counter, once both are done the data should be increased by two. But from a software design perspective, you want an entire set of memory operations (e.g., incremental encounters, updating data structures) to be processed.

The bottom line is that transactions leave the process data always in a known good state.

Work units, persistence, and transactions allow Cisco gear to simply restart processes when glitches or software upgrades threaten a restart or crash. Separating IOS XE code from data, making it a process under Linux running in Cisco’s compiler solution, enables data integrity at the work unit transaction level. It also provides much higher performance for failures and restarts with access to x86 processors. Not tied to hardware, these software resiliency techniques work in diverse environments with little involvement with IOS XE.

This is the new meaning of resilience in an enterprise networking software stack. It’s now table stakes thanks to the creative and hard work of Cisco developers.

Check out our Cisco Networking video channel

Subscribe to the Networking blog