Data Center High Availability Redefined
The recent mega-outage in Amazon Web Services (AWS) knocked off a plethora of websites as well as various applications, security cameras, IoT gears etc. Cloud outages such as these have a huge impact on a global scale. With the rapid adoption of the cloud over the last few years, data centers are expected to be fully functional 24/7, 365 days a year with close to zero downtime.
To ensure minimal downtime, IT spends considerably on resilient network designs, and highly available maintenance technologies. Cisco has come a long way with evolution of software upgrade mechanisms with In Service Software Upgrade (ISSU) and the protocol extensions that facilitate ISSU for the Data Center Network Operating System (NX-OS). Some key features in this area include: separation of data plane and control plane, support for process restart-ability, ability to patch software, and support for non-stop forwarding.
ISSU is a comprehensive and transparent software upgrade capability. ISSU capability extends Cisco’s high availability innovations for minimizing planned downtime for data center networks. The ability to perform ISSU has always been admired and a prime customer ask since many years, so that they can update to newer software versions without having to take the network element offline. This significantly benefits the network administrators and operators with respect to serviceability and high availability of network resources.
Even though ISSU has significant advantages, achieving consistent and predictable behavior in a finite time requires sophisticated orchestration with high precision. The complexity is further amplified with the need to upgrade with zero packet loss in the data plane and minimal control plane downtime to mitigate any network wide disruption.
Cisco NX-OS has made tremendous strides over the years in providing the much needed ISSU support for data center network deployments. The entire spectrum is covered, starting with various form factors of the modular chassis with dual supervisor cards, to Top of Rack (ToR) switches with single supervisor cards. In the past, with dual supervisor cards, both zero packet loss in data plane and minimal control plane downtime was available. However, ISSU for ToRs only provided zero packet loss in the data plane. Control plane downtime ranged in the order of 50-90 seconds.
Starting from the October 2016 release, NX-OS supports minimal control plane downtime, even on ToRs. This has been made possible by using NX-OS with Linux containers. Support for NXOS in containers and using that for ISSU on ToRs is a unique and innovative solution that solves a real customer problem. Additional benefits of this container-based ISSU on NX-OS include:
- The entire upgrade process is accomplished with a single command that is consistent across all NX-OS platforms.
- Control plane downtime is bounded and independent of the configuration and scale.
- No need to upgrade network elements in a different way depending on their role in the network.
- Multiple nodes can be upgraded in parallel thereby providing considerable time savings in upgrading the entire network.
Container based ISSU has been shipping on the Nexus 9000 ToR platforms starting from the 7.0(3)I5(1) NX-OS release. The feedback for this feature has been overwhelmingly positive, especially from customers at the recent Customer Advisory Board and Cisco Live Berlin 2017 events. Among others, EBay and BMW have been actively engaged in this endeavour since its inception. See the following short video to get a brief overview of the container-based ISSU functionality on the Nexus 9000 ToR switches.