Part 1 of the 3-part High Availability Series
High availability (HA) networks continue to function even when some components fail. A variety of features in Cisco IOS XE Software provide hardware and software redundancy that contribute to five nines (99.999%) uptime, which translates to no more than 5.26 minutes of downtime per year. That’s the kind of reliability that Cisco customers have come to expect. Thousands of Cisco engineers in offices throughout the world make it possible.
This is the first in a series of three blogs that describe significant features in Cisco IOS XE that contribute to HA in the enterprise.
Cisco Stack Manager is a platform-independent discovery protocol that provides failover from active to standby switches in case the active switch experiences a failure. Available on Cisco Catalyst 9000 series, it enables a switch to discover peer nodes, verify their authenticity, raise alarms in case of a mismatch, allocate a unique switch number during discovery, and assign a HA role (e.g., active, standby, and member in one type of configuration). In case of failover, switchover, or a reload of the active switch card, the standby switch takes over.
After Stack Manager assigns roles to the switches (e.g., Active, Standby, Member), the Cisco IOS XE redundancy framework enables the control plane protocols to synchronize configuration data to the standby node. Standby protocols remain in a hot state so the standby switch can become active in case of a failure.
Stack Manager works in three different HA configurations, which will be described in an upcoming blog:
- Switch connected via stack cable to up to eight nodes
- Switch connected via StackWise Virtual Link to up to two nodes
- Dedicated HA interface for wireless devices like controllers
Cluster Manager is an adaptation of Stack Manager for use with Cisco Next Gen StackWise® Virtual Link, which provides the ability to virtualize two connected switches into a single virtual switch. Cluster Manager enables the same standby/active failover features provided by Stack Manager, with the added ability to provide HA across an entire data center environment using Next Gen StackWise Virtual Link. Virtualization eliminates the need to physically stack switches on top of each other. Soon, Cluster Manager will be able to support HA in switch clusters across different geographically dispersed locations.
Redundancy Management Interface
The Stack Manager solution connects switches in a ring up to 8 switches but in configurations using StackWise Virtual Link and in wireless deployments, there is only a single interface between two nodes: one active, one standby. So, two technologies were created to handle split-brain-related HA scenarios in these configurations: Redundancy Management Interface (RMI) and Dual Active Detection (DAD).
RMI adds another interface to wireless controllers so that if one interface falters or fails, the other will take over to handle HA, first determining if it is an actual failure or just a momentary glitch. If it is an actual failure, RMI provides the redundant connection to ensure that if the active switch goes down, the standby takes over.
Dual Active Detection
For deployments using StackWise Virtual Link, if the connection between the active and standby switches is lost, if one switch fails over to the second, the Dual Active Detection (DAD) process is activated. It queries the node manager for the existence of the lost peer. If it is available, it sends a recovery handshake. Once the handshake is completed, if the lost connection was due to a momentary glitch, the standby switch goes into recovery mode. If the switch is experiencing a failure, the other switch goes into recovery mode and assumes the active role.
Operational Data Manager
All processes in active switches update the database and the database maintains the device’s state. Since the standby doesn’t communicate to the outside world, when it is updated by the active switch, it uses Operational Data Manager (ODM) to update the database. ODM uses Replication Manager to trigger all the data to sync from an active to a standby switch. The update first goes to the DB and then out to update the processes in the hot standby switch.
Symmetric Early Stacking Authentication
Symmetric Early Stacking Authentication (SESA) imposes authentication when one Catalyst 9000 series switch interacts with another and encrypts and decrypts all the remote inter-process communication between them to guard against hacking attempts. It works alongside standard stacking, StackWise Virtual Link, and wireless HA solutions and is Federal Information Processing Standards (FIPS) compliant.
Extended Fast Software Upgrade
In the past, reloading software on Cisco platforms could take 6-7 minutes. Now, with Extended Fast Software Upgrade (xFSU), the process is reduced to 30 seconds or less. This fast reload feature for Catalyst 9300 series switches decreases downtime during reload ― the hardware is never powered off and traffic keeps flowing ― while maintaining the control plane in an operational state during the reload process.
Graceful Insertion and Removal
Network admins may wish to remove a network device from the network to perform troubleshooting or upgrade operations. To remove one device and replace it with another, the Graceful Insertion and Removal (GIR) function notifies the protocols of both devices that there is a maintenance window but not to go down. When the platform undergoing maintenance comes back online, it goes immediately into production without having to recreate the sessions it missed, minimizing traffic disruption both at the time of removal from the network and during insertion back into the network.
Another area that contributes to HA is hot patching. Cisco issues small micro images containing only the code necessary for a critical bug or security fix. Customers can install it on devices in a fraction of a second using hot patching without any network disruption. Hot patching doesn’t result in a device reload and the fix takes effect immediately. Because of the small size of the patches, they are easy to distribute. Because of their limited content, customers can have much higher confidence in installing these micro patches in their production network without going through the complete validation process. The Cisco IOS XE hot patching feature is a toolchain of integrated technology and is expected to provide a default hitless defect fix.
With the in-service software upgrade (ISSU) feature, Cisco customers using Cisco IOS XE products with HA functionality, including both routing and switching platforms, can avoid disruptions from image upgrades. ISSU orchestrates the upgrade on standby and active processors one after the other and then switches between them in the control plane so that there is zero effective downtime and zero traffic loss. The Cisco IOS XE software stack has the ability to do ISSU between any–to–any releases and the development team has an elaborate feature development testing and governance process to ensure this happens without failures occurring. Cisco defines policies for a smooth ISSU experience based on platform and releases combinations.
An Ongoing Quest for High Availability
Handling failover at the device level seems straightforward, with automatic features guiding active, standby, and sometimes member switches that are all waiting in line. (For Cisco ASR 1000 routers, active and standby route processors also provide failover and HA, much like Catalyst 9000 series switches.) But for Cisco engineers working on Cisco IOS XE solutions, HA is an ongoing, complex challenge, with vulnerabilities addressed by the many solutions above.