Posting this blog on behalf of Babi Seal, Senior Manager, Product Management, INSBU and Lukas Krattiger, Principal Engineer, INSBU
This is the second blog in a two-part series that highlights novel Virtual Extensible LAN (VXLAN)-related features that are now shipping in the latest software release of the Nexus 9000 platform. In the previous blog, we briefly described three key features: Tenant Routed Multicast (TRM), Centralized Route Leaking for EVPN (Ethernet VPN), and Policy-Based Routing support with VXLAN. In this blog, we will look at the capabilities of the VXLAN EVPN Multi-Site feature.
Overview
VXLAN EVPN Multi-Site marks an important milestone in the journey of overlays. The vanilla VXLAN flood-and-learn based mechanism that relied on data-plane learning. This approach was replaced with an enhanced mechanism that relied on a control plane, back in early 2015 when BGP EVPN became the control plane of choice for VXLAN overlays. With this addition, support for integrated Layer-2/3 services, multi-tenancy, optimal one-hop forwarding, and workload mobility was introduced, making EVPN enabled VXLAN a more scalable and efficient solution.
VXLAN EVPN Multi-Site continues the evolutionary path toward building even more efficient VXLAN-based overlay deployments. It brings back proven networking design principles around hierarchical network design and fault containment with preserving network control boundaries when building scalable overlays.
Pre-EVPN Multi-Site: Multi-Pod and Multi-Fabric
The need for interconnecting data centers is as old as the notion of data centers themselves. This was no different when VXLAN was introduced. With VXLAN’s capability to build Layer-2 networks on top of Layer-3 networks, we achieved simplicity with transport independence but unfortunately left out many network-design principles for the overlay.
Even in the pre-VXLAN EVPN days, we still managed to build well-structured and hierarchical topologies such as Fat-Tree, Clos, Leaf/Spine. VXLAN overlays flattened this by creating end-to-end encapsulations from leaf to leaf through the Multi-Pod design. There the data plane was shared across pods while keeping some separate overlay control plane instance per pod. Alternative approaches preserved the hierarchy but required introduction of additional Data Center Interconnect (DCI) technology for interconnecting distinct VXLAN overlay domains; resulting in a Multi-Fabric design.
The challenge with Multi-Pod was the use of a single overlay domain (end-to-end encapsulation), which created challenges with scale, fate sharing, and operational restrictions. While Multi-Fabric provided improvements by isolating both the control and the data plane using hierarchical topologies, there was additional considerations imposed on the users to select from a mish-mash of different DCI technologies to extend and interconnect the overlay domains, thus resulting in greater operational complexity.
Introducing VXLAN EVPN Multi-Site
VXLAN EVPN Multi-Site is an open solution that extends the capability of VXLAN EVPN to provide hierarchical multi-site connectivity and allows stretching of Layer 2 and 3 services beyond a single overlay domain. The improvement over Multi-Pod/Multi-Fabric designs is significant in that now VXLAN EVPN is still used for carrying traffic between sites but policies can be applied at the border devices that also serve as the ‘gateway’ to the other sites. These border devices called Border Gateways (BGW) and terminate, mask, and interconnect multiple overlay domains, fabrics or sites. The chosen approach in VXLAN EVPN Multi-Site preserves the network-control boundary for traffic enforcement and failure containment with the simplicity of an integrated Layer 2 and 3 extension.
The BGW is the core component of EVPN Multi-Site that simplifies the deployment of the overall solution. In existing VXLAN EVPN fabrics, the BGW becomes a simple conversion of an existing Border Node or an easy addition as a leaf during the fabric lifecycle.
With EVPN Multi-Site, control- and data-plane within a given fabric stays unchanged. Only when it is necessary for traffic to leave the existing fabric to reach an end-point in a remote fabric, then the BGW perform its function of termination and re-origination the VXLAN tunnels. The question is how?
In EVPN Multi-Site, we define each fabric (‘site’) as its own BGP Autonomous System. We leverage the behavior of External BGP’s next-hop behavior, which points to the next-hop node for reaching a remote end point, in this case the closest BGW. To ensure resiliency and load distribution for the BGW, up to four BGWs can operate with the same “personality” requiring no control-plane changes whenever a failure scenario isolates or degrades one of the available BGWs. The personality encompasses sharing the same site ID and the same Virtual IP Address in a given site thereby making them part of a BGW cluster. Additional functions are available that perform interface state tracking to assist in rapid and efficient detection of failure scenarios thereby preventing an impaired BGW from remaining in the cluster.
The two important steps that allow EVPN Multi-Site to achieve its overall behavior are:
- How a BGP EVPN advertisement appears remotely: When a BGP EVPN Route-Type 2 (MAC/IP) or Route-Type 5 (IP Prefix) is advertised from a remote site (remote AS), the BGW will take this information and advertise it with its own IP address as the next-hop into its local site (local AS).
- How a leaf performs the data-plane operations when exiting the local traffic: As a result of BGP EVPN advertisements into its local site (local AS), all site local leafs will see the BGW as the only next-hop to reach the remote site prefixes (both MAC and IP). Whenever there is a need to reach these destinations, the VXLAN encapsulation from a leaf will be performed towards the BGW of the local site.
What this means is that if there are N=10 sites with M=256 leafs (VTEPs) each, the number of VTEPs each leaf needs to know about significantly reduces with an EVPN Multi-Site deployment as listed below:
Another useful feature that EVPN Multi-Site offers is rate limiting across the three BUM classes – Broadcast, Unknown Unicast, and Multicast. Rate limiting or even disabling of some of these classes becomes paramount, especially with the requirement of Layer-2 extension that is present in many intra data center and data center interconnect use cases.
While limiting BUM traffic is important, the distribution of BUM handling is even more critical in a world of scale-out architectures. In EVPN Multi-Site, we are doing this by a per-VNI Designated Forwarder (DF) election. Across all the BGW that are deployed within a site and with seamless extension of Layer-2, each BGW will perform the function of BUM forwarding for a different VNI (VXLAN Network Identifier). This way potential hotspots are avoided and traffic distribution can be achieved more efficiently.
Summary
Innovations such as the Cisco CloudScale ASICs available through the Nexus 9000-EX and -FX series provide many advanced capabilities for VXLAN overlays that are not available as widely in other switching platforms, like VXLAN EVPN Multi-Site. Cisco is developing comprehensive deployment guides that will go in-depth on all of the topics we have introduced in this two-part blog series. Stay tuned.
Resources
Since the release of NX-OS 7.0(3)I7(1) for Nexus 9000 platform, various resources have been posted around EVPN Multi-Site. The prime resources are listed below:
Build Hierarchical Fabrics with VXLAN EVPN Multi-Site White Paper https://www.cisco.com/c/dam/en/us/products/collateral/switches/nexus-9000-series-switches/at-a-glance-c45-739422.pdf
Configuration Guide for VXLAN EVPN Multi-Site https://www.cisco.com/c/en/us/td/docs/switches/datacenter/nexus9000/sw/7-x/vxlan/configuration/guide/b_Cisco_Nexus_9000_Series_NX-OS_VXLAN_Configuration_Guide_7x/b_Cisco_Nexus_9000_Series_NX-OS_VXLAN_Configuration_Guide_7x_chapter_01100.html
BRKDCN-2035 – VXLAN BGP EVPN based Multi-POD, Multi-Fabric and Multi-Site (2017 Las Vegas)https://www.ciscolive.com/online/connect/sessionDetail.ww?SESSION_ID=95611
Building Data Centers with VXLAN BGP EVPN: A Cisco NX-OS Perspective: By Lukas Krattiger, Shyam Kapadia, David Jansen, Published Mar 31, 2017 by Cisco Press. http://www.ciscopress.com/store/building-data-centers-with-vxlan-bgp-evpn-a-cisco-nx-9781587144677
Is there any plans to introduce this feature to 77k platform for F3 or M3 cards?
Or it`s just for particular ASICs on 9k?
The EVPN Multi-Site feature is based on innovation we brought into the Cisco CloudScale ASIC that is part of the Cisco Nexus 9000 Series of Switches.
We are evaluating the applicability of the EVPN Multi-Site feature against other platforms like the Cisco Nexus 7000/7700 with M3-based line-cards. Today, we already have “VXLAN BGP EVPN to OTV Interoperability” available as part of NX-OS 8.2 (https://www.cisco.com/c/en/us/td/docs/switches/datacenter/nexus7000/sw/vxlan/config/cisco_nexus7000_vxlan_config_guide_8x/cisco_nexus7000_vxlan_config_guide_8x_chapter_01001.html), which means Layer-2 technology stitching capability is possible. Layer-2 stitching capability will not be present on the Cisco Nexus 7000/7700 F3-based line-cards.
So, if I understand correct “VXLAN BGP EVPN to OTV Interoperability” is just simplification of configuration legacy approach with OTV-on-Stick.
But new approach introduce much more caveats than advantages at the moment.
In the end it is much more efficient to buy a pair new 9k switches just for border-leaf role and use all this great feature like EVPN Multi-Site etc.
Excellent write-up! Also know that ACI supports multi-site in ACI version 3.0 and higher.
Yes. https://www.cisco.com/c/en/us/td/docs/switches/datacenter/aci/apic/sw/kb/b_Cisco_APIC_and_Cisco_ACI_Multi-Site.html#id_53442