VXLAN Deep Dive, Part 2: Looking at the Options
Hey folks–this is the second of three posts looking a little more closely at VXLAN. If you missed the first post, you can find it here. In this installment we are going to look at the some of the other options out there. Two of the most common questions we see are “why do I need yet another protocol?” and “can I now get rid of X?” This should help you answer these questions.So, let’s dig in…
3 Comparison with other technologies
3.1 Overlay Transport Virtualization (OTV)
If one were to look carefully at the encapsulation format of VXLAN one might notice that it is actually a subset of the IPv4 OTV encapsulation in draft-hasmit-ovt-03, except the Overlay ID field is not used (and made reserved) and the well-known destination UDP port is not yet allocated by IANA (but will be different).
If one were to look even closer, they would notice that OTV is actually a subset of the IPv4 LISP encapsulation, but carrying an Ethernet payload instead of an IP payload.
Using a common (overlapping) encapsulation for all these technologies simplifies the design of hardware forwarding devices and prevents reinvention for its own sake.
Given that the packet on the wire is very similar between VXLAN and OTV, what is different? OTV was designed to solve a different problem. OTV is meant to be deployed on aggregation devices (the ones at the top of an structured hierarchy of 802.1Q switches) to interconnect all (up to 4094) VLANs in one hierarchy with others either in the same or in another datacenter, creating a single stretched 4K VLAN domain. It is optimized to operate over the capital I Internet as a Data Center Interconnect. Cisco’s latest version is able to interconnect datacenters without relying on IP multicast, which is not always available across the Internet. It prevents flooding of unknown destinations across the Internet by advertising MAC address reachability using routing protocol extensions (namely IS-IS). Each OTV device peers with each other using IS-IS. There is expected to be a limited number of these OTV devices peering with each other over IS-IS (because of where they are placed – at a layer 2 aggregation point). Within a given layer 2 domain below this aggregation point, there are still only 4K VLANs available, so OTV does not create more layer 2 network segments. Instead it extends the existing ones over the Internet.
Since VXLAN is designed to be run within a single administrative domain (e.g. a datacenter), and not across the Internet, it is free to use Any Source Multicast (ASM) (a.k.a. (*,G) forwarding) to flood unknown unicasts. Since a VXLAN VTEP may be running in every host in a datacenter, it must scale to numbers far beyond what IS-IS was designed to scale to.
Note that OTV can be complimentary to VXLANs as a Data Center Interconnect. This is helpful in two ways. For one, the entire world is not poised to replace VLANs with VXLANs any time soon. All physical networking equipment supports VLANs. The first implementations of VXLANs will be only in virtual access switches (the ones Virtual Machines connect to), so this means that only VMs can connect to VXLANs. If a VM wants to talk with a physical device such as a physical server, layer 3 switch, router, physical network appliance, or even a VM running on a hypervisor that does not support a VXLAN enabled access switch – then it must use a VLAN. So, if you have a VM that wants to talk with something out on the Internet…it must go through a router, and that router will communicate with the VM over a VLAN. Given that some VMs will still need to connect to VLANs, they will still exist and if layer 2 adjacency is desired across datacenters, then OTV works well to interconnect them. The layer 2 extension provided by OTV can be used, not just to interconnect VLANs with VMs and physical devices connected to them, but also by VTEPs as well. Since VTEPs require the use of ASM forwarding, and this may not be available across the Internet, OTV can be used to extend the transport VLAN(s) used by the VTEPs across the Internet between multiple datacenters.
Why did VXLANs use a MAC-in-UDP encapsulation instead of MAC-in-GRE? The easy answer is to say, for the same reasons OTV and LISP use UDP instead of GRE. The reality of the world is that the vast majority (if not all) switches and routers do not parse deeply into GRE packets for applying policies related to load distribution (Port Channel and ECMP load spreading) and security (ACLs).
Let’s start with load distribution. Port Channels (or Cisco’s Virtual Port Channels) are used to aggregate the bandwidth of multiple physical links into one logical link. This technology is used both at access ports and on inter-switch trunks. Switches using Cisco’s FabricPath can get even greater cross-sectional bandwidth by combining Port Channels with ECMP forwarding – but only if the switches can identify flows (this is to prevent out-of-order delivery which can kill L4 performance). If one of today’s switches were to try to distribute GRE flows between two VTEPs that used a GRE encapsulation, all the traffic would be polarized to use only one link within these Port Channels. Why? Because the physical switches only see two IP endpoints communicating, and cannot parse the GRE header to identify the individual flows from each VM. Fortunately, these same switches all support parsing of UDP all the way to the UDP source and destination port numbers. By configuring the switches to use the hash of source IP/dest IP/L4 protocol/source L4 port/dest L4 port (typically referred to as a 5-tuple), they can spread each UDP flow out to a different link of a Port Channel or ECMP route. While VXLAN does use a well-known destination UDP port, the source UDP port can be any value. A smart VTEP can spread the all the VMs 5-tuple flows over many source UDP ports. This allows the intermediate switches to spread the multiple flows (even between the same two VMs!) out over all the available links in the physical network. This is an important feature for data center network design. Note that this does not just apply to layer 2 switches, since VXLAN traffic is IP and can cross routers as well, it applies to ECMP IP routing in the core as well.
Note that MAC-in-GRE based schemes can perform a similar trick as mentioned above by creating flow-based entropy within a sub-portion of the GRE key (as opposed to the source UDP port), but it is a moot point unless all the switches and routers along the path can parse the GRE Key field and use it to generate a hash for Port Channel / ECMP load distribution
Next comes security. As soon as you start carrying your layer 2 traffic over IP routers, you open yourself up for packet injection on to a layer 2 segment from anywhere there is IP access…unless you use firewalls and/or ACLs to protect the VXLAN traffic. Similar to the load balancing issue above, if GRE is used, firewalls and layer 3 switches and routers with ACLs will typically not parse deeply into the GRE header enough to differentiate one type of tunneled traffic from another. This means all GRE would need to be blocked indiscriminately. Since VXLAN uses UDP with a well-known destination port, firewalls and switch/router ACLs can be tailored to block only VXLAN traffic.
Note that one downside to any encapsulation approach, whether it is based on UDP or GRE is that by having the hypervisor software add an encapsulation, today’s NICs and/or NIC drivers do not have a mechanism to be informed about the presence of the encapsulation for performing NIC hardware offloads. It will be a performance benefit for either of these encapsulation methods for NIC vendors to update their NICs and/or NIC drivers and for hypervisor vendors to allow access to these capabilities. Given that NIC vendors (Intel, Broadcom and Emulex) have given public support to both VXLAN and GRE based encapsulations, I can only guess that support for both schemes will be forthcoming.
Locator/ID Separation Protocol (LISP) is a technology that allows end systems to keep their IP address (ID) even as they move to a different subnet within the Internet (Location). It breaks the ID/Location dependency that exists in the Internet today by creating dynamic tunnels between routers (Ingress and Egress Tunnel Routers). Ingress Tunnel Routers (ITRs) tunnel packets to Egress Tunnel Routers (ETRs) by looking up the mapping of an end system’s IP address (ID) to its adjacent ETR IP address (Locator) in the LISP mapping system.
LISP provides true end system mobility while maintaining shortest path routing of packets to the end system. With traditional IP routing, an end station’s IP address must match the subnet it is connected to. While VXLAN can extend a layer 2 segment (and therefore the subnet it is congruent with), across hosts which are physically connected to different subnets, when a VM on a particular host needs to communicate out through a physical router via a VLAN, the VMs IP address must match the subnet of that VLAN – unless the router supports LISP.
If a VXLAN is extended across a router boundary, and the IP Gateway for the VXLAN’s congruent subnet is a VM on the other side of the router, this means traffic will flow from the originating VMs server, across the IP network to the IP Gateway VM residing on another host, and then back up into the physical IP network via a VLAN. This phenomenon is sometime referred to as “traffic tromboning” (alluding to the curved shape of a trombone). Thus, while VXLANs support VMs moving across hosts connected to different layer 2 domains (and therefore subnets), it doesn’t provide the direct path routing of traffic that LISP does.
VMware has an existing proprietary equivalent of VXLAN which is deployed today with vCloud Director, called vCloud Director Network Isolation (vCDNI). vCDNI uses a MAC-in-MAC encapsulation. Cisco and VMware, along with others in the hypervisor and networking industry have worked together on a common industry standard to replace vCDNI – namely VXLAN. VXLAN has been designed to overcome the shortcomings of the vCDNI MAC-in-MAC encapsulation – namely load distribution, and limited span of a layer 2 segment.
The first one is the same issue that GRE has with load distribution across layer 2 switch Port Channels (and ECMP for FabricPath). The second is that because the outer encapsulation is a layer 2 frame (not an IP packet), all network nodes (hypervisors in the case vCDNI), MUST be connected to the same VLAN. This limits the flexibility in placing VMs within your datacenter if you have any routers interconnecting your server pods unless you use a layer 2 extension technology such as OTV to do it.
So, a couple of points to wrap things up. Hopefully, this gives you a better understanding of why VXLAN instead of some of the existing options. Beyond that, I hope it becomes clear that while VXLAN is immensely useful, it is not magical–it relies on a well-built, well-operating L2/L3 infrastructure so other technologies and protocols are going to come into play in the real world. As usual, post questions to either of these blog entires and we will get them answered as best we can.
[Updated 10 Nov 11]