Yes, I am still talking about VXLAN, rather you folks are still talking about VXLAN, so I thought its worthwhile digging deeper into the topic since there is so much interest out there. There also still seem to be a fair number of misconceptions around VXLAN, so let’s see what we can do to clear things up.
This time around, I have some partners in crime for the discussion:
Larry Kreeger is currently a Principal Engineer at Cisco Systems’ SAVTG working on Nexus 1000V architecture. Larry has a wide ranging background in networking accumulated from over 25 years of experience in developing networking products. His recent focus is data center networking, especially as it relates to data center virtualization.
Ajit Sanzgiri has worked on various networking technologies at Cisco and other bay area networking companies over the last 16 years. His interests include hardware based switching and routing solutions, Ethernet and wireless LANs and virtual networking. Currently he works on the Nexus1000v and related network virtualization products.
So, Larry and Ajit have put together this VXLAN primer--its fairly dense stuff, so we are breaking this into three posts. In this initial post, we’ll cover the basics--why VXLANs and what is VXLAN. I know I’ve covered this to some degree already, but Larry and Ajit are going to dig a little deeper, which will hopefully help clarify the lingering questions and misconceptions. In the next post, we’ll discuss how VXLAN compares with the other tools in your networking arsenal, and, in the final post, we’ll cover more of the common questions we are seeing.
1 Why VXLANs ?
VLANs have been used in networking infrastructures for many years now to solve different problems. They can be used to enforce L2 isolation, as policy enforcement points and as routing interface identifiers. Network services like firewalls have used them in novel ways for traffic steering purposes
Support for VLANs is now available in most operating systems, NICs, network equipment (e.g. switches, routers, firewalls etc.) and also in most virtualization solutions. As virtualized data centers proliferate and grow, some shortcomings of the VLAN technology are beginning to make themselves felt. Cloud providers need some extensions to the basic VLAN mechanism if these are to be overcome.
The first is the VLAN namespace itself. 802.1q specifies a VLAN ID to be 12 bits which restricts the number of VLANs in a single switched L2 domain to 4096 at best. (Usually some VLAN IDs are reserved for ‘well-known’ uses, which restricts the range further.) Cloud provider environments require accommodating different tenants in the same underlying physical infrastructure. Each tenant may in turn create multiple L2/L3 networks within their own slice of the virtualized data center. This drives the need for a greater number of L2 networks.
The second issue has to do with the operational model for deploying VLANs. Although VTP exists as a protocol for creating, disseminating and deleting VLANs as well as for pruning them for optimal extent, most networks disable it. That means some sort of manual coordination is required among the network admin, the cloud admin and the tenant admin to transport VLANs over existing switches. Any proposed extension to VLANs must figure out a way to avoid such coordination. To be more precise, adding each new L2 network must not require incremental config changes in the transport infrastructure.
Third, VLANS today are too restrictive for virtual data centers in terms of physical constraints of distance and deployment. The new standard should ideally be free (at least ‘freer’) of these constraints. This would allow data centers more flexibility in distributing workloads, for instance, across L3 boundaries.
Finally, any proposed extension to the VLAN mechanism should not necessarily require a wholesale replacement of existing network gear. The reason for this should be self-evident.
VXLAN is the proposed technology to support these requirements.
2 What are VXLANs ?
2.1 What’s in a name?
As the name VXLANs (Virtual eXtensible LANs) implies, the technology is meant to provide the same services to connected Ethernet end systems that VLANs do today, but in a more extensible manner. Compared to VLANs, VXLANs are extensible with regard to scale, and extensible with regard to the reach of their deployment.
As mentioned, the 802.1Q VLAN Identifier space is only 12 bits. The VXLAN Identifier space is 24 bits. This doubling in size allows the VXLAN Id space to increase by over 400,000 percent to over 16 million unique identifiers. This should provide sufficient room for expansion for years to come.
VXLANs use Internet Protocol (both unicast and multicast) as the transport medium. The ubiquity of IP networks and equipment allows the end to end reach of a VXLAN segment to be extended far beyond the typical reach of VLANs using 802.1Q today. There is no denying that there are other technologies that can extend the reach of VLANs (Cisco FabricPath/TRILL is just one), but none are as ubiquitously deployed as IP.
2.2 Protocol Design Considerations
When it comes to networking, not every problem can be solved with the same tool. Specialized tools are optimized for specific environments (e.g. WAN, MAN, Campus, Datacenter). In designing the operation of VXLANs, the following deployment environment characteristics were considered for its deployment. These characteristics are based on large datacenters hosting highly virtualized workloads providing Infrastructure as a Service offerings.
- Highly distributed systems. VXLANs should work in an environment where there could be many thousands of networking nodes (and many more end systems connected to them). The protocol should work without requiring a centralized control point, nor without a hierarchy of protocols.
- Many highly distributed segments with sparse connectivity. Each VXLAN segment could be highly distributed among the networking nodes. Also, with so many segments, the number of end systems connected to any one segment is expected to be relatively low, and therefore the percentage of networking nodes participating in any one segment would also be low.
- Highly dynamic end systems. End systems connected to VXLANs can be very dynamic, both in terms of creation/deletion/power-on/off and in terms of mobility across the network nodes.
- Work with existing, widely deployed network equipment. This translates into Ethernet switches and IP routers.
- Network infrastructure administered by a single administrative domain. This is consistent with operation within a datacenter, and not across the internet.
- Low network node overhead / simple implementation. With the requirement to support very large numbers of network nodes, the resource requirements on each node should not be intensive both in terms of memory footprint or processing cycles. This also means consideration for hardware offload.
2.3 How does it work?
The VXLAN draft defines the VXLAN Tunnel End Point (VTEP) which contains all the functionality needed to provide Ethernet layer 2 services to connected end systems. VTEPs are intended to be at the edge of the network, typically connecting an access switch (virtual or physical) to an IP transport network. It is expected that the VTEP functionality would be built into the access switch, but it is logically separate from the access switch. The figure below depicts the relative placement of the VTEP function.
Each end system connected to the same access switch communicates through the access switch. The access switch acts as any learning bridge does, by flooding out its ports when it doesn’t know the destination MAC, or sending out a single port when it has learned which direction leads to the end station as determined by source MAC learning. Broadcast traffic is sent out all ports. Further, the access switch can support multiple “bridge domains” which are typically identified as VLANs with an associated VLAN ID that is carried in the 802.1Q header on trunk ports. In the case of a VXLAN enabled switch, the bridge domain would instead by associated with a VXLAN ID.
Each VTEP function has two interfaces. One is a bridge domain trunk port to the access switch, and the other is an IP interface to the IP network. The VTEP behaves as in IP host to the IP network. It is configured with an IP address based on the subnet its IP interface is connected to. The VTEP uses this IP interface to exchange IP packets carrying the encapsulated Ethernet frames with other VTEPs. A VTEP also acts as an IP host by using the Internet Group Membership Protocol (IGMP) to join IP multicast groups.
In addition to a VXLAN ID to be carried over the IP interface between VTEPs, each VXLAN is associated with an IP multicast group. The IP multicast group is used as communication bus between each VTEP to carry broadcast, multicast and unknown unicast frames to every VTEP participating in the VXLAN at a given moment in time. This is illustrated in the figure below.
The VTEP function also works the same way as a learning bridge, in that if it doesn’t know where a given destination MAC is, it floods the frame, but it performs this flooding function by sending the frame to the VXLAN’s associated multicast group. Learning is similar, except instead of learning the source interface associated with a frame’s source MAC, it learns the encapsulating source IP address. Once it has learned this MAC to remote IP association, frames can be encapsulated within a unicast IP packet directly to the destination VTEP.
The initial use case for VXLAN enabled access switches are for access switches connected to end systems that are Virtual Machines (VMs). These switches are typically tightly integrated with the hypervisor. One benefit of this tight integration is that the virtual access switch knows exactly when a VM connects to or disconnects from the switch, and what VXLAN the VM is connected to. Using this information, the VTEP can decide when to join or leave a VXLAN’s multicast group. When the first VM connects to a given VXLAN, the VTEP can join the multicast group and start receiving broadcasts/multicasts/floods over that group. Similarly, when the last VM connected to a VXLAN disconnects, the VTEP can use IGMP to leave the multicast group and stop receiving traffic for the VXLAN which has no local receivers.
Note that because the potential number of VXLANs (16M!) could exceed the amount of multicast state supported by the IP network, multiple VXLANs could potentially map to the same IP multicast group. While this could result in VXLAN traffic being sent needlessly to a VTEP that has no end systems connected to that VXLAN, inter VXLAN traffic isolation is still maintained. The same VXLAN Id is carried in multicast encapsulated packets as is carried in unicast encapsulated packets. It is not the IP network’s job to keep the traffic to the end systems isolated, but the VTEP’s. Only the VTEP inserts and interprets/removes the VXLAN header within the IP/UDP payload. The IP network simply sees IP packets carrying UDP traffic with a well-known destination UDP port.
So, that was the first installment--if you have questions, post them as comments and we’ll get back to you.
[Updated 11 Nov 11]
Subsequent parts of this post can be found here and here.