SC’11 Cisco booth demo: Open MPI over Linux VFIO

November 14, 2011 - 3 Comments

Linux VFIO (Virtual Function IO) is an emerging technology that allows direct access to PCI devices from userspace.  Although primarily designed as a hypervisor-bypass technology for virtualization uses, it can also be used in an HPC context.

Think of it this way: hypervisor bypass is somewhat similar to operating system (OS) bypass.  And OS bypass is a characteristic sought in many HPC low-latency networks these days.

Drop by the Cisco SC’11 booth (#1317) where we’ll be showing a technology preview demo of Open MPI utilizing Linux VFIO over the Cisco “Palo” family of first-generation hardware virtualized NICs (specifically, the P81E PCI form factor).  VIFO + hardware virtualized NICs allow benefits such as:

  • Low HRT ping-pong latencies over Ethernet via direct access to L2 from userspace (4.88us)
  • Hardware steerage of inbound and outbound traffic to individual MPI processes

Let’s dive into these technologies a bit and explain how they benefit MPI.

The Cisco Palo NICs are incredibly cool for multiple reasons; the HPC-relevant reasons include:

  • Palo can present itself as up to 128 “virtual” PCI devices to the server
  • The switching to these 128 devices is done in hardware (not software!)

If you pair the concept of hardware-virtualized NICs with VFIO, not only can you access each of these virtual NICs from Linux userspace (e.g., MPI processes), you can give each MPI process a unique L2 address and have hardware control inbound and outbound steering, flow control, buffering, etc.

In the Cisco SC’11 booth, we’re showing a development demo of Open MPI utilizing technology built upon Linux VFIO to do exactly that.

Specifically: each Open MPI process has direct access to read and write L2 Ethernet frames from Linux userspace, offloading all the checksums, routing, etc. to the hardware.

This is essentially OS bypass.

Not only does this tremendously cut down on latency (by avoiding the entire TCP and/or UDP stacks), it also offloads routing to individual MPI processes to hardware.

Stop by the Cisco booth to see an early development version of this Open MPI port in action.

Finally, note that Palo is only Cisco’s first-generation hardware virtualized NIC.  Stay tuned for even better performance with our second-generation NIC…

In an effort to keep conversations fresh, Cisco Blogs closes comments after 60 days. Please visit the Cisco Blogs hub page for the latest content.


  1. Hi Jeff,

    Do we really need hardware virtualized NICs? Could we just bypass the OS with multiple ethernet NICs, or is there something I’m missing?

    — Diego

    • Yes, you certainly could use any old NIC, but you will need a userspace mapping of that NIC to get things like its send, receive, and completion queues. With such a mapping, however, you could even conceivably use a single NIC with multiple different ethertypes to differentiate between different MPI processes.

      The strategy I used was with Cisco’s Palo virtualized NIC, where I could give each MPI process a unique, virtual NIC. This provided (at least) two benefits:

      1. Each MPI process got a unique L2 MAC address, meaning that hardware demultiplexed incoming traffic (vs. software).
      2. With the Palo NIC, each virtual NIC has its own hardware resources (ring buffer, etc.), effectively meaning that one MPI process won’t cause head-of-line blocking in other MPI processes.

      So while a VFIO-enabled strategy should work with any NIC (assuming you have a userspace mapping, as mentioned above), the Palo NIC offers several benefits due to its virtualization hardware. Which is a little weird (virtualization benefitting HPC instead of the other way around), but there you go. 🙂

  2. I should probably clarify: 4.88us is the “native” latency (without MPI). The BTL that I wrote took about a week and had *zero* tuning applied (meaning: I got it working Friday afternoon and left it at that); I was getting about 5.17us NetPIPE MPI half-round-trip latency for a 1-byte message.