Cisco Logo


High Performance Computing Networking

Durga C., long-time listener, first-time caller, sent me a few interesting questions that I thought I’d share with everyone.  Here’s his first question:

  1. What is the role of the hardware in an RDMA transaction?  In other words, why does one need special hardware (e.g., InfiniBand, iWARP, RoCE, etc.) hardware to do RDMA as opposed to a “normal” Ethernet NIC?

This one question is surprisingly complex.  Let’s dive in…

1. What’s the role of hardware in RDMA transactions?

There are two typical concepts involved: offloading the work from the main CPU and operating system bypass.  Let’s talk about offloading here, and defer discussion of operating system bypass to question 2.

There are many types of NICs available today — let’s consider a spectrum of capabilities with “dumb” NICs on one end and “smart” NICs on the other end.

Dumb NICs are basically bridges between software and the network physical layer.  These NICs support very few commands (e.g., send a frame, receive a frame), and rely on software (usually in the form of a low-level operating system driver) to do all the heavy lifting such as making protocol decisions, creating wire-format frames, interpreting the meaning of incoming frames, etc.

Smart NICs provide much more intelligence.  They typically have microprocessors and can apply some amount of logic to traffic flowing through the NIC (both incoming and outgoing), such as VLAN processing or other network-related operations.

Smart NICs typically also offer some form of offloading work from the main CPU.  This means that they offer high-level commands such as “send message X via TCP connection Y”, and handle the entire transaction in hardware — the main CPU is not involved.

As another example, NICs that implement OpenFabrics-style RDMA typically do two offload-ish kinds of things (in addition to OS bypass, which I’ll discuss later):

While each of InfiniBand, iWARP, and RoCE NICs have slightly different ways of effecting offloading, they all expose high-level commands in firmware that allow relatively thin software layers to perform complex message-passing tasks that are wholly enacted in the NIC hardware.

The main idea is that once a high-level command is issued, an offload-capable NIC’s firmware takes over and the main CPU is no longer necessary for the rest of that transaction.

The whole point of offloading is basically to allow the NIC to handle network-ish stuff, thereby freeing up the CPU to do non-network-ish stuff (i.e., whatever your application needs).

This is frequently referred to as overlapping computation (on the main CPU) with communication (on the NIC).  Since CPUs are tremendously faster than NICs / network speeds, you can do a lot of computation on the main CPU while waiting for a network action to complete.

Sidenote: communication / computation overlap is exactly why MPI supports non-blocking message passing.  An MPI application can start a non-blocking send, for example, then go off and do non-network things (e.g., computing FFTs). Assumedly, the MPI implementation has offloaded the “send” command to a “smart” NIC, and the NIC is progressing the network operation while the application is doing other things.

The catch is that you need a “smart” enough NIC (such as an offload-capable NIC) to be able to do these kinds of things. Modern OpenFabrics-style NICs are in this category; they have enough hardware support to asynchronously progress high-level commands while freeing up the CPU for other tasks.

Stay tuned for my next entry where I’ll discuss two more related questions from Durga.

In an effort to keep conversations fresh, Cisco Blogs closes comments after 90 days. Please visit the Cisco Blogs hub page for the latest content.

1 Comments.


Trackbacks and Pingbacks:

  1. Return to Countries/Regions
  2. Return to Home
  1. All High Performance Computing Networking
  2. All Security
  3. Return to Home