Durga C., long-time listener, first-time caller, sent me a few interesting questions that I thought I’d share with everyone. Here’s his first question:
- What is the role of the hardware in an RDMA transaction? In other words, why does one need special hardware (e.g., InfiniBand, iWARP, RoCE, etc.) hardware to do RDMA as opposed to a “normal” Ethernet NIC?
This one question is surprisingly complex. Let’s dive in…
1. What’s the role of hardware in RDMA transactions?
There are two typical concepts involved: offloading the work from the main CPU and operating system bypass. Let’s talk about offloading here, and defer discussion of operating system bypass to question 2.
There are many types of NICs available today — let’s consider a spectrum of capabilities with “dumb” NICs on one end and “smart” NICs on the other end.
Dumb NICs are basically bridges between software and the network physical layer. These NICs support very few commands (e.g., send a frame, receive a frame), and rely on software (usually in the form of a low-level operating system driver) to do all the heavy lifting such as making protocol decisions, creating wire-format frames, interpreting the meaning of incoming frames, etc.
Smart NICs provide much more intelligence. They typically have microprocessors and can apply some amount of logic to traffic flowing through the NIC (both incoming and outgoing), such as VLAN processing or other network-related operations.
Smart NICs typically also offer some form of offloading work from the main CPU. This means that they offer high-level commands such as “send message X via TCP connection Y”, and handle the entire transaction in hardware — the main CPU is not involved.
As another example, NICs that implement OpenFabrics-style RDMA typically do two offload-ish kinds of things (in addition to OS bypass, which I’ll discuss later):
- Accept high-level commands, such as “put buffer X at address Y on remote node Z.” The hardware takes buffer X, fragments it (if necessary), frames the fragments, and sends them to the peer. The peer NIC hardware takes the incoming frames, sees that they’re part of an RDMA transaction, and puts the incoming data in buffer starting at address Y. Similarly, other high-level commands can effect protocol decisions and actions in peer NIC hardware.
- Provide asynchronous, “background” progress in ongoing network operations. In the above “put” example, if the message was 1MB in length, it may take some time to retrieve the entire message from main RAM, fragment it, frame it, and send it (and it’s usually done in a pipelined fashion). The main CPU is not involved in this process — the processor(s) on the NIC itself handle all of this.
While each of InfiniBand, iWARP, and RoCE NICs have slightly different ways of effecting offloading, they all expose high-level commands in firmware that allow relatively thin software layers to perform complex message-passing tasks that are wholly enacted in the NIC hardware.
The main idea is that once a high-level command is issued, an offload-capable NIC’s firmware takes over and the main CPU is no longer necessary for the rest of that transaction.
The whole point of offloading is basically to allow the NIC to handle network-ish stuff, thereby freeing up the CPU to do non-network-ish stuff (i.e., whatever your application needs).
This is frequently referred to as overlapping computation (on the main CPU) with communication (on the NIC). Since CPUs are tremendously faster than NICs / network speeds, you can do a lot of computation on the main CPU while waiting for a network action to complete.
Sidenote: communication / computation overlap is exactly why MPI supports non-blocking message passing. An MPI application can start a non-blocking send, for example, then go off and do non-network things (e.g., computing FFTs). Assumedly, the MPI implementation has offloaded the “send” command to a “smart” NIC, and the NIC is progressing the network operation while the application is doing other things.
The catch is that you need a “smart” enough NIC (such as an offload-capable NIC) to be able to do these kinds of things. Modern OpenFabrics-style NICs are in this category; they have enough hardware support to asynchronously progress high-level commands while freeing up the CPU for other tasks.
Stay tuned for my next entry where I’ll discuss two more related questions from Durga.