In my prior blog entry, I answered the first of Durga C.’s questions to me. Here’s all three of his questions:
- What is the role of the hardware in an RDMA transaction? In other words, why does one need special hardware (e.g., InfiniBand, iWARP, RoCE, etc.) hardware to do RDMA as opposed to a “normal” Ethernet NIC? (see prior blog entry)
- Further, can you explain why pure software solutions (e.g., Open-MX) are better than nothing when you don’t have hardware support?
- Also, what is the difference between “RDMA” and “RMA”?
Let’s explore the last two of those questions.
2. Why are software-only solutions worthwhile?
Software-only solutions can still be worthwhile, even if you have a “dumb” NIC (i.e., one that isn’t capable of providing high-level network firmware commands).
Taking a step back, HPC applications typically seek to reduce network-communication-induced overhead. There’s two typical ways to do this:
- Eliminate as much network protocol as possible (both in terms of processing cycles and bytes sent/received).
- Hand off processing to NIC hardware as soon as possible
Open-MX is a pure software solution in the Linux kernel. It works with any Ethernet NIC. It accomplishes the first of the above goals by bypassing the entire TCP stack: it simply sends and receives raw L2 Ethernet frames. The whole TCP protocol overhead — both in terms of CPU cycles and bytes occupied by IP and TCP headers on the physical layer — is avoided, potentially resulting in significantly lower latency than TCP messaging.
Open-MX fails in the second goal, however — it is not an operating system bypass solution. Specifically, userspace applications still have to trap down into the kernel to send or receive via the Open-MX software stack. Modern Linux kernel traps are fairly fast these days, but it’s still additional overhead that an OS-bypass solution does not incur.
Sidenote: Open-MX’s goal of portability sacrifices a bit in terms of performance. But note that this was a specific design choice.
Additionally, Open-MX generally has no hardware support for protocol and progression issues. For example, the Open-MX kernel driver needs to be activated every time a frame arrives — it needs to decide what to do with that frame. Even though the kernel can execute the Open-MX driver asynchronously (with respect to userspace applications), it still has to share CPU cycles with userspace applications.
Conversely, many high-speed networks provide a way to bypass the operating system when talking to the network hardware. For example, OpenFabrics-style NICs generally provide access to the hardware’s send queue (SQ), receive queue (RQ), and completion queue (CQ) directly from userspace. So does Cisco’s Ethernet Virtual Interface Card (VIC), which I demo’ed at SC’11. Such NICs usually also have support for asynchronously progressing message passing operations in hardware.
This allows userspace applications to directly communicate with the NIC hardware without trapping into the kernel and/or executing kernel-level software drivers. Long / ongoing network operations can progress down in the NIC hardware and not incur any main CPU usage.
All this being said, however, software-only solutions (e.g., Open-MX) are great when you have mediocre hardware. Open-MX is a great performance equalizer for commodity Ethernet NICs, for example. You won’t get nearly the same performance as you do with a “smart” NIC with offload and OS-bypass capabilities, but you will definitely get better performance than using TCP-based messaging.
3. What’s the difference between “RDMA” and “RMA”?
RMA = “Remote Memory Access”. It’s a general moniker, encompassing many different flavors of getting data from or putting data to remote memory. It usually has the connotation of direct data placement, such as getting data from or putting data to a specific address in a peer’s address space.
RDMA = “Remote Direct Memory Access”. This acronym has generally become synonymous with OpenFabrics/verbs-style RMA.