If not RDMA, then what?
In prior blog posts, I talked about some of the challenges that are associated with implementing MPI over RMA- or RDMA-based networks. The natural question then becomes, “What’s the alternative?”
There’s at least two general classes of alternatives:
- General purpose networks (e.g., Ethernet — perhaps using TCP/IP or even UDP)
- Special purpose networks (i.e., built specifically for MPI)
This doesn’t even mention shared memory, but let’s return to shared memory as an MPI transport in a future post.
General purpose networks such as Ethernet were among the first targets for MPI back in the mid-90’s. TCP and UDP were the obvious choices, but let’s just discuss TCP here for brevity.
It’s not too difficult to map MPI’s 2-sided semantics on to TCP’s 2-sided semantics. Sure, there’s some “gotchas” to deal with, such as lazy connections (e.g., opening a TCP socket to the peer the first time you MPI_SEND to it), the fact that TCP doesn’t have message-based semantics, NAT and other private networking issues, etc. But even with that, TCP matches MPI’s semantics fairly well. MPI effectively hides a bunch of the underlying TCP communication gorp from the application (sidenote: try to use the word “gorp” in a conversation today) and tries to let the application focus on, well, the application.
There are other networks that were specifically built for MPI, meaning that they present user-level software APIs that map very well to many MPI implementations. Networks that immediately jump to mind are most well-known by their network APIs: MX, Portals, PSM, Tports, and Elan. Such networks typically have unique hardware features and corresponding firmware abstractions that are specifically designed to be used as the lower layer to an MPI implementation.
For example, if there is no shared memory (e.g., on some Cray models), the NIC can do MPI communicator+tag matching in hardware. This is actually a fairly complex task, and is in the critical performance path — MPI matching has a direct impact on short message latency. Implementing it in hardware can definitely be a performance win.
Other examples include supporting some level of MPI collective operations directly in the NIC. Such functionality can definitely be a performance win — the progression of a collective operation between multiple processes can asynchronously advance without any involvement of the main CPU. This can create a tremendous speedup compared to software-based collective algorithm implementations.
PSM is a little different than the others, however. The underlying PSM NIC is (intentionally) rock stupid; the PSM middleware is very smart. Instead of hardware asynchronous progress, PSM has a custom kernel driver that provides all the smarts and asynchronicity (use “asychronicity” in a sentence today, too).
To be clear, having a network API that presents MPI-friendly abstractions to an MPI implementation is not sufficient: the network hardware and/or firmware must natively support such MPI concepts. PSM blurs this definition a little, but without hardware/firmware support IMHO, it’s just the same old network, dressed up with middleware under MPI that hides the complexity.