Shared memory as an MPI transport (part 2)
In my last post, I discussed the rationale for using shared memory as a transport between MPI processes on the same server as opposed to using traditional network API loopback mechanisms.
The two big reasons are:
- Short message latency is lower with shared memory.
- Using shared memory frees OS and network hardware resources (to include the PCI bus) for off-node communication.
Let’s discuss common ways in which MPI implementations use shared memory as a transport.
First, let’s talk about point-to-point communication — direct messages between a pair of MPI process peers. Here are a few ways that MPI implementations have implemented shared memory as a transport for point-to-point IPC.
Copy in / copy out (CICO). CICO techniques were used all the way back in MPI-1 (mid-1990’s). At startup, a chunk of shared memory is setup between two processes. Directional queues are setup in this shared memory; the sending process copies a message from its process memory space into the shared memory queue. The receiving process will notice this message and copy it out of the shared memory queue into its own memory space.
By definition, the shared memory between the processes is limited in size; it’s typically much smaller than the memory available to each MPI process. As a direct consequence, large MPI messages need to be segmented and copied through shared memory a chunk at a time. The receiver must drain large message segments before the next one can be copied into the shared memory.
Such segmenting is typically done in a pipelined fashion so that the sender can be copying in segment A while the receiver is copying out segment B simultaneously. Allowing the sender and receiver to overlap their copies dramatically improves the efficiency of large message transfers.
Direct memory mapping. Some operating systems support direct mapping of peer processes. As such, no intermediate shared memory chunk may be necessary: process X can copy directly to the receiving buffer in process Y.
This technique can get a little tricky if the virtual memory addressing isn’t the same between the two processes, but it can definitely be more efficient than CICO because only one memory copy is required.
An interesting side note is on modern server architectures, memory bandwidth and memory copying speeds are so high that a well-tuned pipelined CICO method can be nearly as efficient as a single memory copy from direct mapping. The real win for direct memory mapping is that it uses half as much bandwidth as CICO. As such, the performance win for direct mapping may not become blatantly obvious until you have lots of MPI processes all sending lots of data through shared memory simultaneously.
Asynchronous transfer. Sometimes, additional hardware or software can effect asynchronous transfer of large messages “in the background” — even when the user application is not in an MPI function. Various experiments over the years have tried using hardware such as Intel’s I/OAT chipset, InfiniBand HCAs, and even kernel software threads to keep large transfers progressing without involvement from the MPI application.
However, remember that there is no such thing as a free lunch: co-opting existing hardware meant for other purposes for same-node asynchronous transfer all come at some cost (e.g., using InfiniBand HCAs for asynchronous transfer means that the message will need to traverse the PCI bus — perhaps even twice). As such, whether these techniques actually provide a consistent performance benefit or not is hugely dependent upon the MPI application’s communication patterns and the architecture of the underlying hardware.
Dedicated, asynchronous memory copying hardware is really required to make this idea work. Such hardware has existed in proprietary parallel architectures for a long time, and has proven quite successful. Think of such setups as having a dedicated communications co-processor (such as OpenFabrics-based RDMA NICs) that just happens to use shared memory as its transport.
That’s enough for one blog post… and let’s not forget that we didn’t even talk about using shared memory as a transport for MPI’s collective communication! That’s a whole topic unto itself; there’s no need to be bound by the same abstractions as point-to-point messages when optimizing collective communication implementations.