MPI is a great transport-agnostic inter-process communication (IPC) mechanism. Regardless of where the peer process is that you’re trying to communicate with, MPI shuttles messages back and forth with it.
Most people think of MPI as communicating across a traditional network: Ethernet, InfiniBand, …etc. But let’s not forget that MPI is also used between processes on the same server.
A loopback network interface could be used to communicate between them; this would present a nice abstraction to the MPI implementation — all peer processes are connected via the same networking interface (TCP sockets, OpenFabrics verbs, …etc.).
But network loopback interfaces are typically not optimized for communicating between processes on the same server (a.k.a. “loopback” communication). For example, short message latency between MPI processes — a not-unreasonable metric to measure an MPI implementation’s efficiency — may be higher than it could be with a different transport layer.
Shared memory is a fast, efficient mechanism that can be used for IPC between processes on the same server. Let’s examine the rationale for using shared memory and how it is typically used as a message transport layer.
Using shared memory can be lighter-weight than traditional network driver loopback mechanisms for multiple reasons:
- Lower-layer network stacks are typically not optimized for on-server loopback communications. Such network stacks are usually tuned for off-server communication. For example, OpenFabrics-based devices can utilize operating system (OS) bypass techniques to communicate with remote peers.
- When using on-server loopbacks, the OS must be involved since process boundaries are crossed. Hence, benefits such as OS-bypass are lost, and the whole mechanism slows down. See the network loopback transport diagram (right); notice how traffic must traverse through the OS to get to the peer process.
- Shared memory usually only requires direct OS involvement during the setup phase. Once shared memory is setup, it effectively becomes an OS-bypass mechanism — processes simply copy to and from the shared region. See the shared memory transport diagram; notice that no OS traversals are necessary.
- When utilizing a network stack, additional layers of wire-level protocol may be added and then stripped that simply aren’t necessary when communicating between two pre-established MPI processes. For example, using a TCP loopback interface, the sending side may add TCP, IP, and L2 protocol information to the message which the receiver must strip off. With shared memory, MPI completely controls exactly what protocol bytes are transferred, and can avoid that unnecessary overhead.
All this boils down to the fact that byte transfers through shared memory will likely be faster through shared memory than through a lower-layer network stack loopback mechanism — especially for short messages. The effective performance difference for large messages may be a wash.
Of course, these are generalities — there are cases where shared memory transports do require dipping into the OS. So even though shared memory as a transport is usually faster than a network loopback, perhaps the biggest reason to use shared memory is simply to avoid using OS network resources.
Specifically: using shared memory as a transport effectively frees up the network OS driver resources, the NIC processor(s), and the paths between them (e.g., the PCI bus). Remember that with increasing core counts inside servers, effects such as congestion and resource contention inside the server become a very big deal. Avoiding these kinds of hotspots is a Good Thing.
Ok, great. So how do MPI implementations typically use shared memory as a transport?
We’ll look at that in my next blog entry.