MPI tradeoffs: space vs. time
@brockpalen asked me a question in Twitter:
@jsquyres [can you discuss] common #MPI implementation assumptions made for performance and/or resource constraints?
Good question. MPI implementations are full of trade-offs between performance and resource consumption. Let’s discuss a few easy ones.
Eager RDMA: This is an optimization that came out of the realization that RDMA operations on certain networks (e.g., iWARP, IBoE [a.k.a. RoCE], and InfiniBand) are typically faster than send/receive operations. The idea is simple: use RDMA to send short, unexpected messages instead of send/receive semantics. Essentially, this means that a short message will just magically show up in the receiver’s memory without the receiving process doing anything — the RDMA transfer is completely handled in hardware.
The receiver can be aggressively polling on a specific location in memory to find out that a new message has arrived. Polling in this manner can notice the new message much faster than, say, an interrupt, or some other OS-induced mechanism.
However, to make this optimization work, every sender must have a dedicated memory chunk at the receiver to place incoming messages (think about it). And typically you want the receiver to be able to queue up multiple short messages from each sender, so for N processes with a receiving queue depth of M and maximum message size K, each MPI process consumes ((N-1)^2 x M x K) bytes just for possible incoming short messages.
Clearly, this does not scale as the number of processes increases — far too much memory is consumed (i.e., memory that cannot be used by the application).
But eager RDMA does reduce latency, so MPI implementations still like to use this optimization. A few ideas can be used to make it workable, even as the number of processes grows large, including:
- A receiver only sets up memory for eager RDMA for a given peer the first time that peer sends to it. Many MPI codes use “nearest neighbor” communication patterns, meaning that even though there may be a bajillion processes in the overall MPI job, each one only ever communicates with a few others.
- A receiver limits the number of peers who are allowed to utilize eager RDMA.
Hence, some peers get the benefits of eager RDMA — not all. This limits the amount of memory setup for eager RDMA.
In my next entry, I’ll discuss another space vs. time optimization: shared receive queues.