“Eager Limits”, part 2
Open MPI actually has multiple different protocols for sending messages — not just eager / rendezvous.
Our protocols were originally founded on the ideas described in this paper. Many things have changed since that 2004 paper, but some of the core ideas are still the same.
The picture to the right shows how Open MPI divides an MPI message up into segments and sends them in three phases. Open MPI’s specific definition of the “eager limit” is the max payload size that is sent with MPI match information to the receiver as the first part of the transfer. If the entire message fits in the eager limit, no further transfers / no CTS is needed.
If the message is longer than the eager limit, the receive will eventually send back a CTS and the sender will proceed to transfer the rest of the message. When an RDMA-enabled network transport is used (such as OpenFabrics-based networks, such as iWARP and InfiniBand), Open MPI starts “registering” the bulk of the large message with the OpenFabrics network stack (i.e., the phase 3 segments). Since this registration is slow, the phase 2 segments are copied to pre-registered buffers and sent immediately. This helps hide the initial latency of registering the phase 3 segments.
Finally, phase 2 and phase 3 are fully pipelined so that Open MPI doesn’t have to wait, for example, for the entire phase 3 section of the message is registered (registration time is directly proportional to buffer size). Instead, Open MPI can send a chunk of the phase 3 segment as soon as it is registered.
By default, Open MPI leaves user memory registered so that if the same buffer is used to send or receive again in the future, no pipelining is needed — the entire phase 2 and 3 segments can be transferred in a single offloaded RDMA transfer after the CTS is received.
See this FAQ entry on the Open MPI web site for a fuller description of the pipelined protocols and the run-time parameters that can be used to modify its behavior.
Note, however, that Open MPI’s “eager limit” definition changes for each network type that it supports. Indeed, some network transports hide their own eager limit in their network API stack — Open MPI doesn’t have visibility on what the value is (e.g., the Portals, MX, and PSM network transports).