One of the nice features of MPI is that its applications don’t have to worry about connection management. There’s no concept of “open a connection to peer X” — in MPI, you just send or receive from peer X.
This is somewhat similar to many connectionless network transports (e.g., UDP) where you just send to a target address without explicitly creating a connection to that address. But it’s different in a key way from many connectionless transports (e.g., UDP): MPI’s transport is reliable, meaning that whatever you send is guaranteed to get there.
All this magic happens under the covers of the MPI API. It means that in some environments, MPI must manage connections for you, and also must guarantee reliable delivery.
In the mid-90’s, ethernet-based MPI implementations typically opened TCP sockets between all peers during MPI_INIT. For the 32- or 64-process jobs that were the norm back then, that was fine.
As usage has scaled up, however, “typical” MPI jobs have gotten larger. Jobs that span hundreds of MPI processes are common, thereby making the “open all possible connections during MPI_INIT” strategies infeasible.
Think of it this way: “all possible connections” turns into an O(N^2) operation with corresponding resources. Consider a 512-process job (with 32-core commodity servers, that’s only 16 machines — which is not unrealistic for common jobs today). If every MPI process opened up a TCP socket to each of its peers, that would be 511^2 = 261,121 sockets.
Not only would this create a “connection storm” on the network, it would also consume 511 file descriptors in each process, or 16,352 file descriptors on each server.
This is why most (if not all) modern MPI implementations using connection-based network transports utilize a “lazy connection” management scheme. Specifically: the first time an MPI process sends to a peer, a connection is opened between them.
This unfortunately takes quite a bit of code and logic to work, because the peer MPI process may not have a listening agent ready to accept the connection. Typically, the sending process initiates some sort of non-blocking connection initiation, and then periodically polls for completion later. The sender may also need to re-initiate the connection request if the receiver never responded, and the initial request timed out.
Studies have shown over the years than many typical MPI applications only have “nearest neighbor” based communication, meaning that each MPI process really only communicates with a small number of its peers. Hence, even in a 512-process job, a typical process in that job may only use MPI to communicate with 5 or 6 of its peers.
This observation, paired with a lazy connection scheme, dramatically reduces network-based resource consumption.