Cisco Blogs

Connection Management

September 3, 2011 - 4 Comments

One of the nice features of MPI is that its applications don’t have to worry about connection management.  There’s no concept of “open a connection to peer X” — in MPI, you just send or receive from peer X.

This is somewhat similar to many connectionless network transports (e.g., UDP) where you just send to a target address without explicitly creating a connection to that address.  But it’s different in a key way from many connectionless transports (e.g., UDP): MPI’s transport is reliable, meaning that whatever you send is guaranteed to get there.

All this magic happens under the covers of the MPI API.  It means that in some environments, MPI must manage connections for you, and also must guarantee reliable delivery.

In the mid-90’s, ethernet-based MPI implementations typically opened TCP sockets between all peers during MPI_INIT.  For the 32- or 64-process jobs that were the norm back then, that was fine.

As usage has scaled up, however, “typical” MPI jobs have gotten larger.  Jobs that span hundreds of MPI processes are common, thereby making the “open all possible connections during MPI_INIT” strategies infeasible.

Think of it this way: “all possible connections” turns into an O(N^2) operation with corresponding resources.  Consider a 512-process job (with 32-core commodity servers, that’s only 16 machines — which is not unrealistic for common jobs today).  If every MPI process opened up a TCP socket to each of its peers, that would be 511^2 = 261,121 sockets.

Not only would this create a “connection storm” on the network, it would also consume 511 file descriptors in each process, or 16,352 file descriptors on each server.

Yowza!  🙁

This is why most (if not all) modern MPI implementations using connection-based network transports utilize a “lazy connection” management scheme.  Specifically: the first time an MPI process sends to a peer, a connection is opened between them.

This unfortunately takes quite a bit of code and logic to work, because the peer MPI process may not have a listening agent ready to accept the connection.  Typically, the sending process initiates some sort of non-blocking connection initiation, and then periodically polls for completion later.  The sender may also need to re-initiate the connection request if the receiver never responded, and the initial request timed out.

Studies have shown over the years than many typical MPI applications only have “nearest neighbor” based communication, meaning that each MPI process really only communicates with a small number of its peers.  Hence, even in a 512-process job, a typical process in that job may only use MPI to communicate with 5 or 6 of its peers.

This observation, paired with a lazy connection scheme, dramatically reduces network-based resource consumption.


In an effort to keep conversations fresh, Cisco Blogs closes comments after 60 days. Please visit the Cisco Blogs hub page for the latest content.


  1. In the 512-process job, let say that every process will use MPI to communicate with 511 of its peers. Is there other tricks aside from the “lazy connection scheme” — which I believe will not be effective in this case — to avoid the “connection storm” ?

    Is this problem specific to the TCP/IP byte transfer layer or it is also present in openib too (with Queue Pairs) ?

    Thanks !

    • 1. Is the problem specific to TCP? No, it’s common to all connection-based network transports (e.g., including OpenFabrics queue pairs).

      2. How to make it better? You’re right that lazy connections only take you so far. A few things can make it better:

      – if the application triggers a connection storm via an MPI collective across MPI_COMM_WORLD, the MPI implementation may well be staggering the connections anyway, because it may not necessarily have every process talking to every other process (e.g., it may use some kind of algorithm that needs a less-than-fully-connected scheme).
      – if the application itself is doing point-to-point communications with all of its peers, it simply might not be scalable to begin with; perhaps it should be refactored to run at larger scale.
      – some MPI’s have options to “pre-connect” functionality that will establish all point-to-point connections during MPI_INIT. A scalable algoritm is typically used for such all-to-all connections.
      – some network transports have options to bundle and multiplex underlying connections such that the network will only have 1 connection between servers (for example), regardless of how many peers are on each server. This can greatly decrease the amount of resources used and the resulting “connection” storm.

  2. All of my experience is with connectionless interconnects (okay, just Blue Gene) so if you’ll permit a naive question, what would happen if you created a statically connected mesh and forwarded messages whenever direct connections were not available? Might there be some cases where this works better than dynamic connections?

    • I know little/nothing about Blue Gene, but yes, the concept of forwarding messages is certainly viable. It would even allow static setup during MPI_INIT (which is what I think you were implying), because assumedly there would be very few connections actually created — so who cares if you don’t end up using them?

      The drawback, of course, is that you might have to rely on software for forwarding. This could consume resources for network use, thereby taking away resources from the MPI process on that node. How many resources this ends up consuming is likely to be entirely application-dependent.