Cisco Blogs

“Give me 4 255-sided die and I’ll get you some IPs”

September 29, 2010 - 0 Comments

Have you ever wondered how an MPI implementation picks network paths and allocates resources?  It’s a pretty complicated (set of) issue(s), actually.

An MPI implementation must tread the fine line between performance and resource consumption.  If the implementation chooses poorly, it risks poor performance and/or the wrath of the user.  If the implementation chooses well, users won’t notice at all — they silently enjoy good performance.

It’s a thankless job, but someone’s got to do it.  🙂

A canonical example of the “resources vs. performance” issue is “pinned” buffers for OS-bypass networks such as iWARP (cough cough and another network that rhymes with “shminfiniband” cough cough).  Such networks have multiple possible sending modes, particularly as applied to short messages:

  • RDMA: where data just magically appears in the receiver’s memory without any software intervention at the receiver
  • Regular: where data must propagate through the receiver’s network software stack before it shows up in the receiver’s usable memory

The RDMA mode may be faster because it’s (nearly) all hardware.  Yay!  But there are downsides.  Boo!

  • Each receiver must have a dedicated buffer (or set of buffers if you want to enable more than one simultaneous message in flight) for each sender.
  • The receiver must regularly poll each of the dedicated buffers — or at least one of them per receiver — to know if new messages have arrived.

Neither of these are scalable; if a receiver posts N buffers of size M bytes for each of P peer processes, the total size used is (M*N*P).  Consider what happens as P grows large. That’s a lot of memory that is being used by the network stack that can’t be used by your application.


Most MPI implementations have taken the approach of allocating such “fast path” RDMA short message buffers for a limited number “friend” processes.  A “friend” may be a peer MPI process who has sent more than N short messages in a specified period of time, for example.  All other peer MPI processes must use the “slow” regular mode of sending.  This assumes that short messages will usually be exchanged between a small number of peer MPI processes.

If you view an MPI job as an N-dimensional grid of processes, and if only adjacent MPI processes exchange short messages, this assumption holds up well.  Yay!  If not, the assumption breaks.  Boo!

Another example of performance-vs-resources is unexpected messages.  What happens if an MPI process sends a message to a peer who has not posted a corresponding MPI receive yet — and therefore there is no buffer to receive the message into?  If the receiver always immediately accepts the entire message into a temporary buffer, it could hide a great deal of latency when it finally posts the corresponding receive (depending on the size of the message, when exactly the receiver posts the receive relative to the send, etc.).  But buffering all these “unexpected” messages could consume a lot of memory on the receiver.  And possibly even network resources (depending on the network type).

Most MPI implementations implement an “eager” receive for short unexpected messages, but may drop incoming messages after a certain threshhold — just so that they don’t fill up the receiver’s memory with messages that it doesn’t want yet.  Long incoming messages use a rendezvous protocol: the bulk of the message is not transferred until a matching receive has been posted.

These are two relatively straightforward issues, but there are many more.

The moral of this story is twofold:

  1. Write well-behaved apps that don’t have an excess of unexpected messages (or any, if you can help it).
  2. Check out the tunable parameters of your MPI implementation and set them for what your application needs.  Without setting them, the MPI implementation is purely guessing at what your application needs.  If you tell the MPI implementation what your application needs, you might see (substantial) improvements in performance with lower resource consumption.  That’s a good thing!

(BTW: I can’t claim to have come up with the quote that is the title of this entry; I found it here)


In an effort to keep conversations fresh, Cisco Blogs closes comments after 60 days. Please visit the Cisco Blogs hub page for the latest content.