Cisco Blogs


- August 16, 2009 - 0 Comments

I find that there are generally two types of MPI application programmers:

  1. Those that only use standard (“blocking”) mode sends and receives
  2. Those that use non-blocking sends and receives

The topic of whether an MPI application should use only simple standard mode sends and receives or dive into the somewhat-more-complex non-blocking modes of communication comes up not-infrequently (it just came up again on the Open MPI user’s mailing list the other day). It’s always a challenge for programmers who are new to MPI to figure out which model they should use. Recently, we came across a user who chose a third solution: use MPI_SENDRECV. The “MPI_SEND may block” issue is covered in many MPI internet tutorials, mailing lists, chat forums, and bulletin boards. Heck, it’s even explicitly described in the MPI-2.1 standard itself (see example 3.10 in section 3.5). So I won’t rehash all of that here. The user that we came across recently could not easily re-order the standard mode sends and receives in his application because of the abstraction layers, complex distributed messaging patterns, etc. For similar reasons, the user did not want to delve into the complexity of non-blocking sends and receives. The user therefore opted for a fun solution: use MPI_SENDRECV.Excellent! We love users who are adventurous and actually read up on the different options that are available. The user was relying on the following sentences from MPI-2.1 section 3.10:

If blocking sends and receives are used for such a shift, then one needs to order the sends and receives correctly (for example, even processes send, then receive, odd processes receive first, then send) so as to prevent cyclic dependencies that may lead to deadlock. When a send-receive operation is used, the communication subsystem takes care of these issues.

While the user had done a valiant job of converting all of his code to use MPI_SENDRECV, he was confused why it worked under MPI implementation X and deadlocked under MPI implementation Y.The trap that this user had fallen into was thinking that MPI_SENDRECV should fix all deadlocks. If there is a cycle of dependencies within a single set of MPI_SENDRECV invocations across all involved processes, then send-receive is, indeed, guaranteed to prevent the deadlock. Unfortunately, it is possible to construct a scenario where MPI_SENDRECV may deadlock. The key is the following text, slightly further down in MPI-2.1 section 3.10:

The semantics of a send-receive operation is what would be obtained if the caller forked two concurrent threads, one to execute the send, and one to execute the receive, followed by a join of these two threads.

Therefore, MPI_SENDRECV suffers from exactly the same “MPI_SEND may block” issue as regular standard mode sends. To be clear: if MPI_SENDRECV relies on a send to complete for a receive that has not yet been posted (such as the receive side of a subsequent call to MPI_SENDRECV), deadlock may occur.For example, assuming two MPI processes and no other concurrent sending or receiving, the following may deadlock:

if (0 == rank) {    MPI_Sendrecv(buf1, size, MPI_CHAR, 1, 1111, buf2, size,        MPI_CHAR, 1, 2222, comm, &status);    MPI_Sendrecv(buf3, size, MPI_CHAR, 1, 3333, buf4, size,         MPI_CHAR, 1, 4444, comm, &status);} else if (1 == rank) {    MPI_Sendrecv(buf1, size, MPI_CHAR, 0, 4444, buf2, size,         MPI_CHAR, 0, 1111, comm, &status);    MPI_Sendrecv(buf3, size, MPI_CHAR, 0, 2222, buf4, size,         MPI_CHAR, 0, 3333, comm, &status);}

Try it in your favorite MPI implementation, varying the value of “size.” You may see it complete in some cases, but deadlock in others. Both behaviors are correct as defined by the MPI standard.Note that converting this code to use non-blocking calls will never deadlock:

if (0 == rank) {    MPI_Isend(buf1, size, MPI_CHAR, 1, 1111, comm, &req[0]);    MPI_Isend(buf3, size, MPI_CHAR, 1, 3333, comm, &req[1]);    MPI_Irecv(buf2, size, MPI_CHAR, 1, 2222, comm, &req[2]);    MPI_Irecv(buf4, size, MPI_CHAR, 1, 4444, comm, &req[3]);} else if (1 == rank) {    MPI_Isend(buf1, size, MPI_CHAR, 0, 4444, comm, &req[0]);    MPI_Isend(buf2, size, MPI_CHAR, 0, 2222, comm, &req[1]);    MPI_Irecv(buf3, size, MPI_CHAR, 0, 1111, comm, &req[2]);    MPI_Irecv(buf4, size, MPI_CHAR, 0, 3333, comm, &req[3]);}MPI_Waitall(4, req, MPI_STATUSES_IGNORE);

In the above example, all the sends and receives are started, but then MPI gets to progress all of them simultaneously in MPI_WAITALL — rather than just individual pairs of them, as in MPI_SENDRECV. Hence, no deadlock.While the ordering of the individual calls to MPI_ISEND and MPI_IRECV do not matter for correctness (because they’re all progressed in MPI_WAITALL, regardless of the initial posting order), the ordering may matter for efficiency. But that’s a matter for a future blog entry…

In an effort to keep conversations fresh, Cisco Blogs closes comments after 60 days. Please visit the Cisco Blogs hub page for the latest content.