I recently received an interesting email from Heiko Bauke about a new C++-based message-passing library that he is working on called MPL.
His library aims to make a simple-to-use library that exploits the features available in modern C++ compliers — it’s not a simple 1:1 mapping of C-to-C++ bindings like the MPI 2.x C++ bindings were.
Heiko provides three blog entries about MPL:
In MPL, you see the obvious “comm.rank()” and “comm.size()” object abstractions that you would expect, but also some more interesting usage of templates, using the STL, and even some use of anonymous lambda functions. For example:
comm_world.reduce(mpl::min<pair_t>(), 0, v.data(), result.data(), layout);
Fun stuff!
MPL is a great example of layering the power of a native language on top of underlying MPI functionality. I.e., let app developers have the full power of their language, and let the library provide the pluming to get down to the native / system-level abstractions.
Any plans to provide a way to compose non-blocking mpi operations (akin as std::future .then()/when_all/when_any) ?
Right now one has to use a separate thread to achieve this, and thus non-blocking operations are not even useful.
I wouldn’t say that the non-blocking collectives are useless — they simply aren’t optimal for the use case that you want to use. They are certainly well-suited for other use cases. I’ll ask Heiko to see if he can answer your question here.
The “challenge” is having 2 MPI process do the following in a completely non-blocking way (I just use non-blocking send and recv for exposition, but a similar example using MPI_IProbe could be constructed):
Process A:
non-blocking sends size of array
non-blocking send array
Process B:
non-blocking receive size of array
non-blocking allocate buffer
non-blocking receive array on buffer
In C++ pseudo-code:
Process A:
std::vector data {1, 2, 3, 4};
auto request
= nb_send(process_b, tag_0, data.size())
.then([]() { // schedules a continuation:
nb_send(process_b, tag_1, data.data(), data.size());
});
// application code folows {
// not a single call to mpi test in here:
// the second send is scheduled as soon as the first
// completes
// }
request.get(); // blocks until both sends complete
Process B:
std::size_t size;
std::vector buffer;
auto request
= nb_recv(process_a, tag_0, &size)
.then([&]() { // schedules continuation
// allocate memory for the buffer before the second recv
buffer.resize(size);
nb_recv(process_a, tag_1, buffer.data(), size);
});
// application code {
// not a single call to test in here:
// as soon as the first recv completes, the continuation
// runs asynchronously and memory is allocated
// and the second recv is scheduled
// }
request.get(); // blocks until both recvs complete
The main net win is that one fires all the communication logic at the beginning, and blocks at the very end. That is, the unrelated application code in the middle doesn’t need to e.g. call a function “every now and then” to just advance the continuation.
I don’t see how to do this without spawning a different thread that continuously loops over the requests calling MPI test to advance to the next continuation.
If one has to spawn a different thread anyways the advantage of non-blocking MPI calls is still there, but it is less important (they allow cool things like having a pool of request + continuations).
P.S: One could just have sent the array at once with send, but the receive end would still need to repeatedly call MPI_IProbe before executing the continuation (I just chose the two sends + two receives for exposition purposes).
It seems that I screwed up the formatting completely. You can read my previous comment properly formatted here:
https://ideone.com/kqhllQ
Yes, the Cisco blog system doesn’t let you format the replies nicely — thanks for posting the code on ideone; that made it significantly easier to read.
A few points…
1. You’re using significantly more advanced C++ than I can comment on; we’ll have to wait for Heiko to reply to your specific question.
2. FWIW, there’s also MPI_Mprobe and MPI_Improbe that at least makes the alternate method of Probe/Iprobe “safe”.
3. The topic of asynchronous progress is a hot topic these days. The *intent* is that applications do not need to call an MPI function just to get progress on ongoing non-blocking communications, but implementations aren’t fully there yet.
The Message Passing Library (MPL) is based on MPI. All message passing is realized by MPI. MPL is basically a lightweight convenience wrapper for MPI. At the moment it is not intended to provide new message passing functionalities, which go beyond the capabilities of the MPI standard. I think one has to be very careful not to introduce possible sources of unwanted side effects, when combining MPI functions to more advanced massage passing patterns, in particular when it comes to non-blocking operations.
Composing non-blocking MPI operations sounds interesting. However, I doubt that this would give significant advantages in terms of performance as long as nonblocking MPI communication gets progress only when MPI_Test* or MPI_Wait* are called. In my opinion such fancy stuff as you have it in mind becomes only interesting, when MPI implementations allow threading on the level of MPI_THREAD_MULTIPLE.
FWIW, MPI implementations are supposed to make progress on non-blocking communications (collective and otherwise) even while not in the MPI library (e.g., via MPI_TEST and friends).
That being said, it’s still early days for non-blocking collective operation implementations; other than some hardware offload implementations, there has been limited availability of implementations that are truly asynchronous and do not require dipping into the MPI library. Give it some time.
Heiko:
With respect to “when MPI implementations allow threading on the level of MPI_THREAD_MULTIPLE,” many implementations support this. It is just that Open MPI does not. MPICH and its derivatives – MVAPICH2, Intel MPI, Cray MPI, etc. – have supported MPI_THREAD_MULTIPLE for many years. I use it all the time.
Granted, there may be performance issues [http://www.mcs.anl.gov/~balaji/pubs/2015/ppopp/ppopp15.mpi_threads.pdf] but they are not intractable (e.g. http://dx.doi.org/10.1109/IPDPS.2012.73 and http://sc15.supercomputing.org/schedule/event_detail?evid=pap145).
Please try MPICH or a derivative thereof next time you want full support for MPI-3 and send Jeff Squyres a note once a week to inquire about the status of https://github.com/open-mpi/ompi/issues/157 🙂
Best,
Jeff
MPL on MPI… Fun stuff!
thanks for posting!
Heiko:
With respect to “when MPI implementations allow threading on the level of MPI_THREAD_MULTIPLE,” many implementations support this. It is just that Open MPI does not. MPICH and its derivatives – MVAPICH2, Intel MPI, Cray MPI, etc. – have supported MPI_THREAD_MULTIPLE for many years. I use it all the time.
Granted, there may be performance issues [http://www.mcs.anl.gov/~balaji/pubs/2015/ppopp/ppopp15.mpi_threads.pdf] but they are not intractable (e.g. http://dx.doi.org/10.1109/IPDPS.2012.73 and http://sc15.supercomputing.org/schedule/event_detail?evid=pap145).
Please try MPICH or a derivative thereof next time you want full support for MPI-3 and send Jeff Squyres a note once a week to inquire about the status of https://github.com/open-mpi/ompi/issues/157 🙂
Best,
Jeff