Top 10 reasons why buffered sends are evil
I gave a brief explanation in a comment reply, but the subject is enough to warrant its own blog entry.
So here it is — my top 10 reasons why MPI_BSEND (and its two variants) are evil:
- Buffered sends generally force an extra copy of the outgoing message (i.e., a copy from the application’s buffer to internal MPI storage). Note that I said “generally” — an MPI implementation doesn’t have to copy. But the MPI standard says “Thus, if a send is executed and no matching receive is posted, then MPI must buffer the outgoing message…” Ouch. Most implementations just always copy the message and then start processing the send.
- The copy from the application buffer to internal storage not only consumes resources, it also takes time. In a poor implementation, the entire copy may complete before the message is sent.
- The application must allocate the internal buffer space and attach it via MPI_BUFFER_ATTACH. This doesn’t allow the MPI implementation to allocate “special” memory which may be optimized for the underlying network transport used to reach a specific peer MPI process (e.g., pinned memory, or device-specific memory).
- One of the arguments for buffered sends is that MPI_BUFFER_ATTACH allows the application to control how much buffering space is used. However, many MPI implementations have other, more precise / fine-grained mechanisms to control how much internal buffering is used. Granted, these mechanisms are not portable between MPI implementations, but if you’re worried about buffering space, it’s worth reading a few man pages.
- Another problem with MPI_BUFFER_ATTACH is that you can’t extend (or reduce) how much memory is attached while buffered send operations are ongoing. You can only detach it and then re-attach a different buffer (presumably one with more or less memory than the original) after all buffered send operations complete.
- What is the point of MPI_IBSEND? It’s non-blocking, so why wouldn’t you just call a normal MPI_ISEND? MPI_IBSEND gives you local completion semantics, but it potentially forces a memory copy — i.e., more overhead/less performance — so why use it?
- MPI is commonly used in environments with lots of memory. I believe that 2GB/core is pretty typical these days. But keep in mind that COTS servers used to make HPC clusters these days are also increasing cores, cache, and RAM. 16GB/core may become common before you know it. And if your RAM is so large (by today’s standards), why buffer? Or, put differently: if your data is large, then you probably don’t want to use 2x of it just to buffer outgoing messages.
- When using any flavor of a buffered send, MPI defines the completion semantics to be local, meaning that you have no indication of when the message is actually transmitted. This is a somewhat weak argument because it’s (more or less) the same lack of guarantee that normal MPI_SEND provides.
- Since the application has to attach the internal storage that MPI_BSEND uses, it’s statically attached and doesn’t decrease due to lack of use. If, instead, an MPI implementation buffers at its own discretion, it can release unused buffer space back to the application.
- MPI 3.0 has removed the restriction of disallowing reading from a buffer that is being used in an ongoing send operation (most MPI implementations didn’t care about that restriction, anyway). Hence, you can start an MPI_ISEND on a buffer and then continue to use it for read-only operations before the ISEND completes. There’s no need to get that buffer back ASAP so that you can continue to use it in your application while the communication continues in the background.
In short, it much better to let the MPI implementation decide to use buffering if it wants to. Indeed, most MPI implementations have good segmentation and pipelining engines to efficiently overlap the copy of a large message and processing its send. But these code paths are optimized for specific scenarios — it is almost always better to let the MPI implementation choose when to use them (vs. having the application choose).
Indeed, even if your MPI is good at pipelining the copy to internal storage, it still might have to pipeline again to send from that internal storage. So you still pay an overhead as compared to just sending straight from your source buffer.
It all boils down to: by (essentially) forcing the MPI implementation to buffer a message, you may be forcing sub-optimal behavior and potentially additional consumption of resources.