Cisco Blogs


Cisco Blog > High Performance Computing Networking

Top 10 reasons why buffered sends are evil

February 13, 2012
at 5:00 am PST

I made an offhand remark in my last entry about how MPI buffered sends are evil.  In a comment on that entry, @brockpalen asked me why.

I gave a brief explanation in a comment reply, but the subject is enough to warrant its own blog entry.

So here it is — my top 10 reasons why MPI_BSEND (and its two variants) are evil:

  1. Buffered sends generally force an extra copy of the outgoing message (i.e., a copy from the application’s buffer to internal MPI storage).  Note that I said “generally” — an MPI implementation doesn’t have to copy.  But the MPI standard says “Thus, if a send is executed and no matching receive is posted, then MPI must buffer the outgoing message…”  Ouch.  Most implementations just always copy the message and then start processing the send.
  2. The copy from the application buffer to internal storage not only consumes resources, it also takes time.  In a poor implementation, the entire copy may complete before the message is sent.
  3. The application must allocate the internal buffer space and attach it via MPI_BUFFER_ATTACH.  This doesn’t allow the MPI implementation to allocate “special” memory which may be optimized for the underlying network transport used to reach a specific peer MPI process (e.g., pinned memory, or device-specific memory).
  4. One of the arguments for buffered sends is that MPI_BUFFER_ATTACH allows the application to control how much buffering space is used.  However, many MPI implementations have other, more precise / fine-grained mechanisms to control how much internal buffering is used.  Granted, these mechanisms are not portable between MPI implementations, but if you’re worried about buffering space, it’s worth reading a few man pages.
  5. Another problem with MPI_BUFFER_ATTACH is that you can’t extend (or reduce) how much memory is attached while buffered send operations are ongoing.  You can only detach it and then re-attach a different buffer (presumably one with more or less memory than the original) after all buffered send operations complete.
  6. What is the point of MPI_IBSEND?  It’s non-blocking, so why wouldn’t you just call a normal MPI_ISEND?  MPI_IBSEND gives you local completion semantics, but it potentially forces a memory copy — i.e., more overhead/less performance — so why use it?
  7. MPI is commonly used in environments with lots of memory.  I believe that 2GB/core is pretty typical these days.  But keep in mind that COTS servers used to make HPC clusters these days are also increasing cores, cache, and RAM.  16GB/core may become common before you know it.  And if your RAM is so large (by today’s standards), why buffer?  Or, put differently: if your data is large, then you probably don’t want to use 2x of it just to buffer outgoing messages.
  8. When using any flavor of a buffered send, MPI defines the completion semantics to be local, meaning that you have no indication of when the message is actually transmitted.  This is a somewhat weak argument because it’s (more or less) the same lack of guarantee that normal MPI_SEND provides.
  9. Since the application has to attach the internal storage that MPI_BSEND uses, it’s statically attached and doesn’t decrease due to lack of use.  If, instead, an MPI implementation buffers at its own discretion, it can release unused buffer space back to the application.
  10. MPI 3.0 has removed the restriction of disallowing reading from a buffer that is being used in an ongoing send operation (most MPI implementations didn’t care about that restriction, anyway).  Hence, you can start an MPI_ISEND on a buffer and then continue to use it for read-only operations before the ISEND completes.  There’s no need to get that buffer back ASAP so that you can continue to use it in your application while the communication continues in the background.

In short, it much better to let the MPI implementation decide to use buffering if it wants to.  Indeed, most MPI implementations have good segmentation and pipelining engines to efficiently overlap the copy of a large message and processing its send.  But these code paths are optimized for specific scenarios — it is almost always better to let the MPI implementation choose when to use them (vs. having the application choose).

Indeed, even if your MPI is good at pipelining the copy to internal storage, it still might have to pipeline again to send from that internal storage.  So you still pay an overhead as compared to just sending straight from your source buffer.

It all boils down to: by (essentially) forcing the MPI implementation to buffer a message, you may be forcing sub-optimal behavior and potentially additional consumption of resources.

Tags: ,

In an effort to keep conversations fresh, Cisco Blogs closes comments after 60 days. Please visit the Cisco Blogs hub page for the latest content.

6 Comments.


  1. Thanks, that was a nice summary! I always wondered why I would want to use buffered sends, when I essentially already have a buffer (the memory block I want to send). Double-buffering for memory-bound jobs just doesn’t seem right :) Especially since I need twice the memory already (XGB sending out, XGB receiving).

    But I honestly missed the part that until the (I)send completes, the application is not supposed to access the source buffer at all. I always thought that the application (obviously) must not write anything into the buffer that is being send, but read access would be fine. Would this
    double buf[10];
    MPI_Isend(buf, 10, MPI_DOUBLE, 1, …);
    MPI_Isend(buf, 10, MPI_DOUBLE, 2, …);
    be illegal MPI (ignoring that one would probably bcast)? Amazing.

    I am not up to date on ‘const’, but if MPI-3 loosens the constraint from ‘no access’ to ‘only read access’, I would assume the function signature has a ‘const’ now? :)

    Completely unrelated, but do you know if ‘restrict’ is used in MPI-3 (e.g. in Sendrecv), or if it is discussed at all?

    Thanks for blogging about MPI, I always read the posts with interest (and do learn new tricks once in a while)!

       0 likes

    • February 15, 2012 at 7:24 am

      Yes, MPI-3 has both loosened the restriction on accessing buffers associated with incomplete send requests: you can now treat them in a read-only manner and all is fine. Truthfully, most MPI implementations didn’t care and allowed this behavior anyway. But technically (i.e., according to the MPI spec), it was incorrect.

      And yes, MPI-3 has added “const” to its C bindings where relevant. No “restrict”, though — that hasn’t been brought up in the Forum.

         0 likes

  2. Okay, after carefully reading the MPI specs again, I indeed did find the spot that prohibits read access to send buffers (though I wouldn’t claim that I understand the rationale alluding to performance reason and a non cache-coherent DMA engine). I mean, if cache-coherence is an argument then wouldn’t the application need to do pretty bizarre checks to make sure to not become an illegal MPI programme?

    What would happen if I have two non-blocking sends that operate on the same source array, but one is sending the odd elements, the other the even ones? Both sends would access disjoint memory regions, true, but they are interleaved and that would certainly do something funny to caching.

    Hmm, okay, I guess I am just happy that in practice it wasn’t an issue and with MPI-3 it is also officially correct to not worry about that :)

    Thanks for the const update. I guess ‘restrict’ isn’t much of an issue anyways (performance-wise), it would just expose the requirement that two buffers are not aliased in the API.

       0 likes

    • February 15, 2012 at 10:05 am

      I think “restrict” gets hairy for exactly the case you cite — with MPI datatypes, you can have overlapping “buffers” (e.g., pages), but still have mutually-exclusive read patterns.

         0 likes

  3. I’m late to the party, but I object to point 7 “16GB/core may become common before you know it”. This is exactly opposite the actual trend which is that while memory per compute node (and usually per socket) may be increasing, memory per core is definitely going down. Memory (and accompanying bandwidth) is becoming the most expensive and power-hungry component of HPC systems. Since flops and cores are becoming so inexpensive and the marginal benefit is high for some applications, we get lots of cores (even if “most” applications are limited by memory bandwidth and network latency).

    However, avoiding unnecessary copies is still good in this environment. Also, if the attached buffer resides in a bad NUMA region (e.g. on a memory bus of a distant socket), the copy could be more expensive than putting the bytes on the network.

    Note that using one MPI process per node (with threading inside) may alleviate the memory pressure, but then the caller has to deal with threading their application using inept programming models that are library-unfriendly and have poor control of memory affinity, leading to inconsistent and often poor performance for memory-bound tasks. /rant

       0 likes

  4. March 13, 2012 at 8:07 am

    Jed: Fair enough points.

    But I still think my statement is correct: “But keep in mind that COTS servers used to make HPC clusters these days are also increasing cores, cache, and RAM. 16GB/core may become common before you know it.”

    Cisco makes servers that can hold 1TB of RAM — with 64 hyperthreads, that’s 16GB/core. Other companies make large-RAM-capable machines, too. Whether HPC environments choose to use this much RAM or not is a different issue — so perhaps I was being a bit too coarse-grained in my remarks.

    However, I have seen US DOE labs buy N-core machines (where N>=16) simply so that they could get more RAM. They would run MPI jobs with half the cores in the machine, but with all the RAM simply because they had “big data” kinds of computations to run. Big Data apps are real. With intelligent MPI non-blocking communication, you can do things like fill half your memory with active data, use the processors to chomp on that data for a while, and then have some non-blocking communication filling in the other half of your RAM for the next iterative round of data chomping (while you’re presumable evicting the first set of data and filling it with a 3rd set). And so on.

    That’s more of where I was going; sorry if I didn’t explain that well.

       0 likes