It’s the eternal question: should I send lots and lots of small messages, or should I glump multiple small messages into a single, bigger message?
Unfortunately, the answer is: it depends. There’s a lot of factors in play.
The purist answer is, “it doesn’t matter.” MPI does not distinguish between “small” and “large” messages. Indeed, many may be surprised to learn that even though you may hear a lot about MPI’s “eager limit,” the MPI specification defines no such term. The “eager limit” is an implementation technique artifact that is common to many MPI implementations. But it’s not part of the standard definitions from the MPI specification.
From an MPI standard perspective, sending one message is pretty much the same as sending another.
But in reality, it can be quite different. Here are a few factors that come into play:
- MPI implementations need to send your data across a network (which may even be shared memory) between peer MPI processes. A certain amount of meta data must be sent with your MPI message, to include information such as the communicator on which the message was sent, the message tag, the sending process’ ID, and possibly network transport information such as a MAC address, IP address, or OpenFabrics LID. For efficient communication, you may need to minimize the ratio of meta data to message data in a single network frame.
- Some networks will pad frames to send a minimum size. For example, if the MPI meta data plus your message data is 32 bytes, the operating system, NIC driver, NIC hardware, and/or network switch may pad the message to be a minimum of 64 bytes — just so that hardware ASICs can perform more efficiently by guaranteeing a minimum size for all messages. This may hurt individual message latency, but be more efficient for hardware transfers at, for example, larger message sizes.
- The network fabric itself has a raw data rate which serves as an upper bound on how many small messages can be sent per quantum. For example, QDR IB’s 32 Gb/s data rate means that it can send, at most, 16M 64-byte frames per second (I arbitrarily picked a 64 byte frame size, which includes the OpenFabrics header size, MPI meta data, and MPI message size — your frame size may vary, depending on these factors).
- The small message injection rate is typically the maximum rate at which your NIC can inject small messages on to the network. The higher the quality of NIC (and switch), the closer this rate approaches the maximum theoretical value cited in #3. But remember: in “short” messages, the ratio of meta data to message data is typically high, so even if the injection rate is high, overall network efficiency may end up being low.
- Today’s CPU and RAM technology are getting more and more impressive. You need to compare the speed of memory copies vs. the small message injection rate of your network. For example, there is likely a message size under which it is more efficient to copy N MPI message data buffers into a single, larger buffer and send just 1 MPI message (vs. sending N MPI messages). Memory copies are highly optimized these days, but can be affected my all kinds of factors such as process and memory locality — you’ll need to ensure that your memory copy source and destinations are local, the copying process is pinned, when/if the MPI is aggregating messages, etc.
These are some factors off the top of my head. And, honestly, I could probably spend a whole blog post on each of them — I’ve omitted many details in each of the above bullets.
Re-reading the above text, I see that my text seems to imply that you should find the message size where it is more efficient to pack data into a single buffer than sending individual messages. I don’t mean to imply that. Many hardware platforms and MPI implementations optimize for small messages. But not necessarily to the same degree. Although this plays with the central tenant of portability of MPI applications, you will likely need to need to play with your MPI implementation and hardware to find the balance that achieves the best performance with your application.
For example, the following trivial code should quite likely be optimized to send a single message:
MPI_Send(&i, 1, MPI_INTEGER, peer, tag, comm);
MPI_Send(&j, 1, MPI_INTEGER, peer, tag, comm);
But real applications are rarely this simple, and the optimizations may be more difficult to spot.
Bottom line: dig in to your code and conduct a few trivial experiments; see if you can combine messages to get higher MPI throughput.