New things in MPI-3: MPI_Count
The count parameter exists in many MPI API functions: MPI_SEND, MPI_RECV, MPI_TYPE_CREATE_STRUCT, etc. In conjunction with the datatype parameter, the count parameter is often used to effectively represent the size of a message. As a concrete example, the language-neutral prototype for MPI_SEND is:
MPI_SEND(buf, count, datatype, dest, tag, comm)
The buf parameter specifies where the message is in the sender’s memory, and the count and datatype arguments indicate its layout (and therefore size).
Since MPI-1, the count parameter has been an integer (int in C, INTEGER in Fortran). This meant that the largest count you could express in a single function call was 231, or about 2 billion. Since MPI-1 was introduced in 1994, machines — particularly commodity machines used in parallel computing environments — have grown. 2 billion began to seem like a fairly arbitrary, and sometimes distasteful, limitation.
The MPI Forum just recently passed ticket #265, formally introducing the MPI_Count datatype to alleviate the 2B limitation.
Really? 2 billion is limiting? C’mon — it’s two BILLION. With a “b”. Do people really need to do more than that?
The short answer is: yes. Fab Tillier described the justification for counts larger than 2B in a prior post on this blog. 32-bit is so, like, 2002, dude.
Two obvious workarounds are possible to get around the “2 billion limitation”:
1. If an application wants to send 8 billion elements to a peer, just send multiple messages. Relatively speaking, the overhead difference between sending a single 8GB message and sending four 2GB messages is the same.
2. Use MPI_TYPE_CONTIGUOUS to create a datatype comprised of 2B MPI_BYTEs, then MPI_SEND 4 of those. Perhaps something like this (other possibilities exist, too):
MPI_Type_contiguous(2147483648, MPI_BYTE, &two_billion_type);
MPI_Send(buf, 4, two_billion_type, …);
However, these two workarounds don’t solve all problems. The most-often cited case is that MPI-IO-based libraries are interested in knowing when bytes are written to disk — not types (e.g., check the MPI_Status output from an MPI_FILE_IWRITE call). One reason is that disks don’t care a whit about MPI datatypes; disks may write a number of bytes that cannot be described by an integer number of MPI datatypes when the size of those datatypes are larger than 1.
For example: a disk may write 17 bytes. That would be 4.25 MPI_INT’s (assuming sizeof(int) == 4).
HDF, for example, needs to know when an entire 8GB message has been written to disk. More specifically, it wants to know when 8,589,934,592 bytes have been written, or know exactly how many of them (so far) have been written. You can’t express that through multiple function calls, and you can’t use composed datatypes.
Getting MPI_Count into MPI-3 was a very long and unexpectedly complicated process. The Forum discussed many different proposals before finally settling on the minimalist approach of ticket #265:
- Define the MPI_Count type
- Create four new datatype manipulation functions that use the MPI_Count type for the count parameter (instead of an integer)
- Define what happens when a count is too large to be expressed in an integer OUT parameter
Specifically, the Forum did not simply change all the integer count parameters in all existing MPI function prototypes. That would have been a backwards compatibility nightmare.
These four new functions can be used to solve the HDF problem, for example. An application can MPI_FILE_IWRITE an 8GB message. HDF can call MPI_GET_ELEMENTS_X with the resulting MPI_Status and the MPI_BYTE datatype to know exactly how many bytes were written.