Can we count on MPI to handle large datasets?

April 22, 2011 - 2 Comments

(today’s entry is guest-written by Fab Tillier, Microsoft MPI engineer extraordinaire)

When you send data in MPI, you specify how many items of a particular datatype you want to send in your call to an MPI send routine.  Likewise, when you read data from a file, you specify how many datatype elements to read.

This “how many” value is referred to in MPI as a count parameter, and all of MPI’s functions define count parameters as integers: int in C, INTEGER in Fortran.  This definition often limits users to 231 elements (i.e., roughly two billion elements) because int and INTEGER default to 32 bits on many of today’s platforms.

That may sound pretty big, but consider that a 231 byte file is not really that large by today’s standards — especially in HPC, where datasets can sometimes be terabytes in size.  Reading a ~2 gigabyte file can take (far) less than a second. 

While you can read two billion bytes (2,147,483,647, to be precise) from a file in a single call to MPI_FILE_READ_AT, you can’t always do the same if you want to read (231 + 2) bytes. Ouch.  

To be clear: you can often construct MPI derived datatypes such that you can send / read / etc. more than 231 bytes in a single MPI function call — but there are also many cases where MPI datatypes do not fix the problem.

If you’re thinking, “Wow! 32-bit is so, like, 2002, dude,” — you’re right.

Applications that process data in such large quantities are relatively rare today.  I’ll bet that such large-data application will become more prevalent over the next decade, though.  

We know that users want their MPI implementation to hide platform limitations under the MPI calls so that they can focus on the data they need to process in a portable manner. Why shouldn’t limitations on count values be hidden, too?

Enter MPI_Count, a new type being proposed in the MPI Forum for the MPI 3.0 standardization effort.  MPI_Count is defined to be as big as anything that MPI can address: it must be at least as big as an integer, an MPI_AINT (MPI’s version of size_t), and MPI_OFFSET (a type to express file offsets, which broke the 32-bit barrier quite a while ago).

With MPI_Count and some associated new MPI API functions, users can express and manipulate large datasets without having to implement 31-bit chunking themselves. The MPI library — as it should — will do any necessary splitting for them.  At the same time, the existing MPI functions will remain unchanged, preserving backward compatibility.

Just to be clear: the MPI_Count proposal is still only that — a proposal.  It looks like it will pass and become part of MPI-3.0, but nothing is definite until it is actually voted in, which will take several more months.

In an effort to keep conversations fresh, Cisco Blogs closes comments after 60 days. Please visit the Cisco Blogs hub page for the latest content.


  1. Sweet. It is pretty annoying at the moment to have to do the chunking manually (it doesn’t occur frequently, but when it does it is really painful). So it will be really nice to have MPI do that for me in a smarter way than my ‘grmpf-don’t-bother-me-with-this-stupid-integer-overflow’-coding 🙂