GPU: HPC Friend or Foe?
General purpose computing with GPUs looks like a great concept on paper. Indeed, SC’08 was dominated by GPUs — it was impossible not to be (technically) impressed with some of the results that were being cited and shown on the exhibit floor. But despite that, GPGPUs have failed to become a “must have” HPC technology over the past year. Last week’s announcements from NVIDIA look really great for the HPC crowd (aside from some embarrissing PR blunders) — they seem to address many of the shortcomings of prior generation GPU usage in an HPC environment: more memory, more cores, ECC memory, better / cheaper memory management, etc. Will GPUs become the new hotness in HPC?
The obvious question here is “Why is Jeff discussing GPUs on an MPI blog?”
At first blush, MPI and GPUs are orthogonal. MPI is about message passing, after all — nothing to do with computation, right? Specifically: most people use MPI to pass messages around their computation — they compute compute compute and then send some messages (for example) to peer processes to exchange updated edge data. Easy peasey.
However, there are a small number of places in MPI that can benefit from faster computation. The most obvious of which is a small class of MPI’s collective communications: reductions. That is, taking data that is distributed among a group of processes and computationally combining them into a single answer. A simple example is taking a single value from each MPI process and multiplying them together into a single result value. This is a combination of both message passing and computation.
MPI has several built-in reduction operations, such as addition, multiplication, minimum, maximum, etc. MPI also allows users to define their own operators for application-specific computations. For small amounts of data and/or small numbers of processes — like my simple example, above — having MPI perform a reduction is probably no more efficient than the application manually doing the communication and computation itself. But where MPI reductions really shine is where there is a large amount of data involved. MPI implementations spend a lot of time, effort, and research into figuring out how to make distributed reductions as efficient (and fast!) as possible.
And this is where GPUs come in. GPUs could be a natural target for MPI implementations to use when performing large reductions. One simplistic way of looking at a global addition reduction would be to gather all the data in one process and then perform a simplistic loop over the data to add it all together, perhaps like this:
for i = 0 ... large_reduction_size
result(i) = a(i) + b(i)
But instead of this loop, the MPI implementation could use a GPU to do the addition. With a large enough input data set, using a GPU could see some nice speedup compared to the simplistic loop.
This is actually a game-changer in terms of distributed MPI reduction algorithms. Contrary to what I said above, today’s MPI reduction algorithms typically perform distributed-style computations. For example, use a tree-based algorithm where each interior process sums the data from its children and then sends the result to its parent. When done, the tree root will contain the global summation result. Easy peasey.
GPUs, however, are most efficient when operating on large sets of data. Hence, it might be better to gather all the operand data to one MPI process (or perhaps a small number of MPI processes) that have GPUs and have them perform the bulk of the computation. The challenge will be to balance network bandwidth and congestion issues against the computational power of a single GPU. It’ll be even more challenging if only some nodes in the compute cluster have GPUs, or if there are less GPUs than CPU cores on a single compute node — such issues raise all kinds of heterogeneous distribution and allocation issues which can be quite sticky to solve in an optimal fashion at run-time.
Consider this an open call to researchers to start tackling these topics!