Cisco Blogs

Modern GPU Integration in MPI

February 8, 2013 - 1 Comment

Today’s guest post is from Rolf vandeVaart, a Senior CUDA Engineer with NVIDIA.

GPUs are becoming quite popular as accelerators in High Performance Computing clusters. For example, check out Titan; a recent entry into the Top 500 list from Oak Ridge Laboratories. Titan has 18,688 nodes (299,008 CPU cores) coupled with 18,688 NVIDIA Tesla K20 GPUs.

To help ease the programming burden working with GPU memory in MPI applications, support has been added to several MPI libraries such that the MPI library can directly send and receive the GPU buffers without the user having to stage them in host memory first. This has sometimes been referred to as “CUDA-aware MPI.”

Here is some psuedo code that shows the difference. This is typical application code that uses a “regular” (non-CUDA-aware) MPI library:

cudaMalloc(&gpuPtr, DATA_SIZE);
cudaMallocHost(&cpuPtr, DATA_SIZE);
/* Special CUDA mojo to launch computation kernel on GPU */
kernel<<grid, block>>(gpuPtr);
cudaMemcpy(cpuPtr, gpuPtr, DATA_SIZE, cudaMemcpyDefault);
MPI_Send(cpuPtr, NUM_ELEMENTS, MPI_DOUBLE, dest, tag, comm);

But a CUDA-aware MPI can hide the extra steps for you:

cudaMalloc(&gpuPtr, DATA_SIZE);
kernel<<grid, block>>(gpuPtr);
MPI_Send(gpuPtr, NUM_ELEMENTS, MPI_DOUBLE, dest, tag, comm);

The MPI library is doing a few things behind the scenes. First, CUDA supports the ability to determine if a buffer is a GPU buffer or a host buffer via this API function:

cuPointerGetAttribute(&memType, CU_POINTER_ATTRIBUTE_MEMORY_TYPE, ptr);

If memType is


, the MPI library can initiate a copy to get the memory from the device before handing off the resulting buffer to the network API. While this CUDA query function needs to be invoked for every send and receive buffer, pains were taken to optimize this function and make its overhead minimal.

Note, too, that this is quite analogous to what happens with OpenFabrics-based networks: in every call to MPI_Send, the buffer must be looked up to see if it is already registered with the OpenFabrics network stack. Hence, this is not really a new concept.

Once inside the MPI library, there are basically two ways that the GPU buffers can be moved. First, they can be staged through internal host buffers in the MPI library. In this case, after the GPU data is copied into the internal host buffers, MPI just utilizes its existing protocols to send the data. Additionally, the internal host buffers can be registered with the CUDA system via


. With CUDA-registered memory, the data can be copied asynchronously by a DMA unit in the GPU. The MPI library than just polls every now and then to determine when the data is ready to be sent via standard host buffer protocols.

Alternatively, the MPI library can take advantage of GPU-to-GPU data movement capabilities where available. For example, within a single node, CUDA has a set of APIs that allows copying data directly between two GPUs without passing through host memory. Even better, the DMA units on the GPUs progress such copies asynchronously and without involvement of the main CPU. These APIs are called CUDA Interprocess Communication (IPC) functions; more details can be found here.

These types of CUDA support are appearing in more and more MPI implementations.

Here are some links to FAQs and papers that talk about this feature:

And if you are attending CTC 2013 March 18-21, 2013 in Santa Clara, there are some talks scheduled on this topic:

In an effort to keep conversations fresh, Cisco Blogs closes comments after 60 days. Please visit the Cisco Blogs hub page for the latest content.


  1. Here’s a reply from Fab Tillier, Microsoft HPC MPI developer:

    You state that GPU buffers are similar to RDMA buffers. I wholehearted disagree. The former is a conscious choice by the user, while the latter is an internal implementation issue in the MPI library. A user of MPI should never know that they’re using RDMA under the covers, and memory registration is a burden that the MPI implementation takes on in exchange for better performance.

    Said another way, a program using a GPU *knows* it’s using a GPU, and usually exactly which GPU vendor to boot. Such a program could just as easily call

    MPI_Send(gpuPtr, NUM_ELEMENTS, OMPI_GPU_DOUBLE, dest, tag, comm);

    You might argue that an implementation-specific datatype handle makes the program no longer portable, and I’ll counter that to allow using a GPU pointer in the first place requires knowledge that the MPI library supports GPU pointers, and loses portability just the same, but with potentially more difficult error manifestation (imagine treating a GPU pointer as a host pointer!)

    Anyway, I think the end results will be better if the application is honest with the MPI library that it is using GPU buffers, rather than trying to deduce it from the pointer value. The portability issue could be resolved by the MPI Forum via standardizing the GPU datatypes, of defining a way of tagging a datatype’s buffer type.

    Some other optimizations are GPUDirect RDMA ( where the HCA could do peer-to-peer data transfers directly to/from GPU memory. This is independent of how the MPI library discovers that a buffer is a GPU buffer (whether by querying CUDA, or being told by the user).