Today we feature a deep-dive guest post from Torsten Hoefler, the Performance Modeling and Simulation lead of the Blue Waters project at NCSA, and Pavan Balaji, computer scientist in the Mathematics and Computer Science (MCS) Division at the Argonne National Laboratory (ANL), and as a fellow of the Computation Institute at the University of Chicago.
Despite MPI’s vast success in bringing portable message passing to scientists on a wide variety of platforms, MPI has been labeled as a communication model that only supports “two-sided” and “global” communication. The MPI-1 standard, which was released in 1994, provided functionality for performing two-sided and group or collective communication. The MPI-2 standard, released in 1997, added support for one-sided communication or remote memory access (RMA) capabilities, among other things. However, users have been slow to adopt such capabilities because of a number of reasons, the primary ones being: (1) the model was too strict for several application behavior patterns, and (2) there were several missing features in the MPI-2 RMA standard. Bonachea and Duell put together a more-or-less comprehensive list of areas where MPI-2 RMA falls behind. A number of alternate programming models, including Global Arrays, UPC and CAF have gained popularity filling this gap.
That’s where MPI-3 comes in.
The MPI Forum has been working for the past 4 years to bring together the third major version of MPI — MPI-3. An informal overview of the new features in MPI-3 can be found here. An important part of MPI-3 deals with improved semantics and features for MPI RMA to address some of these criticisms and make MPI RMA a portable runtime system that can provide high-performance and feature-rich one-sided communication support to applications. That is not to say that MPI-3 will replace all other alternate programming models that provide one-sided communication. But it will certainly offer a competitive library interface for many parallel programs. In addition, the portability, performance and vendor-support that MPI enjoys would allow it to potentially form a portable runtime layer for many of these high-level languages and libraries, thus providing a base for rapid development of newer languages/libraries while providing compatibility with existing MPI applications.
Several MPI-3 RMA features have been adopted from other programming models, including those mentioned above. The goals of MPI have always been to standardize and generalize the capabilities that have been shown to be useful for applications, while maintaining portability across various platforms and the look-and-feel of MPI. In this blog post, we will talk about some such improvements that are being added in MPI-3 RMA.
MPI-3 RMA in a Nutshell
The basic premise of MPI RMA revolves around the concept that any allocated memory is private to the MPI process by default, and can be exposed to other processes as a “public memory region” as needed. To do this, MPI provides the notion of windows. An MPI process can declare a segment of its memory to be part of a window, allowing other processes to access this memory segment using one-sided operations such as PUT, GET, ACCUMULATE, and others. Processes can control the visibility of data written using one-sided operations for other processes to access using several synchronization primitives.
MPI-3 RMA offers two new window allocation functions: a collective version that can be used to allocate window memory for fast access, and a dynamic version which exposes no memory but allows the user to “register” remotely-accessible memory locally and dynamically at each process.
Two memory models are provided to allow the implementation to take advantage of cache-coherency if it is supported by the underlying system architecture. This enables high-performance implementations and easier programming on many modern computing systems.
New atomic operations, such as fetch-and-accumulate and compare-and-swap offer new functions for well-known shared memory semantics and enable the implementation of lock-free algorithms in distributed memory.
New synchronization functions are also provided to offer a way to explicitly finish outstanding accesses to a target and notify completion at the source process. This also allows separating local and remote completion, enabling faster reuse of local buffers. RMA data-movement operations can now also return requests, which enables finer-grained completion semantics instead of bulk (per target) completion.
Addressing Some Shortcomings of MPI-2.0
Collective Window Creation and Dynamically exposing buffers — One of the criticisms of Bonachea and Duell was that exposing a memory segment to be publicly accessible was a collective operation in MPI-2 RMA. Therefore, if a process had to dynamically decide what memory it wants to publicly expose, it had to synchronize with all other processes in an appropriate communicator. This is cumbersome and inefficient. MPI-3 RMA provides a new method for dynamically exposing buffers for other processes to access, called MPI_WIN_CREATE_DYNAMIC. The idea of this new functionality is to provide an empty window initially, and allow each process to dynamically attach and detach memory regions as needed. Thus, if a process needs to expose a memory region for one other process to access, it no longer needs to synchronize with all other processes in the communicator.
Conflicting GET/PUT communication — Another criticism of Bonachea and Duell was that it was erroneous for multiple PUT/GET operations to access the same memory location in the same synchronization epoch. That is, if the application performs a PUT operation on a region more than once in the same synchronization epoch, the MPI implementation is allowed to raise an error, after which the behavior of MPI was undefined (e.g., it could abort). This was problematic in a number of ways. First, if MPI RMA was a compiler target for high-level languages, the language will have to guarantee that multiple PUT/GET operations cannot access the same memory region. This is hard for the compiler to track. Second, some applications, such as those represented by the HPCC random access benchmark, intentionally perform multiple PUT/GET operations to the same memory region. The algorithms in these applications rely on the fact that, even if some regions in memory are corrupted, it does not impact the overall correctness of the application. In such cases, it is acceptable to the application if MPI corrupts the memory regions that have multiple PUT/GET operations on them, but it is not acceptable for the MPI implementation to raise an error. In MPI-3 RMA, this behavior has been changed to be valid, but the value of the data in such cases is undefined.
Local and remote updates — Bonaechea and Duell also mentioned that MPI also does not allow users to update a window’s memory by remote PUT/GET and local stores concurrently. This feature simplified the implementation of MPI-2 memory model significantly even though it can be very unnatural and cumbersome for users. MPI-3 again defines a middle-ground where such accesses are not generally erroneous but the outcome is undefined. A similar limitation was that overlapping windows can be created but are hard to use because concurrent communications lead to erroneous results. MPI-3 again allows such accesses without any guarantees about the outcome. However, in the MPI-3 model, the user would create a single dynamic window that spans virtually the whole address space and register/deregister memory dynamically when it needs to be exposed to other processes.
Passive target — The criticism that passive target RMA operations can only lock (and access) a single target process per epoch has also been lifted. Now, multiple target processes can be locked and accessed concurrently, and MPI-3 also provides a function to lock all target processes from a single process.
Comparison to Other One Sided Models
Comparison to ARMCI — While it is an area of active research, currently, a high-performance implementation of the ARMCI API requires hardware features such as cache-coherence. Thus, while a process is accessing its local data using load/store instructions, another process is allowed to write to another part of this process memory using a PUT operation. This is not a concern for cache-coherent architectures, but on non-cache-coherent architectures which do not track byte-level accesses, this might result in data being overwritten. MPI-2 RMA was strict in this respect and did not allow such accesses. MPI-3 RMA provides ability for users to take advantage of cache-coherent domains using a query function to allow such behavior.
Comparison to the Co-Array-enabled Fortran 2008 — MPI-3 offers similar memory access and consistency semantics. It should be possible to implement Fortran 2008 with MPI-3 as a runtime. Fortran 2008 offers additional features for the multi-dimensional specification and distribution of arrays (matrices), which can be a clear advantage over using pure MPI-3 semantics. Otherwise, MPI-3 RMA will integrate with Fortran 2008 and offer alternative methods to access remote data. The novel Fortran bindings in MPI-3 will enhance this functionality further (added type safety and “safe” nonblocking semantics).
Comparison to UPC — MPI offers a similar one-dimensional partitioning of global arrays (windows in MPI). The memory consistency and update semantics are similar to UPC by default (in-order updates for accumulates) but can also be relaxed for faster implementations on out-of-order networks. MPI-3 RMA also offers good support for multi-level parallelism and modular program with the well-known concepts of windows and communicators.
This summary was written by Pavan Balaji and Torsten Hoefler. The design of the RMA in MPI-3.0 was a great team efforts; the most active individuals were (alphabetically): Pavan Balaji, Brian Barrett, Ron Brightwell, James Dinan, Bill Gropp, Jeff Hammond, Torsten Hoefler, Rajeev Thakur, and Keith Underwood.