Taking MPI Process Affinity to the Next Level
Process affinity is a hot topic. With commodity servers getting more and more complex internally (think: NUMA and NUNA), placing and binding individual MPI processes to specific processor, cache, and memory resources is becoming quite important in terms of delivered application performance.
MPI implementations have long offered options for laying out MPI processes across the resources allocated for the job. Such options typically included round-robin schemes by core or by server node. Additionally, MPI processes can be bound to individual processor cores (and even sockets).
Today caps a long-standing effort between Josh Hursey, Terry Dontje, Ralph Castain, and myself (all developers in the Open MPI community) to revamp the processor affinity system in Open MPI.
The first implementation of the Location Aware Mapping Algorithm (LAMA) for process mapping, binding, and ordering has been committed to the Open MPI SVN trunk. LAMA provides a whole new level of processor affinity control to the end user.
Our guiding principle for LAMA was that as modern computing platforms continue to evolve, we can’t anticipate the changing affinity requirements for MPI applications. Rather than just providing a few stock mapping/binding patterns, we decided to offer a generalized regular pattern engine (i.e., LAMA) to allow application developers to do what they need.
The SVN commit message actually contains quite a bit of information about how to use LAMA (pardon all the typos; I forgot to run the spell checker before committing — #$%@#%$!!).
This revamp has been a long time coming; we appreciate the patience of the greater community. I’d also like to personally think all the developers who made this happen.
A good chunk of the processor affinity revamp hit the Open MPI SVN trunk a while ago, and is already included in Open MPI’s upcoming v1.7 series. Ralph did the majority of the work to update our base handling of mpirun’s command-line mapping and binding options, and hooked those into Open MPI’s back-end run-time system. He also deeply integrated hwloc into the run-time system so that we could take advantage of its detailed topology detection and flexible binding mechanisms.
The rest of us collaborated on developing a new algorithm for generalized round-robin mapping, binding, and ordering. Taking a lot of inspiration from both BlueGene’s network specification methodology and many, many conversations with users (including some on this blog), we developed both the LAMA algorithm and command-line specification for how to invoke it (which was a surprisingly difficult task).
I’ll probably devote a future blog entry or two on some examples of how to use LAMA. It’s very powerful.
LAMA has probably already missed the boat for Open MPI v1.7.0. I’m pretty sure we can get it into v1.7.1, though.
We hope that LAMA will be genuinely useful to application developers with non-trivial affinity needs. Let us know what you think.