Cisco Blogs

Taking MPI Process Affinity to the Next Level

August 31, 2012 - 4 Comments

Process affinity is a hot topic.  With commodity servers getting more and more complex internally (think: NUMA and NUNA), placing and binding individual MPI processes to specific processor, cache, and memory resources is becoming quite important in terms of delivered application performance.

MPI implementations have long offered options for laying out MPI processes across the resources allocated for the job.  Such options typically included round-robin schemes by core or by server node.  Additionally,  MPI processes can be bound to individual processor cores (and even sockets).

Today caps a long-standing effort between Josh Hursey, Terry Dontje, Ralph Castain, and myself (all developers in the Open MPI community) to revamp the processor affinity system in Open MPI.

The first implementation of the Location Aware Mapping Algorithm (LAMA) for process mapping, binding, and ordering has been committed to the Open MPI SVN trunk.  LAMA provides a whole new level of processor affinity control to the end user.

Our guiding principle for LAMA was that as modern computing platforms continue to evolve, we can’t anticipate the changing affinity requirements for MPI applications.  Rather than just providing a few stock mapping/binding patterns, we decided to offer a generalized regular pattern engine (i.e., LAMA) to allow application developers to do what they need.

The SVN commit message actually contains quite a bit of information about how to use LAMA (pardon all the typos; I forgot to run the spell checker before committing — #$%@#%$!!).

This revamp has been a long time coming; we appreciate the patience of the greater community. I’d also like to personally think all the developers who made this happen.

A good chunk of the processor affinity revamp hit the Open MPI SVN trunk a while ago, and is already included in Open MPI’s upcoming v1.7 series. Ralph did the majority of the work to update our base handling of mpirun’s command-line mapping and binding options, and hooked those into Open MPI’s back-end run-time system.  He also deeply integrated hwloc into the run-time system so that we could take advantage of its detailed topology detection and flexible binding mechanisms.

The rest of us collaborated on developing a new algorithm for generalized round-robin mapping, binding, and ordering.  Taking a lot of inspiration from both BlueGene’s network specification methodology and many, many conversations with users (including some on this blog), we developed both the LAMA algorithm and command-line specification for how to invoke it (which was a surprisingly difficult task).

I’ll probably devote a future blog entry or two on some examples of how to use LAMA. It’s very powerful.

LAMA has probably already missed the boat for Open MPI v1.7.0.  I’m pretty sure we can get it into v1.7.1, though.

We hope that LAMA will be genuinely useful to application developers with non-trivial affinity needs. Let us know what you think.


UPDATE: See Part 1 and Part 2 of my detailed description of all the new affinity options in Open MPI.

In an effort to keep conversations fresh, Cisco Blogs closes comments after 60 days. Please visit the Cisco Blogs hub page for the latest content.


  1. Yes and no.

    The topology information from hwloc shows us single-server topology information. It doesn’t show us network fabric topology information. 🙁 Hence, we don’t use this information (yet) for cartesian/graph MPI communicator topology yet.

    That being said, network topology continues to be a hot topic, so stay tuned…

  2. Presumably, the topology information provided by hwloc could be used to reorder process ranks in MPI_Cart_create, MPI_Graph_create, and MPI_Dist_graph_create to better match a program’s communication patterns to the underlying hardware. Now that hwloc is integrated into OpenMPI’s run-time system, is there any chance of such a capability showing up in the near future?

  3. Fair point.

    I think what we were trying to say was that the BG methodology wasn’t sufficient for our NUMA based systems. So we had to extend/morph it. But it definitely was a great starting point / source of inspiration for us; standing on the shoulders of giants and all that.

  4. In your paper you say:

    “The MPI implementation on IBM BlueGene systems allow applications to map with respect to their position in the three-dimensional torus network… Unfortunately, this mapping pattern does not account for internal server topologies that might affect application performance such as sockets and cache levels. A mapping file can be specified for irregular patterns not well supported via command line options.”

    It is worth noting that Blue Gene systems are – by design – always single-socket SMPs with uniform access to shared caches and therefore this design is fully general for the hardware in question. Every torus partition is electrically isolated and identical to every other one of the same dimensions. If Blue Gene had NUMA problems like x86 supercomputers, I assure you there would be an API to express that.