In my last post, I mentioned that we just finished a complete revamp of the Open MPI process affinity system, and provided only a few details as to what we actually did.
I did link to a SVN commit message, but I’ll wager that few readers — if anyone — actually read it.
Much of what is in the Open MPI v1.6.x series is the same as what Ralph Castain described in a prior blog post. I’ll describe below what we changed for the v1.7 series.
If you haven’t done so already, go read the prior blog post about affinity in the v1.5 (and v1.6) series. Really. It has pictures. This is a surprisingly complicated topic; it behooves you to get a basis for what we’re talking about.
In v1.7, we provide two main levels of setting affinity:
- Simple: building upon the existing v1.5/v1.6 affinity options
- Expert: a whole new set of options that provide lots of flexibility for a variety of regular binding patterns
We anticipate that the vast majority of users will be well-served with the “Simple” commands. But some power users — particularly those exploring new environments and hardware architectures — may need more flexibility. For those users, the “Expert” commands will hopefully provide the flexibility for what they need.
The new affinity system is centered around regular round-robin types of algorithms, and is built upon three distinct phases:
- Mapping: the selection of where to put individual processes
- Binding: the (optional) chaining of individual processes to one or more hardware threads
- Ordering: the assignment of MPI_COMM_WORLD ranks to processes
Mapping: v1.7 now has a “–map-by <level>” option to mpirun. <level> describes what hardware element to iterate over when round-robin placing MPI processes. For example, “–may-by core” means to place one process on a core before advancing to the next core. “–map-by socket” means to place one process on a socket before advancing to the next socket. And so on.
Here’s all the valid –map-by levels:
This generalized “–map-by” option replaces the v1.5/v1.6 –bynode and –byslot options (which correspond to “–map-by node” and “–may-by slot”, respectively).
Be warned: The older –by* options will disappear in a future version of Open MPI.
Binding: v1.7 also has a “–bind-to <width>” mpirun option. <width> describes how may hardware threads an individual process should be bound to, and can use all the same values as for map-by’s <level> (listed above). For example, “–bind-to core” means to bind each process to all the hardware threads in a single processor core. “–bind-to l1cache” means to bind the process to all the hardware threads that are under a single L1 cache. And so on.
As indicated above, binding is optional. But note that without binding, mapping is effectively reduced to “counting how many processes to launch on a single server.” If Open MPI doesn’t bind the processes to any specific resources, the OS may move MPI processes anywhere within the confines of the server on which it was launched.
The generalized “–bind-to” option replaces the v1.5/v1.6 –bind-to-core and –bind-to-socket options.
Be warned: these older –bind-to-* options will disappear in a future version of Open MPI.
Ordering: Once all the processes have been mapped (and optionally bound), Open MPI assigns ranks in MPI_COMM_WORLD to them. In the Simple mode, Open MPI behaves as if all available hardware threads were laid out in a linear fashion, and then overlays them with the mapped MPI processes. Open MPI then assigns MPI_COMM_WORLD ranks from 0 to (N-1) in a left-to-right fashion.
The Simple options of –map-by and –bind-to provide a surprising amount of flexibility, and can probably handle most users’ affinity needs.