In my last post, I described the Simple mode of Open MPI v1.7′s process affinity system.
The Simple mode is actually quite flexible, and we anticipate that it will meet most users’ needs. However, some users will need more flexibility. That’s what the Expert mode is for.
Before jumping in to the Expert mode, though, let me describe two more features of the revamped v1.7 affinity system.
Open MPI v1.7 uses hwloc to probe the topologies of the nodes on which it is launching processes. These server topologies are gathered (and intelligently de-duplicated) and used to feed the mapping algorithm. Hence, not only do you not need to specify compute node topologies ahead of time (e.g., in a config file), the mapping will be performed correctly even if you are launching an MPI job across a heterogeneous set of servers.
Second, even as far back as v1.6.0, Open MPI improved the “--report-bindings” mpirun option. When paired with any of Open MPI’s affinity functionality, this option will draw an ASCII prettyprint representation showing where processes were bound. In the v1.6 series, Open MPI only shows cores and sockets. In the v1.7 series, hardware threads are also shown.
Now let’s jump into v1.7′s Expert mode. Expert mode is governed by four MCA parameters:
- rmaps_lama_map: a sequence of characters describing how to lay out processes
- rmaps_lama_bind: a sequence of characters describing the resources to bind to each process
- rmaps_lama_mppr: a sequence of characters describing the maximum number of processes to allow per resource
- rmaps_lama_ordering: how to order the ranks in MPI_COMM_WORLD
Each parameter is described below.
rmaps_lama_map: This MCA parameter corresponds to the mapping phase, but is more flexible than the “--map-by” Simple option. It is a string comprised of a sequence of tokens, each corresponding to the --map-by tokens (case is significant):
- h: Hardware thread
- c: Processor core
- s: Processor socket
- L1: L1 cache
- L2: L2 cache
- L3: L3 cache
- N: NUMA node
- b: Processor board
- n: Server node
These tokens, when strung together, represent the order in which hardware resources are round-robin traversed when mapping.
For example, consider the string “sNbnct”. Open MPI’s Location Aware Mapping Algorithm (LAMA) takes the first token, “s”, and iterates over all the sockets to place processes. When LAMA runs out of sockets, it then gets the next token, which is “N” (NUMA node). However, a NUMA node is “bigger” than a socket, and so all of those are exhausted, too. Similarly for “b” and “n”.
LAMA finally gets to “c”, and can map the next process to the next (second) core in the first socket. And then map the next process to the second core in the second socket. And so on (until all the cores in all the sockets are exhausted, or all processes have been mapped — whichever comes first).
Hence, as mentioned above, the map string represents the order in which hardware resources are round-robin traversed when mapping. Any regular round-robin pattern can be expressed by stringing together the tokens in arbitrary orders. For example, “--map-by core” exactly corresponds to “csNbnt”, and “--map-by socket” exactly corresponds to “sNbnct”.
Note that L1, L2, and L3 are “variable” levels that depend on the specific hardware topology.
rmaps_lama_bind: Similar to mapping, this MCA parameter corresponds to the binding phase, but is more flexible than the “--bind-to” Simple option. It uses the same tokens to represent levels as the rmaps_lama_map MCA parameter, but pairs them with an integer.
For example, a binding string of “2c” means binding each process to two processor cores. A binding string of “1s” means binding each process to one processor socket. And so on.
rmaps_lama_mppr: By default, Open MPI will not oversubscribe resources. But what is the exact definition of “oversubscribe”? The Max Processes Per Resource (MPPR) is user-definable specification for oversubscription conditions. Like binding, it uses the same tokens as mapping, and pairs them with an integer. However, the format of the MPPR string is intentionally slightly different than the binding string (to emphasize their different meanings).
This is best shown through an example MPPR string: “1:c” — it is pronounced as “at most one process per core”. And “2:s” would be pronounced “at most two processes per socket”.
The MPPR defaults to “1:c”, but can be a comma-delimited list of specifications for advanced resource consumption scenarios. For example, “1:s,2:n” means to only allow one process per processor socket, and a max of two processes per server. Hence, even if a server had four processor sockets, only two MPI processes would be allowed on that server. This could be useful for MPI processes that consume large amounts of memory, for example.
rmaps_lama_ordering: This parameter can take one of two values: sequential or natural. It also corresponds to the “--order” mpirun option (I didn’t mention this in the Simple options because most users won’t need to change ordering).
Sequential is the default; Open MPI behaves as if all available hardware threads were laid out in a linear fashion, and then overlays them with the mapped MPI processes. Open MPI then assigns MPI_COMM_WORLD ranks from 0 to (N-1) in a left-to-right fashion.
Natural ordering follows the mapping order. For example, consider a server node with two processor sockets, each containing four cores. The command line “mpirun -np 8 --bind-to core --map-by socket --order n a.out” would result in MCW ranks that look like this: [0 2 4 6] [1 3 5 7].
There you have it — this is the new affinity system in Open MPI. Use it to go forth and explore, compute, and be productive!