Today we feature a guest post from Torsten Hoefler, the Performance Modeling and Simulation lead of the Blue Waters project at NCSA, and Adjunct Assistant Professor at the Computer Science department at the University of Illinois at Urbana-Champaign (UIUC) .
I’m sure everybody heard about network topologies, such as 2D or 3D tori, fat-trees, Kautz networks and Clos networks. It can be argued that even multi-core nodes (if run in the “MPI everywhere” mode) are a separate “hierarchical network”. And you probably also wondered how to map your communication on such network topologies in a portable way.
MPI offers support for such optimized mappings since the old days of MPI-1. The process topology functionality if probably one of the most overlooked useful features of MPI. We have to admit that it had some issues and was clumsy to use but it was finally fixed in MPI-2.2.
But what are process topologies about? Well, as the name says, a user can tell MPI about the logical communication topology (the flow of messages between processes) of his MPI job. The user can then query the information he put into MPI, which doesn’t seem all that useful. However, the most useful feature is that he can request an optimized mapping of the running processes to MPI ranks in the newly created topology. The library simply returns a new numbering of ranks in the created communicator which optimizes the mapping of ranks in the logical topology to the underlying system architecture.
It is now clear that this feature is only useful for relatively static topologies. But how many applications do you know that communicate in very regular structures with nearest neighbors? Many applications simply use n-dimensional grids for communication. More irregular applications still use a topology that is based on the structure of the input system (e.g., sparsity structure of the stiffness matrix). Even adaptive mesh refinement often runs multiple iterations with a given mesh to amortize the creation costs.
So it seems that application performance could be improved with better mappings. But how does it work in practice? MPI allows to create so called “topological communicators”. Those communicators have the user’s process topology attached. MPI-1 offers two mechanisms to create such communicators: a scalable mechanism to create n-dimensional Cartesian topology communicators and a non-scalable mechanism for creating arbitrary (graph) communicators. The MPI-1 mechanism to create graph communicators requires each process to specify and store the full graph. In MPI-2.2, the distributed graph topology, a new scalable constructor was introduced that allows distributed storage of the graph. We discuss and document the new graph topology interfaces for end-users in “The Scalable Process Topology Interface of MPI 2.2“.
Now it seems that the user-side is pretty clear and graphs can be specified easily and portably. However, the question “does it help in practice and how well is it supported” remains. Well, first, I have to disappoint you: general topology mapping (even renumbering) is NP-hard. This is proved in “Generic Topology Mapping Strategies for Large-scale Parallel Architectures“. However, that doesn’t mean that the simple Cartesian mappings cannot be improved! Many implementations, for example IBM’s BlueGene/P MPI supports optimized mappings for Cartesian topology. The number of MPI libraries that supports mapping of arbitrary graph topologies is unfortunately negligible. This is because mapping is hard and benefits are unclear. In “Generic Topology Mapping Strategies for Large-scale Parallel Architectures” we propose two mapping algorithms and show that they perform well in practice (reduce congestion by up to 80%). We hope that such strategies are adopted by library implementors to make the user’s life easier!
It is also a “chicken-egg-problem”, i.e., no library will optimize it until an application uses it and no application will use it until it provides benefits. However, the application-side is very simple and libraries that automatically create graph topologies from the output of parallel graph partitioners exist. Thus, programmers should consider using the topology interface to create an incentive (or pressure 😉 ) on the MPI implementors to support this functionality. However, one should make sure to use the Cartesian or distributed graph constructors in order to guarantee scaling to large numbers of processes.
One more thing: MPI-3 is on it’s way and a new proposal exists to extend the process topologies with communication capabilities. The proposal introduces new “collective” operations that act on the process topologies. This means that each process communicates with his local neighbors. Such a pattern is very typical for halo exchanges in many codes. The collective and declarative specification allows the MPI implementation to compute optimized messaging schedules and relay messages through other processes just like it does for collective operations since MPI-1. Initial experimental results and a detailed description of the envisioned functionality can be found in “Sparse Collective Operations for MPI“.
That’s it for now! Remember: process topologies is a beautiful portable abstractions to achieve optimized topology mapping in MPI-2.2 and optimized communication for arbitrary communication patterns in MPI-3.