The anatomy of MPI implementation optimizations
MPI implementations are large, complex beasts. By definition, they span many different layers ranging from the user-level implementation all the way down to sending 1s and 0s across a physical network medium.
However, not all MPI implementations actually “own” the code at every level. Consider: a TCP-based MPI implementation only “owns” the user-level middleware. It cannot see or change anything in the TCP stack (or below). Such an implementation is limited to optimizations at the user space level.
That being said, there certainly are many optimizations possible at the user level. In fact, user space is probably where the largest number of optimizations are typically possible. Indeed, nothing can save your MPI_BCAST performance if you’re using a lousy broadcast algorithm.
However, lower-layer optimizations are just as important, and can deliver many things that simply cannot be effected from user space.
There are three typical areas for optimization below user space code:
- Operating system (e.g., the network / NIC driver)
Let’s talk about these in order.
OS / driver: Having good OS support (usually in the form of some kind of NIC driver) can provide asynchronous message passing progress from the perspective of the user space MPI application. It can receive messages and progress large sends “behind the scenes”, perform MPI communicator+tag matching, dispatch messages between multiple different MPI processes on the same server, etc. Using a “smart” OS driver, by definition, means that CPU cycles are taken away from the MPI application, but it still can be an overall performance win from the perspective of asynchronous progress.
Firmware: If a network interface supports some level of programabiltiy, some MPI tasks can be offloaded to the NIC’s processor. The main CPU can then go off and compute FFTs (or whatever the MPI application is doing besides communicating). Such tasks usually have to be limited in size and resources — there’s only so much you can fit in the limited resources of a NIC, after all. But several vendors have had great success with accelerating small-sized MPI collective operations such as barriers, broadcasts, and reductions.
Hardware: Firmware typically exposes hardware features in an MPI-specific ways (software is typically used if the hardware’s features must be adapted to MPI). As mentioned above, NIC offload is a great feature than MPI implementations can utilize. Quality of service support, timing precision, multiple queue buffer zones, and tagged message dispatch are other features than can be exposed via firmware and utilized by MPI implementations. Hardware acceleration of such features is typically faster than corresponding software implementations, but must be evaluated in terms of resource consumption trade offs.
Hardware is usually the slowest level to be developed. Firmware, OS drivers, and user space middleware typically follow the hardware.
Not many networks are custom-built for MPI. But the ones that are tend to have the most potential for optimization and efficiency.
That being said, today’s networking interfaces — even Ethernet interfaces — are subject to many different messaging requirements. MPI isn’t the only high performance networking abstraction in town. Consider other well-financed markets: networked databases, web and email server farms, financial trading platforms, virtual machine server farms, … and so on. The list is long.
As a result, NICs are getting much more complex; they support many different types of hardware and firmware optimizations that also apply to MPI.
This is going to be fun. 🙂