Short message latency and NUMA effects
Here’s another subtle NUMA effect that a well-tuned MPI implementation can hide from you: intelligently distributing traffic between multiple network interfaces.
Yeah, yeah, most MPI implementations have had so-called “multi-rail” support for a long time (i.e., using multiple network interfaces for MPI traffic). But there’s more to it than that.
For the purposes of this blog entry, assume that you have one network uplink per NUMA locality in your compute servers.
Cisco’s upcoming ultra-low latency MPI transport in Open MPI, for example, examines each compute server at MPI job startup. It makes many setup decisions based on what it finds; let’s describe two of these decisions in detail (sidenote: I’m using Cisco’s “usNIC” as an example because I’m not aware of other transports making these same kinds of setup decisions)…
1. Limit short message NUMA distance
Traditional MPI implementation multi-rail support round-robins message fragments between all available network interfaces. This allows MPI to split very large messages across multiple network links, effectively multiplying available bandwidth.
The fact that some of the network interfaces may be NUMA-remote is irrelevant for large message. By definition, the latency of large messages is already high, such that the additional latency required to traverse inter-processor links (such as Intel’s QPI network) is negligible compared to the overall transit time incurred by the message.
But for short messages, the usual argument for round robin schemes is not about bandwidth; it’s about increasing message rates. Consider, however, if the round robin set includes NUMA-remote interfaces. In this case, the inter-processor links to reach those remote interfaces can either artificially decrease expected message rate improvements, or even outright cause over performance loss (e.g., due to NUMA/NUNA congestion).
For these reasons, by default, Cisco’s usNIC Open MPI transport will only use NUMA-near network interfaces for short messages. Large messages will, of course, be striped across all available network interfaces to get the expected bandwidth multiplication.
2. Dynamic short message threshold
Cisco’s usNIC transport is a bit different than other Open MPI transports. Rather than having an upper-layer engine do fragmenting, it handles fragmenting and networking ACKing internally. We can therefore draw a fine line between what we want the upper layer engine to do and what we handle down in the transport layer, such as dynamically determining the length of a “short” message:
- When there is one usNIC interface available, short = 150K
- When there are more than one usNIC interfaces available, short = 50K
Skipping the complicated details, this dynamic determination of the length of a short message allows the optimization of our transport layer’s interaction with the upper-layer MPI engine in the case of a single usNIC interface. For example, when there’s only one usNIC interface, having a long “short” definition means that the upper layer engine will fragment messages less, and will result in fewer engine-to-transport-layer traversals.
In summary, these are two complicated implementation details that a well-tuned MPI implementation hides from you.
MPI is good for you!