Everything old is new again — NUMA is back!
With NUMA going mainstream, high performance software — MPI applications and otherwise — might need to be re-tuned to maintain their current performance levels.
A less-acknowledged aspect of HPC systems is the multiple levels of networks that are traversed to get data from MPI process A to MPI process B. The heterogeneous, multi-level network is going to become more important (again) in your applications’ overall performance, especially as per-compute-server-core-counts increase.
That is, it’s not going to only be about the bandwidth and latency of your “Ethermyriband” network. It’s also going to be about the network (or networks!) inside each compute server.
A Cisco colleague of mine (hi Ted!) previously coined a term that is quite apropos for what HPC applications now need to target: it’s no longer just about NUMA — NUMA effects are only one of the networks involved.
Think bigger: the issue is really about Non-Uniform Network Access (NUNA).Let’s back up a few steps…
Many types of parallel algorithms perform best when they are implemented to be aware of data locality. For example, they take special care to fully operate on the data within a single cache line before forcing a cache miss to get the next. The best data-driven applications may even exploit this effect for multiple levels of locality (e.g., L1 through L3 and off-node).
With UMA, there was only one memory bus; it was relatively easy to model and understand how it affected overall system performance. The PCI bus was used to communicate with other compute nodes (via various flavors of network interface cards); mixing PCI/network performance characteristics into the model made it quite a bit more complex.
Despite the complexity, decent models and heuristics are readily available; many applications have tuned themselves for such architectures. But remember the (recently re-learned) lessons of 8-core processors: UMA doesn’t scale for memory- and/or IO-intensive applications.
NUMA effects were already important on AMD-based compute servers, and were critically important on large NUMA SMP systems from a few years ago (e.g., the SGI Altix). But now that Intel has finally moved into the NUMA camp, one can argue that NUMA is now both mainstream and the likely architecture target for many future HPC compute systems. As such, the memory and IO subsystem networks must also be taken into account for effects such as latency, bandwidth, congestion, etc.
To be clear: all the same network kinds of things that software needed to optimize for outside the server will now also need to be optimized inside the server.
Sure, current NUMA networks are super fast and have lovely low latency. But they can suffer from both congestion and bottlenecks — just like any other network. This effect may be multiplied by increasing the number of ever-faster data-hungry cores (e.g., will QPI/Hypertransport have to route?).
Consider: four Intel Nehalem cores on a socket can be quite demanding on the memory subsystem just to keep the instructions and data flowing. Now add MPI message passing on top of these requirements (potentially aided by fast hardware offload): sending additional data across the memory subsystem to/from PCI devices, to/from non-local memory, and so on. When will the NUMA network become saturated?
This is (yet) another curveball being thrown at parallel programmers.Doug Eadline, Distinguished Cluster Monkey and column author for Linux Magazine, has opined on many similar software complexity issues in his columns. His recent Linux Magazine column entitled “The Core-Diameter” discusses the maximum physical diameter of an HPC cluster based on the cabling requirements for the external network.
But what about the internal network? Shouldn’t that be a factor of the core diameter as well?
My point: don’t only plan for the homogeneous Ethmyriband network between your compute servers. Instead, application developers need to take a holistic approach: think about the overall heterogeneous (Ethmyriband + QPI/Hypertransport + PCI* + …?) network between processing cores. In a simplistic sense, you can think of it as multiple levels of latency between the different network types.
Applications need to be aware that non-local references cause network traffic — regardless of whether that network happens to be inside the server or both inside and outside the server. For example, many MPI applications are currently tuned for inter-node communications — minimizing MPI message traffic between nodes as much as possible. These applications may (will?) need additional tuning to ensure that intra-node communication patterns do not cause internal network congestion or other types of bottlenecks, such as resource starvation issues.
NUMA: that’s so 2004-esque. NUNA: that’s the new challenge.