How many network links do you have for MPI traffic?
If you’re a bargain basement HPC user, you might well scoff at the idea of having more than one network interface for your MPI traffic.
“I’ve got (insert your favorite high bandwidth network name here)! That’s plenty to serve all my cores! Why would I need more than that?”
I can think of (at least) three reasons off the top of my head.
I’ll disclaim this whole blog entry by outright admitting that I’m a vendor with an obvious bias for selling more hardware. But bear with me; there is an actual engineering issue here.
Here’s three reasons for more network resources in a server:
- Processors are getting faster
- Core counts are rising
- NUMA effects and congestion within a single server
Think of it this way: MPI applications tend to be bursty with communication. They compute for a while, and then they communicate.
Since processors are getting faster, the length of computation time between communications can be decreasing. As a direct result, that same MPI application you’ve been running for years is now communicating more frequently, simply because it’s now running on faster processors.
Add to that the fact that you now have more and more MPI processes in a single server. Remember when four MPI processes per server seemed like a lot? 16 MPI processes per server is now commonplace. And that number is increasing.
And then add to that the fact that MPI applications have been adapted over the years to assume the availability of high-bandwidth networks. “That same MPI application you’ve been running for years” isn’t really the same — you’ve upgraded it over time to newer versions that are network-hungry.
Consider this inequality in the context of MPI processes running on a single server:
num_MPI_processes * network_resources_per_MPI_process ?=
Are the applications running in your HPC clusters on the left or right hand side of that inequality? Note that the inequality refers to overall network resources — not just bandwidth. This includes queue depths, completion queue separation, ingress routing capability, etc.
And then add in another complication: NUMA effects. If you’ve only got one network uplink from your fat server, it’s likely NUMA-local to some of your MPI processes and NUMA-remote from other MPI processes on that server.
Remember that all MPI traffic from that remote NUMA node will need to traverse inter-processor links before it can hit the PCI bus to get to the network interface used for MPI. On Intel E5-2690-based machines (“Sandy Bridge”), traversing QPI links can add anywhere from hundreds of nanoseconds to a microsecond of short message half-roundtrip latency, for example. And we haven’t even mentioned the congestion/NUNA effects inside the server, which can further degrade performance.
My point is that you need to take a hard look at the applications you run in your HPC clusters and see if you’re artificially capping your performance by:
- Not having enough network resources (bandwidth is the easiest to discuss, but others exist, too!) on each server for the total number of MPI processes on that server
- Not distributing network resources among each NUMA locality in each server