Avatar

It seems like we’ve gotten a rash of “how do I setup my new cluster for MPI?” questions on the Open MPI mailing list recently.

I take this as a Very Good Thing, actually — it means more and more people are tinkering with and discovering the power of parallel computing, HPC, and MPI.

In previous posts, I have discussed requirements for installing MPI and instructions for how to build MPI applications.

In this entry, I’ll talk about network considerations for your HPC cluster.

There are a million things to think about; I’ll only talk about some of the MPI-based issues.

As with the prior blog entries, I’ll cite specific Open MPI examples (because I’m an Open MPI developer) but the same concepts also apply to other MPI implementations.

Network design

You may have a single, simple network connecting your HPC servers (which is usually some form of Ethernet — either 1G, 10G, or 40G).

Or you may have (at least) two networks connected to each compute server, typically split as:

  1. One network for ssh and administrative traffic. This network is almost always Ethernet.
  2. Another for MPI traffic (and possibly filesystem/IO traffic).

If you don’t have much money, it’s common to have a single Ethernet network.

If you have a little more funding, it’s common to have a simple/cheap Ethernet network for ssh/admin traffic and a second high-performance network for MPI.

If you have lots of funding and bandwidth-hungry MPI applications, you can attach multiple ports of a high-performance network to each server and let MPI (and/or your filesystem) use one or more of those ports.

Depending on how much filesystem IO you need, you may want to have your MPI and filesystem share access to the high-performance networks. Keep in mind that sharing a network can cause performance impacts.

This will really depend on the MPI and filesystem IO traffic patterns for the applications that you run. For example, the apps you run may use MPI and the filesystem and different times, so having them share a single network will be fine. Or your apps may heavily use both MPI and the filesystem at the same time, which could lead to congestion and delay.

You may also have multiple high performance networks. Perhaps one network for MPI and the other for filesystem IO. Or perhaps both networks can be used by MPI (most high-quality MPI implementations can stripe across multiple network interfaces). Again, it depends on the bandwidth and latency requirements of the MPI applications which you intend to run.

What type of network should I buy?

I’m a vendor with an obvious bias, so take my advice with a grain of salt here. But this is a common question, so I’ll try to be fair and balanced in my answer.

There are five typical types of networks/network adapters that people buy for commodity server-based HPC clusters these days (as of May 2014):

  1. Plain vanilla Ethernet adapters with no special HPC acceleration, typically 1G or 10G.
  2. Cisco 10G or 40G UCS Ethernet VIC adapters.
  3. iWARP 10G RNIC Ethernet adapters.
  4. Mellanox 10G or 40G RoCE Ethernet adapters.
  5. InfiniBand QDR or FDR adapters.

Each one of these options have their own benefits and tradeoffs, most of which are hotly debated by various marketing teams.

One characteristic that is safe to say is that plain vanilla Ethernet adapters (with no special HPC acceleration) will limit you to using the — comparatively speaking — lowest performance option: TCP. TCP’s latency will be far higher than any of the other options, which, depending on your MPI applications, may adversely impact overall performance.

The other options are all significantly lower latency and higher performance than TCP over non-accelerated Ethernet adapters.

Which network(s) should my MPI traffic use?

In most cases, if you have a high-performance network, your MPI will likely auto-detect it and automatically prefer that network for MPI traffic.

However, if you have no HPC-accelerated network and you have multiple Ethernet networks (e.g., a 1G and a 10G network), you might need to tell your MPI implementation which one to use.

If there are no high-performance networks available, Open MPI, for example, will default to using all TCP networks. If you want Open MPI to use only eth2, for example (because that’s the network you want MPI traffic to use), you can use the btl_tcp_if_include MCA parameter:

shell$ mpirun --mca btl_tcp_if_include eth2 ...

We’ve seen some cases where the ethX mappings are different across cluster nodes, so you can also specify which network to use via CIDR notation:

shell$ mpirun --mca btl_tcp_if_include 10.10.0.0/16 ...

This form will always use the 10.10.x.y network, regardless of which ethX interface it uses on each machine.



Authors

Jeff Squyres

The MPI Guy

UCS Platform Software