HPC in L3

As an HPC old-timer, I’m used to thinking of HPC networks as large layer-2 (L2) subnets. All HPC traffic (e.g., MPI traffic) is therefore designed to stay within a single L2 subnet.

The next layer up — L3 — is the “networking” layer in the OSI network model; it adds more abstractions than are available in L2. For example, IP switching and routing occurs at L3. Indeed, L3-based networks can be comprised of multiple subnets.

I’ve come to appreciate that, especially with modern high-speed networking gear, there is no reason for limiting HPC networks to L2.

For example, today’s high-speed switches incur zero penalty for switching in L3 vs. switching in L2. Just to make this point totally clear: modern hardware can make switching decisions about L3 packets at exactly the same speed as it can make switching decisions about L2 packets.

Don’t get me wrong — I’m not advocating using the general-purpose OS stacks for UDP and TCP. We’ve known for a long time that bypassing the OS networking stack is important to HPC application performance.

Nor am I advocating HPC applications across a WAN. Lots of research has shown that WAN-spanning HPC is… at best, different than HPC within a single datacenter (perhaps this can be a topic for a future blog entry…).

What I’m specifically talking about is getting all the speed and latency benefits of bypassing the OS, and also utilizing L3 within a single datacenter.

On Ethernet-based networks, this doesn’t necessarily mean doing full-fledged TCP — that path leads to madness. TCP, even hardware-accelerated TCP, is quite a bit more than what HPC (MPI) applications really need, and can actually incur performance penalties.

Instead, all you need really need are IP headers (although also including UDP headers makes the traffic more familiar to many networking tools and administrators). The IP layer enables many L3 features; switching and routing around a datacenter is probably among the easiest to cite.

Other Ethernet-based L3 benefits include:

L3 networks can scale inherently larger than L2 networks
Using multiple IP subnets effectively eliminates ARP storms seen in large L2 networks
No need to trunk physically disparate datacenter resources into a single, large L2 subnet
Equal cost multi-path routing
L3 traffic is well known and understood by network tools and administrators

The main point here is that we should eliminate the “L3? That’s a terrible idea!” knee-jerk reaction. L3 has lots of desirable features, and is now able to play well with HPC/MPI applications.

Disclaimer: yeah, I’m obviously referring to my own usNIC product here (which is an Ethernet-based OS-bypass product). But my point is larger than that: we’ve all been used to thinking that L2 was the only way to get high performance. It’s not.