Avatar

In my last entry, I gave a vehicles-driving-in-a-city analogy for network traffic.

Let’s tie that analogy back to HPC and MPI.

The “managed chaos” of tribillions of individual point-to-point network flows generally works because each vehicle follows a common set of basic driving rules.

By its very nature, it is very difficult to make specific predictions about the behavior of this traffic. Similarly, it can be difficult to optimize this traffic — minimize congestion and accidents, maximize flow rates and throughput, etc.

Sure, chaos theory and complicated mathematical models can make some general predictions about group behavior. But making specific behavior about any individual packet can be quite difficult.

But when you look down at that city and observe all the traffic flowing around, there’s a few cases where there’s less chaos then you might initially assume.

1. Switches and routers effect policy that regulate traffic. In a managed network, certain types of traffic are made to behave a specific way. VLANs, for example, regulate and separate traffic into distinct sources and destinations. They only allow crossing to a different VLAN at specific, well-defined points.

In the city traffic analogy, think of VLANs as different modes of transportation; say, trains and cars. They have different traffic paths, and passengers can only transfer between them at specific, well-defined points.

Quality of service policies can be applied to traffic, too. Consider that trains and cars move at different speeds and may not even be affected by each other. As another example: motorcycles tend to go between other types of vehicles, and can even move up to the front of the line at stoplights.

2. Long-standing traffic patterns. Let’s say you’re downloading a very large file. This is obviously not just one giant network packet; it’s an ordered flow of packets from a specific source to a specific destination. In the city traffic analogy, think of it as a convoy of vehicles.

By observing the entire convoy, you can make some predictions about the behavior of the individual vehicles in it. You can know that they’ll turn left at this particular intersection. You can know that they’ll all be carrying payloads of (roughly) size X. You can know exactly where they’re going.

When switches and routers see convoys like that, they can effect policy decisions to optimize both this individual convoy and the other traffic flowing both around and through it.

3. HPC/MPI traffic has distributed structure. MPI traffic tends to generate similar convoys between multiple source/destination pairs that tend to overlap in time.

Consider when your application calls MPI_BCAST on a large message. For simplicity, let’s say that your MPI implementation picks some kind of tree-based distribution algorithm to effect this broadcast.

First, it will start a convoy from the tree root to the root’s first child. Then it will start another convoy from the root to its second child. And so on.

Shortly thereafter, the first child will start its own convoys to its children. Similarly, the root’s second child will start its own convoys. And so on.

Remember: this is a broadcast of a large message. So it’s quite possible that the convoys will be quite long — the root may still be sending out individual convoy vehicles while its descendants are sending out individual convoy vehicles.

As a direct consequence, this MPI traffic is highly structured for a lengthy amount of time.

Not only can individual switches and routers detect this kind of behavior and optimize for it, MPI implementations themselves can choose to make good decisions about this long-lasting traffic pattern. For example, MPI may exploit the particular topology of the network on which it is operating by choosing an algorithm that causes no congestion between the traffic in the broadcast.

Here’s where MPI is different than the union of kazillions of individual point-to-point streams: all the participants in the broadcast can collectively — yet in a distributed fashion — choose the routes that their convoys will take in a manner that will not cause traffic jams. In short: they can choose different roads for each individual convoy route (or perhaps choose roads that have large enough capacity to handle multiple convoys simultaneously).

Indeed, this effect has driven a lot of network design for HPC systems over the last several decades.

Pretty cool, right?

Sidenote: The managed chaos of MPI traffic is one of the reasons I chose the city-traffic-at-night graphic logo for this blog.



Authors

Jeff Squyres

The MPI Guy

UCS Platform Software