Avatar

I’ve written about network traffic before (see this post and this post). It’s the subject of endless blog posts, help forums, and instructional guides across the internet.

In a High Performance Computing (HPC) context, there are some fascinating aspects about network traffic that are fairly different than other types of network traffic.

But before we can talk about that, you need to appreciate that switching/routing of network traffic frequently has a lot in common with distributed computing.

For example, switching/routing decisions are commonly made at a local level — any given switch, for example, usually only knows the next hop that a packet should take. In a simplistic case, the switch looks at any given ingress packet’s final destination, and then throws the packet out some egress port in the direction of that destination.

Individual switches and routers generally don’t have much more knowledge than that.

Think of it this way: there’s a bazillion switches and routers across the greater internet that make a gabillion individual, de-centralized decisions that all somehow work to make your packets get to the right destination.

That’s essentially distributed computing: lots of individual, de-centralized decisions working towards a common goal.

Here’s MPI’s analog to that: many MPI implementations use this same kind of distributed computing decision-making pattern.

For example: your application invokes MPI_BCAST across all N MPI processes.  What algorithm will the MPI implementation use to effect that broadcast? There are many choices available. Some of them are good in X scenarios. Others are good in Y topologies. And still others are wonderful for Z architectures.

Which algorithm should be used right now?

In some cases, MPI implementations may actually exchange some network messages in order to agree on which broadcast algorithm will be used.  These messages may even be exchanged during communication creation in order to speed up each individual invocation of MPI_BCAST.

But in other cases, MPI implementations may make the determination locally in each MPI process, simply by examining the same data.

philosophorapterFor example, if the selection of broadcast algorithm is solely determined by the number of processes involved, then each MPI process can simply check that data point locally without needing to consult any of its peers. It is therefore trivial for each MPI process to independently come to exactly the same conclusion about which broadcast algorithm to use.

This is a fairly simplistic example, but the same distributed computing concept is used to drive a lot of independent-but-identical decisions in MPI implementations, such as (but not limited to):

  • process affinity decisions
  • buffer size decisions
  • network egress decisions
  • network protocol decisions
  • congestion avoidance decisions
  • …etc.

How ’bout them apples… MPI uses distributed computing… Mind. Blown.



Authors

Jeff Squyres

The MPI Guy

UCS Platform Software