Multi / many / mucho cores
I’ve briefly mentioned before the idea of dedicating some cores for MPI communication tasks (remember: the idea of using dedicated communication co-processors isn’t new). I thought I’d explore this in a bit more detail in today’s entry.
Two networking vendors (I can’t say the vendor names or networking technologies here because they’re competitors, but let’s just say that the technology rhymes with “schminfiniband”) recently announced products that utilize communication processing offload for MPI collective communications. Interestingly enough, they use different approaches. Let’s look at both.
Vendor A added some smarts in their network interface card on the host. They added some hooks that essentially allow the user-level MPI middleware to program a simple state machine down in the NIC. For example, when doing a broadcast, MPI can tell the NIC, “when you receive message X, automatically forward it to peers A and B.” This allows the hardware to propagate a broadcast tree without involving the software, or even touching local cache, busses, etc.
Vendor B added some smarts into their network switches. They provide a user-level library that middleware like MPI can call that sends instructions to co-processors in the switch. MPI can use this capabilities to setup topology-aware broadcasts, etc. Hence, an MPI broadcast root doesn’t send the message to its local peers; it sends the message to the switch co-processor. The switch then does the individual sends to all peer MPI processes who happen to be on the same switch, and also takes care of sending the message to the co-processor(s) on the switch(es) of any remote peer MPI processes. In this way, traffic through the network core is minimized because communication with remote switch co-processors uses a topology-smart algorithm (e.g., the message only crosses a given network link once).
Both approaches are clever in their own right, and both have their own advantages and disadvantages.
But both approaches utilize communications-dedicated processors to handle the processing and propagation of MPI collective operations. …without peppering the main MPI application with interrupts, or thrashing caches, or …anything else that interrupts the MPI app’s optimized performance.
Consider: back in LAM/MPI days we had a mode where a separate software daemon (the “lamd”) handled all MPI communications for the MPI processes over socket-based communications. Surprisingly enough, there was a class of real-world MPI applications that got a huge performance boost out of this approach — the benefit of asynchronous progress outweighed even the fact that there was no hardware offload capability.
It seems obvious to me that having lots of cores in a single server will provide more opprotunities for exactly this kind of asynchronous progress optimization. If you’ve got 64 or 128 cores in a box, dedicate a few of them to proxy all communication handling. Sure, you give up some computation power, but the communications “win” that you get may greatly offset that loss.