The main sentiments expressed in the article seem quite reasonable:
- MCAPI plays better in the embedded space than MPI (that’s what MCAPI was designed for, after all). Simply put: MPI is too feature-rich (read: big) for embedded environments, reflecting the different design goals of MCAPI vs. MPI.
- MCAPI + MPI might be a useful combination. The article cites a few examples of using MCAPI to wrap MPI messages. Indeed, I agree that MCAPI seems like it may be a useful transport in some environments.
One thing that puzzled me about the article, however, is that it states that MPI is terrible at moving messages around within a single server.
Huh. That’s news to me…
To be clear: there has actually been quite a bit of research into making MPI highly efficient on multicore systems. Open MPI’s shared memory transport, for example, has evolved through multiple generations of research (and is just about to see another update). MPICH2′s Nemesis shared memory transport has been the subject of many academic papers.
Both provide excellent performance for moving MPI messages between processes on the same server.
Is MCAPI that much more efficient at moving messages between memory spaces than optimized MPI implementations? We MPI implementors always have more optimization work to do, but I’m unaware (perhaps ignorant?) of what it could be doing that is fundamentally different than typical MPI implementations.
That being said, I might well be mis-understanding the authors’ intent. The examples at the end of the article seem to imply that they may be referring to using MCAPI to communicate between server cores and accelerators. I can certainly see a few cases where using MCAPI as an MPI transport might be useful:
- exchanging messages with an MPI process on an accelerator. Although the idea of running MPI on an accelerator hasn’t panned out well yet in current research circles… but there are people still thinking about the problems involved. For the moment, using plain MCAPI to send to an accelerator might be better.
- exchanging messages with an MPI process on a different virtual machine on the same physical server. …but I don’t know if that would be allowed by the hypervisor. Hmm. (you may scoff at the idea of running more than one — or even one! — HPC-oriented virtual machine on a server, but core counts keep rising…)
- exchanging messages with an MPI process on an FPGA or other specialized hardware computational resource (e.g., connected via PCI, QPI, HT, or some other “fast” connection network).
And who knows? Maybe MCAPI implementations are faster than MPI shared memory transports. Perhaps it would be useful to have a performance shootout between traditional MPI shared memory implementations and MPI-over-MCAPI.
MCAPI people: we should talk.