Cisco Blogs


December 9, 2011 - 9 Comments

From @softtalkblog, I was recently directed to an article about the Multicore Communication API (MCAPI) and MPI.  Interesting stuff.

The main sentiments expressed in the article seem quite reasonable:

  1. MCAPI plays better in the embedded space than MPI (that’s what MCAPI was designed for, after all).  Simply put: MPI is too feature-rich (read: big) for embedded environments, reflecting the different design goals of MCAPI vs. MPI.
  2. MCAPI + MPI might be a useful combination.  The article cites a few examples of using MCAPI to wrap MPI messages.  Indeed, I agree that MCAPI seems like it may be a useful transport in some environments.

One thing that puzzled me about the article, however, is that it states that MPI is terrible at moving messages around within a single server.

Huh.  That’s news to me…

To be clear: there has actually been quite a bit of research into making MPI highly efficient on multicore systems.  Open MPI’s shared memory transport, for example, has evolved through multiple generations of research (and is just about to see another update).  MPICH2’s Nemesis shared memory transport has been the subject of many academic papers.

Both provide excellent performance for moving MPI messages between processes on the same server.

Is MCAPI that much more efficient at moving messages between memory spaces than optimized MPI implementations?  We MPI implementors always have more optimization work to do, but I’m unaware (perhaps ignorant?) of what it could be doing that is fundamentally different than typical MPI implementations.

That being said, I might well be mis-understanding the authors’ intent.  The examples at the end of the article seem to imply that they may be referring to using MCAPI to communicate between server cores and accelerators.  I can certainly see a few cases where using MCAPI as an MPI transport might be useful:

  • exchanging messages with an MPI process on an accelerator.  Although the idea of running MPI on an accelerator hasn’t panned out well yet in current research circles… but there are people still thinking about the problems involved.  For the moment, using plain MCAPI to send to an accelerator might be better.
  • exchanging messages with an MPI process on a different virtual machine on the same physical server.  …but I don’t know if that would be allowed by the hypervisor.  Hmm.  (you may scoff at the idea of running more than one — or even one! — HPC-oriented virtual machine on a server, but core counts keep rising…)
  • exchanging messages with an MPI process on an FPGA or other specialized hardware computational resource (e.g., connected via PCI, QPI, HT, or some other “fast” connection network).

And who knows?  Maybe MCAPI implementations are faster than MPI shared memory transports. Perhaps it would be useful to have a performance shootout between traditional MPI shared memory implementations and MPI-over-MCAPI.

MCAPI people: we should talk.

In an effort to keep conversations fresh, Cisco Blogs closes comments after 60 days. Please visit the Cisco Blogs hub page for the latest content.


  1. I find it interesting that the whole premise of the article seems to be that MPI applications fundamentally use the dynamic process model. I’m not sure a lot of MPI applications could handle a “user’s inadvertent unplugging of a physical network cable”. In my experience, MPI apps expect the cluster they run on to be effectively static for the duration of the job, else they fail.

    The article seems to boil down to “because MPI can run across the WAN, it is slow”, without acknowledging that the MPI libraries know and take advantage of the fastest interconnect available between any given processes.

    • I just had a phone chat with one of the authors of the article, Sven Brehmer at Polycore Software. I’ll blog about the call in a little bit.

  2. “Milliseconds matter. MCAPI can therefore be quick and responsive in a way that MPI can’t be.”

    I find the insinuation that MPI has worse than millisecond latency to be rather hilarious.

  3. Actually, I take that back, it looks like both MPICH2 and OpenMPI now do this via knem, perhaps Jeff could confirm?

    • Correct. See also Jeff Hammond’s comments (and to confirm: XPMEM support is being developed in Open MPI as well).

  4. Nobody has in-kernel userspace to userspace memory copy working again yet do they? Without this you have to use shared memory copy-in/copy-out buffers which halves the bandwidth.

    At Quadrics we had two features here, firstly we could remap the whole of the BSS and heap allocators to shared memory so you could just memcpy() to and from remote address space and we had a modified kernel ptrace API that you could use to get the kernel to do direct userspace to userspace copy into a remote processes address space.

    • Ashley,

      You might look at and related work on Nemesis in MPICH2 by INRIA and Argonne.

      See also XPMEM (, which is developed by some folks associated with OpenMPI.

      On Blue Gene/P, MPI can exploit the static TLB map to directly access memory in other processes with no overhead, but this exists because of the unique properties of CNK, e.g. the bijective mapping of virtual and physical addresses.

    • Microsoft Windows has had support for user-space to user-space copy for over a decade (since Windows 2000), allowing processes to move data in either direction via the ReadProcessMemory and WriteProcessMemory APIs.

      Microsoft MPI takes advantage of this, though I don’t know if any other MPI libraries on Windows do (they really should).