Cisco Blogs


Cisco Blog > High Performance Computing Networking

More on Memory Affinity

March 18, 2011 at 7:35 am PST

There was a great comment chain on my prior post (“Unexpected Linux Memory Migration“) which brought out a number of good points.  Let me clarify a few things from my post:

  • My comments were definitely about HPC types of applications, which are admittedly a small subset of applications that run on Linux.  It is probably a fair statement to say that the OS’s treatment of memory affinity will be just fine for most (non-HPC) applications.
  • Note, however, that Microsoft Windows and Solaris do retain memory affinity information when pages are swapped out.  When the pages are swapped back in, if they were bound to a specific locality before swapping, they are restored to that same locality.  This is why I was a bit surprised by Linux’s behavior.
  • More specifically, Microsoft Windows and Solaris seem to treat memory locality as a binding decision — Linux treats it as a hint.
  • Many (most?) HPC applications are designed not to cause paging.  However, at least some do.  A side point of this blog is that HPC is becoming commoditized — not everyone is out at the bleeding edge (meaning: some people willingly violate the “do not page” HPC mantra and are willing to give up a little performance in exchange for the other benefits that swapping provides).

To be clear, Open MPI has a few cases where it has very specific memory affinity needs that almost certainly fall outside the realm of just about all OS’s default memory placement schemes.  My point is that other applications may also have similar requirements, particularly as core counts are going up, and therefore communication between threads / processes on different cores will become more common.

Read More »

Tags: , , , , ,

Unexpected Linux memory migration

March 4, 2011 at 7:20 am PST

I learned something disturbing earlier this week: if you allocate memory in Linux to a particular NUMA location and then that memory is paged out, it will lose that memory binding when it is paged back it.

Yowza!

Core counts are going up, and server memory networks are getting more complex; we’re effectively increasing the NUMA-ness of memory.  The specific placement of your data in memory is becoming (much) more important; it’s all about location, Location, LOCATION!

But unless you are very, very careful, your data may not be in the location that you think it is — even if you thought you had bound it to a specific NUMA node.

Read More »

Tags: , , , , ,

Making MPI survive process failures

February 27, 2011 at 12:00 pm PST

Arguably, one of the biggest weaknesses of MPI is its lack of resilience — most (if not all) MPI implementations will kill an entire MPI job if any individual process dies.  This is in contrast to the reliability of TCP sockets, for example: if a process on one side of a socket suddenly goes away, the peer just gets a stale socket.

This lack of resilience is not entirely the fault of MPI implementations; the MPI standard itself lacks some critical definitions about behavior when one or more processes die.

I talked to Joshua Hursey, Postdoctoral Research Associate at Oak Ridge National Laboratory and a leading member of the MPI Forum’s Fault Tolerance Working Group to find out what is being done to make MPI more resilient.

Read More »

Tags: , , , ,

MPI Programming Mistakes

February 25, 2011 at 7:30 am PST

Mistakes

I’ve seen many users make lots of different kinds of MPI programming mistakes.

Some are common, newbie types of mistakes.  Others are common intermediate-level mistakes.  Others are incredibly subtle programming mistakes in deep logic that took sophisticated debugging tools to figure out (race conditions, memory overflowing, etc.).

In 2007, I wrote a pair of magazine columns listing 10 common MPI programming mistakes (see this PDF for part 1 and this PDF for part 2).  Indeed, we still see users asking about some of these mistakes on the Open MPI user’s mailing list.

What mistakes do you see your users making with MPI?  How can we — the MPI community — better educate users to avoid these kinds of common mistakes?  Post your thoughts in the comments.

Read More »

Tags: , ,

MPI Forum Roundup

February 11, 2011 at 9:00 am PST

We just finished up another MPI Forum meeting earlier this week, hosted at the Cisco node 0 facility in San Jose, CA.  A lot of the working groups are making tangible progress and bringing their work back to the full forum for review and discussion.  Sometimes the working group reports are accepted and moved forward towards standardization; other times, the full Forum provides feedback and guidance, and then sends the working group back to committee to keep hashing out details.  This is pretty typical stuff for a standard body.

This week, we had a first vote (out of two total) on the MPI_MPROBE proposal.  It passed the vote, and will likely pass its next vote in March, meaning that it will become part of the MPI 3.0 draft standard.

MPI_MPROBE closes an important race condition vulnerability.

Read More »

Tags: ,