Cisco Blogs

Unexpected Linux memory migration

March 4, 2011 - 20 Comments

I learned something disturbing earlier this week: if you allocate memory in Linux to a particular NUMA location and then that memory is paged out, it will lose that memory binding when it is paged back it.


Core counts are going up, and server memory networks are getting more complex; we’re effectively increasing the NUMA-ness of memory.  The specific placement of your data in memory is becoming (much) more important; it’s all about location, Location, LOCATION!

But unless you are very, very careful, your data may not be in the location that you think it is — even if you thought you had bound it to a specific NUMA node.

Here’s a very high-level, skipping-many-details description of how memory affinity works in Linux: a memory affinity policy is associated with each thread in a process.  New memory allocations obey the policy of the allocating thread.  However, this policy can be overridden on demand: memory can be allocated and/or associated with a different memory affinity policy (e.g., placing memory on a specific, remote NUMA node).

But here’s the disturbing part: when Linux swaps a page out, it does not save the memory’s original NUMA location or policy.

There are several reasons why this is Bad:

  • When the page is swapped back in, it will be placed according to the current policy of the touching thread.  This may not be the same thread that set the page’s policy before it was swapped out — meaning that the page could be placed according to a different policy.  Bluntly: the page may be swapped in to a different NUMA node than where it originally resided.
  • Linux aggressively swaps in the one page that it needs and the next N adjacent pages from swap space — even if the adjacent pages are from different processes (!).  All of these pages will be placed according to the memory affinity policy of the touching thread.  To be clear: the touching thread may not be in the same process as the adjacent pages being faulted in.  This is another reason why memory may move to an unexpected NUMA node.
  • Swap out/in is transparent to applications; there is no notification if a memory access causes memory to be paged out or in — and potentially moved to a new NUMA node.


To be fair, there are good reasons why Linux works this way.  But these strategies can still be disastrous if your application has specific memory affinity needs.

Many HPC applications already use Hardware Locality (hwloc) (or something like it) to ensure to keep their dataset small enough to fit within physical memory.  Some HPC sites even disable swap altogether, thereby avoiding this NUMA nightmare.

But the point remains that if you want to be absolutely positively sure that your memory is bound to the location where you want it to be, you may need to do the following:

  1. Allocate some memory
  2. Bind it to the location that you want
  3. “Pin” the memory (so that it will never be paged out — e.g., with mlock(2))
  4. Touch each page in your memory allocation

Note that allocating the memory doesn’t actually associate pages with it.  Touching the memory does.  Hence, you can safely set two attributes on the allocation (binding it to the desired NUMA location and pinning it) before pages are actually associated with it.

Brice Goglin also suggested that applications with less-stringent locality needs can call hwloc_set_area_membind() with HWLOC_MEMBIND_MIGRATE from time to time to move pages back to where they belong (if they have moved).

Finally, be aware that Linux’s /etc/security/limits.conf can have an effect on how much memory a given user process is allowed to pin (“lock”).  Most HPC environments that have OpenFabrics-based networks will always have the lock limits set at “unlimited” and it shouldn’t be an issue.  But if you run against these limits, Google around to find information about the scope of limits.conf, how to set them properly, etc.

In summary: most HPC applications already try very hard to avoid swapping because the disk I/O tended to be a performance killer.  NUMA HPC architectures are now providing even more reasons to avoid swapping.

(Many thanks to Brice for proofing this entry and providing some additional technical information before it was published)

In an effort to keep conversations fresh, Cisco Blogs closes comments after 60 days. Please visit the Cisco Blogs hub page for the latest content.


  1. First, I agree, memory allocation policy absolutely should be maintained at the memory map level… not associated with the transient physical allocation in a dynamically paged system.

    On the other hand, I’m not sure I understand why you could not simply turn off swap (or not even mount any in the first place) for applications like this.

    And it’s certainly possible to implement the UNIX API without demand paging (proof by construction, Thompson, Kernighan, Ritchie, 1970), so a HPC platform that isn’t based on Linux and still provides a completely compatible API shouldn’t be at all controversial.

  2. Jeff, I’m not quite sure what you were expecting to happen with paging. numa allocation hints are just that: allocation hints. so wouldn’t you expect pageins to be treated as the new physical page allocations that they are? if your app has a well-defined core/node affinity, and hasn’t changed its memory policy, doesn’t everything work as you expect?

    regarding pagins, did you set to 0?

    as for the linux bashing in comments: linux won, get over it. if you don’t like what linux currently does, where’s your patch?

    • I guess I was surprised by the behavior because of two reasons:

      1. The mbind(2) and set_mempolicy(2) pages don’t say anything about the policies being hints. I had therefore believed (apparently incorrectly) that the policies were binding (pardon the pun) — particularly when one of the policies has “STRICT” in its name and uses strong language about what it means.
      2. Until you mentioned it, it would not have occurred to me that swapping in would be considered a new allocation. Sure, maybe that’s how it’s implemented on the back end, but I’m just a dumb user here; I didn’t think that a back-end implementation artifact would affect the policy that I previously set and had no upcall/notification when it had changed.

      Perhaps I’m just a naive userspace guy, but I said that I wanted memory X to be bound to location Y; why should that change if the memory gets paged out?

      FWIW, memory binding policies don’t change on Solaris or Windows if the memory gets paged out. When the memory is paged back in, the OS tries very hard to make it obey the original memory binding — only placing it elsewhere if it absolutely cannot place it where the original memory binding policy specified.

      As for the “should we not use Linux?” comments; I believe that those comments are offered in the spirit of “Hmm.. that’s interesting to think about…” (including a few examples of how others have done it). I know just about everyone who has commented here; we all use and rely on Linux heavily every day. That doesn’t mean that Linux doesn’t have some warts that are worth talking about; potentially even resulting in the creation of a patch by a Linux kernel expert (which I clearly am not!).

  3. @Kyle You’re right, BGP CNK does have VA but they map trivially to PA, so I tend to ignore the distinction. Clearly, without VA, fragmentation could be an issue. I’m not sure if it is an issue in CNK right now or not. I’ll write a synthetic test to see what happens.

    @David Yes, “Linux must die” is rather over-the-top, but that is my general state-of-mind (Jeff S. can certainly vouch for this). I’m just trying to throw stones at the conventional wisdom that Linux is the only way to fly in HPC. The HPC community is pretty smart as far as computer users go, and we should not bind ourselves forever to an OS that was created by a grad student in 1983.

    Kyle and I have both pointed out examples of attempts to use something other than Linux in HPC. BGL and BGP CNK are generally considered successful, BGP more so because it had more Linux-like features, including dlopen(). Most people I know think Catamount was a failure, but I don’t think the overarching design of Catamount was the reason. If Cray had done a better job of socializing the fact that Catamount was not Linux for a reason, it might have been easier to get people outside of Sandia to accept it. It seems there were implementation issues in many features of the Cray XT3, but I can’t speak with any authority on what they were.

    What has not been said, but I think is important, is that any alternative to Linux in HPC should be generally POSIX-complaint. BGP CNK has this property, as it is derived from BSD. There are very few examples of properly designed HPC codes failing to run on BGP because of this. It is well-known that what BGP CNK prohibits are generally bad ideas in HPC anyways, e.g. oversubscription. I refuse to accept any argument that fork() and exec*() are good ideas in HPC codes.

  4. I think David’s post here gives a good explanation of the source of the problem:

    On our HPC systems we require users to specify how much RAM per core they want (defaults to 1GB) and they have to request more if they need it. That limit is enforced by setting RLIMIT_AS for its child processes.

    The scheduler won’t allocate jobs to a node for which memory is not available.

  5. Getting back to your original post, Jeff, I’m going to suggest it’s a bit more alarmist than need be and that, really, you are talking about a positive. Very few people use memory binding and they probably should.

    At best, users use process affinity and assume (the default) “preferred” memory placement will do the right thing. But in the face of physical memory (partially) full of page cache (often residual from the last job to finish), preferred page allocations will “go off-node” often enough to give annoyingly variable job performance. That’s what originally got us in to investigating adding memory binding to MPI. Even with swapin_readahead occasionally messing with binding placement, it’s still usually miles ahead of preferred placement. And, as you say, how many sites cause their jobs to page anyway? Yes, we do as a matter of scheduling policy but at a lot of sites paging would only be caused by the user and in that case, NUMA placement is probably a relatively minor concern.

    So I guess I would be pointing out the shortcomings of the usual preferred NUMA node approach and promoting the large win that binding can provide (with the small caveat that it’s still not perfect).

    • Fair enough; I tried to soften my text by stating things like “There are good reasons why Linux works this way…” and qualify that my comments were about applications that have specific memory affinity needs. But perhaps that wasn’t strong enough.

      For example, Open MPI really does need shared memory buffers to reside on specific NUMA nodes, or performance will plummet (relatively speaking). Paging out — such as suspending and resuming a job, as in your case — can be disastrous to performance. The problem only gets worse as core counts keep going up, potentially enabling sites to start allocating multiple jobs to individual compute nodes.

      That being said, I’m planning a followup blog entry about this. I’ll include some stronger clarifications.

  6. I’m not sure things are totally hopeless with the Linux VM and NUMA. There are a lot of smart kernel developers – we just need to engage them and convince them there is an issue worth resolving. Imagine a kernel config option to partition swap based on the source NUMA node of swapped out pages. Then swapin_readahead will do the right thing at least in the context of most MPI jobs.

    BTW, I think swap is one of the most useful OS features available to us for managing our HPC system in a smart way. We could replace it by reserving half our memory for suspended jobs but I doubt users would like that idea.

  7. @Kyle NWChem has its own stack (in the true sense of the data structure i.e. push=alloc and pop=free) memory allocator, so fragmentation is not an issue.

    In general quantum chemistry does not require frequent malloc+free, and generally they are stack-like such that fragmentation shouldn’t be a huge issue.

    I don’t see how fragmentation matters on BG/P anyways. There is no NUMA so fragmentation would only show up at the granularity of a cache line.

    • Fragmentation would matter in terms of being unable to allocate large blocks of memory because you don’t have enough free contiguous space.

      In any event, as I understand it, the BlueGene CNK provides an offset-based virtual addressing system.

  8. @Kyle At Argonne, we run Python-driven C-MPI programs on Blue Gene/P all the time. The only real problem with Python is dynamic loading, which causes file system problems at scale. If the OS is the problem, it is by design.

  9. @Jeff – so the codes you work with do not do much allocation/deallocation at runtime? Without virtual memory, I would think you’d have a lot of fragmentation issues otherwise.

  10. @Scott When constrained to the HPC context, I do not see a lot of value in virtual memory either. I have never once used BGP and thought, “man, I really wish I had an abstraction layer between pointers and physical addresses”. I might use VM to index remote memory on some machines, but this is really just a cheap hack to make legacy code work. A good user-space library can do remote access much more efficiently using its own index system.

  11. I would agree that this tends to point to an argument of “Don’t Use Linux for HPC!”, but that is an old saw. OS’s that are custom-designed for HPC, like Catamount or its open-source successor Kitten, neatly avoid all of these problems. However, many consumers of HPC cycles simply demand Linux without a whole lot of thought behind it. Whether its because they want their “high-performance” python scripts to work, or whether they just don’t want to cross-compile, it’s a constant thorn in the side of OS guys who keep tearing their hair out going “but we already SOLVED that problem a *decade* ago!”

  12. Be sure not to confuse Virtual Memory with paging. They aren’t unrelated, but virtual memory has much more purpose than just swapping pages to secondary storage.

    • Yes, fair enough — thanks for keeping us honest.

      Swapping is the main evil here, but Jeff Hammond’s point about disabling one level of virtual memory is still an interesting point to consider — *specifically in the context of HPC*. For general purpose computing, that’s probably not a good idea.

  13. No idea. My ALCF operations friend tells my removing VM from Linux is impossible. We need another OS option. Maybe IBM should productive CNK for x86. I bet it would make IB systems much more productive. Just RDMA away. No pages means no registration.

    • Mmmm. Interesting idea.

      Memory can still move between processes, though. But yes, if one layer of the virtual< ->physical memory mapping (i.e., Linux[-like] virtual memory) is removed, that could simplify things in the kernel and/or the MPI implementation. Ridding ourselves of the bane of registered memory would be a Very, Very Good Thing IMNSHO!

  14. I have begun to argue against the use of Linux in HPC for reasons like this. There is absolutely no reason for virtual memory or pages in HPC. Security is irrelevant because nodes are allocated per user and paging is catastrophic for performance.

    Basically, everyone should be like Blue Gene/P, and that’s not just because I help operate one.

    • What do you think should be used in HPC instead of Linux? Or did you mean “…argue against virtual memory in HPC…”