Unexpected Linux memory migration
I learned something disturbing earlier this week: if you allocate memory in Linux to a particular NUMA location and then that memory is paged out, it will lose that memory binding when it is paged back it.
Core counts are going up, and server memory networks are getting more complex; we’re effectively increasing the NUMA-ness of memory. The specific placement of your data in memory is becoming (much) more important; it’s all about location, Location, LOCATION!
But unless you are very, very careful, your data may not be in the location that you think it is — even if you thought you had bound it to a specific NUMA node.
Here’s a very high-level, skipping-many-details description of how memory affinity works in Linux: a memory affinity policy is associated with each thread in a process. New memory allocations obey the policy of the allocating thread. However, this policy can be overridden on demand: memory can be allocated and/or associated with a different memory affinity policy (e.g., placing memory on a specific, remote NUMA node).
But here’s the disturbing part: when Linux swaps a page out, it does not save the memory’s original NUMA location or policy.
There are several reasons why this is Bad:
- When the page is swapped back in, it will be placed according to the current policy of the touching thread. This may not be the same thread that set the page’s policy before it was swapped out — meaning that the page could be placed according to a different policy. Bluntly: the page may be swapped in to a different NUMA node than where it originally resided.
- Linux aggressively swaps in the one page that it needs and the next N adjacent pages from swap space — even if the adjacent pages are from different processes (!). All of these pages will be placed according to the memory affinity policy of the touching thread. To be clear: the touching thread may not be in the same process as the adjacent pages being faulted in. This is another reason why memory may move to an unexpected NUMA node.
- Swap out/in is transparent to applications; there is no notification if a memory access causes memory to be paged out or in — and potentially moved to a new NUMA node.
To be fair, there are good reasons why Linux works this way. But these strategies can still be disastrous if your application has specific memory affinity needs.
Many HPC applications already use Hardware Locality (hwloc) (or something like it) to ensure to keep their dataset small enough to fit within physical memory. Some HPC sites even disable swap altogether, thereby avoiding this NUMA nightmare.
But the point remains that if you want to be absolutely positively sure that your memory is bound to the location where you want it to be, you may need to do the following:
- Allocate some memory
- Bind it to the location that you want
- “Pin” the memory (so that it will never be paged out — e.g., with mlock(2))
- Touch each page in your memory allocation
Note that allocating the memory doesn’t actually associate pages with it. Touching the memory does. Hence, you can safely set two attributes on the allocation (binding it to the desired NUMA location and pinning it) before pages are actually associated with it.
Brice Goglin also suggested that applications with less-stringent locality needs can call hwloc_set_area_membind() with HWLOC_MEMBIND_MIGRATE from time to time to move pages back to where they belong (if they have moved).
Finally, be aware that Linux’s /etc/security/limits.conf can have an effect on how much memory a given user process is allowed to pin (“lock”). Most HPC environments that have OpenFabrics-based networks will always have the lock limits set at “unlimited” and it shouldn’t be an issue. But if you run against these limits, Google around to find information about the scope of limits.conf, how to set them properly, etc.
In summary: most HPC applications already try very hard to avoid swapping because the disk I/O tended to be a performance killer. NUMA HPC architectures are now providing even more reasons to avoid swapping.
(Many thanks to Brice for proofing this entry and providing some additional technical information before it was published)