<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Unexpected Linux memory migration</title>
	<atom:link href="http://blogs.cisco.com/performance/unexpected-linux-memory-migration/feed/" rel="self" type="application/rss+xml" />
	<link>http://blogs.cisco.com/performance/unexpected-linux-memory-migration/</link>
	<description></description>
	<lastBuildDate>Wed, 22 May 2013 00:09:23 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	
	<item>
		<title>By: Peter da Silva</title>
		<link>http://blogs.cisco.com/performance/unexpected-linux-memory-migration/#comment-154287</link>
		<dc:creator>Peter da Silva</dc:creator>
		<pubDate>Wed, 18 May 2011 12:53:41 +0000</pubDate>
		<guid isPermaLink="false">http://blogs.cisco.com/?p=21231#comment-154287</guid>
		<description><![CDATA[First, I agree, memory allocation policy absolutely should be maintained at the memory map level... not associated with the transient physical allocation in a dynamically paged system.

On the other hand, I&#039;m not sure I understand why you could not simply turn off swap (or not even mount any in the first place) for applications like this.

And it&#039;s certainly possible to implement the UNIX API without demand paging (proof by construction, Thompson, Kernighan, Ritchie, 1970), so a HPC platform that isn&#039;t based on Linux and still provides a completely compatible API shouldn&#039;t be at all controversial.]]></description>
		<content:encoded><![CDATA[<p>First, I agree, memory allocation policy absolutely should be maintained at the memory map level&#8230; not associated with the transient physical allocation in a dynamically paged system.</p>
<p>On the other hand, I&#8217;m not sure I understand why you could not simply turn off swap (or not even mount any in the first place) for applications like this.</p>
<p>And it&#8217;s certainly possible to implement the UNIX API without demand paging (proof by construction, Thompson, Kernighan, Ritchie, 1970), so a HPC platform that isn&#8217;t based on Linux and still provides a completely compatible API shouldn&#8217;t be at all controversial.
<p class="comment-like"><img class="comment-like-btn" title="Vote" onclick="cl_like_this('http://blogs.cisco.com/wp-admin/admin-ajax.php',154287)" src="http://blogs.cisco.com/wp-content/plugins/comments-likes/images/like.png" />&nbsp;&nbsp;&nbsp;<span id="comment-like-cnt-154287">0</span> likes</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jeff Squyres</title>
		<link>http://blogs.cisco.com/performance/unexpected-linux-memory-migration/#comment-132496</link>
		<dc:creator>Jeff Squyres</dc:creator>
		<pubDate>Sun, 13 Mar 2011 13:25:45 +0000</pubDate>
		<guid isPermaLink="false">http://blogs.cisco.com/?p=21231#comment-132496</guid>
		<description><![CDATA[I guess I was surprised by the behavior because of two reasons:
&lt;ol&gt;
&lt;li&gt; The mbind(2) and set_mempolicy(2) pages don&#039;t say anything about the policies being hints.  I had therefore believed (apparently incorrectly) that the policies were binding (pardon the pun) -- particularly when one of the policies has &quot;STRICT&quot; in its name and uses strong language about what it means.&lt;/li&gt;
&lt;li&gt; Until you mentioned it, it would not have occurred to me that swapping in would be considered a new allocation.  Sure, maybe that&#039;s how it&#039;s implemented on the back end, but I&#039;m just a dumb user here; I didn&#039;t think that a back-end implementation artifact would affect the policy that I previously set and had no upcall/notification when it had changed.
&lt;/ol&gt;

Perhaps I&#039;m just a naive userspace guy, but I said that I wanted memory X to be bound to location Y; why should that change if the memory gets paged out?  

FWIW, memory binding policies &lt;em&gt;don&#039;t&lt;/em&gt; change on Solaris or Windows if the memory gets paged out.  When the memory is paged back in, the OS tries very hard to make it obey the original memory binding -- only placing it elsewhere if it absolutely cannot place it where the original memory binding policy specified.

As for the &quot;should we not use Linux?&quot; comments; I believe that those comments are offered in the spirit of &quot;Hmm.. that&#039;s interesting to think about...&quot; (including a few examples of how others have done it).  I know just about everyone who has commented here; we all use and rely on Linux heavily every day.  That doesn&#039;t mean that Linux doesn&#039;t have some warts that are worth talking about; potentially even resulting in the creation of a patch by a Linux kernel expert (which I clearly am not!).]]></description>
		<content:encoded><![CDATA[<p>I guess I was surprised by the behavior because of two reasons:</p>
<ol>
<li> The mbind(2) and set_mempolicy(2) pages don&#8217;t say anything about the policies being hints.  I had therefore believed (apparently incorrectly) that the policies were binding (pardon the pun) &#8212; particularly when one of the policies has &#8220;STRICT&#8221; in its name and uses strong language about what it means.</li>
<li> Until you mentioned it, it would not have occurred to me that swapping in would be considered a new allocation.  Sure, maybe that&#8217;s how it&#8217;s implemented on the back end, but I&#8217;m just a dumb user here; I didn&#8217;t think that a back-end implementation artifact would affect the policy that I previously set and had no upcall/notification when it had changed.
</li>
</ol>
<p>Perhaps I&#8217;m just a naive userspace guy, but I said that I wanted memory X to be bound to location Y; why should that change if the memory gets paged out?  </p>
<p>FWIW, memory binding policies <em>don&#8217;t</em> change on Solaris or Windows if the memory gets paged out.  When the memory is paged back in, the OS tries very hard to make it obey the original memory binding &#8212; only placing it elsewhere if it absolutely cannot place it where the original memory binding policy specified.</p>
<p>As for the &#8220;should we not use Linux?&#8221; comments; I believe that those comments are offered in the spirit of &#8220;Hmm.. that&#8217;s interesting to think about&#8230;&#8221; (including a few examples of how others have done it).  I know just about everyone who has commented here; we all use and rely on Linux heavily every day.  That doesn&#8217;t mean that Linux doesn&#8217;t have some warts that are worth talking about; potentially even resulting in the creation of a patch by a Linux kernel expert (which I clearly am not!).
<p class="comment-like"><img class="comment-like-btn" title="Vote" onclick="cl_like_this('http://blogs.cisco.com/wp-admin/admin-ajax.php',132496)" src="http://blogs.cisco.com/wp-content/plugins/comments-likes/images/like.png" />&nbsp;&nbsp;&nbsp;<span id="comment-like-cnt-132496">0</span> likes</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Mark Hahn</title>
		<link>http://blogs.cisco.com/performance/unexpected-linux-memory-migration/#comment-131217</link>
		<dc:creator>Mark Hahn</dc:creator>
		<pubDate>Fri, 11 Mar 2011 21:34:20 +0000</pubDate>
		<guid isPermaLink="false">http://blogs.cisco.com/?p=21231#comment-131217</guid>
		<description><![CDATA[Jeff, I&#039;m not quite sure what you were expecting to happen with paging.  numa allocation hints are just that: allocation hints.  so wouldn&#039;t you expect pageins to be treated as the new physical page allocations that they are?  if your app has a well-defined core/node affinity, and hasn&#039;t changed its memory policy, doesn&#039;t everything work as you expect? 

regarding pagins, did you set vm.page-cluster to 0?

as for the linux bashing in comments: linux won, get over it.  if you don&#039;t like what linux currently does, where&#039;s your patch?]]></description>
		<content:encoded><![CDATA[<p>Jeff, I&#8217;m not quite sure what you were expecting to happen with paging.  numa allocation hints are just that: allocation hints.  so wouldn&#8217;t you expect pageins to be treated as the new physical page allocations that they are?  if your app has a well-defined core/node affinity, and hasn&#8217;t changed its memory policy, doesn&#8217;t everything work as you expect? </p>
<p>regarding pagins, did you set vm.page-cluster to 0?</p>
<p>as for the linux bashing in comments: linux won, get over it.  if you don&#8217;t like what linux currently does, where&#8217;s your patch?
<p class="comment-like"><img class="comment-like-btn" title="Vote" onclick="cl_like_this('http://blogs.cisco.com/wp-admin/admin-ajax.php',131217)" src="http://blogs.cisco.com/wp-content/plugins/comments-likes/images/like.png" />&nbsp;&nbsp;&nbsp;<span id="comment-like-cnt-131217">0</span> likes</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jeff</title>
		<link>http://blogs.cisco.com/performance/unexpected-linux-memory-migration/#comment-129632</link>
		<dc:creator>Jeff</dc:creator>
		<pubDate>Wed, 09 Mar 2011 14:47:17 +0000</pubDate>
		<guid isPermaLink="false">http://blogs.cisco.com/?p=21231#comment-129632</guid>
		<description><![CDATA[@Kyle You&#039;re right, BGP CNK does have VA but they map trivially to PA, so I tend to ignore the distinction.  Clearly, without VA, fragmentation could be an issue.  I&#039;m not sure if it is an issue in CNK right now or not.  I&#039;ll write a synthetic test to see what happens.

@David Yes, &quot;Linux must die&quot; is rather over-the-top, but that is my general state-of-mind (Jeff S. can certainly vouch for this).  I&#039;m just trying to throw stones at the conventional wisdom that Linux is the only way to fly in HPC.  The HPC community is pretty smart as far as computer users go, and we should not bind ourselves forever to an OS that was created by a grad student in 1983.

Kyle and I have both pointed out examples of attempts to use something other than Linux in HPC.  BGL and BGP CNK are generally considered successful, BGP more so because it had more Linux-like features, including dlopen().  Most people I know think Catamount was a failure, but I don&#039;t think the overarching design of Catamount was the reason.  If Cray had done a better job of socializing the fact that Catamount was not Linux for a reason, it might have been easier to get people outside of Sandia to accept it.  It seems there were implementation issues in many features of the Cray XT3, but I can&#039;t speak with any authority on what they were.

What has not been said, but I think is important, is that any alternative to Linux in HPC should be generally POSIX-complaint.  BGP CNK has this property, as it is derived from BSD.  There are very few examples of properly designed HPC codes failing to run on BGP because of this.  It is well-known that what BGP CNK prohibits are generally bad ideas in HPC anyways, e.g. oversubscription.  I refuse to accept any argument that fork() and exec*() are good ideas in HPC codes.]]></description>
		<content:encoded><![CDATA[<p>@Kyle You&#8217;re right, BGP CNK does have VA but they map trivially to PA, so I tend to ignore the distinction.  Clearly, without VA, fragmentation could be an issue.  I&#8217;m not sure if it is an issue in CNK right now or not.  I&#8217;ll write a synthetic test to see what happens.</p>
<p>@David Yes, &#8220;Linux must die&#8221; is rather over-the-top, but that is my general state-of-mind (Jeff S. can certainly vouch for this).  I&#8217;m just trying to throw stones at the conventional wisdom that Linux is the only way to fly in HPC.  The HPC community is pretty smart as far as computer users go, and we should not bind ourselves forever to an OS that was created by a grad student in 1983.</p>
<p>Kyle and I have both pointed out examples of attempts to use something other than Linux in HPC.  BGL and BGP CNK are generally considered successful, BGP more so because it had more Linux-like features, including dlopen().  Most people I know think Catamount was a failure, but I don&#8217;t think the overarching design of Catamount was the reason.  If Cray had done a better job of socializing the fact that Catamount was not Linux for a reason, it might have been easier to get people outside of Sandia to accept it.  It seems there were implementation issues in many features of the Cray XT3, but I can&#8217;t speak with any authority on what they were.</p>
<p>What has not been said, but I think is important, is that any alternative to Linux in HPC should be generally POSIX-complaint.  BGP CNK has this property, as it is derived from BSD.  There are very few examples of properly designed HPC codes failing to run on BGP because of this.  It is well-known that what BGP CNK prohibits are generally bad ideas in HPC anyways, e.g. oversubscription.  I refuse to accept any argument that fork() and exec*() are good ideas in HPC codes.
<p class="comment-like"><img class="comment-like-btn" title="Vote" onclick="cl_like_this('http://blogs.cisco.com/wp-admin/admin-ajax.php',129632)" src="http://blogs.cisco.com/wp-content/plugins/comments-likes/images/like.png" />&nbsp;&nbsp;&nbsp;<span id="comment-like-cnt-129632">0</span> likes</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jeff Squyres</title>
		<link>http://blogs.cisco.com/performance/unexpected-linux-memory-migration/#comment-129629</link>
		<dc:creator>Jeff Squyres</dc:creator>
		<pubDate>Wed, 09 Mar 2011 14:41:42 +0000</pubDate>
		<guid isPermaLink="false">http://blogs.cisco.com/?p=21231#comment-129629</guid>
		<description><![CDATA[Fair enough; I tried to soften my text by stating things like &quot;There are good reasons why Linux works this way...&quot; and qualify that my comments were about applications that have specific memory affinity needs.  But perhaps that wasn&#039;t strong enough.

For example, Open MPI really does need shared memory buffers to reside on specific NUMA nodes, or performance will plummet (relatively speaking).  Paging out -- such as suspending and resuming a job, as in your case -- can be disastrous to performance.  The problem only gets worse as core counts keep going up, potentially enabling sites to start allocating multiple jobs to individual compute nodes.

That being said, I&#039;m planning a followup blog entry about this.  I&#039;ll include some stronger clarifications.]]></description>
		<content:encoded><![CDATA[<p>Fair enough; I tried to soften my text by stating things like &#8220;There are good reasons why Linux works this way&#8230;&#8221; and qualify that my comments were about applications that have specific memory affinity needs.  But perhaps that wasn&#8217;t strong enough.</p>
<p>For example, Open MPI really does need shared memory buffers to reside on specific NUMA nodes, or performance will plummet (relatively speaking).  Paging out &#8212; such as suspending and resuming a job, as in your case &#8212; can be disastrous to performance.  The problem only gets worse as core counts keep going up, potentially enabling sites to start allocating multiple jobs to individual compute nodes.</p>
<p>That being said, I&#8217;m planning a followup blog entry about this.  I&#8217;ll include some stronger clarifications.
<p class="comment-like"><img class="comment-like-btn" title="Vote" onclick="cl_like_this('http://blogs.cisco.com/wp-admin/admin-ajax.php',129629)" src="http://blogs.cisco.com/wp-content/plugins/comments-likes/images/like.png" />&nbsp;&nbsp;&nbsp;<span id="comment-like-cnt-129629">0</span> likes</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Chris Samuel</title>
		<link>http://blogs.cisco.com/performance/unexpected-linux-memory-migration/#comment-129197</link>
		<dc:creator>Chris Samuel</dc:creator>
		<pubDate>Wed, 09 Mar 2011 01:15:24 +0000</pubDate>
		<guid isPermaLink="false">http://blogs.cisco.com/?p=21231#comment-129197</guid>
		<description><![CDATA[I think David&#039;s post here gives a good explanation of the source of the problem:

http://www.open-mpi.org/community/lists/hwloc-devel/2011/02/2012.php

On our HPC systems we require users to specify how much RAM per core they want (defaults to 1GB) and they have to request more if they need it.  That limit is enforced by setting RLIMIT_AS for its child processes.

The scheduler won&#039;t allocate jobs to a node for which memory is not available.]]></description>
		<content:encoded><![CDATA[<p>I think David&#8217;s post here gives a good explanation of the source of the problem:</p>
<p><a href="http://www.open-mpi.org/community/lists/hwloc-devel/2011/02/2012.php" rel="nofollow">http://www.open-mpi.org/community/lists/hwloc-devel/2011/02/2012.php</a></p>
<p>On our HPC systems we require users to specify how much RAM per core they want (defaults to 1GB) and they have to request more if they need it.  That limit is enforced by setting RLIMIT_AS for its child processes.</p>
<p>The scheduler won&#8217;t allocate jobs to a node for which memory is not available.
<p class="comment-like"><img class="comment-like-btn" title="Vote" onclick="cl_like_this('http://blogs.cisco.com/wp-admin/admin-ajax.php',129197)" src="http://blogs.cisco.com/wp-content/plugins/comments-likes/images/like.png" />&nbsp;&nbsp;&nbsp;<span id="comment-like-cnt-129197">0</span> likes</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: David Singleton</title>
		<link>http://blogs.cisco.com/performance/unexpected-linux-memory-migration/#comment-129118</link>
		<dc:creator>David Singleton</dc:creator>
		<pubDate>Tue, 08 Mar 2011 22:59:32 +0000</pubDate>
		<guid isPermaLink="false">http://blogs.cisco.com/?p=21231#comment-129118</guid>
		<description><![CDATA[Getting back to your original post, Jeff, I&#039;m going to suggest it&#039;s a bit more alarmist than need be and that, really, you are talking about a positive.  Very few people use memory binding and they probably should.

At best, users use process affinity and assume (the default) &quot;preferred&quot; memory placement will do the right thing.  But in the face of physical memory (partially) full of page cache (often residual from the last job to finish), preferred page allocations will &quot;go off-node&quot; often enough to give annoyingly variable job performance.  That&#039;s what originally got us in to investigating adding memory binding to MPI.  Even with swapin_readahead occasionally messing with binding placement, it&#039;s still usually miles ahead of preferred placement.  And, as you say, how many sites cause their jobs to page anyway?  Yes, we do as a matter of scheduling policy but at a lot of sites paging would only be caused by the user and in that case, NUMA placement is probably a relatively minor concern.

So I guess I would be pointing out the shortcomings of the usual preferred NUMA node approach and promoting the large win that binding can provide (with the small caveat that it&#039;s still not perfect).]]></description>
		<content:encoded><![CDATA[<p>Getting back to your original post, Jeff, I&#8217;m going to suggest it&#8217;s a bit more alarmist than need be and that, really, you are talking about a positive.  Very few people use memory binding and they probably should.</p>
<p>At best, users use process affinity and assume (the default) &#8220;preferred&#8221; memory placement will do the right thing.  But in the face of physical memory (partially) full of page cache (often residual from the last job to finish), preferred page allocations will &#8220;go off-node&#8221; often enough to give annoyingly variable job performance.  That&#8217;s what originally got us in to investigating adding memory binding to MPI.  Even with swapin_readahead occasionally messing with binding placement, it&#8217;s still usually miles ahead of preferred placement.  And, as you say, how many sites cause their jobs to page anyway?  Yes, we do as a matter of scheduling policy but at a lot of sites paging would only be caused by the user and in that case, NUMA placement is probably a relatively minor concern.</p>
<p>So I guess I would be pointing out the shortcomings of the usual preferred NUMA node approach and promoting the large win that binding can provide (with the small caveat that it&#8217;s still not perfect).
<p class="comment-like"><img class="comment-like-btn" title="Vote" onclick="cl_like_this('http://blogs.cisco.com/wp-admin/admin-ajax.php',129118)" src="http://blogs.cisco.com/wp-content/plugins/comments-likes/images/like.png" />&nbsp;&nbsp;&nbsp;<span id="comment-like-cnt-129118">0</span> likes</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Kyle Wheeler</title>
		<link>http://blogs.cisco.com/performance/unexpected-linux-memory-migration/#comment-129094</link>
		<dc:creator>Kyle Wheeler</dc:creator>
		<pubDate>Tue, 08 Mar 2011 22:00:39 +0000</pubDate>
		<guid isPermaLink="false">http://blogs.cisco.com/?p=21231#comment-129094</guid>
		<description><![CDATA[Fragmentation would matter in terms of being unable to allocate large blocks of memory because you don&#039;t have enough free contiguous space.

In any event, as I understand it, the BlueGene CNK provides an offset-based virtual addressing system.]]></description>
		<content:encoded><![CDATA[<p>Fragmentation would matter in terms of being unable to allocate large blocks of memory because you don&#8217;t have enough free contiguous space.</p>
<p>In any event, as I understand it, the BlueGene CNK provides an offset-based virtual addressing system.
<p class="comment-like"><img class="comment-like-btn" title="Vote" onclick="cl_like_this('http://blogs.cisco.com/wp-admin/admin-ajax.php',129094)" src="http://blogs.cisco.com/wp-content/plugins/comments-likes/images/like.png" />&nbsp;&nbsp;&nbsp;<span id="comment-like-cnt-129094">0</span> likes</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: David Singleton</title>
		<link>http://blogs.cisco.com/performance/unexpected-linux-memory-migration/#comment-129091</link>
		<dc:creator>David Singleton</dc:creator>
		<pubDate>Tue, 08 Mar 2011 21:57:43 +0000</pubDate>
		<guid isPermaLink="false">http://blogs.cisco.com/?p=21231#comment-129091</guid>
		<description><![CDATA[I&#039;m not sure things are totally hopeless with the Linux VM and NUMA.  There are a lot of smart kernel developers - we just need to engage them and convince them there is an issue worth resolving.  Imagine a kernel config option to partition swap based on the source NUMA node of swapped out pages.  Then swapin_readahead will do the right thing at least in the context of most MPI jobs.

BTW, I think swap is one of the most useful OS features available to us for managing our HPC system in a smart way.  We could replace it by reserving half our memory for suspended jobs but I doubt users would like that idea.]]></description>
		<content:encoded><![CDATA[<p>I&#8217;m not sure things are totally hopeless with the Linux VM and NUMA.  There are a lot of smart kernel developers &#8211; we just need to engage them and convince them there is an issue worth resolving.  Imagine a kernel config option to partition swap based on the source NUMA node of swapped out pages.  Then swapin_readahead will do the right thing at least in the context of most MPI jobs.</p>
<p>BTW, I think swap is one of the most useful OS features available to us for managing our HPC system in a smart way.  We could replace it by reserving half our memory for suspended jobs but I doubt users would like that idea.
<p class="comment-like"><img class="comment-like-btn" title="Vote" onclick="cl_like_this('http://blogs.cisco.com/wp-admin/admin-ajax.php',129091)" src="http://blogs.cisco.com/wp-content/plugins/comments-likes/images/like.png" />&nbsp;&nbsp;&nbsp;<span id="comment-like-cnt-129091">0</span> likes</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jeff</title>
		<link>http://blogs.cisco.com/performance/unexpected-linux-memory-migration/#comment-129085</link>
		<dc:creator>Jeff</dc:creator>
		<pubDate>Tue, 08 Mar 2011 21:40:06 +0000</pubDate>
		<guid isPermaLink="false">http://blogs.cisco.com/?p=21231#comment-129085</guid>
		<description><![CDATA[@Kyle NWChem has its own stack (in the true sense of the data structure i.e. push=alloc and pop=free) memory allocator, so fragmentation is not an issue.

In general quantum chemistry does not require frequent malloc+free, and generally they are stack-like such that fragmentation shouldn&#039;t be a huge issue.

I don&#039;t see how fragmentation matters on BG/P anyways.  There is no NUMA so fragmentation would only show up at the granularity of a cache line.]]></description>
		<content:encoded><![CDATA[<p>@Kyle NWChem has its own stack (in the true sense of the data structure i.e. push=alloc and pop=free) memory allocator, so fragmentation is not an issue.</p>
<p>In general quantum chemistry does not require frequent malloc+free, and generally they are stack-like such that fragmentation shouldn&#8217;t be a huge issue.</p>
<p>I don&#8217;t see how fragmentation matters on BG/P anyways.  There is no NUMA so fragmentation would only show up at the granularity of a cache line.
<p class="comment-like"><img class="comment-like-btn" title="Vote" onclick="cl_like_this('http://blogs.cisco.com/wp-admin/admin-ajax.php',129085)" src="http://blogs.cisco.com/wp-content/plugins/comments-likes/images/like.png" />&nbsp;&nbsp;&nbsp;<span id="comment-like-cnt-129085">0</span> likes</p>
]]></content:encoded>
	</item>
</channel>
</rss>
