<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Message size: big or small?</title>
	<atom:link href="http://blogs.cisco.com/performance/message-size-big-or-small/feed/" rel="self" type="application/rss+xml" />
	<link>http://blogs.cisco.com/performance/message-size-big-or-small/</link>
	<description></description>
	<lastBuildDate>Sun, 19 May 2013 12:19:51 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	
	<item>
		<title>By: Jeff Squyres</title>
		<link>http://blogs.cisco.com/performance/message-size-big-or-small/#comment-698172</link>
		<dc:creator>Jeff Squyres</dc:creator>
		<pubDate>Tue, 29 Jan 2013 17:14:25 +0000</pubDate>
		<guid isPermaLink="false">http://blogs.cisco.com/?p=99146#comment-698172</guid>
		<description><![CDATA[Correct -- in that case, I&#039;d say that the app developer probably should have coalesced the 4 messages into 1.

But just to clarify your point: hardware send queues are typically fairly deep, capable of holding thousands of pending sends.  So the coalescing code in Open MPI probably won&#039;t be triggered by just 4 sends.  It&#039;ll typically be triggered by ping-pong benchmarks (yet another reason benchmarks are not good indicators of real performance!) and other sending-many-thousands-of-sends-at-a-time types of codes.]]></description>
		<content:encoded><![CDATA[<p>Correct &#8212; in that case, I&#8217;d say that the app developer probably should have coalesced the 4 messages into 1.</p>
<p>But just to clarify your point: hardware send queues are typically fairly deep, capable of holding thousands of pending sends.  So the coalescing code in Open MPI probably won&#8217;t be triggered by just 4 sends.  It&#8217;ll typically be triggered by ping-pong benchmarks (yet another reason benchmarks are not good indicators of real performance!) and other sending-many-thousands-of-sends-at-a-time types of codes.
<p class="comment-like"><img class="comment-like-btn" title="Vote" onclick="cl_like_this('http://blogs.cisco.com/wp-admin/admin-ajax.php',698172)" src="http://blogs.cisco.com/wp-content/plugins/comments-likes/images/like.png" />&nbsp;&nbsp;&nbsp;<span id="comment-like-cnt-698172">0</span> likes</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: motonacciu (First Name)</title>
		<link>http://blogs.cisco.com/performance/message-size-big-or-small/#comment-698169</link>
		<dc:creator>motonacciu (First Name)</dc:creator>
		<pubDate>Tue, 29 Jan 2013 16:22:29 +0000</pubDate>
		<guid isPermaLink="false">http://blogs.cisco.com/?p=99146#comment-698169</guid>
		<description><![CDATA[Yes it makes sense now! 

For example let&#039;s say you have this sequence of 4 messages:
MPI_Send(&amp;i, 1, MPI_INTEGER, peer, tag, comm);
MPI_Send(&amp;j, 1, MPI_INTEGER, peer, tag, comm);
MPI_Send(&amp;i, 1, MPI_INTEGER, peer, tag, comm);
MPI_Send(&amp;j, 1, MPI_INTEGER, peer, tag, comm);

The way you explained the implementation makes me thing that most likely the first message is dispatched immediately (and assuming that the send queue only accept 1 message) and the next 3 will be coalesced together and sent out. 

Therefore this runtime solution is not solving *completely* the problem, if the developer would have coalesced the sends in the source code probably the performance will increase (also considering all the above considerations you discussed in the post). 

thanks again.]]></description>
		<content:encoded><![CDATA[<p>Yes it makes sense now! </p>
<p>For example let&#8217;s say you have this sequence of 4 messages:<br />
MPI_Send(&amp;i, 1, MPI_INTEGER, peer, tag, comm);<br />
MPI_Send(&amp;j, 1, MPI_INTEGER, peer, tag, comm);<br />
MPI_Send(&amp;i, 1, MPI_INTEGER, peer, tag, comm);<br />
MPI_Send(&amp;j, 1, MPI_INTEGER, peer, tag, comm);</p>
<p>The way you explained the implementation makes me thing that most likely the first message is dispatched immediately (and assuming that the send queue only accept 1 message) and the next 3 will be coalesced together and sent out. </p>
<p>Therefore this runtime solution is not solving *completely* the problem, if the developer would have coalesced the sends in the source code probably the performance will increase (also considering all the above considerations you discussed in the post). </p>
<p>thanks again.
<p class="comment-like"><img class="comment-like-btn" title="Vote" onclick="cl_like_this('http://blogs.cisco.com/wp-admin/admin-ajax.php',698169)" src="http://blogs.cisco.com/wp-content/plugins/comments-likes/images/like.png" />&nbsp;&nbsp;&nbsp;<span id="comment-like-cnt-698169">0</span> likes</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jeff Squyres</title>
		<link>http://blogs.cisco.com/performance/message-size-big-or-small/#comment-698167</link>
		<dc:creator>Jeff Squyres</dc:creator>
		<pubDate>Tue, 29 Jan 2013 15:44:12 +0000</pubDate>
		<guid isPermaLink="false">http://blogs.cisco.com/?p=99146#comment-698167</guid>
		<description><![CDATA[Open MPI coalesceses messages only when it is stalled from sending. 

For example, when the hardware send queue is full, but the application is still sending more new (short) messages.  If successive new messages in this case are of the same MPI signature (e.g., to the same receiver on the same CID with the same tag), then Open MPI will coalesce the messages together.  

Hence, when the send queue eventually opens up, there will be fewer entries placed on the queue than MPI messages that are actually sent.

Make sense?

This was a particularly ugly feature to implement and debug.  But I&#039;d be surprised if other OpenFabrics-based MPI implementations don&#039;t do similar things (i.e., coalesce only when otherwise blocked from sending).]]></description>
		<content:encoded><![CDATA[<p>Open MPI coalesceses messages only when it is stalled from sending. </p>
<p>For example, when the hardware send queue is full, but the application is still sending more new (short) messages.  If successive new messages in this case are of the same MPI signature (e.g., to the same receiver on the same CID with the same tag), then Open MPI will coalesce the messages together.  </p>
<p>Hence, when the send queue eventually opens up, there will be fewer entries placed on the queue than MPI messages that are actually sent.</p>
<p>Make sense?</p>
<p>This was a particularly ugly feature to implement and debug.  But I&#8217;d be surprised if other OpenFabrics-based MPI implementations don&#8217;t do similar things (i.e., coalesce only when otherwise blocked from sending).
<p class="comment-like"><img class="comment-like-btn" title="Vote" onclick="cl_like_this('http://blogs.cisco.com/wp-admin/admin-ajax.php',698167)" src="http://blogs.cisco.com/wp-content/plugins/comments-likes/images/like.png" />&nbsp;&nbsp;&nbsp;<span id="comment-like-cnt-698167">0</span> likes</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: motonacciu (First Name)</title>
		<link>http://blogs.cisco.com/performance/message-size-big-or-small/#comment-698166</link>
		<dc:creator>motonacciu (First Name)</dc:creator>
		<pubDate>Tue, 29 Jan 2013 15:27:12 +0000</pubDate>
		<guid isPermaLink="false">http://blogs.cisco.com/?p=99146#comment-698166</guid>
		<description><![CDATA[Nice post, this is an interesting topic. 

Since you are an expert, do you know whether MPI libraries are able to automatically coalesce messages? Some time ago I came across a parameter of the OpenMPI&#039;s MCA framework  called &quot;btl_openib_use_message_coalescing&quot; which should do that. However I have troubles figuring out how such thing is concretely implemented. To me there should be a kind of timer that waits &quot;enough time&quot; for several messages to be in the send buffer and then dispatches them all together to the receiver... but this would have a very bad impact on latency... and I don&#039;t think this would even be allowed in MPI. 

So the question remains!]]></description>
		<content:encoded><![CDATA[<p>Nice post, this is an interesting topic. </p>
<p>Since you are an expert, do you know whether MPI libraries are able to automatically coalesce messages? Some time ago I came across a parameter of the OpenMPI&#8217;s MCA framework  called &#8220;btl_openib_use_message_coalescing&#8221; which should do that. However I have troubles figuring out how such thing is concretely implemented. To me there should be a kind of timer that waits &#8220;enough time&#8221; for several messages to be in the send buffer and then dispatches them all together to the receiver&#8230; but this would have a very bad impact on latency&#8230; and I don&#8217;t think this would even be allowed in MPI. </p>
<p>So the question remains!
<p class="comment-like"><img class="comment-like-btn" title="Vote" onclick="cl_like_this('http://blogs.cisco.com/wp-admin/admin-ajax.php',698166)" src="http://blogs.cisco.com/wp-content/plugins/comments-likes/images/like.png" />&nbsp;&nbsp;&nbsp;<span id="comment-like-cnt-698166">0</span> likes</p>
]]></content:encoded>
	</item>
</channel>
</rss>
