Cisco Blogs

Cisco Blog > High Performance Computing Networking

Registered Memory (RMA / RDMA) and MPI implementations

In a prior blog post, I talked about RMA (and RDMA) networks, and what they mean to MPI implementations.  In this post, I’ll talk about one of the consequences of RMA networks: registered memory.

Registered memory is something that most HPC administrators and users have at least heard of, but may not fully understand.

Let me clarify it for you: registered memory is both a curse and a blessing.

It’s more of the former than the latter, if you ask me, but MPI implementations need to use (and track) registered memory to get high performance on today’s high-performance networking API stacks.

Read More »

Tags: , ,

“RDMA” — what does it mean to MPI applications?

RDMA standard for Remote Direct Memory Access.  The acronym is typically associated with OpenFabrics networks such as iWARP, IBoIP (a.k.a. RoCE), and InfiniBand.  But “RDMA” is typically just today’s popular flavor du jour of a more general concept: RMA (remote memory access), or directly reading and writing to a peer’s memory space.

RMA implementations (including RDMA-based networks, such as OpenFabrics) typically include one or more of the following technologies:

  1. Operating system bypass: userspace applications directly communicate with network hardware.
  2. Hardware offload: network activity is driven by the NIC, not the main CPU
  3. Hardware or software notification: when messages finish sending or are received

How are these technologies typically used in MPI implementations?

Read More »

Tags: , ,

hwloc article published in Linux Pro Magazine

Brice, Samuel, and I got the crazy idea to write a magazine article about hwloc to expand its reach to people outside the HPC community. We wrote something up and submitted it to Linux Pro Magazine — and they accepted it!

I just got my copy in the mail — it’s published in the July issue: “Lessons in Locality: hwloc.”

Picture of 1st page of article

Read More »

Tags: ,

MPI 2.2’s Scalable Process Topologies and Topology Mapping in practice

Today we feature a guest post from Torsten Hoefler, the Performance Modeling and Simulation lead of the Blue Waters project at NCSA, and Adjunct Assistant Professor at the Computer Science department at the University of Illinois at Urbana-Champaign (UIUC) .

I’m sure everybody heard about network topologies, such as 2D or 3D tori, fat-trees, Kautz networks and Clos networks. It can be argued that even multi-core nodes (if run in the “MPI everywhere” mode) are a separate “hierarchical network”. And you probably also wondered how to map your communication on such network topologies in a portable way.

MPI offers support for such optimized mappings since the old days of MPI-1. The process topology functionality if probably one of the most overlooked useful features of MPI. We have to admit that it had some issues and was clumsy to use but it was finally fixed in MPI-2.2. :-) Read More »

Tags: ,

MPI run-time at large scale

With the news that Open MPI is being used on the K supercomputer (i.e., the #1 machine on the June 2011 Top500 list), another colleague of mine, Ralph Castain — who focuses on the run-time system in Open MPI — pointed out that K has over 80,000 processors (over 640K cores!).  That’s ginormous.

He was musing to me that it would be fascinating to see some of K’s run-time data for what most people don’t consider too interesting / sexy: MPI job launch performance.

For example, another public use of Open MPI is on Los Alamos National Lab’s RoadRunner, which has 3,000+ nodes at 4 processes per node (remember RoadRunner?  It was #1 for a while, too).

It’s worth noting that Open MPI starts up full-scale jobs on RoadRunner — meaning that all processes complete MPI_INIT — in less than 1 minute.

Read More »

Tags: , , ,