Collaborate to Innovate

November 11, 2010 at 2:23 pm PST

Doug Eadline recently talked about how community is tremendously important to HPC.  Two words: he’s right.  The HPC ecosystem is all about working together to advance the state of the art.  No single group, university, or company could do it alone.

As Cisco’s representative to the MPI Forum and the Open MPI software projects, I often work with teams of researchers and developers.  Sometimes all the people are in one physical place and the process of sharing ideas and dividing work is easy. But it’s much more common for me to participate in geographically scattered groups of people.  And there’s no doubt about it: collaboration across distances is just hard.  You just can’t beat having a bunch of engineers in the same room with a whiteboard when trying to figure out a complex topic. But we don’t always get that opportunity.

So how do you take a disparate group of people and make them productive?

X petaflops, where X>1

October 29, 2010 at 4:49 am PST

Lotsa news coming out in the ramp-up to SC.  Probably the biggest is that about China being the proud owners of the 2.5-petaflop computing monster named “Tianhe-1A”.

Congratulations to all involved!  2.5 petaflops is an enormous achievement.

Just to put this in perspective, there are only three other (publicly disclosed) machines in the world right now that have reached a petaflop: the Oak Ridge US Department of Energy (DoE) “Jaguar” machine hit 1.7 petaflops, China’s “Nebulae” hit 1.3 petaflops, and the Los Alamos US DoE “Roadrunner” machine hit 1.0 petaflops.

Sockets, cores, and hyperthreads… oh my!

October 15, 2010 at 5:00 am PST

Core counts are going up.  Cisco’s C460 rack-mount server series, for example, can have up to 32 Nehalem EX cores.  As a direct result, we may well be returning to the era of running more than one MPI process per server.  This has long been true in “big iron” parallel resources, but commodity Linux HPC clusters have tended towards the one-MPI-job-per-server model in recent history.

Because of this trend, I have an open-ended question for MPI users and cluster administrators: how do you want to bind MPI processes to processors?  For example: what kinds of binding patterns do you want?  How many hyperthreads / cores / sockets do you want each process to bind to?  How do you want to specify what process binds where?  What level of granularity of control do you want / need?  (…and so on)

We are finding that every user we ask seems to have slightly different answers.  What do you think?  Let me know in the comments, below.

“Give me 4 255-sided die and I’ll get you some IPs”

September 29, 2010 at 12:00 pm PST

Have you ever wondered how an MPI implementation picks network paths and allocates resources?  It’s a pretty complicated (set of) issue(s), actually.

An MPI implementation must tread the fine line between performance and resource consumption.  If the implementation chooses poorly, it risks poor performance and/or the wrath of the user.  If the implementation chooses well, users won’t notice at all — they silently enjoy good performance.

It’s a thankless job, but someone’s got to do it.  :-)

hwloc 1.0 released!

May 18, 2010 at 12:00 pm PST

At long last, we have released a stable, production-quality version of Hardware Locality (hwloc).  Yay!

If you’ve missed all my prior discussions about hwloc, hwloc provides command line tools and a C API to obtain the hierarchical map of key computing elements, such as: NUMA memory nodes, shared caches, processor sockets, processor cores, and processing units (logical processors or “threads”). hwloc also gathers various attributes such as cache and memory information, and is portable across a variety of different operating systems and platforms.

In an increasing NUMA (and NUNA!) world, hwloc is a valuable tool for high performance.

