Cisco Logo


High Performance Computing Networking

A recent exchange on the Open MPI users’ list turned up a minor bug in our code base.  The bug had to do with how Open MPI reported a settings value through our configuration querying tool (“ompi_info”).

The code using the configuration value in question was doing the Right Things, but the tool was effectively reporting the wrong value.  This led to some confusion on the mailing list, resulting in a bug fix being pushed upstream and the user concluding, “Trust, but verify.”

Very true!

As applied to HPC: You need to trust your hardware and software vendors, but verify that both your system is working the way that you expect it to, and that your applications are getting the performance that they should.

MPI implementations — just like any network stack — are large, complex pieces of software.  We try very hard to deliver perfect software, but even thought we don’t like to admit it, bugs happen.  The bug in question revealed through this exchange with the user was pretty minor, but it did cause some confusion.

But sometimes the most nefarious bugs occur when users exercise our carefully developed software in ways that we didn’t anticipate, in environments that we didn’t expect.  In these cases, you might run across unintentionally run across an untested code path, or something that causes an unexpected combination of inputs.

The moral of the story here is: trust, but verify.  When something goes wrong, or even when everything is all going Right — verify, verify, verify.

It’s good science.

In an effort to keep conversations fresh, Cisco Blogs closes comments after 90 days. Please visit the Cisco Blogs hub page for the latest content.

2 Comments.


  1. _Especially_ when everything is going right! That’s the most dangerous time of all.

       0 likes

    • I’m too embarrassed (or proud?) to admit how many times I thought I had some awesome new piece of research where all the results looked great, but then upon the 3rd or 4th examination, found that there was a mistake that made my result worthless.

      My grad school advisor beat into me: verify, verify, verify. Then verify again.

         0 likes

  1. Return to Countries/Regions
  2. Return to Home
  1. All High Performance Computing Networking
  2. All Security
  3. Return to Home