Cisco Blogs

Open Resilient Cluster Manager (ORCM)

December 7, 2009 - 0 Comments

Cisco announced this past weekend a new open source effort that is being launched under the Open MPI project umbrella named the Open Resilient Cluster Manager (or “OpenRCM”, or — my personal favorite — “ORCM”.  Say it 10 times fast!).

The Open MPI community is pleased to announce the establishment of a new subproject built upon the Open MPI code base. Using work initially contributed by Cisco Systems, the Open Resilient Cluster Manager is an open source project released under the Open MPI [BSD] license focused on development of an “always on” resource manager for systems spanning the range from embedded to very large clusters.

The ORCM web site neatly lays out the project goals:

  • Maintain operation of running applications in the face of single or multiple failures of any given process within that application.
  • Proactively detect incipient failures (hardware and/or software) and respond appropriately to maintain overall system operation.
  • Support both MPI and non-MPI applications.
  • Provide a research platform for exploring new concepts and methods in resilient systems.

“That’s great,” you say.  “But why on earth do we need yet another cluster resource manager?”

Also stolen directly from the ORCM project page:

Several features distinguish OpenRCM from other common resource managers, including (but not limited to):

  • Full utilization of component architecture methods to provide a platform for research and production code to coexist and be tested in actual production environments.
  • A focus on fault prediction, integration with embedded state-of-health sensors, and proactive response to both hardware and software faults.
  • Support for dynamic resource addition/subtraction from running multi-node applications, allowing for “on-the-fly” removal and replacement of nodes without stopping applications.
  • Built-in communications library for resilient applications that automatically maintains communications in the presence of failed processes.
  • An architecture designed to support platforms ranging from small embedded multi-processor systems to large-scale high-performance computing clusters.

The tie-in to the Open MPI project here is that ORCM is built upon parts of the Open MPI code base.  A little-known fact is that the Open MPI code base is segregated into three distinct layers.  Starting from the bottom:

  1. Open Portable Access Layer (OPAL): operating system glue, linked lists, and other “utility” code.
  2. Open MPI Runtime Environment (ORTE): all the abstractions necessary to map, launch, monitor, and kill parallel jobs, I/O redirection, etc.
  3. Open MPI layer: the MPI API and all its supporting logic.

ORCM is built upon OPAL and ORTE — it effectively is a different top-level personality to the code base than MPI.  In short, there’s oodles of good stuff in OPAL and ORTE — so we decided to use it for another project.  Woot!

Please keep in mind that while the initial ORCM code base is functional, a production release has not yet been made.  Heck, we’re not even generating nightly tarballs yet.  But we thought we’d open the doors to a wider community, thereby allowing interested parties to checkout the developer’s trunk and get involved in the project (see the project web site for details).  Announcements regarding eventual production releases (expected to commence in early 2010) will be made on the Open MPI announcements mailing list.

In an effort to keep conversations fresh, Cisco Blogs closes comments after 60 days. Please visit the Cisco Blogs hub page for the latest content.