Open MPI and the MPI-3 MPI_T interface
The MPI_T interface is a standardized interface designed for MPI tools, but can be used by regular MPI application programs, too.
Specifically, MPI_T provides programatic access to two types of MPI implementation data:
- Control variables: used to control the behavior of an MPI implementation
- Performance variables: provide access to internal MPI implementation performance metrics
Open MPI has long had an “MCA parameter” system, which allow users to tweak Open MPI’s behavior through mpirun CLI options, environment variables, and text files.
Nathan Hjelm, from Los Alamos National Laboratory, did the bulk of the work recently to both revamp the internals of Open MPI’s MCA parameter system as well as tie it directly to the MPI_T control variable interface. As a result of this work, MPI_T control variables are MCA parameters (and vice versa).
(Note that all of the MPI_T control variable work described in this blog entry
will be included in the upcoming Open MPI 1.7.3 release)
One of the features we’re very excited about is the ability to assign an MPI_T-defined “level” to each control variable.
Specifically, Open MPI has a bazillion control variables (a.k.a., MCA parameters). This is both a curse and a blessing: it’s a blessing because power users can tweak just about anything of Open MPI’s behavior. It’s a curse because the sheer number of knobs to turn is bewildering to new users.
With Nathan’s new implementation, we can assign an MPI_T “level” to each control variable indicating for whom the variable is targeted. There are three main levels:
- End user: A user who is simply running MPI applications. We interpret this level to mean variables that are required for correctness of an Open MPI job. For example, variables controlling the selection of which network interfaces to use for MPI communications.
- Application tuner: Those who want to tweak Open MPI’s performance. These variables control most aspects of Open MPI’s resource usage, algorithmic patterns, protocol choices, etc.
- MPI developer: Those who are actually developing Open MPI itself. Such variables are mostly for debugging, and are likely only useful to the Open MPI developer community.
Each of these three main levels has three sub-levels (basic, advanced, and all), allowing a gradation of leveling. For example, Open MPI intentionally puts very, very few parameters in the End user/Basic category so that new users will a) likely see only the control variables that they need for correctness, and b) not be frustrated with needing to sort through a bazillion total control variables to find the ones they want.
This past week, we started limiting the output of the ompi_info command: it now only shows End user/basic parameters by default. For example, by default, you now only see a few control variables for any given network transport — only two for TCP:
$ ompi_info --param btl tcp MCA btl: parameter "btl_tcp_if_include" (current value: "", data source: default value, level: 1 user/basic) Comma-delimited list of devices and/or CIDR notation of networks to use for MPI communication (e.g., "eth0,192.168.0.0/16"). Mutually exclusive with btl_tcp_if_exclude. MCA btl: parameter "btl_tcp_if_exclude" (current value: "127.0.0.1/8,sppp", data source: default value, level: 1 user/basic) Comma-delimited list of devices and/or CIDR notation of networks to NOT use for MPI communication -- all devices not matching these specifications will be used (e.g., "eth0,192.168.0.0/16"). If set to a non-default value, it is mutually exclusive with btl_tcp_if_include.
If you want to see many more control variables, use the –level option. In this example, we ask for App tuner/All:
$ ompi_info --param btl tcp --level 6 MCA btl: parameter "btl_tcp_if_include" (current value: "", data source: [...and 20 more TCP-related control variables...]
Open MPI also contains support for an older performance metric introspection system called PERUSE. Unfortunately no other MPI implementation implemented PERUSE, so the PERUSE effort died.
That being said, as part of our MPI_T work, Nathan is working on updating / converting / revamping our PERUSE-based performance metrics to the new MPI_T performance variables.
This work is ongoing, but is looking very promising.