Higher Level MPI Features

A parallel program may not provide much benefit if only the computations are done in parallel. Disk I/O and message passing may also need be parallelized in order to avoid bottlenecks that will limit performance to serial speed.

Parallel Message Passing

Suppose we have 100 processes in an MPI job that all need the same data in order to begin their calculations. We could simply loop through them and send the data as follows:

for (rank = 1; rank < 100; ++rank)
{
    if ( MPI_Send(data, len, MPI_CHAR, rank, TAG, MPI_COMM_WORLD) !=
         MPI_SUCCESS )
    {
        ...
    }
}

The problem is, this is a serial operation that sends one message at a time, while many pathways through the network switch may be left idle. A typical network switch used in a cluster is capable of transmitting many messages between disjoint node pairs at the same time. For example, node 1 can send a message to node 5 at the same time node 4 sends a message to node 10.

If it ends up taking longer to distribute data to the processes than it does to do the computations on that data, then it's time to look for a different strategy.

A simple strategy to better utilize the network hardware might work as follows:

  1. Root process transmits data to process 1.
  2. Root process and process 1 transmit to processes 2 and 3 at the same time.
  3. Root process and processes 1, 2, and 3 can all transmit to processes 4, 5, 6, and 7 at the same time.
  4. ...and so on.

If the job uses a large number of processes, the limits of the network switch may be reached, but that's OK. If the switch is saturated, it will simply transmit the data as fast as it can. There is rarely any negative impact from a saturated network switch, other than increased response times for other processes, but this is why clusters have dedicated networks.

While this broadcast strategy is simple in concept, it can be tricky to program. Real world strategies take into account various different network architectures in an attempt to optimize throughput for specific hardware.

Fortunately, MPI offers a number of high-level routines such as MPI_Bcast(), MPI_Scatter(), and MPI_Gather(), which will provide good performance on most hardware, and certainly better than the serial loop above.

There may be cases where using routines designed for specific network switches can offer significantly better performance. However, this may mean sacrificing portability.

A big part of becoming a proficient MPI programmer is simply learning what MPI has to offer and choosing the right routines for your needs.

Self-test
  1. What is the easiest way to ensure reasonably good performance from your MPI programs?
  2. What problems are associated with overloading a cluster's dedicated network?
  3. Write an MPI program that solves a linear system of equations using an open source distributed parallel linear systems library.