A Simple MPI Program

MPI programming is complex and not necessary for all cluster users. Many cluster users will only run HTC jobs, which do not require communication between processes. The goal of this example is to provide a very simple introduction to MPI programming for those who are interested. We assume some familiarity with C or Fortran.

Users interested in pursuing MPI programming are encouraged to consult a book on the subject. Many good resources are cited on the HPC website at http://www4.uwm.edu/hpc/related_resources/.

C version:

/*
 *  Program description:
 *
 *  MPI Example
 *
 *  A typical MPI job will start N processes, all running the same
 *  program.  This is known as the Single-Program Multiple-Data (SPMD)
 *  model.
 *
 *  This program demonstrates how the various processes
 *  distinguish themselves from the rest using MPI library functions.
 *
 *  Return values of all MPI functions are checked for good practice,
 *  although some MPI functions will by default terminate the process without
 *  returning if an error is encountered.  This behavior can be changed, so
 *  including error checks for every MPI call is a good idea.
 */

/*
 *  Modification history:
 *  Date        Name        Modification
 *  2011-08-24  Jason Bacon Begin
 */

#include <stdio.h>      /* fputs(), printf(), etc. */
#include <stdlib.h>     /* exit() */
#include <sysexits.h>   /* EX_ exit constants */
#include <sys/param.h>  /* MAXHOSTNAMELEN */
#include <unistd.h>     /* gethostname() */
#include <mpi.h>        /* MPI functions and constants */

#define ROOT_RANK       0
#define MESSAGE_MAX_LEN 128
#define TAG             0

int     main(int argc, char *argv[])
{
    int             total_processes;
    int             my_rank;
    int             rank;
    MPI_Status      status;
    char            message[MESSAGE_MAX_LEN + 1];
    char            hostname[MAXHOSTNAMELEN + 1];
    
    /* Get name of node running this process */
    if ( gethostname(hostname, MAXHOSTNAMELEN) != 0 )
    {
        fputs("gethostname() failed.\n", stderr);
        exit(EX_OSERR);
    }
        
    /* Initialize data for the MPI functions */
    if ( MPI_Init(&argc, &argv) != MPI_SUCCESS )
    {
        fputs("MPI_Init failed.\n", stderr);
        exit(EX_UNAVAILABLE);
    }
    
    /* Find out how many processes are in this MPI job */
    if ( MPI_Comm_size(MPI_COMM_WORLD, &total_processes) != MPI_SUCCESS )
    {
        fputs("MPI_Comm_size failed.\n", stderr);
        exit(EX_UNAVAILABLE);
    }
    
    /*
     *  Each process withing the job has a unique integer "rank".
     *  This is how each process determines its role within the job.
     */
    if ( MPI_Comm_rank(MPI_COMM_WORLD, &my_rank) != MPI_SUCCESS )
    {
        fputs("MPI_Comm_rank failed.\n", stderr);
        exit(EX_UNAVAILABLE);
    }

    /*
     *  For this job, the process with rank 0 will assume the role of
     *  the "root" process, which will run different code than the
     *  other processes.
     */
    if (my_rank == ROOT_RANK)
    {
        printf("We have %d processors\n", total_processes);
        
        /* Send a message to all non-root processes */
        for (rank = 1; rank < total_processes; ++rank)
        {
            snprintf(message, MESSAGE_MAX_LEN, "Process %d, where are you? ", rank);
            if ( MPI_Send(message, MESSAGE_MAX_LEN, MPI_CHAR, rank, TAG,
                        MPI_COMM_WORLD) != MPI_SUCCESS )
            {
                fputs("MPI_Comm_rank failed.\n", stderr);
                exit(EX_UNAVAILABLE);
            }
        }
        
        /* Read the response from all non-root processes */
        for (rank = 1; rank < total_processes; ++rank)
        {
            if ( MPI_Recv(message, MESSAGE_MAX_LEN, MPI_CHAR, rank, TAG,
                        MPI_COMM_WORLD, &status) != MPI_SUCCESS )
            {
                fputs("MPI_Comm_rank failed.\n", stderr);
                exit(EX_UNAVAILABLE);
            }
            printf("%s\n", message);
        }
    }
    else
    {
        /* Wait for message from root process */
        if ( MPI_Recv(message, MESSAGE_MAX_LEN, MPI_CHAR, ROOT_RANK,
                TAG, MPI_COMM_WORLD, &status) != MPI_SUCCESS )
        {
            fputs("MPI_Comm_rank failed.\n", stderr);
            exit(EX_UNAVAILABLE);
        }
        printf("Process %d received message: %s\n", my_rank, message);
        
        /* Send response */
        snprintf(message, MESSAGE_MAX_LEN, "Process %d is on %s", my_rank, hostname);
        if ( MPI_Send(message, MESSAGE_MAX_LEN, MPI_CHAR, ROOT_RANK, TAG, MPI_COMM_WORLD) != MPI_SUCCESS )
        {
            fputs("MPI_Comm_rank failed.\n", stderr);
            exit(EX_UNAVAILABLE);
        }
    }

    /*
     *  All MPI processes must execute MPI finalize to synchronize
     *  the job before they exit.
     */
    
    if ( MPI_Finalize() != MPI_SUCCESS )
    {
        fputs("MPI_Finalize failed.\n", stderr);
        exit(EX_UNAVAILABLE);
    }
    return EX_OK;
}

Fortran version:

!-----------------------------------------------------------------------
!   Program description:
! 
!   MPI Example
! 
!   A typical MPI job will start N processes, all running the same
!   program.  This is known as the Single-Program Multiple-Data (SPMD)
!   model.
! 
!   This program demonstrates how the various processes
!   distinguish themselves from the rest using MPI library functions.
!-----------------------------------------------------------------------

!-----------------------------------------------------------------------
!   Modification history:
!   Date        Name        Modification
!   2011-08-24  Jason Bacon Begin
!-----------------------------------------------------------------------

module constants
    ! Global Constants
    double precision, parameter :: &
        PI = 3.1415926535897932d0, &
        E = 2.7182818284590452d0, &
        TOLERANCE = 0.00000000001d0, &  ! For numerical methods
        AVOGADRO = 6.0221415d23         ! Not known to more digits than this
    integer, parameter :: &
        MESSAGE_MAX_LEN = 128, &
        HOSTNAME_MAX_LEN = 128, &
        TAG = 0, &
        ROOT_RANK = 0
end module constants

! Main program body
program mpi_hello
    use constants           ! Constants defined above
    use ISO_FORTRAN_ENV     ! INPUT_UNIT, OUTPUT_UNIT, ERROR_UNIT, etc.
    
    ! Disable implicit declarations (i-n rule)
    implicit none
    
    include 'mpif.h'        ! MPI constants
    
    ! Variable defintions
    character(MESSAGE_MAX_LEN) :: message       ! MPI message buffer
    character(HOSTNAME_MAX_LEN) :: hostname     ! Name of node
    integer :: total_processes, my_rank, rank, count, ierr, &
               message_length = MESSAGE_MAX_LEN
    integer :: status(MPI_STATUS_SIZE)
    
    ! Get name of node running this process
    call hostnm(hostname)
    
    ! Initialize data for the MPI functions
    call mpi_init(ierr)
    if ( ierr /= MPI_SUCCESS ) stop 'mpi_init failed.'
    
    ! Find out how many processes are in this MPI job
    call mpi_comm_size(MPI_COMM_WORLD, total_processes, ierr)
    if ( ierr /= MPI_SUCCESS ) stop 'mpi_comm_size failed.'

    ! Each process withing the job has a unique integer "rank".
    ! This is how each process determines its role within the job.
    call mpi_comm_rank(MPI_COMM_WORLD, my_rank, ierr)
    if ( ierr /= MPI_SUCCESS ) stop 'mpi_comm_rank failed.'
    
    ! For this job, the process with rank ROOT_RANK will assume the role of
    ! the "root" process, which will run different code than the
    ! other processes.
    if ( my_rank == ROOT_RANK ) then
        ! Only root process runs this clause
        
        ! Do this in root so it only prints once
        print '(a,i0,a)', 'We have ', total_processes, ' processes.'
        
        ! Debug code
        ! print '(a,a,a)', 'Root processing running on ', trim(hostname), '.'
        
        do rank = 1, total_processes-1
            write (message, '(a,i0,a)') 'Process ', rank, ', where are you?'
            
            ! Debug code
            ! print '(a,a,a,i0,a)', 'Sending ', trim(message), ' to process ', &
            !    rank, '.'
            
            ! It's stupid to send a padded string, but it's complicated
            ! for mpi_recv() to receive a message of unknown length
            ! A smarter program would save network bandwidth and time by using
            ! len_trim(message) instead of MESSAGE_MAX_LEN.
            call mpi_send(message, MESSAGE_MAX_LEN, MPI_CHARACTER, rank, &
                TAG, MPI_COMM_WORLD, status, ierr)
            if ( ierr /= MPI_SUCCESS ) stop 'mpi_send failed.'
        enddo
        
        do count = 1, total_processes-1
            ! Accept message from slave processes in any rank order
            ! by using MPI_ANY_SOURCE for rank in recv call
            call mpi_recv(message, MESSAGE_MAX_LEN, MPI_CHARACTER, &
                MPI_ANY_SOURCE, MPI_ANY_TAG, &
                MPI_COMM_WORLD, status, ierr)
            if ( ierr /= MPI_SUCCESS ) stop 'mpi_recv failed.'
            print *, 'Root received response: ', trim(message)
        enddo
    else
        ! All slave processes run this section
        
        ! Debug code
        ! print '(a,i0,a,a, a)', 'Process ', my_rank, ' running on ', &
        !    trim(hostname), '.'
        
        ! Wait for message from root
        call mpi_recv(message, MESSAGE_MAX_LEN, MPI_CHARACTER, ROOT_RANK, &
            TAG, MPI_COMM_WORLD, status, ierr)
        if ( ierr /= MPI_SUCCESS ) stop 'mpi_recv failed.'
        print '(a,i0,a,a)', 'Process ', my_rank, ' received: ', trim(message)
        
        ! Respond to message from root
        write (message, '(a,i0,a,a,a)') 'Process ', my_rank,' is on ', &
            trim(hostname), '.'
        call mpi_send(message, len(message), MPI_CHARACTER, ROOT_RANK, &
            TAG, MPI_COMM_WORLD, status, ierr)
        if ( ierr /= MPI_SUCCESS ) stop 'mpi_send failed.'
    endif

    ! All MPI processes must execute MPI finalize to synchronize
    ! the job before they exit.
    call mpi_finalize(ierr)
    if ( ierr /= MPI_SUCCESS ) stop 'mpi_finalize failed.'
end program

In a scheduled environment, MPI jobs are submitted like batch serial jobs. The scheduler is informed about resource requirements (cores, memory) but does not dispatch all the processes. The scheduler dispatches a single mpirun command and the mpirun command then creates all the processes to a list of nodes provided by the scheduler.

SLURM submit script:

#!/bin/sh -e

# Number of cores
#SBATCH --ntasks=8

mpirun ./mpi-hello

Programs should be compiled in the same environment in which they will run, i.e. on a compute node, under the scheduler. This will ensure that they find the same tools and libraries at run time as they did at compile time. The best way to achieve this is by using a submit script to compile:

SLURM build script:

#!/bin/sh -e

# Number of cores
#SBATCH --ntasks=1

mpicc -o mpi-hello mpi-hello.c

PBS submit script:

#!/bin/sh

# Job name
#PBS -N MPI-Hello

# Number of cores
#PBS -l procs=8

##########################################################################
# Shell commands
##########################################################################

# Torque starts from the home directory on each node, so we must manually
# cd to the working directory where the hello binary is located.
cd $PBS_O_WORKDIR
mpirun ./mpi-hello

LSF submit script:

#!/usr/bin/env bash

# Job name
#BSUB -J MPI-Hello

# Number of cores
#BSUB -n 8

##########################################################################
# Shell commands
##########################################################################

# LSF requires the use of wrapper scripts rather than using mpirun directly
openmpi_wrapper ./mpi-hello

Self-test
  1. Write an MPI version of calcpi.c from Chapter 34, Programming for HTC. For simplicity, use a constant for the number of random points generated and use the process rank for the srandom() seed.
  2. Is MPI the best solution for estimating PI? Why or why not?
  3. Write an MPI matrix addition routine. The root process should distribute a row of each source matrix to each of the worker processes until all the rows are added.
  4. Is this a good fit for MPI? Why or why not?