Shared Memory Parallel Programming with OpenMP

Shared Memory Parallel Programming with OpenMP
Prev	Chapter 35. Programming for HPC	Next

OpenMP, short for Open Multiprocessing, is a set of tools embedded into many Fortran compilers since 1997, and C/C++ compilers since 2000. OpenMP allows the programmer to utilize multiple cores within a single computer, and in some cases, using extensions, multiple computers.

OpenMP allows for finer grained parallelism than embarrassingly parallel computing or distributed models such as MPI.

The latter two paradigms utilize multiple processes, potentially running on different computers, which require a great deal of overhead to create. Hence, individual processes need to be long-running so that the overhead of starting them doesn't use a significant portion of the total run time.

OpenMP, on the other hand, spawns lightweight threads within a single process, which requires very little overhead. Hence, OpenMP can effectively parallelize very short-running sections of code such as individual loops within a program.

OpenMP can benefit code that might only take a fraction of a second to run in serial, whereas HTC and MPI are only good for parallelizing processes that take far longer (usually minutes or hours).

The main limitation of OpenMP is that it only scales up to the number of cores available on a single computer (or other tightly-coupled architecture). If you want to achieve greater speedup, you'll need to look for ways to perform coarser scale parallelism with HTC or MPI.

OpenMP is implemented as a set of preprocessor directives, header files, and library functions.

To use OpenMP, a C or C++ program must include the header omp.h and use #pragma directives to indicate which code segments are to be parallelized and how.

Fortran programs must contain the line

use omp_lib

and use specially formatted comments beginning with

!$omp

To compile an OpenMP program with gcc, clang, or gfortran we must use the -fopenmp flag. Note that on FreeBSD, cc is clang, while on Linux, cc is gcc. Hence, cc and c++ should work with the default compiler on both systems.

mypc: gcc -fopenmp myprog.c
mypc: g++ -fopenmp myprog.cc
mypc: clang -fopenmp myprog.c
mypc: clang++ -fopenmp myprog.cc
mypc: cc -fopenmp myprog.c
mypc: c++ -fopenmp myprog.cc
mypc: gfortran -fopenmp myprog.f90

At the time of this writing, clang does not include a Fortran compiler. Clang is intended to be binary compatible with GCC, however, so we can use gfortran alongside clang and clang++ as long as the compiler versions are compatible.

C and C++ programs must also contain

#include <omp.h>

After an OpenMP program has split into multiple threads, each thread can identify itself by calling omp_get_thread_num(), which returns a value between 0 and N-1 for a program running N threads.

Multithreaded programs can be task parallel (running different code for each thread) or data parallel (running the same code on different data for each thread).

OMP Parallel

Using a basic omp parallel pragma, we can execute the same code on multiple cores (more or less) simultaneously.

Simple parallel program:

/***************************************************************************
 *  Description:
 *      OpenMP parallel common code example
 *
 *  Arguments:
 *      None
 *
 *  Returns:
 *      Standard exit codes (see sysexits.h)
 *
 *  History: 
 *  Date        Name        Modification
 *  2012-06-27  Jason Bacon Begin
 ***************************************************************************/

#include <stdio.h>
#include <sysexits.h>
#include <omp.h>

int     main(int argc,char *argv[])

{
/* Execute the same code simultaneously on multiple cores */
#pragma omp parallel
    printf("Hello from thread %d!\n", omp_get_thread_num());
    
    return EX_OK;
}

!-----------------------------------------------------------------------
!   Program description:
!       OpenMP parallel common code example
!-----------------------------------------------------------------------

!-----------------------------------------------------------------------
!   Modification history:
!   Date        Name        Modification
!   2012-06-29  Jason Bacon Created
!-----------------------------------------------------------------------

! Main program body
program openmp_hello
    use omp_lib
    
    ! Disable implicit declarations (i-n rule)
    implicit none
    
    ! Note: No space allowed between ! and $
    !$omp parallel
    print *, 'Hello from thread ', omp_get_thread_num()
    !$omp end parallel
end program

We can control the number of threads in an OpenMP block using the num_treads() parameter or by setting OMP_NUM_THREADS in the environment.

/***************************************************************************
 *  Description:
 *      OpenMP parallel common code example
 *
 *  Arguments:
 *      None
 *
 *  Returns:
 *      Standard exit codes (see sysexits.h)
 *
 *  History: 
 *  Date        Name        Modification
 *  2012-06-27  Jason Bacon Begin
 ***************************************************************************/

#include <stdio.h>
#include <sysexits.h>
#include <omp.h>

int     main(int argc,char *argv[])

{
/* Execute the same code simultaneously on multiple cores */
#pragma omp parallel num_threads(2)
    printf("Hello from thread %d!\n", omp_get_thread_num());
    
    return EX_OK;
}

Otherwise, OpenMP will match the number of threads to the number of cores present on the system.

OMP Loops

One of the most commonly used features of OpenMP is loop parallelization.

Note

As always, you should optimize the serial code before trying to parallelize it. Before resorting to parallelizing a loop, make sure that the loop is as efficient as possible. You may be able to move some code outside the loop to reduce its run time, greatly reduce the number of iterations, or eliminate the loop entirely with some careful thought. See the section called “Performance” for a detailed discussion.

If a loop can be structured in such a way that each iteration is independent of previous iterations, then in theory, all of the iterations can be executed at the same time.

Of course, the number of iterations which can execute at once is limited by the number of cores available. For example, if a parallelizable loop iterates 100 times, but the computer running it only has four cores, then only four iterations will run at once. Still, this will speed up the program by nearly a factor of four. A small amount of overhead is incurred when splitting the execution path into multiple threads and again when merging them back together after the parallel segment.

Example of a parallel loop with OpenMP:

/***************************************************************************
 *  Description:
 *      OpenMP parallel loop example
 *
 *  Arguments:
 *      None
 *
 *  Returns:
 *      Standard exit codes (see sysexits.h)
 *
 *  History: 
 *  Date        Name        Modification
 *  2011-10-06  Jason Bacon Begin
 ***************************************************************************/

#include <stdio.h>
#include <omp.h>
#include <sysexits.h>

int     main(int argc,char *argv[])

{
    int     c;
    
#pragma omp parallel for
    for (c=0; c < 8; ++c)
    {
        printf("Hello from thread %d, nthreads %d, c = %d\n", 
            omp_get_thread_num(), omp_get_num_threads(), c);
    }
    return EX_OK;
}

!-----------------------------------------------------------------------
!   Program description:
!       OpenMP parallel loop example
!-----------------------------------------------------------------------

!-----------------------------------------------------------------------
!   Modification history:
!   Date        Name        Modification
!   2012-06-29  Jason Bacon Created
!-----------------------------------------------------------------------

! Main program body
program openmp_hello
    use omp_lib
    
    ! Disable implicit declarations (i-n rule)
    implicit none
    integer :: c
    
    ! Note: No space allowed between ! and $
    !$omp parallel do
    do c = 1, 8
        print *, 'Hello from thread ', omp_get_thread_num(), &
            ' nthreads ', omp_get_num_threads(), ' c = ', c
    enddo
end program

Caution

When several iterations of a loop are run in parallel, we cannot predict the order in which they will complete. While they are in theory executed at the "same time", there really is no such thing as exactly the same time. Even if each iteration runs the exact same number of instructions, they may finish at slightly different times, and the order is unpredictable.

Hence, when using OpenMP, you must take special care to ensure that it does not matter to your program which iterations complete first.

Performing output to a common stream such as the standard output in a parallelized section of code is generally a bad idea, except for debugging purposes.

If anything within a loop depends on results computed during previous iterations, then the loop simply cannot be parallelized. It may be possible to redesign the loop so that each iteration is independent of the rest, but there are some cases where computations must be done serially.

Shared and Private Variables

When an OpenMP process splits into multiple threads, we may want some variables to exist independently in each thread and others to be shared by all threads.

For example, if a parallelized loop computes the sum of a list of numbers, the sum variable must be shared.

On the other hand, any variable that must contain a different value for each iteration (such as the OMP thread number) should be private.

Critical and Atomic Sections

When multiple threads modify a shared variable, there is a danger that the modifications could overlap, resulting in incorrect results.

Caution

What appears as one statement in the source code may actually be a sequence of several instructions of machine code.

For example, when assigning a new value to sum in the example below, a typical CPU must actually perform multiple steps. The following would represent the sequence of machine instructions on many CPUs:

Load the value of sum from memory into the CPU
Load the value of c from memory into the CPU
Add sum + c in the CPU
Store the result back to memory variable sum

This is commonly called a read-modify-write sequence.

/***************************************************************************
 *  Description:
 *      OpenMP private and shared variables example
 *      Run time is around 4 seconds for MAX = 200000000 on an i5
 ***************************************************************************/

#include <stdio.h>
#include <sysexits.h>

#define MAX         200000000

int     main(int argc,char *argv[])

{
    unsigned long   sum;

    sum=0;
    /* compute sum of squares */
#pragma omp parallel for shared(sum)
    for (unsigned long c = 1; c <= MAX; ++c)
    {
#pragma omp atomic
        sum += c;
    }
    printf("%lu\n", sum);
    return EX_OK;
}

!-----------------------------------------------------------------------
!   Program description:
!       OpenMP parallel loop example
!-----------------------------------------------------------------------

! Main program body
program openmp_hello
    use omp_lib
    
    ! Disable implicit declarations (i-n rule)
    implicit none
    integer*8 :: c, sum, max
    
    ! Note: No space allowed between ! and $
    !$omp parallel do private(c_squared) shared(sum)
    sum = 0
    max = 200000000
    do c = 1, max
        !$omp atomic
        sum = sum + c
    enddo
    print *, sum
end program

Suppose there are two threads, numbered 0 and 1.

Also suppose that thread 0 begins the read-modify-write sequence when sum = 1 and c = 2. When thread 0 finishes the sequence, sum should contain 1 + 2 = 3. We will assume that this actually happens as expected.

Now suppose thread 1 has a value c = 3 and begins the read-modify-write sequence immediately after thread 0 finishes writing 3 to sum. Thread 1 will then add sum=3 + c=3 and store the correct value of 6 in sum.

However, if thread 1 begins before thread 0 stores the 3 into sum, the following will occur:

Thread 1 will load the old value 1 from sum, not the updated value of 3.
Thread 0 will store 3 into sum as expected, but thread 1 will never load this value because it as already performed its load instruction.
Thread 1 will then add 1 + 3 and store the incorrect value 4 in sum. If the store operation in thread 1 happens after the store in thread 0, this 4 will overwrite the 3 placed there by thread 0. If the store in thread 1 happens before the store in thread 0, then the 3 from thread 0 will overwrite the 4 from thread 1.

We cannot predict exactly when each thread will perform each of its operations. This is determined by the operating system based on many variables beyond our control that affect CPU scheduling. After both threads 0 and 1 have completed, sum could contain either 3, 4, or 6. This is called a race condition, because it depends on which thread finishes each operation first.

Unfortunately, the programmer must take responsibility for pointing this out to the compiler by declaring statements as critical or atomic.

A statement or block marked critical can only be entered (begun) by one thread at a time. Once a thread has begun executing a critical section, no other thread can begin executing it until the current thread has completed it.

In an atomic section, only memory update portions of the code are considered critical by the compiler. This portion would include the read-modify-write sequence for sum in the example above. Reading the value of c, on the other hand, would not be critical, because c is not shared by multiple threads. Each thread has a private copy of c with a different value than other threads. Using atomic is usually sufficient, and sometimes faster, since it may allow the compiler to parallelize parts of the machine code that cannot be separated from the critical parts at the source code level.

The program above has an atomic section that is executed 200,000,000 times. The inability to execute these sequences in parallel can have a major impact on performance. We can get around the atomic section bottleneck by computing a private partial sum for each thread. Updating this partial sum need not be done atomically, since it is not shared by multiple threads. This replaces 200,000,000 atomic operations with operations that can be done in parallel. We must then atomically add the private sums to compute the overall sum, but the number of atomic additions here is only the number of threads, which is very small (e.g. probably 4 on a 4-core computer), so the performance hit is negligible. The end result in this example is a program that runs more than twice as fast.

/***************************************************************************
 *  Description:
 *      OpenMP private and shared variables example
 *      Run time is 1.9 seconds for MAX = 200000000 on an i5
 *      This is about twice as fast as the same program using a shared sum
 *      in the main loop
 ***************************************************************************/

#include <stdio.h>
#include <sysexits.h>
#include <omp.h>

#define MAX_CORES   256     // More than any current computer
#define MAX         200000000

int     main(int argc,char *argv[])

{
    unsigned long   sum, private_sums[MAX_CORES]; 

    for (int c = 0; c < MAX_CORES; ++c)
        private_sums[c] = 0;

    /*
     *  Here we use an array of sums, one for each thread, so that
     *  adding to the sum need not be atomic.  Hence, there is no
     *  bottleneck and we achieve much better speedup than when using
     *  a shared sum variable, which must be updated atomically.
     */
    
    #pragma omp parallel for
    for (unsigned long c = 1; c <= MAX; ++c)
    {
        private_sums[omp_get_thread_num()] += c;
    }

    /*
     *  Now we atomically add the private sums to find the total sum.
     *  The number of atomic operations here is only the number of
     *  threads, typically much smaller than MAX, so the bottleneck
     *  has minimal impact on run time.
     */
    
    sum = 0;
    #pragma omp parallel for shared(sum)
    for (int c = 0; c < omp_get_num_threads(); ++c)
    {
        #pragma omp atomic
        sum += private_sums[c];
    }
    
    printf("sum = %lu\n", sum);
    return EX_OK;
}

Self-test

What are the advantages of OpenMP over distributed parallel systems such as MPI?
Write an OpenMP program that prints the square of every number from 1 to 1000. Print the thread number alongside each square.

Prev	Up	Next
HPC Programming Languages and Libraries	Home	Shared Memory Parallelism with POSIX Threads