OpenMP, short for Open Multiprocessing, is a set of tools embedded into many Fortran compilers since 1997, and C/C++ compilers since 2000. OpenMP allows the programmer to utilize multiple cores within a single computer, and in some cases, using extensions, multiple computers.
OpenMP allows for finer grained parallelism than embarrassingly parallel computing or distributed models such as MPI.
The latter two paradigms utilize multiple processes, potentially running on different computers, which require a great deal of overhead to create. Hence, individual processes need to be long-running so that the overhead of starting them doesn't use a significant portion of the total run time.
OpenMP, on the other hand, spawns lightweight threads within a single process, which requires very little overhead. Hence, OpenMP can effectively parallelize very short-running sections of code such as individual loops within a program.
OpenMP can benefit code that might only take a fraction of a second to run in serial, whereas HTC and MPI are only good for parallelizing processes that take far longer (usually minutes or hours).
The main limitation of OpenMP is that it only scales up to the number of cores available on a single computer (or other tightly-coupled architecture). If you want to achieve greater speedup, you'll need to look for ways to perform coarser scale parallelism with HTC or MPI.
OpenMP is implemented as a set of preprocessor directives, header files, and library functions.
To use OpenMP, a C or C++
program must include the header omp.h
and use
#pragma
directives to indicate which code
segments are to be parallelized and how.
Fortran programs must contain the line
use omp_lib
and use specially formatted comments beginning with
!$omp
To compile an OpenMP program with gcc, clang, or gfortran we must
use the -fopenmp
flag. Note that on FreeBSD,
cc is clang, while on Linux,
cc is gcc. Hence,
cc and c++ should work with the
default compiler on both systems.
mypc: gcc -fopenmp myprog.c mypc: g++ -fopenmp myprog.cc mypc: clang -fopenmp myprog.c mypc: clang++ -fopenmp myprog.cc mypc: cc -fopenmp myprog.c mypc: c++ -fopenmp myprog.cc mypc: gfortran -fopenmp myprog.f90
At the time of this writing, clang does not include a Fortran compiler. Clang is intended to be binary compatible with GCC, however, so we can use gfortran alongside clang and clang++ as long as the compiler versions are compatible.
C and C++ programs must also contain
#include <omp.h>
After an OpenMP program has split into multiple threads, each thread can identify itself by calling omp_get_thread_num(), which returns a value between 0 and N-1 for a program running N threads.
Multithreaded programs can be task parallel (running different code for each thread) or data parallel (running the same code on different data for each thread).
Using a basic omp parallel pragma, we can execute the same code on multiple cores (more or less) simultaneously.
Simple parallel program:
/*************************************************************************** * Description: * OpenMP parallel common code example * * Arguments: * None * * Returns: * Standard exit codes (see sysexits.h) * * History: * Date Name Modification * 2012-06-27 Jason Bacon Begin ***************************************************************************/ #include <stdio.h> #include <sysexits.h> #include <omp.h> int main(int argc,char *argv[]) { /* Execute the same code simultaneously on multiple cores */ #pragma omp parallel printf("Hello from thread %d!\n", omp_get_thread_num()); return EX_OK; }
!----------------------------------------------------------------------- ! Program description: ! OpenMP parallel common code example !----------------------------------------------------------------------- !----------------------------------------------------------------------- ! Modification history: ! Date Name Modification ! 2012-06-29 Jason Bacon Created !----------------------------------------------------------------------- ! Main program body program openmp_hello use omp_lib ! Disable implicit declarations (i-n rule) implicit none ! Note: No space allowed between ! and $ !$omp parallel print *, 'Hello from thread ', omp_get_thread_num() !$omp end parallel end program
We can control the number of threads in an OpenMP block using
the num_treads()
parameter or by
setting OMP_NUM_THREADS in the environment.
/*************************************************************************** * Description: * OpenMP parallel common code example * * Arguments: * None * * Returns: * Standard exit codes (see sysexits.h) * * History: * Date Name Modification * 2012-06-27 Jason Bacon Begin ***************************************************************************/ #include <stdio.h> #include <sysexits.h> #include <omp.h> int main(int argc,char *argv[]) { /* Execute the same code simultaneously on multiple cores */ #pragma omp parallel num_threads(2) printf("Hello from thread %d!\n", omp_get_thread_num()); return EX_OK; }
Otherwise, OpenMP will match the number of threads to the number of cores present on the system.
One of the most commonly used features of OpenMP is loop parallelization.
If a loop can be structured in such a way that each iteration is independent of previous iterations, then in theory, all of the iterations can be executed at the same time.
Of course, the number of iterations which can execute at once is limited by the number of cores available. For example, if a parallelizable loop iterates 100 times, but the computer running it only has four cores, then only four iterations will run at once. Still, this will speed up the program by nearly a factor of four. A small amount of overhead is incurred when splitting the execution path into multiple threads and again when merging them back together after the parallel segment.
Example of a parallel loop with OpenMP:
/*************************************************************************** * Description: * OpenMP parallel loop example * * Arguments: * None * * Returns: * Standard exit codes (see sysexits.h) * * History: * Date Name Modification * 2011-10-06 Jason Bacon Begin ***************************************************************************/ #include <stdio.h> #include <omp.h> #include <sysexits.h> int main(int argc,char *argv[]) { int c; #pragma omp parallel for for (c=0; c < 8; ++c) { printf("Hello from thread %d, nthreads %d, c = %d\n", omp_get_thread_num(), omp_get_num_threads(), c); } return EX_OK; }
!----------------------------------------------------------------------- ! Program description: ! OpenMP parallel loop example !----------------------------------------------------------------------- !----------------------------------------------------------------------- ! Modification history: ! Date Name Modification ! 2012-06-29 Jason Bacon Created !----------------------------------------------------------------------- ! Main program body program openmp_hello use omp_lib ! Disable implicit declarations (i-n rule) implicit none integer :: c ! Note: No space allowed between ! and $ !$omp parallel do do c = 1, 8 print *, 'Hello from thread ', omp_get_thread_num(), & ' nthreads ', omp_get_num_threads(), ' c = ', c enddo end program
When several iterations of a loop are run in parallel, we cannot predict the order in which they will complete. While they are in theory executed at the "same time", there really is no such thing as exactly the same time. Even if each iteration runs the exact same number of instructions, they may finish at slightly different times, and the order is unpredictable.
Hence, when using OpenMP, you must take special care to ensure that it does not matter to your program which iterations complete first.
Performing output to a common stream such as the standard output in a parallelized section of code is generally a bad idea, except for debugging purposes.
If anything within a loop depends on results computed during previous iterations, then the loop simply cannot be parallelized. It may be possible to redesign the loop so that each iteration is independent of the rest, but there are some cases where computations must be done serially.
When an OpenMP process splits into multiple threads, we may want some variables to exist independently in each thread and others to be shared by all threads.
For example, if a parallelized loop computes the sum of a list of numbers, the sum variable must be shared.
On the other hand, any variable that must contain a different value for each iteration (such as the OMP thread number) should be private.
When multiple threads modify a shared variable, there is a danger that the modifications could overlap, resulting in incorrect results.
For example, when assigning a new value to sum in the example below, a typical CPU must actually perform multiple steps. The following would represent the sequence of machine instructions on many CPUs:
This is commonly called a read-modify-write sequence.
/*************************************************************************** * Description: * OpenMP private and shared variables example * Run time is around 4 seconds for MAX = 200000000 on an i5 ***************************************************************************/ #include <stdio.h> #include <sysexits.h> #define MAX 200000000 int main(int argc,char *argv[]) { unsigned long sum; sum=0; /* compute sum of squares */ #pragma omp parallel for shared(sum) for (unsigned long c = 1; c <= MAX; ++c) { #pragma omp atomic sum += c; } printf("%lu\n", sum); return EX_OK; }
!----------------------------------------------------------------------- ! Program description: ! OpenMP parallel loop example !----------------------------------------------------------------------- ! Main program body program openmp_hello use omp_lib ! Disable implicit declarations (i-n rule) implicit none integer*8 :: c, sum, max ! Note: No space allowed between ! and $ !$omp parallel do private(c_squared) shared(sum) sum = 0 max = 200000000 do c = 1, max !$omp atomic sum = sum + c enddo print *, sum end program
Suppose there are two threads, numbered 0 and 1.
Also suppose that thread 0 begins the read-modify-write sequence when sum = 1 and c = 2. When thread 0 finishes the sequence, sum should contain 1 + 2 = 3. We will assume that this actually happens as expected.
Now suppose thread 1 has a value c = 3 and begins the read-modify-write sequence immediately after thread 0 finishes writing 3 to sum. Thread 1 will then add sum=3 + c=3 and store the correct value of 6 in sum.
However, if thread 1 begins before thread 0 stores the 3 into sum, the following will occur:
We cannot predict exactly when each thread will perform each of its operations. This is determined by the operating system based on many variables beyond our control that affect CPU scheduling. After both threads 0 and 1 have completed, sum could contain either 3, 4, or 6. This is called a race condition, because it depends on which thread finishes each operation first.
Unfortunately, the programmer must take responsibility for pointing this out to the compiler by declaring statements as critical or atomic.
A statement or block marked critical can only be entered (begun) by one thread at a time. Once a thread has begun executing a critical section, no other thread can begin executing it until the current thread has completed it.
In an atomic section, only memory update portions of the code are considered critical by the compiler. This portion would include the read-modify-write sequence for sum in the example above. Reading the value of c, on the other hand, would not be critical, because c is not shared by multiple threads. Each thread has a private copy of c with a different value than other threads. Using atomic is usually sufficient, and sometimes faster, since it may allow the compiler to parallelize parts of the machine code that cannot be separated from the critical parts at the source code level.
The program above has an atomic section that is executed 200,000,000 times. The inability to execute these sequences in parallel can have a major impact on performance. We can get around the atomic section bottleneck by computing a private partial sum for each thread. Updating this partial sum need not be done atomically, since it is not shared by multiple threads. This replaces 200,000,000 atomic operations with operations that can be done in parallel. We must then atomically add the private sums to compute the overall sum, but the number of atomic additions here is only the number of threads, which is very small (e.g. probably 4 on a 4-core computer), so the performance hit is negligible. The end result in this example is a program that runs more than twice as fast.
/*************************************************************************** * Description: * OpenMP private and shared variables example * Run time is 1.9 seconds for MAX = 200000000 on an i5 * This is about twice as fast as the same program using a shared sum * in the main loop ***************************************************************************/ #include <stdio.h> #include <sysexits.h> #include <omp.h> #define MAX_CORES 256 // More than any current computer #define MAX 200000000 int main(int argc,char *argv[]) { unsigned long sum, private_sums[MAX_CORES]; for (int c = 0; c < MAX_CORES; ++c) private_sums[c] = 0; /* * Here we use an array of sums, one for each thread, so that * adding to the sum need not be atomic. Hence, there is no * bottleneck and we achieve much better speedup than when using * a shared sum variable, which must be updated atomically. */ #pragma omp parallel for for (unsigned long c = 1; c <= MAX; ++c) { private_sums[omp_get_thread_num()] += c; } /* * Now we atomically add the private sums to find the total sum. * The number of atomic operations here is only the number of * threads, typically much smaller than MAX, so the bottleneck * has minimal impact on run time. */ sum = 0; #pragma omp parallel for shared(sum) for (int c = 0; c < omp_get_num_threads(); ++c) { #pragma omp atomic sum += private_sums[c]; } printf("sum = %lu\n", sum); return EX_OK; }