There are a number of ways to decompose many problems for parallel execution.
If a problem can be decomposed into independent parts and run in an embarrassingly parallel fashion, this is usually the best route, since it is the easiest to program and the most scalable.
If the processes within a parallel job must communicate during execution, there are a variety of options, including MPI, shared-memory parallelism, and specialty hardware architectures (supercomputers) such as Single Instruction Multiple Data (SIMD) machines.
Which architecture and model will provide the best performance is dependent on the algorithms your software needs to use.
One advantage of MPI, however, is portability. An MPI program is capable of utilizing virtually any architecture with multiple processors. MPI can utilize the multiple cores on a single PC, the multiple nodes in a cluster, and in some cases (if communication needs are light) MPI programs could even run on a grid.
Shared-memory parallelism, discussed in the section called “Shared Memory Parallel Programming with OpenMP”, might provide better performance when utilizing multiple cores within a single PC. However, this is not a general rule. Again, it depends on the algorithms being used. In addition, shared-memory parallelism does not scale well, due to the fact that the cores contend for the same shared memory banks and other hardware resources. A PC with 48 cores may not provide the performance boost you were hoping for.
Since parallel programming is a complex and time-consuming endeavor, using a system such as MPI that will allow the code to run on the widest possible variety of architectures has clear advantages. Even if shared-memory parallelism offers better performance, it may still be preferable to use MPI when you consider the value of programmer time and the ability to utilize more than one computer. The performance gains of a shared-memory program using a small number of cores may not be important enough to warrant the extra programming time and effort.