Process Distribution

Another thing we need to consider on modern hardware is that most cluster nodes have multiple cores. Hence, messages are often passed between two processes running on the same compute node. Such messages do not pass through the cluster network: There is no reason for them to leave the compute node and come back. Instead, they are simply placed in an operating system memory buffer by the sending process, and then read by the recipient process.

In general, local connections like this are faster than even the fastest of networks. However, memory is a shared resource and can become a bottleneck if too many processes on the same node are passing many messages.

For some applications, you may find that you get better performance by limiting the number of processes per node, and hence balancing the message passing load between local memory and the network. Jobs that perform a lot of disk I/O may also benefit from using fewer processes per node. Most schedulers allow you to specify not just how many processes to use, but also how many to place on each node.

The best way to determine the optimal distribution of processes is by experimenting. You might try 1 per node, 4 per node, and 8 per node and compare the run times. If 8 per node turns out to be the slowest, then try something between 1 and 4. If 4 turns out to be the fastest, then try 5 or 6. Results will vary on different clusters, so you'll have to repeat the experiment on each cluster. If you plan to run many jobs over the course of weeks or months, spending a day or two finding the optimal distribution could lead to huge time savings overall.

Self-test
  1. When is it likely that limiting the number of MPI processes per node will improve performance? Why?