Modern clusters consist of nodes with many cores.This raises the question of how processes should be distributed across nodes.
If independent processes such as serial jobs and job arrays are spread out across as many nodes as possible, there may be less resource contention within each node, leading to better overall performance.Spreading jobs out this way can also be a necessity.For example, if each process in a job array requires 20 gigabytes of RAM, and each node has 24 gigabytes available, then we can 't run more than one of these processes per node.
On the other hand, shared memory jobs require their processes to be on the same node, and multicore jobs(such as MPI jobs) that use a lot of interprocess communication will perform better when their processes are on the same node. (Communication between any two processes is generally faster if they are on the same node.)
When large numbers of serial and job array processes are spread out across the cluster, shared memory jobs may be prevented from starting, because there are no individual nodes with sufficient free cores available, even though there are plenty of free cores across the cluster.Multicore jobs may see degraded performance in this situation, since their processes are forced to run on different nodes and therefore must use the network to communicate.
Default scheduler policies tend to favor spreading out serial and array jobs across many nodes, and clumping multicore jobs on the same node.
Users running serial and array jobs can take an extra step toward being a good citizen by overriding the default scheduler policies. If you know that your serial or array jobs won 't have a problem with resource contention, you can tell the scheduler to dispatch as many of them as possible to heavily loaded nodes using a simple bsub resource specifier:
#BSUB -R "order[-ut]"
This will leave more nodes with a large number of cores available, which will help users running shared memory and multicore jobs.