Using SLURM

Using SLURM
Prev	Chapter 9. Job Scheduling with SLURM	Next

SLURM Jargon

Before we begin, you should know a few common terms that will pop up throughout this chapter.

A CPU or core is a processor capable of running a program.

A node is a distinct computer within a cluster. See the section called “Clusters and HPC” for a description of node types.

A partition is a group of nodes designated in the SLURM configuration as a pool from which to allocate resources.

A process is a process as defined by Unix. (The execution of a program.) A parallel job on a cluster consists of multiple processes running at the same time. A serial job consists of only one process.

A job is all of the processes dispatched by SLURM under the same job ID. Processes within a job on a cluster may be running on the same node or different nodes.

A task is one or more processes within a job that are distinct from the rest of the processes. All processes within a given task must run on the same node. For example, a job could consist of 4 tasks running on different nodes, each consisting of 8 threads using shared-memory parallelism. This job therefore uses 32 cores and a maximum of 4 different nodes. Two tasks could run on the same node if the node has at least 16 cores.

Cluster Status

Before scheduling any jobs through SLURM, it is often useful to check the status of the nodes. Knowing how many cores are available may influence your decision on how many cores to request for your next job.

For example, if only 50 cores are available at the moment, and your job requires 200 cores, the job may have to wait in the queue until 200 cores (with sufficient associated memory) are free. Your job may finish sooner if you reduce the number of cores to 50 or less so that it can start right away, even though it will take longer to run. We can't always predict when CPU cores will become available, so it's often a guessing game. However, some experienced users who know their software may have a pretty good idea when their jobs will finish.

The sinfo command shows information about the nodes and partitions in the cluster. This can be used to determine the total number of cores in the cluster, cores in use, etc.

shell-prompt: sinfo
PARTITION         AVAIL  TIMELIMIT  NODES  STATE NODELIST
default-partitio*    up   infinite      2   idle compute-[001-002]

shell-prompt: sinfo --long
Wed Nov 27 13:34:20 2013
PARTITION         AVAIL  TIMELIMIT   JOB_SIZE ROOT SHARE     GROUPS  NODES       STATE NODELIST
default-partitio*    up   infinite 1-infinite   no    NO        all      2        idle compute-[001-002]

shell-prompt: sinfo -N
NODELIST           NODES         PARTITION STATE 
compute-[001-002]      2 default-partitio* idle

As a convenience, SPCM clusters provide a script called slurm-node-info, which displays specifications on each compute node:

shell-prompt: slurm-node-info
HOSTNAMES CPUS MEMORY
compute-001 12 31751
compute-002 12 31751
compute-003 12 31751
compute-004 12 31751
compute-005 12 31751
compute-006 12 31751
compute-007 12 31751

Job Status

You can check on the status of running jobs using squeue.

shell-prompt: squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                64 default-p bench.sl    bacon  R       0:02      2 compute-[001-002]

The Job id column shows the numeric job ID. The ST column shows the current status of the job. The most common status flags are 'PD' for pending (waiting to start) and 'R' for running.

The squeue --long flag requests more detailed information.

The squeue command has many flags for controlling what it reports. Run man squeue for full details.

As a convenience, SPCM clusters provide a script called slurm-cluster-load, which uses squeue and sinfo to display a quick summary on current jobs and overall load:

shell-prompt: slurm-cluster-load 
JOBID USER       EXEC_HOST    CPUS NODES SHARED NAME       TIME       ST
186110 chen59   R  16   2   no  surf     1-08:46:19  compute-2-[14,16]
186112 chen59   R  16   2   no  surf     1-08:46:19  compute-2-33,compute-3-02
186113 chen59   R  16   2   no  surf     1-08:46:19  compute-3-[07,10]
187146 albertof R  1    1   yes bash     1-03:08:12  compute-2-04
187639 albertof R  1    1   yes submitSR 52:03       compute-4-02
187640 albertof R  1    1   yes submitSR 52:03       compute-4-02
187642 albertof R  1    1   yes submitSR 52:03       compute-4-02
187867 qium     R  8    1   no  ph/IC1x1 46:11       compute-1-03
187868 qium     R  8    1   no  ph/IC1x2 46:11       compute-1-04

CPUS(A/I/O/T)
785/264/87/1136

Load: 69%

The slurm-user-cores command displays only the number of cores currently in use by each user and the cluster usage summary:

shell-prompt: slurm-user-cores 
Username    Running Pending 
bahumda2    15      0       
bkhazaei    300     0       
sheikhz2    640     0       
yous        1       0       

batch: 88%
CPUS(A/I/O/T)
955/109/16/1080

128g: 6%
CPUS(A/I/O/T)
1/15/0/16

Using top

Using the output from squeue, we can see which compute nodes are being used by a job:

shell-prompt: squeue
 JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
  1017     batch     bash   oleary  R    2:44:00      1 compute-001
  1031     batch ARM97-pa  roberts  R    1:50:55      2 compute-[001-002]
  1032     batch   sbatch     joea  R    1:35:19      1 compute-002
  1034     batch   sbatch     joea  R    1:05:59      1 compute-002
  1035     batch   sbatch     joea  R      50:08      1 compute-003
  1036     batch   sbatch     joea  R      47:13      1 compute-004
  1041     batch     bash   oleary  R      36:41      1 compute-001
  1042     batch     bash  roberts  R       0:09      1 compute-001

From the above output, we can see that job 1031 is running on compute-001 through compute-002.

We can then examine the processes on any of those nodes using a remotely executed top command:

shell-prompt: ssh -t compute-001 top

top - 13:55:55 up 12 days, 24 min,  1 user,  load average: 5.00, 5.00, 4.96
Tasks: 248 total,   6 running, 242 sleeping,   0 stopped,   0 zombie
Cpu(s): 62.5%us,  0.1%sy,  0.0%ni, 37.3%id,  0.1%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  24594804k total, 15388204k used,  9206600k free,   104540k buffers
Swap: 33554424k total,    10640k used, 33543784k free, 14486568k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
30144 roberts   20   0  251m  69m  17m R 100.4  0.3 113:29.04 SAM_ADV_MPDATA_   
30145 roberts   20   0  251m  69m  17m R 100.4  0.3 113:28.16 SAM_ADV_MPDATA_   
30143 roberts   20   0  251m  69m  17m R 100.1  0.3 113:16.87 SAM_ADV_MPDATA_   
30146 roberts   20   0  251m  69m  17m R 100.1  0.3 113:29.24 SAM_ADV_MPDATA_   
30147 roberts   20   0  251m  69m  17m R 100.1  0.3 113:29.82 SAM_ADV_MPDATA_   
  965 bacon     20   0 15168 1352  944 R  0.3  0.0   0:00.01 top                
    1 root      20   0 19356 1220 1004 S  0.0  0.0   0:00.96 init

Note

The -t flag is important here, since it tells ssh to open a connection with full terminal control, which is needed by top to update your terminal screen.

The column of interest is under "RES".

We can see from the top command above that the processes owned by job 1031 are using about 69 mebibytes of memory each. Watch this value for a while as it will change as the job runs.

Take the highest value you see in the RES column, add about 10%, and use this with --mem-per-cpu to set a reasonable memory limit.

#SBATCH --mem-per-cpu=76

SPCM clusters provide some additional convenience tools to save a little typing when running top.

The node-top command is equivalent to running top on a compute node via ssh as shown above, without typing the ssh command or the full host name:

shell-prompt: node-top 001

The job-top command takes a SLURM job ID and runs top on all of the node used by the job. The job-top command below is equivalent to running node-top 001 followed by node-top 001:

shell-prompt: squeue
 JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
  1017     batch     bash   oleary  R    2:44:00      1 compute-001
  1031     batch ARM97-pa  roberts  R    1:50:55      2 compute-[001-002]
  1032     batch   sbatch     joea  R    1:35:19      1 compute-002
  1034     batch   sbatch     joea  R    1:05:59      1 compute-002
  1035     batch   sbatch     joea  R      50:08      1 compute-003
  1036     batch   sbatch     joea  R      47:13      1 compute-004
  1041     batch     bash   oleary  R      36:41      1 compute-001
  1042     batch     bash  roberts  R       0:09      1 compute-001
shell-prompt: job-top 1031

Job Submission

The purpose of this section is to provide the reader a quick start in job scheduling using the most common tools. The full details of job submission are beyond the scope of this document. For more information, see SLURM website http://slurm.schedmd.com/ and the man pages for individual SLURM commands.

man sbatch

Submission Scripts

Submitting jobs involves specifying a number of job parameters such as the number of cores, the job name (which is used by other SLURM commands), the name(s) of the output file(s), etc.

In order to document all of this information and make it easy to resubmit the same job, this information is usually incorporated into a submission script. Using a script saves you a lot of typing when you want to run-submit the same job, and also fully documents the job parameters.

A submission script is an ordinary shell script, with some directives inserted to provide information to the scheduler. For SLURM, the directives are specially formatted shell comments beginning with "#SBATCH".

Note

The ONLY difference between an sbatch submission script and a script that you would run on any Unix laptop or workstation is the #SBATCH directives.

You can develop and test a script to run your analyses or models on any Unix system. To use it on a SLURM cluster, you need only add the appropriate #SBATCH directives and possibly alter some command arguments to enable parallel execution.

Caution

There cannot be any Unix commands above #SBATCH directives. SLURM will ignore any #SBATCH directives below the first Unix command.

Submission scripts are submitted with the sbatch command. The script may contain #SBATCH options to define the job, regular Unix commands, and srun or other commands to run programs in parallel.

Note

The script submitted by sbatch is executed on one core, regardless of how many cores are allocated for the job. The commands within the submission script are responsible for dispatching multiple processes for parallel jobs. This strategy differs from other popular schedulers like Torque (PBS) and LSF, where the submission script is run in parallel on all cores.

Suppose we have the following text in a file called hostname.sbatch:

#!/usr/bin/env bash

# A SLURM directive
#SBATCH --job-name=hostname

# A command to be executed on the scheduled node.
# Prints the host name of the node running this script.
hostname

The script is submitted to the SLURM scheduler as a command line argument to the sbatch command:

                shell-prompt: sbatch hostname.sbatch

The SLURM scheduler finds a free core on a compute node, reserves it, and then remotely runs hostname.sbatch on the compute node using ssh or some other remote execution command.

Recall from Chapter 4, Unix Shell Scripting that everything in a shell script from a '#' character to the end of the line is considered a comment by the shell, and ignored.

However, comments beginning with #SBATCH, while ignored by the shell, are interpreted as directives by sbatch.

The directives within the script provide command line flags to sbatch. For instance, the line

                #SBATCH --mem-per-cpu=10

causes sbatch to behave as if you had typed

                shell-prompt: sbatch --mem-per-cpu=10 hostname.sbatch

By putting these comments in the script, you eliminate the need to remember them and retype them every time you run the job. It's generally best to put all sbatch flags in the script rather than type any of them on the command line, so that you have an exact record of how the job was started. This will help you determine what went wrong if there are problems, and allow you to reproduce the results at a later date.

Note

If you want to disable a #SBATCH comment, you can just add another '#' rather than delete it. This will allow you to easily enable it again later as well as maintain a record of options you used previously.

                ##SBATCH This line is ignored by sbatch
                #SBATCH This line is interpreted by sbatch

Practice Break

#!/usr/bin/env bash

# A SLURM directive
#SBATCH --job-name=hostname

# A command to be executed on the scheduled node.
# Prints the host name of the node running this script.
hostname

Type in the hostname.sbatch script shown above and submit it to the scheduler using sbatch. Then check the status with squeue and view the output and error files.

Common Flags

#SBATCH --output=standard-output-file
#SBATCH --error=standard-error-file
#SBATCH --nodes=min[-max]
#SBATCH --ntasks=tasks
#SBATCH --array=list
#SBATCH --cpus-per-task=N
#SBATCH --exclusive
#SBATCH --mem=MB
#SBATCH --mem-per-cpu=MB
#SBATCH --partition=name

The --output and --error flags control the names of the files to which the standard output and standard error of the processes are redirected. If omitted, both standard output and standard error are written to slurm-JOBID.out.

Note

Commands in a submission script should not use output redirection (>) or error redirection (2>, >&), since these would conflict with --output and --error.

The --ntasks flag indicates how many tasks the job will run, usually for multicore jobs.

The --nodes flag is used to specify how many nodes (not tasks) to allocate for the job. We can use this to control how many tasks are run on each node. For example, if a job consists of 20 I/O-intensive processes, we would not want many of them running on the same node and competing for the local disk. In this case, we can specify --nodes=20 --ntasks=20 to force each process to a separate node.

Note

The sbatch command with --nodes or --ntasks will not cause multiple processes to run. The sbatch command simply runs the script on one node. To run multiple tasks, you must use srun, mpirun, or some other dispatcher within the script. This differs from many other schedulers such as LSF and Torque.

The --cpus-per-task=N flag indicates that we need N cpus on the same node. Using --ntasks=4 --cpus-per-task=3 will indicate 4 tasks using 3 cores each, in effect requesting 12 cores in groups of 3 per node.

The --exclusive flag indicates that the job can not share nodes with other jobs. This is typically used for shared memory parallel jobs to maximize the number of cores available to the job. It may also be used for jobs with high memory requirements, although it is better to simple specify the memory requirements using --mem or --mem-per-cpu.

The --mem=MB flag indicates the amount of memory needed on each node used by the job, in megabytes.

The --mem-per-cpu=MB flag indicates the amount of memory needed by each process within a job, in megabytes.

The --partition=name flag indicates which partition (set of nodes) on which the job should run. Simply run sinfo to see a list of available partitions.

SLURM Resource Requirements

When using a cluster, it is important to develop a feel for the resources required by your jobs, and inform the scheduler as accurately as possible what will be needed in terms of CPU time, memory, etc.

If a user does not specify a given resource requirement, the scheduler uses default limits. Default limits are set low, so that users are encouraged to provide an estimate of required resources for all non-trivial jobs. This protects other users from being blocked by long-running jobs that require less memory and other resources than the scheduler would assume.

The --mem=MB flag indicates that the job requires MB megabytes of memory per node.

The --mem-per-cpu=MB flag indicates that the job requires MB megabytes per core.

Note

It is a very important to specify memory requirements accurately in all jobs, and it is generally easy to predict based on previous runs by monitoring processes within the job using top or ps. Failure to do so could block other jobs run running, even though the resources it requires are actually available.

Batch Serial Jobs

A batch serial submission script need only have optional flags such as job name, output file, etc. and one or more commands.

                #!/usr/bin/env bash
                
                hostname

Interactive Jobs

Interactive jobs involve running programs under the scheduler on compute nodes, where the user can interact directly with the process. I.e., output is displayed on their terminal instead of being redirected to files (as controlled by the sbatch --output and --error flags) and input can be taken from the keyboard.

The use of interactive jobs on a cluster is uncommon, but sometimes useful.

For example, an interactive shell environment might be useful when porting a new program to the cluster. This can be used to do basic compilation and testing in the compute node environment before submitting larger jobs. Programs should not be compiled or tested on the head node and users should not ssh directly to a compute node for this purpose, as this would circumvent the scheduler, causing their compilations, etc. to overload the node.

Caution

A cluster is not a good place to do code development and testing. Code should be fully developed and tested on a separate development systems such as a laptop or workstation before being run on a cluster, where buggy programs will waste valuable shared resources and may cause other problems for other users.

On SPCM clusters, a convenience script called slurm-shell is provided to make it easy to start an interactive shell under the scheduler. It requires the number of cores and the amount of memory to allocate as arguments. The memory specification is in mebibytes by default, but can be followed by a 'g' to indicate gibibytes:

shell-prompt: slurm-shell 1 500
==========================================================================
Note:

slurm-shell is provided solely as a convenience for typical users.

If you would like an interactive shell with additional memory, more
than 1 CPU, etc., you can use salloc and srun directly.  Run

    more /usr/local/bin/slurm-shell

for a basic example and

    man salloc
    man srun

for full details.
==========================================================================

 1:05PM  up 22 days, 2 hrs, 0 users, load averages: 0.00, 0.00, 0.00
To repeat the last command in the C shell, type "!!".
        -- Dru <genesis@istar.ca>
FreeBSD compute-001. bacon ~ 401:

Batch Parallel Jobs (Job Arrays)

A job array is a set of independent processes all started by a single job submission. The entire job array can be treated as a single job by SLURM commands such as squeue and scancel.

A batch parallel submission script looks almost exactly like a batch serial script. There are a couple of ways to run a batch parallel job. It is possible to run a job array using --ntasks and an srun command in the sbatch script. However, the --array flag is far more convenient for most purposes.

The --array flag is followed by an index specification which allows the user to specify any set of task IDs for each job in the array. The specification can use '-' to specify a range of task IDs, commas to specify an arbitrary list, or both. For example, to run a job array with task IDs ranging from 1 to 5, we would use the following:

Note

Some SLURM clusters use an operating system feature known as task affinity to improve performance. The feature binds each process to a specific core, so that any data in the cache RAM of that core does not have to be flushed and reloaded into the cache of a different core. This can dramatically improve performance for certain memory-intensive processes, though it won't make any noticeable difference for many jobs.

In order to activate task affinity, you must use srun or another command to launch the processes.

#!/usr/bin/env bash

#SBATCH  --array=1-5

# OK, but will not use task affinity
hostname

#!/usr/bin/env bash

#SBATCH  --array=1-5

# Activates task affinity, so each hostname process will stick to one core
srun hostname

If we wanted task IDs of 1,2,3,10, 11, and 12, we could use the following:

#!/usr/bin/env bash

#SBATCH  --array=1-3,10-12

srun hostname

The reasons for selecting specific task IDs are discussed in the section called “Environment Variables”.

Note

Each task in an array job is treated as a separate job by slurm. Hence, --mem and --mem-per-cpu mean the same thing in an array job. Both refer to the memory required by one task.

Another advantage of using --array is that we don't need all the cores available at once in order for the job to run. For example, if we submit a job using --array=1-200 while there are only 50 cores free on the cluster, it will begin running as many processes as possible immediately and run more as more cores become free. In contrast, a job started with --ntasks will remain in a pending state until there are 200 cores free.

Furthermore, we can explicitly limit the number of array jobs run at once with a simple addition to the flag. Suppose we want to run 10,000 processes, but be nice to other cluster users and only use 100 cores at a time. All we need to do is use the following:

#SBATCH --array=1-10000%100

Caution

Do not confuse this with --array=1-1000:4, which simply increments the job index by 4 instead of one, and will attempt to run 10,000 processes at once using indices 1, 5, 9, ...!

Finally, we can run arrays of multithreaded jobs by adding --cpus-per-task. For example, the BWA sequence aligner allows the user to specify multiple threads using the -t flag:

#SBATCH --array=1-100 --nodes=1 --cpus-per-task=8

srun bwa mem -t $SLURM_CPUS_PER_TASK reference.fa input.fa > output.sam

Practice Break

Copy your hostname.sbatch to hostname-parallel.sbatch, modify it to run 5 processes, and submit it to the scheduler using sbatch. Then check the status with squeue and view the output and error files.

Multi-core Jobs

A multi-core job is any job that runs a parallel program. This means that it uses multiple processes that communicate and cooperate with each other in some way. There is no such communication or cooperation in batch parallel (also known as embarrassingly parallel) jobs, which use SLURM's --array flag.

Cores are allocated for multi-core jobs using the --ntasks flag.

Using --ntasks ONLY tells the SLURM scheduler how many cores to reserve. It does NOT tell your parallel program how many cores or which compute nodes to use. It is the responsibility of the user to ensure that the command(s) in their script utilize the correct number of cores and the correct nodes within the cluster.

Some systems, such as OpenMPI (discussed in the section called “Message Passing Interface (MPI)”), will automatically detect this information from the SLURM environment and dispatch the correct number of processes to each allocated node.

If you are not using OpenMPI, you may need to specify the number of cores and/or which nodes to use in the command(s) that run your computation. Many commands have a flag argument such as -n or -np for this purpose.

OpenMP (not to be confused with OpenMPI) is often used to run multiple threads on the same node (shared memory parallelism, as discussed in the section called “Shared Memory and Multi-Core Technology” and the section called “Shared Memory Parallel Programming with OpenMP”).

OpenMP software will look for the OMP_NUM_THREADS environment variable to indicate how many cores to use. It is the user's responsibility to ensure that OMP_NUM_THREADS is set when using OpenMP software.

When running OpenMP programs under SLURM, we can ensure that the right number of cores are used by using the --ntasks flag and passing $SLURM_NTASKS to $OMP_NUM_THREADS.

#!/bin/sh -e

#SBATCH --ntasks=4

OMP_NUM_THREADS=$SLURM_NTASKS
export OMP_NUM_THREADS

OpenMP-based-program arg1 arg2 ...

MPI Multi-core Jobs

Scheduling MPI jobs in SLURM is much like scheduling batch parallel jobs.

MPI programs cannot be executed directly from the command line as we do with normal programs and scripts. Instead, we must use the mpirun or srun command to start up MPI programs.

Note

Whether you should use mpirun or srun depends on how your SLURM scheduler is configured. See slurm.conf or talk to your systems manager to find out more.

                mpirun [mpirun flags] mpi-program [mpi-program arguments]

Caution

Like any other command used on a cluster or grid, mpirun must not be executed directly from the command line, but instead must be used in a scheduler submission script.

For MPI and other multicore jobs, we use the sbatch --ntasks flag to indicate how many cores we want to use. Unlike --array, sbatch with --ntasks runs the sbatch script on only one node, and it is up to the commands in the script (such as mpirun or srun) to dispatch the parallel processes to all the allocated cores.

Commands like mpirun and srun can retrieve information about which cores have been allocated from the environment handed down by their parent process, the SLURM scheduler. The details about SLURM environment variables are discussed in the section called “Environment Variables”.

#!/bin/sh

#SBATCH --ntasks=2

PATH=${PATH}:/usr/local/mpi/openmpi/bin
export PATH

mpirun ./mpi_bench

When running MPI jobs, it is often desirable to have as many processes as possible running on the same node. Message passing is generally faster between processes on the same node than between processes on different nodes, because messages passed within the same node need not cross the network. If you have a very fast network such as Infiniband or a low-latency Ethernet, the difference may be marginal, but on more ordinary networks such as gigabit Ethernet, the difference can be enormous.

SLURM by default will place as many processes as possible on the same node. We can also use --exclusive to ensure that "leftover" cores on a partially busy node will not be used for out MPI jobs. This may improve communication performance, but may also delay the start of the job until enough nodes are completely empty.

Environment Variables

SLURM sets a number of environment variables when a job is started. These variables can be used in the submission script and within other scripts or programs executed as part of the job.

shell-prompt: srun printenv | grep SLURM
SLURM_SRUN_COMM_PORT=27253
SLURM_TASKS_PER_NODE=1
SLURM_NODELIST=compute-001
SLURM_JOB_NUM_NODES=1
SLURM_NNODES=1
SLURM_STEPID=0
SLURM_STEP_ID=0
SLURM_JOBID=81
SLURM_JOB_ID=81
SLURM_DISTRIBUTION=cyclic
SLURM_NPROCS=1
SLURM_NTASKS=1
SLURM_JOB_CPUS_PER_NODE=1
SLURM_JOB_NAME=/usr/bin/printenv
SLURM_SUBMIT_HOST=finch.cs.uwm.edu
SLURM_SUBMIT_DIR=/usr/home/bacon
SLURM_PRIO_PROCESS=0
SLURM_STEP_NODELIST=compute-001
SLURM_STEP_NUM_NODES=1
SLURM_STEP_NUM_TASKS=1
SLURM_STEP_TASKS_PER_NODE=1
SLURM_STEP_LAUNCHER_PORT=27253
SLURM_SRUN_COMM_HOST=192.168.0.2
SLURM_TOPOLOGY_ADDR=compute-001
SLURM_TOPOLOGY_ADDR_PATTERN=node
SLURM_CPUS_ON_NODE=1
SLURM_TASK_PID=5945
SLURM_NODEID=0
SLURM_PROCID=0
SLURM_LOCALID=0
SLURM_LAUNCH_NODE_IPADDR=192.168.0.2
SLURM_GTIDS=0
SLURM_CHECKPOINT_IMAGE_DIR=/usr/home/bacon
SLURMD_NODENAME=compute-001

The SLURM_JOB_NAME variable can be useful for generating output file names within a program, among other things.

When submitting a job array using --array, the submission script is dispatched to every core by sbatch. To distinguish between tasks in this type of job array, we would examine SLURM_ARRAY_TASK_ID. This variable will be set to a different value for each task, from the specification given with --array.

For example, suppose we have 100 input files named input-1.txt through input-100.txt. We could use the following script to process them:

                #!/usr/bin/env bash
                
                #SBATCH --array=1-100
                
                ./myprog < input-$SLURM_ARRAY_TASK_ID.txt \
                    > output-$SLURM_ARRAY_TASK_ID.txt

Suppose our input files are not numbered sequentially, but according to some other criteria, such as a set of prime numbers. In this case, we would simply change the specification in --array:

                #!/usr/bin/env bash
                
                #SBATCH --array=2,3,5,7,11,13
                
                ./myprog < input-$SLURM_ARRAY_TASK_ID.txt \
                    > output-$SLURM_ARRAY_TASK_ID.txt

Terminating a Job

If you determine that a job is not behaving properly (by reviewing partial output, for example), you can terminate it using scancel, which takes a job ID as a command line argument.

shell-prompt: sbatch bench.sbatch 
Submitted batch job 90
shell-prompt: squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                90 default-p bench.sl    bacon  R       0:03      2 compute-[001-002]
shell-prompt: scancel 90
shell-prompt: squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

Terminating Stray Processes

Occasionally, a SLURM job may fail and leave processes running on the compute nodes. These are called stray processes, since they are no longer under the control of the scheduler.

Do not confuse stray processes with zombie processes. A zombie process is a process that has terminated but has not been reaped. I.e., it is still listed by the ps command because it's parent process has not become aware that it is terminated. Zombie processes are finished and do not consume any resources, so we need not worry about them.

Stray processes are easy to detect on nodes where you have no jobs running. If squeue or slurm-cluster-load do not show any of your jobs using a particular node, but top or node-top show processes under your name, then you have some strays. Simply ssh to that compute node and terminate them with the standard Unix kill or killall commands.

If you have jobs running on the same node as your strays, then detecting the strays may be difficult. If the strays are running a different program than your legitimate job processes, then they will be easy to spot. If they are running the same program as your legitimate job processes, then they will likely have a different run time than your legitimate processes. Be very careful to ensure that you kill the strays and not your active job in this situation.

Linux login.mortimer bacon ~ 414: squeue -u bacon
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

Linux login.mortimer bacon ~ 415: node-top 003
top - 09:33:55 up 2 days, 10:44,  1 user,  load average: 16.02, 15.99, 14.58
Tasks: 510 total,  17 running, 493 sleeping,   0 stopped,   0 zombie
Cpu(s):  2.3%us,  0.0%sy,  0.0%ni, 97.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  49508900k total,  2770060k used, 46738840k free,    62868k buffers
Swap: 33554428k total,        0k used, 33554428k free,   278916k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
 7530 bacon     20   0  468m 112m 4376 R 81.5  0.2  37:55.11 mpi-bench          
 7531 bacon     20   0  468m 112m 4524 R 81.5  0.2  37:55.44 mpi-bench          
 7532 bacon     20   0  468m 112m 4552 R 81.5  0.2  37:55.27 mpi-bench          
 7533 bacon     20   0  468m 112m 4552 R 81.5  0.2  37:55.83 mpi-bench          
 7537 bacon     20   0  468m 112m 4556 R 81.5  0.2  37:55.71 mpi-bench          
 7548 bacon     20   0  468m 112m 4556 R 81.5  0.2  37:55.51 mpi-bench          
 7550 bacon     20   0  468m 112m 4552 R 81.5  0.2  37:55.87 mpi-bench          
 7552 bacon     20   0  468m 112m 4532 R 81.5  0.2  37:55.82 mpi-bench          
 7554 bacon     20   0  468m 112m 4400 R 81.5  0.2  37:55.83 mpi-bench          
 7534 bacon     20   0  468m 112m 4548 R 79.9  0.2  37:55.78 mpi-bench          
 7535 bacon     20   0  468m 112m 4552 R 79.9  0.2  37:55.84 mpi-bench          
 7536 bacon     20   0  468m 112m 4552 R 79.9  0.2  37:55.87 mpi-bench          
 7542 bacon     20   0  468m 112m 4552 R 79.9  0.2  37:55.86 mpi-bench          
 7544 bacon     20   0  468m 112m 4552 R 79.9  0.2  37:55.84 mpi-bench          
 7546 bacon     20   0  468m 112m 4544 R 79.9  0.2  37:55.88 mpi-bench          
 7529 bacon     20   0  468m 110m 4408 R 78.3  0.2  37:41.46 mpi-bench          
    4 root      20   0     0    0    0 S  1.6  0.0   0:00.04 ksoftirqd/0        
 7527 bacon     20   0  146m 4068 2428 S  1.6  0.0   0:11.39 mpirun             
 7809 bacon     20   0 15280 1536  892 R  1.6  0.0   0:00.07 top                
    1 root      20   0 19356 1536 1240 S  0.0  0.0   0:02.03 init               
    2 root      20   0     0    0    0 S  0.0  0.0   0:00.03 kthreadd           
Connection to compute-003 closed.

Linux login.mortimer bacon ~ 416: ssh compute-003 kill 7530

Linux login.mortimer bacon ~ 417: ssh compute-003 killall mpi-bench

If the normal kill command doesn't work, you can use kill -9 to do a sure kill.

Viewing Output of Active Jobs

Unlike many other schedulers, SLURM output files are available for viewing while the job is running. Hence, SLURM does not require a "peek" command. The default output file or files specified by --output and --error are updated regularly while a job is running and can be viewed with standard Unix commands such as "more".

Checking Job Stats with sacct

The sacct command is used to view accounting statistics on completed jobs. With no command-line arguments, sacct prints a summary of your past jobs:

shell-prompt: sacct
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
70           build.slu+ default-p+     (null)          1  COMPLETED      0:0 
71           build.slu+ default-p+     (null)          1  COMPLETED      0:0 
72           bench.slu+ default-p+     (null)         24  COMPLETED      0:0 
...
184            hostname default-p+     (null)         40  COMPLETED      0:0 
184.0          hostname                (null)         40  COMPLETED      0:0 
185          bench-fre+ default-p+     (null)         48  CANCELLED      0:0 
185.0             orted                (null)          3     FAILED      1:0 
186          env.sbatch default-p+     (null)          1  COMPLETED      0:0

For detailed information on a particular job, we can use the -j flag to specify a job id and the -o flag to specify which information to display:

shell-prompt: sacct -o alloccpus,nodelist,elapsed,cputime -j 117
 AllocCPUS        NodeList    Elapsed    CPUTime 
---------- --------------- ---------- ---------- 
        12     compute-001   00:00:29   00:05:48

On SPCM clusters, the above command is provided in a convenience script called slurm-job-stats:

shell-prompt: slurm-job-stats 117
 AllocCPUS        NodeList    Elapsed    CPUTime 
---------- --------------- ---------- ---------- 
        12     compute-001   00:00:29   00:05:48

For more information on sacct, run "man sacct" or view the online SLURM documentation.

Checking Status of Running Jobs with scontrol

Detailed information on currently running jobs can be displayed using scontrol.

shell-prompt: scontrol show jobid 10209
JobId=10209 JobName=bench-freebsd.sbatch
   UserId=bacon(4000) GroupId=bacon(4000)
   Priority=4294901512 Nice=0 Account=(null) QOS=(null)
   JobState=FAILED Reason=NonZeroExitCode Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=213:0
   RunTime=00:00:02 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2015-09-09T11:25:31 EligibleTime=2015-09-09T11:25:31
   StartTime=2015-09-09T11:25:32 EndTime=2015-09-09T11:25:34
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=default-partition AllocNode:Sid=login:7345
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=compute-[003-006]
   BatchHost=compute-003
   NumNodes=4 NumCPUs=48 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=1024M MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/share1/Data/bacon/Facil/Software/Src/Bench/MPI/bench-freebsd.sbatch
      WorkDir=/share1/Data/bacon/Facil/Software/Src/Bench/MPI
   StdErr=/share1/Data/bacon/Facil/Software/Src/Bench/MPI/slurm-10209.out
   StdIn=/dev/null
   StdOut=/share1/Data/bacon/Facil/Software/Src/Bench/MPI/slurm-10209.out

On SPCM clusters, the above command is provided in a convenience script called slurm-job-status:

shell-prompt: squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             20537     batch bench.sb    bacon  R       0:02      1 compute-001

shell-prompt: slurm-job-status 20537
JobId=20537 JobName=bench.sbatch
   UserId=bacon(4000) GroupId=bacon(4000)
   Priority=4294900736 Nice=0 Account=(null) QOS=(null)
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:12 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2016-07-18T11:38:31 EligibleTime=2016-07-18T11:38:31
   StartTime=2016-07-18T11:38:31 EndTime=Unknown
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=batch AllocNode:Sid=login:7485
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=compute-001
   BatchHost=compute-001
   NumNodes=1 NumCPUs=12 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=12,mem=3072,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=256M MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/share1/Data/bacon/Testing/mpi-bench/trunk/bench.sbatch freebsd
   WorkDir=/share1/Data/bacon/Testing/mpi-bench/trunk
   StdErr=/share1/Data/bacon/Testing/mpi-bench/trunk/slurm-20537.out
   StdIn=/dev/null
   StdOut=/share1/Data/bacon/Testing/mpi-bench/trunk/slurm-20537.out
   Power= SICP=0

Job Sequences

If you need to submit a series of jobs in sequence, where one job begins after another has completed, the simplest approach is to simply submit job N+1 from the sbatch script for job N.

It's important to make sure that the current job completed successfully before submitting the next, to avoid wasting resources. It is up to you to determine the best way to verify that a job was successful. Examples might include grepping the log file for some string indicating success, or making the job create a marker file using the touch command after a successful run. If the command used in your job returns a Unix-style exit status (0 for success, non-zero on error), then you can simply use the shell's exit-on-error feature to make your script exit when any command fails. Below is a template for scripts that might run a series of jobs.

#!/bin/sh

#SBATCH job-parameters
set -e  # Set exit-on-error

job-command

# This script will exit here if job-command failed

sbatch job2.sbatch  # Executed only if job-command succeeded

Self-test

What is the SLURM command for showing the current state of all nodes in a cluster?
What is the SLURM command to show the currently running jobs on a cluster?
Write and submit a batch-serial SLURM script called list-etc.sbatch that prints the host name of the compute node on which it runs and a long-listing of /etc directory on that node.
The script should store the output of the commands in list-etc.stdout and error messages in list-etc.stderr in the directory from which the script was submitted.
The job should appear in squeue listings under the name "list-etc".
Quickly check the status of your job after submitting it.
Copy your list-etc.sbatch script to list-etc-parallel.sbatch, and modify it so that it runs the hostname and ls commands on 10 cores instead of just one.
The job should produce a separate output file for each process named list-etc-parallel.o<jobid>-<arrayid> and a separate error file for each process named list-etc-parallel.e<jobid>-<arrayid>.
Quickly check the status of your job after submitting it.
What is the SLURM command for terminating a job with job-id 3545?
What is the SLURM command for viewing the terminal output of a job with job-id 3545 while it is still running?
What is the SLURM command for showing detailed job information about the job with job-id 3254?