Before we begin, you should know a few common terms that will pop up throughout this chapter.
A CPU or core is a processor capable of running a program.
A node is a distinct computer within a cluster. See the section called “Clusters and HPC” for a description of node types.
A partition is a group of nodes designated in the SLURM configuration as a pool from which to allocate resources.
A process is a process as defined by Unix. (The execution of a program.) A parallel job on a cluster consists of multiple processes running at the same time. A serial job consists of only one process.
A job is all of the processes dispatched by SLURM under the same job ID. Processes within a job on a cluster may be running on the same node or different nodes.
A task is one or more processes within a job that are distinct from the rest of the processes. All processes within a given task must run on the same node. For example, a job could consist of 4 tasks running on different nodes, each consisting of 8 threads using shared-memory parallelism. This job therefore uses 32 cores and a maximum of 4 different nodes. Two tasks could run on the same node if the node has at least 16 cores.
Before scheduling any jobs through SLURM, it is often useful to check the status of the nodes. Knowing how many cores are available may influence your decision on how many cores to request for your next job.
For example, if only 50 cores are available at the moment, and your job requires 200 cores, the job may have to wait in the queue until 200 cores (with sufficient associated memory) are free. Your job may finish sooner if you reduce the number of cores to 50 or less so that it can start right away, even though it will take longer to run. We can't always predict when CPU cores will become available, so it's often a guessing game. However, some experienced users who know their software may have a pretty good idea when their jobs will finish.
The sinfo command shows information about the nodes and partitions in the cluster. This can be used to determine the total number of cores in the cluster, cores in use, etc.
shell-prompt: sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST default-partitio* up infinite 2 idle compute-[001-002] shell-prompt: sinfo --long Wed Nov 27 13:34:20 2013 PARTITION AVAIL TIMELIMIT JOB_SIZE ROOT SHARE GROUPS NODES STATE NODELIST default-partitio* up infinite 1-infinite no NO all 2 idle compute-[001-002] shell-prompt: sinfo -N NODELIST NODES PARTITION STATE compute-[001-002] 2 default-partitio* idle
As a convenience, SPCM clusters provide a script called slurm-node-info, which displays specifications on each compute node:
shell-prompt: slurm-node-info HOSTNAMES CPUS MEMORY compute-001 12 31751 compute-002 12 31751 compute-003 12 31751 compute-004 12 31751 compute-005 12 31751 compute-006 12 31751 compute-007 12 31751
You can check on the status of running jobs using squeue.
shell-prompt: squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 64 default-p bench.sl bacon R 0:02 2 compute-[001-002]
The Job id column shows the numeric job ID. The ST column shows the current status of the job. The most common status flags are 'PD' for pending (waiting to start) and 'R' for running.
The squeue --long
flag requests
more detailed information.
The squeue command has many flags for controlling what it reports. Run man squeue for full details.
As a convenience, SPCM clusters provide a script called slurm-cluster-load, which uses squeue and sinfo to display a quick summary on current jobs and overall load:
shell-prompt: slurm-cluster-load JOBID USER EXEC_HOST CPUS NODES SHARED NAME TIME ST 186110 chen59 R 16 2 no surf 1-08:46:19 compute-2-[14,16] 186112 chen59 R 16 2 no surf 1-08:46:19 compute-2-33,compute-3-02 186113 chen59 R 16 2 no surf 1-08:46:19 compute-3-[07,10] 187146 albertof R 1 1 yes bash 1-03:08:12 compute-2-04 187639 albertof R 1 1 yes submitSR 52:03 compute-4-02 187640 albertof R 1 1 yes submitSR 52:03 compute-4-02 187642 albertof R 1 1 yes submitSR 52:03 compute-4-02 187867 qium R 8 1 no ph/IC1x1 46:11 compute-1-03 187868 qium R 8 1 no ph/IC1x2 46:11 compute-1-04 CPUS(A/I/O/T) 785/264/87/1136 Load: 69%
The slurm-user-cores command displays only the number of cores currently in use by each user and the cluster usage summary:
shell-prompt: slurm-user-cores Username Running Pending bahumda2 15 0 bkhazaei 300 0 sheikhz2 640 0 yous 1 0 batch: 88% CPUS(A/I/O/T) 955/109/16/1080 128g: 6% CPUS(A/I/O/T) 1/15/0/16
Using the output from squeue, we can see which compute nodes are being used by a job:
shell-prompt: squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1017 batch bash oleary R 2:44:00 1 compute-001 1031 batch ARM97-pa roberts R 1:50:55 2 compute-[001-002] 1032 batch sbatch joea R 1:35:19 1 compute-002 1034 batch sbatch joea R 1:05:59 1 compute-002 1035 batch sbatch joea R 50:08 1 compute-003 1036 batch sbatch joea R 47:13 1 compute-004 1041 batch bash oleary R 36:41 1 compute-001 1042 batch bash roberts R 0:09 1 compute-001
From the above output, we can see that job 1031 is running on compute-001 through compute-002.
We can then examine the processes on any of those nodes using a remotely executed top command:
shell-prompt: ssh -t compute-001 top top - 13:55:55 up 12 days, 24 min, 1 user, load average: 5.00, 5.00, 4.96 Tasks: 248 total, 6 running, 242 sleeping, 0 stopped, 0 zombie Cpu(s): 62.5%us, 0.1%sy, 0.0%ni, 37.3%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 24594804k total, 15388204k used, 9206600k free, 104540k buffers Swap: 33554424k total, 10640k used, 33543784k free, 14486568k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 30144 roberts 20 0 251m 69m 17m R 100.4 0.3 113:29.04 SAM_ADV_MPDATA_ 30145 roberts 20 0 251m 69m 17m R 100.4 0.3 113:28.16 SAM_ADV_MPDATA_ 30143 roberts 20 0 251m 69m 17m R 100.1 0.3 113:16.87 SAM_ADV_MPDATA_ 30146 roberts 20 0 251m 69m 17m R 100.1 0.3 113:29.24 SAM_ADV_MPDATA_ 30147 roberts 20 0 251m 69m 17m R 100.1 0.3 113:29.82 SAM_ADV_MPDATA_ 965 bacon 20 0 15168 1352 944 R 0.3 0.0 0:00.01 top 1 root 20 0 19356 1220 1004 S 0.0 0.0 0:00.96 init
-t
flag is important here, since it tells
ssh to open a connection with full terminal control, which is
needed by top to update your terminal screen.
The column of interest is under "RES".
We can see from the top command above that the processes owned by job 1031 are using about 69 mebibytes of memory each. Watch this value for a while as it will change as the job runs.
Take the highest value you see in the RES column, add about 10%, and use this with --mem-per-cpu to set a reasonable memory limit.
#SBATCH --mem-per-cpu=76
SPCM clusters provide some additional convenience tools to save a little typing when running top.
The node-top command is equivalent to running top on a compute node via ssh as shown above, without typing the ssh command or the full host name:
shell-prompt: node-top 001
The job-top command takes a SLURM job ID and runs top on all of the node used by the job. The job-top command below is equivalent to running node-top 001 followed by node-top 001:
shell-prompt: squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1017 batch bash oleary R 2:44:00 1 compute-001 1031 batch ARM97-pa roberts R 1:50:55 2 compute-[001-002] 1032 batch sbatch joea R 1:35:19 1 compute-002 1034 batch sbatch joea R 1:05:59 1 compute-002 1035 batch sbatch joea R 50:08 1 compute-003 1036 batch sbatch joea R 47:13 1 compute-004 1041 batch bash oleary R 36:41 1 compute-001 1042 batch bash roberts R 0:09 1 compute-001 shell-prompt: job-top 1031
The purpose of this section is to provide the reader a quick start in job scheduling using the most common tools. The full details of job submission are beyond the scope of this document. For more information, see SLURM website http://slurm.schedmd.com/ and the man pages for individual SLURM commands.
man sbatch
Submitting jobs involves specifying a number of job parameters such as the number of cores, the job name (which is used by other SLURM commands), the name(s) of the output file(s), etc.
In order to document all of this information and make it easy to resubmit the same job, this information is usually incorporated into a submission script. Using a script saves you a lot of typing when you want to run-submit the same job, and also fully documents the job parameters.
A submission script is an ordinary shell script, with some directives inserted to provide information to the scheduler. For SLURM, the directives are specially formatted shell comments beginning with "#SBATCH".
The ONLY difference between an sbatch submission script and a script that you would run on any Unix laptop or workstation is the #SBATCH directives.
You can develop and test a script to run your analyses or models on any Unix system. To use it on a SLURM cluster, you need only add the appropriate #SBATCH directives and possibly alter some command arguments to enable parallel execution.
Submission scripts are submitted with the sbatch command. The script may contain #SBATCH options to define the job, regular Unix commands, and srun or other commands to run programs in parallel.
Suppose we have the following text in a file called
hostname.sbatch
:
#!/usr/bin/env bash # A SLURM directive #SBATCH --job-name=hostname # A command to be executed on the scheduled node. # Prints the host name of the node running this script. hostname
The script is submitted to the SLURM scheduler as a command line argument to the sbatch command:
shell-prompt: sbatch hostname.sbatch
The SLURM scheduler finds a free core on a compute node, reserves it, and then remotely runs hostname.sbatch on the compute node using ssh or some other remote execution command.
Recall from Chapter 4, Unix Shell Scripting that everything in a shell script from a '#' character to the end of the line is considered a comment by the shell, and ignored.
However, comments beginning with #SBATCH, while ignored by the shell, are interpreted as directives by sbatch.
The directives within the script provide command line flags to sbatch. For instance, the line
#SBATCH --mem-per-cpu=10
causes sbatch to behave as if you had typed
shell-prompt: sbatch --mem-per-cpu=10 hostname.sbatch
By putting these comments in the script, you eliminate the need to remember them and retype them every time you run the job. It's generally best to put all sbatch flags in the script rather than type any of them on the command line, so that you have an exact record of how the job was started. This will help you determine what went wrong if there are problems, and allow you to reproduce the results at a later date.
##SBATCH This line is ignored by sbatch #SBATCH This line is interpreted by sbatch
#!/usr/bin/env bash # A SLURM directive #SBATCH --job-name=hostname # A command to be executed on the scheduled node. # Prints the host name of the node running this script. hostname
Type in the hostname.sbatch
script shown above and submit it to the scheduler
using sbatch. Then check the status
with squeue and view the output
and error files.
#SBATCH --output=standard-output-file #SBATCH --error=standard-error-file #SBATCH --nodes=min[-max] #SBATCH --ntasks=tasks #SBATCH --array=list #SBATCH --cpus-per-task=N #SBATCH --exclusive #SBATCH --mem=MB #SBATCH --mem-per-cpu=MB #SBATCH --partition=name
The --output
and --error
flags control the names of the files to which the standard
output and standard error of the processes are redirected.
If omitted, both standard output and standard error are
written to slurm-JOBID.out.
The --ntasks
flag indicates how many tasks
the job will run, usually for multicore jobs.
The --nodes
flag is used to specify how
many nodes (not tasks) to allocate for the job. We can
use this to control how many tasks are run on each node.
For example, if a job consists of 20 I/O-intensive processes,
we would not want many of them running on the same node
and competing for the local disk. In this case, we can
specify --nodes=20 --ntasks=20 to force each process to
a separate node.
The --cpus-per-task=N flag indicates that we need N cpus on the same node. Using --ntasks=4 --cpus-per-task=3 will indicate 4 tasks using 3 cores each, in effect requesting 12 cores in groups of 3 per node.
The --exclusive flag indicates that the job can not share nodes with other jobs. This is typically used for shared memory parallel jobs to maximize the number of cores available to the job. It may also be used for jobs with high memory requirements, although it is better to simple specify the memory requirements using --mem or --mem-per-cpu.
The --mem=MB flag indicates the amount of memory needed on each node used by the job, in megabytes.
The --mem-per-cpu=MB flag indicates the amount of memory needed by each process within a job, in megabytes.
The --partition=name flag indicates which partition (set of nodes) on which the job should run. Simply run sinfo to see a list of available partitions.
When using a cluster, it is important to develop a feel for the resources required by your jobs, and inform the scheduler as accurately as possible what will be needed in terms of CPU time, memory, etc.
If a user does not specify a given resource requirement, the scheduler uses default limits. Default limits are set low, so that users are encouraged to provide an estimate of required resources for all non-trivial jobs. This protects other users from being blocked by long-running jobs that require less memory and other resources than the scheduler would assume.
The --mem=MB flag indicates that the job requires MB megabytes of memory per node.
The --mem-per-cpu=MB flag indicates that the job requires MB megabytes per core.
A batch serial submission script need only have optional flags such as job name, output file, etc. and one or more commands.
#!/usr/bin/env bash hostname
Interactive jobs involve running programs under the scheduler on compute nodes, where the user can interact directly with the process. I.e., output is displayed on their terminal instead of being redirected to files (as controlled by the sbatch --output and --error flags) and input can be taken from the keyboard.
The use of interactive jobs on a cluster is uncommon, but sometimes useful.
For example, an interactive shell environment might be useful when porting a new program to the cluster. This can be used to do basic compilation and testing in the compute node environment before submitting larger jobs. Programs should not be compiled or tested on the head node and users should not ssh directly to a compute node for this purpose, as this would circumvent the scheduler, causing their compilations, etc. to overload the node.
On SPCM clusters, a convenience script called slurm-shell is provided to make it easy to start an interactive shell under the scheduler. It requires the number of cores and the amount of memory to allocate as arguments. The memory specification is in mebibytes by default, but can be followed by a 'g' to indicate gibibytes:
shell-prompt: slurm-shell 1 500 ========================================================================== Note: slurm-shell is provided solely as a convenience for typical users. If you would like an interactive shell with additional memory, more than 1 CPU, etc., you can use salloc and srun directly. Run more /usr/local/bin/slurm-shell for a basic example and man salloc man srun for full details. ========================================================================== 1:05PM up 22 days, 2 hrs, 0 users, load averages: 0.00, 0.00, 0.00 To repeat the last command in the C shell, type "!!". -- Dru <genesis@istar.ca> FreeBSD compute-001. bacon ~ 401:
A job array is a set of independent processes all started by a single job submission. The entire job array can be treated as a single job by SLURM commands such as squeue and scancel.
A batch parallel submission script looks almost exactly like a batch serial script. There are a couple of ways to run a batch parallel job. It is possible to run a job array using --ntasks and an srun command in the sbatch script. However, the --array flag is far more convenient for most purposes.
The --array flag is followed by an index specification which allows the user to specify any set of task IDs for each job in the array. The specification can use '-' to specify a range of task IDs, commas to specify an arbitrary list, or both. For example, to run a job array with task IDs ranging from 1 to 5, we would use the following:
Some SLURM clusters use an operating system feature known as task affinity to improve performance. The feature binds each process to a specific core, so that any data in the cache RAM of that core does not have to be flushed and reloaded into the cache of a different core. This can dramatically improve performance for certain memory-intensive processes, though it won't make any noticeable difference for many jobs.
In order to activate task affinity, you must use srun or another command to launch the processes.
#!/usr/bin/env bash #SBATCH --array=1-5 # OK, but will not use task affinity hostname
#!/usr/bin/env bash #SBATCH --array=1-5 # Activates task affinity, so each hostname process will stick to one core srun hostname
If we wanted task IDs of 1,2,3,10, 11, and 12, we could use the following:
#!/usr/bin/env bash #SBATCH --array=1-3,10-12 srun hostname
The reasons for selecting specific task IDs are discussed in the section called “Environment Variables”.
Another advantage of using --array is that we don't need all the cores available at once in order for the job to run. For example, if we submit a job using --array=1-200 while there are only 50 cores free on the cluster, it will begin running as many processes as possible immediately and run more as more cores become free. In contrast, a job started with --ntasks will remain in a pending state until there are 200 cores free.
Furthermore, we can explicitly limit the number of array jobs run at once with a simple addition to the flag. Suppose we want to run 10,000 processes, but be nice to other cluster users and only use 100 cores at a time. All we need to do is use the following:
#SBATCH --array=1-10000%100
Finally, we can run arrays of multithreaded jobs by adding
--cpus-per-task
. For example, the BWA
sequence aligner allows the user to specify multiple threads
using the -t
flag:
#SBATCH --array=1-100 --nodes=1 --cpus-per-task=8 srun bwa mem -t $SLURM_CPUS_PER_TASK reference.fa input.fa > output.sam
Copy your hostname.sbatch
to
hostname-parallel.sbatch
, modify
it to run 5 processes, and submit it to the scheduler
using sbatch. Then check the status
with squeue and view the output
and error files.
A multi-core job is any job that runs a parallel program. This means that it uses multiple processes that communicate and cooperate with each other in some way. There is no such communication or cooperation in batch parallel (also known as embarrassingly parallel) jobs, which use SLURM's --array flag.
Cores are allocated for multi-core jobs using the --ntasks flag.
Using --ntasks ONLY tells the SLURM scheduler how many cores to reserve. It does NOT tell your parallel program how many cores or which compute nodes to use. It is the responsibility of the user to ensure that the command(s) in their script utilize the correct number of cores and the correct nodes within the cluster.
Some systems, such as OpenMPI (discussed in the section called “Message Passing Interface (MPI)”), will automatically detect this information from the SLURM environment and dispatch the correct number of processes to each allocated node.
If you are not using OpenMPI, you may need to specify the number of cores and/or which nodes to use in the command(s) that run your computation. Many commands have a flag argument such as -n or -np for this purpose.
OpenMP (not to be confused with OpenMPI) is often used to run multiple threads on the same node (shared memory parallelism, as discussed in the section called “Shared Memory and Multi-Core Technology” and the section called “Shared Memory Parallel Programming with OpenMP”).
OpenMP software will look for the OMP_NUM_THREADS environment variable to indicate how many cores to use. It is the user's responsibility to ensure that OMP_NUM_THREADS is set when using OpenMP software.
When running OpenMP programs under SLURM, we can ensure
that the right number of cores are used by using the
--ntasks
flag and passing
$SLURM_NTASKS
to $OMP_NUM_THREADS
.
#!/bin/sh -e #SBATCH --ntasks=4 OMP_NUM_THREADS=$SLURM_NTASKS export OMP_NUM_THREADS OpenMP-based-program arg1 arg2 ...
Scheduling MPI jobs in SLURM is much like scheduling batch parallel jobs.
MPI programs cannot be executed directly from the command line as we do with normal programs and scripts. Instead, we must use the mpirun or srun command to start up MPI programs.
mpirun [mpirun flags] mpi-program [mpi-program arguments]
For MPI and other multicore jobs, we use the sbatch --ntasks flag to indicate how many cores we want to use. Unlike --array, sbatch with --ntasks runs the sbatch script on only one node, and it is up to the commands in the script (such as mpirun or srun) to dispatch the parallel processes to all the allocated cores.
Commands like mpirun and srun can retrieve information about which cores have been allocated from the environment handed down by their parent process, the SLURM scheduler. The details about SLURM environment variables are discussed in the section called “Environment Variables”.
#!/bin/sh #SBATCH --ntasks=2 PATH=${PATH}:/usr/local/mpi/openmpi/bin export PATH mpirun ./mpi_bench
When running MPI jobs, it is often desirable to have as many processes as possible running on the same node. Message passing is generally faster between processes on the same node than between processes on different nodes, because messages passed within the same node need not cross the network. If you have a very fast network such as Infiniband or a low-latency Ethernet, the difference may be marginal, but on more ordinary networks such as gigabit Ethernet, the difference can be enormous.
SLURM by default will place as many processes as possible on the same node. We can also use --exclusive to ensure that "leftover" cores on a partially busy node will not be used for out MPI jobs. This may improve communication performance, but may also delay the start of the job until enough nodes are completely empty.
SLURM sets a number of environment variables when a job is started. These variables can be used in the submission script and within other scripts or programs executed as part of the job.
shell-prompt: srun printenv | grep SLURM SLURM_SRUN_COMM_PORT=27253 SLURM_TASKS_PER_NODE=1 SLURM_NODELIST=compute-001 SLURM_JOB_NUM_NODES=1 SLURM_NNODES=1 SLURM_STEPID=0 SLURM_STEP_ID=0 SLURM_JOBID=81 SLURM_JOB_ID=81 SLURM_DISTRIBUTION=cyclic SLURM_NPROCS=1 SLURM_NTASKS=1 SLURM_JOB_CPUS_PER_NODE=1 SLURM_JOB_NAME=/usr/bin/printenv SLURM_SUBMIT_HOST=finch.cs.uwm.edu SLURM_SUBMIT_DIR=/usr/home/bacon SLURM_PRIO_PROCESS=0 SLURM_STEP_NODELIST=compute-001 SLURM_STEP_NUM_NODES=1 SLURM_STEP_NUM_TASKS=1 SLURM_STEP_TASKS_PER_NODE=1 SLURM_STEP_LAUNCHER_PORT=27253 SLURM_SRUN_COMM_HOST=192.168.0.2 SLURM_TOPOLOGY_ADDR=compute-001 SLURM_TOPOLOGY_ADDR_PATTERN=node SLURM_CPUS_ON_NODE=1 SLURM_TASK_PID=5945 SLURM_NODEID=0 SLURM_PROCID=0 SLURM_LOCALID=0 SLURM_LAUNCH_NODE_IPADDR=192.168.0.2 SLURM_GTIDS=0 SLURM_CHECKPOINT_IMAGE_DIR=/usr/home/bacon SLURMD_NODENAME=compute-001
The SLURM_JOB_NAME
variable can be useful
for generating output file names within a program, among
other things.
When submitting a job array using --array, the submission script is dispatched to every core by sbatch. To distinguish between tasks in this type of job array, we would examine SLURM_ARRAY_TASK_ID. This variable will be set to a different value for each task, from the specification given with --array.
For example, suppose we have 100 input files named input-1.txt through input-100.txt. We could use the following script to process them:
#!/usr/bin/env bash #SBATCH --array=1-100 ./myprog < input-$SLURM_ARRAY_TASK_ID.txt \ > output-$SLURM_ARRAY_TASK_ID.txt
Suppose our input files are not numbered sequentially, but according to some other criteria, such as a set of prime numbers. In this case, we would simply change the specification in --array:
#!/usr/bin/env bash #SBATCH --array=2,3,5,7,11,13 ./myprog < input-$SLURM_ARRAY_TASK_ID.txt \ > output-$SLURM_ARRAY_TASK_ID.txt
If you determine that a job is not behaving properly (by reviewing partial output, for example), you can terminate it using scancel, which takes a job ID as a command line argument.
shell-prompt: sbatch bench.sbatch Submitted batch job 90 shell-prompt: squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 90 default-p bench.sl bacon R 0:03 2 compute-[001-002] shell-prompt: scancel 90 shell-prompt: squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
Occasionally, a SLURM job may fail and leave processes running on the compute nodes. These are called stray processes, since they are no longer under the control of the scheduler.
Do not confuse stray processes with zombie processes. A zombie process is a process that has terminated but has not been reaped. I.e., it is still listed by the ps command because it's parent process has not become aware that it is terminated. Zombie processes are finished and do not consume any resources, so we need not worry about them.
Stray processes are easy to detect on nodes where you have no jobs running. If squeue or slurm-cluster-load do not show any of your jobs using a particular node, but top or node-top show processes under your name, then you have some strays. Simply ssh to that compute node and terminate them with the standard Unix kill or killall commands.
If you have jobs running on the same node as your strays, then detecting the strays may be difficult. If the strays are running a different program than your legitimate job processes, then they will be easy to spot. If they are running the same program as your legitimate job processes, then they will likely have a different run time than your legitimate processes. Be very careful to ensure that you kill the strays and not your active job in this situation.
Linux login.mortimer bacon ~ 414: squeue -u bacon JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) Linux login.mortimer bacon ~ 415: node-top 003 top - 09:33:55 up 2 days, 10:44, 1 user, load average: 16.02, 15.99, 14.58 Tasks: 510 total, 17 running, 493 sleeping, 0 stopped, 0 zombie Cpu(s): 2.3%us, 0.0%sy, 0.0%ni, 97.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 49508900k total, 2770060k used, 46738840k free, 62868k buffers Swap: 33554428k total, 0k used, 33554428k free, 278916k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 7530 bacon 20 0 468m 112m 4376 R 81.5 0.2 37:55.11 mpi-bench 7531 bacon 20 0 468m 112m 4524 R 81.5 0.2 37:55.44 mpi-bench 7532 bacon 20 0 468m 112m 4552 R 81.5 0.2 37:55.27 mpi-bench 7533 bacon 20 0 468m 112m 4552 R 81.5 0.2 37:55.83 mpi-bench 7537 bacon 20 0 468m 112m 4556 R 81.5 0.2 37:55.71 mpi-bench 7548 bacon 20 0 468m 112m 4556 R 81.5 0.2 37:55.51 mpi-bench 7550 bacon 20 0 468m 112m 4552 R 81.5 0.2 37:55.87 mpi-bench 7552 bacon 20 0 468m 112m 4532 R 81.5 0.2 37:55.82 mpi-bench 7554 bacon 20 0 468m 112m 4400 R 81.5 0.2 37:55.83 mpi-bench 7534 bacon 20 0 468m 112m 4548 R 79.9 0.2 37:55.78 mpi-bench 7535 bacon 20 0 468m 112m 4552 R 79.9 0.2 37:55.84 mpi-bench 7536 bacon 20 0 468m 112m 4552 R 79.9 0.2 37:55.87 mpi-bench 7542 bacon 20 0 468m 112m 4552 R 79.9 0.2 37:55.86 mpi-bench 7544 bacon 20 0 468m 112m 4552 R 79.9 0.2 37:55.84 mpi-bench 7546 bacon 20 0 468m 112m 4544 R 79.9 0.2 37:55.88 mpi-bench 7529 bacon 20 0 468m 110m 4408 R 78.3 0.2 37:41.46 mpi-bench 4 root 20 0 0 0 0 S 1.6 0.0 0:00.04 ksoftirqd/0 7527 bacon 20 0 146m 4068 2428 S 1.6 0.0 0:11.39 mpirun 7809 bacon 20 0 15280 1536 892 R 1.6 0.0 0:00.07 top 1 root 20 0 19356 1536 1240 S 0.0 0.0 0:02.03 init 2 root 20 0 0 0 0 S 0.0 0.0 0:00.03 kthreadd Connection to compute-003 closed. Linux login.mortimer bacon ~ 416: ssh compute-003 kill 7530 Linux login.mortimer bacon ~ 417: ssh compute-003 killall mpi-bench
If the normal kill command doesn't work, you can use kill -9 to do a sure kill.
Unlike many other schedulers, SLURM output files are available for viewing while the job is running. Hence, SLURM does not require a "peek" command. The default output file or files specified by --output and --error are updated regularly while a job is running and can be viewed with standard Unix commands such as "more".
The sacct command is used to view accounting statistics on completed jobs. With no command-line arguments, sacct prints a summary of your past jobs:
shell-prompt: sacct JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 70 build.slu+ default-p+ (null) 1 COMPLETED 0:0 71 build.slu+ default-p+ (null) 1 COMPLETED 0:0 72 bench.slu+ default-p+ (null) 24 COMPLETED 0:0 ... 184 hostname default-p+ (null) 40 COMPLETED 0:0 184.0 hostname (null) 40 COMPLETED 0:0 185 bench-fre+ default-p+ (null) 48 CANCELLED 0:0 185.0 orted (null) 3 FAILED 1:0 186 env.sbatch default-p+ (null) 1 COMPLETED 0:0
For detailed information on a particular job, we can use the -j flag to specify a job id and the -o flag to specify which information to display:
shell-prompt: sacct -o alloccpus,nodelist,elapsed,cputime -j 117 AllocCPUS NodeList Elapsed CPUTime ---------- --------------- ---------- ---------- 12 compute-001 00:00:29 00:05:48
On SPCM clusters, the above command is provided in a convenience script called slurm-job-stats:
shell-prompt: slurm-job-stats 117 AllocCPUS NodeList Elapsed CPUTime ---------- --------------- ---------- ---------- 12 compute-001 00:00:29 00:05:48
For more information on sacct, run "man sacct" or view the online SLURM documentation.
Detailed information on currently running jobs can be displayed using scontrol.
shell-prompt: scontrol show jobid 10209 JobId=10209 JobName=bench-freebsd.sbatch UserId=bacon(4000) GroupId=bacon(4000) Priority=4294901512 Nice=0 Account=(null) QOS=(null) JobState=FAILED Reason=NonZeroExitCode Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=213:0 RunTime=00:00:02 TimeLimit=UNLIMITED TimeMin=N/A SubmitTime=2015-09-09T11:25:31 EligibleTime=2015-09-09T11:25:31 StartTime=2015-09-09T11:25:32 EndTime=2015-09-09T11:25:34 PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=default-partition AllocNode:Sid=login:7345 ReqNodeList=(null) ExcNodeList=(null) NodeList=compute-[003-006] BatchHost=compute-003 NumNodes=4 NumCPUs=48 CPUs/Task=1 ReqB:S:C:T=0:0:*:* Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryCPU=1024M MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=OK Contiguous=0 Licenses=(null) Network=(null) Command=/share1/Data/bacon/Facil/Software/Src/Bench/MPI/bench-freebsd.sbatch WorkDir=/share1/Data/bacon/Facil/Software/Src/Bench/MPI StdErr=/share1/Data/bacon/Facil/Software/Src/Bench/MPI/slurm-10209.out StdIn=/dev/null StdOut=/share1/Data/bacon/Facil/Software/Src/Bench/MPI/slurm-10209.out
On SPCM clusters, the above command is provided in a convenience script called slurm-job-status:
shell-prompt: squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 20537 batch bench.sb bacon R 0:02 1 compute-001 shell-prompt: slurm-job-status 20537 JobId=20537 JobName=bench.sbatch UserId=bacon(4000) GroupId=bacon(4000) Priority=4294900736 Nice=0 Account=(null) QOS=(null) JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:12 TimeLimit=UNLIMITED TimeMin=N/A SubmitTime=2016-07-18T11:38:31 EligibleTime=2016-07-18T11:38:31 StartTime=2016-07-18T11:38:31 EndTime=Unknown PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=batch AllocNode:Sid=login:7485 ReqNodeList=(null) ExcNodeList=(null) NodeList=compute-001 BatchHost=compute-001 NumNodes=1 NumCPUs=12 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=12,mem=3072,node=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryCPU=256M MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=OK Contiguous=0 Licenses=(null) Network=(null) Command=/share1/Data/bacon/Testing/mpi-bench/trunk/bench.sbatch freebsd WorkDir=/share1/Data/bacon/Testing/mpi-bench/trunk StdErr=/share1/Data/bacon/Testing/mpi-bench/trunk/slurm-20537.out StdIn=/dev/null StdOut=/share1/Data/bacon/Testing/mpi-bench/trunk/slurm-20537.out Power= SICP=0
If you need to submit a series of jobs in sequence, where one job begins after another has completed, the simplest approach is to simply submit job N+1 from the sbatch script for job N.
It's important to make sure that the current job completed successfully before submitting the next, to avoid wasting resources. It is up to you to determine the best way to verify that a job was successful. Examples might include grepping the log file for some string indicating success, or making the job create a marker file using the touch command after a successful run. If the command used in your job returns a Unix-style exit status (0 for success, non-zero on error), then you can simply use the shell's exit-on-error feature to make your script exit when any command fails. Below is a template for scripts that might run a series of jobs.
#!/bin/sh #SBATCH job-parameters set -e # Set exit-on-error job-command # This script will exit here if job-command failed sbatch job2.sbatch # Executed only if job-command succeeded
Write and submit a batch-serial SLURM script called
list-etc.sbatch
that prints
the host name of the compute node on which it runs and
a long-listing of /etc
directory on
that node.
The script should store the output of the
commands in list-etc.stdout
and error messages in list-etc.stderr
in the directory from which the script was submitted.
The job should appear in squeue listings under the name "list-etc".
Quickly check the status of your job after submitting it.
Copy your list-etc.sbatch
script
to list-etc-parallel.sbatch
, and modify
it so that it runs the hostname and
ls
commands on 10 cores instead of just one.
The job should produce a separate output file for each process
named list-etc-parallel.o<jobid>-<arrayid>
and a separate error file for each process
named list-etc-parallel.e<jobid>-<arrayid>
.
Quickly check the status of your job after submitting it.