Table of Contents
Before reading this chapter, you should be familiar with basic Unix concepts (Chapter 3, Using Unix), the Unix shell (the section called “Command Line Interfaces (CLIs): Unix Shells”, redirection (the section called “Redirection and Pipes”), shell scripting (Chapter 4, Unix Shell Scripting) and job scheduling (Chapter 7, Job Scheduling.
For complete information, see the official TORQUE documentation at http://www.adaptivecomputing.com/resources/docs/torque/3-0-2/index.php.
The goal of this section is to provide a quick introduction to the PBS scheduler. Specifically, this section covers the current mainstream implementation of PBS, known as TORQUE.
Before scheduling any jobs on through PBS, it is often useful to check the status of the nodes. Knowing how many cores are available may influence your decision on how many cores to request for your next job.
For example, if only 50 cores are available at the moment, and your job requires 200 cores, the job will have to wait in the queue until 200 cores are free. You may end up getting your results sooner if you reduce the number of cores to 50 or less so that the job can begin right away.
The pbsnodes command shows information about the nodes in the cluster. This can be used to determine the total number of cores in the cluster, cores in use, etc.
FreeBSD peregrine bacon ~/Facil 408: pbsnodes|more compute-001 state = job-exclusive np = 12 ntype = cluster jobs = 0/1173[20].peregrine.hpc.uwm.edu, 1/1173[20].peregrine.hpc.uwm.edu, 2/1173[20].peregrine.hpc.uwm.edu, 3/1173[20].peregrine.hpc.uwm.edu, 4/1173[20].peregrine.hpc.uwm.edu, 5/1173[20].peregrine.hpc.uwm.edu, 6/1173[20].peregrine.hpc.uwm.edu, 7/1173[20].peregrine.hpc.uwm.edu, 8/1173[20].peregrine.hpc.uwm.edu, 9/1173[20].peregrine.hpc.uwm.edu, 10/1173[20].peregrine.hpc.uwm.edu, 11/1173[20].peregrine.hpc.uwm.edu status = rectime=1331737185,varattr=,jobs=1173[20].peregrine.hpc.uwm.edu,state=free,netload=? 15201,gres=,loadave=2.00,ncpus=12,physmem=32515040kb,availmem=? 15201,totmem=? 15201,idletime=6636140,nusers=3,nsessions=3,sessions= 72397 13348 1005,uname=FreeBSD compute-001.local 8.2-RELEASE-p3 FreeBSD 8.2-RELEASE-p3 #0: Tue Sep 27 18:45:57 UTC 2011 root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC amd64,opsys=freebsd5 mom_service_port = 15002 mom_manager_port = 15003 gpus = 0 [text removed for brevity] compute-08 state = free np = 12 ntype = cluster status = rectime=1331737271,varattr=,jobs=,state=free,netload=? 15201,gres=,loadave=0.00,ncpus=12,physmem=32515040kb,availmem=? 15201,totmem=? 15201,idletime=6642605,nusers=2,nsessions=2,sessions= 18723 1034,uname=FreeBSD compute-08.local 8.2-RELEASE-p3 FreeBSD 8.2-RELEASE-p3 #0: Tue Sep 27 18:45:57 UTC 2011 root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC amd64,opsys=freebsd5 mom_service_port = 15002 mom_manager_port = 15003 gpus = 0
As a convenience, a script called cluster-load is installed on Peregrine. This script extracts information from pbsnodes and displays the current cluster load in an abbreviated, easy-to-read format.
FreeBSD peregrine bacon ~/Facil 406: cluster-load Nodes in use: 5 Cores in use: 60 Total cores: 96 Free cores: 36 Load: 62%
In addition to the command line tools used to control and monitor jobs, there are also web-based tools for monitoring clusters and grids in a more visual manner.
The Ganglia Resource Monitor is a web-based monitoring tool that provides statistics about a collection of computers on a network.
The status of the student cluster, Peregrine, can be viewed at http://www.peregrine.hpc.uwm.edu/ganglia/.
The status of the faculty research cluster, Mortimer, can be viewed at http://www.mortimer.hpc.uwm.edu/ganglia/.
You can check on their status of running jobs using qstat.
peregrine: qstat Job id Name User Time Use S Queue ---------------- ---------------- ------ -------- - ----- 52[].peregrine ...allel-example bacon 0 C batch
The Job id column shows the numeric job ID. [] indicate a job array. The S column shows the current status of the job. The most common status flags are 'Q' for queued (waiting to start), 'R' for running, and 'C' for completed.
The scheduler retains job status information for a short time after job completion. The amount of time is configured by the systems manager, so it will be different on each site.
The qstat -f
flag requests
more detailed information. Since it produces a lot of output,
it is typically piped through more and/or used with a specific
job-id:
peregrine: qstat -f 2151 | more Job Id: 2151.peregrine.hpc.uwm.edu Job_Name = list-etc.pbs Job_Owner = bacon@peregrine.hpc.uwm.edu resources_used.cput = 00:00:00 resources_used.mem = 0kb resources_used.vmem = 0kb resources_used.walltime = 00:00:00 job_state = C queue = batch server = peregrine.hpc.uwm.edu Checkpoint = u ctime = Mon Jun 4 14:44:39 2012 Error_Path = peregrine.hpc.uwm.edu:/home/bacon/Computer-Training/Common/li st-etc.stderr exec_host = compute-03/0 exec_port = 15003 Hold_Types = n Join_Path = n Keep_Files = n Mail_Points = a mtime = Mon Jun 4 14:44:39 2012 Output_Path = peregrine.hpc.uwm.edu:/home/bacon/Computer-Training/Common/l ist-etc.stdout --More--(byte 720)
The qstat command has many flags for controlling what it reports. Run man qstat for full details.
Using the output from qstat -f, we can see which compute nodes are being used by a job.
We can then examine the processes on a given node using a remotely executed top command:
peregrine: ssh -t compute-003 top
-t
flag is important here, since it tells
ssh to open a connection with full terminal control, which is
needed by top to update your terminal screen.
On peregrine, a convenience script is provided to save typing:
peregrine: topnode 003
The purpose of this section is to provide the reader a quick start in job scheduling using the most common and useful tools. The full details of job submission are beyond the scope of this document.
Submitting jobs involves specifying a number of job parameters such as the number of cores, the job name (which is displayed by qstat), the name(s) of the output file(s), etc.
In order to record all of this information and make it easy to resubmit the same job, this information is usually incorporated into a submission script. Using a script saves you a lot of typing when you want to run-submit the same job, and also fully documents the job parameters.
A submission script is an ordinary shell script, with some directives inserted to provide information to the scheduler. For PBS, the directives are specially formatted shell comments beginning with "#PBS".
Suppose we have the following text in a file called
hostname.pbs
:
#!/usr/bin/env bash # A PBS directive #PBS -N hostname # A command to be executed on the scheduled node(s) # Prints the host name of the node running this script. hostname
The script is submitted to the PBS scheduler as a command line argument to the qsub command:
peregrine: qsub hostname.pbs
The qsub command, which is part of the PBS scheduler, finds a free core on a compute node, reserves it, and then runs the script on the compute node using ssh or some other remote execution command.
Comments beginning with #PBS are interpreted as directives by qsub and as any other comment by the shell. Recall that the shell ignores anything on a line after a '#'.
The command(s) in the script are dispatched to the node(s) containing the core(s) allocated by PBS, using ssh, rsh or any other remote shell PBS is configured to use.
The directives within the script provide command line flags to qsub. For instance, the line
#PBS -N hostname
causes qsub to behave as if you had typed
peregrine: qsub -N hostname hostname.pbs
By putting these comments in the script, you eliminate the need to remember them and retype them every time you run the job. It's generally best to put all qsub flags in the script rather than type any of them on the command line, so that you have an exact record of how the job was started. This will help you determine what went wrong if there are problems, and allow you to reproduce the results at a later time without worrying about whether you did something different.
The script itself can be any valid Unix script, using the shell of your choice. Since all Unix shells interpret the '#' as the beginning of a comment, the #PBS lines will be interpreted only by qsub, and ignored by the shell.
##PBS This line is ignored by qsub #PBS This line is interpreted by qsub
It's a good idea to use a modern shell such as bash, ksh, or tcsh, simply because they are more user-friendly than sh or csh.
Type in the hostname.pbs
script shown above and submit it to the scheduler
using qsub. Then check the status
with qstat and view the output
and error files.
#PBS -N job-name #PBS -o standard-output-file (default = <job-name>.o<job-id>) #PBS -e standard-error-file (default = <job-name>.e<job-id>) #PBS -l resource-requirements #PBS -M email-address #PBS -t first-last
The -N
flag gives the job a name which
will appear in the output of qstat.
Choosing a good name makes it easier to keep tabs on your
running jobs.
The -o
and -e
flags
control the name of the files to which the standard
output and standard error of the processes are redirected.
If omitted, a default name is generated using the job name
and job ID.
The -l
flag is used to specify resource
requirements for the job, which are discussed in
the section called “PBS Resource Requirements”.
The -t
flag indicates the starting
and ending subscripts for a job array. This is discussed
in the section called “Batch Parallel Jobs (Job Arrays)”.
When using a cluster, it is important to develop a feel for the resources required by your jobs, and inform the scheduler as accurately as possible what will be needed in terms of CPU time, memory, etc.
This allows the scheduler to maximize the utilization of precious resources and thereby provide the best possible run times for all users.
If a user does not specify a given resource requirement, the scheduler uses default limits. Default limits are set low, so that users are encouraged to provide an estimate of required resources for all non-trivial jobs. This protects other users from being blocked by long-running jobs that require less memory and other resources than the scheduler would assume.
PBS resource requirements are specified with the
qsub -l
flag.
#PBS -l procs=count #PBS -l nodes=node-count:ppn=procs-per-node #PBS -l cput=seconds #PBS -l cput=hours:minutes:seconds #PBS -l vmem=size[kb|mb|gb|tb] #PBS -l pvmem=size[kb|mb|gb|tb]
The procs
resource indicates the number
of processors (cores) used by the job. This resource must be
specified for parallel jobs such as MPI jobs, which are
discussed in the section called “MPI (Multi-core) Jobs”.
Count
cores are allocated by
the scheduler according to its own policy configuration.
Some cores may be on the same node, depending on the
current load distribution.
The nodes
and ppn
resources allow the user
to better control the distribution of cores allocated to
a job. For example, if you want 8 cores all on the same
node, you could use -l nodes=1:ppn=8
.
If you want 20 cores, each on a different node, you could
use -l nodes=20:ppn=1
. The best distribution
depends on the nature of both your own job and other jobs
running at the time. Spreading a job across more nodes
will generally reduce memory and disk contention between
processes within the job, but also increase communication
cost between processes in an MPI job.
Note that specifying a ppn value equal to the number of cores in a node ensures exclusive access to that node while the job is running. This is especially useful for jobs with high memory or disk requirements, where we want to avoid contending with other users' jobs.
The cput
resource limits the total
CPU time used by the job. Most jobs should specify
a value somewhat higher than the expected run time simply
to prevent program bugs from consuming excessive
cluster resources. As noted in the example above,
CPU time may be specified either as a number of seconds,
or in the format HH:MM:SS.
Memory requirements can be specified either as the total
memory for the job (vmem
) or as memory
per process within the job (pvmem
).
The pvmem
flag is generally more
convenient, since it does not have to be changed when
procs
, nodes
, or
ppn
is changed.
A batch serial submission script need only have optional PBS flags such as job name, output file, etc. and one or more commands.
#!/usr/bin/env bash #PBS -N hostname hostname
For simple serial jobs, a convenience script called qsubw is provided. This script submits a job, waits for the job to complete, and then immediately displays the output files. This type of operation is convenient for compiling programs and performing other simple tasks that must be scheduled, but for which we intend to wait for completion and immediately perform the next step.
-N, -o
and -e
.
The entire line is ignored, including other flags. Hence,
these flags should not be combined in the same line with
other flags in any scripts run through qsubw.
FreeBSD peregrine bacon ~/Data/Testing/RLA 503: qsubw compile.pbs Job ID = 298 exec_host = compute-02/0 compile.pbs.tmp.e298: RLA.cpp: In function 'double Random()': RLA.cpp:271: warning: integer overflow in expression RLA.cpp: In function 'void SelectBidAction()': RLA.cpp:307: warning: integer overflow in expression FreeBSD peregrine bacon ~/Data/Testing/RLA 504: qsub RLA.pbs
A job array is a set of independent processes all started by a single job submission. The entire job array can be treated as a single job by PBS commands such as qstat and qdel, but individual processes are also jobs in and of themselves, and can therefore by manipulated individually via PBS commands.
A batch parallel submission script looks almost exactly like a batch serial script, but requires just one additional flag:
#!/usr/bin/env bash #PBS -N hostname-parallel #PBS -t 1-5 hostname
The -t
flag is followed a list of integer
array IDs. The specification of array IDs can be fairly
sophisticated, but is usually a simple range.
The example above requests a job consisting of 5 identical processes which will all be under the job name hostname-parallel. Each process within the job produces a separate output file with the array ID as a suffix. For example, process 2 produces an output file named hostname-parallel.o51-2. ( 51 is the job ID, and 2 is the array ID within the job. )
-t N
does not work as
advertised in the official TORQUE documentation, which
states that it is the same as -t 0-N
.
In reality, it creates a single job with subscript N.
Copy your hostname.pbs
to
hostname-parallel.pbs
, modify
it to run 5 processes, and submit it to the scheduler
using qsub. Then check the status
with qstat and view the output
and error files.
Scheduling MPI jobs is actually much like scheduling batch serial jobs. This may not seem intuitive at first, but once you understand how MPI works, it makes more sense.
MPI programs cannot be executed directly from the command line as we do with normal programs and scripts. Instead, we must use the mpirun command to start up MPI programs.
mpirun [mpirun flags] mpi-program [mpi-program arguments]
Hence, unlike batch parallel jobs, the scheduler does not directly dispatch all of the processes in an MPI job. Instead, the scheduler dispatches a single mpirun command, and the MPI system takes care of dispatching and managing all of the MPI processes that comprise the job.
Since the scheduler is only dispatching one process, but
the MPI job may dispatch others, we must add one more item
to the submit script to inform the scheduler how many processes
MPI will dispatch. In PBS, this is done using a
resource specification, which consists
of a -l
followed by one or more resource names
and values.
#!/bin/sh #PBS -N MPI-Example # Use 48 cores for this job #PBS -l procs=48 mpirun mpi-program
When running MPI jobs, it is often desirable to have as many processes as possible running on the same node. Message passing is generally faster between processes on the same node than between processes on different nodes, because messages passed within the same node need not cross the network. If you have a very fast network such as Infiniband or 10 gigabit Ethernet, the difference may be marginal, but on more ordinary networks such as gigabit Ethernet, the difference can be enormous.
PBS sets a number of environment variables when a job is started. These variables can be used in the submission script and within other scripts or programs executed as part of the job.
One of the most important is the
PBS_O_WORKDIR
variable.
By default, PBS runs processes on the compute nodes with
the home directory as the current working directory.
Most jobs, however, are run from a project directory
under or outside the home directory that is shared
by all the nodes in the cluster. Usually, the processes
on the compute nodes should all run in the same
project directory. In order to ensure this, we can
either use the -d
flag, or add
this command to the script before the other command(s):
cd $PBS_O_WORKDIR
PBS_O_WORKDIR
is
set by the scheduler to the directory from which
the job is submitted. Note that this is the directory must
be shared by the submit node and all compute nodes via
the same pathname: The scheduler takes the pathname from
the submit node and then attempts to cd
to it on the compute node(s).
Using this variable is more convenient than using
the -d
flag, since we would have to
specify a hard-coded path following -d
.
If we move the project directory, scripts using
-d
would have to be modified, while
those using cd $PBS_O_WORKDIR will
work from any starting directory.
The PBS_JOBNAME
variable can be useful
for generating output filenames within a program, among
other things.
The PBS_ARRAYID
variable is especially
useful in jobs arrays, where each job in the array must
use a different input file and/or generate a different
output file. This scenario is especially common in Monte
Carlo experiments and parameter sweeps, where many instances
of the same program are run using a variety of inputs.
#!/usr/bin/env bash #PBS -t 1-100 cd $PBS_O_WORKDIR ./myprog input-$PBS_ARRAYID.txt
Below is a sample submit script with typical options and
extensive comments. This template is also available as a
macro in APE (Another Programmer's Editor) and in
/share1/Examples
on Peregrine.
#!/bin/sh ########################################################################## # Torque (PBS) job submission script template # "#PBS" denotes PBS command-line flags # "##PBS" is ignored by torque/PBS ########################################################################## ########################################################################### # Job name that will be displayed by qstat, used in output filenames, etc. #PBS -N pbs-template ########################################################################## # Job arrays run the same program independently on multiple cores. # Each process is treated as a separate job by PBS. # Each job has the name pbs-template[index], where index is # one of the integer values following -t. The entire array can # also be treated as a single job with the name "pbs-template[]". ##PBS -t 1-10 # 10 jobs with consecutive indexes ##PBS -t 2,4,5,6 # Explicitly list arbitrary indexes ################################################# # Specifying cores and distribution across nodes # Arbitrary cores: the most flexible method. Use this for all jobs with # no extraordinary requirements (e.g. high memory/process). It gives the # scheduler the maximum flexibility in dispatching the job, which could # allow it to start sooner. ##PBS -l procs=6 # Specific number of cores/node. Use this for high-memory processes, or # any other time there is a reason to distribute processes across multiple # nodes. # # For multicore jobs (e.g. MPI), this requirement is applied once to the # entire job. To spread an MPI job out so that there is only one process # per node, use: # # nodes=8:ppn=1 # # For job arrays, it is applied to each job in the array individually. # E.g. # # To spread a job array across multiple nodes, so that there is only one # process per node, use: # # nodes=1:ppn=N # # where N is the total number of cores on a node. This reserves an entire # node for each job in the array, i.e. other jobs will not be able to use # any cores on that node. Useful for high-memory jobs that need all the # memory on each node. ##PBS -l nodes=3:ppn=1 ########################################################################### # Specifying virtual memory requirements for each process within the job. # This should be done for all jobs in order to maximize utilization of # cluster resources. The scheduler will assume a small memory limit # unless told otherwise. ##PBS -l pvmem=250mb ###################################################################### # CPU time and wall time. These should be specified for all jobs. # Estimate how long your job should take, and specify 1.5 to 2 times # as much CPU and wall time as a limit. This is only to prevent # programs with bugs from occupying resources longer than they should. # The scheduler will assume a small memory limit unless told otherwise. # # cput refers to total CPU time used by all processes in the job. # walltime is actual elapsed time. # Hence, cput ~= walltime * # of processes ##PBS -l cput=seconds or [[HH:]MM:]SS ##PBS -l walltime=seconds or [[HH:]MM:]SS ######################################################################### # Environment variables set by by the scheduler. Use these in the # commands below to set parameters for array jobs, control the names # of output files, etc. # # PBS_O_HOST the name of the host where the qsub command was run # PBS_JOBID the job identifier assigned to the job by the batch system # PBS_JOBNAME the job name supplied by the user. # PBS_O_WORKDIR the absolute path of the current working directory of the # qsub command # PBS_NODEFILE name of file containing compute nodes ########################################################################## # Shell commands ########################################################################## # Torque starts from the home directory on each node by default, so we # must manually cd to the working directory to ensure that output files # end up in the project directory, etc. cd $PBS_O_WORKDIR # Optional preparation example: # Remove old files, alter PATH, etc. before starting the main process # rm output*.txt # PATH=/usr/local/mpi/openmpi/bin:${PATH} # export PATH # MPI job example: # mpirun ./pbs-template # Serial or array job example: # ./pbs-template -o output-$PBS_JOBID.txt
If you determine that a job is not behaving properly (by reviewing partial output, for example), you can terminate it using qdel, which take a job ID as a command line argument.
peregrine: qstat Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 53.peregrine MPI-Benchmark bacon 0 R batch peregrine: qdel 53 peregrine: qstat Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 53.peregrine MPI-Benchmark bacon 00:04:20 C batch
As mentioned in the section called “Job Submission”,
the output sent by processes to standard output and standard
error is redirected to files, as named by the -o
and -e
flags.
However, these files are not created until the job ends. Output is stored in temporary files until then.
A convenience script, called qpeek, is provided for viewing the output of a job stored in the temporary files. The qpeek is not part of PBS, but is provided as an add-on. It takes a single job ID as an argument.
FreeBSD peregrine bacon ~ 499: qsub solveit.pbs 297.peregrine.hpc.uwm.edu FreeBSD peregrine bacon ~ 500: qpeek 297 Computing the solution... 1 2 3 4 5 FreeBSD peregrine bacon ~ 501:
Write and submit a batch-serial PBS script called
list-etc.pbs
that prints
the host name of the compute node on which it runs and
a long-listing of /etc
directory on
that node.
The script should store the output of the
commands in list-etc.stdout
and error messages in list-etc.stderr
in the directory from which the script was submitted.
The job should appear in qstat listings under the name "list-etc".
Quickly check the status of your job after submitting it.
Copy your list-etc.pbs
script
to list-etc-parallel.pbs
, and modify
it so that it runs the hostname and
ls
commands on 10 cores instead of just one.
The job should produce a separate output file for each process
named list-etc-parallel.o<jobid>-<arrayid>
and a separate error file for each process
named list-etc-parallel.e<jobid>-<arrayid>
.
Quickly check the status of your job after submitting it.