Chapter 11. Job Scheduling with PBS (TORQUE)

Table of Contents

11.. The PBS Scheduler
Cluster Status
Job Status
Using top
Job Submission
Terminating a Job
Viewing Output of Active Jobs
Self-test

Before You Begin

Before reading this chapter, you should be familiar with basic Unix concepts (Chapter 3, Using Unix), the Unix shell (the section called “Command Line Interfaces (CLIs): Unix Shells”, redirection (the section called “Redirection and Pipes”), shell scripting (Chapter 4, Unix Shell Scripting) and job scheduling (Chapter 7, Job Scheduling.

For complete information, see the official TORQUE documentation at http://www.adaptivecomputing.com/resources/docs/torque/3-0-2/index.php.

The PBS Scheduler

Cluster Status

The goal of this section is to provide a quick introduction to the PBS scheduler. Specifically, this section covers the current mainstream implementation of PBS, known as TORQUE.

Before scheduling any jobs on through PBS, it is often useful to check the status of the nodes. Knowing how many cores are available may influence your decision on how many cores to request for your next job.

For example, if only 50 cores are available at the moment, and your job requires 200 cores, the job will have to wait in the queue until 200 cores are free. You may end up getting your results sooner if you reduce the number of cores to 50 or less so that the job can begin right away.

The pbsnodes command shows information about the nodes in the cluster. This can be used to determine the total number of cores in the cluster, cores in use, etc.

FreeBSD peregrine bacon ~/Facil 408: pbsnodes|more
compute-001
     state = job-exclusive
     np = 12
     ntype = cluster
     jobs = 0/1173[20].peregrine.hpc.uwm.edu, 1/1173[20].peregrine.hpc.uwm.edu, 2/1173[20].peregrine.hpc.uwm.edu, 3/1173[20].peregrine.hpc.uwm.edu, 4/1173[20].peregrine.hpc.uwm.edu, 5/1173[20].peregrine.hpc.uwm.edu, 6/1173[20].peregrine.hpc.uwm.edu, 7/1173[20].peregrine.hpc.uwm.edu, 8/1173[20].peregrine.hpc.uwm.edu, 9/1173[20].peregrine.hpc.uwm.edu, 10/1173[20].peregrine.hpc.uwm.edu, 11/1173[20].peregrine.hpc.uwm.edu
     status = rectime=1331737185,varattr=,jobs=1173[20].peregrine.hpc.uwm.edu,state=free,netload=? 15201,gres=,loadave=2.00,ncpus=12,physmem=32515040kb,availmem=? 15201,totmem=? 15201,idletime=6636140,nusers=3,nsessions=3,sessions= 72397 13348 1005,uname=FreeBSD compute-001.local 8.2-RELEASE-p3 FreeBSD 8.2-RELEASE-p3 #0: Tue Sep 27 18:45:57 UTC 2011     root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC amd64,opsys=freebsd5
     mom_service_port = 15002
     mom_manager_port = 15003
     gpus = 0

[text removed for brevity]

compute-08
     state = free
     np = 12
     ntype = cluster
     status = rectime=1331737271,varattr=,jobs=,state=free,netload=? 15201,gres=,loadave=0.00,ncpus=12,physmem=32515040kb,availmem=? 15201,totmem=? 15201,idletime=6642605,nusers=2,nsessions=2,sessions= 18723 1034,uname=FreeBSD compute-08.local 8.2-RELEASE-p3 FreeBSD 8.2-RELEASE-p3 #0: Tue Sep 27 18:45:57 UTC 2011     root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC amd64,opsys=freebsd5
     mom_service_port = 15002
     mom_manager_port = 15003
     gpus = 0
            

As a convenience, a script called cluster-load is installed on Peregrine. This script extracts information from pbsnodes and displays the current cluster load in an abbreviated, easy-to-read format.

            FreeBSD peregrine bacon ~/Facil 406: cluster-load
            Nodes in use: 5
            Cores in use: 60
            Total cores:  96
            Free cores:   36
            Load:         62%
            
The Ganglia Resource Monitor

In addition to the command line tools used to control and monitor jobs, there are also web-based tools for monitoring clusters and grids in a more visual manner.

The Ganglia Resource Monitor is a web-based monitoring tool that provides statistics about a collection of computers on a network.

The status of the student cluster, Peregrine, can be viewed at http://www.peregrine.hpc.uwm.edu/ganglia/.

The status of the faculty research cluster, Mortimer, can be viewed at http://www.mortimer.hpc.uwm.edu/ganglia/.

Job Status

You can check on their status of running jobs using qstat.

            peregrine: qstat
            Job id           Name             User   Time Use S Queue
            ---------------- ---------------- ------ -------- - -----
            52[].peregrine   ...allel-example bacon         0 C batch
            

The Job id column shows the numeric job ID. [] indicate a job array. The S column shows the current status of the job. The most common status flags are 'Q' for queued (waiting to start), 'R' for running, and 'C' for completed.

The scheduler retains job status information for a short time after job completion. The amount of time is configured by the systems manager, so it will be different on each site.

The qstat -f flag requests more detailed information. Since it produces a lot of output, it is typically piped through more and/or used with a specific job-id:

peregrine: qstat -f 2151 | more
Job Id: 2151.peregrine.hpc.uwm.edu
    Job_Name = list-etc.pbs
    Job_Owner = bacon@peregrine.hpc.uwm.edu
    resources_used.cput = 00:00:00
    resources_used.mem = 0kb
    resources_used.vmem = 0kb
    resources_used.walltime = 00:00:00
    job_state = C
    queue = batch
    server = peregrine.hpc.uwm.edu
    Checkpoint = u
    ctime = Mon Jun  4 14:44:39 2012
    Error_Path = peregrine.hpc.uwm.edu:/home/bacon/Computer-Training/Common/li
	st-etc.stderr
    exec_host = compute-03/0
    exec_port = 15003
    Hold_Types = n
    Join_Path = n
    Keep_Files = n
    Mail_Points = a
    mtime = Mon Jun  4 14:44:39 2012
    Output_Path = peregrine.hpc.uwm.edu:/home/bacon/Computer-Training/Common/l
	ist-etc.stdout
--More--(byte 720)

The qstat command has many flags for controlling what it reports. Run man qstat for full details.

Using top

Using the output from qstat -f, we can see which compute nodes are being used by a job.

We can then examine the processes on a given node using a remotely executed top command:

            peregrine: ssh -t compute-003 top
            

Note

The -t flag is important here, since it tells ssh to open a connection with full terminal control, which is needed by top to update your terminal screen.

On peregrine, a convenience script is provided to save typing:

            peregrine: topnode 003
            
Job Submission

The purpose of this section is to provide the reader a quick start in job scheduling using the most common and useful tools. The full details of job submission are beyond the scope of this document.

Submission Scripts

Submitting jobs involves specifying a number of job parameters such as the number of cores, the job name (which is displayed by qstat), the name(s) of the output file(s), etc.

In order to record all of this information and make it easy to resubmit the same job, this information is usually incorporated into a submission script. Using a script saves you a lot of typing when you want to run-submit the same job, and also fully documents the job parameters.

A submission script is an ordinary shell script, with some directives inserted to provide information to the scheduler. For PBS, the directives are specially formatted shell comments beginning with "#PBS".

Suppose we have the following text in a file called hostname.pbs:

#!/usr/bin/env bash

# A PBS directive
#PBS -N hostname

# A command to be executed on the scheduled node(s)
# Prints the host name of the node running this script.
hostname

The script is submitted to the PBS scheduler as a command line argument to the qsub command:

                peregrine: qsub hostname.pbs
                

The qsub command, which is part of the PBS scheduler, finds a free core on a compute node, reserves it, and then runs the script on the compute node using ssh or some other remote execution command.

Comments beginning with #PBS are interpreted as directives by qsub and as any other comment by the shell. Recall that the shell ignores anything on a line after a '#'.

The command(s) in the script are dispatched to the node(s) containing the core(s) allocated by PBS, using ssh, rsh or any other remote shell PBS is configured to use.

The directives within the script provide command line flags to qsub. For instance, the line

                #PBS -N hostname
                

causes qsub to behave as if you had typed

                peregrine: qsub -N hostname hostname.pbs
                

By putting these comments in the script, you eliminate the need to remember them and retype them every time you run the job. It's generally best to put all qsub flags in the script rather than type any of them on the command line, so that you have an exact record of how the job was started. This will help you determine what went wrong if there are problems, and allow you to reproduce the results at a later time without worrying about whether you did something different.

The script itself can be any valid Unix script, using the shell of your choice. Since all Unix shells interpret the '#' as the beginning of a comment, the #PBS lines will be interpreted only by qsub, and ignored by the shell.

Note

If you want to disable a #PBS comment, you can just add another '#' rather than delete it. This will allow you to easily enable it again later.
                ##PBS This line is ignored by qsub
                #PBS This line is interpreted by qsub
                

It's a good idea to use a modern shell such as bash, ksh, or tcsh, simply because they are more user-friendly than sh or csh.

Practice Break

Type in the hostname.pbs script shown above and submit it to the scheduler using qsub. Then check the status with qstat and view the output and error files.

Common Flags
                #PBS -N job-name
                #PBS -o standard-output-file  (default = <job-name>.o<job-id>)
                #PBS -e standard-error-file   (default = <job-name>.e<job-id>)
                #PBS -l resource-requirements
                #PBS -M email-address
                #PBS -t first-last
                

The -N flag gives the job a name which will appear in the output of qstat. Choosing a good name makes it easier to keep tabs on your running jobs.

The -o and -e flags control the name of the files to which the standard output and standard error of the processes are redirected. If omitted, a default name is generated using the job name and job ID.

The -l flag is used to specify resource requirements for the job, which are discussed in the section called “PBS Resource Requirements”.

The -t flag indicates the starting and ending subscripts for a job array. This is discussed in the section called “Batch Parallel Jobs (Job Arrays)”.

PBS Resource Requirements

When using a cluster, it is important to develop a feel for the resources required by your jobs, and inform the scheduler as accurately as possible what will be needed in terms of CPU time, memory, etc.

This allows the scheduler to maximize the utilization of precious resources and thereby provide the best possible run times for all users.

If a user does not specify a given resource requirement, the scheduler uses default limits. Default limits are set low, so that users are encouraged to provide an estimate of required resources for all non-trivial jobs. This protects other users from being blocked by long-running jobs that require less memory and other resources than the scheduler would assume.

PBS resource requirements are specified with the qsub -l flag.

                #PBS -l procs=count
                #PBS -l nodes=node-count:ppn=procs-per-node
                #PBS -l cput=seconds
                #PBS -l cput=hours:minutes:seconds
                #PBS -l vmem=size[kb|mb|gb|tb]
                #PBS -l pvmem=size[kb|mb|gb|tb]
                

The procs resource indicates the number of processors (cores) used by the job. This resource must be specified for parallel jobs such as MPI jobs, which are discussed in the section called “MPI (Multi-core) Jobs”. Count cores are allocated by the scheduler according to its own policy configuration. Some cores may be on the same node, depending on the current load distribution.

The nodes and ppn resources allow the user to better control the distribution of cores allocated to a job. For example, if you want 8 cores all on the same node, you could use -l nodes=1:ppn=8. If you want 20 cores, each on a different node, you could use -l nodes=20:ppn=1. The best distribution depends on the nature of both your own job and other jobs running at the time. Spreading a job across more nodes will generally reduce memory and disk contention between processes within the job, but also increase communication cost between processes in an MPI job.

Note that specifying a ppn value equal to the number of cores in a node ensures exclusive access to that node while the job is running. This is especially useful for jobs with high memory or disk requirements, where we want to avoid contending with other users' jobs.

The cput resource limits the total CPU time used by the job. Most jobs should specify a value somewhat higher than the expected run time simply to prevent program bugs from consuming excessive cluster resources. As noted in the example above, CPU time may be specified either as a number of seconds, or in the format HH:MM:SS.

Memory requirements can be specified either as the total memory for the job (vmem) or as memory per process within the job (pvmem). The pvmem flag is generally more convenient, since it does not have to be changed when procs, nodes, or ppn is changed.

Note

It is a very important to specify memory requirements accurately in all jobs, and it is generally easy to predict based on previous runs by monitoring processes within the job using top or ps. Failure to do so could block other jobs run running, even though the resources it requires are actually available.
Batch Serial Jobs

A batch serial submission script need only have optional PBS flags such as job name, output file, etc. and one or more commands.

                #!/usr/bin/env bash
                
                #PBS -N hostname
                
                hostname
                

For simple serial jobs, a convenience script called qsubw is provided. This script submits a job, waits for the job to complete, and then immediately displays the output files. This type of operation is convenient for compiling programs and performing other simple tasks that must be scheduled, but for which we intend to wait for completion and immediately perform the next step.

Caution

As currently implemented, qsubw ignores #PBS lines in the submit script containing -N, -o and -e. The entire line is ignored, including other flags. Hence, these flags should not be combined in the same line with other flags in any scripts run through qsubw.
                FreeBSD peregrine bacon ~/Data/Testing/RLA 503: qsubw compile.pbs 
                    Job ID = 298
                    exec_host = compute-02/0
                
                compile.pbs.tmp.e298:
                RLA.cpp: In function 'double Random()':
                RLA.cpp:271: warning: integer overflow in expression
                RLA.cpp: In function 'void SelectBidAction()':
                RLA.cpp:307: warning: integer overflow in expression
                FreeBSD peregrine bacon ~/Data/Testing/RLA 504: qsub RLA.pbs
                
Batch Parallel Jobs (Job Arrays)

A job array is a set of independent processes all started by a single job submission. The entire job array can be treated as a single job by PBS commands such as qstat and qdel, but individual processes are also jobs in and of themselves, and can therefore by manipulated individually via PBS commands.

A batch parallel submission script looks almost exactly like a batch serial script, but requires just one additional flag:

                #!/usr/bin/env bash
                
                #PBS -N hostname-parallel
                #PBS -t 1-5
                
                hostname
                

The -t flag is followed a list of integer array IDs. The specification of array IDs can be fairly sophisticated, but is usually a simple range.

The example above requests a job consisting of 5 identical processes which will all be under the job name hostname-parallel. Each process within the job produces a separate output file with the array ID as a suffix. For example, process 2 produces an output file named hostname-parallel.o51-2. ( 51 is the job ID, and 2 is the array ID within the job. )

Caution

The syntax -t N does not work as advertised in the official TORQUE documentation, which states that it is the same as -t 0-N. In reality, it creates a single job with subscript N.

Practice Break

Copy your hostname.pbs to hostname-parallel.pbs, modify it to run 5 processes, and submit it to the scheduler using qsub. Then check the status with qstat and view the output and error files.

MPI (Multi-core) Jobs

Scheduling MPI jobs is actually much like scheduling batch serial jobs. This may not seem intuitive at first, but once you understand how MPI works, it makes more sense.

MPI programs cannot be executed directly from the command line as we do with normal programs and scripts. Instead, we must use the mpirun command to start up MPI programs.

                mpirun [mpirun flags] mpi-program [mpi-program arguments]
                

Caution

Like any other command used on a cluster or grid, mpirun must not be executed directly from the command line, but instead must be used in a scheduler submission script.

Hence, unlike batch parallel jobs, the scheduler does not directly dispatch all of the processes in an MPI job. Instead, the scheduler dispatches a single mpirun command, and the MPI system takes care of dispatching and managing all of the MPI processes that comprise the job.

Since the scheduler is only dispatching one process, but the MPI job may dispatch others, we must add one more item to the submit script to inform the scheduler how many processes MPI will dispatch. In PBS, this is done using a resource specification, which consists of a -l followed by one or more resource names and values.

#!/bin/sh

#PBS -N MPI-Example

# Use 48 cores for this job
#PBS -l procs=48

mpirun mpi-program

When running MPI jobs, it is often desirable to have as many processes as possible running on the same node. Message passing is generally faster between processes on the same node than between processes on different nodes, because messages passed within the same node need not cross the network. If you have a very fast network such as Infiniband or 10 gigabit Ethernet, the difference may be marginal, but on more ordinary networks such as gigabit Ethernet, the difference can be enormous.

Environment Variables

PBS sets a number of environment variables when a job is started. These variables can be used in the submission script and within other scripts or programs executed as part of the job.

One of the most important is the PBS_O_WORKDIR variable. By default, PBS runs processes on the compute nodes with the home directory as the current working directory. Most jobs, however, are run from a project directory under or outside the home directory that is shared by all the nodes in the cluster. Usually, the processes on the compute nodes should all run in the same project directory. In order to ensure this, we can either use the -d flag, or add this command to the script before the other command(s):

                cd $PBS_O_WORKDIR
                

PBS_O_WORKDIR is set by the scheduler to the directory from which the job is submitted. Note that this is the directory must be shared by the submit node and all compute nodes via the same pathname: The scheduler takes the pathname from the submit node and then attempts to cd to it on the compute node(s).

Using this variable is more convenient than using the -d flag, since we would have to specify a hard-coded path following -d. If we move the project directory, scripts using -d would have to be modified, while those using cd $PBS_O_WORKDIR will work from any starting directory.

The PBS_JOBNAME variable can be useful for generating output filenames within a program, among other things.

The PBS_ARRAYID variable is especially useful in jobs arrays, where each job in the array must use a different input file and/or generate a different output file. This scenario is especially common in Monte Carlo experiments and parameter sweeps, where many instances of the same program are run using a variety of inputs.

                #!/usr/bin/env bash
                
                #PBS -t 1-100
                
                cd $PBS_O_WORKDIR
                ./myprog input-$PBS_ARRAYID.txt
                
A Submit Script Template

Below is a sample submit script with typical options and extensive comments. This template is also available as a macro in APE (Another Programmer's Editor) and in /share1/Examples on Peregrine.

#!/bin/sh

##########################################################################
#   Torque (PBS) job submission script template
#   "#PBS" denotes PBS command-line flags
#   "##PBS" is ignored by torque/PBS
##########################################################################

###########################################################################
# Job name that will be displayed by qstat, used in output filenames, etc.

#PBS -N pbs-template

##########################################################################
# Job arrays run the same program independently on multiple cores.
# Each process is treated as a separate job by PBS.
# Each job has the name pbs-template[index], where index is
# one of the integer values following -t.  The entire array can
# also be treated as a single job with the name "pbs-template[]".

##PBS -t 1-10       # 10 jobs with consecutive indexes
##PBS -t 2,4,5,6    # Explicitly list arbitrary indexes

#################################################
# Specifying cores and distribution across nodes

# Arbitrary cores: the most flexible method.  Use this for all jobs with
# no extraordinary requirements (e.g. high memory/process).  It gives the
# scheduler the maximum flexibility in dispatching the job, which could
# allow it to start sooner.

##PBS -l procs=6

# Specific number of cores/node.  Use this for high-memory processes, or
# any other time there is a reason to distribute processes across multiple
# nodes.
#
# For multicore jobs (e.g. MPI), this requirement is applied once to the
# entire job.  To spread an MPI job out so that there is only one process
# per node, use:
#
# nodes=8:ppn=1
#
# For job arrays, it is applied to each job in the array individually.
# E.g. 
#
# To spread a job array across multiple nodes, so that there is only one
# process per node, use:
#
# nodes=1:ppn=N
#
# where N is the total number of cores on a node.  This reserves an entire
# node for each job in the array, i.e. other jobs will not be able to use
# any cores on that node.  Useful for high-memory jobs that need all the
# memory on each node.

##PBS -l nodes=3:ppn=1

###########################################################################
# Specifying virtual memory requirements for each process within the job.
# This should be done for all jobs in order to maximize utilization of
# cluster resources.  The scheduler will assume a small memory limit
# unless told otherwise.

##PBS -l pvmem=250mb

######################################################################
# CPU time and wall time.  These should be specified for all jobs.
# Estimate how long your job should take, and specify 1.5 to 2 times
# as much CPU and wall time as a limit.  This is only to prevent
# programs with bugs from occupying resources longer than they should.
# The scheduler will assume a small memory limit unless told otherwise.
#
# cput refers to total CPU time used by all processes in the job.
# walltime is actual elapsed time.
# Hence, cput ~= walltime * # of processes

##PBS -l cput=seconds or [[HH:]MM:]SS
##PBS -l walltime=seconds or [[HH:]MM:]SS

#########################################################################
# Environment variables set by by the scheduler.  Use these in the
# commands below to set parameters for array jobs, control the names
# of output files, etc.
#
# PBS_O_HOST    the name of the host where the qsub command was run
# PBS_JOBID     the job identifier assigned to the job by the batch system
# PBS_JOBNAME   the job name supplied by the user.
# PBS_O_WORKDIR the absolute path of the current working directory of the
#               qsub command
# PBS_NODEFILE  name of file containing compute nodes

##########################################################################
# Shell commands
##########################################################################

# Torque starts from the home directory on each node by default, so we
# must manually cd to the working directory to ensure that output files
# end up in the project directory, etc.

cd $PBS_O_WORKDIR

# Optional preparation example:
# Remove old files, alter PATH, etc. before starting the main process

# rm output*.txt
# PATH=/usr/local/mpi/openmpi/bin:${PATH}
# export PATH

# MPI job example:
# mpirun ./pbs-template

# Serial or array job example:
# ./pbs-template -o output-$PBS_JOBID.txt

Terminating a Job

If you determine that a job is not behaving properly (by reviewing partial output, for example), you can terminate it using qdel, which take a job ID as a command line argument.

            peregrine: qstat
            Job id                    Name             User            Time Use S Queue
            ------------------------- ---------------- --------------- -------- - -----
            53.peregrine               MPI-Benchmark    bacon                  0 R batch          
            peregrine: qdel 53
            peregrine: qstat
            Job id                    Name             User            Time Use S Queue
            ------------------------- ---------------- --------------- -------- - -----
            53.peregrine               MPI-Benchmark    bacon           00:04:20 C batch
            
Viewing Output of Active Jobs

As mentioned in the section called “Job Submission”, the output sent by processes to standard output and standard error is redirected to files, as named by the -o and -e flags.

However, these files are not created until the job ends. Output is stored in temporary files until then.

A convenience script, called qpeek, is provided for viewing the output of a job stored in the temporary files. The qpeek is not part of PBS, but is provided as an add-on. It takes a single job ID as an argument.

            FreeBSD peregrine bacon ~ 499: qsub solveit.pbs 
            297.peregrine.hpc.uwm.edu
            FreeBSD peregrine bacon ~ 500: qpeek 297
            Computing the solution...
            1 2 3 4 5
            FreeBSD peregrine bacon ~ 501:
            
Self-test
  1. What is the PBS command for showing the current state of all nodes in a cluster?
  2. What is the PBS command to show the currently running jobs on a cluster?
  3. Write and submit a batch-serial PBS script called list-etc.pbs that prints the host name of the compute node on which it runs and a long-listing of /etc directory on that node.

    The script should store the output of the commands in list-etc.stdout and error messages in list-etc.stderr in the directory from which the script was submitted.

    The job should appear in qstat listings under the name "list-etc".

    Quickly check the status of your job after submitting it.

  4. Copy your list-etc.pbs script to list-etc-parallel.pbs, and modify it so that it runs the hostname and ls commands on 10 cores instead of just one.

    The job should produce a separate output file for each process named list-etc-parallel.o<jobid>-<arrayid> and a separate error file for each process named list-etc-parallel.e<jobid>-<arrayid>.

    Quickly check the status of your job after submitting it.

  5. What is the PBS command for terminating a job with job-id 3545?
  6. What is the PBS command for viewing the terminal output of a job with job-id 3545 while it is still running?
  7. What is the PBS command for showing detailed job information about the job with job-id 3254?