Table of Contents
Before reading this chapter, you should be familiar with basic Unix concepts (Chapter 3, Using Unix), the Unix shell (the section called “Command Line Interfaces (CLIs): Unix Shells”, redirection (the section called “Redirection and Pipes”), and shell scripting (Chapter 4, Unix Shell Scripting).
LSF, which stands for Load Sharing Facility, is part of a commercial cluster management suite called Platform Cluster Management. The LSF scheduler allows you to schedule jobs for immediate queuing or to run at a later time. It also allows you to monitor the progress of your jobs and manipulate them after they have started. This document describes the basics of using LSF, and how to get more detailed information from the official LSF documentation.
Jobs are submitted to the LSF scheduler using the bsub command. The bsub command finds available cores within the cluster and executes your job(s) on the selected cores. If the resources required by your job are not immediately available, the scheduler will hold your job in a queue until they become available.
You can provide the Unix command you want to schedule as part of the bsub command, but the preferred method of using bsub involves creating a simple shell script. Several examples are given below. Writing shells scripts is covered in Chapter 4, Unix Shell Scripting.
[user@hd1 ~]$ bsub -J hostname_example -o hostname_output%J.txt hostname
The bsub command above finds an available core and dispatches the
Unix command hostname to that core. The output of the command
(the host name of the node containing the core) is appended to
hostname_output%J.txt, where %J is replaced
by the job number.
-o flag in the command tells bsub
that any output the hostname command tries to send to the terminal
should be appended to the filename following
The job is given the name
This job name can be used by other LSF commands to refer to the
job while it is running. A few of these commands are described below.
The job in the example above can also be submitted using a shell script.
When used with the scheduler, scripts document exactly how the bsub command is invoked, as well as the exact sequence of Unix commands to be executed on the cluster. This allows you to re-execute the job in exactly the same manner without having to remember the details of the command. It also allows you to perform preparation steps within the script before the primary command, like removing old files, going to a specific directory, etc.
To script the example above, enter the appropriate Unix commands into
a text file, e.g.
using your favorite text editor.
[user@hd1 ~]$ nano hostname.bsub
Once you're in the editor, enter the following text, and then save and exit:
Example 11.1. Simple LSF Batch Script
#!/usr/bin/env bash #BSUB -J hostname_example -o hostname_output%J.txt #BSUB -v 1000000 hostname
Then submit the job by running:
[user@hd1 ~]$ bsub < hostname.bsub
Note that the script is fed to bsub using input redirection.
The first line of the script:
indicates that this script should be executed by bash (Bourne Again shell).
The second and third lines:
#BSUB -J hostname_example -o hostname_output%J.txt #BSUB -v 1000000
specify command line flags used to be used with the bsub command. Note that they're the same flags we entered on the Unix command line when running bsub without a script. You can place any number of command line flags on each #BSUB line. It makes no difference to bsub, so it's strictly a matter of readability and personal taste.
Recall that to the shell, anything following a '#' is a comment. Lines that begin with '#BSUB' are specially formatted shell comments that are recognized by bsub but ignored by the shell running the script.
The beginning of the line must be exactly '#BSUB' in order to be recognized by bsub. Hence, we can disable an option without removing it from the script by simply adding another '#':
##BSUB -o hostname
The new command-line flag here,
limits the memory use of the job to 1,000,000 kilobytes, or
All lines in the script that are not comments are interpreted by bash as Unix commands, and are executed on the core(s) selected by the scheduler.
Scheduled jobs fall under one of several classifications. The differences between these classifications are mainly conceptual and the differences in the commands for submitting them via bsub can be subtle.
The example above is what we call a batch serial job. Batch serial
jobs run on a single core, and any output they send to the standard
output (normally the terminal) is redirected to the file named
-o filename with bsub.
Batch jobs do not display output to the terminal window, and cannot receive input from the keyboard.
If you need to see the output of your job on the screen during execution, or provide input from the keyboard, then you need an interactive job.
From the user's point of view, an interactive job runs as if you had simple entered the command at the Unix prompt instead of running it under the scheduler. The important distinction is that with a scheduled interactive job, the scheduler decides which core to run it on. This prevents multiple users from running interactive jobs on the same node and potentially swamping the resources of that node.
From the scheduler's point of view, an interactive job is almost the same as a batch serial, except that the output is not redirected to a file, and interactive jobs can receive input from the keyboard.
To run an interactive job in bsub, simply add
-I flag alone allows the job to send output back from the
compute node to your screen. It does not, however, allow
certain interactive features required by editors and other
full-screen programs. The
-Ip flag creates a pseudo-terminal,
which enables terminal features required for most programs.
-Is flag enables shell mode support in addition to
creating a pseudo-terminal.
-Ip should be sufficient for
Note that the
-o flag cannot be used with interactive jobs,
since output is sent to the terminal screen, not to a file.
-K flag causes
bsub to wait until the job completes.
This provides a very simple mechanism to run a series of jobs, where one job requires the output of another.
It can also be used in place of interactive jobs for tasks such as program compilation, if you want to save the output for future reference.
Batch serial and interactive jobs don't really make good use of a cluster, since they only run your job a single core. In order to utilize the real power of a cluster, we need to run multiple processes in parallel (at the same time).
If your computations can be decomposed into N completely independent tasks, then you may be able to use a batch parallel job to reduce the run time by nearly a factor of N. This is the simplest form of parallelism, and it maximizes the performance gain on a cluster. This type of parallel computing is often referred to as embarrassingly parallel, loosely coupled, high throughput computing (HTC), or grid computing. It is considered distinct from tightly coupled or high performance computing (HPC), where the processes that make up a parallel job communicate and cooperate with each other to complete a task.
In LSF, batch parallel jobs are distinguished from batch serial jobs in a rather subtle way; by simply appending a job index specification to the job name:
Example 11.3. Batch Parallel Script
#!/usr/bin/env bash #BSUB -J parallel[1-10] -o parallel_output%J.txt printf "Job $LSB_JOBINDEX running on `hostname`\n" > output_$LSB_JOBINDEX.txt
This script instructs the scheduler to allocate 10 cores in the cluster, and start a job consisting of 10 simultaneous processes with names parallel, parallel, ... parallel within the LSF scheduler. The scheduler allocates 10 cores, and the commands in the script are executed on all 10 cores at (approximately) the same time.
The environment variable $LSB_JOBINDEX is created by the scheduler,
and assigned a different value for each process within the job.
Specifically, it will be a number between 1 and 10, since these
are the subscripts specified with the
-J flag. This allows the
script, as well as commands executed by the script, to distinguish
themselves from other processes. This can be useful when you want
all the processes to read different input files or store output in
separate output files.
Note that the output of the printf command above is redirected to a different file for each job index. Normally, printf displays its output in the terminal window, but this example uses a shell feature called output redirection to send it to a file instead. When the shell sees a “>” in the command, it takes the string after the “>” as a filename, and causes the command to send its output to that file instead of the terminal screen.
After all the processes complete, you will have a series out output files called output_1.txt, output_2.txt, and so on. Although each process in the job runs on a different core, all of the cores have direct access to the same files and directories, so all of these files will be in the same place when the job finishes.
Each file will contain the host name on which the process ran. Note that since each node in the cluster has multiple cores, some of the files may contain the same host name. Remember that the scheduler allocates cores, not nodes.
As mentioned earlier, if you can decompose your computations into a set of completely independent parallel processes that have no need to talk to each other, then a batch parallel job is probably the best solution.
When the processes within a parallel job must communicate extensively, special programming is required to make the processes exchange information.
MPI (Message Passing Interface) is the de facto standard library and API (Application Programming Interface) for such tightly-coupled distributed programming. MPI can be used with general-purpose languages such as C, Fortran, and C++ to implement complex parallel programs requiring extensive communication between processes. Parallel programming can be very complex, but MPI will make the program implementation as easy as it can be.
There are multiple implementations of the MPI standard, several of which are installed on the cluster. The OpenMPI implementation is the newest, most complete, and is becoming the standard implementation. However, some specific applications still depend on other MPI implementations, so many clusters have multiple implementation installed. If you are building and running your own MPI applications, you may need to select a default implementation using the mpi-selector-menu command. A default selection of openmpi_gcc_qlc is usually configured for new users, but you can change it at any time, and as often as you like.
All MPI jobs are started by using the mpirun command, which is part of the MPI installation. The mpirun command dispatches the mpi program to all the nodes that you specify with the command, and sets up the communication interface that allows them to pass messages to each other while running.
To facilitate the use of mpirun under the LSF scheduler, LSF provides wrapper commands for each MPI implementation. For example, to use OpenMPI, the command is openmpi_wrapper. These wrappers allow the scheduler to pass information such as the list of allocated cores down to mpirun.
Example 11.4. LSF and OpenMPI
#!/usr/bin/env bash #BSUB -o matmult_output%J.txt -n 10 openmpi_wrapper matmult
In this script, matmult is an MPI program, i.e. a program using the MPI library functions to pass messages between multiple cooperating matmult processes.
The scheduler executes openmpi_wrapper on a single core, and openmpi_wrapper in turn executes matmult on 10 cores.
Note that MPI jobs in LSF are essentially batch serial jobs where the command is openmpi_wrapper or one of the other wrappers. The scheduler only allocates multiple cores, and executes the provided command on one of them. From there, MPI takes over.
Although openmpi_wrapper executes on only one core, the MPI program
as a whole requires multiple cores, and these cores must be
allocated by the scheduler. The bsub flag
-n 10 instructs
the LSF scheduler to allocate 10 cores for this job, even though
the scheduler will only dispatch openmpi_wrapper to one of them.
The openmpi_wrapper command then dispatches matmult to all of
the cores provided by the scheduler, including the one on which
openmpi_wrapper is running.
Note that you do not need a cluster to develop and run MPI applications. You can develop and test MPI programs on your PC provided that you have a compiler and an MPI package installed. The speedup provided by MPI will be limited by the number of cores your PC has, but the program should work basically the same way on your PC as it does on a cluster, except that you will probably use mpirun directly on your PC rather than submit the job to a scheduler.
After submitting jobs with bsub, you can monitor their status with the bjobs command.
[user@hd1 ~]$ bjobs
By default, bjobs limits the amount of information displayed to
fit an 80-column terminal. To ensure complete information is printed
for each job, add the
-w (wide output) flag.
[user@hd1 ~]$ bjobs -w
By default, the bjobs command will list jobs you have running, waiting to run, or suspended. It displays basic information such as the numeric job ID, job name, nodes in use, and so on.
If you would like to see what other users are running on the cluster, run
[user@hd1 ~]$ bjobs -w -u all
There are many options for listing a subset of your jobs, other users' jobs, etc. Run man bjobs for a quick reference.
When you run a batch job, the output file may not be available until the job finishes. You can view the output generated so far by an unfinished job using the bpeek command:
[user@hd1 ~]$ bpeek numeric-job-id
[user@hd1 ~]$ bpeek -J job-name
The job-name above is the same name you specified with the
in bsub. The numeric job id can be found by running bjobs.
Using the output from bjobs, we can see which compute nodes are being used by a job.
We can then examine the processes on a given node using a remotely executed top command:
avi: ssh -t compute-1-03 top
-tflag is important here, since it tells ssh to open a connection with full terminal control, which is needed by top to update your terminal screen.
A job can be terminated using the bkill command:
[user@hd1 ~]$ bkill numeric-job-id
[user@hd1 ~]$ bkill -J job-name
Again, the numeric job id can be found by running bjobs, and the
job-name is what you specified with the
-J option in bsub.
LSF has the ability to notify you by email when a job begins or ends. All notification emails are sent to your PantherLINK account.
To receive an email notification when your job ends, add the following to your bsub script:
To receive an email notification when your job begins, add the following to your bsub script:
The -B flags is only useful when the load in the cluster is too high to accommodate your job at the time it is submitted.