Chapter 12. Job Scheduling with LSF

Table of Contents

12.. The LSF Scheduler
Submitting Jobs
Job Types
Job Control and Monitoring
12.. Good Neighbor Scheduling Strategies
Job Distribution
12.. More Information

Before You Begin

Before reading this chapter, you should be familiar with basic Unix concepts (Chapter 3, Using Unix), the Unix shell (the section called “Command Line Interfaces (CLIs): Unix Shells”, redirection (the section called “Redirection and Pipes”), and shell scripting (Chapter 4, Unix Shell Scripting).

The LSF Scheduler

LSF, which stands for Load Sharing Facility, is part of a commercial cluster management suite called Platform Cluster Management. The LSF scheduler allows you to schedule jobs for immediate queuing or to run at a later time. It also allows you to monitor the progress of your jobs and manipulate them after they have started. This document describes the basics of using LSF, and how to get more detailed information from the official LSF documentation.

Submitting Jobs

Jobs are submitted to the LSF scheduler using the bsub command. The bsub command finds available cores within the cluster and executes your job(s) on the selected cores. If the resources required by your job are not immediately available, the scheduler will hold your job in a queue until they become available.

You can provide the Unix command you want to schedule as part of the bsub command, but the preferred method of using bsub involves creating a simple shell script. Several examples are given below. Writing shells scripts is covered in Chapter 4, Unix Shell Scripting.

Using bsub without a script
                [user@hd1 ~]$ bsub -J hostname_example -o hostname_output%J.txt hostname
                

The bsub command above finds an available core and dispatches the Unix command hostname to that core. The output of the command (the host name of the node containing the core) is appended to hostname_output%J.txt, where %J is replaced by the job number. The -o flag in the command tells bsub that any output the hostname command tries to send to the terminal should be appended to the filename following -o instead.

The job is given the name hostname_example via the -J flag. This job name can be used by other LSF commands to refer to the job while it is running. A few of these commands are described below.

Using a Submission Script

The job in the example above can also be submitted using a shell script.

When used with the scheduler, scripts document exactly how the bsub command is invoked, as well as the exact sequence of Unix commands to be executed on the cluster. This allows you to re-execute the job in exactly the same manner without having to remember the details of the command. It also allows you to perform preparation steps within the script before the primary command, like removing old files, going to a specific directory, etc.

To script the example above, enter the appropriate Unix commands into a text file, e.g. hostname.bsub, using your favorite text editor.

                [user@hd1 ~]$ nano hostname.bsub
                

Once you're in the editor, enter the following text, and then save and exit:

Example 12.1. Simple LSF Batch Script

                #!/usr/bin/env bash
            
                #BSUB -J hostname_example -o hostname_output%J.txt
                #BSUB -v 1000000
                
                hostname
                

Then submit the job by running:

                [user@hd1 ~]$ bsub < hostname.bsub
                

Note that the script is fed to bsub using input redirection.

The first line of the script:

                #!/usr/bin/env bash
                

indicates that this script should be executed by bash (Bourne Again shell).

The second and third lines:

                #BSUB -J hostname_example -o hostname_output%J.txt
                #BSUB -v 1000000
                

specify command line flags used to be used with the bsub command. Note that they're the same flags we entered on the Unix command line when running bsub without a script. You can place any number of command line flags on each #BSUB line. It makes no difference to bsub, so it's strictly a matter of readability and personal taste.

Recall that to the shell, anything following a '#' is a comment. Lines that begin with '#BSUB' are specially formatted shell comments that are recognized by bsub but ignored by the shell running the script.

The beginning of the line must be exactly '#BSUB' in order to be recognized by bsub. Hence, we can disable an option without removing it from the script by simply adding another '#':

                ##BSUB -o hostname
                

The new command-line flag here, -v 1000000, limits the memory use of the job to 1,000,000 kilobytes, or 1 gigabyte.

Caution

All jobs should use memory-limits like this to prevent them from accidentally overloading the cluster. Programs that use too much memory can cause compute nodes to crash, which may kill other users' jobs as well as your own. It's possible to cause another user to lose several weeks worth of work by simply miscalculating how much memory your job requires.

All lines in the script that are not comments are interpreted by bash as Unix commands, and are executed on the core(s) selected by the scheduler.

Job Types

Scheduled jobs fall under one of several classifications. The differences between these classifications are mainly conceptual and the differences in the commands for submitting them via bsub can be subtle.

Batch Serial Jobs

The example above is what we call a batch serial job. Batch serial jobs run on a single core, and any output they send to the standard output (normally the terminal) is redirected to the file named using -o filename with bsub.

Batch jobs do not display output to the terminal window, and cannot receive input from the keyboard.

Interactive Jobs

If you need to see the output of your job on the screen during execution, or provide input from the keyboard, then you need an interactive job.

From the user's point of view, an interactive job runs as if you had simple entered the command at the Unix prompt instead of running it under the scheduler. The important distinction is that with a scheduled interactive job, the scheduler decides which core to run it on. This prevents multiple users from running interactive jobs on the same node and potentially swamping the resources of that node.

From the scheduler's point of view, an interactive job is almost the same as a batch serial, except that the output is not redirected to a file, and interactive jobs can receive input from the keyboard.

To run an interactive job in bsub, simply add -I (capital i), -Ip, or -Is. The -I flag alone allows the job to send output back from the compute node to your screen. It does not, however, allow certain interactive features required by editors and other full-screen programs. The -Ip flag creates a pseudo-terminal, which enables terminal features required for most programs. The -Is flag enables shell mode support in addition to creating a pseudo-terminal. -Ip should be sufficient for most purposes.

Example 12.2. Batch Interactive Script

                #!/usr/bin/env bash
            
                #BSUB -J hostname_example -Ip
                
                hostname
                

Note that the -o flag cannot be used with interactive jobs, since output is sent to the terminal screen, not to a file.

Waiting for Jobs to Complete

The bsub -K flag causes bsub to wait until the job completes.

This provides a very simple mechanism to run a series of jobs, where one job requires the output of another.

It can also be used in place of interactive jobs for tasks such as program compilation, if you want to save the output for future reference.

Batch Parallel Jobs

Batch serial and interactive jobs don't really make good use of a cluster, since they only run your job a single core. In order to utilize the real power of a cluster, we need to run multiple processes in parallel (at the same time).

If your computations can be decomposed into N completely independent tasks, then you may be able to use a batch parallel job to reduce the run time by nearly a factor of N. This is the simplest form of parallelism, and it maximizes the performance gain on a cluster. This type of parallel computing is often referred to as embarrassingly parallel, loosely coupled, high throughput computing (HTC), or grid computing. It is considered distinct from tightly coupled or high performance computing (HPC), where the processes that make up a parallel job communicate and cooperate with each other to complete a task.

In LSF, batch parallel jobs are distinguished from batch serial jobs in a rather subtle way; by simply appending a job index specification to the job name:

Example 12.3. Batch Parallel Script

                #!/usr/bin/env bash
            
                #BSUB -J parallel[1-10] -o parallel_output%J.txt
                printf "Job $LSB_JOBINDEX running on `hostname`\n" > output_$LSB_JOBINDEX.txt
                

This script instructs the scheduler to allocate 10 cores in the cluster, and start a job consisting of 10 simultaneous processes with names parallel[1], parallel[2], ... parallel[10] within the LSF scheduler. The scheduler allocates 10 cores, and the commands in the script are executed on all 10 cores at (approximately) the same time.

The environment variable $LSB_JOBINDEX is created by the scheduler, and assigned a different value for each process within the job. Specifically, it will be a number between 1 and 10, since these are the subscripts specified with the -J flag. This allows the script, as well as commands executed by the script, to distinguish themselves from other processes. This can be useful when you want all the processes to read different input files or store output in separate output files.

Note that the output of the printf command above is redirected to a different file for each job index. Normally, printf displays its output in the terminal window, but this example uses a shell feature called output redirection to send it to a file instead. When the shell sees a > in the command, it takes the string after the > as a filename, and causes the command to send its output to that file instead of the terminal screen.

After all the processes complete, you will have a series out output files called output_1.txt, output_2.txt, and so on. Although each process in the job runs on a different core, all of the cores have direct access to the same files and directories, so all of these files will be in the same place when the job finishes.

Each file will contain the host name on which the process ran. Note that since each node in the cluster has multiple cores, some of the files may contain the same host name. Remember that the scheduler allocates cores, not nodes.

MPI Jobs

As mentioned earlier, if you can decompose your computations into a set of completely independent parallel processes that have no need to talk to each other, then a batch parallel job is probably the best solution.

When the processes within a parallel job must communicate extensively, special programming is required to make the processes exchange information.

MPI (Message Passing Interface) is the de facto standard library and API (Application Programming Interface) for such tightly-coupled distributed programming. MPI can be used with general-purpose languages such as C, Fortran, and C++ to implement complex parallel programs requiring extensive communication between processes. Parallel programming can be very complex, but MPI will make the program implementation as easy as it can be.

There are multiple implementations of the MPI standard, several of which are installed on the cluster. The OpenMPI implementation is the newest, most complete, and is becoming the standard implementation. However, some specific applications still depend on other MPI implementations, so many clusters have multiple implementation installed. If you are building and running your own MPI applications, you may need to select a default implementation using the mpi-selector-menu command. A default selection of openmpi_gcc_qlc is usually configured for new users, but you can change it at any time, and as often as you like.

All MPI jobs are started by using the mpirun command, which is part of the MPI installation. The mpirun command dispatches the mpi program to all the nodes that you specify with the command, and sets up the communication interface that allows them to pass messages to each other while running.

Caution

On a cluster shared by many users, it is impractical to use mpirun directly, since it requires the user to specify which cores to use, and cores should be selected by the scheduler instead. All MPI jobs should use the wrappers described below instead of using mpirun directly.

To facilitate the use of mpirun under the LSF scheduler, LSF provides wrapper commands for each MPI implementation. For example, to use OpenMPI, the command is openmpi_wrapper. These wrappers allow the scheduler to pass information such as the list of allocated cores down to mpirun.

Example 12.4. LSF and OpenMPI

                #!/usr/bin/env bash
            
                #BSUB -o matmult_output%J.txt -n 10
                openmpi_wrapper matmult
                

In this script, matmult is an MPI program, i.e. a program using the MPI library functions to pass messages between multiple cooperating matmult processes.

The scheduler executes openmpi_wrapper on a single core, and openmpi_wrapper in turn executes matmult on 10 cores.

Note that MPI jobs in LSF are essentially batch serial jobs where the command is openmpi_wrapper or one of the other wrappers. The scheduler only allocates multiple cores, and executes the provided command on one of them. From there, MPI takes over.

Although openmpi_wrapper executes on only one core, the MPI program as a whole requires multiple cores, and these cores must be allocated by the scheduler. The bsub flag -n 10 instructs the LSF scheduler to allocate 10 cores for this job, even though the scheduler will only dispatch openmpi_wrapper to one of them. The openmpi_wrapper command then dispatches matmult to all of the cores provided by the scheduler, including the one on which openmpi_wrapper is running.

Note that you do not need a cluster to develop and run MPI applications. You can develop and test MPI programs on your PC provided that you have a compiler and an MPI package installed. The speedup provided by MPI will be limited by the number of cores your PC has, but the program should work basically the same way on your PC as it does on a cluster, except that you will probably use mpirun directly on your PC rather than submit the job to a scheduler.

Job Control and Monitoring
Listing jobs

After submitting jobs with bsub, you can monitor their status with the bjobs command.

                [user@hd1 ~]$ bjobs
                

By default, bjobs limits the amount of information displayed to fit an 80-column terminal. To ensure complete information is printed for each job, add the -w (wide output) flag.

                [user@hd1 ~]$ bjobs -w
                

By default, the bjobs command will list jobs you have running, waiting to run, or suspended. It displays basic information such as the numeric job ID, job name, nodes in use, and so on.

If you would like to see what other users are running on the cluster, run

                [user@hd1 ~]$ bjobs -w -u all
                

There are many options for listing a subset of your jobs, other users' jobs, etc. Run man bjobs for a quick reference.

Checking Progress

When you run a batch job, the output file may not be available until the job finishes. You can view the output generated so far by an unfinished job using the bpeek command:

                [user@hd1 ~]$ bpeek numeric-job-id
                

or

                [user@hd1 ~]$ bpeek -J job-name
                

The job-name above is the same name you specified with the -J flag in bsub. The numeric job id can be found by running bjobs.

Using top

Using the output from bjobs, we can see which compute nodes are being used by a job.

We can then examine the processes on a given node using a remotely executed top command:

                avi: ssh -t compute-1-03 top
                

Note

The -t flag is important here, since it tells ssh to open a connection with full terminal control, which is needed by top to update your terminal screen.
Terminating Jobs

A job can be terminated using the bkill command:

                [user@hd1 ~]$ bkill numeric-job-id
                

or

                [user@hd1 ~]$ bkill -J job-name
                

Again, the numeric job id can be found by running bjobs, and the job-name is what you specified with the -J option in bsub.

Email Notification

LSF has the ability to notify you by email when a job begins or ends. All notification emails are sent to your PantherLINK account.

To receive an email notification when your job ends, add the following to your bsub script:

                #BSUB -N
                

To receive an email notification when your job begins, add the following to your bsub script:

                #BSUB -B
                

The -B flags is only useful when the load in the cluster is too high to accommodate your job at the time it is submitted.