Scheduler Basics

Purpose of a Scheduler

Unix systems can divide CPU cores and memory resources among multiple processes. Memory is divided up physically, and CPU time is divided temporally, with time slices of about 1/100 to 1/1000 of a second given to each process in a mostly round-robin fashion. This type of CPU sharing is known as context switching or preemptive multitasking. If you run ps -ax, ps -ef or top on any Unix system, you will see that there are far more processes running than there are cores in the system.

Sharing time on individual cores works well for programs that are idle much of the time, such as word processors and web browsers that spend most of the time waiting for user input. It can create the illusion that multiple processes are actually running at the same time, since the time it takes to respond to user input may be a small fraction of a second for any one of the processes sharing the CPU.

However, this sort of multitasking does not work well for intensive computational processes that use the CPU constantly for hours, days, or weeks.

There is overhead involved in switching a core between one process and another, so the total time to finish all processes would actually be lower if we ran one process to completion before starting the next. The more frequently a core switches between processes, the more overhead it incurs, and the longer it will take to complete all processes. More frequent context switching sacrifices overall throughput for better response times. Just how much overhead is incurred depends on how much the processes contend with each other for memory and disk resources. The difference could be marginal, or it could be huge.

There is also the possibility that there simply won't be enough memory available for multiple processes to run at the same time.

Hence, most systems that are used to run long, CPU-intensive processes employ a scheduler to limit the number of processes using the system at once. Typically, the scheduler will allow at most one process per core to be running at any given moment. Most schedulers also have features to track and balance memory use.

When you run a job under a scheduler, the scheduler locates nodes with sufficient cores, memory and any other resources your job requires. If then marks those resources as "occupied" and dispatches your processes to the nodes containing those resources. When your job completes, the resources are again marked as "available" so that they can be allocated to the next job.

Caution

Out of courtesy to other users on a shared cluster or grid, you should never run anything that requires significant CPU time, memory, or disk I/O on the head node or any other node that users log into directly. Doing so can overload the computer can cause it to become sluggish or unresponsive for other users. It is generally OK to perform trivial tasks there such as editing programs and scripts, processing small files, etc. Everything else should be scheduled to run on compute nodes.

After submitting a job to the scheduler, you can usually log out of the head/submit node without affecting the job. Hence, you can submit a long-running job, leave, and check the results at a later time.

Caution

When running jobs in a scheduled environment, all jobs must be submitted to the scheduler. The scheduler does not know what resources are being used by processes it did not start, so it will not be able to guarantee performance and stability of the system if anything is run outside the scheduler.

Another advantage of schedulers is their ability to queue jobs. If you submit a job at a time when insufficient resources are available, you can walk away knowing that it will begin executing as soon as the necessary resources become available.

Common Schedulers
HTCondor

HTCondor is a sophisticated scheduler designed primarily to utilize idle time on grids of existing personal computers across an organization. Many college campuses use HTCondor grids to provide inexpensive parallel computing resources to their users.

Load Sharing Facility (LSF)

Load Sharing Facility (LSF) is a proprietary scheduler primarily used on large Linux clusters.

Lightweight, Portable Job Scheduler (LPJS)

Lightweight, Portable Job Scheduler (LPJS) is a resource manager and scheduler specifically designed to be small, completely portable to any Unix-like platform, and easy to set up and use.

It is suitable for anything from a small cluster or grid utilizing a mix of existing computers running a variety of operating systems, including BSD, Linux, and macOS, to substantial clusters running on dedicated hardware.

Portable Batch System (PBS)

Portable Batch System is an open source scheduler used on clusters of all sizes.

The most popular and modern implementation of the PBS scheduler is Torque. There is also an open source extension for Torque called Maui, which provides enhanced scheduling features, as well as a more sophisticated, commercial version called MOAB.

Simple Linux Utility for Resource Management (SLURM)

SLURM is a relatively new open source resource manager and scheduler. It was originally developed on Linux, but naturally runs on other Unix-compatible systems as well. SLURM is distinguished from other schedulers by its high throughput, scalability, modularity, and fault-tolerance.

Sun Grid Engine (SGE)

Sun Grid Engine is an open source scheduler originally developed as a proprietary product at Sun Microsystems for their Solaris Unix system. Since it was open sourced, it has become fairly popular on other Unix variants such as BSD and Linux. It is used on clusters of all sizes.

Oracle Grid Engine is the continuation of Sun Grid Engine after Oracle, Inc purchased Sun Microsystems. There are also derivatives, including Son of Grid Engine, and SOME Grid Engine.

Job Types
Batch Serial

A batch serial job allocates one core and executes a single process on it. All output that would appear on the terminal screen is redirected to a file. You do not need to remain logged into the submit node while a batch job is running, since it does not need access to your terminal.

This is not parallel computing, but clusters and grids may be used this way just to utilize software that users may not have on their own computers. Batch serial jobs are also often used for pre-processing (prep work done before a parallel job) and post-processing (wrap-up work done after a parallel job) that cannot be parallelized.

Interactive

An interactive job is like a batch serial job, in that a single core is usually used. However, terminal output is not redirected to a file, and the process can receive input from the keyboard. You cannot log out of the submit node while an interactive job is running. Interactive jobs are rarely used, but can be useful for short-running tasks such as program compilation.

Batch Parallel (Job Arrays)

A batch parallel job runs the same program simultaneously on multiple cores, each using different inputs. The processes are generally independent of each other while running, i.e. they do not communicate with each other. Batch parallel jobs are often referred to as embarrassingly parallel, since they are so easy to implement.

Multicore

A multicore job covers all other types of jobs. Typically, a multicore job is dispatched to a single core like a batch serial job. The scheduler is asked to allocate more than one core, but not to dispatch processes to them. The scheduler only dispatches a master process to one core, and it is up to the master process to dispatch and manage processes on the other scheduled cores. Processes within a multicore job usually communicate and cooperate with each other during execution. Hence, this is the most complex type of distributed parallel programming.

Most multicore jobs use the Message Passing Interface (MPI) system, which facilitates the creation and communication of cooperative distributed parallel jobs. MPI is discussed in detail in Chapter 35, Programming for HPC.

Checkpointing

The longer a program runs, the more likely it is to experience a problem such as a power outage or hardware failure before it completes.

Checkpointing is the process of periodically saving the partial results and/or current state of a process as it executes. If the exact state of a process can be saved, then it can, in theory, be easily resumed from where it left off after being interrupted. If only partial results can be saved, it may be harder to resume from where you left off, but at least there is a chance that you won't have to start over from the beginning.

If starting over will be a major inconvenience, then checkpointing is always a good idea. How to go about checkpointing depends heavily on the code you're running. Simple batch serial and batch parallel jobs can be checkpointed more easily than shared memory or MPI jobs. Tools exist that allow you to checkpoint simple jobs without any additional coding. More complex jobs may require adding to your code so that it checkpoints itself.

Consult the current scheduler documentation or talk to your facilitator for more information.

Self-test
  1. What kind of problems would occur if users of a cluster or grid used remote execution directly to run jobs?
  2. Why is time-sharing of cores not a good idea in a cluster or grid environment? What types of processes work well with time-sharing?
  3. What does a job scheduler do? How does it solve the problem of multiple jobs competing for resources?
  4. Describe several popular job schedulers and where they are typically used.
  5. Describe the four common categories of jobs that run on clusters and grids.
  6. What is checkpointing and what are some of its limitations?