Table of Contents
This chapter is based on content developed by Dr. Lars Olson, Marquette University.
Before reading this chapter, you should be familiar with basic Unix concepts (Chapter 3, Using Unix), the Unix shell (the section called “Command Line Interfaces (CLIs): Unix Shells”, redirection (the section called “Redirection and Pipes”), shell scripting (Chapter 4, Unix Shell Scripting), and job scheduling (Chapter 7, Job Scheduling.
For complete information, see the official HTCondor documentation at http://research.cs.wisc.edu/htcondor/.
The goal of this section is to provide a quick introduction to the HTCondor scheduler. For more detailed information, consult the HTCondor man pages and the HTCondor website.
HTCondor is very different from other common resource managers because it serves different goals.
HTCondor is intended for use on "distributively owned" resources, i.e. PCs that are owned by people other than the HTCondor users and used primarily for purposes other than parallel computing. Most other resource managers are designed for use on dedicated resources, such as a cluster of computers built specifically for parallel computation.
As such, HTCondor is designed to "borrow" resources from their owners while the owners are not otherwise using them. The PC owner has the right to "evict" (terminate) an HTCondor process at any time and without warning. In fact, if an owner logs into a PC while an HTCondor process is running, the process is generally evicted and restarted elsewhere.
HTCondor can be configured so that processes are merely suspended or even allowed to continue running, but this requires permission from the owner of the PC. Since the PCs on an HTCondor pool may be owned by many different people, it's best to assume that at least some of the PCs are configured to evict HTCondor processes when a local user is active.
A computer used by HTCondor should therefore not be expected to remain available for long periods of time. Plan for some of the processes running under HTCondor to be terminated and restarted on a different machine.
The longer an HTCondor process runs, the less likely it is to complete. As a rule of thumb, an HTCondor process should ideally run no more than a few hours. This is usually easy to achieve. Most HTCondor users are running embarrassingly parallel jobs that simply divide up input data or parameter space among an arbitrary number of independent processes. If such a user has 1,000 hours worth of computation, they can just as easily do it with 10 jobs running for 100 hours each, 100 jobs running for 10 hours each, or 500 jobs running for 2 hours each.
Don't make your processes too short, though. Each process entails some scheduling overhead, and may take up to a minute or so to start. We want our jobs to spend much more time running than sitting in a queue in order to maximize throughput. There is little benefit to breaking computation into jobs much shorter than an hour and scheduling overhead will become significant if you do.
One of the advantages of HTCondor is that the number of available cores is usually irrelevant to how you divide up your job. HTCondor will simply run as many processes as it can at any given moment and start others as cores become available. For example, even if you only have 200 cores available, you can run a job consisting of 500 processes. HTCondor will run up to 200 at a time until all 500 are finished. 500 processes running for 2 hours each is usually preferable to 100 processes of 10 hours each, since this reduces the risk of individual processes being evicted from a PC.
Before scheduling any jobs on through HTCondor, it is often useful to check the status of the hosts.
The condor_status command shows the status of all the execute hosts in the pool.
FreeBSD peregrine bacon ~ 402: condor_status | more Name OpSys Arch State Activity LoadAv Mem ActvtyTime povbEEFF73C040D8.l LINUX INTEL Backfill Idle 0.050 1968 0+00:00:03 povbEEFF73C04485.l LINUX INTEL Backfill Idle 0.080 1968 0+00:00:04 povbEEFF73C046A1.l LINUX INTEL Backfill Idle 0.020 1968 0+00:00:04 povbEEFF73C048DB.l LINUX INTEL Backfill Idle 0.000 1968 0+00:00:04 povbEEFF73C04B9F.l LINUX INTEL Backfill Idle 0.130 1968 0+00:00:04 povbEEFF73C04BF1.l LINUX INTEL Backfill Idle 0.110 1968 0+00:00:04 povbEEFF73C0516E.l LINUX INTEL Backfill Idle 0.050 1968 0+00:00:04 povbEEFF73C061C5.l LINUX INTEL Backfill Idle 0.070 1968 0+00:00:04 povbEEFF73C06F85.l LINUX INTEL Backfill Idle 0.130 1968 0+00:00:04 povbEEFF73C0873B.l LINUX INTEL Backfill Idle 0.150 1968 0+00:00:04 povbEEFF73C08BA6.l LINUX INTEL Backfill Idle 0.020 1968 0+00:00:04 povbEEFF73C08E57.l LINUX INTEL Backfill Idle 0.070 1968 0+00:00:04 povbEEFFA0A4F157.l LINUX INTEL Owner Idle 0.290 1009 0+09:32:43 povbEEFFA0AA6F42.l LINUX INTEL Backfill Idle 0.090 1009 0+00:00:04 povbEEFFA0AA717C.l LINUX INTEL Backfill Idle 0.370 1009 0+00:00:04 povbEEFFA0AD6A4C.l LINUX INTEL Backfill Idle 0.100 1009 0+00:00:04 povbEEFFA0B8E890.l LINUX INTEL Backfill Idle 0.080 1009 0+00:00:04 povbEEFFA0DDD5C2.l LINUX INTEL Backfill Idle 0.140 1009 0+00:00:03 povbEEFFA0DDD60F.l LINUX INTEL Backfill Idle 0.200 1009 0+00:00:05 povbEEFFA0DDD652.l LINUX INTEL Backfill Idle 0.200 1009 0+00:00:03 --More--(byte 1682)
The condor_status command shows only resources available in the local pool. If flocking is configured, it will be possible for jobs to utilize resources in other pools, but those resources are not shown by condor_status unless specifically requested.
The purpose of this section is to provide the reader a quick start in job scheduling using the most common and useful tools. The full details of job submission are beyond the scope of this document.
Submitting jobs involves specifying a number of job cluster parameters such as the number of cores, the job cluster name (which is displayed by qstat), the name(s) of the output file(s), etc.
In order to record all of this information and make it easy to resubmit the same job cluster, this information is incorporated into a submit description file. Using a submit description file saves you a lot of typing when you want to run-submit the same job cluster, and also fully documents the job cluster parameters.
An HTCondor submit description file is a file containing a number of HTCondor variable assignments and commands that indicate the resource requirements of the program to be run, the number of instances to run in parallel, the name of the executable, etc.
Unlike the submission scripts of other schedulers such as LSF, PBS, and SLURM, a condor submit description file is not a Unix shell script and hence is not executed on the compute hosts. It is a special type of file specific to HTCondor.
Recall that in most other schedulers, the job cluster parameters are embedded in the script as specially formatted comments beginning with "#SBATCH", "#PBS", or something similar. In HTCondor, the job cluster parameters and the script or program to be run on the compute hosts are kept in separate files.
HTCondor pools are often heterogeneous, i.e. the hosts in the HTCondor pool may run different operating systems, have different amounts of RAM, etc. For this reason, resource requirements are a must in many HTCondor description files.
Furthermore, HTCondor pools typically do not have any shared storage available to all hosts, so the executable, the input files, and the output files are often transferred to and from the execute hosts. Hence, the names of these files must be specified in the submit description file.
Suppose we have the following text in a file called
hostname.condor
:
############################################################################ # # hostname.condor # # Condor Universe # # standard: # Defaults to transfer executables and files. # Use when you are running your own script or program. # # vanilla: # grid: # Explicitly enable file transfer mechanisms with # 'transfer_executable', etc. # Use when you are using your own files and some installed on the # execute hosts. # # parallel: # Explicitly enable file transfer mechanism. Used for MPI jobs. universe = vanilla # Macros (variables) to use in this description file # This is our own custom macro. It has no special meaning to HTCondor. Process_count = 5 ############################################################################ # Specify the executable filename. This can be a binary file or a script. # HTCondor does not search $PATH! A relative pathname here refers to an # executable in the current working directory. To run a standard Unix command # use an absolute pathname, or set executable to a script. executable = /bin/hostname # Output from executable running on the execute hosts # $(Process) is a predefined macro containing a different integer value for # each process, ranging from 0 to Process_count-1. output = hostname.out-$(Process) error = hostname.err-$(Process) # Log file contains status information from HTCondor log = hostname.log ############################################################################ # Condor assumes job requirements from the host submitting job. # IT DOES NOT DEFAULT TO ACCEPTING ANY ARCH OR OPSYS!!! # Requirements to match any FreeBSD or Linux host, 32 or 64 bit processors. requirements = ((target.arch == "INTEL") || (target.arch == "X86_64")) && \ ((target.opsys == "FREEBSD") || (target.opsys == "LINUX")) # Memory requirements in mebibytes request_memory = 10 # Executable is a standard Unix command, already on every execute host transfer_executable = false # Specify how many jobs you would like to submit to the queue. queue $(Process_count)
The description file is submitted to the HTCondor scheduler as a command line argument to the condor_submit command:
shell-prompt: condor_submit hostname.condor
This will run /bin/hostname on each of 5 execute hosts.
FreeBSD login.peregrine bacon ~ 540: cat hostname.out-* FBVM.10-4-60-177.meadows FBVM.10-4-60-231.meadows FBVM.10-4-60-227.meadows FBVM.10-4-60-219.meadows FBVM.10-4-60-231.meadows
The condor_submit command, which is part of the HTCondor scheduler, finds a free core on a execute host, reserves it, transfers the executable and input files from the submit host to the execute host if necessary, and then runs the executable on the execute host.
The executable named in the description is dispatched to the host(s) containing the core(s) allocated by HTCondor, using ssh, rsh or any other remote shell HTCondor is configured to use.
Since an HTCondor submit description file specifies a single executable, we must create a script in addition to the submit description file in order to run multiple commands in sequence.
#!/bin/sh -e # hostname.sh hostname uname pwd ls
We must then alter the submit description file so that our script is the executable, and it is transferred to the execute host:
############################################################################ # # hostname-script.condor # # Condor Universe # # standard: # Defaults to transfer executables and files. # Use when you are running your own script or program. # # vanilla: # grid: # Explicitly enable file transfer mechanisms with # 'transfer_executable', etc. # Use when you are using your own files and some installed on the # execute hosts. # # parallel: # Explicitly enable file transfer mechanism. Used for MPI jobs. universe = vanilla # Macros (variables) to use in this description file # This is our own custom macro. It has no special meaning to HTCondor. Process_count = 5 ############################################################################ # Specify the executable filename. This can be a binary file or a script. # HTCondor does not search $PATH! A relative pathname here refers to an # executable in the current working directory. To run a standard Unix command # use an absolute pathname, or set executable to a script. executable = hostname.sh # Output from executable running on the execute hosts # $(Process) is a predefined macro containing a different integer value for # each process, ranging from 0 to Process_count-1. output = hostname.out-$(Process) error = hostname.err-$(Process) # Log file contains status information from HTCondor log = hostname.log ############################################################################ # Condor assumes job requirements from the host submitting job. # IT DOES NOT DEFAULT TO ACCEPTING ANY ARCH OR OPSYS!!! # Requirements to match any FreeBSD or Linux host, 32 or 64 bit processors. requirements = ((target.arch == "INTEL") || (target.arch == "X86_64")) && \ ((target.opsys == "FREEBSD") || (target.opsys == "LINUX")) # Memory requirements in mebibytes request_memory = 10 # Executable is our own script transfer_executable = true # Specify how many jobs you would like to submit to the queue. queue $(Process_count)
The output of one job will appear as follows:
FreeBSD login.peregrine bacon ~ 542: cat hostname.out-0 FBVM.10-4-60-177.meadows FreeBSD /htcondor/Data/execute/dir_7104 _condor_stderr _condor_stdout condor_exec.exe
We can write a simple script to remove output and log files:
#!/bin/sh -e # hostname-cleanup.sh rm -f hostname.out-* hostname.err-* hostname.log
Since many HTCondor pools are heterogeneous, you must make sure that the executable is portable, or that you specify resource requirements to ensure that it will only be dispatched to hosts on which it will work.
Using Bourne shell scripts (not bash, ksh, csh, or tcsh) will maximize the portability of a shell script, as discussed in Chapter 4, Unix Shell Scripting. If you must use different shell, be sure to use a portable shebang line (e.g. #!/usr/bin/env bash, not #!/bin/bash).
As HTCondor grids may be heterogeneous (running a variety of operating systems and CPU architectures), we may need more than one binary (compiled) file in order to utilize all available hosts. Maintaining multiple binaries and transferring the right one to each host can be tedious and error-prone.
As long as the source code is portable to all execute hosts, we can avoid this issue by compiling the program on each execute host as part of the job.
Below is a submit description file that demonstrates how to use an executable script to compile and run a program on each execute host. Note that the program source code is sent to the execute host as an input file, while the script that compiles it is the executable.
############################################################################ # # hostname-c.condor # # Condor Universe # # standard: # Defaults to transfer executables and files. # Use when you are running your own script or program. # # vanilla: # grid: # Explicitly enable file transfer mechanisms with # 'transfer_executable', etc. # Use when you are using your own files and some installed on the # execute hosts. # # parallel: # Explicitly enable file transfer mechanism. Used for MPI jobs. universe = vanilla # Macros (variables) to use in this description file # This is our own custom macro. It has no special meaning to HTCondor. Process_count = 5 ############################################################################ # Specify the executable filename. This can be a binary file or a script. # HTCondor does not search $PATH! A relative pathname here refers to an # executable in the current working directory. To run a standard Unix command # use an absolute pathname, or set executable to a script. executable = hostname-c.sh # Output from executable running on the execute hosts # $(Process) is a predefined macro containing a different integer value for # each process, ranging from 0 to Process_count-1. output = hostname.out-$(Process) error = hostname.err-$(Process) # Log file contains status information from HTCondor log = hostname.log ############################################################################ # Condor assumes job requirements from the host submitting job. # IT DOES NOT DEFAULT TO ACCEPTING ANY ARCH OR OPSYS!!! # Requirements to match any FreeBSD or Linux host, 32 or 64 bit processors. requirements = ((target.arch == "INTEL") || (target.arch == "X86_64")) && \ ((target.opsys == "FREEBSD") || (target.opsys == "LINUX")) # Memory requirements in mebibytes request_memory = 50 # Executable is our own script transfer_executable = true # Send the source code as an input file for the executable transfer_input_files = hostname.c # Specify how many jobs you would like to submit to the queue. queue $(Process_count)
The executable script is below. We set the sh -x flag in this example so that we can see the commands executed by the script on the execute hosts.
#!/bin/sh -e # hostname-c.sh # HTCondor environment contains a minimal PATH, so help it find cc PATH=/bin:/usr/bin export PATH # Echo commands set -x pwd cc -o hostname hostname.c ./hostname
Sample output from one of the jobs:
FreeBSD login.peregrine bacon ~ 402: cat hostname.out-0 /htcondor/Data/execute/dir_7347 Hello from FBVM.10-4-60-177.meadows! FreeBSD login.peregrine bacon ~ 403: cat hostname.err-0 + pwd + cc -o hostname hostname.c + ./hostname
When using a grid, it is important to develop a feel for the resources required by your jobs, and inform the scheduler as accurately as possible what will be needed in terms of CPU time, memory, etc.
You may not know this the first time you run a given job, but after examining the log files from one or more runs, you will have a pretty good idea.
This allows the scheduler to maximize the utilization of precious resources and thereby provide the best possible run times for all users.
HTCondor resource requirements are specified with the requirements variable. There are many parameters available to specify the operating system, memory requirements, etc. Users should try to match requirements as closely as possible to the actual requirements of the job cluster.
For example, if your job requires only 300 megabytes of RAM, then by specifying this, you can encourage HTCondor to save the hosts with several gigabytes of RAM for the jobs that really need them.
A batch serial submit description file need only specify basic information such as the executable name, input and output files, and basic resource requirements.
universe = vanilla executable = hostname.sh output = $(Process).stdout error = $(Process).stderr log = hostname.log transfer_executable = yes should_transfer_files = yes when_to_transfer_output = on_exit queue 1
A job cluster is a set of independent processes all started by a single job submission.
A batch parallel submit description file looks almost exactly like a batch serial file, but indicates a process count greater than 1 following the queue command:
universe = vanilla executable = hostname.sh output = $(Process).stdout error = $(Process).stderr log = hostname.log transfer_executable = yes should_transfer_files = yes when_to_transfer_output = on_exit queue 10
Most pools are not designed to run MPI jobs. MPI jobs often require a fast dedicated network to accommodate extensive message passing. A pool implemented on lab or office PCs is not suitable for this, a typical local area network in a lab is not fast enough to offer good performance, and MPI traffic may in fact overload it, causing problems for other users.
HTCondor does provide facilities for running MPI jobs, but they are usually only useful where HTCondor is employed as the scheduler for a cluster.
Below is a sample submit description file with typical options and
extensive comments. This script is also available in
/share1/Examples
on Peregrine.
############################################################################ # Sample HTCondor submit description file. # # Use \ to continue an entry on the next line. # # You can query your jobs by command: # condor_q ############################################################################ # Choose which universe you want your program is running with # Available options are # # - standard: # Defaults to transfer executables and files. # Use when you are running your own script or program. # # - vanilla: # - grid: # Explicitly enable file transfer mechanisms with # 'transfer_executable', etc. # Use when you are using your own files and some installed on the # execute hosts. # # - parallel: # Explicitly enable file transfer mechanism. Used for MPI jobs. universe = vanilla # Macros (variables) to use in this submit description file Points = 1000000000 Process_count = 10 ############################################################################ # Specify the executable filename. This can be a binary file or a script. # NOTE: The POVB execute hosts currently support 32-bit executables only. # If compiling a program on the execute hosts, this script should compile # and run the program. # # In template.sh, be sure to give the executable a different # name for each process, since multiple processes could be on the same host. # E.g. cc -O -o prog.$(Process) prog.c executable = template.sh # Command-line arguments for executing template.sh # arguments = ############################################################################ # Set environment variables for use by the executable on the execute hosts. # Enclose the entire environment string in quotes. # A variable assignment is var=value (no space around =). # Separate variable assignments with whitespace. environment = "Process=$(Process) Process_count=$(Process_count) Points=$(Points)" ############################################################################ # Where the standard output and standard error from executables go. # $(Process) is current job ID. # If running template under both PBS and HTCondor, use same output # names here as in template-run.pbs so that we can use the same # script to tally all the outputs from any run. output = template.out-$(Process) error = template.err-$(Process) ############################################################################ # Logs for the job, produced by HTCondor. This contains output from # HTCondor, not from the executable. log = template.log ############################################################################ # Custome job requirements # HTCondor assumes job requirements from the host submitting job. # IT DOES NOT DEFAULT TO ACCEPTING ANY ARCH OR OPSYS!!! # For example, if the jobs is submitted from peregrine, target.arch is # "X86_64" and target.opsys is "FREEBSD8", which do not match # POVB execute hosts. # # You can query if your submitting host is accepted by command: # condor_q -analyze # Memory requirements in megabytes request_memory = 50 # Requirements for a binary compiled on CentOS 4 (POVB hosts): # requirements = (target.arch == "INTEL") && (target.opsys == "LINUX") # Requirements for a Unix shell script or Unix program compiled on the # execute host: requirements = ((target.arch == "INTEL") || (target.arch == "X86_64")) && \ ((target.opsys == "FREEBSD") || (target.opsys == "LINUX")) # Requirements for a job utilizing software installed via FreeBSD ports: # requirements = ((target.arch == "INTEL") || (target.arch == "X86_64")) && \ # (target.opsys == "FREEBSD") # Match specific compute host names # requirements = regexp("compute-s.*.meadows", Machine) ############################################################################ # Explicitly enable executable transfer mechanism for vanilla universe. # true | false transfer_executable = true # yes | no | if_needed should_transfer_files = if_needed # All files to be transferred to the execute hosts in addition to the # executable. If compiling on the execute hosts, list the source file(s) # here, and put the compile command in the executable script. transfer_input_files = template.c # All files to be transferred back from the execute hosts in addition to # those listed in "output" and "error". # transfer_output_files = file1,file2,... # on_exit | on_exit_or_evict when_to_transfer_output = on_exit ############################################################################ # Specify how many jobs you would like to submit to the queue. queue $(Process_count)
While your jobs are running, you can check on their status using condor_q.
FreeBSD peregrine bacon ~ 533: condor_q -- Submitter: peregrine.hpc.uwm.edu : <129.89.25.224:37668> : peregrine.hpc.uwm.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 63.0 bacon 6/5 13:45 0+00:00:00 I 0 0.0 hostname.sh 64.0 bacon 6/5 13:46 0+00:00:00 I 0 0.0 hostname.sh 2 jobs; 2 idle, 0 running, 0 held
The ST column indicates job status (R = running, I = idle, C = completed). Run man condor_q for full details.
If you determine that a job is not behaving properly (by reviewing partial output, for example), you can terminate it using condor_rm, which take a job ID or job cluster ID as a command line argument.
To terminate a single job within a cluster, use the job ID form cluster-id.job-index:
FreeBSD peregrine bacon ~ 534: condor_rm 63.0 Job 63.0 marked for removal FreeBSD peregrine bacon ~ 535: condor_q -- Submitter: peregrine.hpc.uwm.edu : <129.89.25.224:37668> : peregrine.hpc.uwm.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 64.0 bacon 6/5 13:46 0+00:00:00 I 0 0.0 hostname.sh 1 jobs; 1 idle, 0 running, 0 held
To terminate an entire job cluster, just provide the job cluster ID:
FreeBSD peregrine bacon ~ 534: condor_rm 63
If you need to submit a series of jobs in sequence, where one job begins after another has completed,
It's important to make sure that the current job completed successfully before submitting the next, to avoid wasting resources. It is up to you to determine the best way to verify that a job was successful. Examples might include grepping the log file for some string indicating success, or making the job create a marker file using the touch command after a successful run. If the command used in your job returns a Unix-style exit status (0 for success, non-zero on error), then you can simply use the shell's exit-on-error feature to make your script exit when any command fails. Below is a template for scripts that might run a series of jobs.
#!/bin/sh condor_submit job1.condor condor_wait job1-log-file # Verify that job1 completed successfully, by the method of your choice if ! test-to-indicate-job1-succeeded; then exit 1 fi condor_submit job2.condor
Write and submit a batch-serial HTCondor script called
list-etc.condor
that prints
the host name of the execute host on which it runs and
a long-listing of /etc
directory on
that host.
The script should store the output of the
commands in list-etc.stdout
and error messages in list-etc.stderr
in the directory from which the script was submitted.
Quickly check the status of your job after submitting it.
Copy your list-etc.condor
script
to list-etc-parallel.condor
, and modify
it so that it runs the hostname and
ls
commands on 10 cores instead of just one.
The job should produce a separate output file for each process
named list-etc-parallel.o<jobid>
and a separate error file for each process
named list-etc-parallel.e<jobid>
.
Quickly check the status of your job after submitting it.