Chapter 10. Job Scheduling with HTCondor

Table of Contents

10.. The HTCondor Resource Manager
A Unique Approach to Resource Management
HTCondor Terminology
Pool Status
Job Submission
Job Status
Terminating a Job or Cluster
Job Sequences
Self-test
10.. Local Customizations

This chapter is based on content developed by Dr. Lars Olson, Marquette University.

Before You Begin

Before reading this chapter, you should be familiar with basic Unix concepts (Chapter 3, Using Unix), the Unix shell (the section called “Command Line Interfaces (CLIs): Unix Shells”, redirection (the section called “Redirection and Pipes”), shell scripting (Chapter 4, Unix Shell Scripting), and job scheduling (Chapter 7, Job Scheduling.

For complete information, see the official HTCondor documentation at http://research.cs.wisc.edu/htcondor/.

The HTCondor Resource Manager

The goal of this section is to provide a quick introduction to the HTCondor scheduler. For more detailed information, consult the HTCondor man pages and the HTCondor website.

A Unique Approach to Resource Management

HTCondor is very different from other common resource managers because it serves different goals.

HTCondor is intended for use on "distributively owned" resources, i.e. PCs that are owned by people other than the HTCondor users and used primarily for purposes other than parallel computing. Most other resource managers are designed for use on dedicated resources, such as a cluster of computers built specifically for parallel computation.

As such, HTCondor is designed to "borrow" resources from their owners while the owners are not otherwise using them. The PC owner has the right to "evict" (terminate) an HTCondor process at any time and without warning. In fact, if an owner logs into a PC while an HTCondor process is running, the process is generally evicted and restarted elsewhere.

HTCondor can be configured so that processes are merely suspended or even allowed to continue running, but this requires permission from the owner of the PC. Since the PCs on an HTCondor pool may be owned by many different people, it's best to assume that at least some of the PCs are configured to evict HTCondor processes when a local user is active.

A computer used by HTCondor should therefore not be expected to remain available for long periods of time. Plan for some of the processes running under HTCondor to be terminated and restarted on a different machine.

The longer an HTCondor process runs, the less likely it is to complete. As a rule of thumb, an HTCondor process should ideally run no more than a few hours. This is usually easy to achieve. Most HTCondor users are running embarrassingly parallel jobs that simply divide up input data or parameter space among an arbitrary number of independent processes. If such a user has 1,000 hours worth of computation, they can just as easily do it with 10 jobs running for 100 hours each, 100 jobs running for 10 hours each, or 500 jobs running for 2 hours each.

Don't make your processes too short, though. Each process entails some scheduling overhead, and may take up to a minute or so to start. We want our jobs to spend much more time running than sitting in a queue in order to maximize throughput. There is little benefit to breaking computation into jobs much shorter than an hour and scheduling overhead will become significant if you do.

One of the advantages of HTCondor is that the number of available cores is usually irrelevant to how you divide up your job. HTCondor will simply run as many processes as it can at any given moment and start others as cores become available. For example, even if you only have 200 cores available, you can run a job consisting of 500 processes. HTCondor will run up to 200 at a time until all 500 are finished. 500 processes running for 2 hours each is usually preferable to 100 processes of 10 hours each, since this reduces the risk of individual processes being evicted from a PC.

HTCondor Terminology
  • An execute host is a computer available for use by HTCondor for running a user's processes.
  • A job is usually one process running on an execute host managed by HTCondor.
  • A cluster (not to be confused with a hardware cluster) is the group of jobs from a single condor_submit, all of which share the same numeric ID.
  • A submit host is a computer from which HTCondor jobs can be submitted.
  • A central manager is a computer that manages a specific pool of computers which include execute hosts, submit hosts, and possibly other types of hosts. Every HTCondor pool has it's own central manager.
  • A grid is a group of computers available for parallel computing. It consists of one or more pools, each with it's own central manager.
  • Flocking is a system that allows jobs from one pool that has run out of resources to migrate to another pool, much like a flock of birds migrating when they run out of food.
Pool Status

Before scheduling any jobs on through HTCondor, it is often useful to check the status of the hosts.

The condor_status command shows the status of all the execute hosts in the pool.

FreeBSD peregrine bacon ~ 402: condor_status | more

Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

povbEEFF73C040D8.l LINUX      INTEL  Backfill  Idle     0.050  1968  0+00:00:03
povbEEFF73C04485.l LINUX      INTEL  Backfill  Idle     0.080  1968  0+00:00:04
povbEEFF73C046A1.l LINUX      INTEL  Backfill  Idle     0.020  1968  0+00:00:04
povbEEFF73C048DB.l LINUX      INTEL  Backfill  Idle     0.000  1968  0+00:00:04
povbEEFF73C04B9F.l LINUX      INTEL  Backfill  Idle     0.130  1968  0+00:00:04
povbEEFF73C04BF1.l LINUX      INTEL  Backfill  Idle     0.110  1968  0+00:00:04
povbEEFF73C0516E.l LINUX      INTEL  Backfill  Idle     0.050  1968  0+00:00:04
povbEEFF73C061C5.l LINUX      INTEL  Backfill  Idle     0.070  1968  0+00:00:04
povbEEFF73C06F85.l LINUX      INTEL  Backfill  Idle     0.130  1968  0+00:00:04
povbEEFF73C0873B.l LINUX      INTEL  Backfill  Idle     0.150  1968  0+00:00:04
povbEEFF73C08BA6.l LINUX      INTEL  Backfill  Idle     0.020  1968  0+00:00:04
povbEEFF73C08E57.l LINUX      INTEL  Backfill  Idle     0.070  1968  0+00:00:04
povbEEFFA0A4F157.l LINUX      INTEL  Owner     Idle     0.290  1009  0+09:32:43
povbEEFFA0AA6F42.l LINUX      INTEL  Backfill  Idle     0.090  1009  0+00:00:04
povbEEFFA0AA717C.l LINUX      INTEL  Backfill  Idle     0.370  1009  0+00:00:04
povbEEFFA0AD6A4C.l LINUX      INTEL  Backfill  Idle     0.100  1009  0+00:00:04
povbEEFFA0B8E890.l LINUX      INTEL  Backfill  Idle     0.080  1009  0+00:00:04
povbEEFFA0DDD5C2.l LINUX      INTEL  Backfill  Idle     0.140  1009  0+00:00:03
povbEEFFA0DDD60F.l LINUX      INTEL  Backfill  Idle     0.200  1009  0+00:00:05
povbEEFFA0DDD652.l LINUX      INTEL  Backfill  Idle     0.200  1009  0+00:00:03
--More--(byte 1682)

The condor_status command shows only resources available in the local pool. If flocking is configured, it will be possible for jobs to utilize resources in other pools, but those resources are not shown by condor_status unless specifically requested.

Job Submission

The purpose of this section is to provide the reader a quick start in job scheduling using the most common and useful tools. The full details of job submission are beyond the scope of this document.

Submit Description Files

Submitting jobs involves specifying a number of job cluster parameters such as the number of cores, the job cluster name (which is displayed by qstat), the name(s) of the output file(s), etc.

In order to record all of this information and make it easy to resubmit the same job cluster, this information is incorporated into a submit description file. Using a submit description file saves you a lot of typing when you want to run-submit the same job cluster, and also fully documents the job cluster parameters.

An HTCondor submit description file is a file containing a number of HTCondor variable assignments and commands that indicate the resource requirements of the program to be run, the number of instances to run in parallel, the name of the executable, etc.

Note

Unlike the submission scripts of other schedulers such as LSF, PBS, and SLURM, a condor submit description file is not a Unix shell script and hence is not executed on the compute hosts. It is a special type of file specific to HTCondor.

Recall that in most other schedulers, the job cluster parameters are embedded in the script as specially formatted comments beginning with "#SBATCH", "#PBS", or something similar. In HTCondor, the job cluster parameters and the script or program to be run on the compute hosts are kept in separate files.

HTCondor pools are often heterogeneous, i.e. the hosts in the HTCondor pool may run different operating systems, have different amounts of RAM, etc. For this reason, resource requirements are a must in many HTCondor description files.

Furthermore, HTCondor pools typically do not have any shared storage available to all hosts, so the executable, the input files, and the output files are often transferred to and from the execute hosts. Hence, the names of these files must be specified in the submit description file.

Suppose we have the following text in a file called hostname.condor:

############################################################################
#
# hostname.condor
#
# Condor Universe
#
# standard:
#      Defaults to transfer executables and files.
#      Use when you are running your own script or program.
#   
# vanilla:
# grid:
#      Explicitly enable file transfer mechanisms with
#      'transfer_executable', etc.
#      Use when you are using your own files and some installed on the
#      execute hosts.
#
# parallel:
#      Explicitly enable file transfer mechanism. Used for MPI jobs.

universe = vanilla

# Macros (variables) to use in this description file
# This is our own custom macro.  It has no special meaning to HTCondor.
Process_count = 5

############################################################################
# Specify the executable filename.  This can be a binary file or a script.
# HTCondor does not search $PATH!  A relative pathname here refers to an
# executable in the current working directory.  To run a standard Unix command
# use an absolute pathname, or set executable to a script.
executable = /bin/hostname

# Output from executable running on the execute hosts
# $(Process) is a predefined macro containing a different integer value for
# each process, ranging from 0 to Process_count-1.
output = hostname.out-$(Process)
error = hostname.err-$(Process)

# Log file contains status information from HTCondor
log = hostname.log 

############################################################################
# Condor assumes job requirements from the host submitting job. 
# IT DOES NOT DEFAULT TO ACCEPTING ANY ARCH OR OPSYS!!!

# Requirements to match any FreeBSD or Linux host, 32 or 64 bit processors.
requirements = ((target.arch == "INTEL") || (target.arch == "X86_64")) && \
               ((target.opsys == "FREEBSD") || (target.opsys == "LINUX"))

# Memory requirements in mebibytes
request_memory = 10

# Executable is a standard Unix command, already on every execute host
transfer_executable = false

# Specify how many jobs you would like to submit to the queue.
queue $(Process_count)

The description file is submitted to the HTCondor scheduler as a command line argument to the condor_submit command:

shell-prompt: condor_submit hostname.condor
                

This will run /bin/hostname on each of 5 execute hosts.

FreeBSD login.peregrine bacon ~ 540: cat hostname.out-*
FBVM.10-4-60-177.meadows
FBVM.10-4-60-231.meadows
FBVM.10-4-60-227.meadows
FBVM.10-4-60-219.meadows
FBVM.10-4-60-231.meadows
                

The condor_submit command, which is part of the HTCondor scheduler, finds a free core on a execute host, reserves it, transfers the executable and input files from the submit host to the execute host if necessary, and then runs the executable on the execute host.

The executable named in the description is dispatched to the host(s) containing the core(s) allocated by HTCondor, using ssh, rsh or any other remote shell HTCondor is configured to use.

Since an HTCondor submit description file specifies a single executable, we must create a script in addition to the submit description file in order to run multiple commands in sequence.

#!/bin/sh -e

# hostname.sh

hostname
uname
pwd
ls

We must then alter the submit description file so that our script is the executable, and it is transferred to the execute host:

############################################################################
#
# hostname-script.condor
#
# Condor Universe
#
# standard:
#      Defaults to transfer executables and files.
#      Use when you are running your own script or program.
#   
# vanilla:
# grid:
#      Explicitly enable file transfer mechanisms with
#      'transfer_executable', etc.
#      Use when you are using your own files and some installed on the
#      execute hosts.
#
# parallel:
#      Explicitly enable file transfer mechanism. Used for MPI jobs.

universe = vanilla

# Macros (variables) to use in this description file
# This is our own custom macro.  It has no special meaning to HTCondor.
Process_count = 5

############################################################################
# Specify the executable filename.  This can be a binary file or a script.
# HTCondor does not search $PATH!  A relative pathname here refers to an
# executable in the current working directory.  To run a standard Unix command
# use an absolute pathname, or set executable to a script.
executable = hostname.sh

# Output from executable running on the execute hosts
# $(Process) is a predefined macro containing a different integer value for
# each process, ranging from 0 to Process_count-1.
output = hostname.out-$(Process)
error = hostname.err-$(Process)

# Log file contains status information from HTCondor
log = hostname.log 

############################################################################
# Condor assumes job requirements from the host submitting job. 
# IT DOES NOT DEFAULT TO ACCEPTING ANY ARCH OR OPSYS!!!

# Requirements to match any FreeBSD or Linux host, 32 or 64 bit processors.
requirements = ((target.arch == "INTEL") || (target.arch == "X86_64")) && \
               ((target.opsys == "FREEBSD") || (target.opsys == "LINUX"))

# Memory requirements in mebibytes
request_memory = 10

# Executable is our own script
transfer_executable = true

# Specify how many jobs you would like to submit to the queue.
queue $(Process_count)

The output of one job will appear as follows:

FreeBSD login.peregrine bacon ~ 542: cat hostname.out-0 
FBVM.10-4-60-177.meadows
FreeBSD
/htcondor/Data/execute/dir_7104
_condor_stderr
_condor_stdout
condor_exec.exe
                

We can write a simple script to remove output and log files:

#!/bin/sh -e

# hostname-cleanup.sh

rm -f hostname.out-* hostname.err-* hostname.log

Note

Since many HTCondor pools are heterogeneous, you must make sure that the executable is portable, or that you specify resource requirements to ensure that it will only be dispatched to hosts on which it will work.

Using Bourne shell scripts (not bash, ksh, csh, or tcsh) will maximize the portability of a shell script, as discussed in Chapter 4, Unix Shell Scripting. If you must use different shell, be sure to use a portable shebang line (e.g. #!/usr/bin/env bash, not #!/bin/bash).

Running Your Own Compiled Programs

As HTCondor grids may be heterogeneous (running a variety of operating systems and CPU architectures), we may need more than one binary (compiled) file in order to utilize all available hosts. Maintaining multiple binaries and transferring the right one to each host can be tedious and error-prone.

As long as the source code is portable to all execute hosts, we can avoid this issue by compiling the program on each execute host as part of the job.

Below is a submit description file that demonstrates how to use an executable script to compile and run a program on each execute host. Note that the program source code is sent to the execute host as an input file, while the script that compiles it is the executable.

############################################################################
#
# hostname-c.condor
#
# Condor Universe
#
# standard:
#      Defaults to transfer executables and files.
#      Use when you are running your own script or program.
#   
# vanilla:
# grid:
#      Explicitly enable file transfer mechanisms with
#      'transfer_executable', etc.
#      Use when you are using your own files and some installed on the
#      execute hosts.
#
# parallel:
#      Explicitly enable file transfer mechanism. Used for MPI jobs.

universe = vanilla

# Macros (variables) to use in this description file
# This is our own custom macro.  It has no special meaning to HTCondor.
Process_count = 5

############################################################################
# Specify the executable filename.  This can be a binary file or a script.
# HTCondor does not search $PATH!  A relative pathname here refers to an
# executable in the current working directory.  To run a standard Unix command
# use an absolute pathname, or set executable to a script.
executable = hostname-c.sh

# Output from executable running on the execute hosts
# $(Process) is a predefined macro containing a different integer value for
# each process, ranging from 0 to Process_count-1.
output = hostname.out-$(Process)
error = hostname.err-$(Process)

# Log file contains status information from HTCondor
log = hostname.log 

############################################################################
# Condor assumes job requirements from the host submitting job. 
# IT DOES NOT DEFAULT TO ACCEPTING ANY ARCH OR OPSYS!!!

# Requirements to match any FreeBSD or Linux host, 32 or 64 bit processors.
requirements = ((target.arch == "INTEL") || (target.arch == "X86_64")) && \
               ((target.opsys == "FREEBSD") || (target.opsys == "LINUX"))

# Memory requirements in mebibytes
request_memory = 50

# Executable is our own script
transfer_executable = true

# Send the source code as an input file for the executable
transfer_input_files = hostname.c

# Specify how many jobs you would like to submit to the queue.
queue $(Process_count)

The executable script is below. We set the sh -x flag in this example so that we can see the commands executed by the script on the execute hosts.

#!/bin/sh -e

# hostname-c.sh

# HTCondor environment contains a minimal PATH, so help it find cc
PATH=/bin:/usr/bin
export PATH

# Echo commands
set -x

pwd
cc -o hostname hostname.c
./hostname

Sample output from one of the jobs:

FreeBSD login.peregrine bacon ~ 402: cat hostname.out-0
/htcondor/Data/execute/dir_7347
Hello from FBVM.10-4-60-177.meadows!

FreeBSD login.peregrine bacon ~ 403: cat hostname.err-0
+ pwd
+ cc -o hostname hostname.c
+ ./hostname
                
Common Features
  • Anything from a '#' character to the end of a line is considered a comment, and is ignored by HTCondor.
  • universe: Categorizes jobs in order to set reasonable defaults for different job types and thus reduce the number of explicit settings in the HTCondor script.
  • executable: Name of the program or script to run on the execute host(s). This is often transferred from the submit host to the execute hosts, or even compiled on the execute hosts at the start of the job.
  • output: Name of a file to which the standard output (normal terminal output) is redirected. Jobs have no connection to the terminal while they run under HTCondor, so the output must be saved to a file for later viewing.
  • error: Name of a file to which the standard error (error messages normally sent to the terminal) is redirected.
  • input: Name of a file from which to redirect input the program would expect from the standard input (normally the keyboard).
  • log: Name of a file to which HTCondor saves it's own informative messages about the job cluster.
  • requirements: Used to describe the requirements for running the program, such as operating system, memory needs, CPU type, etc.
  • transfer_executables = yes|no: Indicate whether the executable specified with the executable variable must be transferred to the execute host before the job executes.
  • should_transfer_files = yes|no: Indicate whether input and output files must be transferred to/from the execute host. Normally yes unless a shared file system is available.
  • when_to_transfer_output_files: Indicates when output files should be transferred from the execute host back to the submit host. Normally use "on_exit".
  • transfer_input_files = list: Comma-separated list of input files to transfer to the execute host.
  • transfer_input_files = list: Comma-separated list of output files to transfer to the execute host.
  • queue [count]: Command to submit the job cluster to one or more execute hosts. If no count is given, if defaults to 1 host.
HTCondor Resource Requirements

When using a grid, it is important to develop a feel for the resources required by your jobs, and inform the scheduler as accurately as possible what will be needed in terms of CPU time, memory, etc.

You may not know this the first time you run a given job, but after examining the log files from one or more runs, you will have a pretty good idea.

This allows the scheduler to maximize the utilization of precious resources and thereby provide the best possible run times for all users.

HTCondor resource requirements are specified with the requirements variable. There are many parameters available to specify the operating system, memory requirements, etc. Users should try to match requirements as closely as possible to the actual requirements of the job cluster.

For example, if your job requires only 300 megabytes of RAM, then by specifying this, you can encourage HTCondor to save the hosts with several gigabytes of RAM for the jobs that really need them.

Batch Serial Jobs

A batch serial submit description file need only specify basic information such as the executable name, input and output files, and basic resource requirements.

universe = vanilla 
executable = hostname.sh
output = $(Process).stdout 
error = $(Process).stderr 
log = hostname.log 
transfer_executable = yes
should_transfer_files = yes 
when_to_transfer_output = on_exit 

queue 1
                
Batch Parallel Jobs (Job Clusters)

A job cluster is a set of independent processes all started by a single job submission.

A batch parallel submit description file looks almost exactly like a batch serial file, but indicates a process count greater than 1 following the queue command:

universe = vanilla 
executable = hostname.sh
output = $(Process).stdout 
error = $(Process).stderr 
log = hostname.log 
transfer_executable = yes
should_transfer_files = yes 
when_to_transfer_output = on_exit 

queue 10
                
MPI (Multi-core) Jobs

Most pools are not designed to run MPI jobs. MPI jobs often require a fast dedicated network to accommodate extensive message passing. A pool implemented on lab or office PCs is not suitable for this, a typical local area network in a lab is not fast enough to offer good performance, and MPI traffic may in fact overload it, causing problems for other users.

HTCondor does provide facilities for running MPI jobs, but they are usually only useful where HTCondor is employed as the scheduler for a cluster.

A Submit Description File Template

Below is a sample submit description file with typical options and extensive comments. This script is also available in /share1/Examples on Peregrine.

############################################################################
# Sample HTCondor submit description file.
#
# Use \ to continue an entry on the next line.
#
# You can query your jobs by command:
# condor_q 

############################################################################
# Choose which universe you want your program is running with 
# Available options are 
#
# - standard:
#      Defaults to transfer executables and files.
#      Use when you are running your own script or program.
#   
# - vanilla:
# - grid:
#      Explicitly enable file transfer mechanisms with
#      'transfer_executable', etc.
#      Use when you are using your own files and some installed on the
#      execute hosts.
#
# - parallel:
#      Explicitly enable file transfer mechanism. Used for MPI jobs.

universe = vanilla 

# Macros (variables) to use in this submit description file
Points = 1000000000
Process_count = 10

############################################################################
# Specify the executable filename.  This can be a binary file or a script.
# NOTE: The POVB execute hosts currently support 32-bit executables only.
# If compiling a program on the execute hosts, this script should compile
# and run the program.
#
# In template.sh, be sure to give the executable a different
# name for each process, since multiple processes could be on the same host.
# E.g. cc -O -o prog.$(Process) prog.c

executable = template.sh

# Command-line arguments for executing template.sh
# arguments = 

############################################################################
# Set environment variables for use by the executable on the execute hosts.
# Enclose the entire environment string in quotes.
# A variable assignment is var=value (no space around =).
# Separate variable assignments with whitespace.

environment = "Process=$(Process) Process_count=$(Process_count) Points=$(Points)"

############################################################################
# Where the standard output and standard error from executables go.
# $(Process) is current job ID.

# If running template under both PBS and HTCondor, use same output
# names here as in template-run.pbs so that we can use the same
# script to tally all the outputs from any run.

output = template.out-$(Process)
error = template.err-$(Process)

############################################################################
# Logs for the job, produced by HTCondor.  This contains output from
# HTCondor, not from the executable.

log = template.log 

############################################################################
# Custome job requirements 
# HTCondor assumes job requirements from the host submitting job. 
# IT DOES NOT DEFAULT TO ACCEPTING ANY ARCH OR OPSYS!!!
# For example, if the jobs is submitted from peregrine, target.arch is
# "X86_64" and target.opsys is "FREEBSD8", which do not match
# POVB execute hosts.
#
# You can query if your submitting host is accepted by command:
#  condor_q -analyze

# Memory requirements in megabytes
request_memory = 50

# Requirements for a binary compiled on CentOS 4 (POVB hosts):
# requirements = (target.arch == "INTEL") && (target.opsys == "LINUX")

# Requirements for a Unix shell script or Unix program compiled on the
# execute host:
requirements = ((target.arch == "INTEL") || (target.arch == "X86_64")) && \
               ((target.opsys == "FREEBSD") || (target.opsys == "LINUX"))

# Requirements for a job utilizing software installed via FreeBSD ports:
# requirements = ((target.arch == "INTEL") || (target.arch == "X86_64")) && \
#    (target.opsys == "FREEBSD")

# Match specific compute host names
# requirements = regexp("compute-s.*.meadows", Machine)

############################################################################
# Explicitly enable executable transfer mechanism for vanilla universe.

# true | false
transfer_executable = true

# yes | no | if_needed
should_transfer_files = if_needed

# All files to be transferred to the execute hosts in addition to the
# executable.  If compiling on the execute hosts, list the source file(s)
# here, and put the compile command in the executable script.
transfer_input_files = template.c

# All files to be transferred back from the execute hosts in addition to
# those listed in "output" and "error".
# transfer_output_files = file1,file2,...

# on_exit | on_exit_or_evict
when_to_transfer_output = on_exit 

############################################################################
# Specify how many jobs you would like to submit to the queue.

queue $(Process_count)

Job Status

While your jobs are running, you can check on their status using condor_q.

FreeBSD peregrine bacon ~ 533: condor_q

-- Submitter: peregrine.hpc.uwm.edu : <129.89.25.224:37668> : peregrine.hpc.uwm.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
  63.0   bacon           6/5  13:45   0+00:00:00 I  0   0.0  hostname.sh             
  64.0   bacon           6/5  13:46   0+00:00:00 I  0   0.0  hostname.sh             

2 jobs; 2 idle, 0 running, 0 held
            

The ST column indicates job status (R = running, I = idle, C = completed). Run man condor_q for full details.

Terminating a Job or Cluster

If you determine that a job is not behaving properly (by reviewing partial output, for example), you can terminate it using condor_rm, which take a job ID or job cluster ID as a command line argument.

To terminate a single job within a cluster, use the job ID form cluster-id.job-index:

FreeBSD peregrine bacon ~ 534: condor_rm 63.0
Job 63.0 marked for removal
FreeBSD peregrine bacon ~ 535: condor_q


-- Submitter: peregrine.hpc.uwm.edu : <129.89.25.224:37668> : peregrine.hpc.uwm.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
  64.0   bacon           6/5  13:46   0+00:00:00 I  0   0.0  hostname.sh             

1 jobs; 1 idle, 0 running, 0 held
            

To terminate an entire job cluster, just provide the job cluster ID:

FreeBSD peregrine bacon ~ 534: condor_rm 63
            
Job Sequences

If you need to submit a series of jobs in sequence, where one job begins after another has completed,

It's important to make sure that the current job completed successfully before submitting the next, to avoid wasting resources. It is up to you to determine the best way to verify that a job was successful. Examples might include grepping the log file for some string indicating success, or making the job create a marker file using the touch command after a successful run. If the command used in your job returns a Unix-style exit status (0 for success, non-zero on error), then you can simply use the shell's exit-on-error feature to make your script exit when any command fails. Below is a template for scripts that might run a series of jobs.

#!/bin/sh

condor_submit job1.condor
condor_wait job1-log-file

# Verify that job1 completed successfully, by the method of your choice
if ! test-to-indicate-job1-succeeded; then
    exit 1
fi

condor_submit job2.condor
            
Self-test
  1. What command is used to check the status of an HTCondor pool?
  2. What kind of file is used to describe an HTCondor job? Is this file a shell script?
  3. Does HTCondor support MPI jobs? Explain.
  4. How can you check the status of your HTCondor jobs?
  5. How can you terminate a running HTCondor job?
  6. Write and submit a batch-serial HTCondor script called list-etc.condor that prints the host name of the execute host on which it runs and a long-listing of /etc directory on that host.

    The script should store the output of the commands in list-etc.stdout and error messages in list-etc.stderr in the directory from which the script was submitted.

    Quickly check the status of your job after submitting it.

  7. Copy your list-etc.condor script to list-etc-parallel.condor, and modify it so that it runs the hostname and ls commands on 10 cores instead of just one.

    The job should produce a separate output file for each process named list-etc-parallel.o<jobid> and a separate error file for each process named list-etc-parallel.e<jobid>.

    Quickly check the status of your job after submitting it.