Using LPJS

Using LPJS
Prev	Chapter 8. Job Scheduling with LPJS	Next

LPJS Jargon

In order to use any new system, a few definitions must be familiar. Below are some terms used throughout this document that are necessary for understanding LPJS. This chapter assumes that the reader is familiar with the material in Chapter 6, Parallel Computing and Chapter 7, Job Scheduling, which cover the general concepts of HPC and HTC.

A node is a single computer in the cluster or grid.
A job is the execution of a program under the LPJS scheduler. Each job is assigned a unique integer job ID. A job is analogous to a Unix process, but not the same. Job IDs are not Unix process IDs, and a job may entail more than one Unix process, if it is a parallel program. There are three kinds of LPJS jobs:
- A serial job runs a single Unix process.
- A shared memory parallel job runs multiple cooperating Unix processes or threads on the same node. They most commonly use pthreads (POSIX threads) or OpenMP (not to be confused with OpenMPI), but any shared memory parallel programming API is possible.
- A distributed parallel job runs multiple cooperating Unix processes, possibly on different nodes. These mostly commonly use MPI (Message Passing Interface), a set of libraries and tools for creating parallel programs. There are multiple implementations of MPI, including MPICH, OpenMPI, etc. The processes that make up an MPI job can be on the same node, or on different nodes. A typical MPI job runs multiple processes on each of multiple nodes.
A submission refers to all jobs created by one lpjs submit command. A submission is not an entity in the LPJS scheduler, but only a concept used in this document. The only unit of work managed by LPJS is the job.

Cluster Status

The lpjs nodes command lists available compute nodes and available resources on each node. The following shows a Unix workstation (barracuda) and a Mac Mini (tarpon) being used as compute nodes on a local home network:

shell-prompt: lpjs nodes
Hostname             State    Procs Used PhysMiB    Used OS        Arch     
barracuda.acadix.biz Up           4    0   16350       0 FreeBSD   amd64    
tarpon.acadix.biz    Up           8    0    8192       0 Darwin    arm64    

Total                Up          12    0   24542       0 -         -        
Total                Down         0    0       0       0 -         -

The following shows a small cluster consisting of dedicated Dell PowerEdge servers:

shell-prompt: lpjs nodes
Hostname             State    Procs Used PhysMiB    Used OS        Arch     
compute-001.albacore Up          16    0   65476       0 FreeBSD   amd64    
compute-002.albacore Up          16    0   65477       0 FreeBSD   amd64    
compute-003.albacore Up          16    0   65477       0 FreeBSD   amd64    
compute-004.albacore Up          16    0   65477       0 FreeBSD   amd64    
compute-005.albacore Up          16    0  131012       0 FreeBSD   amd64    
compute-006.albacore Up          16    0  131012       0 FreeBSD   amd64    

Total                Up          96    0  523931       0 -         -        
Total                Down         0    0       0       0 -         -

Job Status

The lpjs jobs command shows currently pending (waiting to start) and running jobs. Below, we see an RNA-Seq adapter trimming job utilizing our workstation and Mac Mini to trim six files at once.

shell-prompt: lpjs jobs 
Legend: P = processor  J = job  N = node

Pending

    JobID  IDX Jobs P/J P/N MiB/P User Compute-node Script
      169    7   18   2   2    10 bacon TBD 04-trim.lpjs
      170    8   18   2   2    10 bacon TBD 04-trim.lpjs
      171    9   18   2   2    10 bacon TBD 04-trim.lpjs
      172   10   18   2   2    10 bacon TBD 04-trim.lpjs
      173   11   18   2   2    10 bacon TBD 04-trim.lpjs
      174   12   18   2   2    10 bacon TBD 04-trim.lpjs
      175   13   18   2   2    10 bacon TBD 04-trim.lpjs
      176   14   18   2   2    10 bacon TBD 04-trim.lpjs
      177   15   18   2   2    10 bacon TBD 04-trim.lpjs
      178   16   18   2   2    10 bacon TBD 04-trim.lpjs
      179   17   18   2   2    10 bacon TBD 04-trim.lpjs
      180   18   18   2   2    10 bacon TBD 04-trim.lpjs

Running

    JobID  IDX Jobs P/J P/N MiB/P User Compute-node Script
      163    1   18   2   2    10 bacon barracuda.acadix.biz 04-trim.lpjs
      164    2   18   2   2    10 bacon barracuda.acadix.biz 04-trim.lpjs
      165    3   18   2   2    10 bacon tarpon.acadix.biz 04-trim.lpjs
      166    4   18   2   2    10 bacon tarpon.acadix.biz 04-trim.lpjs
      167    5   18   2   2    10 bacon tarpon.acadix.biz 04-trim.lpjs
      168    6   18   2   2    10 bacon tarpon.acadix.biz 04-trim.lpjs

Below, we see an RNA-Seq adapter trimming job utilizing our cluster to trim all eighteen of our files simultaneously. This should get the job done much faster than the two computers in the previous example.

shell-prompt: lpjs jobs     
Legend: P = processor  J = job  N = node

Pending

    JobID  IDX Jobs P/J P/N MiB/P User Compute-node Script

Running

    JobID  IDX Jobs P/J P/N MiB/P User Compute-node Script
       19    1   18   3   3    50 bacon compute-001.albacore 04-trim.lpjs
       20    2   18   3   3    50 bacon compute-001.albacore 04-trim.lpjs
       21    3   18   3   3    50 bacon compute-001.albacore 04-trim.lpjs
       22    4   18   3   3    50 bacon compute-001.albacore 04-trim.lpjs
       23    5   18   3   3    50 bacon compute-001.albacore 04-trim.lpjs
       24    6   18   3   3    50 bacon compute-002.albacore 04-trim.lpjs
       25    7   18   3   3    50 bacon compute-002.albacore 04-trim.lpjs
       26    8   18   3   3    50 bacon compute-002.albacore 04-trim.lpjs
       27    9   18   3   3    50 bacon compute-002.albacore 04-trim.lpjs
       28   10   18   3   3    50 bacon compute-002.albacore 04-trim.lpjs
       29   11   18   3   3    50 bacon compute-003.albacore 04-trim.lpjs
       30   12   18   3   3    50 bacon compute-003.albacore 04-trim.lpjs
       31   13   18   3   3    50 bacon compute-003.albacore 04-trim.lpjs
       32   14   18   3   3    50 bacon compute-003.albacore 04-trim.lpjs
       33   15   18   3   3    50 bacon compute-003.albacore 04-trim.lpjs
       34   16   18   3   3    50 bacon compute-004.albacore 04-trim.lpjs
       35   17   18   3   3    50 bacon compute-004.albacore 04-trim.lpjs
       36   18   18   3   3    50 bacon compute-004.albacore 04-trim.lpjs

Using top

It is important to know how many processors and how much memory are being used by the Unix processes that make up our jobs. For this, we use standard Unix process monitoring tools, such as top. In order to do so, we must know which compute nodes are being used by the job. This is shown by the lpjs jobs command described above. We must also have the ability to run commands manually on the compute nodes. This can often be done using ssh. Use the ssh -t flag to enable full terminal control, which is required by top.

shell-prompt: ssh -t compute-001 top

last pid: 11680;  load averages:  7.12,  6.36,  3.54  up 0+01:57:38    19:22:18
75 processes:  10 running, 65 sleeping
CPU:  3.5% user,  0.0% nice,  0.5% system,  0.1% interrupt, 95.9% idle
Mem: 128M Active, 1320M Inact, 4416M Wired, 1572M Buf, 57G Free
ARC: 1024M Total, 56M MFU, 766M MRU, 188M Anon, 10M Header, 2422K Other
     748M Compressed, 976M Uncompressed, 1.31:1 Ratio
Swap: 5120M Total, 5120M Free

  PID USERNAME    THR PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
11590 bacon         1  68    0    13M  2664K piperd   6   4:10  52.98% fastq-tr
11631 bacon         1 111    0    13M  2664K CPU8     8   4:07  49.07% fastq-tr
11640 bacon         1 109    0    13M  2656K CPU7     7   4:10  48.49% fastq-tr
11605 bacon         1 110    0    13M  2664K CPU6     6   4:06  47.17% fastq-tr
11591 bacon         1 108    0    13M  2660K CPU9     9   4:08  43.16% fastq-tr
11617 bacon         1  47    0    14M  3916K piperd  11   2:08  26.76% gzip
11607 bacon         1  45    0    14M  3904K piperd   0   2:06  26.46% gzip
11646 bacon         1  44    0    14M  3908K piperd   5   2:09  24.27% gzip
11644 bacon         1  45    0    14M  3900K CPU4     4   2:07  23.49% gzip
11647 bacon         1  48    0    14M  3892K piperd  14   2:11  23.39% gzip
11618 bacon         1  45    0    14M  3908K CPU10   10   2:05  23.29% gzip
11635 bacon         1  47    0    14M  3916K piperd  10   2:06  23.00% gzip
11642 bacon         1  47    0    14M  3912K piperd  15   2:05  23.00% gzip
11634 bacon         1  42    0    14M  3916K piperd   2   2:10  21.09% gzip
11610 bacon         1  40    0    14M  3908K CPU5     5   2:05  20.65% gzip
11608 bacon         1  36    0    25M    11M zio->i   0   1:15  15.67% xzcat
11592 bacon         1  32    0    25M    11M select   7   1:14  15.58% xzcat
11643 bacon         1  34    0    25M    11M select   0   1:16  14.26% xzcat

Note

We see in the top output above that our fastq-trim processes are only utilizing about 50% of each processor. This is because running 18 of them at the same time overwhelms the disk system in this small cluster, so the processes spend a lot of time waiting for I/O instead of using the processor. The fastq-trim program is very fast, and the input and output files are very large. It would be better in this case to limit the submission to fewer jobs, so that the processors are fully utilized. This will leave more processors available for other jobs, which might not need to compete for the disk.

TBD: Document top-job when LPJS is integrated with SPCM.

Job Submission

Submission Scripts

All jobs under LPJS are described by a script that contains some special directives to describe the LPJS job parameters. Otherwise, it is an ordinary script, which is run on each compute node selected for the submission.

An LPJS job script can be written in any scripting language that sees "#lpjs" as a comment.

Note

As LPJS is a cross-platform job scheduler, it is strongly recommended that scripts be written in a portable language and without any operating system specific features. The easiest solution is to use POSIX Bourne shell, which is supported by all Unix-like systems, and is described in Chapter 4, Unix Shell Scripting.

#!/bin/sh -e

#lpjs jobs 10
#lpjs processors-per-job 1
#lpjs threads-per-process 1
#lpjs pmem-per-processor 100MiB
#lpjs path /usr/local/bin:/usr/bin:/bin

my-program my-arguments

It may be tempting to use a more advanced shell language, but doing so may be problematic on some clusters or grids, as some compute nodes may not have the shell installed, or it may behave differently under different operating systems due to different versions of the shell, or differences between the operating systems. Features of advanced shells are rarely useful in HPC/HTC anyway, since the scripts tend to be short and simple. POSIX Bourne shell is more than adequate for most jobs.

LPJS job parameters are specified in the script using a line that begins with "#lpjs". To the scripting language, this is a comment, so it is ignored when the script is run on a compute node. The lpjs submit command, however, extracts these lines from the script and uses the information to set the jobs resource requirements and other parameters.

Caution

LPJS will terminate running jobs that exceed the resource allocations specified by these parameters.

The parameters listed below are required by LPJS. Job submissions will fail if any of them are missing from the script.

#lpjs jobs: Number of jobs to run in one submission. Each job is assigned a different job ID and an array index beginning with 1.
A submission with jobs > 1 is known as a job array.
#lpjs processors-per-job: Number of processors to allocate for each job in the submission. A processor is whatever the operating system defines it to be. In most situations, it is a logical processor, which may be affected by SMT (simultaneous multithreading, known as hyper-threading on Intel processors). With SMT, a physical CPU core is treated as two or more logical processors by most operating systems. The same core can execute more than one machine instruction at the same time, as long as the instructions don't contend for the same CPU components. SMT is usually disabled on HPC clusters and HTC grids, due to its limited performance benefits and increased contention for memory and other resources.
In general terms, a processor is what the operating system uses to run a process, so processors >= processes that are actually running at any given moment. Other processes must wait in a queue until their next turn at using a processor.
#lpjs threads-per-process: Minimum number of processors that must be on the same node. For shared memory multiprocessing, where all processes or threads must be on the same node, this must be equal to processors-per-job.
```
#lpjs processors-per-job 8
#lpjs threads-per-process 8
                    
```
LPJS allows you to use "processors-per-job" as a value here, so you don't need to remember to change both parameters in the future.
```
#lpjs processors-per-job 8
#lpjs threads-per-process processors-per-job
                    
```
Distributed parallel programs, such as MPI programs, may use threads-per-process < processors-per-job. Setting threads-per-process to 1 will allow the most flexible scheduling of available processors for an MPI job, which may allow it to start sooner.
#lpjs pmem-per-processor: Physical memory (RAM) per process. This refers to the actual about of RAM (electronic memory) used by a process. All Unix-like (POSIX) operating systems use virtual memory, where the most active part of a process is in physical memory and less active parts may be swapped out to disk or other slower storage.
It is important to set this parameter correctly, as overallocating physical memory can cause serious problems:
- If pmem-per-processor is set too high, then your job is hoarding memory resources that it isn't using, which may prevent other jobs from running.
- If pmem-per-processor is set too low, then the compute node will be oversubscribed. This will cause processes to run slowly or crash, and may even cause the node to crash.
The only way to determine the correct pmem-per-processor is by doing sample runs of your program with the same inputs used by the job, and observing the memory use using tools such as top. If possible, this should be done on a workstation rather than on a cluster or grid, so that these test runs don't negatively impact other jobs.
Then set pmem-per-processor slightly larger than the observed maximum (maybe 10% to 20%).
Note that top may show both virtual and physical memory use, in different columns. FreeBSD top shows virtual memory under the "SIZE" header and physical memory under "RES". Physical memory is also know as resident memory in Unix, i.e. the portion of the program that resides in real memory rather than swap. Linux topshows virtual memory under "VIRT" and physical memory under "RES".

There are additional, optional parameters as well.

#lpjs path: Sets the environment variable PATH on the compute node before running the script.
Note that this is not quite the same as setting PATH in the script, since #lpjs path sets it before the script begins running, and before pull-command is executed. Hence, this could affect which version of rsync is used for pull-command.
#lpjs push-command: File transfer command for transferring output files from a compute node to the submit node after a job completes.
Note
This is not used if the compute nodes have direct access to the submit directory, using NFS or some other network file sharing protocol.
```
#lpjs push-command rsync -av %w/Outputs/ %h:%s/Results/Outputs
                    
```
The "%w" represents the temporary working directory on the compute node.
The "%s" represents the hostname of the submit node.
The "%s" represents the directory from which the job was submitted on the submit node.
You can choose any file transfer command you like, sending the output to some other server or directory.
Note
The only requirement is that all compute nodes have the ability to run the transfer command without the need to enter a password. Typically, this is done by installing the SSH keys of all compute nodes in the authorized_keys file on the destination. See the SSH documentation for details. The auto-ssh-authorize script, installed as a dependency of LPJS by all package managers, can assist with this. The SPCM cluster manager will automatically set up SSH keys for all cluster nodes.
Examples:
```
#lpjs push-command scp -r %w %s:/jobs/data/username/Results
#lpjs push-command scp -r %w myworkstation.my.domain:/jobs/data/username/Results
                    
```
#lpjs pull-command: Like push-command, but pulls files to the compute node before the job script executes.
```
#lpjs pull-command rsync -r --copy-links %h:%s/Inputs/ %w/Inputs
                    
```
#lpjs log-dir: Sets the parent directory for job logs. Each job creates a subdirectory under this directory containing the chaperone log, a copy of the job script created at the time of submission, and the stdout and stderr output from the script.
The log directory name may not contain whitespace.
The default is <working-directory>/LPJS-logs/script-name/Job-jobid.
Example:
```
#lpjs log-dir Logs/04-trim
                    
```

Common Flags

TBD

LPJS Resource Requirements

LPJS will not dispatch a job until enough processors and memory are available.

Note that the number of processors and memory for a submission is irrelevant. All jobs, even from the same submission, are scheduled independently. LPJS will dispatch as many jobs as possible from a given submission, and the rest will wait in the queue until sufficient resources become available.

Memory requirements must specify units, which can be MB (megabytes, 10^6 bytes), MiB (mebibytes, 2^20 bytes), GB (gigabytes, 10^9 bytes), or GiB (gibibytes, 2^30 bytes).

Other schedulers have default units, which may seem like a convenience, but this often leads to confusion and errors in practice.

Environment Variables

LPJS sets a number of environment variables, which, like all environment variables, are inherited by child processes. In this way, information about the job is passed from LPJS to the programs run by your job scripts.

LPJS_HOME_DIR: The home directory of the user on the compute node. May be different on different nodes.
LPJS_ARRAY_INDEX: The 1-base array index of each job in a job array submission.
LPJS_JOB_COUNT: The number of jobs in the job array submission.
LPJS_PROCS_PER_JOB: The value set in the script by "#lpjs processors-per-job".
LPJS_MIN_PROCS_PER_NODE: The value set in the script by "#lpjs threads-per-process".
LPJS_PMEM_PER_PROC: The value set in the script by "#lpjs pmem-per-processor".
LPJS_USER_NAME: The username of the user running the job.
LPJS_PRIMARY_GROUP_NAME: The primary group name of the user running the job.
LPJS_SUBMIT_HOST: The FQDN (fully qualified domain name / host name) from which the job was submitted.
LPJS_SUBMIT_DIRECTORY: The absolute pathname of the directory on the submit node, from which the job was submitted.
LPJS_SCRIPT_NAME: The filename of the batch script being run by the job.
LPJS_COMPUTE_NODE: The compute node running the job (same as $(hostname) or `hostname`).
LPJS_JOB_LOG_DIR: The path of the directory containing job terminal output, relative to LPJS_SUBMIT_DIRECTORY. Defaults to LPJS-logs/script-name.
LPJS_PUSH_COMMAND: The command used to transfer the temporary current working directory from compute nodes that do not have direct access to the submit directory on the submit node. Defaults to "rsync -av %w/ %h:%s", where %w is the temporary working directory on the compute node, %h is the submit host, and %s is the submit directory on the submit node. Below is output from a job that runs "printenv | grep LPJS_" in the batch script:

Note: LPJS_PMEM_PER_PROC is shown in MiB.

#!/bin/sh -e

#lpjs jobs 1
#lpjs processors-per-job 3
#lpjs threads-per-process processors-per-job
#lpjs pmem-per-processor 50MiB

# Print all environment variables and filter for those containing LPJS_
printenv | grep LPJS_

LPJS_JOB_COUNT=1
LPJS_COMPUTE_NODE=TBD
LPJS_PUSH_COMMAND=rsync -av %w/ %h:%s
LPJS_PMEM_PER_PROC=9
LPJS_MIN_PROCS_PER_NODE=1
LPJS_PRIMARY_GROUP_NAME=bacon
LPJS_SUBMIT_HOST=moray.acadix.biz
LPJS_JOB_LOG_DIR=LPJS-logs/env
LPJS_USER_NAME=bacon
LPJS_SUBMIT_DIRECTORY=/home/bacon
LPJS_JOB_ID=1983
LPJS_SCRIPT_NAME=env.lpjs
LPJS_PROCS_PER_JOB=1
LPJS_ARRAY_INDEX=1
LPJS_HOME_DIR=/home/bacon

Shell scripts can use these variables directly, as described in Chapter 4, Unix Shell Scripting.

Batch Serial Submissions

A batch serial submission has both jobs and processors-per-job parameters both set to 1.

#!/bin/sh -e

#lpjs jobs 1
#lpjs processors-per-job 1
#lpjs threads-per-process 1
#lpjs pmem-per-processor 100MiB

my-serial-program arguments

Batch Parallel Submissions

A batch parallel submission, or job array, has a jobs parameter > 1. Individual jobs may be serial (1 process) or parallel (multiple processes).

Input Files the Easy Way

When submitting a job array, it's helpful to have input and output filenames that contain integer indexes, like 1, 2, 3, etc.

shell-prompt: ls
input-1.fastq.xz   input-15.fastq.xz  input-20.fastq.xz  input-8.fastq.xz
input-10.fastq.xz  input-16.fastq.xz  input-3.fastq.xz   input-9.fastq.xz
input-11.fastq.xz  input-17.fastq.xz  input-4.fastq.xz
input-12.fastq.xz  input-18.fastq.xz  input-5.fastq.xz
input-13.fastq.xz  input-19.fastq.xz  input-6.fastq.xz
input-14.fastq.xz  input-2.fastq.xz   input-7.fastq.xz

Note that the listing above does not show the filenames in numeric order. This is because they are sorted lexically (more or less alphabetically) rather than numerically. Lexically, "10" is less than "9", because they are treated as strings, not numbers, and "1" comes before "9", like "A" comes before "B". This is a quirk with ls and and with many programming/scripting constructs.

#!/bin/sh -e

for file in *.xz; do
    echo $file
done

input-1.fastq.xz
input-10.fastq.xz
input-2.fastq.xz
input-3.fastq.xz
input-4.fastq.xz
input-5.fastq.xz
input-6.fastq.xz
input-7.fastq.xz
input-8.fastq.xz
input-9.fastq.xz

Sometimes this is preferable, but it's a problem if you need to process things in numeric order. You will know when you actually start to work with your input files. This will happen any time the indexes contain a variable number of digits. It can be solved by simply left-padding the numbers with 0s, so that 0s so that all the numbers have the same number of digits. This makes the lexical and numeric orders the same.

shell-prompt: ls
input-001.fastq.xz  input-007.fastq.xz  input-013.fastq.xz  input-019.fastq.xz
input-002.fastq.xz  input-008.fastq.xz  input-014.fastq.xz  input-020.fastq.xz
input-003.fastq.xz  input-009.fastq.xz  input-015.fastq.xz
input-004.fastq.xz  input-010.fastq.xz  input-016.fastq.xz
input-005.fastq.xz  input-011.fastq.xz  input-017.fastq.xz
input-006.fastq.xz  input-012.fastq.xz  input-018.fastq.xz

Running the same script as above with the new filenames:

input-001.fastq.xz
input-002.fastq.xz
input-003.fastq.xz
input-004.fastq.xz
input-005.fastq.xz
input-006.fastq.xz
input-007.fastq.xz
input-008.fastq.xz
input-009.fastq.xz
input-010.fastq.xz

If raw input filenames are cryptic, as they often are, you can simplify things by creating symbolic links with names that are both easier for people to read and easier for scripts and other programs to parse.

Consider the following FASTQ RNA sequence files, downloaded from the SRA (Sequence Read Archive) and NCBI. Note how the numbers embedded in the filenames appear somewhat sequential, with mostly increments of 7, but not always.

shell-prompt:
ERR458493.fastq.gz  ERR458528.fastq.gz  ERR458563.fastq.gz  ERR458906.fastq.gz
ERR458500.fastq.gz  ERR458535.fastq.gz  ERR458878.fastq.gz  ERR458913.fastq.gz
ERR458507.fastq.gz  ERR458542.fastq.gz  ERR458885.fastq.gz  ERR458920.fastq.gz
ERR458514.fastq.gz  ERR458549.fastq.gz  ERR458892.fastq.gz  ERR458927.fastq.gz
ERR458521.fastq.gz  ERR458556.fastq.gz  ERR458899.fastq.gz  ERR458934.fastq.gz

Writing a script to utilize these numbers would be a calamity. Instead, we can generate new names, without losing the old ones, using symbolic links:

#!/bin/sh -e

index=1
for file in *.fastq.gz; do
    prefixed_index=$(printf "%03d" $index)
    ln -s $file input-$prefixed_index.fastq.gz
    index=$(($index + 1))
done

shell-prompt: ls
ERR458493.fastq.gz  ERR458563.fastq.gz  input-001.fastq.gz@ input-011.fastq.gz@
ERR458500.fastq.gz  ERR458878.fastq.gz  input-002.fastq.gz@ input-012.fastq.gz@
ERR458507.fastq.gz  ERR458885.fastq.gz  input-003.fastq.gz@ input-013.fastq.gz@
ERR458514.fastq.gz  ERR458892.fastq.gz  input-004.fastq.gz@ input-014.fastq.gz@
ERR458521.fastq.gz  ERR458899.fastq.gz  input-005.fastq.gz@ input-015.fastq.gz@
ERR458528.fastq.gz  ERR458906.fastq.gz  input-006.fastq.gz@ input-016.fastq.gz@
ERR458535.fastq.gz  ERR458913.fastq.gz  input-007.fastq.gz@ input-017.fastq.gz@
ERR458542.fastq.gz  ERR458920.fastq.gz  input-008.fastq.gz@ input-018.fastq.gz@
ERR458549.fastq.gz  ERR458927.fastq.gz  input-009.fastq.gz@ input-019.fastq.gz@
ERR458556.fastq.gz  ERR458934.fastq.gz  input-010.fastq.gz@ input-020.fastq.gz@

Now that we have links with rational filenames, specifying an input file in a job array is easy:

#!/bin/sh -e

#lpjs jobs 20
#lpjs processors-per-job 1
#lpjs threads-per-process 1
#lpjs pmem-per-processor 50MiB

# Add leading 0s to index provided by LPJS
index=$(printf "%03d" $LPJS_ARRAY_INDEX)

myprog --input input-$index.fastq.gz --output output-$index.fastq.zst

Multiprocessing Jobs

A multiprocessing job is a job with the processors-per-job parameter > 1. This could be shared memory (all processes or threads on the same node, threads-per-process = processors-per-job), or distributed parallel (processes may be spread across more than one node, threads-per-process <= processors-per-job).

#!/bin/sh -e

#################################################
# A submission of 10 shared memory parallel jobs
# Each job requires 5 processors
# LPJS will run as many of the 10 jobs as possible at the same time

# Array of 10 multiprocessing jobs
#lpjs jobs 10

# 5 processors per job
#lpjs processors-per-job 5

# All processors must be on the same node
#lpjs threads-per-process processors-per-job

#lpjs pmem-per-processor 100MiB

# Make sure the program uses only 5 processors (do not oversubscribe the processors)
OMP_NUM_THREADS=5
export OMP_NUM_THREADS

my-openmp-program arguments

MPI Multiprocessing Jobs

TBD

The submission usually runs a single MPI job, but it is also possible to submit an array of MPI jobs.

Terminating a Job

The lpjs cancel command takes one or more job IDs or ranges of job IDs. To specify a range, separate two job IDs with a '-', and only a '-'. The range must be a single Unix shell argument, so it cannot contain any whitespace.

shell-prompt: lpjs jobs
Legend: P = processor  J = job  N = node

Pending

    JobID  IDX Jobs P/J P/N MiB/P User Compute-node Script
      169    7   18   2   2    10 bacon TBD 04-trim.lpjs
      170    8   18   2   2    10 bacon TBD 04-trim.lpjs
      171    9   18   2   2    10 bacon TBD 04-trim.lpjs
      172   10   18   2   2    10 bacon TBD 04-trim.lpjs
      173   11   18   2   2    10 bacon TBD 04-trim.lpjs
      174   12   18   2   2    10 bacon TBD 04-trim.lpjs
      175   13   18   2   2    10 bacon TBD 04-trim.lpjs
      176   14   18   2   2    10 bacon TBD 04-trim.lpjs
      177   15   18   2   2    10 bacon TBD 04-trim.lpjs
      178   16   18   2   2    10 bacon TBD 04-trim.lpjs
      179   17   18   2   2    10 bacon TBD 04-trim.lpjs
      180   18   18   2   2    10 bacon TBD 04-trim.lpjs

Running

    JobID  IDX Jobs P/J P/N MiB/P User Compute-node Script
      163    1   18   2   2    10 bacon barracuda.acadix.biz 04-trim.lpjs
      164    2   18   2   2    10 bacon barracuda.acadix.biz 04-trim.lpjs
      165    3   18   2   2    10 bacon tarpon.acadix.biz 04-trim.lpjs
      166    4   18   2   2    10 bacon tarpon.acadix.biz 04-trim.lpjs
      167    5   18   2   2    10 bacon tarpon.acadix.biz 04-trim.lpjs
      168    6   18   2   2    10 bacon tarpon.acadix.biz 04-trim.lpjs
      
# Cancel one job
shell-prompt: lpjs cancel 163

# Cancel all pending and running jobs
shell-prompt: lpjs cancel 163-180

# Cancel all jobs running on tarpon
shell-prompt: lpjs cancel 165-168

# Cancel a random sample of jobs for no good reason other than demonstration
shell-prompt: lpjs cancel 163-165 177 179-180

Terminating Stray Processes

The lpjs cancel command does a fairly thorough job hunting down and terminating all processes run by a job script. There may be circumstances where some processes are missed, however. The only way to kill such processes is by first identifying them using top, ps, or similar commands, and then manually terminating them using kill.

It can be tricky to identify which processes are part of an active job and which are strays. LPJS chaperone creates a process group for this purpose. Running ps -jxw will show processes, along with their process group (PGID). Processes that are not in the same group as any chaperone process are strays.

File Sharing

Clusters normally have one or more file servers, so that jobs can run in a directory that is directly accessible from all nodes. This is the ideal situation, as input files are directly available to jobs, and output files from jobs can be written to their final location without needing to transfer them.

Note

At present, it appears to be impractical to use macOS for compute nodes with data on a file server. macOS has a security feature that prevents programs from accessing most directories unless the user explicitly grants permission via the graphical interface. In order for LPJS to access file servers as required for normal operation, the program lpjs_compd must be granted full disk access via System Settings, Privacy and Security. Otherwise, you may see "operation not permitted" errors in the log when trying to access NFS shares.

The major problem is that this is not a one-time setting. Each time LPJS is updated, full disk access is revoked, and the user must enable it via the graphical interface again.

Grids normally do not have file servers. In this case, it will be necessary for all nodes to have the ability to pull files from and push files to somewhere. Typically, this somewhere would be the submit node, or a server accessible for file transfers from the submit node and all compute nodes.

LPJS does not provide file transfer tools. There are numerous highly-evolved, general-purpose file transfer tools already available, so it is left to the systems manager and user to decide which one(s) to use. We recommend using rsync if possible, as it is highly portable and reliable, and minimizes the amount of data transferred when repeating a transfer.

Note

All compute nodes must be able to perform a passwordless file transfers to the designated server, i.e. pulling files to, or pushing files from a compute node does not prompt the user for a password. This is generally accomplished by installing ssh keys on the submit node, which can be done by running auto-ssh-authorize submit-host from every compute node, as every user who will run jobs.

The lpjs submit command creates a marker file in the working directory on the submit host, named "lpjs-submit-host-name-shared-fs-marker" (replace "submit-host-name" with the FQDN of your submit node). If this file is not accessible to the compute node, then LPJS will take the necessary steps to create the temporary working directory and transfer it back to the submit node after the script terminates.

If the working directory (the directory from which the job is submitted on the submit node) is not accessible to the compute nodes (e.g. using NFS), then the user's script is responsible for downloading any required input files. Below is an example from Test/fastq-trim.lpjs in the LPJS Github repository.

Note

Note that we used the --copy-links option with rsync, so that it copies files pointed to by symbolic links, rather than just recreating the symbolic link on the compute node. You must understand each situation and decide whether this is necessary.

# Marker file is created by "lpjs submit" so we can detect shared filesystems.
# If this file does not exist on the compute nodes, then the compute nodes
# must pull (download) the input files.
marker=lpjs-$LPJS_SUBMIT_HOST-shared-fs-marker
if [ ! -e $marker ]; then
    printf "$marker does not exist.  Using rsync to transfer files.\n"
    set -x
    printf "Fetching $LPJS_SUBMIT_HOST:$LPJS_WORKING_DIRECTORY/$infile\n"
    # Use --copy-links if a file on the submit node might be a symbolic
    # link pointing to something that it not also being pulled here
    rsync --copy-links ${LPJS_SUBMIT_HOST}:$LPJS_WORKING_DIRECTORY/$infile .
    set +x
else
    printf "$marker found.  No need to transfer files.\n"
fi

LPJS will, by default, transfer the contents of the temporary working directory back to the working directory on the submit node, using rsync -av temp-working-dir/ submit-host:working-dir. The "working-dir" above is the directory from which the job was submitted, and "temp-working-dir" is a job-specific temporary directory created by LPJS on the compute node. Following this transfer, the working directory on the submit node should contain the same output file as it would using a shared filesystem. Users can override the transfer command. command. See the Research Computing User Guide for details.

# If we downloaded the input file, remove it now to avoid wasting time
# transferring it back.  By default, LPJS transfers the entire temporary
# working directory to the submit node using rsync.
if [ ! -e $marker ]; then
    rm -f $infile
fi

Viewing Output of Active Jobs

If your system uses NFS or some other networked file sharing service, so that the compute nodes have direct access to the same files as the submit host, then we can see output files grow while the job is running. There may be a slight delay due to NFS buffering, so the output files will not show everything that has been written by a different node at a given moment.

If compute nodes do not have direct access via NFS or similar, then a temporary working directory is created on the compute node. You can only monitor progress if you can log into that compute node and view files in the temporary working directory.

Viewing Job Statistics

TBD: Not yet implemented

Job Sequences

TBD: Not yet implemented

Self-test

TBD