Scripting an Analysis Pipeline

What's an Analysis Pipeline?

An analysis pipeline is simply a sequence of processing steps. Since the steps are basically the same for a given type of analysis, we can automate the pipeline using a scripting language for the reasons we discussed at the beginning of this chapter: To save time and avoid mistakes.

A large percentage of scientific research analyses require multiple steps, so analysis pipelines are very common in practice.

Where do Pipelines Come From?

It has been said that for every PhD thesis, there is a pipeline. There are many preexisting pipelines available for a wide variety of tasks. Many such pipelines were developed by researchers for a specific project, and then generalized in order to be useful for other projects or other researchers.

Unfortunately, most such pipelines are not well designed or rigorously tested, so they don't work well for analyses that differ significantly from the one for which they were originally designed.

Another problem is that most of them are not maintained over the long term. Developers set out with good intentions to help other researchers, but once their project is done and they move onto new things, they find that they don't have time to maintain old pipelines anymore. Also, new tools are constantly evolving and old pipelines therefore quickly become obsolete unless they are aggressively updated.

Note

The best use of most existing pipelines is as an example to following in developing a similar one.

Finally, many pipelines are integrated into a specific system with a graphical user interface (GUI) or web interface, and therefore cannot be used on a generic computer or HPC cluster (unless the entire system is installed and configured, which is often difficult or impossible).

For these reasons, every researcher should know how to develop their own pipelines. Relying on the charity of your competitors for publishing space and grant money will not lead to long-term success in research.

This is especially true for long-term studies. If you become dependent on a preexisting pipeline early on, and it is not maintained by its developers for the duration of your study, then the completion of your study will prove very difficult.

Implementing Your Own Pipeline

A pipeline can be implemented in any programming language. Since most pipelines involve simply running a series of programs with the appropriate command-line arguments, a Unix shell script is a very suitable choice in most cases. In some cases, it may be possible to use Unix shell pipes to perform multiple steps at the same time. This will depend on a number of things:

  • Do the processing programs use standard input and standard output? If not, then redirecting to and from them with pipes will not be possible.
  • What are the resource requirements of each step? Do you have enough memory to run multiple steps at the same time?
  • Do you need to save the intermediate files generated by some of the steps? If so, then either don't use a Unix shell pipe, or use the tee command to dump output to a file and pipe it to the next command at the same time.

    shell-prompt: step1 < input1 | tee output1 | step2 > output2
                    
Example Genomics Pipelines
Genomic Differential Analysis

One of the most common types of bioinformatics analysis if differential gene expression using RNA-Seq (ribonucleic acid sequence) data. In this type of analysis, RNA molecules are extracted from tissue samples under two different conditions, such as two time points during development, or wild-type (normal) and mutant individuals. The amount of RNA transcribed from some genes will differ across the two conditions. From this we can infer that expression of those genes is somehow related to the conditions being compared.

The extracted RNA is sequenced, producing files containing one sequence for each RNA molecule (in theory), such as "ACUGGCAAUCGGAAAUA...". The differential analysis pipeline has the following high-level stages:

  1. Clean up the sequences (remove artificial sequences added by the sequencing process)

  2. Check the quality of the sequence data

  3. Map the RNA sequences to a genome, so we know which gene produced each fragment

  4. Count the fragments produced by each gene (not as simple as it sounds)

  5. Compare the counts for each gene across the two conditions

If the counts for a given gene differ significantly between one condition and another, then we conclude that this gene is regulated (turned on or off) in response to the condition. A complete pipeline to perform this analysis on an HPC cluster is available at https://github.com/auerlab/cnc-emdiff. A similar pipeline that will run fairly quickly on a single computer is available at https://github.com/auerlab/fasda/tree/main/Test. This latter is a set of test scripts for FASDA, a program that performs the comparison of counts for each gene.

Metagenomics Example

Below is a simple shell script implementation of the AmrPlusPlus pipeline, which, according to their website, is used to "characterize the content and relative abundance of sequences of interest from the DNA of a given sample or set of samples".

People can use this pipeline by uploading their data to the developer's website, or by installing the pipeline to their own Docker container or Galaxy server.

Note

At the time of this writing, November 2022, the AmrPlusPlus pipeline source on Github has not been updated for over five years. This is typical of scientific pipelines for the reasons stated above.

In reality, the analysis is performed by the following command-line tools, which are developed by other parties and freely available:

  • Trimmomatic
  • BWA
  • Samtools
  • SNPFinder
  • ResistomeAnalyzer
  • RarefactionAnalyzer

The role of AmrPlusPlus is to coordinate the operation of these tools. AmrPlusPlus is itself a script.

If you don't want to be dependent on their web service, a Galaxy server, or their Docker containers, or if you would like greater control over and understanding of the analysis pipeline, or if you want to use the newer versions of tools such as samtools, you can easily write your own script to run the above commands.

Also, when developing our own pipeline, we can substitute other tools that perform the same function, such as Cutadapt in place of Trimmomatic, or Bowtie in place of BWA for alignment.

All of these tools are designed for "short-read" DNA sequences (on the order of 100 base pair per fragment). When we take control of the process rather than rely on someone else's pipeline, we open the possibility of developing an analogous pipeline using a different set of tools for "long-read" sequences (on the order of 1000 base pair per fragment).

For our purposes, we install the above commands via FreeBSD ports and/or pkgsrc (on CentOS and Mac OS X). Then we just write a Unix shell script to implement the pipeline for our data.

Note that this is a real pipeline used for research at the UWM School of Freshwater Science.

It is not important whether you understand genomics analysis for this example. Simply look at how the script uses loops and other scripting constructs to see how the material you just learned can be used in actual research. I.e., don't worry about what cutadapt and bwa are doing with the data. Just see how they are run within the pipeline script, using redirection, command line arguments, etc. Also read the comments within the script for a deeper understanding of what the conditionals and loops are doing.

#!/bin/sh -e

# Get gene fraction threshold from user
printf "Resistome threshold? "
read threshold

############################################################################
# 1. Enumerate input files

raw_files="SRR*.fastq"

############################################################################
# 2.  Quality control: Remove adapter sequences from raw data

for file in $raw_files; do
    output_file=trimmed-$file
    # If the output file already exists, assume cutadapt was already run
    # successfully.  Remove trimmed-* before running this script to force
    # cutadapt to run again.
    if [ ! -e $output_file ]; then
        cutadapt $file > $output_file
    else
        printf "$raw already processed by cutadapt.\n"
    fi
done

############################################################################
# 3. If sequences are from a host organism, remove host dna

# Index resistance gene database
if [ ! -e megares_database_v1.01.fasta.ann ]; then
    bwa index megares_database_v1.01.fasta
fi

############################################################################
# 4. Align to target database with bwa mem.

for file in $raw_files; do
    # Output is an aligned sam file.  Replace trimmed- prefix with aligned-
    # and replace .fastq suffix with .sam
    output_file=aligned-${file%.fastq}.sam
    if [ ! -e $output_file ]; then
        printf "\nRunning bwa-mem on $file...\n"
        bwa mem megares_database_v1.01.fasta trimmed-$file > $output_file
    else
        printf "$file already processed by bwa mem\n"
    fi
done

############################################################################
# 5. Resistome analysis.

aligned_files=aligned-*.sam
for file in $aligned_files; do
    if [ ! -e ${file%.sam}group.tsv ]; then
        printf "\nRunning resistome analysis on $file...\n"
        resistome -ref_fp megares_database_v1.01.fasta -sam_fp $file \
            -annot_fp megares_annotations_v1.01.csv \
            -gene_fp ${file%.sam}gene.tsv \
            -group_fp ${file%.sam}group.tsv \
            -class_fp ${file%.sam}class.tsv \
            -mech_fp ${file%.sam}mech.tsv \
            -t $threshold
    else
        printf "$file already processed by resistome.\n"
    fi
done

############################################################################
# 6. Rarefaction analysis?

I generally write a companion to every analysis script to remove output files and allow a fresh start for the next attempt. Many programs will detect when output already exists and either not attempt to rewrite it, or behave differently in other ways.

#!/bin/sh -e

rm -f trimmed-* aligned-* aligned-*.tsv megares*.fasta.*
Practice

Note

Be sure to thoroughly review the instructions in Section 2, “Practice Problem Instructions” before doing the practice problems below.
  1. What is an analysis pipeline?

  2. Where do most analysis pipelines come from?

  3. How useful are most pipelines to future projects?

  4. What is the best way to use most existing pipelines?

  5. What languages can be used to implement an analysis pipeline? Which are likely to be convenient?

  6. What is a good way to restart a pipeline from the beginning?