Power Tools for Data Processing

Introduction

Congratulations on reaching the holy land of Unix data processing. It has often been said that if you know Unix well, you may never need to write a program. The tools provided by Unix often contain all the functionality you need to process your data. They are like a box of Legos from which we can construct a machine to perform almost any data analysis imaginable from the Unix shell.

Most of these tools function as filters, so they can be incorporated into pipelines. Most also accept filenames as command-line arguments for simpler use cases.

In this section, we'll introduce some of the most powerful tools that are heavily used by researchers to process data files. This will certainly reduce, if not eliminate, the need to write your own programs for many projects. This is only an introduction to make you aware of the available tools and the power they can give you.

For more detailed information, consult the man pages and other sources. Some tools, such as awk and sed, have entire books written about them, in case you want to explore in-depth.

However, do not set out to learn as much as you can about these tools. Set out to learn as much as you need. The ability to show off your vast knowledge is not the ability to achieve. Knowledge is not wisdom. Wisdom is doing. Learn what you need to accomplish today's goals as elegantly as possible, and then do it. You will learn more from this doing than from any amount of studying. You will develop problem solving skills and instincts, which are far more valuable than encyclopedic knowledge.

Never stop wondering if there might be an even more elegant solution. Albert Einstein was once asked what was his goal in life. His response: "To simplify." Use the tools presented here to simplify your research and by extension, your life. With this approach can achieve great things without great effort and spend your time savoring the wonders and mysteries of your work rather than memorizing facts that might come in handy one day.

Grep

Grep shows lines in one or more text streams that match a given regular expression (RE). It is an acronym for Global Regular Expression Print (or Pattern or Parser if you prefer).

shell-prompt: grep expression [file ...]
            

The expression is often a simple string, but can represent RE patterns as described in detail by man re_format on FreeBSD. There are also numerous web pages describing REs.

Using simple strings or REs, we can search any file stream for lines containing information of interest. By knowing how to construct REs that represent the information you seek, you can easily identify patterns in your data.

REs resemble globbing patterns, but they are not the same. For example, '*' by itself in a globbing pattern means any sequence of 0 or more characters. In an RE, '*' means 0 or more of the preceding character. '*' in globbing is expressed as '.*' in an RE. Some of the most common RE patterns are shown in Table 3.12, “RE Patterns”.

Table 3.12. RE Patterns

PatternMeaning
.Any single character
*0 or more of the preceding character
+1 or more of the preceding character
[]One character in the set or range of the enclosed characters (same as globbing)
^Beginning of the line
$End of the line
.*0 or more of any character
[a-z]*0 or more lower-case letters

Consider the following C program:

#include <stdio.h>
#include <sysexits.h>
#include <math.h>

int     main(int argc,char *argv[])

{
    puts("Hello!");
    printf("The square root of the # 2 is %f.\n", sqrt(2.0));
    printf("The natural log of the # 2 is %f.\n", log(2.0));
    
    return EX_OK;
}
            

The command below shows all lines containing a call to the printf() function. We use quotes around the string because the shell will try to interpret the '(' without them.

shell-prompt: grep 'printf(' prog1.c
    printf("The square root of 2 is %f.\n", sqrt(2.0));
    printf("The natural log of 2 is %f.\n", log(2.0));
            

We might also wish to show all lines containing any function call in prog1.c. Since we are looking for any function name rather than one particular name, we cannot use a simple string and must construct a regular expression. Variable and function names begin with a letter or underscore and may contain any number of letters, underscores, or digits after that. So our RE must require a letter or underscore for the first character and then accept zero or more letters, digits, or underscores after that. We will also require an argument list (anything between () is good enough for our purposes) and a semicolon to terminate the statement.

shell-prompt: grep '[a-zA-Z_][a-zA-Z0-9_]*(.*);' prog1.c
    puts("Hello!");
    printf("The square root of 2 is %f.\n", sqrt(2.0));
    printf("The natural log of 2 is %f.\n", log(2.0));
            

The following shows lines that have a '#' in the first column, which represents a preprocessor directive in C or C++:

shell-prompt: grep '^#' prog1.c
#include <stdio.h>
#include <sysexits.h>
#include <math.h>
            

Without the '^' we match a '#' anywhere in the line:

shell-prompt: grep '#' prog1.c
#include <stdio.h>
#include <sysexits.h>
#include <math.h>
    printf("The square root of the # 2 is %f.\n", sqrt(2.0));
    printf("The natural log of the # 2 is %f.\n", log(2.0));
            

Note

Since REs share many special characters with globbing patterns, we must enclose the RE in quotes to prevent the shell from treating it as a globbing pattern.

Note

If we want to match a special character such as '.' or '*' literally, we must escape it (preceded it with a '\'). For example, to locate method calls in a Java program, which have the form object.method(arguments);, we could use the following:
shell-prompt: grep '[a-zA-Z_][a-zA-Z0-9_]*\.[a-zA-Z_][a-zA-Z0-9_]*\(.*\);' prog1.java
            

As an example of searching data files, rather than program code, suppose we would like to find all the lines containing contractions in text file. This would consist of some letters, followed by an apostrophe, followed by more letters. Since the apostrophe is the same character as the single quotes we might use to enclose the RE, we either need to escape it (with a '\') or use double quotes to enclose the RE.

shell-prompt: grep '[a-zA-Z][a-zA-Z]*\'[a-zA-Z][a-zA-Z]*'
shell-prompt: grep "[a-zA-Z][a-zA-Z]*'[a-zA-Z][a-zA-Z]*"
            

Another example would be searching for DNA sequences in a genome. We might use this to locate adapters, artificial sequences added to the ends of DNA fragments for the sequencing process, in our sequence data. Sequences are usually stored one per line in a text file in FASTA format. A common adapter sequence is "CTGTCTCTTATA".

Note

We can speed up processing by using grep --fixed-strings or fgrep instead of a regular grep. This uses a more efficient simple string comparison instead of the more complex regular expression matching.
shell-prompt: fgrep CTGTCTCTTATA file.fasta
GCGGCCAACACCTTGCCTGTATTGGCATCCATGATGAAATGGGCGTAACCCTGTCTCTTATACACATCTCCGAG
AAAGGCCTGTATGATAAGTTGGCAAATTTCCTCAAGATTGTTTACTTGATACACCTGTCTCTTATACACATCTC
GACCGAGGCACTCGCCGCGCTTGAGCTCGAGATCGATGCCGTCGACCTGTCTCTTATACACATCTCCGAGCCCA
AAAAAATCCCTCCGAAGCATTGTAGGTTTCCATGCTGTCTCTTATACACATCTCCGAGCCCACGAGACTCCTGA
            

DNA sequences sometimes have variations, such as single nucleotide polymorphisms, or SNPs, where one nucleotide varies in different individuals. Suppose the sequence we're looking for might have either an C or a G in the 5th position. We can use an RE to accommodate this:

shell-prompt: grep CTGT[CG]TCTTATA file.fasta
            

It's hard to see the pattern we were looking for in this output. To solve this problem, we can colorize any matched patterns using the --color flag as shown in Figure 3.6, “Colorized grep output”.

Figure 3.6. Colorized grep output

Colorized grep output

There is an extended version of regular expressions that is not supported by the normal grep command. Extended REs include things like alternative strings, which are separated by a '|'. For example, we might want to search for either of two adapter sequences. To enable extended REs, we use egrep or grep --extended-regexp.

shell-prompt: egrep 'CTGTCTCTTATA|AGATCGGAAGAG' file.fasta
            

Extended REs also support the '+' modifier to indicate 1 or more of the previous character, e.g. '[a-z]+' is shorthand for '[a-z][a-z]*'.

The grep family of commands are very often used as filters in pipelines. If no file name argument is provided, they will read from the standard input, like most Unix commands.

The -l, --files-with-matches flag tells grep to merely report the names of files that contain a match. This is often used to generate a list of file names for use with another command.

Example 3.19. Practice Break

shell-prompt: ls /usr/bin | grep '^z'
                

Awk

AWK, an acronym for Aho, Weinberger, and Kernighan (the original developers of the program), is an extremely powerful tool for processing tabular data. Like grep, it supports RE matching, but unlike grep, it can process individual columns, called fields, in the data. It also includes a flexible scripting language that closely resembles the C language, so we can perform highly sophisticated processing of whole lines or individual fields.

Awk can be used to automate many of the same tasks that researchers often perform manually in a spreadsheet program such as LibreOffice Calc or MS Excel.

There are multiple implementations of awk. The most common are "The one true awk", evolved from the original awk code and used on many BSD systems. Gawk, the GNU project implementation, is used on most Linux systems. Mawk is an independent implementation that tends to outperform the others. It is available in most package managers. Awka is an awk-to-C translator that can convert most awk scripts to C for maximize performance.

Fields by default are separated by white space, i.e. space or tab characters. However, awk allows us to specify any set of separators using an RE following the -F flag or embedded in the script, so we can process tab-separated (.tsv) files, comma-separated (.csv) files, or any other data that can be broken down into columns.

An awk script consists of one or more lines containing a pattern and an action. The action is enclosed in curly braces, like a C code block.

pattern { action }
            

The pattern is used to select lines from the input, usually using a relational expression such as those found in an if statement. The action determines what to do when a line is selected. If no pattern is given, the action is applied to every line of input. If no action is given, the default is to print the line.

In both the pattern and the action, we can refer to the entire line as $0. $1 is the first field: all text up to but not including the first separator. $2 is the second field: all text between the first and second separators. And so on...

It is very common to use awk "one-liners" on the command-line, without actually creating an awk script file. In this case, the awk script is the first argument to awk, usually enclosed in quotes to allow for white space and special characters. The second argument is the input file to be processed by the script.

For example, the file /etc/passwd contains colon-separated fields including the username ($1), user ID ($3), primary group ID ($4), full name ($5), home directory ($6), and the user's shell program ($7). To see a list of full names for every line, we could use the following simple command, which has no pattern (so it processes every line) and an action of printing the fifth field:

shell-prompt: awk -F : '{ print $5 }' /etc/passwd
Jason Bacon
D-BUS Daemon User
TCG Software Stack user
Avahi Daemon User
...
            

To see a list of usernames and shells:

shell-prompt: awk -F : '{ print $1, $6 }' /etc/passwd
bacon /bin/tcsh
messagebus /usr/sbin/nologin
_tss /usr/sbin/nologin
avahi /usr/sbin/nologin
...
            

Many data files used in research computing are tabular, with one of the most popular formats being TSV (tab-separated value) files. The General Feature Format, or GFF file is a TSV file format for describing features of a genome. The first field contains the sequence ID (such as a chromosome number) on which the feature resides. The third field contains the feature type, such as "gene" or "exon". The fourth and fifth fields contain the starting and ending positions withing the sequence. The ninth field contains "attributes", such as the globally unique feature ID and possibly the feature name and other information, separated by semicolons. If we just want to see the locations and attributes of all the genes in a genome and their names, we could use the following:

shell-prompt:  awk '$3 == "gene" { print $1, $4, $5, $9 }' file.gff3
1 3073253 3074322 ID=gene:ENSMUSG00000102693;Name=4933401J01Rik
1 3205901 3671498 ID=gene:ENSMUSG00000051951;Name=Xkr4
...
            

Awk uses largely the same comparison operators and C and similar languages. One additional awk operator that is often useful is ~, which means "contains".

# Locate all features whose type contains "RNA".  In a typical GFF3 file,
# this could include mRNA, miRNA, ncRNA, etc.
shell-prompt:  awk '$3 ~ "RNA" { print $1, $4, $5, $9 }' file.gff3
1 3073253 3074322 ID=gene:ENSMUSG00000102693;Name=4933401J01Rik
1 3205901 3671498 ID=gene:ENSMUSG00000051951;Name=Xkr4
...
            

Suppose we want to extract specific attributes from the semicolon-separated attributes field, such as the gene ID and gene name, as well as count the number of genes in the input. This will require a few more awk features.

The gene ID is always the first attribute in the field, assuming the feature is a gene. Not every gene has a name, so we will need to scan the attributes for this information. Awk makes this easy. We can break the attributes field into an array of strings using the split() function. We can then use a loop to search the attributes for one beginning with "Name=".

To count the genes in the input, we need to initialize a count variable before we begin processing the file, increment it for each gene found, and print it after processing is finished. For this we can use the special patterns BEGIN and END, which allow us to run an action before and after processing the input.

We will use the C-like printf() function to format the output. The basic print statement always adds a newline, so it does not allow us to print part of a line and finish it with an subsequent print statement.

Since this is a multiline script, we will save it in a file called gene-info.awk and run it using the -f flag, which tells awk to get the script from a file rather than the command-line.

shell-prompt: awk -f gene-info.awk file.gff3
            

Caution

Awk can be finicky about the placement of curly braces. To avoid problems, always place the opening brace ({) for an action on the same line as the pattern.
BEGIN {
    gene_count = 0;
}

$3 == "gene" {
    # Separate attributes into an array
    split($9, attributes, ";");
    
    # Print location and feature ID
    printf("%s %s %s %s", $1, $4, $5, attributes[1]);
    
    # Look for a name attribute and print it if it exists
    # With the for-in loop, c gets the SUBSCRIPT of each element in the
    # attributes array
    for ( c in attributes )
    {
        # See if first 5 characters of the attribute are "Name="
        if ( substr(attributes[c], 1, 5) == "Name=" )
            printf(" %s", attributes[c]);
    }
    
    # Terminate the output line
    printf("\n");
    
    # Count this gene
    ++gene_count;
}

END {
    printf("\nGenes found = %d\n", gene_count);
}
            

As we can see, we can do some fairly sophisticated data processing with a very short awk script. There is very little that awk cannot do conveniently with tabular data. If a particular task seems like it will be difficult to do with awk, don't give up too easily. Chances are, with a little thought and effort, you can come up with an elegant awk script to get the job done.

That said, there are always other options for processing tabular data. Perl is a scripting language especially well suited to text processing, with its powerful RE handling capabilities and numerous features. Python has also become popular for such tasks in recent years.

Awk is highly efficient, and processing steps performed with it are rarely a bottleneck in an analysis pipeline. If you do need better performance than awk provides, there are C libraries that can be used to easily parse tabular data, such as libxtend. Libxtend includes a set of DSV (delimiter-separated-value) processing functions that make it easy to read fields from files in formats like TSV, CSV, etc. Once you have read a line or an individual field using libxtend's DSV functions, you now have the full power and performance of C at your disposal to process it in minimal time.

Full coverage of awk's capabilities is far beyond the scope of this text. Readers are encouraged to explore it further via the awk man page and one of the many books available on the language.

Example 3.20. Practice Break

shell-prompt: awk -F : '{ print $1 }' /etc/passwd
shell-prompt: awk -F : '$1 == "root" { print $0 }' /etc/passwd
                

Cut

The cut command is used to select columns from a file, either by byte position, character position, or like awk, delimiter-separated columns. Note that characters in the modern world may be more than one byte, so bytes and characters are distinguished here.

To extract columns by byte or character position, we use the -b or -c option followed by a list of positions. The list is comma-separated and may contain individual positions or ranges denoted with a '-'. For example, to extract character positions 1 through 10 and 21 through 26 from every line of file.txt, we could use the following:

shell-prompt: cut -c 1-10,21-26 file.txt
            

For delimiter-separated columns, we use -d to indicate the delimiter. The default is a tab character alone, not just any white space. The -w flag tells cut to accept any white space (tab or space) as the delimiter. The -f is then used to indicate the fields to extract, much like -c is used for character positions. Output is separated by the same delimiter as the input.

For example, to extract the username, userid, groupid, and full name (fields 1, 3, 4, and 5) from /etc/passwd, we could use the following:

shell-prompt: cut -d : -f 1,3-5 /etc/passwd
...
ganglia:102:102:Ganglia User
nagios:181:181:Nagios pseudo-user
webcamd:145:145:Webcamd user
            

The above is equivalent to the following awk command:

shell-prompt: awk -F : '{ printf("%s:%s:%s:%s\n", $1, $3, $4, $5); }' /etc/passwd
            

Example 3.21. Practice Break

shell-prompt: cut -d : -f 1,3-5 /etc/passwd
                

Sed

The sed command is a stream editor. It makes changes to a file stream with no interaction from the user. It is probably most often used to make simple text substitutions, though it can also do insertions and deletions of lines and parts of lines, even selecting lines by number or based on pattern matching much like grep and awk. A basic substitution command takes the following format:

sed -e 's|pattern|replacement|g' input-file
            

Pattern is any regular expression, like those used in grep or awk. Replacement can be a fixed string, but also takes some special characters, such as &, which represents the string matched by pattern. It can also be empty if you simply want to remove occurrences of pattern from the text.

The characters enclosing pattern and replacement are arbitrary. The '|' character is often used because it stands out among most other characters. If either pattern or replacement contains a '|', simply use a different separator, such as '/'. The 'g' after the pattern means "global". Without it, sed will only replace the first occurrence of pattern in each line. With it, all matches are replaced.

shell-prompt: cat fox.txt
The quick brown fox jumped over the lazy dog.
shell-prompt: sed -e 's|fox|worm|g' fox.txt
The quick brown worm jumped over the lazy dog.
shell-prompt: sed -e 's/brown //g' -e 's|fox|&y worm|g' fox.txt
The quick foxy worm jumped over the lazy dog.
            

Using -E in place of -e causes sed to support extended regular expressions.

By default, sed sends output to the standard output stream. The -i flag tells sed to edit the file in-place, i.e. replace the original file with the edited text. This flag should be followed by a filename extension, such as ".bak". The original file will then be saved to filename.bak, so that you can reverse the changes if you make a mistake. The extension can be an empty string, e.g. '' if you are sure you don't need a backup of the original.

Caution

There is a rare portability issue with sed. GNU sed requires that the extension be nestled against the -i:

shell-prompt: sed -i.bak -e 's|pattern|replacement|g' file.txt
            

Some other implementations require a space between the -i and the extension, which is more orthodox among Unix commands:

shell-prompt: sed -i .bak -e 's|pattern|replacement|g' file.txt
            

FreeBSD's sed accepts either form. You must be aware of this in order to ensure that scripts using sed are portable. The safest approach is not to use the -i flag, but simply save the output to a temporary file and then move it:

shell-prompt: sed -e 's|pattern|replacement|g' file.txt > file.txt.tmp
shell-prompt: mv file.txt.tmp file.txt
            

This way, it won't matter which implementation of sed is present when someone runs your script.

Sed is a powerful and complex tool that is beyond the scope of this text. Readers are encouraged to consult books and other documentation to explore further.

Example 3.22. Practice Break

shell-prompt: printf "The quick brown fox jumped over the lazy dog." > fox.txt
shell-prompt: cat fox.txt
shell-prompt: sed -e 's|fox|worm|g' fox.txt
shell-prompt: sed -e 's/brown //g' -e 's|fox|&y worm|g' fox.txt
                

Sort

The sort command sorts text data line by line according to one or more keys. A key indicates a field (usually a column separated by white space or some other delimiter) and the type of comparison, such as lexical (like alphabetical, but including non-letters) or numeric.

If no keys are specified, sort compares entire lines lexically. The --key followed by a field number restricts comparison to that field. Fields are numbered starting with 1. This can be used in conjunction with the --field-separator flag to specify a separator other than the default white space. The --numeric-sort flag must be used to perform integer comparison rather than lexical. The --general-numeric-sort flag must be used to compare real numbers.

shell-prompt: cat ages.txt
Bob Vila        23
Joe Piscopo     27
Al Gore         19
Ingrid Bergman  26
Mohammad Ali    22
Ram Das         9
Joe Montana     25

shell-prompt: sort ages.txt 
Al Gore         19
Bob Vila        23
Ingrid Bergman  26
Joe Montana     25
Joe Piscopo     27
Mohammad Ali    22
Ram Das         9

shell-prompt: sort --key 2 ages.txt
Mohammad Ali    22
Ingrid Bergman  26
Ram Das         9
Al Gore         19
Joe Montana     25
Joe Piscopo     27
Bob Vila        23

shell-prompt: sort --key 3 --numeric-sort ages.txt
Ram Das         9
Al Gore         19
Mohammad Ali    22
Bob Vila        23
Joe Montana     25
Ingrid Bergman  26
Joe Piscopo     27
            

The sort command can process files of any size, regardless of available memory. If a file is too large to fit in memory, it is broken into smaller pieces, which are sorted separately and saved to temporary files. The sorted temporary files are then merged.

The uniq command, which removes adjacent lines that are identical, is often used after sorting to remove redundancy from data. Note that the sort command also has a --unique flag, but it does not behave the same as the uniq command. The --unique flag compares keys, while the uniq command compares entire lines.

Example 3.23. Practice Break

Using your favorite text editor, enter a few names from the example above into a file called ages.txt.

shell-prompt: cat ages.txt
shell-prompt: sort ages.txt 
shell-prompt: sort --key 2 ages.txt
shell-prompt: sort --key 3 --numeric-sort ages.txt
shell-prompt: du -sm * | sort -n    # Determine biggest directories
                

Tr

The tr (translate) command is a simple tool for performing character conversions and deletions in a text stream. A few examples are shown below. See the tr man page for details.

We can use it to convert individual characters in a text stream. In this case, it takes two string arguments. Characters in the Nth position in the first string are replaced by characters in the Nth position in the second string:

shell-prompt: cat fox.txt
The quick brown fox   jumped over the lazy dog.
shell-prompt: tr 'xl' 'gh' < fox.txt
The quick brown fog   jumped over the hazy dog.
            

There is limited support for character sets enclosed in square brackets [], similar to regular expressions, including predefined sets such as [:lower:] and [:upper:]:

shell-prompt: tr '[:lower:]' '[:upper:]' < fox.txt
THE QUICK BROWN FOX   JUMPED OVER THE LAZY DOG.
            

We can use it to "squeeze" repeated characters down to one in a text stream. This is useful for compressing white space:

shell-prompt: tr -s ' ' < fox.txt
The quick brown fox jumped over the lazy dog.
            

The tr command does not support doing multiple conversions in the same command, but we can use it as a filter:

shell-prompt: tr '[:lower:]' '[:upper:]' < fox.txt | tr -s ' '
THE QUICK BROWN FOX JUMPED OVER THE LAZY DOG.
            

There is some overlap between the capabilities of tr, sed, awk, and other tools. Which one you choose for a given task is a matter of convenience.

Example 3.24. Practice Break

shell-prompt: printf "The quick brown fox jumped over the lazy dog." > fox.txt
shell-prompt: cat fox.txt
shell-prompt: tr '[:lower:]' '[:upper:]' < fox.txt | tr -s ' '
                

Find

The find command is a powerful tool for not only locating path names in a directory tree, but also for taking any desired action when a path name is found.

Unlike popular search utilities in macOS, Windows, and the Unix locate command. find does not use a previously constructed index of the file system, but searches the file system in its current state. Indexed search utilities very quickly produce results from a recent snapshot of the file system, which is rebuilt periodically by a scheduled job. This is much faster than an exhaustive search, but will miss files that were added since the last index build. The find command will take longer to search a large directory tree, but also guarantees accurate results.

The basic format of a find command is as follows:

shell-prompt: find top-directory search-criteria [optional-action \;]
            

The search-criteria can be any attribute of a file or other path name. To match by name, we use -name followed by a globbing pattern, in quotes to prevent the shell from expanding it before passing it to find. To search for files owned by a particular user or group, we can use -user or -group. We can also search for files with certain permissions, a minimum or maximum age, and many other criteria. The man page provides all of these details.

The default action is to print the relative path name of each match. For example, to list all the configuration files under /etc, we could use the following:

shell-prompt: find /etc -name '*.conf'
            

We can run any Unix command in response to each match using the -exec flag followed the command and a ';' or '+'. The ';' must be escaped or quoted to prevent the shell from using it as a command separator and treating everything after it as a new command, separate from the find command. The name of the matched path is represented by '{}'.

shell-prompt: find /etc -name '*.conf' -exec ls -l '{}' \;
            

With a ';' terminating the command, the command is executed immediately after each match. This may be necessary in some situations, but it entails a great deal of overhead from running the same command many times. Replacing the ';' with a '+' tells find to accumulate as many path names as possible and pass them all to one invocation of the command. This means the command could receive thousands of path names as arguments and will be executed far fewer times.

shell-prompt: find /etc -name '*.conf' -exec ls -l '{}' +
            

There are also some predefined actions we can use instead of spelling out a -exec, such as -print, which is the default action, and -ls, which is equivalent to -exec ls -l '{}' +. The -print action is useful for showing path names being processed by another action:

shell-prompt: find Data -name '*.bak' -print -exec rm '{}' +
            

Sometimes we may want to execute more than one command for each path matched. Rather than construct a complex and messy -exec, we may prefer to write a shell script containing the commands and run the script using -exec. Scripting is covered in Chapter 4, Unix Shell Scripting.

Example 3.25. Practice Break

shell-prompt: find /etc -name '*.conf' -exec ls -l '{}' +
                

Xargs

As stated earlier, most Unix commands that accept a file name as an argument will accept any number of file names. When processing 100 files with the same program, it is usually more efficient to run one process with 100 file name arguments than to run 100 processes with one argument each.

However, there is a limit to how long Unix commands can be. When processing many thousands of files, it may not be possible to run a single command with all of the filenames as arguments. The xargs command solves this problem by reading a list of file names from the standard input (which has no limit) and feeding them to another command as arguments, providing as many arguments as possible to each process created.

The arguments processed by xargs do not have to be file names, but usually are. The main trick generating the list of files. Suppose we want to change all occurrences of "fox" to "toad" in the files input*.txt in the CWD. Our first thought might be a simple command:

shell-prompt: sed -i '' -e 's|fox|toad|g' input*.txt
            

If there are too many files matching "input*.txt", we will get an error such as "Argument list too long". One might think to solve this problem using xargs as follows:

shell-prompt: ls input*.txt | xargs sed -i '' -e 's|fox|toad|g'
            

However, this won't work either, because the shell hits the same argument list limit for the ls command as it does for the sed command.

The find command can come to the rescue:

shell-prompt: find . -name 'input*.txt' | xargs sed -i '' -e 's|fox|toad|g'
            

Since the shell is not trying to expand '*.txt' to an argument list, but instead passing the literal string '*.txt' to find, there is no limit on how many file names it can match. The find command is sophisticated enough to work around the limits of argument lists.

The find command above will send relative path names of every file with a name matching 'input*.txt' in and under the CWD. If we don't want to process files in subdirectories of CWD, we can limit the depth of the find command to one directory level:

shell-prompt: find . -maxdepth 1 -name '*.txt' \
              | xargs sed -i '' -e 's|fox|toad|g'
            

Note

The xargs command places the arguments read from the standard input after any arguments included with the command. So the commands run by xargs will have the form

sed -i '' -e 's|fox|toad|g' input1.txt input2.txt input3.txt ...
            

Some xargs implementations have an option for placing the arguments from the standard input before the fixed arguments, but this is still limited. There may be cases where we want the arguments intermingled. The most portable and flexible solution to this is writing a simple script that takes all the arguments from xargs last, and constructs the appropriate command with the arguments in the correct order. Scripting is covered in Chapter 4, Unix Shell Scripting.

Most xargs implementations also support running multiple processes at the same time. This provides a convenient way to utilize multiple cores to parallelize processing. If you have a computer with 16 cores and speeding up your analysis by a factor of nearly 16 is good enough, then this can be a very valuable alternative to using an HPC cluster. If you need access to hundreds of cores to get your work done in a reasonable time, then a cluster is a better option.

shell-prompt: find . -name '*.txt' \
              | xargs -P 8 sed -i '' -e 's|fox|toad|g'
            

A value of 0 following -P tells xargs to detect the number of available cores and use all of them. Some, but not all xargs implementations support --max-procs in place of -P. While using of long options is more readable, it is not portable in this instance.

There is a more sophisticated open source program called GNU parallel that can run commands in parallel in a similar way, but with more flexibility. It can be installed via most package managers. See the section called “GNU Parallel” for an introduction.

Example 3.26. Practice Break

shell-prompt: find /etc -name '*.conf' | xargs ls -l
                

Bc

The bc (binary calculator) command is an unlimited range and precision calculator with a scripting language very similar to C. When invoked with -l or --mathlib, it includes numerous additional functions including l(x) (natural log), e(x) (exponential), s(x) (sine), c(x) (cosine), and a(x) (arctangent). There are numerous standard functions available even without --mathlib. See the man page for a full list.

By default, bc prints the result of each expression evaluated followed by a newline. There is also a print statement that does not print a newline. This allows a line of output to be constructed from multiple expressions, the last of which includes a literal "\n".

shell-prompt: bc --mathlib
sqrt(2)
1.41421356237309504880

print sqrt(2), "\n"
1.41421356237309504880

e(1)
2.71828182845904523536

x=10
5 * x^2 + 2 * x + 1
521

quit
            

Bc is especially useful for quick computations where extreme range or precision is required, and for checking the results from more traditional languages that lack such range and precision. For example, consider the computation of factorials. N factorial, denoted N!, is the product of all integers from one to N. The factorial function grows so quickly that 21! exceeds the range of a 64-bit unsigned integer, the largest integer value supported by most CPUS and most common languages. The C program and output below demonstrate the limitations of 64-bit integers.

#include <stdio.h>
#include <sysexits.h>

int     main(int argc,char *argv[])

{
    unsigned long   c, fact = 1;
    
    for (c = 1; c <= 22; ++c)
    {
        fact *= c;
        printf("%lu! = %lu\n", c, fact);
    }
    return EX_OK;
}
            
1! = 1
2! = 2
3! = 6
4! = 24
5! = 120
6! = 720
7! = 5040
8! = 40320
9! = 362880
10! = 3628800
11! = 39916800
12! = 479001600
13! = 6227020800
14! = 87178291200
15! = 1307674368000
16! = 20922789888000
17! = 355687428096000
18! = 6402373705728000
19! = 121645100408832000
20! = 2432902008176640000
21! = 14197454024290336768      This does not equal 20! * 21
22! = 17196083355034583040
23! = 8128291617894825984
24! = 10611558092380307456
25! = 7034535277573963776
            

At 21!, an integer overflow occurs. In the limited integer systems used by computers, adding 1 to the largest possible value produces a result of 0. The integer number sets used by computers are called modular number systems and are actually circular. The limitations of computer number systems are covered in Chapter 15, Data Representation.

In contrast, bc can compute factorials of any size, limited only by the amount of memory needed to store all the digits. It is, of course, much slower than C, both because it is an interpreted language and because it performs multiple precision arithmetic, which requires multiple machine instructions for every math operation. However, it is more than fast enough for many purposes and the easiest way to do math that is beyond the capabilities of common languages.

The bc script below demonstrates the superior range of bc. The first line (#!/usr/bin/bc -l) tells the Unix shell how to run the script, so we can run it by simply typing its name, such as ./fact.bc. This will be covered in Chapter 4, Unix Shell Scripting. For now, create the script using nano fact.bc and run it with bc < fact.bc.

#!/usr/bin/bc -l

fact = 1;
for (c = 1; c <= 100; ++c)
{
    fact *= c;
    print c, "!= ", fact, "\n";
}
quit
            
1!= 1
2!= 2
3!= 6
4!= 24
5!= 120
6!= 720
7!= 5040
8!= 40320
9!= 362880
10!= 3628800
11!= 39916800
12!= 479001600
13!= 6227020800
14!= 87178291200
15!= 1307674368000
16!= 20922789888000
17!= 355687428096000
18!= 6402373705728000
19!= 121645100408832000
20!= 2432902008176640000
21!= 51090942171709440000
22!= 1124000727777607680000
23!= 25852016738884976640000
24!= 620448401733239439360000
25!= 15511210043330985984000000

[ Output removed for brevity ]

100!= 93326215443944152681699238856266700490715968264381621468592963\
89521759999322991560894146397615651828625369792082722375825118521091\
6864000000000000000000000000

Someone with a little knowledge of computer number systems might think that we can get around the range problem in general purpose languages like C by using floating point rather than integers. This will not work, however. While a 64-bit floating point number has a much greater range than a 64-bit integer (up to 10308 vs 1019 for integers), floating point actually has less precision. It sacrifices some precision in order to achieve the greater range. The modified C code and output below show that the double (64-bit floating point) type in C only gets us to 22!, and round-off error corrupts 23! and beyond.

#include <stdio.h>
#include <sysexits.h>

int     main(int argc,char *argv[])

{
    double  c, fact = 1;
    
    for (c = 1; c <= 25; ++c)
    {
        fact *= c;
        printf("%0.0f! = %0.0f\n", c, fact);
    }
    return EX_OK;
}
            
1! = 1
2! = 2
3! = 6
4! = 24
5! = 120
6! = 720
7! = 5040
8! = 40320
9! = 362880
10! = 3628800
11! = 39916800
12! = 479001600
13! = 6227020800
14! = 87178291200
15! = 1307674368000
16! = 20922789888000
17! = 355687428096000
18! = 6402373705728000
19! = 121645100408832000
20! = 2432902008176640000
21! = 51090942171709440000
22! = 1124000727777607680000
23! = 25852016738884978212864
24! = 620448401733239409999872
25! = 15511210043330986055303168
            

Example 3.27. Practice Break

shell-prompt: printf "sqrt(31.67)\nquit\n" | bc -l
                

Tar

The tar command, short for TApe Archive, is a tool for combining multiple files into one. Recall that Unix incorporates the idea of device independence, where an input/output device is treated exactly like an ordinary file. Originally, tar was meant to write the archive to a tape device, such as /dev/tape. This was a way to create backups for important files on removable tapes in case of a disk failure or other mishap.

Thanks to device independence, we can substitute any other device or ordinary file for /dev/tape. In modern times, backups are more often done over high-speed networks to sophisticated backup systems and tar is more often used to create tarballs, ordinary files containing archives for sharing whole directories. Most open source software is downloaded as a single tarball and unpacked on the local system.

The basic command template for creating a tarball is as follows:

shell-prompt: tar -cvf archive.tar path [path ...]
            

Archiving files this way has many potential advantages. It saves disk space, since each file has on average 1/2 of a disk block unused. Files can only allocate whole blocks and almost never have a size that is an exact multiple of the block size. Replacing many files with one archive reduces the size of the directory containing the files. Processing many small files (moving, transferring to another computer over a network, etc.) takes longer than processing one large file, since there is overhead for opening each file.

The -c flag means "Create". The -v means "Verbose" (echo each file name as it is added). The -f means "File name". If not provided, the default is the first tape device in /dev. The "path" arguments name files or directories to archive.

We can specify any number of files and directories, but the file name of the archive must come immediately after the -f flag.

Note

The tar command is one of the commands that predates the convention of using a '-' to indicate flags. Hence, you may see examples on the web such as:

tar cvf file.tar directory
            

To unpack a tarball, we use the -x flag, which means "eXtract".

shell-prompt: tar -xvf archive.tar
            

We can list the contents of a tarball using -t.

shell-prompt: tar -tf archive.tar
            

Example 3.28. Practice Break

shell-prompt: cd
shell-prompt: mkdir Tempdir
shell-prompt: touch Tempdir/temp1
shell-prompt: touch Tempdir/temp2
shell-prompt: tar -cvf tempdir.tar Tempdir
shell-prompt: tar -tf tempdir.tar
shell-prompt: rm -rf Tempdir
shell-prompt: tar -xvf tempdir.tar
shell-prompt: ls Tempdir
                

Gzip, bzip2, xz

The gzip (GNU zip), bzip2 (Burrows-Wheeler zip), and xz (LZMA zip) commands compress files in order to save disk space. In the most basic use, we run the command with a single file argument:

shell-prompt: gzip file
shell-prompt: bzip2 file
shell-prompt: xz file
            

This will produce a compressed output file with a ".gz", ".bz2", or ".xz" extension. The original file is automatically removed after the compressed file is successfully created.

The compressed files can be decompressed using companion commands to restore the original file. Compression is lossless (unlike JPEG), so the restored file will be identical to the original.

shell-prompt: gunzip file.gz
shell-prompt: bzip2 file.bz2
shell-prompt: xz file.xz
            

All three commands can be used as filters to directly compress output from another program:

shell-prompt: myanalysis | gzip > output.gz
shell-prompt: myanalysis | bzip2 > output.bz2
shell-prompt: myanalysis | xz > output.xz
            

Likewise, the decompression tools can send decompressed output to another program via a pipe. They also include analogs to the cat command for better readability:

shell-prompt: gunzip -c output.gz | more

shell-prompt: bunzip2 -c output.bz2 | more
shell-prompt: bzcat output.bz | more

shell-prompt: unxz -c output.xz | more
shell-prompt: xzcat output.xz | more
            

Note

For historical reasons, the portable command for viewing gzipped files is zcat, not gzcat. However, as of this writing, zcat on macOS looks for a ".Z" extension (from the outdated compress command), and only gzcat works with ".gz" files. Hence, gunzip -c is the most portable approach.

The choice between them is a matter of speed vs compression ratio. Gzip is generally the fastest, but achieves the least compression. Xz produces the best compression, but at a high cost in CPU time. Bzip2 produces intermediate compression and is also CPU-intensive. All three compression tools allow the user to control the compression ratio in order to trade speed for compression. Lower values use less CPU time but to not compress as well.

shell-prompt: myanalysis | xz -3 > output.xz
            

If a program produces high-volume output (more than a few megabytes per second), some compression tools may not be able to keep up. You may want to use gzip and/or lower the compression level in these cases.

When archiving data for long-term storage, on the other hand, you will generally want the best possible compression and should not be too concerned about how long it takes. There are numerous websites containing benchmark data comparing the run time and compression of these tools with various compression levels. Such data will not be included in this guide as it is dated: it will change as the tools are continually improved.

Decompression is generally much faster than compression. While xz with medium to high compression levels requires a great deal of CPU time, unxz can decompress the data very quickly. Hence, if files need only be compressed once, but read many times, xz may be a good choice.

All three tools are integrated with tar in order to produce compressed tarballs. This can be done with a pipe by specifying "-" as the filename following -f, or using -z, --gzip, --gunzip, -j, --bzip2, --bunzip2, or -J, --xz with the tar command. The conventional file name extensions are ".tar.gz" or ".tgz" for gzip, ".tar.bz2" or ".tbz" for bzip2, and ".tar.xz" or ".txz" for xz.

shell-prompt: tar -cvf - Tempdir | gzip > tempdir.tgz
shell-prompt: tar -zcvf tempdir.tgz Tempdir

shell-prompt: tar -cvf - Tempdir | bzip2 > tempdir.tbz
shell-prompt: tar -jcvf tempdir.tbz Tempdir

shell-prompt: tar -cvf - Tempdir | xz > tempdir.txz
shell-prompt: tar -Jcvf tempdir.txz Tempdir
            

Example 3.29. Practice Break

shell-prompt: cat | xz > test.xz
Type in some text, then press Ctrl+d.
shell-prompt: xzcat test.xz

shell-prompt: tar -Jcvf tempdir.txz Tempdir
                

Zip, unzip

Zip is both an archiver and compression tool in one. It was originally developed by Phil Katz, co-founder of PKZIP, Inc. in Milwaukee, WI in 1989, for MS-DOS. The zip format has become the standard for many other Windows-based archive tools. The compression algorithms have evolved significantly since the original PKZIP.

The zip and unzip commands are open source tools for creating and extracting .zip files. They are primarily for interoperability with Windows file archives and far less popular than tarballs compressed with gzip, bzip2, and xz.

Time

The time command runs another command under its supervision and measures wall time, user time, and system time. Wall time, also known as real time, is the elapsed in the world while a program is running. The term was coined at a time when most people had clocks on their walls, rather than relying on a smart phone. User time is the time spent using a core. If a program uses only one core (logical CPU), user time is less than wall time. If it uses more than one core, user time can exceed wall time. System time is the time spent by the operating system performing tasks on behalf of the process. Hence total CPU time is user time + system time.

The time command is used by simply prefixing any other Unix command with "time ". Some shells have an internal time command, which presents output in a different format than the external time command normally found in /usr/bin. The T shell internal time command also reports percent of CPU time used. Low CPU utilization generally indicates that the process was I/O-bound, i.e. it spent a lot of time waiting for disk or other input/output transactions and therefore was not utilizing the CPU. Also reported are memory use in kibibytes, a count of I/O operations, and page faults (where memory blocks are swapped to or from disk due to memory being full).

shell-prompt: time find /usr/local/lib > /dev/null
0.055u 0.094s 0:00.15 93.3%     43+179k 0+0io 0pf+0w

shell-prompt: /usr/bin/time find /usr/local/lib > /dev/null
        0.14 real         0.04 user         0.09 sys
            

Reported times will vary, usually by a fraction of a second, due to limited precision of measurement and other factors. It is usually fairly consistent for programs that use at least a few seconds of CPU time.

Example 3.30. Practice Break

shell-prompt: time find /usr/local/lib > /dev/null
                

Top

The top command displays real-time information about currently running processes, sorted in order of resource use. It does not show information about all processes, but only the top resource users. Snapshots are reported every two seconds by default.

At the top of the screen is a summary of the system state, including load average (% of available cores in use), total processes running and sleeping (waiting for input/output), and a summary of memory (RAM and swap) use. Swap is an area of disk used to extend the amount of memory apparent to processes. Processes see the virtual memory size, which is RAM (electronic memory) + swap.

Below the system summary is information about the most active and resource-intensive processes currently running. Columns in the example below are summarized in Table 3.13, “Column headers of top command”.

Table 3.13. Column headers of top command

TagMeaning
PIDThe process ID
USERNAMEUser owning the process
THRNumber of threads (cores used)
PRICPU scheduling priority
NICENice value: Limits scheduling priority
SIZEVirtual memory allocated
RESResident memory: Actual RAM (not swap) used
STATEState of process at the moment of the last snapshot, such as running (using a core), waiting for I/O, select (waiting on any of multiple devices) pipdwt (writing to a pipe), nanslp (sleeping for nanoseconds), etc.
CLast core on which it ran
TIMECPU time accumulated so far
WCPUWeighted CPU % currently using
COMMANDCommand executed, usually truncated

Different operating systems will display slightly different information. There are many command-line flags to alter behavior, and behavior can be adjusted while running. Press 'h' for a help menu to see the options for altering output.

last pid: 70340;  load averages:  0.67,  0.34,  0.35; b up 3+03:11:57  08:57:32
61 processes:  3 running, 58 sleeping
CPU: 40.6% user,  0.0% nice,  2.2% system,  0.0% interrupt, 57.2% idle
Mem: 145M Active, 1871M Inact, 166M Laundry, 1210M Wired, 648M Buf, 4247M Free
Swap: 3852M Total, 3852M Free

  PID USERNAME    THR PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
70338 bacon         1  79    0    13M  2160K CPU2     2   0:03  72.65% fastq-tr
70340 bacon         1  79    0    13M  3056K CPU1     1   0:03  72.15% gzip
70339 bacon         1  44    0    13M  2856K pipdwt   3   0:01  28.68% gunzip
69958 bacon         3  20    0   237M    92M select   2   0:02   0.54% coreterm
 9690 root          5  20    0   144M    80M select   0   5:08   0.23% Xorg
 9719 bacon         4  20    0   340M   132M select   1   3:42   0.12% lumina-d
70332 bacon         1  20    0    14M  3668K CPU0     0   0:00   0.05% top
 1644 root          1  20    0    13M  1656K select   0   0:58   0.01% powerd
27489 root         14 -44   r8    20M  7576K cuse-s   1   0:01   0.00% webcamd
 9756 bacon         1  20    0    51M    24M select   0   0:03   0.00% python3.
 1666 root          1  20    0    13M  1748K select   3   4:29   0.00% moused
 9716 bacon         1  20    0    27M    11M select   1   0:13   0.00% fluxbox
 1756 root          1  20    0    18M  3400K select   0   0:04   0.00% sendmail
 1315 root          1  20    0    11M  1020K select   2   0:02   0.00% devd
 9744 bacon         3  20    0   153M    48M select   2   0:02   0.00% python3.
 1495 root          1  20    0    13M  2100K select   1   0:02   0.00% syslogd
 1641 root          1  20    0    13M  1984K wait     1   0:01   0.00% sh
24775 bacon         4  20    0    34M  7220K select   3   0:01   0.00% at-spi2-
 9711 bacon         3  20    0    94M    21M select   2   0:01   0.00% start-lu
 1725 root          1  20    0    13M  1992K nanslp   3   0:01   0.00% cron
 1615 messagebus    1  20    0    14M  2860K select   0   0:01   0.00% dbus-dae
 1639 ntpd          1  20    0    21M  3308K select   3   0:01   0.00% ntpd
            

Example 3.31. Practice Break

Run top, press 'h' to see the help screen, and press 'n' followed by '5' to make the screen less noisy.


Iostat

The iostat command displays information about disk activity and possibly other status information, depending on the flags used. Unfortunately, iostat is one of the rare commands that is not well-standardized across Unix systems. Check the man page on your system for details on all the flags. Here we show basic use for monitoring disk activity similarly to how we monitor CPU and memory use with top.

Low CPU utilization in top often indicates that a process is I/O-bound (e.g. spending a great deal of time waiting for disk operations). Processes go to sleep and do not use the CPU while waiting for disk and other input/output. To help verify this, we can check the STATE column in top as well. If it shows a state such as "wait", "select", or "pipe", then the process is waiting for I/O. Lastly, we can use iostat to see exactly how busy the disks are. This tells us nothing about a specific process, but we can generally deduce which processes are causing high disk activity.

The FreeBSD iostat offers concise output on a single line including the rates of tty (terminal) and disk throughput, and some CPU stats similar to top. We can request an update every N seconds by specifying -w N or simply N. The header is kindly reprinted when it is scrolled off the terminal.

FreeBSD shell-prompt: iostat 1
       tty            ada0              cd0            pass0             cpu
 tin  tout KB/t  tps  MB/s  KB/t  tps  MB/s  KB/t  tps  MB/s  us ni sy in id
   4   583 47.0    5   0.2   0.0    0   0.0   0.0    0   0.0   5  0  1  0 95
   1   537 1024   18  18.0   0.0    0   0.0   0.0    0   0.0  45  0  1  0 54
   
   [snip]

   0   733  988   18  17.4   0.0    0   0.0   0.0    0   0.0  42  0  2  0 56
   0   295 1024   18  18.0   0.0    0   0.0   0.0    0   0.0  42  0  2  0 55
       tty            ada0              cd0            pass0             cpu
 tin  tout KB/t  tps  MB/s  KB/t  tps  MB/s  KB/t  tps  MB/s  us ni sy in id
   0   300  927   21  19.0   0.0    0   0.0   0.0    0   0.0  45  0  1  0 54
   0   457  536   35  18.3   0.0    0   0.0   0.0    0   0.0  44  0  2  0 54
            

Apple's iostat is derived from FreeBSD's and has a similar output format and behavior.

macOS shell-prompt: iostat 1
              disk0       cpu    load average
    KB/t  tps  MB/s  us sy id   1m   5m   15m
   13.46    3  0.03   7  5 88  1.15 1.03 1.01
   11.97  289  3.38  40 13 47  1.15 1.03 1.01
    4.00    1  0.00   6  3 91  1.15 1.03 1.01
    0.00    0  0.00   0  2 98  1.15 1.03 1.01
    0.00    0  0.00   1  2 97  1.14 1.03 1.01
    4.25  145  0.60   9  6 85  1.14 1.03 1.01
            

The Linux iostat has significantly different options and output format. In addition, it may not be present on all Linux systems by default. On RHEL (Redhat Enterprise Linux), for example, we must install the sysstat package using the yum package manager. The output format contains multiple lines for each snapshot, but presents similar information.

RHEL shell-prompt: yum install -y sysstat
RHEL shell-prompt: iostat 1
Linux alma8.localdomain  bacon ~ 1001: (pkgsrc): iostat 1
Linux 4.18.0-372.26.1.el8_6.x86_64 (alma8.localdomain) 10/16/2022  _x86_64_(4 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.14    0.00    0.30    0.10    0.00   99.46

Device             tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda              11.57       250.38        34.24     443223      60607
scd0              0.01         0.00         0.00          1          0
dm-0             11.53       221.50        33.04     392110      58492
dm-1              0.06         1.25         0.00       2220          0

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    0.25    0.00    0.00   99.75

Device             tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda               0.00         0.00         0.00          0          0
scd0              0.00         0.00         0.00          0          0
dm-0              0.00         0.00         0.00          0          0
dm-1              0.00         0.00         0.00          0          0
            
GNU Parallel

GNU Parallel is a sophisticated open source tool for running multiple processes simultaneously. Users who do not have access to an HPC cluster for running large parallel jobs can at least utilize all the cores on their laptop or workstation using GNU parallel. GNU Parallel can be installed in seconds using most package managers.

In its simplest form, GNU parallel can be used as a drop-in replacement for xargs:

shell-prompt: find . -name 'input-*.txt' | xargs analyze
shell-prompt: find . -name 'input-*.txt' | parallel analyze
            

However, GNU parallel has many options for more sophisticated execution. The numerous use cases and syntax of GNU parallel are beyond the scope of this guide. There are many web tutorials and even books about GNU Parallel for details. If GNU parallel is properly installed on your system (i.e. via a package manager), you can begin by running man parallel_tutorial.

Note

The GNU parallel tutorial, at the time of this writing, contains some examples of overcomplicating simple tasks, such as the following:

# The tutorial recommends the following to generate sample input files:
shell-prompt: perl -e 'printf "A_B_C_"' > abc_-file
shell-prompt: perl -e 'for(1..1000000){print "$_\n"}' > num1000000

# In reality, perl serves no purpose in either case.
# We can just use the POSIX printf command:
shell-prompt: printf "A_B_C_" > abc_-file
shell-prompt: printf "%s\n" `seq 1 1000` > nums1000000
            
Practice

Note

Be sure to thoroughly review the instructions in Section 2, “Practice Problem Instructions” before doing the practice problems below.
  1. What is a regular expression? Is it the same as a globbing pattern?

  2. Show a Unix command that shows lines in analysis.c containing hard-coded floating point constants.

  3. How can we speed up grep searches when searching for a fixed string rather than an RE pattern?

  4. How can we use extended REs with grep?

  5. How can we make the matched pattern visible in the grep output?

  6. Describe two major differences between grep and awk.

  7. How does awk compare to spreadsheet programs like LibreOffice Calc and MS Excel?

  8. The /etc/group file contains colon-separated lines in the form groupname:password:groupid:members. Show an awk command that will print the groupid and members of the group "root".

  9. A GFF3 file contains tab-separated lines in the form "seqid source feature-type start end score strand phase attributes". The first attribute for an exon feature is the parent sequence ID. Write an awk script that reports the seqid, start, end, strand, and parent for each feature of type "exon". It should also report the number of exons and the number of genes. To test your script, download Mus_musculus.GRCm39.107.chromosome.1.gff3.gz from ensembl.org and then do the following:

    gunzip Mus_musculus.GRCm39.107.chromosome.1.gff3.gz
    awk -f your-script.awk Mus_musculus.GRCm39.107.chromosome.1.gff3
            
  10. Show a cut command roughly equivalent to the following awk command, which processes a tab-separated GFF3 file.

    awk '{ print $1, $3, $4, $5 }' file.gff3
            
  11. Show a sed command that replaces all occurrences of "wolf" with "werewolf" in the file halloween-list.txt.

  12. Show a command to sort the following data by height. Show a separate command to sort by weight. The data are in params.txt.

    ID  Height  Weight
    1   34      10
    2   40      14
    3   29      9
    4   28      11
            
  13. Show a Unix command that reads the file fox.txt, replaces the word "fox" with "toad" and converts all lower case letters to upper case, and stores the output in big-toad.txt.

  14. Show a Unix command that lists and removes all the files whose names end in '.o' in and under ~/Programs.

  15. Why is the xargs command necessary?

  16. Show a Unix command that removes all the files with names ending in ".tmp" only in the CWD, assuming that there are too many of them to provide as arguments to one command. The user should not be prompted for each delete. ( Check the rm man page if needed. )

  17. Show a Unix command that processes all the files named 'input*' in the CWD, using as many cores as possible, through a command such as the following:

    analyze --limit 5 input1 input2
            
  18. What is the most portable and flexible way to use xargs when the arguments it provides to the command must precede some of the fixed arguments?

  19. What is the major advantage of the bc calculator over common programming languages?

  20. Show a bc expression that prints the value of the natural number, e.

  21. Write a bc script that prints the following. Create the script with nano sqrt.bc and run it with bc -l < sqrt.bc.

    sqrt(1) = 1.00000000000000000000
    sqrt(2) = 1.41421356237309504880
    sqrt(3) = 1.73205080756887729352
    sqrt(4) = 2.00000000000000000000
    sqrt(5) = 2.23606797749978969640
    sqrt(6) = 2.44948974278317809819
    sqrt(7) = 2.64575131106459059050
    sqrt(8) = 2.82842712474619009760
    sqrt(9) = 3.00000000000000000000
    sqrt(10) = 3.16227766016837933199
            
  22. What are some advantages of archiving files in a tarball?

  23. Show a Unix command that creates a tarball called research.tar containing all the files in the directory ./Research.

  24. Show a Unix command that saves the output of find /etc to a compressed text file called find-output.txt.bz2.

  25. Show a Unix command for viewing the contents of the compressed text file output.txt.gz, one page at a time.

  26. Show a Unix command that creates a tarball called research.txz containing all the files in the directory ./Research.

  27. What are zip and unzip primarily used for on Unix systems?

  28. Show a Unix command that reports the CPU time used by the command awk -f script.awk input.tsv.

  29. Show a Unix command that will help us determine which processes are using the most CPU time or memory.

  30. How can we find out how to adjust the behavior of top while it is running?

  31. What kind of output from top might suggest that a process is I/O-bound? Why?

  32. Show a Unix command that continuously monitors total disk activity on a Unix system.

  33. How can users who do not have access to an HPC cluster run things in parallel, in a more sophisticated way than possible with standard Unix tools such as xargs?