1.14. Power Tools for Data Processing

1.14. Power Tools for Data Processing
Prev	Chapter 1. Using Unix	Next

1.14.1. Introduction

Congratulations on reaching the holy land of Unix data processing. It has often been said that if you know Unix well, you may never need to write a program. The tools provided by Unix often contain all the functionality you need to process your data. They are like a box of Legos from which we can construct a machine to perform almost any data analysis imaginable from the Unix shell.

Most of these tools function as filters, so they can be incorporated into pipelines. Most also accept filenames as command-line arguments for simpler use cases.

In this section, we'll introduce some of the most powerful tools that are heavily used by researchers to process data files. This will certainly reduce, if not eliminate, the need to write your own programs for many projects. This is only an introduction to make you aware of the available tools and the power they can give you.

For more detailed information, consult the man pages and other sources. Some tools, such as awk and sed, have entire books written about them, in case you want to explore in-depth.

However, do not set out to learn as much as you can about these tools. Set out to learn as juch as you need. The ability to show off your vast knowledge is not the ability to achieve. Knowledge is not wisdom. Cleverness is not wisdom. Wisdom is doing. Learn what you need to accomplish today's goals as elegantly as possible, and then do it. You will learn more from this doing than from any amount of studying. You will develop problem solving skills and instincts, which are far more valuable than encyclopedic knowledge.

Never stop wondering if there might be an even more elegant solution. Albert Einstein was once asked what was his goal in life. His response: "To simplify." Use the tools presented here to simplify your research and by extension, your life. With this approach can achieve great things without great effort and spend your time savoring the wonders and mysteries of your work rather than memorizing facts that might come in handy one day.

1.14.2. grep

grep shows lines in one or more text streams that match a given regular expression (RE). It is an acronym for Global Regular Expression Print (or Pattern or Parser if you prefer).

shell-prompt: grep expression [file ...]

The expression is often a simple string, but can represent RE patterns as described in detail by man re_format on FreeBSD. There are also numerous web pages describing REs.

Using simple strings or REs, we can search any file stream for lines containing information of interest. By knowing how to construct REs that represent the information you seek, you can easily identify patterns in your data.

REs resemble globbing patterns, but they are not the same. For example, '*' by itself in a globbing pattern means any sequence of 0 or more characters. In an RE, '*' means 0 or more of the preceding character. '*' in globbing is expressed as '.*' in an RE. Some of the most common RE patterns are shown in Table 1.12, “RE Patterns”.

Table 1.12. RE Patterns

Pattern	Meaning
.	Any single character
*	0 or more of the preceding character
+	1 or more of the preceding character
[]	One character in the set or range of the enclosed characters (same as globbing)
^	Beginning of the line
$	End of the line
.*	0 or more of any character
.+	1 or more of any character
[a-z]*	0 or more lower-case letters

The command below shows all lines containing a call to the printf() function in prog1.c.

shell-prompt: grep printf prog1.c

We might also wish to show all lines containing variable names in prog1.c. Since we are looking for any variable name rather than one particular variable, we cannot use a simple string and must construct a regular expression. Variable names begin with a letter or underscore and may contain any number of letters, underscores, or digits after that. So our RE must require a letter or underscore for the first character and then accept zero or more letters, digits, or underscores after that.

shell-prompt: grep '[a-zA-Z_][a-zA-Z0-9_]*' prog1.c

The following shows lines that have a '#' in the first column:

shell-prompt: grep '^#' prog1.c

Note

Since REs share many special characters with globbing patterns, we must enclose the RE in quotes to prevent the shell from treating it as a globbing pattern.

Note

If we want to literally match a special character such as '.' or '*' literally, we must escape it (preceded it with a '\'). For example, to locate method calls in a Java program, which have the form object.method(arguments);, we could use the following:

shell-prompt: grep '[a-zA-Z_][a-zA-Z0-9_]*\.[a-zA-Z]*\(.*\);' prog1.java

As an example of searching data files, rather than program code, suppose we would like to find all the lines containing contractions in text file. This would consist of some letters, followed by an apostrophe, followed by more letters. Since the apostrophe is the same character as the single quotes we might use to enclose the RE, we either need to escape it (with a '\') or use double quotes to enclose the RE.

shell-prompt: grep '[a-zA-Z]+\'[a-zA-Z]+'
shell-prompt: grep "[a-zA-Z]+'[a-zA-Z]+"

Another example would be searching for DNA sequences in a genome. We might use this to locate adapters, artificial sequences added to the ends of DNA fragments for the sequencing process, in our sequence data. Sequences are usually stored one per line in a text file in FASTA format. A common adapter sequence is "CTGTCTCTTATA".

Note

We can speed up processing by using grep --fixed-strings or fgrep instead of a regular grep. This uses a more efficient simple string comparison instead of the more complex regular expression matching.

shell-prompt: fgrep CTGTCTCTTATA file.fasta
GCGGCCAACACCTTGCCTGTATTGGCATCCATGATGAAATGGGCGTAACCCTGTCTCTTATACACATCTCCGAG
AAAGGCCTGTATGATAAGTTGGCAAATTTCCTCAAGATTGTTTACTTGATACACCTGTCTCTTATACACATCTC
GACCGAGGCACTCGCCGCGCTTGAGCTCGAGATCGATGCCGTCGACCTGTCTCTTATACACATCTCCGAGCCCA
AAAAAATCCCTCCGAAGCATTGTAGGTTTCCATGCTGTCTCTTATACACATCTCCGAGCCCACGAGACTCCTGA

It's hard to see the pattern we were looking for in this output. To solve this problem, we can colorize any matched patterns using the --color flag as shown in Figure 1.3, “Colorized grep output”.

Figure 1.3. Colorized grep output

There is an extended version of regular expressions that is not supported by the normal grep command. Extended REs include things like alternative strings, which are separated by a '|'. For example, we might want to search for either of two adapter sequences. To enable extended REs, we use egrep or grep --extended-regexp.

shell-prompt: egrep 'CTGTCTCTTATA|AGATCGGAAGAG' file.fasta

1.14.3. awk

AWK, an acronym for Aho, Weinberger, and Kernighan (the original developers of the program), is an extremely powerful tool for processing tabular data. Like grep, it supports RE matching, but unlike grep, it can process individual columns, called fields, in the data. It also includes a flexible scripting language that closely resembles the C language, so we can perform highly sophisticated processing of whole lines or individual fields.

Awk can be used to automate many of the same tasks that researchers often perform manually in a spreadsheet program such as LibreOffice Calc or MS Excel.

There are multiple implementations of awk. The most common are "The one true awk", evolved from the original awk code and used on many BSD systems. Gawk, the GNU project implementation, is used on most Linux systems. Mawk is an independent implementation that tends to outperform the others. It is available in most package managers. Awka is an awk-to-C translator that can convert most awk scripts to C for maximize performance.

Fields by default are separated by white space, i.e. space or tab characters. However, awk allows us to specify any set of separators using an RE following the -F flag or embedded in the script, so we can process tab-separated (.tsv) files, comma-separated (.csv) files, or any other data that can be broken down into columns.

An awk script consists of one or more lines containing a pattern and an action. The action is enclosed in curly braces, like a C code block.

pattern { action }

The pattern is used to select lines from the input, usually using a relational expression such as those found in an if statement. The action determines what to do when a line is selected. If no pattern is given, the action is applied to every line of input. If no action is given, the default is to print the line.

In both the pattern and the action, we can refer to the entire line as $0. $1 is the first field: all text up to but not including the first separator. $2 is the second field: all text between the first and second separators. And so on...

It is very common to use awk "one-liners" on the command-line, without actually creating an awk script file. In this case, the awk script is the first argument to awk, usually enclosed in quotes to allow for white space and special characters. The second argument is the input file to be processed by the script.

For example, the file /etc/passwd contains colon-separated fields including the username ($1), user ID ($3), primary group ID ($4), full name ($5), home directory ($6), and the user's shell program ($7). To see a list of full names for every line, we could use the following simple command, which has no pattern (so it processes every line) and an action of printing the fifth field:

shell-prompt: awk -F : '{ print $5 }' /etc/passwd
Jason Bacon
D-BUS Daemon User
TCG Software Stack user
Avahi Daemon User
...

To see a list of usernames and shells:

shell-prompt: awk -F : '{ print $1, $6 }' /etc/passwd
bacon /bin/tcsh
messagebus /usr/sbin/nologin
_tss /usr/sbin/nologin
avahi /usr/sbin/nologin
...

Many data files used in research computing are tabular, with one of the most popular formats being TSV (tab-separated value) files. The General Feature Format, or GFF file is a TSV file format for describing features of a genome. The first field contains the sequence ID (such as a chromosome number) on which the feature resides. The third field contains the feature type, such as "gene" or "exon". The fourth and fifth fields contain the starting and ending positions withing the sequence. The ninth field contains "attributes", such as the globally unique feature ID and possibly the feature name and other information, separated by semicolons. If we just want to see the locations and attributes of all the genes in a genome and their names, we could use the following:

shell-prompt:  awk '$3 == "gene" { print $1, $4, $5, $9 }' file.gff3
1 3073253 3074322 ID=gene:ENSMUSG00000102693;Name=4933401J01Rik
1 3205901 3671498 ID=gene:ENSMUSG00000051951;Name=Xkr4
...

Suppose we want to extract specific attributes from the semicolon-separated attributes field, such as the gene ID and gene name, as well as count the number of genes in the input. This will require a few more awk features.

The gene ID is always the first attribute in the field, assuming the feature is a gene. Not every gene has a name, so we will need to scan the attributes for this information. Awk makes this easy. We can break the attributes field into an array of strings using the split() function. We can then use a loop to search the attributes for one beginning with "Name=".

To count the genes in the input, we need to initialize a count variable before we begin processing the file, increment it for each gene found, and print it after processing is finished. For this we can use the special patterns BEGIN and END, which allow us to run an action before and after processing the input.

We will use the C-like printf() function to format the output. The basic print statement always adds a newline, so it does not allow us to print part of a line and finish it with an subsequent print statement.

Since this is a multiline script, we will save it in a file called gene-info.awk and run it using the -f flag, which tells awk to get the script from a file rather than the command-line.

shell-prompt: awk -f gene-info.awk file.gff3

Caution

Awk can be finicky about the placement of curly braces. To avoid problems, always place the opening brace ({) for an action on the same line as the pattern.

BEGIN {
    gene_count = 0;
}

$3 == "gene" {
    # Separate attributes into an array
    split($9, attributes, ";");
    
    # Print location and feature ID
    printf("%s %s %s %s", $1, $4, $5, attributes[1]);
    
    # Look for a name attribute and print it if it exists
    # With the for-in loop, c gets the SUBSCRIPT of each element in the
    # attributes array
    for ( c in attributes )
    {
        # See if first 5 characters of the attribute are "Name="
        if ( substr(attributes[c], 1, 5) == "Name=" )
            printf(" %s", attributes[c]);
    }
    
    # Terminate the output line
    printf("\n");
    
    # Count this gene
    ++gene_count;
}

END {
    printf("\nGenes found = %d\n", gene_count);
}

As we can see, we can do some fairly sophisticated data processing with a very short awk script. There is very little that awk cannot do conveniently with tabular data. If a particular task seems like it will be difficult to do with awk, don't give up too easily. Chances are, with a little thought and effort, you can come up with an elegant awk script to get the job done.

That said, there are always other options for processing tabular data. Perl is a scripting language especially well suited to text processing, with its powerful RE handling capabilities and numerous features. Python has also become popular for such tasks in recent years.

Awk is highly efficient, and processing steps performed with it are rarely a bottleneck in an analysis pipeline. If you do need better performance than awk provides, there are C libraries that can be used to easily parse tabular data, such as libxtend. Libxtend includes a set of DSV (delimiter-separated-value) processing functions that make it easy to read fields from files in formats like TSV, CSV, etc. Once you have read a line or an individual field using libxtend's DSV functions, you now have the full power and performance of C at your disposal to process it in minimal time.

Full coverage of awk's capabilities is far beyond the scope of this text. Readers are encouraged to explore it further via the awk man page and one of the many books available on the language.

1.14.4. cut

The cut command is used to select columns from a file, either by byte position, character position, or like awk, delimiter-separated columns. Note that characters in the modern world may be more than one byte, so bytes and characters are distinguished here.

To extract columns by byte or character position, we use the -b or -c option followed by a list of positions. The list is comma-separated and may contain individual positions or ranges denoted with a '-'. For example, to extract character positions 1 through 10 and 21 through 26 from every line of file.txt, we could use the following:

shell-prompt: cut -c 1-10,21-26 file.txt

For delimiter-separated columns, we use -d to indicate the delimiter. The default is a tab character alone, not just any white space. The -w flag tells cut to accept any white space (tab or space) as the delimiter. The -f is then used to indicate the fields to extract, much like -c is used for character positions. Output is separated by the same delimiter as the input.

For example, to extract the username, userid, groupid, and full name (fields 1, 3, 4, and 5) from /etc/passwd, we could use the following:

shell-prompt: cut -d : -f 1,3-5 /etc/passwd
...
ganglia:102:102:Ganglia User
nagios:181:181:Nagios pseudo-user
webcamd:145:145:Webcamd user

The above is equivalent to the following awk command:

shell-prompt: awk -F : '{ printf("%s:%s:%s:%s\n", $1, $3, $4, $5); }' /etc/passwd

1.14.5. sed

The sed command is a stream editor. It makes changes to a file stream with no interaction from the user. It is probably most often used to make simple text substitutions, though it can also do insertions and deletions of lines and parts of lines, even selecting lines by number or based on pattern matching much like grep and awk. A basic substitution command takes the following format:

sed -e 's|pattern|replacement|g' input-file

Pattern is any regular expression, like those used in grep or awk. Replacement can be a fixed string, but also takes some special characters, such as &, which represents the string matched by pattern. It can also be empty if you simply want to remove occurrences of pattern from the text.

The characters enclosing pattern and replacement are arbitrary. The '|' character is often used because it stands out among most other characters. If either pattern or replacement contains a '|', simply use a different separator, such as '/'. The 'g' after the pattern means "global". Without it, sed will only replace the first occurrence of pattern in each line. With it, all matches are replaced.

shell-prompt: cat fox.txt
The quick brown fox jumped over the lazy dog.
shell-prompt: sed -e 's|fox|worm|g' fox.txt
The quick brown worm jumped over the lazy dog.
shell-prompt: sed -e 's/brown //g' -e 's|fox|&y worm|g' fox.txt
The quick foxy worm jumped over the lazy dog.

Using -E in place of -e causes sed to support extended regular expressions.

By default, sed sends output to the standard output stream. The -i flag tells sed to edit the file in-place, i.e. replace the original file with the edited text. This flag should be followed by a filename extension, such as ".bak". The original file will then be saved to filename.bak, so that you can reverse the changes if you make a mistake. The extension can be an empty string, e.g. '' if you are sure you don't need a backup of the original.

Caution

There is a rare portability issue with sed. GNU sed requires that the extension be nestled against the -i:

shell-prompt: sed -i.bak -e 's|pattern|replacement|g' file.txt

Some other implementations require a space between the -i and the extension, which is more orthodox among Unix commands:

shell-prompt: sed -i .bak -e 's|pattern|replacement|g' file.txt

FreeBSD's sed accepts either form. You must be aware of this in order to ensure that scripts using sed are portable. The safest approach is not to use the -i flag, but simply save the output to a temporary file and then move it:

shell-prompt: sed -e 's|pattern|replacement|g' file.txt > file.txt.tmp
shell-prompt: mv file.txt.tmp file.txt

This way, it won't matter which implementation of sed is present when someone runs your script.

Sed is a powerful and complex tool that is beyond the scope of this text. Readers are encouraged to consult books and other documentation to explore further.

1.14.6. sort

The sort command sorts text data line by line according to one or more keys. A key indicates a field (usually a column separated by white space or some other delimiter) and the type of comparison, such as lexical (like alphabetical, but including non-letters) or numeric.

If no keys are specified, sort compares entire lines lexically. The --key followed by a field number restricts comparison to that field. Fields are numbered starting with 1. This can be used in conjunction with the --field-separator flag to specify a separator other than the default white space. The --numeric-sort flag must be used to perform integer comparison rather than lexical. The --general-numeric-sort flag must be used to compare real numbers.

Shell-prompt: cat ages.txt
Bob Vila        23
Joe Piscopo     27
Al Gore         19
Ingrid Bergman  26
Mohammad Ali    22
Ram Das         9
Joe Montana     25

Shell-prompt: sort ages.txt 
Al Gore         19
Bob Vila        23
Ingrid Bergman  26
Joe Montana     25
Joe Piscopo     27
Mohammad Ali    22
Ram Das         9

Shell-prompt: sort --key 2 ages.txt
Mohammad Ali    22
Ingrid Bergman  26
Ram Das         9
Al Gore         19
Joe Montana     25
Joe Piscopo     27
Bob Vila        23

Shell-prompt: sort --key 3 --numeric-sort ages.txt
Ram Das         9
Al Gore         19
Mohammad Ali    22
Bob Vila        23
Joe Montana     25
Ingrid Bergman  26
Joe Piscopo     27

The sort command can process files of any size, regardless of available memory. If a file is too large to fit in memory, it is broken into smaller pieces, which are sorted separately and saved to temporary files. The sorted temporary files are then merged.

The uniq command, which removes adjacent lines that are identical, is often used after sorting to remove redundancy from data. Note that the sort command also has a --unique flag, but it does not behave the same as the uniq command. The --unique flag compares keys, while the uniq command compares entire lines.

1.14.7. tr

The tr (translate) command is a simple tool for performing character conversions and deletions in a text stream. A few examples are shown below. See the tr man page for details.

We can use it to convert individual characters in a text stream. In this case, it takes two string arguments. Characters in the Nth position in the first string are replaced by characters in the Nth position in the second string:

shell-prompt: cat fox.txt
The quick brown fox   jumped over the lazy dog.
shell-prompt: tr 'xl' 'gh' < fox.txt
The quick brown fog   jumped over the hazy dog.

There is limited support for character sets enclosed in square brackets [], similar to regular expressions, including predefined sets such as [:lower:] and [:upper:]:

shell-prompt: tr '[:lower:]' '[:upper:]' < fox.txt
THE QUICK BROWN FOX   JUMPED OVER THE LAZY DOG.

We can use it to "squeeze" repeated characters down to one in a text stream. This is useful for compressing white space:

shell-prompt: tr -s ' ' < fox.txt
The quick brown fox jumped over the lazy dog.

The tr command does not support doing multiple conversions in the same command, but we can use it as a filter:

shell-prompt: tr '[:lower:]' '[:upper:]' < fox.txt | tr -s ' '
THE QUICK BROWN FOX JUMPED OVER THE LAZY DOG.

There is some overlap between the capabilities of tr, sed, awk, and other tools. Which one you choose for a given task is a matter of convenience.

1.14.8. find

The find command is a powerful tool for not only locating path names in a directory tree, but also for taking any desired action when a path name is found.

Unlike popular search utilities in macOS, Windows, and the Unix locate command. find does not use a previously constructed index of the file system, but searches the file system in its current state. Indexed search utilities very quickly produce results from a recent snapshot of the filesystem, which is rebuilt periodically by a scheduled job. This is much faster than an exhaustive search, but will miss files that were added since the last index build. The find command will take longer to search a large directory tree, but also guarantees accurate results.

The basic format of a find command is as follows:

shell-prompt: find top-directory search-criteria [optional-action \;]

The search-criteria can be any attribute of a file or other path name. To match by name, we use -name followed by a globbing pattern, in quotes to prevent the shell from expanding it before passing it to find. To search for files owned by a particular user or group, we can use -user or -group. We can also search for files with certain permissions, a minimum or maximum age, and many other criteria. The man page provides all of these details.

The default action is to print the relative path name of each match. For example, to list all the configuration files under /etc, we could use the following:

shell-prompt: find /etc -name '*.conf'

We can run any Unix command in response to each match using the -exec flag followed the command and a ';' or '+'. The ';' must be escaped or quoted to prevent the shell from using it as a command separator and treating everything after it as a new command, separate from the find command. The name of the matched path is represented by '{}'.

shell-prompt: find /etc -name '*.conf' -exec ls -l '{}' \;

With a ';' terminating the command, the command is executed immediately after each match. This may be necessary in some situations, but it entails a great deal of overhead from running the same command many times. Replacing the ';' with a '+' tells find to accumulate as many path names as possible and pass them all to one invocation of the command. This means the command could receive thousands of path names as arguments and will be executed far fewer times.

shell-prompt: find /etc -name '*.conf' -exec ls -l '{}' +

There are also some predefined actions we can use instead of spelling out a -exec, such as -print, which is the default action, and -ls, which is equivalent to -exec ls -l '{}' +. The -print action is useful for showing path names being processed by another action:

shell-prompt: find Data -name '*.bak' -print -exec rm '{}' +

Sometimes we may want to execute more than one command for each path matched. Rather than construct a complex and messy -exec, we may prefer to write a shell script containing the commands and run the script using -exec. Scripting is covered in Chapter 2, Unix Shell Scripting.

1.14.9. xargs

As stated earlier, most Unix commands that accept a file name as an argument will accept any number of file names. When processing 100 files with the same program, it is usually more efficient to run one process with 100 file name arguments than to run 100 processes with one argument each.

However, there is a limit to how long Unix commands can be. When processing many thousands of files, it may not be possible to run a single command with all of the filenames as arguments. The xargs command solves this problem by reading a list of file names from the standard input (which has no limit) and feeding them to another command asarguments, providing as many arguments as possible to each process created.

The arguments processed by xargs do not have to be file names, but usually are. The main trick generating the list of files. Suppose we want to change all occurrences of "fox" to "toad" in the files input*.txt in the CWD. Our first thought might be a simple command:

shell-prompt: sed -i '' -e 's|fox|toad|g' input*.txt

If there are too many files matching "input*.txt", we will get an error such as "Argument list too long". One might think to solve this problem using xargs as follows:

shell-prompt: ls *.txt | xargs sed -i '' -e 's|fox|toad|g'

However, this won't work either, because the shell hits the same argument list limit for the ls command as it does for the sed command.

The find command can come to the rescue:

shell-prompt: find . -name '*.txt' | xargs sed -i '' -e 's|fox|toad|g'

Since the shell is not trying to expand '*.txt' to an argument list, but instead passing the literal string '*.txt' to find, there is no limit on how many file names it can match. The find command is sophisticated enough to work around the limits of argument lists.

The find command above will send relative path names of every file with a name matching 'input*.txt' in and under the CWD. If we don't want to process files in subdirectories of CWD, we can limit the depth of the find command to one directory level:

shell-prompt: find . -maxdepth 1 -name '*.txt' \
              | xargs sed -i '' -e 's|fox|toad|g'

Note

The xargs command places the arguments read from the standard input after any arguments included with the command. So the commands run by xargs will have the form

sed -i '' -e 's|fox|toad|g' input1.txt input2.txt input3.txt ...

Some xargs implementations have an option for placing the arguments from the standard input before the fixed arguments, but this is still limited. There may be cases where we want the arguments intermingled. The most portable and flexible solution to this is writing a simple script that takes all the arguments from xargs last, and constructs the appropriate command with the arguments in the correct order. Scripting is covered in Chapter 2, Unix Shell Scripting.

Most xargs implementations also support running multiple processes at the same time. This provides a convenient way to utilize multiple cores to parallelize processing. If you have a computer with 16 cores and speeding up your analysis by a factor of nearly 16 is good enough, then this can be a very valuable alternative to using an HPC cluster. If you need access to hundreds of cores to get your work done in a reasonable time, then a cluster is a better option.

shell-prompt: find . -name '*.txt' \
              | xargs --max-procs 8 sed -i '' -e 's|fox|toad|g'

A value of 0 following --max-procs tells xargs to detect the number of available cores and use all of them.

There is a more sophisticated FOSS program called GNU parallel that can run commands in parallel in a similar way, but with more flexibility. It can be installed via most package managers.

1.14.10. bc

The bc (binary calculator) command is an unlimited range and precision calculator with a scripting language very similar to C. When invoked with -l or --mathlib, it includes numerous additional functions including l(x) (natural log), e(x) (exponential), s(x) (sine), c(x) (cosine), and a(x) (arctangent). There are numerous standard functions available even without --mathlib. See the man page for a full list.

By default, bc prints the result of each expression evaluated followed by a newline. There is also a print statement that does not print a newline. This allows a line of output to be constructed from multiple expressions, the last of which includes a literal "\n".

shell-prompt: bc --mathlib
sqrt(2)
1.41421356237309504880
print sqrt(2), "\n"
1.41421356237309504880
l(10)
2.30258509299404568401
x=10
5 * x^2 + 2 * x + 1
521
quit

Bc is especially useful for quick computations where extreme range or precision is required, and for checking the results from more traditional languages that lack such range and precision. For example, consider the computation of factorials. N factorial, denoted N!, is the product of all integers from one to N. The factorial function grows so quickly that 21! exceeds the range of a 64-bit unsigned integer, the largest integer value supported by most CPUS and most common languages. The C program and output below demonstrate the limitations of 64-bit integers.

#include <stdio.h>
#include <sysexits.h>

int     main(int argc,char *argv[])

{
    unsigned long   c, fact = 1;
    
    for (c = 1; c <= 22; ++c)
    {
        fact *= c;
        printf("%lu! = %lu\n", c, fact);
    }
    return EX_OK;
}

1! = 1
2! = 2
3! = 6
4! = 24
5! = 120
6! = 720
7! = 5040
8! = 40320
9! = 362880
10! = 3628800
11! = 39916800
12! = 479001600
13! = 6227020800
14! = 87178291200
15! = 1307674368000
16! = 20922789888000
17! = 355687428096000
18! = 6402373705728000
19! = 121645100408832000
20! = 2432902008176640000
21! = 14197454024290336768      This does not equal 20! * 21
22! = 17196083355034583040
23! = 8128291617894825984
24! = 10611558092380307456
25! = 7034535277573963776

At 21!, an integer overflow occurs. In the limited integer systems used by computers, adding 1 to the largest possible value produces a result of 0. The system is actually circular, which is why 21! above is wrong and 23! is actually smaller than 22!. The limitations of computer number systems are covered in ???.

In constrast, bc can compute factorials of any size, limited only by the amount of memory needed to store the value. It is, of course, much slower than C, both because it is an interpreted language and because it performs multiple precision arithmetic, which requires multiple machine instructions for every math operation. However, it is more than fast enough for many purposes and the easiest way to do math that is beyond the capabilities of common languages.

The bc script below demonstrates the the superior range of bc. The first line (#!/usr/bin/bc -l) tells the Unix shell how to run the script, so we can run it by simply typing its name, such as ./fact.bc. This will be covered in Chapter 2, Unix Shell Scripting. For now, create the script using nano fact.bc and run it with bc < fact.bc.

#!/usr/bin/bc -l

fact = 1;
for (c = 1; c <= 100; ++c)
{
    fact *= c;
    print c, "!= ", fact, "\n";
}
quit

1!= 1
2!= 2
3!= 6
4!= 24
5!= 120
6!= 720
7!= 5040
8!= 40320
9!= 362880
10!= 3628800
11!= 39916800
12!= 479001600
13!= 6227020800
14!= 87178291200
15!= 1307674368000
16!= 20922789888000
17!= 355687428096000
18!= 6402373705728000
19!= 121645100408832000
20!= 2432902008176640000
21!= 51090942171709440000
22!= 1124000727777607680000
23!= 25852016738884976640000
24!= 620448401733239439360000
25!= 15511210043330985984000000

[ Output removed for brevity ]

100!= 93326215443944152681699238856266700490715968264381621468592963\
89521759999322991560894146397615651828625369792082722375825118521091\
6864000000000000000000000000

Someone with a little knowledge of computer number systems might think that we can get around the range problem in general purpose languages like C by using floating point rather than integers. This will not work, however. While a 64-bit floating point number has a much greater range than a 64-bit integer (up to 10³⁰⁸ vs 10¹⁹ for integers), floating point actually has less precision. It sacrifices some precision in order to achieve the greater range. The modified C code and output below show that the double (64-bit floating point) type in C only gets us to 22!, and round-off error corrupts 23! and beyond.

#include <stdio.h>
#include <sysexits.h>

int     main(int argc,char *argv[])

{
    double  c, fact = 1;
    
    for (c = 1; c <= 25; ++c)
    {
        fact *= c;
        printf("%0.0f! = %0.0f\n", c, fact);
    }
    return EX_OK;
}

1! = 1
2! = 2
3! = 6
4! = 24
5! = 120
6! = 720
7! = 5040
8! = 40320
9! = 362880
10! = 3628800
11! = 39916800
12! = 479001600
13! = 6227020800
14! = 87178291200
15! = 1307674368000
16! = 20922789888000
17! = 355687428096000
18! = 6402373705728000
19! = 121645100408832000
20! = 2432902008176640000
21! = 51090942171709440000
22! = 1124000727777607680000
23! = 25852016738884978212864
24! = 620448401733239409999872
25! = 15511210043330986055303168

1.14.11. tar

TBD

1.14.12. gzip, bzip2, xz

TBD xz can be a bottleneck in pipelines when used with default options. Try lowering the compression until it is able to keep up with other processing steps (-4, -3, -2). It will still likely provide better compression than gzip.

1.14.13. zip, unzip

TBD

1.14.14. time

TBD

1.14.15. top

TBD

1.14.16. iostat

TBD

1.14.17. netstat

TBD

1.14.18. iftop

TBD

1.14.19. curl, fetch, wget

TBD

1.14.20. Practice

Instructions

Make sure you are using the latest version of this document.
Carefully read one section of this document and casually read other material (such as corresponding sections in a textbook, if one exists) if needed.
Try to answer the questions from that section. If you do not remember the answer, review the section to find it.
Write the answer in your own words. Do not copy and paste. Verbalizing answers in your own words helps your memory and understanding. Copying does not, and demonstrates a lack of interest in learning.
Check the answer key to make sure your answer is correct and complete.
DO NOT LOOK AT THE ANSWER KEY BEFORE ANSWERING QUESTIONS TO THE VERY BEST OF YOUR ABILITY. In doing so, you would only cheat yourself out of an opportunity to learn and prepare for the quizzes and exams.

Important notes:

Show all your work. This will improve your understanding and ensure full credit for the homework.
The practice problems are designed to make you think about the topic, starting from basic concepts and progressing through real problem solving.
Try to verify your own results. In the working world, no one will be checking your work. It will be entirely up to you to ensure that it is done right the first time.
Start as early as possible to get your mind chewing on the questions, and do a little at a time. Using this approach, many answers will come to you seemingly without effort, while you're showering, walking the dog, etc.

What is a regular expression? Is it the same as a globbing pattern?
How can we show lines in analysis.c containing hard-coded floating point constants?
How can we speed up grep searches when searching for a fixed string rather than an RE pattern?
How can we use extended REs with grep?
How can we make the matched pattern visible in the grep output?
Describe two major differences between grep and awk.
How does awk compare to spreadsheet programs like LibreOffice Calc and MS Excel?
The /etc/group file contains colon-separated lines in the form groupname:password:groupid:members. Show an awk command that will print the groupid and members of the group "root".
A GFF3 file contains tab-separated lines in the form "seqid source feature-type start end score strand phase attributes". The first attribute for an exon feature is the parent sequence ID. Write an awk script that reports the seqid, start, end, strand, and parent for each feature of type "exon". It should also report the number of exons and the number of genes. To test your script, download Mus_musculus.GRCm39.107.chromosome.1.gff3.gz from ensembl.org and then do the following:
```
gunzip Mus_musculus.GRCm39.107.chromosome.1.gff3.gz
awk -f your-script.awk Mus_musculus.GRCm39.107.chromosome.1.gff3
        
```
Show a cut command roughly equivalent to the following awk command, which processes a tab-separated GFF3 file.
```
awk '{ print $1, $3, $4, $5 }' file.gff3
        
```
Show a sed command that replaces all occurrences of "wolf" with "werewolf" in the file halloween-list.txt.
Show a command to sort the following data by height. Show a separate command to sort by weight. The data are in params.txt.
```
ID  Height  Weight
1   34      10
2   40      14
3   29      9
4   28      11
        
```
Show a Unix command that replaces the word "fox" with "toad" and converts all lower case letters to upper case in the file fox.txt. Output should be stored in big-toad.txt.
Show a Unix command that lists and removes all the files whose names end in '.o' in and under ~/Programs.
Why is the xargs command necessary?
Show a Unix command that removes all the files with names ending in ".tmp" only in the CWD, assuming that there are too many of them to provide as arguments to one command. The user should not be prompted for each delete. ( Check the rm man page if needed. )
Show a Unix command that processes all the files named 'input*' in the CWD, using as many cores as possible, through a command such as the following:
```
analyze --limit 5 input1 input2
        
```
What is the most portable and flexible way to use xargs when the arguments it provides to the command must precede some of the fixed arguments?
What is the major advantage of the bc calculator over common programming languages?
Show a bc expression that prints the value of the natural number, e.

Write a bc script that prints the following. Create the script with nano sqrt.bc and run it with bc < sqrt.bc.

sqrt(1) = 1.00000000000000000000
sqrt(2) = 1.41421356237309504880
sqrt(3) = 1.73205080756887729352
sqrt(4) = 2.00000000000000000000
sqrt(5) = 2.23606797749978969640
sqrt(6) = 2.44948974278317809819
sqrt(7) = 2.64575131106459059050
sqrt(8) = 2.82842712474619009760
sqrt(9) = 3.00000000000000000000
sqrt(10) = 3.16227766016837933199

Prev	Up	Next
1.13. Redirection and Pipes	Home	1.15. File Transfer