Congratulations on reaching the holy land of Unix data processing. It has often been said that if you know Unix well, you may never need to write a program. The tools provided by Unix often contain all the functionality you need to process your data. They are like a box of Legos from which we can construct a machine to perform almost any data analysis imaginable from the Unix shell.
Most of these tools function as filters, so they can be incorporated into pipelines. Most also accept filenames as command-line arguments for simpler use cases.
In this section, we'll introduce some of the most powerful tools that are heavily used by researchers to process data files. This will certainly reduce, if not eliminate, the need to write your own programs for many projects. This is only an introduction to make you aware of the available tools and the power they can give you.
For more detailed information, consult the man pages and other sources. Some tools, such as awk and sed, have entire books written about them, in case you want to explore in-depth.
However, do not set out to learn as much as you can about these tools. Set out to learn as juch as you need. The ability to show off your vast knowledge is not the ability to achieve. Knowledge is not wisdom. Cleverness is not wisdom. Wisdom is doing. Learn what you need to accomplish today's goals as elegantly as possible, and then do it. You will learn more from this doing than from any amount of studying. You will develop problem solving skills and instincts, which are far more valuable than encyclopedic knowledge.
Never stop wondering if there might be an even more elegant solution. Albert Einstein was once asked what was his goal in life. His response: "To simplify." Use the tools presented here to simplify your research and by extension, your life. With this approach can achieve great things without great effort and spend your time savoring the wonders and mysteries of your work rather than memorizing facts that might come in handy one day.
grep shows lines in one or more text streams that match a given regular expression (RE). It is an acronym for Global Regular Expression Print (or Pattern or Parser if you prefer).
shell-prompt: grep expression [file ...]
The expression is often a simple string, but can represent RE patterns as described in detail by man re_format on FreeBSD. There are also numerous web pages describing REs.
Using simple strings or REs, we can search any file stream for lines containing information of interest. By knowing how to construct REs that represent the information you seek, you can easily identify patterns in your data.
REs resemble globbing patterns, but they are not the same. For example, '*' by itself in a globbing pattern means any sequence of 0 or more characters. In an RE, '*' means 0 or more of the preceding character. '*' in globbing is expressed as '.*' in an RE. Some of the most common RE patterns are shown in Table 1.12, “RE Patterns”.
Table 1.12. RE Patterns
Pattern | Meaning |
---|---|
. | Any single character |
* | 0 or more of the preceding character |
+ | 1 or more of the preceding character |
[] | One character in the set or range of the enclosed characters (same as globbing) |
^ | Beginning of the line |
$ | End of the line |
.* | 0 or more of any character |
.+ | 1 or more of any character |
[a-z]* | 0 or more lower-case letters |
The command below shows all lines containing a call to the printf() function in prog1.c.
shell-prompt: grep printf prog1.c
We might also wish to show all lines containing variable names in prog1.c. Since we are looking for any variable name rather than one particular variable, we cannot use a simple string and must construct a regular expression. Variable names begin with a letter or underscore and may contain any number of letters, underscores, or digits after that. So our RE must require a letter or underscore for the first character and then accept zero or more letters, digits, or underscores after that.
shell-prompt: grep '[a-zA-Z_][a-zA-Z0-9_]*' prog1.c
The following shows lines that have a '#' in the first column:
shell-prompt: grep '^#' prog1.c
shell-prompt: grep '[a-zA-Z_][a-zA-Z0-9_]*\.[a-zA-Z]*\(.*\);' prog1.java
As an example of searching data files, rather than program code, suppose we would like to find all the lines containing contractions in text file. This would consist of some letters, followed by an apostrophe, followed by more letters. Since the apostrophe is the same character as the single quotes we might use to enclose the RE, we either need to escape it (with a '\') or use double quotes to enclose the RE.
shell-prompt: grep '[a-zA-Z]+\'[a-zA-Z]+' shell-prompt: grep "[a-zA-Z]+'[a-zA-Z]+"
Another example would be searching for DNA sequences in a genome. We might use this to locate adapters, artificial sequences added to the ends of DNA fragments for the sequencing process, in our sequence data. Sequences are usually stored one per line in a text file in FASTA format. A common adapter sequence is "CTGTCTCTTATA".
shell-prompt: fgrep CTGTCTCTTATA file.fasta GCGGCCAACACCTTGCCTGTATTGGCATCCATGATGAAATGGGCGTAACCCTGTCTCTTATACACATCTCCGAG AAAGGCCTGTATGATAAGTTGGCAAATTTCCTCAAGATTGTTTACTTGATACACCTGTCTCTTATACACATCTC GACCGAGGCACTCGCCGCGCTTGAGCTCGAGATCGATGCCGTCGACCTGTCTCTTATACACATCTCCGAGCCCA AAAAAATCCCTCCGAAGCATTGTAGGTTTCCATGCTGTCTCTTATACACATCTCCGAGCCCACGAGACTCCTGA
It's hard to see the pattern we were looking for in this
output. To solve this problem, we can colorize any matched
patterns using the --color
flag as shown in
Figure 1.3, “Colorized grep output”.
There is an extended version of regular expressions that is not supported by the normal grep command. Extended REs include things like alternative strings, which are separated by a '|'. For example, we might want to search for either of two adapter sequences. To enable extended REs, we use egrep or grep --extended-regexp.
shell-prompt: egrep 'CTGTCTCTTATA|AGATCGGAAGAG' file.fasta
AWK, an acronym for Aho, Weinberger, and Kernighan (the original developers of the program), is an extremely powerful tool for processing tabular data. Like grep, it supports RE matching, but unlike grep, it can process individual columns, called fields, in the data. It also includes a flexible scripting language that closely resembles the C language, so we can perform highly sophisticated processing of whole lines or individual fields.
Awk can be used to automate many of the same tasks that researchers often perform manually in a spreadsheet program such as LibreOffice Calc or MS Excel.
There are multiple implementations of awk. The most common are "The one true awk", evolved from the original awk code and used on many BSD systems. Gawk, the GNU project implementation, is used on most Linux systems. Mawk is an independent implementation that tends to outperform the others. It is available in most package managers. Awka is an awk-to-C translator that can convert most awk scripts to C for maximize performance.
Fields by default are separated by white space, i.e. space or
tab characters. However, awk allows us
to specify any set of separators using an RE following the
-F
flag or embedded in the script, so we can
process
tab-separated (.tsv) files, comma-separated (.csv) files,
or any other data that can be broken down into columns.
An awk script consists of one or more lines containing a pattern and an action. The action is enclosed in curly braces, like a C code block.
pattern { action }
The pattern is used to select lines from the input, usually using a relational expression such as those found in an if statement. The action determines what to do when a line is selected. If no pattern is given, the action is applied to every line of input. If no action is given, the default is to print the line.
In both the pattern and the action, we can refer to the entire line as $0. $1 is the first field: all text up to but not including the first separator. $2 is the second field: all text between the first and second separators. And so on...
It is very common to use awk "one-liners" on the command-line, without actually creating an awk script file. In this case, the awk script is the first argument to awk, usually enclosed in quotes to allow for white space and special characters. The second argument is the input file to be processed by the script.
For example, the file /etc/passwd
contains colon-separated
fields including the username ($1), user ID ($3),
primary group ID ($4), full name ($5), home directory ($6),
and the user's shell program ($7). To see a list of
full names for every line, we could use the following simple
command, which has no pattern (so it processes every line) and
an action of printing the fifth field:
shell-prompt: awk -F : '{ print $5 }' /etc/passwd Jason Bacon D-BUS Daemon User TCG Software Stack user Avahi Daemon User ...
To see a list of usernames and shells:
shell-prompt: awk -F : '{ print $1, $6 }' /etc/passwd bacon /bin/tcsh messagebus /usr/sbin/nologin _tss /usr/sbin/nologin avahi /usr/sbin/nologin ...
Many data files used in research computing are tabular, with one of the most popular formats being TSV (tab-separated value) files. The General Feature Format, or GFF file is a TSV file format for describing features of a genome. The first field contains the sequence ID (such as a chromosome number) on which the feature resides. The third field contains the feature type, such as "gene" or "exon". The fourth and fifth fields contain the starting and ending positions withing the sequence. The ninth field contains "attributes", such as the globally unique feature ID and possibly the feature name and other information, separated by semicolons. If we just want to see the locations and attributes of all the genes in a genome and their names, we could use the following:
shell-prompt: awk '$3 == "gene" { print $1, $4, $5, $9 }' file.gff3 1 3073253 3074322 ID=gene:ENSMUSG00000102693;Name=4933401J01Rik 1 3205901 3671498 ID=gene:ENSMUSG00000051951;Name=Xkr4 ...
Suppose we want to extract specific attributes from the semicolon-separated attributes field, such as the gene ID and gene name, as well as count the number of genes in the input. This will require a few more awk features.
The gene ID is always the first attribute in the field,
assuming the feature is a gene. Not every
gene has a name, so we will need to scan the attributes for
this information. Awk makes this easy.
We can break the attributes field into an
array of strings using the split()
function.
We can then use a loop to search the attributes for one beginning
with "Name=".
To count the genes in the input, we need to initialize a count variable before we begin processing the file, increment it for each gene found, and print it after processing is finished. For this we can use the special patterns BEGIN and END, which allow us to run an action before and after processing the input.
We will use the C-like printf()
function
to format the output. The basic print
statement always adds a newline, so it does not allow us to
print part of a line and finish it with an subsequent print
statement.
Since this is a multiline script, we will save it in a file
called gene-info.awk
and run it using
the -f
flag, which tells awk to get the
script from a file rather than the command-line.
shell-prompt: awk -f gene-info.awk file.gff3
BEGIN { gene_count = 0; } $3 == "gene" { # Separate attributes into an array split($9, attributes, ";"); # Print location and feature ID printf("%s %s %s %s", $1, $4, $5, attributes[1]); # Look for a name attribute and print it if it exists # With the for-in loop, c gets the SUBSCRIPT of each element in the # attributes array for ( c in attributes ) { # See if first 5 characters of the attribute are "Name=" if ( substr(attributes[c], 1, 5) == "Name=" ) printf(" %s", attributes[c]); } # Terminate the output line printf("\n"); # Count this gene ++gene_count; } END { printf("\nGenes found = %d\n", gene_count); }
As we can see, we can do some fairly sophisticated data processing with a very short awk script. There is very little that awk cannot do conveniently with tabular data. If a particular task seems like it will be difficult to do with awk, don't give up too easily. Chances are, with a little thought and effort, you can come up with an elegant awk script to get the job done.
That said, there are always other options for processing tabular data. Perl is a scripting language especially well suited to text processing, with its powerful RE handling capabilities and numerous features. Python has also become popular for such tasks in recent years.
Awk is highly efficient, and processing steps performed with it are rarely a bottleneck in an analysis pipeline. If you do need better performance than awk provides, there are C libraries that can be used to easily parse tabular data, such as libxtend. Libxtend includes a set of DSV (delimiter-separated-value) processing functions that make it easy to read fields from files in formats like TSV, CSV, etc. Once you have read a line or an individual field using libxtend's DSV functions, you now have the full power and performance of C at your disposal to process it in minimal time.
Full coverage of awk's capabilities is far beyond the scope of this text. Readers are encouraged to explore it further via the awk man page and one of the many books available on the language.
The cut command is used to select columns from a file, either by byte position, character position, or like awk, delimiter-separated columns. Note that characters in the modern world may be more than one byte, so bytes and characters are distinguished here.
To extract columns by byte or character position, we use the
-b
or -c
option followed
by a list of positions. The list is comma-separated and
may contain individual positions or ranges denoted with a '-'.
For example, to
extract character positions 1 through 10 and 21 through 26
from every line of file.txt, we could use the following:
shell-prompt: cut -c 1-10,21-26 file.txt
For delimiter-separated columns, we use -d
to indicate the delimiter. The default is a tab character
alone, not just any white space. The -w
flag
tells cut to accept any white space (tab or space) as the
delimiter. The -f
is then
used to indicate the fields to extract, much like
-c
is used for character positions. Output
is separated by the same delimiter as the input.
For example, to extract the username, userid, groupid, and full name (fields 1, 3, 4, and 5) from /etc/passwd, we could use the following:
shell-prompt: cut -d : -f 1,3-5 /etc/passwd ... ganglia:102:102:Ganglia User nagios:181:181:Nagios pseudo-user webcamd:145:145:Webcamd user
The above is equivalent to the following awk command:
shell-prompt: awk -F : '{ printf("%s:%s:%s:%s\n", $1, $3, $4, $5); }' /etc/passwd
The sed command is a stream editor. It makes changes to a file stream with no interaction from the user. It is probably most often used to make simple text substitutions, though it can also do insertions and deletions of lines and parts of lines, even selecting lines by number or based on pattern matching much like grep and awk. A basic substitution command takes the following format:
sed -e 's|pattern|replacement|g' input-file
Pattern is any regular expression, like those used in grep or awk. Replacement can be a fixed string, but also takes some special characters, such as &, which represents the string matched by pattern. It can also be empty if you simply want to remove occurrences of pattern from the text.
The characters enclosing pattern and replacement are arbitrary. The '|' character is often used because it stands out among most other characters. If either pattern or replacement contains a '|', simply use a different separator, such as '/'. The 'g' after the pattern means "global". Without it, sed will only replace the first occurrence of pattern in each line. With it, all matches are replaced.
shell-prompt: cat fox.txt The quick brown fox jumped over the lazy dog. shell-prompt: sed -e 's|fox|worm|g' fox.txt The quick brown worm jumped over the lazy dog. shell-prompt: sed -e 's/brown //g' -e 's|fox|&y worm|g' fox.txt The quick foxy worm jumped over the lazy dog.
Using -E
in place of -e
causes sed to support extended regular
expressions.
By default, sed sends output to the standard output stream.
The -i
flag tells sed
to edit the file in-place, i.e.
replace the original file with the edited text. This flag
should be followed by a filename extension, such as ".bak".
The original file will then be saved to filename.bak, so that you
can reverse the changes if you make a mistake. The extension
can be an empty string, e.g. '' if you are sure you don't
need a backup of the original.
There is a rare portability issue with sed.
GNU sed requires that the extension be
nestled against the -i
:
shell-prompt: sed -i.bak -e 's|pattern|replacement|g' file.txt
Some other implementations require a space between the
-i
and the extension, which is more orthodox
among Unix commands:
shell-prompt: sed -i .bak -e 's|pattern|replacement|g' file.txt
FreeBSD's sed accepts either form. You
must be aware of this in order to ensure that scripts using
sed are portable. The safest approach is
not to use the
-i
flag, but simply save the output to
a temporary file and then move it:
shell-prompt: sed -e 's|pattern|replacement|g' file.txt > file.txt.tmp shell-prompt: mv file.txt.tmp file.txt
This way, it won't matter which implementation of sed is present when someone runs your script.
Sed is a powerful and complex tool that is beyond the scope of this text. Readers are encouraged to consult books and other documentation to explore further.
The sort command sorts text data line by line according to one or more keys. A key indicates a field (usually a column separated by white space or some other delimiter) and the type of comparison, such as lexical (like alphabetical, but including non-letters) or numeric.
If no keys are specified, sort compares entire lines lexically.
The --key
followed by a field number restricts
comparison to that field. Fields are numbered starting with 1.
This can be used in conjunction with the
--field-separator
flag to specify a separator
other than the default white space. The
--numeric-sort
flag must be used to perform
integer comparison rather than lexical. The
--general-numeric-sort
flag must be used
to compare real numbers.
Shell-prompt: cat ages.txt Bob Vila 23 Joe Piscopo 27 Al Gore 19 Ingrid Bergman 26 Mohammad Ali 22 Ram Das 9 Joe Montana 25 Shell-prompt: sort ages.txt Al Gore 19 Bob Vila 23 Ingrid Bergman 26 Joe Montana 25 Joe Piscopo 27 Mohammad Ali 22 Ram Das 9 Shell-prompt: sort --key 2 ages.txt Mohammad Ali 22 Ingrid Bergman 26 Ram Das 9 Al Gore 19 Joe Montana 25 Joe Piscopo 27 Bob Vila 23 Shell-prompt: sort --key 3 --numeric-sort ages.txt Ram Das 9 Al Gore 19 Mohammad Ali 22 Bob Vila 23 Joe Montana 25 Ingrid Bergman 26 Joe Piscopo 27
The sort command can process files of any size, regardless of available memory. If a file is too large to fit in memory, it is broken into smaller pieces, which are sorted separately and saved to temporary files. The sorted temporary files are then merged.
The uniq command, which removes adjacent lines
that are identical, is often used after sorting to remove
redundancy from data. Note that the sort
command also has a --unique
flag, but it
does not behave the same as the uniq
command. The --unique
flag compares keys,
while the uniq command compares entire
lines.
The tr (translate) command is a simple tool for performing character conversions and deletions in a text stream. A few examples are shown below. See the tr man page for details.
We can use it to convert individual characters in a text stream. In this case, it takes two string arguments. Characters in the Nth position in the first string are replaced by characters in the Nth position in the second string:
shell-prompt: cat fox.txt The quick brown fox jumped over the lazy dog. shell-prompt: tr 'xl' 'gh' < fox.txt The quick brown fog jumped over the hazy dog.
There is limited support for character sets enclosed in square brackets [], similar to regular expressions, including predefined sets such as [:lower:] and [:upper:]:
shell-prompt: tr '[:lower:]' '[:upper:]' < fox.txt THE QUICK BROWN FOX JUMPED OVER THE LAZY DOG.
We can use it to "squeeze" repeated characters down to one in a text stream. This is useful for compressing white space:
shell-prompt: tr -s ' ' < fox.txt The quick brown fox jumped over the lazy dog.
The tr command does not support doing multiple conversions in the same command, but we can use it as a filter:
shell-prompt: tr '[:lower:]' '[:upper:]' < fox.txt | tr -s ' ' THE QUICK BROWN FOX JUMPED OVER THE LAZY DOG.
There is some overlap between the capabilities of tr, sed, awk, and other tools. Which one you choose for a given task is a matter of convenience.
The find command is a powerful tool for not only locating path names in a directory tree, but also for taking any desired action when a path name is found.
Unlike popular search utilities in macOS, Windows, and the Unix locate command. find does not use a previously constructed index of the file system, but searches the file system in its current state. Indexed search utilities very quickly produce results from a recent snapshot of the filesystem, which is rebuilt periodically by a scheduled job. This is much faster than an exhaustive search, but will miss files that were added since the last index build. The find command will take longer to search a large directory tree, but also guarantees accurate results.
The basic format of a find command is as follows:
shell-prompt: find top-directory search-criteria [optional-action \;]
The search-criteria can be any attribute of a file or other
path name. To match by name, we use -name
followed by a globbing pattern, in quotes to prevent the
shell from expanding it before passing it to
find. To search for files owned by a
particular user or group, we can use -user
or -group
. We can also search for files
with certain permissions, a minimum or maximum age, and
many other criteria. The man page provides all of these
details.
The default action is to print the relative path name of each
match. For example, to list all the configuration files under
/etc
, we could use the following:
shell-prompt: find /etc -name '*.conf'
We can run any Unix command in response to each match using
the -exec
flag followed the command and
a ';' or '+'. The ';' must be escaped or quoted to prevent the
shell from using it as a command separator and treating
everything after it as a new command, separate from the
find command. The name of the matched
path is represented by '{}'.
shell-prompt: find /etc -name '*.conf' -exec ls -l '{}' \;
With a ';' terminating the command, the command is executed immediately after each match. This may be necessary in some situations, but it entails a great deal of overhead from running the same command many times. Replacing the ';' with a '+' tells find to accumulate as many path names as possible and pass them all to one invocation of the command. This means the command could receive thousands of path names as arguments and will be executed far fewer times.
shell-prompt: find /etc -name '*.conf' -exec ls -l '{}' +
There are also some predefined actions we can use instead of
spelling out a -exec
, such as
-print
, which is the default action,
and -ls
, which is equivalent to
-exec ls -l '{}' +
. The -print
action is useful for showing path names being processed by
another action:
shell-prompt: find Data -name '*.bak' -print -exec rm '{}' +
Sometimes we may want to execute more than one command
for each path matched. Rather than construct a complex and
messy -exec
, we may prefer to write a
shell script containing the commands and run the script
using -exec
. Scripting is covered in
Chapter 2, Unix Shell Scripting.
As stated earlier, most Unix commands that accept a file name as an argument will accept any number of file names. When processing 100 files with the same program, it is usually more efficient to run one process with 100 file name arguments than to run 100 processes with one argument each.
However, there is a limit to how long Unix commands can be. When processing many thousands of files, it may not be possible to run a single command with all of the filenames as arguments. The xargs command solves this problem by reading a list of file names from the standard input (which has no limit) and feeding them to another command asarguments, providing as many arguments as possible to each process created.
The arguments processed by xargs do not have to be file names, but usually are. The main trick generating the list of files. Suppose we want to change all occurrences of "fox" to "toad" in the files input*.txt in the CWD. Our first thought might be a simple command:
shell-prompt: sed -i '' -e 's|fox|toad|g' input*.txt
If there are too many files matching "input*.txt", we will get an error such as "Argument list too long". One might think to solve this problem using xargs as follows:
shell-prompt: ls *.txt | xargs sed -i '' -e 's|fox|toad|g'
However, this won't work either, because the shell hits the same argument list limit for the ls command as it does for the sed command.
The find command can come to the rescue:
shell-prompt: find . -name '*.txt' | xargs sed -i '' -e 's|fox|toad|g'
Since the shell is not trying to expand '*.txt' to an argument list, but instead passing the literal string '*.txt' to find, there is no limit on how many file names it can match. The find command is sophisticated enough to work around the limits of argument lists.
The find command above will send relative path names of every file with a name matching 'input*.txt' in and under the CWD. If we don't want to process files in subdirectories of CWD, we can limit the depth of the find command to one directory level:
shell-prompt: find . -maxdepth 1 -name '*.txt' \ | xargs sed -i '' -e 's|fox|toad|g'
The xargs command places the arguments read from the standard input after any arguments included with the command. So the commands run by xargs will have the form
sed -i '' -e 's|fox|toad|g' input1.txt input2.txt input3.txt ...
Some xargs implementations have an option for placing the arguments from the standard input before the fixed arguments, but this is still limited. There may be cases where we want the arguments intermingled. The most portable and flexible solution to this is writing a simple script that takes all the arguments from xargs last, and constructs the appropriate command with the arguments in the correct order. Scripting is covered in Chapter 2, Unix Shell Scripting.
Most xargs implementations also support running multiple processes at the same time. This provides a convenient way to utilize multiple cores to parallelize processing. If you have a computer with 16 cores and speeding up your analysis by a factor of nearly 16 is good enough, then this can be a very valuable alternative to using an HPC cluster. If you need access to hundreds of cores to get your work done in a reasonable time, then a cluster is a better option.
shell-prompt: find . -name '*.txt' \ | xargs --max-procs 8 sed -i '' -e 's|fox|toad|g'
A value of 0 following --max-procs tells xargs to detect the number of available cores and use all of them.
There is a more sophisticated FOSS program called GNU parallel that can run commands in parallel in a similar way, but with more flexibility. It can be installed via most package managers.
The bc (binary calculator) command is
an unlimited range and precision calculator with a scripting
language very similar to C. When invoked with
-l
or --mathlib
, it includes
numerous additional functions including l(x) (natural log),
e(x) (exponential), s(x) (sine), c(x) (cosine), and a(x)
(arctangent). There are numerous standard functions
available even without --mathlib
. See
the man page for a full list.
By default, bc prints the result of each expression evaluated followed by a newline. There is also a print statement that does not print a newline. This allows a line of output to be constructed from multiple expressions, the last of which includes a literal "\n".
shell-prompt: bc --mathlib sqrt(2) 1.41421356237309504880 print sqrt(2), "\n" 1.41421356237309504880 l(10) 2.30258509299404568401 x=10 5 * x^2 + 2 * x + 1 521 quit
Bc is especially useful for quick computations where extreme range or precision is required, and for checking the results from more traditional languages that lack such range and precision. For example, consider the computation of factorials. N factorial, denoted N!, is the product of all integers from one to N. The factorial function grows so quickly that 21! exceeds the range of a 64-bit unsigned integer, the largest integer value supported by most CPUS and most common languages. The C program and output below demonstrate the limitations of 64-bit integers.
#include <stdio.h> #include <sysexits.h> int main(int argc,char *argv[]) { unsigned long c, fact = 1; for (c = 1; c <= 22; ++c) { fact *= c; printf("%lu! = %lu\n", c, fact); } return EX_OK; }
1! = 1 2! = 2 3! = 6 4! = 24 5! = 120 6! = 720 7! = 5040 8! = 40320 9! = 362880 10! = 3628800 11! = 39916800 12! = 479001600 13! = 6227020800 14! = 87178291200 15! = 1307674368000 16! = 20922789888000 17! = 355687428096000 18! = 6402373705728000 19! = 121645100408832000 20! = 2432902008176640000 21! = 14197454024290336768 This does not equal 20! * 21 22! = 17196083355034583040 23! = 8128291617894825984 24! = 10611558092380307456 25! = 7034535277573963776
At 21!, an integer overflow occurs. In the limited integer systems used by computers, adding 1 to the largest possible value produces a result of 0. The system is actually circular, which is why 21! above is wrong and 23! is actually smaller than 22!. The limitations of computer number systems are covered in ???.
In constrast, bc can compute factorials of any size, limited only by the amount of memory needed to store the value. It is, of course, much slower than C, both because it is an interpreted language and because it performs multiple precision arithmetic, which requires multiple machine instructions for every math operation. However, it is more than fast enough for many purposes and the easiest way to do math that is beyond the capabilities of common languages.
The bc script below demonstrates the the superior range of bc. The first line (#!/usr/bin/bc -l) tells the Unix shell how to run the script, so we can run it by simply typing its name, such as ./fact.bc. This will be covered in Chapter 2, Unix Shell Scripting. For now, create the script using nano fact.bc and run it with bc < fact.bc.
#!/usr/bin/bc -l fact = 1; for (c = 1; c <= 100; ++c) { fact *= c; print c, "!= ", fact, "\n"; } quit
1!= 1 2!= 2 3!= 6 4!= 24 5!= 120 6!= 720 7!= 5040 8!= 40320 9!= 362880 10!= 3628800 11!= 39916800 12!= 479001600 13!= 6227020800 14!= 87178291200 15!= 1307674368000 16!= 20922789888000 17!= 355687428096000 18!= 6402373705728000 19!= 121645100408832000 20!= 2432902008176640000 21!= 51090942171709440000 22!= 1124000727777607680000 23!= 25852016738884976640000 24!= 620448401733239439360000 25!= 15511210043330985984000000 [ Output removed for brevity ] 100!= 93326215443944152681699238856266700490715968264381621468592963\ 89521759999322991560894146397615651828625369792082722375825118521091\ 6864000000000000000000000000
Someone with a little knowledge of computer number systems might think that we can get around the range problem in general purpose languages like C by using floating point rather than integers. This will not work, however. While a 64-bit floating point number has a much greater range than a 64-bit integer (up to 10308 vs 1019 for integers), floating point actually has less precision. It sacrifices some precision in order to achieve the greater range. The modified C code and output below show that the double (64-bit floating point) type in C only gets us to 22!, and round-off error corrupts 23! and beyond.
#include <stdio.h> #include <sysexits.h> int main(int argc,char *argv[]) { double c, fact = 1; for (c = 1; c <= 25; ++c) { fact *= c; printf("%0.0f! = %0.0f\n", c, fact); } return EX_OK; }
1! = 1 2! = 2 3! = 6 4! = 24 5! = 120 6! = 720 7! = 5040 8! = 40320 9! = 362880 10! = 3628800 11! = 39916800 12! = 479001600 13! = 6227020800 14! = 87178291200 15! = 1307674368000 16! = 20922789888000 17! = 355687428096000 18! = 6402373705728000 19! = 121645100408832000 20! = 2432902008176640000 21! = 51090942171709440000 22! = 1124000727777607680000 23! = 25852016738884978212864 24! = 620448401733239409999872 25! = 15511210043330986055303168
TBD xz can be a bottleneck in pipelines when used with default options. Try lowering the compression until it is able to keep up with other processing steps (-4, -3, -2). It will still likely provide better compression than gzip.
Make sure you are using the latest version of this document.
Carefully read one section of this document and casually read other material (such as corresponding sections in a textbook, if one exists) if needed.
Try to answer the questions from that section. If you do not remember the answer, review the section to find it.
Write the answer in your own words. Do not copy and paste. Verbalizing answers in your own words helps your memory and understanding. Copying does not, and demonstrates a lack of interest in learning.
Check the answer key to make sure your answer is correct and complete.
DO NOT LOOK AT THE ANSWER KEY BEFORE ANSWERING QUESTIONS TO THE VERY BEST OF YOUR ABILITY. In doing so, you would only cheat yourself out of an opportunity to learn and prepare for the quizzes and exams.
Important notes:
Show all your work. This will improve your understanding and ensure full credit for the homework.
The practice problems are designed to make you think about the topic, starting from basic concepts and progressing through real problem solving.
Try to verify your own results. In the working world, no one will be checking your work. It will be entirely up to you to ensure that it is done right the first time.
Start as early as possible to get your mind chewing on the questions, and do a little at a time. Using this approach, many answers will come to you seemingly without effort, while you're showering, walking the dog, etc.
What is a regular expression? Is it the same as a globbing pattern?
How can we show lines in analysis.c containing hard-coded floating point constants?
How can we speed up grep searches when searching for a fixed string rather than an RE pattern?
How can we use extended REs with grep?
How can we make the matched pattern visible in the grep output?
Describe two major differences between grep and awk.
How does awk compare to spreadsheet programs like LibreOffice Calc and MS Excel?
The /etc/group file contains colon-separated lines in the form groupname:password:groupid:members. Show an awk command that will print the groupid and members of the group "root".
A GFF3 file contains tab-separated lines in the form "seqid source feature-type start end score strand phase attributes". The first attribute for an exon feature is the parent sequence ID. Write an awk script that reports the seqid, start, end, strand, and parent for each feature of type "exon". It should also report the number of exons and the number of genes. To test your script, download Mus_musculus.GRCm39.107.chromosome.1.gff3.gz from ensembl.org and then do the following:
gunzip Mus_musculus.GRCm39.107.chromosome.1.gff3.gz awk -f your-script.awk Mus_musculus.GRCm39.107.chromosome.1.gff3
Show a cut command roughly equivalent to the following awk command, which processes a tab-separated GFF3 file.
awk '{ print $1, $3, $4, $5 }' file.gff3
Show a sed command that replaces all occurrences of "wolf" with "werewolf" in the file halloween-list.txt.
Show a command to sort the following data by height. Show a separate command to sort by weight. The data are in params.txt.
ID Height Weight 1 34 10 2 40 14 3 29 9 4 28 11
Show a Unix command that replaces the word "fox" with "toad" and converts all lower case letters to upper case in the file fox.txt. Output should be stored in big-toad.txt.
Show a Unix command that lists and removes all the files whose names end in '.o' in and under ~/Programs.
Why is the xargs command necessary?
Show a Unix command that removes all the files with names ending in ".tmp" only in the CWD, assuming that there are too many of them to provide as arguments to one command. The user should not be prompted for each delete. ( Check the rm man page if needed. )
Show a Unix command that processes all the files named 'input*' in the CWD, using as many cores as possible, through a command such as the following:
analyze --limit 5 input1 input2
What is the most portable and flexible way to use xargs when the arguments it provides to the command must precede some of the fixed arguments?
What is the major advantage of the bc calculator over common programming languages?
Show a bc expression that prints the value of the natural number, e.
Write a bc script that prints the following. Create the script with nano sqrt.bc and run it with bc < sqrt.bc.
sqrt(1) = 1.00000000000000000000 sqrt(2) = 1.41421356237309504880 sqrt(3) = 1.73205080756887729352 sqrt(4) = 2.00000000000000000000 sqrt(5) = 2.23606797749978969640 sqrt(6) = 2.44948974278317809819 sqrt(7) = 2.64575131106459059050 sqrt(8) = 2.82842712474619009760 sqrt(9) = 3.00000000000000000000 sqrt(10) = 3.16227766016837933199