Congratulations on reaching the holy land of Unix data processing. It has often been said that if you know Unix well, you may never need to write a program. The tools provided by Unix often contain all the functionality you need to process your data. They are like a box of Legos from which we can construct a machine to perform almost any data analysis imaginable from the Unix shell.
Most of these tools function as filters, so they can be incorporated into pipelines. Most also accept filenames as command-line arguments for simpler use cases.
In this section, we'll introduce some of the most powerful tools that are heavily used by researchers to process data files. This will certainly reduce, if not eliminate, the need to write your own programs for many projects. This is only an introduction to make you aware of the available tools and the power they can give you.
For more detailed information, consult the man pages and other sources. Some tools, such as awk and sed, have entire books written about them, in case you want to explore in-depth.
However, do not set out to learn as much as you can about these tools. Set out to learn as much as you need. The ability to show off your vast knowledge is not the ability to achieve. Knowledge is not wisdom. Wisdom is doing. Learn what you need to accomplish today's goals as elegantly as possible, and then do it. You will learn more from this doing than from any amount of studying. You will develop problem solving skills and instincts, which are far more valuable than encyclopedic knowledge.
Never stop wondering if there might be an even more elegant solution. Albert Einstein was once asked what was his goal in life. His response: "To simplify." Use the tools presented here to simplify your research and by extension, your life. With this approach can achieve great things without great effort and spend your time savoring the wonders and mysteries of your work rather than memorizing facts that might come in handy one day.
Grep shows lines in one or more text streams that match a given regular expression (RE). It is an acronym for Global Regular Expression Print (or Pattern or Parser if you prefer).
shell-prompt: grep expression [file ...]
The expression is often a simple string, but can represent RE patterns as described in detail by man re_format on FreeBSD. There are also numerous web pages describing REs.
Using simple strings or REs, we can search any file stream for lines containing information of interest. By knowing how to construct REs that represent the information you seek, you can easily identify patterns in your data.
REs resemble globbing patterns, but they are not the same. For example, '*' by itself in a globbing pattern means any sequence of 0 or more characters. In an RE, '*' means 0 or more of the preceding character. '*' in globbing is expressed as '.*' in an RE. Some of the most common RE patterns are shown in Table 3.12, “RE Patterns”.
Table 3.12. RE Patterns
Pattern | Meaning |
---|---|
. | Any single character |
* | 0 or more of the preceding character |
+ | 1 or more of the preceding character |
[] | One character in the set or range of the enclosed characters (same as globbing) |
^ | Beginning of the line |
$ | End of the line |
.* | 0 or more of any character |
[a-z]* | 0 or more lower-case letters |
Consider the following C program:
#include <stdio.h> #include <sysexits.h> #include <math.h> int main(int argc,char *argv[]) { puts("Hello!"); printf("The square root of the # 2 is %f.\n", sqrt(2.0)); printf("The natural log of the # 2 is %f.\n", log(2.0)); return EX_OK; }
The command below shows all lines containing a call to the printf() function. We use quotes around the string because the shell will try to interpret the '(' without them.
shell-prompt: grep 'printf(' prog1.c printf("The square root of 2 is %f.\n", sqrt(2.0)); printf("The natural log of 2 is %f.\n", log(2.0));
We might also wish to show all lines containing any function call in prog1.c. Since we are looking for any function name rather than one particular name, we cannot use a simple string and must construct a regular expression. Variable and function names begin with a letter or underscore and may contain any number of letters, underscores, or digits after that. So our RE must require a letter or underscore for the first character and then accept zero or more letters, digits, or underscores after that. We will also require an argument list (anything between () is good enough for our purposes) and a semicolon to terminate the statement.
shell-prompt: grep '[a-zA-Z_][a-zA-Z0-9_]*(.*);' prog1.c puts("Hello!"); printf("The square root of 2 is %f.\n", sqrt(2.0)); printf("The natural log of 2 is %f.\n", log(2.0));
The following shows lines that have a '#' in the first column, which represents a preprocessor directive in C or C++:
shell-prompt: grep '^#' prog1.c #include <stdio.h> #include <sysexits.h> #include <math.h>
Without the '^' we match a '#' anywhere in the line:
shell-prompt: grep '#' prog1.c #include <stdio.h> #include <sysexits.h> #include <math.h> printf("The square root of the # 2 is %f.\n", sqrt(2.0)); printf("The natural log of the # 2 is %f.\n", log(2.0));
shell-prompt: grep '[a-zA-Z_][a-zA-Z0-9_]*\.[a-zA-Z_][a-zA-Z0-9_]*\(.*\);' prog1.java
As an example of searching data files, rather than program code, suppose we would like to find all the lines containing contractions in text file. This would consist of some letters, followed by an apostrophe, followed by more letters. Since the apostrophe is the same character as the single quotes we might use to enclose the RE, we either need to escape it (with a '\') or use double quotes to enclose the RE.
shell-prompt: grep '[a-zA-Z][a-zA-Z]*\'[a-zA-Z][a-zA-Z]*' shell-prompt: grep "[a-zA-Z][a-zA-Z]*'[a-zA-Z][a-zA-Z]*"
Another example would be searching for DNA sequences in a genome. We might use this to locate adapters, artificial sequences added to the ends of DNA fragments for the sequencing process, in our sequence data. Sequences are usually stored one per line in a text file in FASTA format. A common adapter sequence is "CTGTCTCTTATA".
shell-prompt: fgrep CTGTCTCTTATA file.fasta GCGGCCAACACCTTGCCTGTATTGGCATCCATGATGAAATGGGCGTAACCCTGTCTCTTATACACATCTCCGAG AAAGGCCTGTATGATAAGTTGGCAAATTTCCTCAAGATTGTTTACTTGATACACCTGTCTCTTATACACATCTC GACCGAGGCACTCGCCGCGCTTGAGCTCGAGATCGATGCCGTCGACCTGTCTCTTATACACATCTCCGAGCCCA AAAAAATCCCTCCGAAGCATTGTAGGTTTCCATGCTGTCTCTTATACACATCTCCGAGCCCACGAGACTCCTGA
DNA sequences sometimes have variations, such as single nucleotide polymorphisms, or SNPs, where one nucleotide varies in different individuals. Suppose the sequence we're looking for might have either an C or a G in the 5th position. We can use an RE to accommodate this:
shell-prompt: grep CTGT[CG]TCTTATA file.fasta
It's hard to see the pattern we were looking for in this
output. To solve this problem, we can colorize any matched
patterns using the --color
flag as shown in
Figure 3.6, “Colorized grep output”.
There is an extended version of regular expressions that is not supported by the normal grep command. Extended REs include things like alternative strings, which are separated by a '|'. For example, we might want to search for either of two adapter sequences. To enable extended REs, we use egrep or grep --extended-regexp.
shell-prompt: egrep 'CTGTCTCTTATA|AGATCGGAAGAG' file.fasta
Extended REs also support the '+' modifier to indicate 1 or more of the previous character, e.g. '[a-z]+' is shorthand for '[a-z][a-z]*'.
The grep family of commands are very often used as filters in pipelines. If no file name argument is provided, they will read from the standard input, like most Unix commands.
The -l, --files-with-matches
flag tells
grep
to merely report the names of files that contain a match.
This is often used to generate a list of file names for use
with another command.
AWK, an acronym for Aho, Weinberger, and Kernighan (the original developers of the program), is an extremely powerful tool for processing tabular data. Like grep, it supports RE matching, but unlike grep, it can process individual columns, called fields, in the data. It also includes a flexible scripting language that closely resembles the C language, so we can perform highly sophisticated processing of whole lines or individual fields.
Awk can be used to automate many of the same tasks that researchers often perform manually in a spreadsheet program such as LibreOffice Calc or MS Excel.
There are multiple implementations of awk. The most common are "The one true awk", evolved from the original awk code and used on many BSD systems. Gawk, the GNU project implementation, is used on most Linux systems. Mawk is an independent implementation that tends to outperform the others. It is available in most package managers. Awka is an awk-to-C translator that can convert most awk scripts to C for maximize performance.
Fields by default are separated by white space, i.e. space or
tab characters. However, awk allows us
to specify any set of separators using an RE following the
-F
flag or embedded in the script, so we can
process
tab-separated (.tsv) files, comma-separated (.csv) files,
or any other data that can be broken down into columns.
An awk script consists of one or more lines containing a pattern and an action. The action is enclosed in curly braces, like a C code block.
pattern { action }
The pattern is used to select lines from the input, usually using a relational expression such as those found in an if statement. The action determines what to do when a line is selected. If no pattern is given, the action is applied to every line of input. If no action is given, the default is to print the line.
In both the pattern and the action, we can refer to the entire line as $0. $1 is the first field: all text up to but not including the first separator. $2 is the second field: all text between the first and second separators. And so on...
It is very common to use awk "one-liners" on the command-line, without actually creating an awk script file. In this case, the awk script is the first argument to awk, usually enclosed in quotes to allow for white space and special characters. The second argument is the input file to be processed by the script.
For example, the file /etc/passwd
contains colon-separated
fields including the username ($1), user ID ($3),
primary group ID ($4), full name ($5), home directory ($6),
and the user's shell program ($7). To see a list of
full names for every line, we could use the following simple
command, which has no pattern (so it processes every line) and
an action of printing the fifth field:
shell-prompt: awk -F : '{ print $5 }' /etc/passwd Jason Bacon D-BUS Daemon User TCG Software Stack user Avahi Daemon User ...
To see a list of usernames and shells:
shell-prompt: awk -F : '{ print $1, $6 }' /etc/passwd bacon /bin/tcsh messagebus /usr/sbin/nologin _tss /usr/sbin/nologin avahi /usr/sbin/nologin ...
Many data files used in research computing are tabular, with one of the most popular formats being TSV (tab-separated value) files. The General Feature Format, or GFF file is a TSV file format for describing features of a genome. The first field contains the sequence ID (such as a chromosome number) on which the feature resides. The third field contains the feature type, such as "gene" or "exon". The fourth and fifth fields contain the starting and ending positions withing the sequence. The ninth field contains "attributes", such as the globally unique feature ID and possibly the feature name and other information, separated by semicolons. If we just want to see the locations and attributes of all the genes in a genome and their names, we could use the following:
shell-prompt: awk '$3 == "gene" { print $1, $4, $5, $9 }' file.gff3 1 3073253 3074322 ID=gene:ENSMUSG00000102693;Name=4933401J01Rik 1 3205901 3671498 ID=gene:ENSMUSG00000051951;Name=Xkr4 ...
Awk uses largely the same comparison operators and C and
similar languages. One additional awk
operator that is often useful is ~
, which means
"contains".
# Locate all features whose type contains "RNA". In a typical GFF3 file, # this could include mRNA, miRNA, ncRNA, etc. shell-prompt: awk '$3 ~ "RNA" { print $1, $4, $5, $9 }' file.gff3 1 3073253 3074322 ID=gene:ENSMUSG00000102693;Name=4933401J01Rik 1 3205901 3671498 ID=gene:ENSMUSG00000051951;Name=Xkr4 ...
Suppose we want to extract specific attributes from the semicolon-separated attributes field, such as the gene ID and gene name, as well as count the number of genes in the input. This will require a few more awk features.
The gene ID is always the first attribute in the field,
assuming the feature is a gene. Not every
gene has a name, so we will need to scan the attributes for
this information. Awk makes this easy.
We can break the attributes field into an
array of strings using the split()
function.
We can then use a loop to search the attributes for one beginning
with "Name=".
To count the genes in the input, we need to initialize a count variable before we begin processing the file, increment it for each gene found, and print it after processing is finished. For this we can use the special patterns BEGIN and END, which allow us to run an action before and after processing the input.
We will use the C-like printf()
function
to format the output. The basic print
statement always adds a newline, so it does not allow us to
print part of a line and finish it with an subsequent print
statement.
Since this is a multiline script, we will save it in a file
called gene-info.awk
and run it using
the -f
flag, which tells awk to get the
script from a file rather than the command-line.
shell-prompt: awk -f gene-info.awk file.gff3
BEGIN { gene_count = 0; } $3 == "gene" { # Separate attributes into an array split($9, attributes, ";"); # Print location and feature ID printf("%s %s %s %s", $1, $4, $5, attributes[1]); # Look for a name attribute and print it if it exists # With the for-in loop, c gets the SUBSCRIPT of each element in the # attributes array for ( c in attributes ) { # See if first 5 characters of the attribute are "Name=" if ( substr(attributes[c], 1, 5) == "Name=" ) printf(" %s", attributes[c]); } # Terminate the output line printf("\n"); # Count this gene ++gene_count; } END { printf("\nGenes found = %d\n", gene_count); }
As we can see, we can do some fairly sophisticated data processing with a very short awk script. There is very little that awk cannot do conveniently with tabular data. If a particular task seems like it will be difficult to do with awk, don't give up too easily. Chances are, with a little thought and effort, you can come up with an elegant awk script to get the job done.
That said, there are always other options for processing tabular data. Perl is a scripting language especially well suited to text processing, with its powerful RE handling capabilities and numerous features. Python has also become popular for such tasks in recent years.
Awk is highly efficient, and processing steps performed with it are rarely a bottleneck in an analysis pipeline. If you do need better performance than awk provides, there are C libraries that can be used to easily parse tabular data, such as libxtend. Libxtend includes a set of DSV (delimiter-separated-value) processing functions that make it easy to read fields from files in formats like TSV, CSV, etc. Once you have read a line or an individual field using libxtend's DSV functions, you now have the full power and performance of C at your disposal to process it in minimal time.
Full coverage of awk's capabilities is far beyond the scope of this text. Readers are encouraged to explore it further via the awk man page and one of the many books available on the language.
Example 3.20. Practice Break
shell-prompt: awk -F : '{ print $1 }' /etc/passwd shell-prompt: awk -F : '$1 == "root" { print $0 }' /etc/passwd
The cut command is used to select columns from a file, either by byte position, character position, or like awk, delimiter-separated columns. Note that characters in the modern world may be more than one byte, so bytes and characters are distinguished here.
To extract columns by byte or character position, we use the
-b
or -c
option followed
by a list of positions. The list is comma-separated and
may contain individual positions or ranges denoted with a '-'.
For example, to
extract character positions 1 through 10 and 21 through 26
from every line of file.txt, we could use the following:
shell-prompt: cut -c 1-10,21-26 file.txt
For delimiter-separated columns, we use -d
to indicate the delimiter. The default is a tab character
alone, not just any white space. The -w
flag
tells cut to accept any white space (tab or space) as the
delimiter. The -f
is then
used to indicate the fields to extract, much like
-c
is used for character positions. Output
is separated by the same delimiter as the input.
For example, to extract the username, userid, groupid, and full name (fields 1, 3, 4, and 5) from /etc/passwd, we could use the following:
shell-prompt: cut -d : -f 1,3-5 /etc/passwd ... ganglia:102:102:Ganglia User nagios:181:181:Nagios pseudo-user webcamd:145:145:Webcamd user
The above is equivalent to the following awk command:
shell-prompt: awk -F : '{ printf("%s:%s:%s:%s\n", $1, $3, $4, $5); }' /etc/passwd
The sed command is a stream editor. It makes changes to a file stream with no interaction from the user. It is probably most often used to make simple text substitutions, though it can also do insertions and deletions of lines and parts of lines, even selecting lines by number or based on pattern matching much like grep and awk. A basic substitution command takes the following format:
sed -e 's|pattern|replacement|g' input-file
Pattern is any regular expression, like those used in grep or awk. Replacement can be a fixed string, but also takes some special characters, such as &, which represents the string matched by pattern. It can also be empty if you simply want to remove occurrences of pattern from the text.
The characters enclosing pattern and replacement are arbitrary. The '|' character is often used because it stands out among most other characters. If either pattern or replacement contains a '|', simply use a different separator, such as '/'. The 'g' after the pattern means "global". Without it, sed will only replace the first occurrence of pattern in each line. With it, all matches are replaced.
shell-prompt: cat fox.txt The quick brown fox jumped over the lazy dog. shell-prompt: sed -e 's|fox|worm|g' fox.txt The quick brown worm jumped over the lazy dog. shell-prompt: sed -e 's/brown //g' -e 's|fox|&y worm|g' fox.txt The quick foxy worm jumped over the lazy dog.
Using -E
in place of -e
causes sed to support extended regular
expressions.
By default, sed sends output to the standard output stream.
The -i
flag tells sed
to edit the file in-place, i.e.
replace the original file with the edited text. This flag
should be followed by a filename extension, such as ".bak".
The original file will then be saved to filename.bak, so that you
can reverse the changes if you make a mistake. The extension
can be an empty string, e.g. '' if you are sure you don't
need a backup of the original.
There is a rare portability issue with sed.
GNU sed requires that the extension be
nestled against the -i
:
shell-prompt: sed -i.bak -e 's|pattern|replacement|g' file.txt
Some other implementations require a space between the
-i
and the extension, which is more orthodox
among Unix commands:
shell-prompt: sed -i .bak -e 's|pattern|replacement|g' file.txt
FreeBSD's sed accepts either form. You
must be aware of this in order to ensure that scripts using
sed are portable. The safest approach is
not to use the
-i
flag, but simply save the output to
a temporary file and then move it:
shell-prompt: sed -e 's|pattern|replacement|g' file.txt > file.txt.tmp shell-prompt: mv file.txt.tmp file.txt
This way, it won't matter which implementation of sed is present when someone runs your script.
Sed is a powerful and complex tool that is beyond the scope of this text. Readers are encouraged to consult books and other documentation to explore further.
Example 3.22. Practice Break
shell-prompt: printf "The quick brown fox jumped over the lazy dog." > fox.txt shell-prompt: cat fox.txt shell-prompt: sed -e 's|fox|worm|g' fox.txt shell-prompt: sed -e 's/brown //g' -e 's|fox|&y worm|g' fox.txt
The sort command sorts text data line by line according to one or more keys. A key indicates a field (usually a column separated by white space or some other delimiter) and the type of comparison, such as lexical (like alphabetical, but including non-letters) or numeric.
If no keys are specified, sort compares entire lines lexically.
The --key
followed by a field number restricts
comparison to that field. Fields are numbered starting with 1.
This can be used in conjunction with the
--field-separator
flag to specify a separator
other than the default white space. The
--numeric-sort
flag must be used to perform
integer comparison rather than lexical. The
--general-numeric-sort
flag must be used
to compare real numbers.
shell-prompt: cat ages.txt Bob Vila 23 Joe Piscopo 27 Al Gore 19 Ingrid Bergman 26 Mohammad Ali 22 Ram Das 9 Joe Montana 25 shell-prompt: sort ages.txt Al Gore 19 Bob Vila 23 Ingrid Bergman 26 Joe Montana 25 Joe Piscopo 27 Mohammad Ali 22 Ram Das 9 shell-prompt: sort --key 2 ages.txt Mohammad Ali 22 Ingrid Bergman 26 Ram Das 9 Al Gore 19 Joe Montana 25 Joe Piscopo 27 Bob Vila 23 shell-prompt: sort --key 3 --numeric-sort ages.txt Ram Das 9 Al Gore 19 Mohammad Ali 22 Bob Vila 23 Joe Montana 25 Ingrid Bergman 26 Joe Piscopo 27
The sort command can process files of any size, regardless of available memory. If a file is too large to fit in memory, it is broken into smaller pieces, which are sorted separately and saved to temporary files. The sorted temporary files are then merged.
The uniq command, which removes adjacent lines
that are identical, is often used after sorting to remove
redundancy from data. Note that the sort
command also has a --unique
flag, but it
does not behave the same as the uniq
command. The --unique
flag compares keys,
while the uniq command compares entire
lines.
Example 3.23. Practice Break
Using your favorite text editor, enter a few names from the example above into a file called ages.txt.
shell-prompt: cat ages.txt shell-prompt: sort ages.txt shell-prompt: sort --key 2 ages.txt shell-prompt: sort --key 3 --numeric-sort ages.txt shell-prompt: du -sm * | sort -n # Determine biggest directories
The tr (translate) command is a simple tool for performing character conversions and deletions in a text stream. A few examples are shown below. See the tr man page for details.
We can use it to convert individual characters in a text stream. In this case, it takes two string arguments. Characters in the Nth position in the first string are replaced by characters in the Nth position in the second string:
shell-prompt: cat fox.txt The quick brown fox jumped over the lazy dog. shell-prompt: tr 'xl' 'gh' < fox.txt The quick brown fog jumped over the hazy dog.
There is limited support for character sets enclosed in square brackets [], similar to regular expressions, including predefined sets such as [:lower:] and [:upper:]:
shell-prompt: tr '[:lower:]' '[:upper:]' < fox.txt THE QUICK BROWN FOX JUMPED OVER THE LAZY DOG.
We can use it to "squeeze" repeated characters down to one in a text stream. This is useful for compressing white space:
shell-prompt: tr -s ' ' < fox.txt The quick brown fox jumped over the lazy dog.
The tr command does not support doing multiple conversions in the same command, but we can use it as a filter:
shell-prompt: tr '[:lower:]' '[:upper:]' < fox.txt | tr -s ' ' THE QUICK BROWN FOX JUMPED OVER THE LAZY DOG.
There is some overlap between the capabilities of tr, sed, awk, and other tools. Which one you choose for a given task is a matter of convenience.
Example 3.24. Practice Break
shell-prompt: printf "The quick brown fox jumped over the lazy dog." > fox.txt shell-prompt: cat fox.txt shell-prompt: tr '[:lower:]' '[:upper:]' < fox.txt | tr -s ' '
The find command is a powerful tool for not only locating path names in a directory tree, but also for taking any desired action when a path name is found.
Unlike popular search utilities in macOS, Windows, and the Unix locate command. find does not use a previously constructed index of the file system, but searches the file system in its current state. Indexed search utilities very quickly produce results from a recent snapshot of the file system, which is rebuilt periodically by a scheduled job. This is much faster than an exhaustive search, but will miss files that were added since the last index build. The find command will take longer to search a large directory tree, but also guarantees accurate results.
The basic format of a find command is as follows:
shell-prompt: find top-directory search-criteria [optional-action \;]
The search-criteria can be any attribute of a file or other
path name. To match by name, we use -name
followed by a globbing pattern, in quotes to prevent the
shell from expanding it before passing it to
find. To search for files owned by a
particular user or group, we can use -user
or -group
. We can also search for files
with certain permissions, a minimum or maximum age, and
many other criteria. The man page provides all of these
details.
The default action is to print the relative path name of each
match. For example, to list all the configuration files under
/etc
, we could use the following:
shell-prompt: find /etc -name '*.conf'
We can run any Unix command in response to each match using
the -exec
flag followed the command and
a ';' or '+'. The ';' must be escaped or quoted to prevent the
shell from using it as a command separator and treating
everything after it as a new command, separate from the
find command. The name of the matched
path is represented by '{}'.
shell-prompt: find /etc -name '*.conf' -exec ls -l '{}' \;
With a ';' terminating the command, the command is executed immediately after each match. This may be necessary in some situations, but it entails a great deal of overhead from running the same command many times. Replacing the ';' with a '+' tells find to accumulate as many path names as possible and pass them all to one invocation of the command. This means the command could receive thousands of path names as arguments and will be executed far fewer times.
shell-prompt: find /etc -name '*.conf' -exec ls -l '{}' +
There are also some predefined actions we can use instead of
spelling out a -exec
, such as
-print
, which is the default action,
and -ls
, which is equivalent to
-exec ls -l '{}' +
. The -print
action is useful for showing path names being processed by
another action:
shell-prompt: find Data -name '*.bak' -print -exec rm '{}' +
Sometimes we may want to execute more than one command
for each path matched. Rather than construct a complex and
messy -exec
, we may prefer to write a
shell script containing the commands and run the script
using -exec
. Scripting is covered in
Chapter 4, Unix Shell Scripting.
As stated earlier, most Unix commands that accept a file name as an argument will accept any number of file names. When processing 100 files with the same program, it is usually more efficient to run one process with 100 file name arguments than to run 100 processes with one argument each.
However, there is a limit to how long Unix commands can be. When processing many thousands of files, it may not be possible to run a single command with all of the filenames as arguments. The xargs command solves this problem by reading a list of file names from the standard input (which has no limit) and feeding them to another command as arguments, providing as many arguments as possible to each process created.
The arguments processed by xargs do not have to be file names, but usually are. The main trick generating the list of files. Suppose we want to change all occurrences of "fox" to "toad" in the files input*.txt in the CWD. Our first thought might be a simple command:
shell-prompt: sed -i '' -e 's|fox|toad|g' input*.txt
If there are too many files matching "input*.txt", we will get an error such as "Argument list too long". One might think to solve this problem using xargs as follows:
shell-prompt: ls input*.txt | xargs sed -i '' -e 's|fox|toad|g'
However, this won't work either, because the shell hits the same argument list limit for the ls command as it does for the sed command.
The find command can come to the rescue:
shell-prompt: find . -name 'input*.txt' | xargs sed -i '' -e 's|fox|toad|g'
Since the shell is not trying to expand '*.txt' to an argument list, but instead passing the literal string '*.txt' to find, there is no limit on how many file names it can match. The find command is sophisticated enough to work around the limits of argument lists.
The find command above will send relative path names of every file with a name matching 'input*.txt' in and under the CWD. If we don't want to process files in subdirectories of CWD, we can limit the depth of the find command to one directory level:
shell-prompt: find . -maxdepth 1 -name '*.txt' \ | xargs sed -i '' -e 's|fox|toad|g'
The xargs command places the arguments read from the standard input after any arguments included with the command. So the commands run by xargs will have the form
sed -i '' -e 's|fox|toad|g' input1.txt input2.txt input3.txt ...
Some xargs implementations have an option for placing the arguments from the standard input before the fixed arguments, but this is still limited. There may be cases where we want the arguments intermingled. The most portable and flexible solution to this is writing a simple script that takes all the arguments from xargs last, and constructs the appropriate command with the arguments in the correct order. Scripting is covered in Chapter 4, Unix Shell Scripting.
Most xargs implementations also support running multiple processes at the same time. This provides a convenient way to utilize multiple cores to parallelize processing. If you have a computer with 16 cores and speeding up your analysis by a factor of nearly 16 is good enough, then this can be a very valuable alternative to using an HPC cluster. If you need access to hundreds of cores to get your work done in a reasonable time, then a cluster is a better option.
shell-prompt: find . -name '*.txt' \ | xargs -P 8 sed -i '' -e 's|fox|toad|g'
A value of 0 following -P tells xargs to detect the
number of available cores and use all of them. Some, but not
all xargs implementations support
--max-procs
in place of -P
.
While using of long options is more readable, it is not
portable in this instance.
There is a more sophisticated open source program called GNU parallel that can run commands in parallel in a similar way, but with more flexibility. It can be installed via most package managers. See the section called “GNU Parallel” for an introduction.
The bc (binary calculator) command is
an unlimited range and precision calculator with a scripting
language very similar to C. When invoked with
-l
or --mathlib
, it includes
numerous additional functions including l(x) (natural log),
e(x) (exponential), s(x) (sine), c(x) (cosine), and a(x)
(arctangent). There are numerous standard functions
available even without --mathlib
. See
the man page for a full list.
By default, bc prints the result of each expression evaluated followed by a newline. There is also a print statement that does not print a newline. This allows a line of output to be constructed from multiple expressions, the last of which includes a literal "\n".
shell-prompt: bc --mathlib sqrt(2) 1.41421356237309504880 print sqrt(2), "\n" 1.41421356237309504880 e(1) 2.71828182845904523536 x=10 5 * x^2 + 2 * x + 1 521 quit
Bc is especially useful for quick computations where extreme range or precision is required, and for checking the results from more traditional languages that lack such range and precision. For example, consider the computation of factorials. N factorial, denoted N!, is the product of all integers from one to N. The factorial function grows so quickly that 21! exceeds the range of a 64-bit unsigned integer, the largest integer value supported by most CPUS and most common languages. The C program and output below demonstrate the limitations of 64-bit integers.
#include <stdio.h> #include <sysexits.h> int main(int argc,char *argv[]) { unsigned long c, fact = 1; for (c = 1; c <= 22; ++c) { fact *= c; printf("%lu! = %lu\n", c, fact); } return EX_OK; }
1! = 1 2! = 2 3! = 6 4! = 24 5! = 120 6! = 720 7! = 5040 8! = 40320 9! = 362880 10! = 3628800 11! = 39916800 12! = 479001600 13! = 6227020800 14! = 87178291200 15! = 1307674368000 16! = 20922789888000 17! = 355687428096000 18! = 6402373705728000 19! = 121645100408832000 20! = 2432902008176640000 21! = 14197454024290336768 This does not equal 20! * 21 22! = 17196083355034583040 23! = 8128291617894825984 24! = 10611558092380307456 25! = 7034535277573963776
At 21!, an integer overflow occurs. In the limited integer systems used by computers, adding 1 to the largest possible value produces a result of 0. The integer number sets used by computers are called modular number systems and are actually circular. The limitations of computer number systems are covered in Chapter 15, Data Representation.
In contrast, bc can compute factorials of any size, limited only by the amount of memory needed to store all the digits. It is, of course, much slower than C, both because it is an interpreted language and because it performs multiple precision arithmetic, which requires multiple machine instructions for every math operation. However, it is more than fast enough for many purposes and the easiest way to do math that is beyond the capabilities of common languages.
The bc script below demonstrates the superior range of bc. The first line (#!/usr/bin/bc -l) tells the Unix shell how to run the script, so we can run it by simply typing its name, such as ./fact.bc. This will be covered in Chapter 4, Unix Shell Scripting. For now, create the script using nano fact.bc and run it with bc < fact.bc.
#!/usr/bin/bc -l fact = 1; for (c = 1; c <= 100; ++c) { fact *= c; print c, "!= ", fact, "\n"; } quit
1!= 1 2!= 2 3!= 6 4!= 24 5!= 120 6!= 720 7!= 5040 8!= 40320 9!= 362880 10!= 3628800 11!= 39916800 12!= 479001600 13!= 6227020800 14!= 87178291200 15!= 1307674368000 16!= 20922789888000 17!= 355687428096000 18!= 6402373705728000 19!= 121645100408832000 20!= 2432902008176640000 21!= 51090942171709440000 22!= 1124000727777607680000 23!= 25852016738884976640000 24!= 620448401733239439360000 25!= 15511210043330985984000000 [ Output removed for brevity ] 100!= 93326215443944152681699238856266700490715968264381621468592963\ 89521759999322991560894146397615651828625369792082722375825118521091\ 6864000000000000000000000000
Someone with a little knowledge of computer number systems might think that we can get around the range problem in general purpose languages like C by using floating point rather than integers. This will not work, however. While a 64-bit floating point number has a much greater range than a 64-bit integer (up to 10308 vs 1019 for integers), floating point actually has less precision. It sacrifices some precision in order to achieve the greater range. The modified C code and output below show that the double (64-bit floating point) type in C only gets us to 22!, and round-off error corrupts 23! and beyond.
#include <stdio.h> #include <sysexits.h> int main(int argc,char *argv[]) { double c, fact = 1; for (c = 1; c <= 25; ++c) { fact *= c; printf("%0.0f! = %0.0f\n", c, fact); } return EX_OK; }
1! = 1 2! = 2 3! = 6 4! = 24 5! = 120 6! = 720 7! = 5040 8! = 40320 9! = 362880 10! = 3628800 11! = 39916800 12! = 479001600 13! = 6227020800 14! = 87178291200 15! = 1307674368000 16! = 20922789888000 17! = 355687428096000 18! = 6402373705728000 19! = 121645100408832000 20! = 2432902008176640000 21! = 51090942171709440000 22! = 1124000727777607680000 23! = 25852016738884978212864 24! = 620448401733239409999872 25! = 15511210043330986055303168
The tar command, short for TApe Archive,
is a tool for combining multiple files into one. Recall that
Unix incorporates the idea of device independence, where an
input/output device is treated exactly like an ordinary file.
Originally, tar was meant to write the
archive to a tape device, such as /dev/tape
.
This was a way to create backups for important files on
removable tapes in case of a disk failure or other mishap.
Thanks to device independence, we can substitute any other device
or ordinary file for /dev/tape
. In modern
times, backups are more often done over high-speed networks
to sophisticated backup systems and tar is
more often used to create tarballs,
ordinary files containing archives for sharing whole directories.
Most open source software is downloaded as a single tarball and
unpacked on the local system.
The basic command template for creating a tarball is as follows:
shell-prompt: tar -cvf archive.tar path [path ...]
Archiving files this way has many potential advantages. It saves disk space, since each file has on average 1/2 of a disk block unused. Files can only allocate whole blocks and almost never have a size that is an exact multiple of the block size. Replacing many files with one archive reduces the size of the directory containing the files. Processing many small files (moving, transferring to another computer over a network, etc.) takes longer than processing one large file, since there is overhead for opening each file.
The -c
flag means "Create". The
-v
means "Verbose" (echo each file name as it
is added). The -f
means "File name". If
not provided, the default is the first tape device in
/dev
. The "path" arguments name files
or directories to archive.
We can specify any number of files and directories, but the
file name of the archive must come immediately after the
-f
flag.
The tar command is one of the commands that predates the convention of using a '-' to indicate flags. Hence, you may see examples on the web such as:
tar cvf file.tar directory
To unpack a tarball, we use the -x
flag, which
means "eXtract".
shell-prompt: tar -xvf archive.tar
We can list the contents of a tarball using -t
.
shell-prompt: tar -tf archive.tar
Example 3.28. Practice Break
shell-prompt: cd shell-prompt: mkdir Tempdir shell-prompt: touch Tempdir/temp1 shell-prompt: touch Tempdir/temp2 shell-prompt: tar -cvf tempdir.tar Tempdir shell-prompt: tar -tf tempdir.tar shell-prompt: rm -rf Tempdir shell-prompt: tar -xvf tempdir.tar shell-prompt: ls Tempdir
The gzip (GNU zip), bzip2 (Burrows-Wheeler zip), and xz (LZMA zip) commands compress files in order to save disk space. In the most basic use, we run the command with a single file argument:
shell-prompt: gzip file shell-prompt: bzip2 file shell-prompt: xz file
This will produce a compressed output file with a ".gz", ".bz2", or ".xz" extension. The original file is automatically removed after the compressed file is successfully created.
The compressed files can be decompressed using companion commands to restore the original file. Compression is lossless (unlike JPEG), so the restored file will be identical to the original.
shell-prompt: gunzip file.gz shell-prompt: bzip2 file.bz2 shell-prompt: xz file.xz
All three commands can be used as filters to directly compress output from another program:
shell-prompt: myanalysis | gzip > output.gz shell-prompt: myanalysis | bzip2 > output.bz2 shell-prompt: myanalysis | xz > output.xz
Likewise, the decompression tools can send decompressed output to another program via a pipe. They also include analogs to the cat command for better readability:
shell-prompt: gunzip -c output.gz | more shell-prompt: bunzip2 -c output.bz2 | more shell-prompt: bzcat output.bz | more shell-prompt: unxz -c output.xz | more shell-prompt: xzcat output.xz | more
For historical reasons, the portable command for viewing gzipped files is zcat, not gzcat. However, as of this writing, zcat on macOS looks for a ".Z" extension (from the outdated compress command), and only gzcat works with ".gz" files. Hence, gunzip -c is the most portable approach.
The choice between them is a matter of speed vs compression ratio. Gzip is generally the fastest, but achieves the least compression. Xz produces the best compression, but at a high cost in CPU time. Bzip2 produces intermediate compression and is also CPU-intensive. All three compression tools allow the user to control the compression ratio in order to trade speed for compression. Lower values use less CPU time but to not compress as well.
shell-prompt: myanalysis | xz -3 > output.xz
If a program produces high-volume output (more than a few megabytes per second), some compression tools may not be able to keep up. You may want to use gzip and/or lower the compression level in these cases.
When archiving data for long-term storage, on the other hand, you will generally want the best possible compression and should not be too concerned about how long it takes. There are numerous websites containing benchmark data comparing the run time and compression of these tools with various compression levels. Such data will not be included in this guide as it is dated: it will change as the tools are continually improved.
Decompression is generally much faster than compression. While xz with medium to high compression levels requires a great deal of CPU time, unxz can decompress the data very quickly. Hence, if files need only be compressed once, but read many times, xz may be a good choice.
All three tools are integrated with tar in
order to produce compressed tarballs. This can be done with
a pipe by specifying "-" as the filename following
-f
, or using
-z, --gzip, --gunzip
,
-j, --bzip2, --bunzip2
, or
-J, --xz
with the tar command. The conventional
file name extensions are ".tar.gz" or ".tgz" for
gzip, ".tar.bz2" or ".tbz" for
bzip2, and ".tar.xz" or ".txz" for
xz.
shell-prompt: tar -cvf - Tempdir | gzip > tempdir.tgz shell-prompt: tar -zcvf tempdir.tgz Tempdir shell-prompt: tar -cvf - Tempdir | bzip2 > tempdir.tbz shell-prompt: tar -jcvf tempdir.tbz Tempdir shell-prompt: tar -cvf - Tempdir | xz > tempdir.txz shell-prompt: tar -Jcvf tempdir.txz Tempdir
Example 3.29. Practice Break
shell-prompt: cat | xz > test.xz Type in some text, then press Ctrl+d. shell-prompt: xzcat test.xz shell-prompt: tar -Jcvf tempdir.txz Tempdir
Zip is both an archiver and compression tool in one. It was originally developed by Phil Katz, co-founder of PKZIP, Inc. in Milwaukee, WI in 1989, for MS-DOS. The zip format has become the standard for many other Windows-based archive tools. The compression algorithms have evolved significantly since the original PKZIP.
The zip and unzip commands are open source tools for creating and extracting .zip files. They are primarily for interoperability with Windows file archives and far less popular than tarballs compressed with gzip, bzip2, and xz.
The time command runs another command under its supervision and measures wall time, user time, and system time. Wall time, also known as real time, is the elapsed in the world while a program is running. The term was coined at a time when most people had clocks on their walls, rather than relying on a smart phone. User time is the time spent using a core. If a program uses only one core (logical CPU), user time is less than wall time. If it uses more than one core, user time can exceed wall time. System time is the time spent by the operating system performing tasks on behalf of the process. Hence total CPU time is user time + system time.
The time command is used by simply prefixing
any other Unix command with "time ". Some shells have an
internal time command, which presents output in a different
format than the external time command
normally found in /usr/bin
. The T shell
internal time command also reports percent
of CPU time used. Low CPU utilization generally indicates
that the process was I/O-bound, i.e. it spent a lot of time
waiting for disk or other input/output transactions and
therefore was not utilizing the CPU. Also reported are memory
use in kibibytes, a count of I/O operations, and page faults
(where memory blocks are swapped to or from disk due to
memory being full).
shell-prompt: time find /usr/local/lib > /dev/null 0.055u 0.094s 0:00.15 93.3% 43+179k 0+0io 0pf+0w shell-prompt: /usr/bin/time find /usr/local/lib > /dev/null 0.14 real 0.04 user 0.09 sys
Reported times will vary, usually by a fraction of a second, due to limited precision of measurement and other factors. It is usually fairly consistent for programs that use at least a few seconds of CPU time.
The top command displays real-time information about currently running processes, sorted in order of resource use. It does not show information about all processes, but only the top resource users. Snapshots are reported every two seconds by default.
At the top of the screen is a summary of the system state, including load average (% of available cores in use), total processes running and sleeping (waiting for input/output), and a summary of memory (RAM and swap) use. Swap is an area of disk used to extend the amount of memory apparent to processes. Processes see the virtual memory size, which is RAM (electronic memory) + swap.
Below the system summary is information about the most active and resource-intensive processes currently running. Columns in the example below are summarized in Table 3.13, “Column headers of top command”.
Table 3.13. Column headers of top command
Tag | Meaning |
---|---|
PID | The process ID |
USERNAME | User owning the process |
THR | Number of threads (cores used) |
PRI | CPU scheduling priority |
NICE | Nice value: Limits scheduling priority |
SIZE | Virtual memory allocated |
RES | Resident memory: Actual RAM (not swap) used |
STATE | State of process at the moment of the last snapshot, such as running (using a core), waiting for I/O, select (waiting on any of multiple devices) pipdwt (writing to a pipe), nanslp (sleeping for nanoseconds), etc. |
C | Last core on which it ran |
TIME | CPU time accumulated so far |
WCPU | Weighted CPU % currently using |
COMMAND | Command executed, usually truncated |
Different operating systems will display slightly different information. There are many command-line flags to alter behavior, and behavior can be adjusted while running. Press 'h' for a help menu to see the options for altering output.
last pid: 70340; load averages: 0.67, 0.34, 0.35; b up 3+03:11:57 08:57:32 61 processes: 3 running, 58 sleeping CPU: 40.6% user, 0.0% nice, 2.2% system, 0.0% interrupt, 57.2% idle Mem: 145M Active, 1871M Inact, 166M Laundry, 1210M Wired, 648M Buf, 4247M Free Swap: 3852M Total, 3852M Free PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND 70338 bacon 1 79 0 13M 2160K CPU2 2 0:03 72.65% fastq-tr 70340 bacon 1 79 0 13M 3056K CPU1 1 0:03 72.15% gzip 70339 bacon 1 44 0 13M 2856K pipdwt 3 0:01 28.68% gunzip 69958 bacon 3 20 0 237M 92M select 2 0:02 0.54% coreterm 9690 root 5 20 0 144M 80M select 0 5:08 0.23% Xorg 9719 bacon 4 20 0 340M 132M select 1 3:42 0.12% lumina-d 70332 bacon 1 20 0 14M 3668K CPU0 0 0:00 0.05% top 1644 root 1 20 0 13M 1656K select 0 0:58 0.01% powerd 27489 root 14 -44 r8 20M 7576K cuse-s 1 0:01 0.00% webcamd 9756 bacon 1 20 0 51M 24M select 0 0:03 0.00% python3. 1666 root 1 20 0 13M 1748K select 3 4:29 0.00% moused 9716 bacon 1 20 0 27M 11M select 1 0:13 0.00% fluxbox 1756 root 1 20 0 18M 3400K select 0 0:04 0.00% sendmail 1315 root 1 20 0 11M 1020K select 2 0:02 0.00% devd 9744 bacon 3 20 0 153M 48M select 2 0:02 0.00% python3. 1495 root 1 20 0 13M 2100K select 1 0:02 0.00% syslogd 1641 root 1 20 0 13M 1984K wait 1 0:01 0.00% sh 24775 bacon 4 20 0 34M 7220K select 3 0:01 0.00% at-spi2- 9711 bacon 3 20 0 94M 21M select 2 0:01 0.00% start-lu 1725 root 1 20 0 13M 1992K nanslp 3 0:01 0.00% cron 1615 messagebus 1 20 0 14M 2860K select 0 0:01 0.00% dbus-dae 1639 ntpd 1 20 0 21M 3308K select 3 0:01 0.00% ntpd
Example 3.31. Practice Break
Run top, press 'h' to see the help screen, and press 'n' followed by '5' to make the screen less noisy.
The iostat command displays information about disk activity and possibly other status information, depending on the flags used. Unfortunately, iostat is one of the rare commands that is not well-standardized across Unix systems. Check the man page on your system for details on all the flags. Here we show basic use for monitoring disk activity similarly to how we monitor CPU and memory use with top.
Low CPU utilization in top often indicates that a process is I/O-bound (e.g. spending a great deal of time waiting for disk operations). Processes go to sleep and do not use the CPU while waiting for disk and other input/output. To help verify this, we can check the STATE column in top as well. If it shows a state such as "wait", "select", or "pipe", then the process is waiting for I/O. Lastly, we can use iostat to see exactly how busy the disks are. This tells us nothing about a specific process, but we can generally deduce which processes are causing high disk activity.
The FreeBSD iostat offers concise output on
a single line including the rates of tty (terminal) and
disk throughput, and some CPU stats similar to
top. We can request an update every N
seconds by specifying -w N
or simply
N
. The header is kindly reprinted when it
is scrolled off the terminal.
FreeBSD shell-prompt: iostat 1 tty ada0 cd0 pass0 cpu tin tout KB/t tps MB/s KB/t tps MB/s KB/t tps MB/s us ni sy in id 4 583 47.0 5 0.2 0.0 0 0.0 0.0 0 0.0 5 0 1 0 95 1 537 1024 18 18.0 0.0 0 0.0 0.0 0 0.0 45 0 1 0 54 [snip] 0 733 988 18 17.4 0.0 0 0.0 0.0 0 0.0 42 0 2 0 56 0 295 1024 18 18.0 0.0 0 0.0 0.0 0 0.0 42 0 2 0 55 tty ada0 cd0 pass0 cpu tin tout KB/t tps MB/s KB/t tps MB/s KB/t tps MB/s us ni sy in id 0 300 927 21 19.0 0.0 0 0.0 0.0 0 0.0 45 0 1 0 54 0 457 536 35 18.3 0.0 0 0.0 0.0 0 0.0 44 0 2 0 54
Apple's iostat is derived from FreeBSD's and has a similar output format and behavior.
macOS shell-prompt: iostat 1 disk0 cpu load average KB/t tps MB/s us sy id 1m 5m 15m 13.46 3 0.03 7 5 88 1.15 1.03 1.01 11.97 289 3.38 40 13 47 1.15 1.03 1.01 4.00 1 0.00 6 3 91 1.15 1.03 1.01 0.00 0 0.00 0 2 98 1.15 1.03 1.01 0.00 0 0.00 1 2 97 1.14 1.03 1.01 4.25 145 0.60 9 6 85 1.14 1.03 1.01
The Linux iostat has significantly different options and output format. In addition, it may not be present on all Linux systems by default. On RHEL (Redhat Enterprise Linux), for example, we must install the sysstat package using the yum package manager. The output format contains multiple lines for each snapshot, but presents similar information.
RHEL shell-prompt: yum install -y sysstat RHEL shell-prompt: iostat 1 Linux alma8.localdomain bacon ~ 1001: (pkgsrc): iostat 1 Linux 4.18.0-372.26.1.el8_6.x86_64 (alma8.localdomain) 10/16/2022 _x86_64_(4 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 0.14 0.00 0.30 0.10 0.00 99.46 Device tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 11.57 250.38 34.24 443223 60607 scd0 0.01 0.00 0.00 1 0 dm-0 11.53 221.50 33.04 392110 58492 dm-1 0.06 1.25 0.00 2220 0 avg-cpu: %user %nice %system %iowait %steal %idle 0.00 0.00 0.25 0.00 0.00 99.75 Device tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 0.00 0.00 0.00 0 0 scd0 0.00 0.00 0.00 0 0 dm-0 0.00 0.00 0.00 0 0 dm-1 0.00 0.00 0.00 0 0
GNU Parallel is a sophisticated open source tool for running multiple processes simultaneously. Users who do not have access to an HPC cluster for running large parallel jobs can at least utilize all the cores on their laptop or workstation using GNU parallel. GNU Parallel can be installed in seconds using most package managers.
In its simplest form, GNU parallel can be used as a drop-in replacement for xargs:
shell-prompt: find . -name 'input-*.txt' | xargs analyze shell-prompt: find . -name 'input-*.txt' | parallel analyze
However, GNU parallel has many options for more sophisticated execution. The numerous use cases and syntax of GNU parallel are beyond the scope of this guide. There are many web tutorials and even books about GNU Parallel for details. If GNU parallel is properly installed on your system (i.e. via a package manager), you can begin by running man parallel_tutorial.
The GNU parallel tutorial, at the time of this writing, contains some examples of overcomplicating simple tasks, such as the following:
# The tutorial recommends the following to generate sample input files: shell-prompt: perl -e 'printf "A_B_C_"' > abc_-file shell-prompt: perl -e 'for(1..1000000){print "$_\n"}' > num1000000 # In reality, perl serves no purpose in either case. # We can just use the POSIX printf command: shell-prompt: printf "A_B_C_" > abc_-file shell-prompt: printf "%s\n" `seq 1 1000` > nums1000000
What is a regular expression? Is it the same as a globbing pattern?
Show a Unix command that shows lines in analysis.c containing hard-coded floating point constants.
How can we speed up grep searches when searching for a fixed string rather than an RE pattern?
How can we use extended REs with grep?
How can we make the matched pattern visible in the grep output?
Describe two major differences between grep and awk.
How does awk compare to spreadsheet programs like LibreOffice Calc and MS Excel?
The /etc/group file contains colon-separated lines in the form groupname:password:groupid:members. Show an awk command that will print the groupid and members of the group "root".
A GFF3 file contains tab-separated lines in the form "seqid source feature-type start end score strand phase attributes". The first attribute for an exon feature is the parent sequence ID. Write an awk script that reports the seqid, start, end, strand, and parent for each feature of type "exon". It should also report the number of exons and the number of genes. To test your script, download Mus_musculus.GRCm39.107.chromosome.1.gff3.gz from ensembl.org and then do the following:
gunzip Mus_musculus.GRCm39.107.chromosome.1.gff3.gz awk -f your-script.awk Mus_musculus.GRCm39.107.chromosome.1.gff3
Show a cut command roughly equivalent to the following awk command, which processes a tab-separated GFF3 file.
awk '{ print $1, $3, $4, $5 }' file.gff3
Show a sed command that replaces all occurrences of "wolf" with "werewolf" in the file halloween-list.txt.
Show a command to sort the following data by height. Show a separate command to sort by weight. The data are in params.txt.
ID Height Weight 1 34 10 2 40 14 3 29 9 4 28 11
Show a Unix command that reads the file fox.txt, replaces the word "fox" with "toad" and converts all lower case letters to upper case, and stores the output in big-toad.txt.
Show a Unix command that lists and removes all the files whose names end in '.o' in and under ~/Programs.
Why is the xargs command necessary?
Show a Unix command that removes all the files with names ending in ".tmp" only in the CWD, assuming that there are too many of them to provide as arguments to one command. The user should not be prompted for each delete. ( Check the rm man page if needed. )
Show a Unix command that processes all the files named 'input*' in the CWD, using as many cores as possible, through a command such as the following:
analyze --limit 5 input1 input2
What is the most portable and flexible way to use xargs when the arguments it provides to the command must precede some of the fixed arguments?
What is the major advantage of the bc calculator over common programming languages?
Show a bc expression that prints the value of the natural number, e.
Write a bc script that prints the following. Create the script with nano sqrt.bc and run it with bc -l < sqrt.bc.
sqrt(1) = 1.00000000000000000000 sqrt(2) = 1.41421356237309504880 sqrt(3) = 1.73205080756887729352 sqrt(4) = 2.00000000000000000000 sqrt(5) = 2.23606797749978969640 sqrt(6) = 2.44948974278317809819 sqrt(7) = 2.64575131106459059050 sqrt(8) = 2.82842712474619009760 sqrt(9) = 3.00000000000000000000 sqrt(10) = 3.16227766016837933199
What are some advantages of archiving files in a tarball?
Show a Unix command that creates a tarball called research.tar containing all the files in the directory ./Research.
Show a Unix command that saves the output of
find /etc to a compressed text file called
find-output.txt.bz2
.
Show a Unix command for viewing the contents of the compressed text file output.txt.gz, one page at a time.
Show a Unix command that creates a tarball called research.txz containing all the files in the directory ./Research.
What are zip and unzip primarily used for on Unix systems?
Show a Unix command that reports the CPU time used by the command awk -f script.awk input.tsv.
Show a Unix command that will help us determine which processes are using the most CPU time or memory.
How can we find out how to adjust the behavior of top while it is running?
What kind of output from top might suggest that a process is I/O-bound? Why?
Show a Unix command that continuously monitors total disk activity on a Unix system.
How can users who do not have access to an HPC cluster run things in parallel, in a more sophisticated way than possible with standard Unix tools such as xargs?