2.24. Common Unix Tools Used in Scripts

It is often said that most Unix users don't need to write programs. The standard Unix commands contain all the functionality that a typical user needs, so they need only learn how to use the commands and write simple scripts to utilize them.

The sections below introduce some of the popular tools with the sole intention of raising awareness. The details of these tools would fill a separate book by themselves, so we will focus on simple, common examples here.

2.24.1. Grep

The grep command, short for General Regular exPressions, is a powerful tool for searching the content of text files.

Regular expressions are a standardized syntax for specifying patterns of text. They are similar to the globbing patterns discussed in Section 1.7.5, “Globbing (File Specifications)”, but the details are quite different. Also, while globbing patterns are meant to match file names, regular expressions are meant to match strings in any context.

Some of the more common regular expression features are shown in Table 2.9, “Common Regular Expression Symbols”.

Table 2.9. Common Regular Expression Symbols

TokenMatches
.Any character
[list]Any single character in list
[first-last]Any single character between first and last, in the order they appear in the character set in use. This may be affected by locale settings.
*Zero or more of the preceding token
+One or more of the preceding token

Note

To match any special character, such as '.', or '[', precede it with a '\'.

On BSD systems, a POSIX regular expression reference is available via man re_format.

On Linux systems, a similar document is available via man 7 regex.

Regular expression pattern matching can be used in any language. At the shell level, patterns are typically matched using the grep command.

In short, grep searches a text file for patterns specified as arguments and prints matching lines.

grep pattern file-spec
            

Note

Patterns passed to grep should usually be hard-quoted to prevent the shell from interpreting them as globbing patterns or other shell features.
# Show lines in Bourne shell scripts containing the string "printf"
grep printf *.sh

# Show lines in C programs containing strings that qualify as variable names
grep '[A-Za-z_][A-Za-z_0-9]*' *.c

# Show lines in C programs containing decimal integers
grep '[0-9]+' *.c

# Show lines in C programs containing real numbers
grep '[0-9]*\.[0-9]+' *.c
            

By default, the grep command follows an older standard for traditional regular expressions, in order to maintain backward compatibility in older scripts.

To enable the newer extended regular expressions, use grep -E or egrep.

To disable the use of regular expressions and treat each pattern as a fixed string, use grep -F or fgrep. This is sometimes useful for better performance or to eliminate the need for '\' before special characters.

2.24.2. Stream Editors

Stream editors are a class of programs that take input from one stream, often standard input, modify it in some way, and send the output to another stream, often standard output.

The sed (Stream EDitor) command is among the most commonly used stream editing programs. The sed has a variety of capabilities for performing almost any kind of changes you can imagine. Most often, though, it is used to replace text matching a regular expression with something else. Our introduction here will focus on this feature and we will leave the rest for tutorials dedicated to sed.

The basic syntax of a sed command for replacing text is as follows:

sed -e 's|pattern|replacement|g'
            

The -e flag specifies the use of traditional regular expressions. To use the more modern extended regular expressions, use -E as with grep.

The 's' is the 'substitute' command. Other commands, not discussed here, include 'd' (delete) and 'i' (insert).

The '|' is a separator. You can use any character as the separator as long as all three separators are the same. This allows any character to appear in the pattern or replacement text. Just use a separator that is not in either. The most popular separators are '|' and '/', since they usually stand out next to typical patterns.

The pattern is a regular expression, just as we would use with grep. Again, special characters that we want to match literally must be escaped (preceded by a '\').

The replacement text is not a regular expression, but may contain some special characters specific to sed. The most common is '&', which represents the current string matching the pattern. This feature makes it easy to add text to strings matching a pattern, even if they are not the same.

The 'g' means perform a global replacement. If omitted, only the first match on each line is replaced.

# Get snooty
sed -e 's|Bob|Robert|g' file.txt > modified-file.txt

# Convert integer constants to long constants in a C program
sed -e 's|[0-9]+|&L|g' prog.c > prog-long.c
            

The tr (translate) command is a simpler stream editing tool. It is typically used to replace or delete individual characters from a stream.

# Capitalize all occurrences of 'a', 'b', and 'c'
tr 'abc' 'ABC' file.txt > file-caps.txt

# Delete all digits from a file
tr -d '0123456789' file.txt > file-qless.txt
            

2.24.3. Tabular Data Tools

Unix systems provide standard tools for working with tabular data (text data organized in columns).

The cut command is a simple tool for removing a portion from each line of a text stream. The user can specify byte, character, or field positions to be removed.

# Remove the 3rd and 4th characters from every line
cut -c 3-4 file.txt > chopped-file.txt

# Remove the first column of numbers separated by white space
cut -w -f 1 results.txt > results-without-col1.txt
            

The awk command is an extremely sophisticated tool for manipulating tabular data. It is essentially a non-interactive spreadsheet, capable of doing modifications and computations of just about any kind.

Awk includes a scripting language that looks very much like C, with many extensions for easily processing textual data.

Entire books are available on awk, so we will focus on just a few basic examples.

Awk is generally invoked in one of two ways. For very simple awk operations (typically 1-line scripts), we can provide the awk script itself as a command-line argument, usually hard-quoted:

awk [-F field-separator] 'script' file-spec
            

For more complex, multi-line scripts, it may prove easier to place the awk script in a separate file and refer to it in the command:

awk [-F field-separator] -f script.awk file-spec
            

Input is separated into fields by white space by default, but we can specify any field-separator we like using the -F. The field separator can also be changed within the awk script by assigning the special variable FS.

Statements within the awk script consist of a pattern and an action.

Patterns may be relational expressions comparing a given field (column) to a pattern. In this case, the action will be invoked only on lines matching the pattern.

If pattern is omitted, the action will be performed on every line of input.

The special patterns BEGIN and END are used to perform actions before the first line is processed and after the last line is processed.

The action is essentially a C-like function. If omitted, the default action is to print the entire line matching pattern. ( Hence, awk can behave much like grep. )

Example 1: A simple awk command

# Print password entries for users with uid >= 1000
shell-prompt: awk -F : '$3 >= 1000 { print $0 }' /etc/passwd
nobody:*:65534:65534:Unprivileged user:/nonexistent:/usr/sbin/nologin
joe:*:4000:4000:Joe User:/home/joe:/bin/tcsh
            

Example 2: A separate awk script

# Initialize variables
BEGIN {
    sum1 = sum2 = 0.0;
}

# Add column data to sum for each line
{
    print $1, $2
    sum1 += $1;
    sum2 += $2;
}

# Output sums after all lines are processed
END {
    printf("Sum of column 1 = %f\n", sum1);
    printf("Sum of column 2 = %f\n", sum2);
}

shell-prompt: cat twocol.txt
4.3     -2.1
5.5     9.0
-7.3    4.6

shell-prompt: awk -f ./sum.awk twocol.txt 
4.3 -2.1
5.5 9.0
-7.3 4.6
Sum of column 1 = 2.500000
Sum of column 2 = 11.500000
            

2.24.4. Sort/Uniq

The sort command is a highly efficient, general-purpose stream sorting tool. It sorts the input stream line-by-line, optionally prioritizing the sort by one or more columns.

shell-prompt: cat names.txt
Kelso Bob
Cox Perry
Dorian John
Turk Christopher
Reid Elliot
Espinosa Carla

# Sort by entire line
shell-prompt: sort names.txt
Cox Perry
Dorian John
Espinosa Carla
Kelso Bob
Reid Elliot
Turk Christopher

# Sort by second column
shell-prompt: sort -k 2 names.txt
Kelso Bob
Espinosa Carla
Turk Christopher
Reid Elliot
Dorian John
Cox Perry

Shell-prompt: cat numbers.txt 
45
-12
32
16
7
-12

# Sort sorts lexically by default
Shell-prompt: sort numbers.txt
-12
-12
16
32
45
7

# Sort numerically
Shell-prompt: sort -n numbers.txt
-12
-12
7
16
32
45
            

The uniq command eliminates adjacent duplicate lines from the input stream.

Shell-prompt: uniq numbers.txt 
45
-12
32
16
7
-12

Shell-prompt: sort numbers.txt | uniq
-12
16
32
45
7
            

2.24.5. Perl, Python, and other Scripting Languages

All of the commands described above are described by the POSIX standard and included with every Unix compatible operating system.

A wide variety of tasks can be accomplished without writing anything more than a shell script utilizing commands like these.

Nevertheless, some Unix users have felt that there is a niche for tools more powerful than shells scripts and standard Unix commands, but more convenient than general-purpose languages like C, Java, etc.

As a result, a new class of scripting languages has evolved that are somewhat more like general-purpose languages. Among the most popular are TCL, Perl, PHP, Python, Ruby, and Lua.

These are interpreted languages, so performance is much slower than a compiled language such as C. However, they are self-contained, using built-in features or library functions instead of relying on external commands such as awk and sed. As a result, many would argue the they are more suitable for writing sophisticated scripts that would lie somewhere between shell scripts and general programs.