1.18. More Shell Tools

1.18.1. Redirection and Pipes

Device Independence

Many operating systems that came before Unix treated each input or output device differently. Each time a new device became available, programs would have to be modified in order to access it. This is intuitive, since the devices all look different and perform different functions.

The Unix designers realized that this is actually unnecessary and a waste of programming effort, so they employed the concept of device independence. Unix device independence works by treating virtually every input and output device exactly like an ordinary file. All input and output, whether to/from a file on a disk, a keyboard, a mouse, a scanner, or a printer, is simply a stream of bytes to be input or output by a program.

Most I/O devices are actually accessible as a device file in /dev. For example, the primary CD-ROM might be /dev/cd0, and the main disk might be /dev/ad0.

Data are often recovered from corrupted file systems or accidentally deleted files by reading the raw disk partition as a file using standard Unix commands such as grep!

shell-prompt: grep string /dev/ad0s1f
		

To see the raw input from a mouse as it is being moved, one could use the following command:

shell-prompt: hexdump /dev/mouse
		

cat /dev/mouse would also work, but the binary data stream would appear as garbage on the terminal screen.

Some years ago while mentoring my son's robotics team, as part of a side project, I reverse-engineered a USB game pad so I could control a Lego robot via Bluetooth from a laptop. Thanks to device-independence, no special software was needed to figure out the game pad's communication protocol.

After plugging the game pad into my FreeBSD laptop, the dmesg command shows the name of the new device file created under /dev.

ugen1.2: <vendor 0x046d product 0xc216> at usbus1
uhid0 on uhub3
uhid0: <vendor 0x046d product 0xc216, class 0/0, rev 1.10/3.00, addr 2> on usbus1
		

One can then view the input from the game pad using hexdump. It was easy to see that moving the right joystick up resulted in lower numbers in the 3rd and 7th columns, while moving down increased the values. Center position sends a value around 8000 (hexadecimal), fully up is around 0, fully down is ffff. Analogous results were seen for the other joystick and left or right motion, as well as the various buttons. It was then relatively easy to write a small program to read the joystick position from the game pad and send commands over Bluetooth to the robot, adjusting motor speeds accordingly. Sending commands over Bluetooth is also done with the same functions as writing to a file.

FreeBSD manatee.acadix  bacon ~ 410: hexdump /dev/uhid0
0000000 807f 7d80 0008 fc04 807f 7b80 0008 fc04
0000010 807f 7780 0008 fc04 807f 6780 0008 fc04
0000020 807f 5080 0008 fc04 807f 3080 0008 fc04
0000030 807f 0d80 0008 fc04 807f 0080 0008 fc04
0000060 807f 005e 0008 fc04 807f 005d 0008 fc04
0000070 807f 0060 0008 fc04 807f 0063 0008 fc04
0000080 807f 006c 0008 fc04 807f 0075 0008 fc04
0000090 807f 0476 0008 fc04 807f 1978 0008 fc04
00000a0 807f 4078 0008 fc04 807f 8c7f 0008 fc04
00000b0 807f 807f 0008 fc04 807f 7f7f 0008 fc04
00000c0 807f 827f 0008 fc04 807f 847f 0008 fc04
00000d0 807f 897f 0008 fc04 807f 967f 0008 fc04
00000e0 807f a77f 0008 fc04 807f be80 0008 fc04
00000f0 807f d980 0008 fc04 807f f780 0008 fc04
0000100 807f ff80 0008 fc04 807f ff83 0008 fc04
0000110 807f ff8f 0008 fc04 807f ff93 0008 fc04
		

It's interesting to note that the hexdump command first appeared in 4.3 BSD years before USB debuted and more than a decade before USB game pads existed. I could have just as easily used the od (octal dump) command, which was part of the original AT& Unix 1 in the early 1970s. The developers could not possibly have imagined that this program would one day be used this way. It was intended for looking at binary files and possibly input from devices of the time, but because of device independence, these commands would never need to be altered in order to work with new devices connected to a Unix system. The ability to use software without modification on devices invented decades later is the mark of intelligent software engineering.

Redirection

Since I/O devices and files are so interchangeable, Unix shells provide a facility called redirection to easily interchange them for any command without the command even knowing it.

Redirection depends on the notion of a file stream. You can think of a file stream as a hose connecting a program to a particular file or device. Redirection simply disconnects the hose from the default file or device and connects it to another one chosen by the shell user.

Every Unix process has three standard streams that are open from the moment the process is born. The standard streams are normally connected to the terminal, as shown in Table 1.9, “Standard Streams”.

Table 1.9. Standard Streams

StreamPurposeDefault Connection
Standard InputUser inputTerminal keyboard
Standard OutputNormal outputTerminal screen
Standard ErrorErrors and warningsTerminal screen

Redirection in the shell allows any or all of the three standard streams to be disconnected from the terminal and connected to a file or other I/O device. It uses operators within the commands to indicate which stream(s) to redirect and where. The basic redirection operators shells are shown in Table 1.10, “Redirection Operators”.

Table 1.10. Redirection Operators

OperatorShellsRedirection type
<AllStandard Input
>AllStandard Output (overwrite)
>>AllStandard Output (append)
2>Bourne-basedStandard Error (overwrite)
2>>Bourne-basedStandard Error (append)
>&C shell-basedStandard Output and Standard Error (overwrite)
>>&C shell-basedStandard Output and Standard Error (append)

Note

Memory trick: The arrow in a redirection operator points in the direction of data flow.

Caution

Using output redirection (>, 2>, or >&) in a command will normally overwrite (clobber) the file that you're redirecting to, even if the command itself fails.

Be very careful not to use output redirection accidentally. This most commonly occurs when a careless user meant to use input redirection, but pressed the wrong key.

The moment you press Enter after typing a command containing "> filename", filename will be erased! Remember that the shell performs redirection, not the command, so filename is clobbered before the command even begins running.

If noclobber is set for the shell, output redirection to a file that already exists will result in an error. The noclobber option can be overridden by appending a ! to the redirection operator in C shell derivatives or a | in Bourne shell derivatives. For example, >! can be used to force overwriting a file in csh or tcsh, and >| can be used in sh, ksh, or bash.

shell-prompt: ls > listing.txt         # Overwrite with listing of .
shell-prompt: ls /etc >> listing.txt   # Append listing of /etc
		

Note that redirection is performed by the shell, not the program. In the examples above, the ls command sends its output to the standard output. It is unaware that the standard output has been redirected to the file listing.txt.

Put another way, listing.txt is not an argument to the ls command. The redirection is handled by the shell, and ls runs as if it had been typed as simple:

shell-prompt: ls
		

More often than not, we want to redirect both normal output and error messages to the same place. This is why C shell and its derivatives use a combined operator that redirects both at once. The same effect can be achieved with Bourne-shell derivatives using another operator that redirects one stream to another stream. In particular, we redirect the standard output (stream 1) to a file (or device) and at the same time redirect the standard error (stream 2) to stream 1.

shell-prompt: find / -name '*.c' > list.txt 2>&1
		

If a program takes input from the standard input, we can redirect input from a file as follows:

shell-prompt: command < input-file
		

For example, consider the "bc" (binary calculator) command, an arbitrary-precision calculator which inputs numerical expressions from the standard input and writes the results to the standard output:

shell-prompt: bc
3.14159265359 * 4.2 ^ 2 + sqrt(30)
60.89491440932
quit
		

In the example above, the user entered "3.14159265359 * 4.2 ^ 2 + sqrt(30)" and "quit" and the bc program output "60.89491440932". We can place the input shown above in a file using any text editor, such as nano or vi, or by any other means:

shell-prompt: cat > bc-input.txt
3.14159265359 * 4.2 ^ 2 + sqrt(30)
quit
(Type Ctrl+d to signal the end of input to the cat command)
shell-prompt: more bc-input.txt
3.14159265359 * 4.2 ^ 2 + sqrt(30)
quit
		

Now that we have the input in a file, we can feed it to the bc command using input redirection instead of retyping it on the keyboard:

shell-prompt: bc < bc-input.txt 
60.29203070318
		

Special Files in /dev

Although it may seem a little confusing and circular, the standard streams themselves are represented as device files on Unix systems. This allows us to redirect one stream to another without modifying a program, by appending the stream to one of the device files /dev/stdout or /dev/stderr. For example, if a program sends output to the standard output and we want to send it instead to the standard error, we could do the following:

		printf "Oops!" >> /dev/stderr
		

If we would like to discard output sent to the standard output or standard error, we can redirect it to /dev/null. For example, to see only error messages (standard error) from myprog, we could do the following:

		./myprog > /dev/null
		

To see only normal output and not error messages, assuming Bourne shell:

		./myprog 2> /dev/null
		

The device /dev/zero is a readable file that produces a stream of zero bytes.

The device /dev/random is a readable file that produces a stream of random integers in binary format.

Pipes

Quite often, we may want to use the output of one program as input to another. Such a thing could be done using redirection, as shown below:

shell-prompt: sort names.txt > sorted-names.txt
shell-prompt: uniq < sorted-names.txt > unique-names.txt
		

The same task can be accomplished in one command using a pipe. A pipe redirects one of the standard streams, just as redirection does, but to another process instead of to a file or device. In other words, we can use a pipe to send the standard output and/or standard error of one process directly to the standard input of another process.

Example 1.4. Simple Pipe

The command below uses a pipe to redirect the standard output of the sort command directly to the standard input of the uniq.

shell-prompt: sort names.txt | uniq > uniq-names.txt
		    

Since a pipe runs multiple commands in the same shell, it is necessary to understand the concept of foreground and background processes, which are covered in detail in Section 1.19, “Process Control”.

Multiple processes can output to a terminal at the same time, although the results would obviously be chaos in most cases.

Only one process can receiving input from the keyboard, however.

The foreground process running under a given shell process is defined as the process that receives the input from the standard input device (usually the keyboard). This is the only difference between a foreground process and a background process.

When running a pipeline command, the last process in the pipeline is the foreground process. All others run in the background, i.e. do not use the standard input device inherited from the shell process. Hence, when we run:

shell-prompt: find /etc | more
		

It is the more command that receives input from the keyboard. The more command has its standard input redirected from the standard output of find, and the standard input of the find command is effectively disabled.

The more command is somewhat special: Since its standard input is redirected from the pipe, it opens another stream to connect to the keyboard so that the user can interact with it, pressing the space bar for another screen, etc.

For piping stderr, the notation is similar to that used for redirection:

Table 1.11. Pipe Operators

OperatorShellsPipe stream(s)
|AllStandard Output to Standard Input
|&C shell familyStandard Output and Standard Error to Standard Input
2|Bourne shell familyStandard Error to Standard Input

The entire chain of commands connected by pipes is known as a pipeline.

This is such a common practice that Unix has defined the term filter to apply to programs that can be used in this way. A filter is any command that can receive input from the standard input and send output to the standard output. Many Unix commands are designed to accept a file names as an arguments, but also to use the standard input and/or standard output if no filename arguments are provided.

Example 1.5. Filters

The more command is commonly used as a filter. It can read a file whose name is provided as an argument, but will use the standard input if no argument is provided. Hence, the following two commands have the same effect:

shell-prompt: more names.txt
shell-prompt: more < names.txt
		    

The only difference between these two commands is that in the first, the more receives names.txt as a command line argument, opens the file itself (creating a new file stream), and reads from the new stream (not the standard input stream). In the second instance, the shell opens the file and connects the standard input stream of the more command to it.

Using the filtering capability of more, we can paginate the output of any command:

shell-prompt: ls | more
shell-prompt: find . -name '*.c' | more
shell-prompt: sort names.txt | more
		    

We can string any number of commands together using pipes. For example, the following pipeline sorts the names in names.txt, removes duplicates, filters out all names not beginning with 'B', and shows the first 100 results one page at a time.

shell-prompt: sort names.txt | uniq | grep '^B' | head -n 100 | more
		

One more useful tool worth mentioning is the tee command. The tee is a simple program that reads from its standard input and writes to both the standard output and to one or more files whose names are provided on the command line. This allows you to view the output of a program on the screen and redirect it to a file at the same time.

shell-prompt: ls | tee listing.txt
		

Recall that Bourne-shell derivatives do not have combined operators for redirecting standard output and standard error at the same time. Instead, we redirect the standard output to a file or device, and redirect the standard error to the standard output using 2>&1.

We can use the same technique with a pipe, but there is one more condition: For technical reasons, the 2>&1 must come before the pipe.

shell-prompt: ls | tee listing.txt 2>&1    # Won't work
shell-prompt: ls 2>&1 | tee listing.txt    # Will work
		

The yes command produces a stream of y's followed by newlines. It is meant to be piped into a program that prompts for y's or n's in response to yes/no questions, so that the program will receive a yes answer to all of its prompts and run without user input.

    yes | ./myprog
		

In cases where the response isn't always "yes" we can feed a program any sequence of responses using redirection or pipes. Be sure to add a newline (\n) after each response to simulate pressing the Enter key:

./myprog < responses.txt
printf "y\nn\ny\n" | ./myprog
		

Misusing Pipes

Users who don't understand Unix and processes very well often fall into bad habits that can potentially be very costly. There are far too many such habits to cover here (One could write a separate 1,000-page volume called "Favorite Bad Habits of Unix Users").

As a less painful alternative, we'll explore one common habit in detail and try to help you understand how to assess your methods so you can then check others for potential problems. Our feature habit of the day is the use of the cat command at the head of a pipeline:

shell-prompt: cat names.txt | sort | uniq > outfile
		

So what's the alternative, what's wrong with using cat this way, what's the big deal, why do people do it, and how do we know it's a problem?

  1. The alternative:

    Most commands used downstream of cat in situations like this (e.g. sort, grep, more, etc.) are capable of reading a file directly if given the filename as an argument:

    shell-prompt: sort names.txt | uniq > outfile
    		    

    Even if they don't take a filename argument, we can always use simple redirection instead of a pipe:

    shell-prompt: sort < names.txt | uniq > outfile
    		    
  2. The problem:

    • Using cat this way just adds overhead in exchange for no benefit. Pipes are helpful when you have to perform multiple processing steps in sequence. However, the cat command doesn't do any processing at all. It just reads the file and copies it to the first pipe.

      In doing so, here's what happens:

      1. The cat command reads blocks from the file into its standard input buffer.
      2. It then copies the standard input buffer, one character at a time, to its standard output buffer, without processing the data in any way. It just senselessly moves data from one memory buffer to another.
      3. The standard output buffer is copied to the pipe, which is yet another memory buffer.
      4. Characters in the pipe buffer are copied to the standard input buffer of the next command (e.g. sort).
      5. The sort can finally begin processing the data.

      This is like pouring a drink into a glass, then moving it to a second glass using an eye dropper, then pouring it into a third class and finally a fourth glass before actually drinking it.

      It's much simpler and less wasteful for the sort command to read directly from the file.

    • Using a pipe this way also prevents the downstream command from optimizing disk access. A program such as sort might use a larger input buffer size to reduce the number of disk reads. Reading fewer, larger blocks from disk can prevent the latency of each disk operation from adding up, thereby reducing run time. This is not possible when reading from a pipe, which is a fixed-size memory buffer.

  3. What's the big deal?

    Usually, this is no real problem at all. Wasting a few seconds or minutes on your laptop won't harm anyone. However, sometimes mistakes like this one are incorporated into HPC cluster jobs using hundreds of cores for weeks at a time. In that case, it could increase run time by several days, delaying the work of other users as well as your own. Not to mention, the wasted electricity could cost the organization hundreds of dollars.

  4. The cause:

    By far the most common response I get when asking people about this sort of thing is: "[Shrug] I copied this from an example on the web. Didn't really think about it."

    Occasionally, someone might think that this speeds up processing by splitting the task into two processes, hence utilizing multiple cores, one running cat to handle the disk input and another dedicated to sort or whatever command is downstream. This only helps if the commands use enough CPU time to benefit from more than one core, or if it saves you from having to write an intermediate file. Neither of these factors are in play in this situation.

    One might also think it helps by overlapping disk input and CPU processing, i.e. cat can read the next block of data while sort is processing the current one. This may have worked a long time ago using slow disks and unsophisticated operating systems, but it only backfires on modern hardware running modern Unix systems.

    In reality, this strategy only increases the amount of CPU time used, and almost always increases run time.

  5. Detection:

    Detecting performance issues is pretty easy. The most common tool is the time command.

    shell-prompt: time fgrep GGTAGGTGAGGGGCGCCTCTAGATCGGAAGAGCACACGTCTGAACTCCAGTCA test.vcf > /dev/null
    2.539u 6.348s 0:09.86 89.9% 92+173k 35519+0io 0pf+0w
    		    

    We have to be careful when using time with a pipeline, however. Depending on the shell and the time command used (some shells have in internal implementation), it may not work as expected. We can ensure proper function by wrapping the pipeline in a separate shell process, which is then timed:

    shell prompt: time sh -c "cat test.vcf | fgrep GGTAGGTGAGGGGCGCCTCTAGATCGGAAGAGCACACGTCTGAACTCCAGTCA > /dev/null"
    2.873u 17.008s 0:13.68 145.2%   33+155k 33317+0io 0pf+0w
    		    

    Table 1.12, “Run times of pipes with cat” compares the run times (wall time) and CPU time of the direct fgrep and piped fgrep shown above three different operating systems.

    All runs were performed on otherwise idle system. Several trials were run to ensure reliable results. Times from the first read of test.vcf were discarded, since subsequent runs benefit from disk buffering (file contents still in memory from the previous read). The wall time varied significantly on the CentOS system, with the piped command running in less wall time for a small fraction of the trials. The times shown in the table are typical. Times for FreeBSD and MacOS were fairly consistent.

    Note that there is a large variability between platforms which should not be taken too seriously. These tests were not run on identical hardware, so they do not tell us anything conclusive about relative operating system performance.

    We can also collect other data using tools such as top to monitor CPU and memory use and iostat to monitor disk activity. These commands are covered in more detail in Section 1.14.15, “top” and Section 1.14.16, “iostat”.

Table 1.12. Run times of pipes with cat

System specsPipe wallNo pipe wallPipe CPUNo pipe CPU
CentOS 7 i7 2.8GHz33.4329.5013.598.45
FreeBSD Phenom 3.2GHz13.018.9018.768.43
MacBook i5 2.7GHz81.0981.3584.0281.20

1.18.2. Subshells

Commands placed between parentheses are executed in a new child shell process rather than the shell process that received the commands as input.

This can be useful if you want a command to run in a different directory or with altered environment variables, without affecting the current shell process.

shell-prompt: (cd /etc; ls)
	    

Since the commands above are executed in a new shell process, the shell process that printed "shell prompt: " will not have its current working directory changed. This command has the same net effect as the following:

shell-prompt: pushd /etc
shell-prompt: ls
shell-prompt: popd
	    

1.18.3. Self-test

  1. What does device independence mean?
  2. Show a Unix command that could be used to view the data stream sent by a mouse represented as /dev/mse0.
  3. Name and describe the three standard streams available to all Unix processes.
  4. Show the simplest Unix command to accomplish each of the following:
    1. Save a list of all files in /etc to the file list.txt.
    2. Compile prog.c under bash using gcc, saving error messages to errors.txt and normal screen output to output.txt.
    3. Compile prog.c under tcsh using gcc, saving both error messages and normal screen output to output.txt.
    4. Compile prog.c under tcsh using gcc, saving both error messages and normal screen output to output.txt. Overwrite output.txt even if noclobber is set.
    5. Run the program ./prog1, causing it to use the file input.txt as the standard input instead of the keyboard.
    6. Compile prog.c under tcsh using gcc, saving both error messages and normal screen output to output.txt and sending them to the screen at the same time.
  5. Which program in a pipeline runs as the foreground process?
  6. How many programs can be included in a single pipeline?
  7. What is a filter program?
  8. Show a Unix command that will edit all the C program files in the subdirectory Programs, using the vi editor.
  9. Show a Unix command that runs the command "make" in the directory "./src" without changing the current working directory of the current shell process.