Redirection and Pipes

Redirection and Pipes
Prev	Chapter 3. Using Unix	Next

Device Independence

Many operating systems that came before Unix treated each input or output device differently. Each time a new device became available, programs would have to be modified in order to access it. This is intuitive, since the devices all look different and perform different functions.

The Unix designers realized that this is actually unnecessary and a waste of programming effort, so they developed the concept of device independence. What this means is that Unix treats virtually every input and output device exactly like an ordinary file. All input and output, whether to/from a file on a disk, a keyboard, a mouse, a scanner, or a printer, is simply a stream of bytes to be input or output using the same tools.

Most I/O devices are actually accessible as a device file in /dev. For example, the primary CD-ROM might be /dev/cd0, the main disk /dev/ad0, the keyboard /dev/kbd0, and the mouse /dev/sysmouse.

A user with sufficient permissions can view input coming from these devices using the same Unix commands we use to view a file:

shell-prompt: cat /dev/kbd0
shell-prompt: more /dev/cd0

In fact, data are often recovered from corrupted file systems or accidentally deleted files by searching the raw disk partition as a file using standard Unix commands such as grep!

shell-prompt: grep string /dev/ad0s1f

A keyboard sends text data, so /dev/kbd0 is like a text file. Many other devices send binary data, so using cat to view them would output gibberish. To see the raw input from a mouse as it is being moved, we could instead use hexdump, which displays the bytes of input as numbers rather than characters:

shell-prompt: hexdump /dev/sysmouse

Some years ago while mentoring my son's robotics team, as part of a side project, I reverse-engineered a USB game pad so I could control a Lego robot via Bluetooth from a laptop. Thanks to device-independence, no special software was needed to figure out the game pad's communication protocol.

After plugging the game pad into my FreeBSD laptop, the system creates a new UHID (USB Human Interface Device) under /dev. The dmesg command shows the name of the new device file:

ugen1.2: <vendor 0x046d product 0xc216> at usbus1
uhid0 on uhub3
uhid0: <vendor 0x046d product 0xc216, class 0/0, rev 1.10/3.00, addr 2> on usbus1

One can then view the input from the game pad using hexdump:

FreeBSD manatee.acadix  bacon ~ 410: hexdump /dev/uhid0
0000000 807f 7d80 0008 fc04 807f 7b80 0008 fc04
0000010 807f 7780 0008 fc04 807f 6780 0008 fc04
0000020 807f 5080 0008 fc04 807f 3080 0008 fc04
0000030 807f 0d80 0008 fc04 807f 0080 0008 fc04
0000060 807f 005e 0008 fc04 807f 005d 0008 fc04
0000070 807f 0060 0008 fc04 807f 0063 0008 fc04
0000080 807f 006c 0008 fc04 807f 0075 0008 fc04
0000090 807f 0476 0008 fc04 807f 1978 0008 fc04
00000a0 807f 4078 0008 fc04 807f 8c7f 0008 fc04
00000b0 807f 807f 0008 fc04 807f 7f7f 0008 fc04
00000c0 807f 827f 0008 fc04 807f 847f 0008 fc04
00000d0 807f 897f 0008 fc04 807f 967f 0008 fc04
00000e0 807f a77f 0008 fc04 807f be80 0008 fc04
00000f0 807f d980 0008 fc04 807f f780 0008 fc04
0000100 807f ff80 0008 fc04 807f ff83 0008 fc04
0000110 807f ff8f 0008 fc04 807f ff93 0008 fc04

To understand these numbers, we need to know a little about hexadecimal, base 16. This is covered in detail in Chapter 15, Data Representation. In short, it works the same as decimal, but we multiply by powers of 16 rather than 10, and digits go up to 15 rather than 9. Digits for 10 through 15 are A, B, C, D, E, and F. The largest possible 4-digit number is therefore FFFF_16. 8000_16 is in the middle of the range.

0000_16 =  0 * 16^3 +  0 * 16^2 +  0 * 16^1 +  0 * 16^0 = 0_10
8000_16 =  8 * 16^3 +  0 * 16^2 +  0 * 16^1 +  0 * 16^0 = 32,678_10
FFFF_16 = 15 * 16^3 + 15 * 16^2 + 15 * 16^1 + 15 * 16^0 = 65,535_10

It was easy to see that moving the right joystick up resulted in lower numbers in the 3rd and 7th columns, while moving down increased the values. Center position sends a value around 8000 (hexadecimal), fully up is around 0, fully down is ffff.

It was then easy to write a small program to read the joystick position from the game pad (by simply opening /dev/uhid0 like any other file) and send commands over Bluetooth to the robot, adjusting motor speeds accordingly. The Bluetooth interface is simply treated as an output file.

Redirection

Since I/O devices and files are interchangeable, Unix shells can provide a facility called redirection to easily interchange them for any process without the process even knowing it.

Redirection depends on the notion of a file stream. You can think of a file stream as a hose connecting a program to a particular file or device, as shown in Figure 3.3, “File streams”. Redirection simply disconnects the hose from the default file or device (such as the keyboard or terminal screen) and connects it to another file or device chosen by the user.

Figure 3.3. File streams

Every Unix process has three standard streams that are open from the moment the process is born. The standard streams for a shell process are normally connected to the terminal, as shown in Table 3.9, “Standard Streams” and Figure 3.4, “Standard streams”.

Table 3.9. Standard Streams

Stream	Purpose	Default Connection
Standard Input	User input	Terminal keyboard
Standard Output	Normal output	Terminal screen
Standard Error	Errors and warnings	Terminal screen

Figure 3.4. Standard streams

Redirection in the shell allows any or all of the three standard streams to be disconnected from the terminal and connected to a file or other I/O device. It uses special operator characters within the commands to indicate which stream(s) to redirect and where. The basic redirection operators shells are shown in Table 3.10, “Redirection Operators”.

Table 3.10. Redirection Operators

Operator	Shells	Redirection type
<	All	Standard Input
>	All	Standard Output (overwrite)
>>	All	Standard Output (append)
2>	Bourne-based	Standard Error (overwrite)
2>>	Bourne-based	Standard Error (append)
>&	C shell-based	Standard Output and Standard Error (overwrite)
>>&	C shell-based	Standard Output and Standard Error (append)

Note

Memory trick: A redirection operator is an arrow that points in the direction of data flow.

shell-prompt: ls > listing.txt         # Overwrite with listing of .
shell-prompt: ls /etc >> listing.txt   # Append listing of /etc

In the examples above, the ls process sends its output to listing.txt instead of the terminal, as shown in Figure 3.5, “Redirecting standard output”.

Figure 3.5. Redirecting standard output

However, the filename listing.txt is not an argument to the ls process. The ls process never even knows about this output file. The redirection is handled by the shell and the shell removes "> listing.txt" and ">> listing.txt" from these commands before executing them. So, the first ls receives no arguments, and the second receives only /etc. Most programs have no idea whether their output is going to a file, a terminal, or some other device. They don't need to know and they don't care.

Caution

Using output redirection (>, 2>, or >&) in a command will normally overwrite (clobber) the file that you're redirecting to, even if the command itself fails. Be very careful not to use output redirection accidentally. This most commonly occurs when a careless user meant to use input redirection, but pressed the wrong key.

The moment you press Enter after typing a command containing "> filename", filename will be erased! Remember that the shell performs redirection, not the command, so filename is clobbered before the command is even executed.

If noclobber is set for the shell, output redirection to a file that already exists will result in an error. The noclobber option can be overridden by appending a ! to the redirection operator in C shell derivatives or a | in Bourne shell derivatives. For example, >! can be used to force overwriting a file in csh or tcsh, and >| can be used in sh, ksh, or bash.

More often than not, we want to redirect both normal output and error messages to the same place. This is why C shell and its derivatives use a combined operator that redirects both at once.

shell-prompt: find /etc >& all-output.txt

The same effect can be achieved with Bourne-shell derivatives using another operator that redirects one stream to another stream. In particular, we redirect the standard output (stream 1) to a file (or device) and at the same time redirect the standard error (stream 2) to stream 1.

shell-prompt: find /etc > all-output.txt 2>&1

In Bourne family shells, we can separately redirect the standard output with > and the standard error with 2>:

shell-prompt: find /etc > list.txt 2> errors.txt

If we want to separate standard output and standard error in a C shell or T shell session, we can use a subshell under which the find command redirects only the standard output. The output from the subshell process will then only contain the standard error left over from find, which we can redirect with &>:

shell-prompt: (find /etc > list.txt) >& errors.txt

If a program takes input from the standard input, we can redirect input from a file as follows:

shell-prompt: command < input-file

For example, the "bc" (binary calculator) command is an arbitrary-precision calculator that inputs numerical expressions from the standard input and writes the results to the standard output. It's a good idea to use the --mathlib flag with bc for more complete functionality.

shell-prompt: bc --mathlib
3.14159265359 * 4.2 ^ 2 + sqrt(30)
60.89491440932
quit

In the example above, the user entered "3.14159265359 * 4.2 ^ 2 + sqrt(30)" and "quit" and the bc program output "60.89491440932". We could instead place the input shown above in a file using any text editor, such as nano or vi, or even using cat with keyboard input and output redirection as a primitive editor:

shell-prompt: cat > bc-input.txt
3.14159265359 * 4.2 ^ 2 + sqrt(30)
quit
(Type Ctrl+d to signal the end of input to the cat process)
shell-prompt: cat bc-input.txt
3.14159265359 * 4.2 ^ 2 + sqrt(30)
quit

Now that we have the input in a file, we can feed it to the bc process using input redirection instead of retyping it on the keyboard:

shell-prompt: bc --mathlib < bc-input.txt 
60.29203070318

Special Files in /dev

The standard streams themselves are represented as device files on Unix systems. This allows us to redirect one stream to another without modifying a program, by appending the stream to one of the device files /dev/stdout or /dev/stderr. For example, if a program sends output to the standard output and we want to send it instead to the standard error, we could do something like the following:

shell-prompt: printf "Oops!" >> /dev/stderr

If we would like to simply discard output sent to the standard output or standard error, we can redirect it to /dev/null. For example, to see only error messages (standard error) from myprog, we could do the following:

shell-prompt: ./myprog > /dev/null

To see only normal output and not error messages, assuming Bourne shell family:

shell-prompt: ./myprog 2> /dev/null

In C shell family:

shell-prompt: ( find /etc > output.txt ) >& /dev/null ; cat output.txt

The device /dev/zero is a readable file that produces a stream of zero bytes.

The device /dev/random is a readable file that produces a stream of random integers in binary format. We can use the dd command, a bit copy program, to copy a fixed number of bytes from one file to another. We specify the input file with "if=", output with "of=", block size with "bs=", and the number of blocks with "count=". Total data copied will be block-size * count.

shell-prompt: dd if=/dev/random of=random-data bs=1000000 count=10

Note

The block size indicates the size of the memory buffer used to store each chunk of the file. Make it large enough to keep the number of disk reads/writes low, but not so large that it will use a significant portion of available memory. A block size of a gigabyte may stress the system's memory resources, and you won't see much improvement in speed using block sizes more than several kibibytes.

Pipes

Very often, we want to use the output of one program as input to another. Such a thing could be done using redirection, as shown below:

shell-prompt: ls > listing.txt
shell-prompt: more listing.txt

The same task can be accomplished in one command using a pipe. A pipe redirects one of the standard streams, just as redirection does, but to or from another process instead of a file or device. In other words, we can use a pipe to send the standard output and/or standard error of one process directly to the standard input of another process.

A pipe is constructed by placing the pipe operator (|) between two commands. The whole chain of commands connected by pipes is called a pipeline.

Example 3.17. Simple Pipe

The command below uses a pipe to redirect the standard output of an ls process directly to the standard input of a more process.

shell-prompt: ls | more

Since a pipe runs multiple processes in the same shell, it is necessary to understand the concept of foreground and background processes, which are covered in detail in the section called “Process Control”.

Multiple processes can output to a terminal at the same time, although the results would obviously be chaos in most cases.

In contrast to output, only one process can be receiving input from the keyboard, however. It would be a remarkable coincidence if the same input made sense to two different programs.

The foreground process running under a given shell process is defined as the process that receives the input from the terminal. This is the only difference between a foreground process and a background process.

When running a pipeline command, the last command in the pipeline becomes the foreground process. All others run in the background, i.e. do not use the standard input device inherited from the shell process. Hence, when we run:

shell-prompt: ls | more

It is the more command that receives input from the keyboard. The more command has its standard input redirected from the standard output of ls, and the standard input of the ls command is effectively disabled.

Note

The more command is somewhat special: Since its standard input is used to receive input from the pipe, it opens another stream to connect to the keyboard so that it can still get user input, such as pressing the space bar for another screen, etc.

This is such a common practice that Unix has defined the term filter to apply to programs that can be used in this way. A filter is any command that can receive input from the standard input and send output to the standard output. Many Unix commands are designed to accept a file name as an argument, but to use the standard input and/or standard output if no filename arguments are provided.

Example 3.18. Filters

The more command is commonly used as a filter. It can read a file whose name is provided as an argument, but will use the standard input if no argument is provided. Hence, the following two commands have the same effect:

shell-prompt: more names.txt
shell-prompt: more < names.txt

The only difference between these two commands is that in the first, the more process receives names.txt as a command line argument, opens the file itself (creating a new file stream), and reads from the new stream (not the standard input stream). In the second instance, the shell process opens names.txt and connects the standard input stream of the more process to it. The more process then uses another stream to read user input from the keyboard.

Using the filtering capability of more, we can paginate the output of any command:

shell-prompt: ls | more
shell-prompt: find . -name '*.c' | more
shell-prompt: sort names.txt | more

We can string any number of commands together using pipes. The only limitations are imposed by the memory requirements of the processes in the pipeline. For example, the following pipeline sorts the names in names.txt, removes duplicates, filters out all names not beginning with 'B', and shows the first 100 results one page at a time.

shell-prompt: sort names.txt | uniq | grep '^B' | head -n 100 | more

To see lines 101 through 200 of a file output.txt:

shell-prompt: head -n 200 output.txt | tail -n 100

One more useful tool worth mentioning is the tee command. The tee command is a simple program that reads from its standard input and writes to both the standard output and to one or more files whose names are provided on the command line. This allows you to view the output of a program on the screen and save it to a file at the same time.

shell-prompt: ls | tee listing.txt

Recall that Bourne-shell derivatives do not have combined operators for redirecting standard output and standard error at the same time. Instead, we redirect the standard output to a file or device, and redirect the standard error to the standard output using 2>&1.

We can use the same technique with a pipe, but there is one more condition: For technical reasons, the 2>&1 must come before the pipe.

shell-prompt: ls | tee listing.txt 2>&1    # Won't work
shell-prompt: ls 2>&1 | tee listing.txt    # Will work

The yes command (much like Jim Carrey in "Yes Man") produces a stream of y's followed by newlines. It is meant to be piped into a program that prompts for y's or n's in response to yes/no questions, so that the program will receive a yes answer to all of its prompts and run without user input.

shell-prompt: yes | ./myprog

The yes command can actually print any response we want, via a command line argument. To answer 'n' to every prompt, we could do the following:

shell-prompt: yes n | ./myprog

In cases where the response isn't always the same, we can feed a program a arbitrary sequence of responses using redirection or pipes. Be sure to add a newline (\n) after each response to simulate pressing the Enter key:

shell-prompt: printf "y\nn\ny\n" | ./myprog

Or, to save the responses to a file for repeated use:

shell-prompt: printf "y\nn\ny\n" > responses.txt
shell-prompt: ./myprog < responses.txt

Misusing Pipes

Aside

It's important to learn from the mistakes of others, because we don't have time to make them all ourselves.

Users who don't fully understand Unix and processes often fall into bad habits that can potentially be costly. There are far too many such habits to cover here: One could write a separate 1,000-page volume called "Favorite Bad Habits of Unix Users". As a less painful alternative, we'll explore one common bad habit in detail and try to help you understand how to spot others. Our feature habit of the day is the use of the cat command at the head of a pipeline:

shell-prompt: cat names.txt | sort | uniq > outfile

So what's the alternative, what's wrong with using cat this way, what's the big deal, why do people do it, and how do we know it's a problem?

The alternative:
Most commands used downstream of cat in situations like this (e.g. sort, grep, more, etc.) are capable of reading a file directly if given the filename as an argument:
```
shell-prompt: sort names.txt | uniq > outfile
                
```
Even if they don't take a filename argument, we can always use simple redirection instead of a pipe:
```
shell-prompt: sort < names.txt | uniq > outfile
                
```
The problem:
- Using cat this way just adds overhead in exchange for no benefit. Pipes are helpful when you have to perform multiple processing steps in sequence. By running multiple processes at the same time instead of one after the other, we can improve resource utilization. For example, while sort is waiting for disk input, uniq can use the CPU. Better yet, on a computer with multiple cores, the processes can utilize two cores at the same time.
  However, the cat command doesn't do any processing at all. It just reads the file and feeds the bytes into the first pipe.
  In using cat this way, here's what happens:
  1. The cat command reads blocks from the file into a file input buffer.
  2. It then copies the input buffer, one byte at a time, to its standard output buffer, without processing the data in any way. It just senselessly moves data (through a proverbial straw) from one memory buffer to another.
  3. When the standard output buffer is full, it is copied to the pipe, which is yet another memory buffer.
  4. Characters in the pipe buffer are copied to the standard input buffer of the next command (e.g. sort).
  5. The sort can finally begin processing the data.
  This is like pouring a drink into a glass, then moving it to a second glass using an eye dropper, then pouring it into a third glass and finally a fourth glass before actually drinking it.
  It's much simpler and less wasteful for the sort command to read directly from the file.
- Using a pipe this way also prevents the downstream command from optimizing disk access. A program such as sort might use a larger input buffer size to reduce the number of disk reads. Reading fewer, larger blocks from disk can keep the latency incurred for each disk operation from adding up, thereby reducing run time. This is not possible when reading from a pipe, which is a fixed-size memory buffer.
What's the big deal?
Usually, this is not much of a problem. Wasting a few seconds or minutes on your laptop won't hurt anyone. However, sometimes mistakes like this one are incorporated into HPC cluster jobs using hundreds of cores for weeks at a time. In that case, it could increase run time by several days, delaying the work of other users who have jobs waiting in the queue, as well as your own. Not to mention, the wasted electricity could cost the organization hundreds of dollars and create additional pollution.
Why do people do things like this?
By far the most common response I get when asking people about this sort of thing is: "[Shrug] I copied this from an example on the web. Didn't really think about it."
Occasionally, someone might think they are being clever by doing this. They believe that this speeds up processing by splitting the task into two processes, hence utilizing multiple cores, one running cat to handle the disk input and another dedicated to sort or whatever command is downstream. However, this strategy only helps if both processes are CPU-bound, i.e. they spend more time using the CPU than performing input and output. This is not the case for the cat command.
One might also think it helps by overlapping disk input and CPU processing, i.e. cat can read the next block of data while sort is processing the current one. This may have worked a long time ago using slow disks and unsophisticated operating systems, but it only backfires with modern disks and modern Unix systems that have sophisticated disk buffering.
In reality, this strategy only increases the amount of CPU time used, and almost always increases run time.
Detection:
Detecting performance issues is pretty easy. The most common tool is the time command.
```
shell-prompt: time fgrep GGTAGGTGAGGGGCGCCTCTAGATCGGAAGAGCACACGTCTGAACTCCAGTCA test.vcf > /dev/null
2.539u 6.348s 0:09.86 89.9% 92+173k 35519+0io 0pf+0w
                
```
We have to be careful when using time with a pipeline, however. Depending on the shell and the time command used (some shells have in internal implementation), it may not work as expected. We can ensure proper function by wrapping the pipeline in a separate shell process, which is then timed:
```
shell-prompt: time sh -c "cat test.vcf | fgrep GGTAGGTGAGGGGCGCCTCTAGATCGGAAGAGCACACGTCTGAACTCCAGTCA > /dev/null"
2.873u 17.008s 0:13.68 145.2%   33+155k 33317+0io 0pf+0w
                
```
Table 3.11, “Run times of pipes with cat” compares the run times (wall time) and CPU time of the direct fgrep and piped fgrep shown above three different operating systems.
All runs were performed on otherwise idle system. Several trials were run to ensure reliable results. Times from the first read of test.vcf were discarded, since subsequent runs benefit from disk buffering (file contents still in memory from the previous read). The wall time varied significantly on the CentOS system, with the piped command running in less wall time for a small fraction of the trials. The times shown in the table are typical. Times for FreeBSD and MacOS were fairly consistent.
Note that there is significant variability between platforms that should not be taken too seriously. These tests were not run on identical hardware, so they do not tell us anything about relative operating system performance.
We can also collect other data using tools such as top to monitor CPU and memory use and iostat to monitor disk activity. These commands are covered in more detail in the section called “Top” and the section called “Iostat”.

Table 3.11. Run times of pipes with cat

System specs	Pipe wall time	No pipe wall time	Pipe CPU time	No pipe CPU time
CentOS 7 i7 2.8GHz	33.43	29.50	13.59	8.45
FreeBSD Phenom 3.2GHz	13.01	8.90	18.76	8.43
MacBook i5 2.7GHz	81.09	81.35	84.02	81.20

Practice

Note

Be sure to thoroughly review the instructions in Section 2, “Practice Problem Instructions” before doing the practice problems below.

How does device independence simplify life for Unix users? Give an example.
Show an example Unix command that displays the input from a mouse as it is being moved or clicked.
What are the standard streams associated with every Unix process? To what file or device are they connected by default?
Show a Unix command that saves the output of ls -l to a file called long-list.txt.
Show a Unix command that appends the output of ls -l /etc to a file called long-list.txt.
Show a Unix command that discards the normal output of ls -l /etc and shows the error messages on the terminal screen.
Show a Bourne shell command that saves the output of ls -al /etc to output.txt and any error messages to errors.txt.
Show a C shell command that saves the output and errors of ls -al /etc to all-output.txt.
How does more list.txt differ from more < list.txt?
Show a Unix command that creates a 1 gigabyte file called new-image filled with 0 bytes.
What are two major advantages of pipes over redirecting to a file and then reading it?
Show a Unix command that lists all the files in and under /etc, sorts them, and paginates the output.
What is a foreground process?
Which program in the following pipeline runs in the foreground?
```
shell-prompt: find /etc | sort | more
        
```
What is a filter program?
What is the maximum number of commands allowed in a Unix pipeline?
Show a Unix command that prints a long listing of /usr/local/bin to the terminal and at the same time saves it to the file local-bin.txt.
Do the same as above, but include any error messages in the file as well. Show the command for both C shell and Bourne shell.
Is it a good idea to feed files into a pipe using cat, rather than have the next command read them directly? Why or why not?
```
Example: Which command below is more efficient?
cat file.txt | sort | uniq
sort file.txt | uniq
        
```

Prev	Up	Next
Subshells	Home	Power Tools for Data Processing