The Unix File System

Unix Files

A Unix file is simply a sequence of bytes (8-bit values) stored on a disk and given a unique name. The bytes in a file may be printable characters such as letters, digits, punctuation symbols, invisible control characters (which cause a printer or terminal to perform actions such as backspacing or scrolling), part of a number (a typical integer or floating point number consists of 8 bytes), or other non-character, non-numeric data.

This is how Unix sees all files. It takes no interest whatsoever in the meaning of the bytes within a file. The meaning of the content is determined solely by the programs using the file.

Text vs Binary Files

Files are often classified as either text or binary files. All of the bytes in a text file are interpreted as ASCII/ISO characters by the programs that read or write the file, while binary files may contain both character and non-character data.

Again, Unix does not make a distinction between text and binary files. This is left to the programs that use the files.

Example 3.8. Practice Break

Try the following commands:

shell-prompt: cat /etc/hosts
                    

What do you see? The /etc/hosts file is a text file, and cat is used here to echo (concatenate) it to the terminal output.

Now try the following:

shell-prompt: cat /bin/ls
                    

What do you see? The file /bin/ls is not a text file. It contains binary program code, not characters. The cat command assumes that the file is a text file and sends each byte to your terminal. The terminal tries to interpret each byte as an ASCII/ISO character and display it on the screen. Since the file does not contain a sequence of characters, it appears as nonsense on your terminal. Some of the bytes sent to the terminal may even knock it out of whack, causing it to behave strangely. If this happens, run the reset command to restore your terminal to its default state.


Unix vs. Windows Text Files

While it is the program that interprets the contents of a file, there are some conventions regarding text file format that all Unix programs follow, so that they can all manipulate the same files. Unfortunately, Windows programs follow different conventions. Unix programs assume that text files terminate each line with a control character known as a line feed (also known as a newline or NL for short), which is the 10th character in the standard ASCII/ISO character sets. Windows programs use both a carriage return or CR (13th character) and NL.

Text files created on Windows will contain both a CR and NL at the end of each line. Text files created on Unix will have only an NL. This can cause problems for programs on either Unix or Windows. Hence, it is not a good idea to use a Windows editor to write code for Unix systems or vice-versa.

The dos2unix and unix2dos commands can be used to clean up files that have been transferred between Unix and Windows. These programs convert text files between the Windows and Unix standards. If you've edited a text file on a non-Unix system, and are now using it on a Unix system, you can clean it up by running:

shell-prompt: dos2unix filename
                

The dos2unix and unix2dos commands are not standard with most Unix systems, but they are free programs that can easily be added via most package managers.

Caution

Note that dos2unix and unix2dos should only be used on text files. They should never be used on binary files, since the contents of a binary file are not meant to be interpreted as characters such as line feeds and carriage returns.
File system Organization
Basic Concepts

A Unix file system contains files and directories. A file is like a document, and a directory is like a folder that contains documents and/or other directories. The terms "directory" and "folder" are interchangeable, but "directory" is the standard term used in Unix.

Directories are so called because they serve the same purpose as the directory you might find in the lobby of an office building: They are listings that keep track of what files and other directories are called and where they are located on the disk.

Note

Unix file systems use case-sensitive file and directory names. I.e., Temp is not the same as temp, and both can coexist in the same directory.

macOS is the only mainstream Unix system that violates this convention. The standard OS X file systems is case-preserving, but not case-sensitive. This means that if you call a file Temp, it will remember that the T is capital, but it can also be referred to as temp, tEmp, etc. Only one of these files can exist in a given directory at any one time.

A Unix file system can be visualized as a tree, with each file and directory contained within another directory. Figure 3.2, “Sample of a Unix File system” shows a small portion of a typical Unix file system. On a real Unix system, there are usually thousands of files and directories. Directories are shown in green and files are in yellow.

Figure 3.2. Sample of a Unix File system

Sample of a Unix File system

Unix uses a forward slash (/) to separate directory and file names while Windows uses a backslash (\).

The one directory that is not contained within any other is known as the root directory, whose name under Unix is /. There is exactly one root directory on every Unix system. Windows systems, on the other hand, have a root directory for each disk partition such as C:\ and D:\.

The Cygwin compatibility layer works around the separate drive letters of Windows by unifying them under a common parent directory called /cygdrive. Hence, for Unix commands run under Cygwin, /cygdrive/c is equivalent to c:\, /cygdrive/d is equivalent to d:\, and so on. This allows Cygwin users to do things like search multiple Windows drive letters with a single command starting in /cygdrive.

Unix file system trees are fairly standardized, but most have some variation. For instance, all Unix systems have a /bin and a /usr/bin, which contain standard Unix commands. Not all of them have /home or /usr/local. Many Linux systems install commands from add-on packages into /usr/bin, mixing them with the standard Unix commands that are essential to the basic functioning of the system. Other systems such as most BSDs keep them separated in /usr/local/bin or /usr/pkg/bin.

The root directory is the parent of /bin and /home and an ancestor of all other files and directories.

The /bin and /home directories are subdirectories, or children of /. Likewise, /home/joe and /home/sue are subdirectories of /home, and grandchildren of /.

All of the files in and under /home comprise a subtree of /home.

The children of a directory, all of its children, and so on, are known as descendants of the directory. All files and directories on a Unix system, except /, are descendants of /.

Each user has a home directory, which can be arbitrarily assigned, but is generally a child of /home on many Unix systems or of /Users on macOS. Most or all of a user's files and subdirectories are found under their home directory. In the example above, /home/joe is the home directory for user joe, and /home/sue is the home directory for user sue.

In some situations, a home directory can be referred to as ~ or ~user. For example, user joe can refer to his home directory as ~, ~/, or ~joe, while he can only refer to sue's home directory as ~sue.

Absolute Path Names

The absolute path name, also known as full path name, of a file or directory denotes the complete path from / (the root directory) to the file or directory of interest. It is the path we would "walk" from the root directory (/) to the file or directory of interest. For example, the absolute path name of Sue's .cshrc file is /home/sue/.cshrc, and the absolute path name of the ape command is /usr/local/bin/ape. To walk the directory tree, we would start in / and progress from there:

Start in /
Go to    /usr
Go to    /usr/local
Go to    /usr/local/bin
End at   /usr/local/bin/ape
                

The absolute path name is the only way to uniquely identify a file or directory in the file system.

Note

An absolute path name always begins with '/' or a '~', noting that '~' is shorthand for a path that begins with a '/' such as /home/joe or /Users/joe.

Example 3.9. Practice Break

Try the following commands:

shell-prompt: ls
shell-prompt: ls /etc
shell-prompt: cat /etc/hosts
shell-prompt: ls ~
                    

Current Working Directory

Every Unix process has an attribute called the current working directory, or CWD. This is the directory that the process is currently "in". When you first log into a Unix system, the shell process's CWD is set to your home directory.

Note

It is important to understand that the CWD is a property of each process, not of a user or a program.

The pwd (print working directory) command prints the CWD of the shell process. The cd (change directory) command changes the CWD of the shell process. Running cd with no arguments sets the CWD to your home directory, much like clicking your heels together three times to get back to Kansas. Running cd - changes the CWD to its previous value.

Example 3.10. Practice Break

Try the following commands:

shell-prompt: pwd
shell-prompt: cd /
shell-prompt: pwd
shell-prompt: cd
shell-prompt: pwd
shell-prompt: cd -
shell-prompt: pwd
shell-prompt: cd -
shell-prompt: pwd
                    

Many commands, such as ls, use the CWD as a default if you don't provide a directory name on the command line. For example, if the CWD is /home/joe, then the following commands are the same:

shell-prompt: ls
shell-prompt: ls /home/joe
shell-prompt: ls ~joe
                
Relative Path Names

Whereas an absolute path name denotes the path from / to a file or directory, the relative path name denotes the path from the CWD to a file or directory.

Any path name that does not begin with a '/' or '~' is interpreted as a relative path name. The absolute path name is then derived by appending the relative path name to the CWD. For example, if the CWD is /etc, then the relative path name hosts refers to the absolute path name /etc/hosts, and the relative path name of /etc/ssh/ssh_config is ssh/ssh_config.

absolute path name = CWD + "/" + relative path name

Note

Since the CWD is a property of each process, a relative path name is not the same for all processes. Relative path names for the same file may be different for different processes, or for the same process before and after it changes its CWD. For example the meaning of the relative path name bin is /bin when CWD is / and /usr/bin when CWD is /usr.

Note

Relative path names are handled at the lowest level of the operating system, by the Unix kernel. This means that they can be used anywhere: in shell commands, in C or Fortran programs, etc.

When you run a program from the shell, the new process inherits the CWD from the shell. Hence, you can use relative path names as arguments in any Unix command, and they will use the CWD inherited from the shell process. For example, the two cat commands below have the same effect.

shell-prompt: cd /etc        # Set shell's CWD to /etc
shell-prompt: cat hosts      # Inherits CWD from shell, so hosts = /etc/hosts
shell-prompt: cat /etc/hosts # Same effect as above
                

Wasting Time

The cd command is one of the most overused Unix commands. Many people use it where it is completely unnecessary and actually results in significantly more typing than needed. Don't use cd if it is actually more work than using an absolute path name as an argument. For example, consider the sequence of commands:

shell-prompt: cd /etc
shell-prompt: more hosts
shell-prompt: cd
                    

The same effect could have been achieved much more easily using the following single command:

shell-prompt: more /etc/hosts
                    

Note

In almost all cases, absolute path names and relative path names are interchangeable. You can use either type of path name as a command line argument, or within a program written in almost any language.

Example 3.11. Practice Break

Try to predict the results of the following commands before running them:

shell-prompt: cd
shell-prompt: pwd
shell-prompt: cd /etc
shell-prompt: pwd
shell-prompt: cat hosts
shell-prompt: cat /etc/hosts
shell-prompt: cd
shell-prompt: pwd
shell-prompt: cat hosts
                    

Why does the last command result in an error?


Avoid Absolute Path Names

The relative path name is potentially much shorter than the equivalent absolute path name. Using relative path names also makes code more portable.

Suppose you have a project contained in the directory /Users/joe/Thesis on your Mac. Now suppose you want to work on the same project on an HPC cluster, where there is no /Users directory, and you have to store it in /share1/joe/Thesis.

The absolute path name of every file and directory under Thesis will be different on the cluster than it is on your Mac. This can cause major problems if you were using absolute path names in your scripts, programs, and makefiles. Statements like the following will have to be changed in order to run the program on a different computer.

infile = fopen("/Users/joe/Thesis/Inputs/input1.txt", "r");
                
sort /Users/joe/Thesis/Inputs/names.txt
                

Note

No program should ever have to be altered just to make it run on a different computer. Changes like these are a source of regressions (new program bugs).

While the absolute path names change when you move the Thesis directory, the path names relative to the Thesis directory remain the same. For this reason, absolute path names should be avoided.

The statements below will work on any computer as long as the program or script is running with Thesis as the CWD. It does not matter where the Thesis directory is located, so long as the Inputs directory is its child.

infile = fopen("Inputs/input1.txt", "r");
                
sort Inputs/names.txt
                
Special Directory Names

In addition to absolute path names and relative path names, there are a few special symbols for directories that are commonly referenced:

Table 3.5. Special Directory Symbols

SymbolRefers to
.The current working directory
..The parent of the current working directory
~Your home directory
~useruser's home directory

The '.' notation for CWD is useful for copying files to CWD and other commands that require a target directory name.

shell-prompt: cp /etc/hosts .
                

It is also useful if a mishap occurs, leading to the creation of a file whose name begins with a special character such as '-' or '~'. If we have a file called "-file.txt", we cannot remove it with rm -file.txt, since the rm command will think the '-' indicates a flag argument. To get around this, we simply need to make the argument not begin with a '-'. We can either use the absolute path name of the file, e.g. /home/joe/-file.txt or ./-file.txt. ./path is exactly the same as path.

The ".." notation refers to the parent of the CWD and allows for relative path names that are not under the CWD. For example, if the CWD is /home/joe, then the relative path of /home/sue/.cshrc is ../sue/.cshrc and the relative path name of /etc/hosts is ../../etc/hosts. We can "walk" a relative path such as ../../etc/hosts just as we walk an absolute path:

Start at /home/joe      (.)
Go to    /home          (..)
Go to    /              (../..)
Go to    /etc           (../../etc)
End at   /etc/hosts     (../../etc/hosts)
                

Note that /home/joe/../sue/.cshrc (/home/joe + / + ../sue/.cshrc) is a valid absolute path name, but it can be shortened to /home/sue/.cshrc. We can always remove a ../ along with the path component to the left of it, such as joe/../. Likewise, /home/joe/../../etc/hosts can be reduced to /home/../etc/hosts and further to /etc/hosts.

Example 3.12. Practice Break

Try the following commands and see what they do:

shell-prompt: cd
shell-prompt: pwd
shell-prompt: ls
shell-prompt: ls ~
shell-prompt: ls .
shell-prompt: mkdir Data Scripts
shell-prompt: cp /etc/hosts .
shell-prompt: mv hosts Data
shell-prompt: ls Data
shell-prompt: ls ./Data
shell-prompt: cd Data
shell-prompt: cd ../Scripts
shell-prompt: ls ..
shell-prompt: ls ../Data
shell-prompt: more ../Data/hosts
shell-prompt: rm ../Data/hosts
shell-prompt: ls ~/Data
shell-prompt: ls /bin
shell-prompt: cd ..
shell-prompt: pwd
                    

Ownership and Permissions
Overview

Every file and directory on a Unix system has inherent access control features based on a simple system:

  • Every file and directory belongs to an individual user and to a group of users.

  • There are 3 types of permissions which are controlled separately from each other:

    • Read
    • Write (modify)
    • Execute (e.g. run a file if it's a program)
  • Read, write, and execute permissions can be granted or denied separately for each of the following:

    • The individual who owns the file (user)
    • The group that owns the file (group)
    • All other users on the system (a hypothetical group known as "world" (other)

Execute permissions on a file mean that the file can be executed as a script or a program by typing its name. It does not mean that the file actually contains a script or a program: It is up to the owner of the file to set the execute permissions appropriately for each file.

Execute permissions on a directory mean that permitted users can cd into it. Users only need read permissions on a directory to list it or access a file within it, but they need execute permissions in order for their processes to make it the CWD.

Unix systems provide this access using 9 on/off switches (bits) associated with each file.

Viewing Permissions

If you do a long listing of a file or directory, you will see the ownership and permissions:

shell-prompt: ls -l
drwx------   2 joe    users      512 Aug  7 07:52 Desktop/
drwxr-x---  39 joe    users     1536 Aug  9 22:21 Documents/
drwxr-xr-x   2 joe    users      512 Aug  9 22:25 Downloads/
-rw-r--r--   1 joe    users    82118 Aug  2 09:47 bootcamp.pdf
                

The leftmost column shows the type of object and the permissions for each user category.

A '-' in the leftmost character means a regular file, 'd' means a directory, 'l' means a link. etc. Running man ls will reveal all the codes.

The next three characters are, in order, read, write and execute permissions for the owner (joe).

The next three after that are permissions for members of the owning group (users).

The next three are permissions for world (other).

A '-' in a permission bit column means that the permission is denied for that user or set of users and an 'r', 'w', or 'x' means that read, write, or execute is permitted.

The next three columns show the number of links (different path names for the same file), the individual and group ownership of the file or directory. The remaining columns show the size, the date and time it was last modified, and name. In addition to the 'd' in the first column, directory names are followed by a '/' if the ls is so configured.

You can see above that Joe's Desktop directory is readable, writable, and executable for Joe, and completely inaccessible to everyone else.

Joe's Documents directory is readable, writable and executable for Joe, and readable and executable for members of the group "users". Users not in the group "users" cannot access the Documents directory at all.

Joe's Downloads directory is readable and executable to anyone who can log into the system.

The file bootcamp.pdf is readable by group and world, but only writable by Joe. It is not executable by anyone, which makes sense because a PDF file is not a program.

Setting Permissions

Users cannot change individual ownership on a file, since this would allow them to subvert disk quotas and do other malicious acts by placing their files under someone else's name. Only the superuser (the system administrator) can change the individual ownership of a file or directory.

Every user has a primary group and may also be a member of supplementary groups. Users can change the group ownership of a file to any group that they belong to using the chgrp command, which requires a group name as the second argument and one or more path names following the group:

shell-prompt: chgrp group path [path ...]
                

All sharing of files on Unix systems is done by controlling group ownership and file permissions.

File permissions are changed using the chmod command:

shell-prompt: chmod permission-specification path [path ...]
                

The permission specification has a symbolic form, and a raw form, which is an octal number.

The symbolic form consists of any of the three user categories 'u' (user/owner), 'g' (group), and 'o' (other/world) followed by a '+' (grant) or '-' (revoke), and finally one of the three permissions 'r', 'w', or 'x'.

To add read and execute (cd) permissions for group and world on the Documents directory:

shell-prompt: chmod go+rx Documents
                

Sometimes it is impossible to express the changes we want to make in one simple specification. In that case, we can use a compound specification, two or more basic specs separated by commas. Remember that white space indicates the end of an argument, so we cannot have any white space next to the comma.

To revoke all permissions for world on the Documents directory and grant read permission for the group:

shell-prompt: chmod o-rwx,g+r Documents
                

Disable write permission for everyone, including the owner, on bootcamp.pdf. This can be used to prevent the owner from accidentally deleting an important file.

shell-prompt: chmod ugo-w bootcamp.pdf
                

Run man chmod for additional information.

The raw form for permissions uses a 3-digit octal number to represent the 9 permission bits. This is a quick and convenient method for computer nerds who can do octal/binary conversions in their head.

shell-prompt: chmod 644 bootcamp.pdf   # 644 = 110100100 = rw-r--r--
shell-prompt: chmod 750 Documents      # 750 = 111101000 = rwxr-x---
                

Caution

NEVER make any file or directory world-writable. Doing so allows any other user to modify it, which is a serious security risk. A malicious user could use this to install a Trojan Horse program under your name, for example.

By default, new files you create are owned by you and your primary group. If you are a member of more than one group and wish to share a directory with one of your supplementary groups, it may also be helpful to set a special flag on the directory so that new files created in it will have the same group as the directory, rather than your primary group. Then you won't have to remember to chmod every new file you create.

shell-prompt: chmod g+s Shared-research
                

Example 3.13. Practice Break

Try the following commands, and try to predict the output of each ls before you run it.

shell-prompt: touch testfile
shell-prompt: ls -l
shell-prompt: chmod go-rwx testfile
shell-prompt: ls -l
shell-prompt: chmod o+rw testfile
shell-prompt: ls -l
shell-prompt: chmod g+rwx testfile
shell-prompt: ls -l
shell-prompt: rm testfile
                    

Now set permissions on testfile so that it is readable, writable, and executable by you, only readable by the group, and inaccessible to everyone else.


Practice

Note

Be sure to thoroughly review the instructions in Section 2, “Practice Problem Instructions” before doing the practice problems below.
  1. What is a file in the viewpoint of Unix?

  2. What is the difference between a text file and a binary file?

  3. What will happen if you echo a binary file to your terminal?

  4. What is the difference between Windows and Unix text files?

  5. How can we convert text files between the Unix and Windows standards?

  6. What is a directory?

  7. What does it mean that Unix filenames are case-sensitive?

  8. What is a root directory?

  9. How many root directories does a Unix system have? How many does Windows have?

  10. What is contained in the /bin and /usr/bin directories?

  11. What is a subdirectory?

  12. What is a home directory?

  13. What is an absolute path name and how do we recognize one?

  14. What is the absolute path name of Sue's asg01.c in the tree diagram in this section?

  15. Of what is the CWD a property?

  16. Show a Unix command that prints the CWD of a shell process.

  17. Show a Unix command that sets the CWD of a shell process to /tmp.

  18. Show a Unix command that sets the CWD of a shell process to our home directory?

  19. What is a relative path name and how to we recognize one?

  20. Is a relative path name unique? Prove your answer with an example.

  21. How does Unix determine the absolute path name from a relative path name?

  22. If the CWD of a process is /usr/local, what is the absolute path name of "bin/ape"?

  23. If the CWD of a process is /usr/local, what is the relative path name of /usr/local/lib/libxtend.a?

  24. If the CWD of a process is /usr/local, what is the relative path name of /usr/bin?

  25. If the CWD of a process is /usr/local, what is the relative path name of /etc/motd?

  26. Where does a new process get its initial CWD?

  27. Why should we avoid using absolute path names in programs and scripts?

  28. Show a Unix command that lists the contents of the parent directory of CWD.

  29. If the CWD of a process is /home/bob/Programs, what is the relative path name of /home/bob/Data/input1.txt?

  30. How do we remove a file called "~sue" in the CWD?

  31. What are the three user categories that can be granted permissions on a file or directory?

  32. What does it mean to set execute permission on a file? On a directory?

  33. Given the following ls -l output, who can do what to bootcamp.pdf?

    -rw-r-----   1 joe    users    82118 Aug  2 09:47 bootcamp.pdf
            
  34. How would we allow users who are not in the owning group to read bootcamp.pdf?

  35. How would we allow members of the group to read and execute the program "simulation" and at the same time revoke all access to other users?

  36. Show a Unix command that makes the directory "MyScripts" world writable.

  37. Show a Unix command that changes the group ownership of the directory "Research" to the group "smithlab".

  38. Assuming your primary group is "joe", show a Unix command that configures the directory Research form the previous question so that new files you create in it will be owned by "smithlab" instead of "joe"?