File Transfer

Many users will need to transfer data to or from remote servers. For example, we often want to analyze publicly available data hosted on a web server. Users of a shared research computer or HPC cluster running Unix may also need to transfer files from their computer to the Unix machine, run research programs, and finally transfer results back to their computer. There are many software tools available to accomplish this. Some of the convenient standard tools are described below.

Downloading Files with Curl, Fetch, and Wget

The curl, fetch, and wget commands are open source command-line tools for downloading files from a remote server. Most often, they serve the same purpose as a web browser. However, they allow us to automate the downloading of files when we know the URL (Uniform Resource Locator), also known as the web address. This is especially useful when we script an analysis that requires many files retrieved from one or more websites. Scripting is covered in Chapter 4, Unix Shell Scripting.

The URL begins with a protocol indicator, such as "https:" or "ftp:". This is followed by the server name (such as those reported by the hostname command), and finally a path name on the remote server.

Curl is included in the default installation of some GNU/Linux operating systems and is easily installed via package managers on most other systems. Unlike other tools, it sends output to the standard output by default. To save the downloaded file using the same name as on the remote system, we need to add the -O (capital O) flag:

shell-prompt: curl -O http://ftp.ensembl.org/pub/release-107/gff3/homo_sapiens/Homo_sapiens.GRCh38.107.chromosome.1.gff3.gz
            

Fetch is a somewhat simpler FreeBSD-specific tool included in the base system. The FreeBSD ports system is heavily dependent on fetch for automated downloading of files from various developer websites. Relying on the more complex and independently developed curl or wget would be riskier. When writing scripts that download files, it is generally better to use curl or wget so that the script will be portable to other systems that may not offer FreeBSD's fetch as a package. Fetch is mentioned here mainly as a fall-back option for researchers using FreeBSD.

shell-prompt: fetch http://ftp.ensembl.org/pub/release-107/gff3/homo_sapiens/Homo_sapiens.GRCh38.107.chromosome.1.gff3.gz
            

Wget is fairly comparable to curl in its interface and capabilities. It is also included in the base install of some GNU/Linux operating systems and easily installed via most package managers on other systems.

shell-prompt: wget http://ftp.ensembl.org/pub/release-107/gff3/homo_sapiens/Homo_sapiens.GRCh38.107.chromosome.1.gff3.gz
            

Example 3.32. Practice Break

Run any or all of the sample commands shown above.


Pushing and Pulling Files with SFTP and Rsync

SFTP (Secure File Transfer Protocol) is often used to remotely log into another machine over a network for the purpose of transferring files to or from it. It is based on ftp, which should no longer be used, since it does not use encryption. Not all remote Unix systems have SFTP enabled.

SFTP provides a shell-like environment that allows us to list files and directories, cd into subdirectories, push (send, upload) files using put and pull (receive, download) files using get. It does not allow us to run programs on the remote system.

shell-prompt: sftp joe@unixdev1.ceas.uwm.edu
password: (Nothing is echoed when the password is typed)
Connected to unixdev1.ceas.uwm.edu.
sftp> ls
Data                      My Programs               Pictures                  
Qemu                      R                         STRESS                    
sftp> cd Data
sftp> ls
CNC-EMDiff   IRC          
sftp> cd CNC-EMDiff/
sftp> ls
ATAC-Seq        Combined        Common          Misc            README.md       
RNA-Seq         Raw             adapter-stats   backup.sh       todo            
sftp> get backup.sh
Fetching /usr/home/bacon/Data/CNC-EMDiff/backup.sh to backup.sh
backup.sh                                     100%  166     3.4KB/s   00:00    
sftp> exit
            

There are also graphical programs that use SFTP protocol, such as FileZilla. The vanilla sftp command and tools like FileZilla are convenient for small, simple transfers.

The scp command can be used to transfer files to any host that accepts ssh connections. This is a simple command with limited capabilities.

For more sophisticated and larger transfers from Unix to Unix (including Mac and Cygwin) users, the recommended transfer tool is rsync. The rsync command is a simple but intelligent tool that makes it easy to synchronize two directories on the same machine or on different machines across a network. Rsync is free software and part of the base installation of many Unix systems including macOS. On Cygwin, you can easily add the rsync package using the Cygwin Setup utility. Rsync has some major advantages over other file transfer programs:

  • Unlike GUI tools, it can be scripted to automate file transfers as part of an analysis. Scripting is covered in Chapter 4, Unix Shell Scripting.

  • If you have transferred a directory before, and only want to synchronize the destination with the latest changes, rsync will automatically determine the differences between the two copies and only transfer what is necessary. When conducting research that generates large amounts of data, this can save an enormous amount of time.
  • If a transfer fails for any reason (which is fairly common for large transfers due to network hiccups, etc), the inherent ability to determine the differences between two copies allows rsync to resume from where it left off. Simply run the exact same rsync command again, and the transfer will resume.

Rsync can push (send, upload) files from the local machine to a remote machine, or pull (retrieve, download) files from a remote machine to the local machine. The command syntax is basically the same in both cases. It's just a matter of how you specify the source and destination for the transfer.

The rsync command has many options, but the most typical usage is to create an exact copy of a directory on a remote system. The general rsync command to push a new directory or just changes to another host would be:

shell-prompt: rsync -av --delete source-path [username@]hostname:[destination-path]
            

Example 3.33. Pushing data with rsync

The following command synchronizes the directory Project from the local machine to ~joeuser/Data/Project on Peregrine:

shell-prompt: rsync -av --delete Project joeuser@unixdev1.ceas.uwm.edu:Data
                

The general syntax for pulling files from another host is:

shell-prompt: rsync -av --delete [username@]hostname:[source-path] destination-path
            

Example 3.34. Pulling data with rsync

The following command synchronizes the directory ~joeuser/Data/Project on Peregrine to ./Project on the local machine:

shell-prompt: rsync -av --delete joeuser@unixdev1.ceas.uwm.edu:Data/project .
                

Note that the only difference between a push and a pull is which argument contains "[user@]hostname:".

This syntax, using a single colon (:) following the host name, tells rsync to use an ssh tunnel. This means that ssh is used to establish a secure connection, and rsync uses that connection to transfer files. Hence, all traffic, including username and password, is encrypted. Rsync can use other connection protocols, but ssh is the most common.

If you omit "username@" from the source or destination, rsync will try to log into the remote system with your username on the local system.

If you omit "destination-path" in an rsync push command or "source-path" in a pull command, rsync will place the source directory under your home directory on the remote host.

The command-line flags used above have the following meanings:

-a, --archive
Use archive mode, equivalent to -rlptgoD. Archive mode copies all subdirectories recursively and preserves as many file attributes as possible, such as ownership, permissions, etc.
-v, --verbose
Verbose copy: Display names of files and directories as they are copied.
--delete
Delete files and directories from the destination that do not exist in the source. Without --delete, rsync will add and replace files in the destination, but never remove anything. This is a good strategy when using rsync to create backups of important files.

Caution

Note that a trailing / on source-path affects where rsync stores the files on the destination system. Without a trailing /, rsync will create a directory called source-path under destination-path on the destination host.

With a trailing / on source-path, destination-path is assumed to be the directory that will replace source-path on the destination host. This feature is a somewhat cryptic method of allowing you to change the name of the directory during the transfer. It is compatible with the behavior of the Unix cp command.

Note also that the trailing / only affects the command when applied to source-path. A trailing / on destination-path has no effect.

The command below creates an identical copy of the directory Model in ~/Data/Model on unixdev1.ceas.uwm.edu. The resulting directory is the same regardless of whether the destination directory existed before the command or not.

shell-prompt: rsync -av --delete Model joeuser@unixdev1.ceas.uwm.edu:Data
            

The command below dumps the contents of the local Model directly into ~/Data on unixdev1, and deletes everything else in the Data directory! In other words, it makes the destination directory ~Data identical to the local directory Model.

Caution

Carelessness with rsync can be very dangerous!

shell-prompt: rsync -av --delete Model/ joeuser@unixdev1.ceas.uwm.edu:Data
            

Note that if using globbing to specify files to pull from the remote system, any globbing patterns must be protected from expansion by the local shell by escaping them or enclosing them in quotes. We want the pattern expanded on the remote system, not the local system:

shell-prompt: rsync -av --delete joeuser@unixdev1.ceas.uwm.edu:Data/Study\* .
shell-prompt: rsync -av --delete 'joeuser@unixdev1.ceas.uwm.edu:Data/Study*' .
            

Example 3.35. Practice Break

If you have access to a remote Unix system, run the following commands, replacing "unixdev1.ceas.uwm.edu" with "your-username@your-remote-hostname".

shell-prompt: mkdir -p Temp
shell-prompt: touch Temp/temp1.txt Temp/temp2.txt
shell-prompt: rsync -av Temp unixdev1.ceas.uwm.edu:
shell-prompt: ssh unixdev1.ceas.uwm.edu ls Temp
shell-prompt: rm Temp/temp2.txt
shell-prompt: rsync -av Temp unixdev1.ceas.uwm.edu:
shell-prompt: ssh unixdev1.ceas.uwm.edu ls Temp
                

Rsync can also be used to copy files locally, though creating multiple copies of a file on the same computer is generally senseless. If you don't have access to a remote Unix system, you can use the commands below to practice rsync.

shell-prompt: mkdir -p Temp
shell-prompt: touch Temp/temp1.txt Temp/temp2.txt
shell-prompt: rsync -av Temp Temp2
shell-prompt: ls Temp2
shell-prompt: rm Temp/temp2.txt
shell-prompt: rsync -av Temp Temp2
shell-prompt: ls Temp2
shell-prompt: rm -r Temp2
                

Practice

Note

Be sure to thoroughly review the instructions in Section 2, “Practice Problem Instructions” before doing the practice problems below.
  1. What are three commands we can use in place of a web browser to download files? Which should we generally use in scripts that need to be portable? Why?

  2. Name three Unix commands that can be used to transfer files between two systems.

  3. Describe three advantages of rsync over other file transfer tools.

  4. What is the meaning of a trailing '/' on the source directory?

  5. Show an rsync command that makes the directory ~/Data/Study1 on unixdev1.ceas.uwm.edu identical to MyStudy on the local machine.

  6. Show an rsync command that makes the local directory MyStudy identical to ~/Data/Study1 on unixdev1.ceas.uwm.edu.