Many users will need to transfer data to or from remote servers. For example, we often want to analyze publicly available data hosted on a web server. Users of a shared research computer or HPC cluster running Unix may also need to transfer files from their computer to the Unix machine, run research programs, and finally transfer results back to their computer. There are many software tools available to accomplish this. Some of the convenient standard tools are described below.
The curl, fetch, and wget commands are open source command-line tools for downloading files from a remote server. Most often, they serve the same purpose as a web browser. However, they allow us to automate the downloading of files when we know the URL (Uniform Resource Locator), also known as the web address. This is especially useful when we script an analysis that requires many files retrieved from one or more websites. Scripting is covered in Chapter 4, Unix Shell Scripting.
The URL begins with a protocol indicator, such as "https:" or "ftp:". This is followed by the server name (such as those reported by the hostname command), and finally a path name on the remote server.
Curl
is included in the default installation of some GNU/Linux
operating systems and is easily installed via package managers
on most other systems. Unlike other tools, it sends output
to the standard output by default. To save the downloaded file
using the same name as on the remote system, we need to add
the -O
(capital O) flag:
shell-prompt: curl -O http://ftp.ensembl.org/pub/release-107/gff3/homo_sapiens/Homo_sapiens.GRCh38.107.chromosome.1.gff3.gz
Fetch is a somewhat simpler FreeBSD-specific tool included in the base system. The FreeBSD ports system is heavily dependent on fetch for automated downloading of files from various developer websites. Relying on the more complex and independently developed curl or wget would be riskier. When writing scripts that download files, it is generally better to use curl or wget so that the script will be portable to other systems that may not offer FreeBSD's fetch as a package. Fetch is mentioned here mainly as a fall-back option for researchers using FreeBSD.
shell-prompt: fetch http://ftp.ensembl.org/pub/release-107/gff3/homo_sapiens/Homo_sapiens.GRCh38.107.chromosome.1.gff3.gz
Wget is fairly comparable to curl in its interface and capabilities. It is also included in the base install of some GNU/Linux operating systems and easily installed via most package managers on other systems.
shell-prompt: wget http://ftp.ensembl.org/pub/release-107/gff3/homo_sapiens/Homo_sapiens.GRCh38.107.chromosome.1.gff3.gz
SFTP (Secure File Transfer Protocol) is often used to remotely log into another machine over a network for the purpose of transferring files to or from it. It is based on ftp, which should no longer be used, since it does not use encryption. Not all remote Unix systems have SFTP enabled.
SFTP provides a shell-like environment that allows us to list files and directories, cd into subdirectories, push (send, upload) files using put and pull (receive, download) files using get. It does not allow us to run programs on the remote system.
shell-prompt: sftp joe@unixdev1.ceas.uwm.edu password: (Nothing is echoed when the password is typed) Connected to unixdev1.ceas.uwm.edu. sftp> ls Data My Programs Pictures Qemu R STRESS sftp> cd Data sftp> ls CNC-EMDiff IRC sftp> cd CNC-EMDiff/ sftp> ls ATAC-Seq Combined Common Misc README.md RNA-Seq Raw adapter-stats backup.sh todo sftp> get backup.sh Fetching /usr/home/bacon/Data/CNC-EMDiff/backup.sh to backup.sh backup.sh 100% 166 3.4KB/s 00:00 sftp> exit
There are also graphical programs that use SFTP protocol, such as FileZilla. The vanilla sftp command and tools like FileZilla are convenient for small, simple transfers.
The scp command can be used to transfer files to any host that accepts ssh connections. This is a simple command with limited capabilities.
For more sophisticated and larger transfers from Unix to Unix (including Mac and Cygwin) users, the recommended transfer tool is rsync. The rsync command is a simple but intelligent tool that makes it easy to synchronize two directories on the same machine or on different machines across a network. Rsync is free software and part of the base installation of many Unix systems including macOS. On Cygwin, you can easily add the rsync package using the Cygwin Setup utility. Rsync has some major advantages over other file transfer programs:
Unlike GUI tools, it can be scripted to automate file transfers as part of an analysis. Scripting is covered in Chapter 4, Unix Shell Scripting.
Rsync can push (send, upload) files from the local machine to a remote machine, or pull (retrieve, download) files from a remote machine to the local machine. The command syntax is basically the same in both cases. It's just a matter of how you specify the source and destination for the transfer.
The rsync command has many options, but the most typical usage is to create an exact copy of a directory on a remote system. The general rsync command to push a new directory or just changes to another host would be:
shell-prompt: rsync -av --delete source-path [username@]hostname:[destination-path]
Example 3.33. Pushing data with rsync
The following command synchronizes the directory
Project
from the local machine
to ~joeuser/Data/Project
on Peregrine:
shell-prompt: rsync -av --delete Project joeuser@unixdev1.ceas.uwm.edu:Data
The general syntax for pulling files from another host is:
shell-prompt: rsync -av --delete [username@]hostname:[source-path] destination-path
Example 3.34. Pulling data with rsync
The following command synchronizes the directory ~joeuser/Data/Project on Peregrine to ./Project on the local machine:
shell-prompt: rsync -av --delete joeuser@unixdev1.ceas.uwm.edu:Data/project .
Note that the only difference between a push and a pull is which argument contains "[user@]hostname:".
This syntax, using a single colon (:) following the host name, tells rsync to use an ssh tunnel. This means that ssh is used to establish a secure connection, and rsync uses that connection to transfer files. Hence, all traffic, including username and password, is encrypted. Rsync can use other connection protocols, but ssh is the most common.
If you omit "username@" from the source or destination, rsync will try to log into the remote system with your username on the local system.
If you omit "destination-path" in an rsync push command or "source-path" in a pull command, rsync will place the source directory under your home directory on the remote host.
The command-line flags used above have the following meanings:
-rlptgoD
. Archive mode copies all subdirectories
recursively and preserves as many file attributes as possible,
such as ownership, permissions, etc.
--delete
,
rsync will add and replace files in the
destination, but never remove anything. This is a good
strategy when using rsync to create backups
of important files.
Note that a trailing “/” on source-path affects where rsync stores the files on the destination system. Without a trailing “/”, rsync will create a directory called “source-path” under “destination-path” on the destination host.
With a trailing “/” on source-path, destination-path is assumed to be the directory that will replace source-path on the destination host. This feature is a somewhat cryptic method of allowing you to change the name of the directory during the transfer. It is compatible with the behavior of the Unix cp command.
Note also that the trailing “/” only affects the command when applied to source-path. A trailing “/” on destination-path has no effect.
The command below creates an identical copy of the directory
Model
in ~/Data/Model
on unixdev1.ceas.uwm.edu. The resulting directory
is the same regardless of whether the destination directory
existed before the command or not.
shell-prompt: rsync -av --delete Model joeuser@unixdev1.ceas.uwm.edu:Data
The command below dumps the contents of
the local Model
directly into
~/Data
on unixdev1, and deletes everything
else in the Data directory! In other words, it makes the
destination directory ~Data
identical to
the local directory Model
.
Carelessness with rsync can be very dangerous!
shell-prompt: rsync -av --delete Model/ joeuser@unixdev1.ceas.uwm.edu:Data
Note that if using globbing to specify files to pull from the remote system, any globbing patterns must be protected from expansion by the local shell by escaping them or enclosing them in quotes. We want the pattern expanded on the remote system, not the local system:
shell-prompt: rsync -av --delete joeuser@unixdev1.ceas.uwm.edu:Data/Study\* . shell-prompt: rsync -av --delete 'joeuser@unixdev1.ceas.uwm.edu:Data/Study*' .
Example 3.35. Practice Break
If you have access to a remote Unix system, run the following commands, replacing "unixdev1.ceas.uwm.edu" with "your-username@your-remote-hostname".
shell-prompt: mkdir -p Temp shell-prompt: touch Temp/temp1.txt Temp/temp2.txt shell-prompt: rsync -av Temp unixdev1.ceas.uwm.edu: shell-prompt: ssh unixdev1.ceas.uwm.edu ls Temp shell-prompt: rm Temp/temp2.txt shell-prompt: rsync -av Temp unixdev1.ceas.uwm.edu: shell-prompt: ssh unixdev1.ceas.uwm.edu ls Temp
Rsync can also be used to copy files locally, though creating multiple copies of a file on the same computer is generally senseless. If you don't have access to a remote Unix system, you can use the commands below to practice rsync.
shell-prompt: mkdir -p Temp shell-prompt: touch Temp/temp1.txt Temp/temp2.txt shell-prompt: rsync -av Temp Temp2 shell-prompt: ls Temp2 shell-prompt: rm Temp/temp2.txt shell-prompt: rsync -av Temp Temp2 shell-prompt: ls Temp2 shell-prompt: rm -r Temp2
What are three commands we can use in place of a web browser to download files? Which should we generally use in scripts that need to be portable? Why?
Name three Unix commands that can be used to transfer files between two systems.
Describe three advantages of rsync over other file transfer tools.
What is the meaning of a trailing '/' on the source directory?
Show an rsync command that makes the directory
~/Data/Study1
on unixdev1.ceas.uwm.edu identical
to MyStudy
on the local machine.
Show an rsync command that makes the local
directory MyStudy
identical to
~/Data/Study1
on unixdev1.ceas.uwm.edu.