Data Transfer

Storage is not the only problem associated with big data. It also presents challenges with transferring data, especially over great distances and across different computer platforms.

This can be particularly problematic for small organizations that do not have a very high bandwidth Internet connection. While the Internet backbone may provide plenty of speed to transfer your research data in a reasonable amount of time, the connection from the Internet into your building may be a severe bottleneck. This is known as the last mile problem.

One potential solution to this problem is to avoid transferring the data in the first place. Some organizations offer web-based tools to allow users elsewhere to perform common analyses on their data without first downloading it. For example, if you want to search for a DNA sequence in the genomes of many organisms, you can do so on the NCBI BLAST website. This means uploading a short DNA sequence to the NCBI server rather than downloading many gigabytes of genome data.

Another potential solution is to simply perform a more selective transfer. Determining exactly which parts of the data to transfer can involve a lot of manual labor, but it may save many hours or more of transfer time.

Sometimes the problem is not bandwidth, but user interface. The most common type of data transfer utilized ordinary tools like a web browser or FTP client. These methods are collectively known as "data schlepping". Data schlepping requires the user to use a variety of tools to transfer data to and from various sites. It also often suffers from failures due to dropped network connections, power outages, and other issues that are likely to interrupt a long transfer. Some tools, such as rsync, allow an interrupted transfer to continue from where it left off. However, not all sites offer rsync service.

Globus Transfer is an example of a web-based alternative for data transfer that has built-in capabilities for dealing with connection issues, login credentials, and many other data transfer issues. It also overcomes bottlenecks associated with long-distance file transfers. Downloading data with a web browser or curl over thousands of miles typically results in transfer speeds of 1 or 2 megabytes per second. Globus can often transfer over such distances at 50 megabytes per second. The down side is that Globus and comparable high-speed transfer tools are commercial, and require a license and expertise to install and configure.

Data transfer tools are evolving rapidly in response to the growing needs presented by big data. Users should make it a habit to continuously explore and reevaluate new and existing options.