Storage Logistics

There are many issues you will need to consider in order to plan well for data management. A few of those issues are discussed in the sections that follow.

Data Format

When storing data only for yourself, you might not give this issue much thought. However, data management includes not only preservation, but dissemination. If others will have access to your data, it must be in a format that is easy for them to read.

Many areas of science have developed standard data formats to help researchers and software applications interoperate. Research needs are too diverse to cover all of the standard data formats here. The goal of this section is only to raise awareness and encourage researchers to explore available standards before digging themselves into a hole.

Caution

Changing the format of large amounts of your data at a later time could be a very frustrating and costly process. It is highly advisable to decide on a standard data format before your research progresses too far.

The best way to explore data formats is by talking to others in your specific field and studying options via the Internet. This will help you develop a sense of what the emerging standards are in your niche.

Lifespan

Another very important question to ask is how long the data should be preserved. This will impact the cost of data management, although not as much as you might think, assuming that storage costs continue to decline over the long term.

Generally, the harder it is to regenerate the data, the longer it should be preserved. Data that are easy to recreate may actually cost more to store.

Security

If the data contain confidential information, such as personal health information (PHI) or financial records, it may be necessary to restrict and track access to it. Regulations on PHI data are strict and somewhat complex, so they should be explored before making any data management plans.

Safety

Data safety refers to the risk of data loss. If you're using a service provider to store your data, this will generally be their responsibility. They will maintain backups of data they store and provide a written guarantee about its availability.

If you are storing the data yourself, you'll need to think about how to back it up and where. Backups should always be stored far from the original data in order to protect against fire, theft, and other physical disasters. Backups in the same room are not safe at all. Backups in the same building are somewhat more safe, while backups in a separate building or distant location are best.

Funding

Paying for long-term data storage is a complex issue. Depending on the cost involved, it may or may not be possible to pay for it from a one-time grant allocation. Some institutions may provide data storage services, but in most cases, researchers will have to make their own arrangements, such as purchasing hardware or purchasing storage space from a commercial service.

Storage Providers

There are a number of organizations that provide long-term data storage, provided by Universities, government organizations, and private companies.

NIH's GenBank is publicly-funded and stores genomic data at no charge to researchers, for the benefit of future medical and other biological research. Commercial services offer very low-cost options, provided you do not need high-speed upload or download. The cost increases along with the desired transfer speed. The best approach to selecting one is investigating their current service offerings and talking to colleagues who have been down this road.

Managing Your Own Storage

If you must manage your own backup or archival storage for reasons of privacy, funding, etc., there are cost-effective and reliable ways to do it.

The worst option is a USB thumb drive or other external disk that plugs into an interface on your computer such as USB. Such devices are easily damaged, lost, disconnected, turned off, or stolen. Accidental disconnections or power loss can lead to damaged file systems and lost files.

When using such a device, you are also limited to the file systems supported by the computer you plug it into. E.g. if you format it for BSD, Linux, or Mac, you won't be able to plug it into a Windows PC.

A much safer option that's nearly as cost-effective is a networked file server with a built-in RAID (redundant array of inexpensive disks). With limited skills, you can build a file server using an inexpensive PC with two or more disks, and a specialized storage OS such as TrueNAS or XigmaNAS. These FOSS (Free Open Source Software) products use the advanced ZFS file system to provide redundancy in case of a disk failure, as well as data compression, encryption, snapshots, etc.

They are extremely easy to install and manage through a simple graphical interface. If you're prepared to spend a little more, you can also purchase a preconfigured TrueNAS box with commercial support.

You can house the file server anywhere, but preferably in a secure location with battery-backed power, such as a data center. If you don't have access to a data center, then choose the most secure location you can and purchase a small UPS (uninterruptable power supply) along with the PC to protect it from brief power outages.

TrueNAS and XigmaNAS support all common network protocols such as NFS (Unix Network File System), SMB/CIFS (Windows disk sharing), and AFS (Apple's networked file system), so files on the server can be simultaneously accessed from any computer on the network.