[Home] [Services] [Philosophy] [Experience] [Software] [Publications] [Tips] [Contact]

Topics

Which platform?
Choosing a password
Network File System (NFS) Tuning
Infiniband on FreeBSD
Running an Open Source OS on an Intel Mac

Tuning NFS for Performance

Network File System (NFS) is the standard used by most Unix systems for direct remote access to files. It serves roughly the same purpose as Windows SMB/CIFS and Apple's AFP.

There is a popular myth that NFS is inherently slow. With default settings, an NFS server will often produce throughput well below the capabilities of the disk and/or network it uses.

However, with a little tuning, NFS can easily saturate (approach the performance limits of) most networks and typical RAIDs.

The key to NFS performance is all about pipelining. Accessing a file through a network introduces an additional step in the process of each disk transaction. This additional delay can be mitigated with proper turning, however.

Local Disk Transactions

When accessing a local disk, a read transaction involves copying data from the disk to a memory buffer, and then reading the data from memory. A write transaction involves writing the data to a memory buffer, and then copying it to disk.

The read pipeline would appear as follows:

| Copy from disk | Read into program |

If we perform these two steps in order for every read transaction, we will not keep the disk busy while the program reads from the memory buffer, so we will not see the best possible performance. One step of the pipeline is always idle while the other is active.

Many systems attempt to speed up disk reads by overlapping the two steps. If we know (or can guess with a good success rate) which block on the disk is to be read next, we can begin copying it to another buffer while the program reads the current buffer. This is known as double-buffering, and can more than double read speed in ideal situations. If we keep the read requests coming as fast as the disk can serve them, we will likely end up reading an entire track from the disk in one revolution. If there is a delay between one disk read and the next request, we may have to wait for the next revolution of the disk before the desired block passes under the heads again, which can reduce throughput severalfold.

In the case of a write, the program writing the data need not wait for it to be copied to the disk. It can proceed with other work as soon as the data has been written to the memory buffer. Hence, small infrequent writes can be performed faster than reads from the program's perspective, since the operating system can copy the memory buffers to disk while the program does other work. Long, sustained writes will be about the same speed as reads, since we can only buffer so much data before waiting for the write to disk to complete. Double-buffering can also be useful on write operations if the program is outputting data as fast as the target disk can store it.

NFS Disk Transactions

When accessing a remote file, we add in network transactions to both read and write.

For read:

  1. Send a request to the remote machine for the read operation
  2. Remote machine copies data from disk to a memory buffer
  3. Data is transferred from remote memory buffer to local memory buffer over the network
  4. Program reads data from local memory buffer

For write:

  1. Program writes data to memory buffer
  2. Data is transferred from local memory buffer to remote memory buffer over the network
  3. Remote machine copies data from memory buffer to disk

Note that like a local disk write, the program does not need to wait for an NFS write operation to complete, as long as the local and remote systems can buffer the data in memory. The program can continue on and do other work while the local and remote operating systems complete the write transaction.

On the read side, the network introduces a significant amount of latency time in the form of TWO network transactions. The first makes the request for the data and the second transfers the data after the remote server has retrieved it from disk.

| Remote request | Copy from disk | Transmit over network | Read into program |

Hence, the program must wait much longer for individual NFS read transactions than it does for local disk reads. Again, if we know at least some of the time which disk block will be read next, we can read ahead before the current transaction is finished, and keep the pipeline more full. When the program wants the next block of data, it could already be buffered in the remote or local memory buffers. For NFS, we often go beyond double buffering and use 3 or 4 read buffers. Under ideal conditions, such as reading a large file sequentially, NFS read throughput can approach the speed of the disk and/or network.

Since disk transactions take a long time and do not utilize the CPU much, using multiple processes to queue NFS requests on the server can greatly improve performance, especially on a busy server with multiple clients. While one NFS server process is waiting for a disk transaction, other can receive requests or transfer memory buffers over the network.

Benchmark Results

Below are some benchmark results from a small HPC cluster running FreeBSD 8.3-RELEASE. The benchmark was run on a compute node accessing an NFS server over gigabit Ethernet. The server storage is a 4-disk RAID 5 on a Dell PERC H710 RAID adapter (LSI RAID chip set).

Similar performance was achieved using a 3-disk RAID-Z array on the same system, suggesting that ZFS software RAID will outperform the PERC using the same hardware.

Raw RAID Performance

   31.95 GiB write      64.00 KiB blocks    172504.00 ms       189.67 MiB/s
	1024 seek       64.00 KiB blocks        15.24 ms         4.27 MiB/s
   31.95 GiB read       64.00 KiB blocks    145778.00 ms       224.44 MiB/s
   31.95 GiB rewrite    64.00 KiB blocks    152434.00 ms       214.64 MiB/s

NFS RAID Access Performance

All NFS tests were run with the "noatime" flag set, to eliminate disk transactions for setting access time every time a file is read. This measure did not show significant performance benefit in these tests.

The gigabit network is capable of transferring around 100 megabytes per second in either direction, while the RAID is capable of more than twice that speed. Hence, the network is the limiting factor here.

With default NFS settings + noatime, throughput is well below the capabilities of the gigabit network:

   63.95 GiB write      64.00 KiB blocks   1082697.00 ms        60.49 MiB/s
	1024 seek       64.00 KiB blocks        27.80 ms         2.29 MiB/s
   63.95 GiB read       64.00 KiB blocks    979652.00 ms        66.85 MiB/s
   63.95 GiB rewrite    64.00 KiB blocks    809254.00 ms        80.92 MiB/s

With some minor adjustments to the NFS server and client settings, we were able to nearly saturate the gigabit Ethernet interface with NFS traffic in both directions.

   63.95 GiB write      64.00 KiB blocks    686593.00 ms        95.38 MiB/s
	1024 seek       64.00 KiB blocks        15.56 ms         4.27 MiB/s
   63.95 GiB read       64.00 KiB blocks    687974.00 ms        95.19 MiB/s
   63.95 GiB rewrite    64.00 KiB blocks    903218.00 ms        72.50 MiB/s

These settings incurred a small penalty for rewrite performance, which is not a concern in HPC, where overwriting files is uncommon.

On the server side, we increased the number of threads to 16 in /etc/rc.conf:

nfs_server_enable="YES"
nfs_server_flags="-t -n 16"

On the client side, we played with the read and write block size to find an optimal, stable value, and increased read-ahead to 4 in /etc/fstab:

peregrine:/share1 /share1 nfs rw,intr,noatime,rsize=8192,wsize=8192,readahead=4 0 0

Throughput was also examined in real-time by running "iostat 1" on the server. Block sizes larger than 8192 produce slightly higher burst speed, but also make the throughput rate unstable: iostat showed disk I/O over 107MB/sec for short periods, and periodic drops to well below this rate. The larger the block size, the deeper the drops. The end result was lower average throughput for long writes than produced by block sizes of 8192. Smaller block sizes lower the maximum throughput.

Setting readahead to 3 produced almost the same read throughput as readahead=4.

The locations and exact syntax of configuration files on other Unix-like systems will differ, but the concepts are the same. You will want to experiment with the number of NFS servers, the block size, and readahead parameters on any system to see what works best with your hardware and applications. In addition, there may be kernel parameters that can be tuned (using sysctl, for instance) and network parameters such as MTU (Maximum Transmit Unit) that could help squeeze a bit more performance out of NFS.