Table of Contents
No matter what operating system you use, you are going to have problems.
What you need to decide is what kinds of problems you can live with.
System crashes are the worst kind of problem for scientific computing, where analyses and simulations may takes days, weeks, or even months to run. If a system crash occurs when a job has been running for a month, someone's research may be delayed by a month (unless their software uses checkpointing, allowing it to be resumed from where it left off).
Reliability must be considered as a major factor when assessing the performance of a system. Long-term throughput (work completed per unit time) is heavily impacted by systems outages that cause jobs to be restarted.
It doesn't really matter why a system needs to be rebooted. It could be due to system freezes, panics (kernel detecting unrecoverable errors), or security updates so critical that they cannot wait. Systems that need to be rebooted frequently for any of these reasons should be considered less reliable.
Uptime, the time a system runs between reboots, should be monitored to determine reliability. The average uptime for popular operating systems varies from days to months.
System crashes are also the worst for IT staff who manage many machines. Suppose you manage 30 machines running an operating system that offers and average up time of a month or two. This means you have to deal with a system crash every day or two on average (unless you reboot machines for other reasons in the interim).
This is exactly the situation I experienced while supporting fMRI research using cutting-edge Linux distributions, such as Redhat (not Redhat enterprise, but the original Redhat, which evolved into Fedora), Mandrake, Caldera, SUSE (again, the original, not SUSE Enterprise).
Some of our Linux workstations would run for months without a problem while others were crashing every week. NFS servers running several different distributions would consistently freeze under heavy load. Systems would freeze for a few minutes at a time while writing DVD-RAMs. These were pristine installations with no invasive modifications. It's not anything we did to the systems, but just the nature of these cutting-edge distributions.
This is a fairly common issue. Some research groups resort to scheduled reboots in order to maximize likely up times from the moment an analysis was started. The HTCondor scheduler has an option to reboot a compute host after a job finishes for similar reasons.
This is in no way a criticism of cutting-edge Linux distributions. They play an important role in the Unix ecosystem, namely as a platform for testing new innovations. We need lots of people using new software systems in order to shake out most of the bugs and make it enterprise-ready, and cutting-edge Linux distributions serve this purpose very well. Many people want to try out the latest new features and don't need a system that can run for months without a reboot. In fact, most of them probably upgrade and reboot their systems every week or so, and as a result, rarely experience a system crash.
However, no operating system is the best at everything, and cutting-edge Linux distributions are not the best at providing stability. Some glitches should be expected from anything on the cutting edge.
For the average user maintaining one or two systems for personal use or development, the stability of a cutting-edge Linux system is generally more than adequate.
For scientists running simulations that take months or IT staff managing many systems, it could be a major nuisance.
One solution is to run an Enterprise Linux distribution, such as Redhat Enterprise, is described in Section 4.3, “RHEL/CentOS Linux”, or SUSE Enterprise.
Another is to run a different Unix variant, such as FreeBSD, described in Section 4.4, “FreeBSD”. This is the route we chose in our fMRI research labs, and it solved almost all of our stability issues. FreeBSD has always been extremely reliable and secure. System crashes are extremely rare. Almost every system crash I've experienced has been traced to a hardware problem or a configuration error on my part. Critical security updates, in my experience, occur less frequently than other systems such as Windows and Linux. If you're looking for a "set and forget" operating system to make your sysadmin duties easy, FreeBSD is a great option.
In addition to choosing an operating system that focuses on reliability, you may want to invest in a UPS and a RAID to protect against power outages and disk failures. If you're really worried, some systems also offer fault-tolerant RAM configurations, using some RAM chips for redundancy, akin to RAIDs.