Chapter 14. Software Development

In this chapter, we introduce some high-level concepts behind high-performance scientific programming, to lay a foundation for the programming instruction that follows. Before we can learn to write high-quality software, we first must understand what makes software high-quality.

Goals of a Top-notch Scientific Software Developer

Maximize Portability

If possible, use an open language such as C, C++, Fortran, Octave, Perl, Python, or R, rather than proprietary languages such as MATLAB or SPSS, which only run on a few specific operating systems and CPUs (central processing units, the part of the computer that a program directly controls).

If possible, write code following Unix standards, so people will be able to run your programs on virtually any operating system including BSD, Linux, macOS, and even Windows with Cygwin or WSL. If you code for Windows or using Apple's proprietary Xcode environment, then your programs will only run on that platform.

As discussed in Chapter 3, Using Unix, most research computing is done on Unix-compatible operating systems. Every operating system you are likely to use, with the exception of Microsoft Windows, is Unix-compatible. There are many commercial Unix systems used in corporate data centers, but researchers are most likely to be running Mac OS X or a free Unix system such as one of the many BSD or Linux distributions.

Windows users can run Unix software using a compatibility layer such as Cygwin (the section called “Cygwin: Try This First”), or by running a Unix system on a virtual machine (Chapter 41, Running Multiple Operating Systems) such as WSL or VirtualBox. Preconfigured virtual machine installations are available for many Unix systems, so obtaining a Unix environment on your Windows PC is easy.

Code developed in proprietary Windows development environments such as Visual Studio is usually difficult to port to Unix systems. This is often a major problem for researchers who discover down the road that their PC is not fast enough and want to run their code on high-power Unix servers or clusters. The only option for them is to heavily modify or rewrite the code so that it is Unix-compatible.

Although Apple's Mac OS X is Unix-compatible, the Xcode development environment is proprietary, and projects developed in Xcode are not portable to other Unix systems. If you develop using an Xcode project, you will need to maintain a separate build system for other Unix platforms. Alternatively, you can develop under Mac OS X using a single, portable build system such as a simple Makefile that will work in all Unix environments (including Cygwin).

For most researchers, it makes no difference which operating system they use to develop and test their code. Unix systems are highly compatible with each other and most of the features of any one of them are available in the others. Each system does have some of its own special features, but most of them are not relevant to most scientific software, so it is generally easy to write code that is portable among all Unix systems.

WORF: Write Once, Run Forever

Unfortunately, much of the software used in science is disposable scripts that don't perform well, are not robust (don't handle bad input), and are reinvented many times over by other researchers doing similar work. This will always be part of the field, as most scientists lack the time and/or the skills to do otherwise.

What the community needs in order to move forward is high-quality application programs that will serve people for many years. If a few more scientists decide to develop such applications, it will bring about a huge reduction in duplicated effort for a long time to come.

If you choose to contribute to the collection of quality software, you'll need to consider carefully how to develop it. You may not have much time to maintain and update the code a few years from now, so write code that will require as little maintenance as possible.

Code in a stable language (one that is not changing), such as C or POSIX Bourne shell. Test on multiple platforms, such as BSD, Cygwin, Linux, and Mac. This is easy to do by running free, open source systems in virtual machines. The fact that a program works on your development platform does not mean it is free of bugs. It probably contains many bugs that simply are not visible, since they don't cause incorrect output or program crashes. Testing on another platform will almost always reveal bugs that you were not aware of.

Most scientists don't understand that writing software is like adopting a puppy. It's not something you do and then forget about. It's a 10 to 20 year commitment in maintenance and support. Almost all software will need to be upgraded periodically to work with new operating systems, compilers, interpreters, libraries, and other programs. Bugs will continue to be discovered and fixed over the entire life of the software.

Abandonware is software that is still available, but no longer maintained. This is the state of most scientific software. The vast majority of scientific software is abandoned within a few years after publication. Much of it is written as someone's thesis and the author either has no time or no interest in maintaining it after graduation.

Paperware is a special category of abandonware, written for the sake of publishing a paper, and then immediately abandoned. Many researchers work in a publish or perish, where they must continually get papers published in respected journals in order to advance, or even maintain, their career. It is rarely feasible for people in this environment to maintain software after the associated paper is published, since they must focus on publishing the next study.

Abandoned software may still be useful, but more often than not, it cannot be installed on newer systems because it was not written for portability. It may have been written in Python 2 and is not compatible with Python 3. Notable exceptions are POSIX Bourne shell and C, which have changed very little in recent years and are unlikely to change much in the future. Software written in these languages will likely continue to function with little or no modification indefinitely.

Complex build systems are also problematic in that they change over time, often breaking compatibility with older versions. A simple makefile is more likely to work for people 10 years from now than a complex cmake setup. Writing makefiles is covered in Chapter 22, Building with Make.

Minimize CPU and Memory Requirements

Many scientists do not have easy access to an HPC cluster or even a powerful workstation. Being able to run your program on a laptop may be the difference between success and failure for you and other researchers. Running highly inefficient programs on an HPC cluster is unwise and unethical. It wastes resources that are actually needed for other work, doing things that could be done on much less expensive hardware. The fastest languages by far, as we will see in the section called “Language Selection”, are C, C++, and Fortran.

Minimize Deployment Effort

Make the program and build-system package-friendly and discourage the use of caveman installs. If it is easy to add to existing package managers, then users won't waste their time struggling with caveman installs and won't waste your time with problem reports and help requests. More on this in Chapter 22, Building with Make and Chapter 32, Software Project Management.

Practice

Note

Be sure to thoroughly review the instructions in Section 2, “Practice Problem Instructions” before doing the practice problems below.

How do we maximize portability of the software we write?
Why is it important to minimize the future maintenance requirements of the free software we write?
How do we minimize future maintenance of our software?
What is abandonware? How common is it in scientific computing?
What are the benefits of minimizing CPU requirements?
How do we help users avoid problems with deploying our software?