Best Practices in Parallel Computing

Parallelize as a Last Resort

While there will always be a need for parallel computing, the availability of parallel computing resources may tempt people to use them as a substitute for writing efficient code.

There is virtually no optimal software in existence. Most software at any given moment can be made to run faster, and a significant percentage of it can be made to run orders of magnitude faster. Most performance issues can and should therefore be resolved by optimizing the software first.

Improving software is a more responsible and intelligent way to resolve performance issues wherever it's possible. It will allow effective use of the software on a much wider variety of hardware, possibly including ordinary desktop and laptop machines. It is much better for everyone if they can run a program on a laptop or desktop system rather than use a cluster or grid.

It is also the more ethical way to resolve issues where users are running on shared resources. Using tens of thousands of dollars worth of computer equipment to make inefficient software run in a reasonable amount of time is wasteful and foolish, and may delay the work of others who need those resources for more legitimate uses.

I once had a computer science student who worked as a consultant. His firm was hired to design and install a faster computer for a business whose nightly data processing had grown to the point where it wasn't finishing before the next business day started. He politely asked if he could have a look at the home-grown software that was performing the overnight processing. In about an hour, he found the bottleneck in the software and made some adjustments that reduce the processing time from 14 hours to about 10 minutes.

I've personally experienced numerous cases where researchers were considering buying a faster computer or using a cluster to speed up their work. In many cases, I was able to help them make their programs run orders of magnitude faster and eliminate the need for more hardware. In one case, in about 20 minutes of examining a Matlab script, I identified the location of a bottleneck that the researcher then easily solved, making the script run about 1,000 times as fast. He was extremely grateful that he could not complete his work on his PC, and other cluster users did not suffer from this needless competition for resources.

Before you consider the use of parallel computing, make sure you've done everything you can to optimize your software, by choosing efficient algorithms, using compiled languages for the time-consuming parts of your code, and eliminating wasteful code.

Software performance is discussed in greater detail in Part III, “High Performance Programming”.

Make it Quick

Parallel computing jobs should be designed to finish within a few hours, if possible. Jobs that run for weeks or months have a lower probability of completing successfully. The longer individual processes run, the higher the risk risk of being interrupted by power failures, hardware issues, security updates, etc. In the case of HTCondor grids, jobs that run for more than a few hours run a high risk of being evicted from desktop machines that are taken over by a local user.

Shorter running jobs also give the scheduler the ability to balance the load fairly among many users.

In the case of HTC (embarrassingly parallel computing), we can usually break up the work into as many pieces as we like. For example, we may have 3,000 core-hours with of computation that we can run as a hundred 30-hour processes, or a thousand 3-hour processes, simply by dividing up the input into smaller chunks. We do not want to divide it to the point where each job takes only a few minutes, however. Doing so increases overhead and reduces throughput.

Monitor Your Jobs

Never submit a job and ignore it. Malfunctioning jobs waste expensive resources that could be utilized by others and may actually cause compute nodes to crash in extreme cases. It is every users responsibility to ensure that their jobs are functioning properly and not causing problems for other users.

Development and Testing Servers

A well-organized software development operation consists of up to five separate environments (tiers) for developing, testing, and ultimately using new software.

  1. Development
  2. Testing
  3. Quality assurance
  4. Staging
  5. Production

A detailed description of the five tiers can be found in the Wikipedia article on Deployment Environments. Typically only organizations with a large software development staff will employ all five tiers.

Even in the smallest environments, though, a clear distinction is made between development/testing servers and production servers. Clusters and grids are meant to be production environments. A cluster or grid is not a good place to develop or test new programs, for multiple reasons:

  • All jobs run on a cluster or grid must go through a scheduler, which makes the development and testing process more cumbersome.
  • We generally don't want test runs of unfinished or unfamiliar code competing with production jobs. Bugs in the code or mistakes in using it will often have unforeseen impact on the system, which may harm important production jobs by consuming too much memory or even crashing compute nodes.
  • A cluster or grid is not necessary for testing code correctness, even for parallel programs. Most testing of any program, including parallel programs, can be done on a single computer, even with a single core. All we need is a multitasking operating system that can run multiple processes at the same time. We do not need parallel hardware resources. The only testing that requires parallel hardware resources is measuring speedup.

These principles apply whether you are developing your own code or learning to use a program written by others. Code should be developed and tested on servers completely separate from the scheduled production environment of a cluster or grid. This is easily done using a separate server with the same operating system and software installed as a compute node on the production cluster or grid.

Think of development server as a compute node where you are free from the need to schedule jobs, or worries about impacting the important work of other users. Here you can quickly and easily make changes to your program and run small test cases. Once you are confident that the code is working properly, you can move up to the next available tier for additional testing or production runs.

Practice

Note

Be sure to thoroughly review the instructions in Section 2, “Practice Problem Instructions” before doing the practice problems below.
  1. When should we make the decision to use parallel computing? Why?

  2. How long should parallel jobs run ideally? Why?

  3. Why is it important to actively monitor jobs running on a cluster or grid?

  4. What are some problems associated with doing code development on a cluster or grid?

  5. What do we need to test the correctness of a parallel program?