Table of Contents
Before reading this chapter, you should be familiar with basic Unix concepts (Chapter 3, Using Unix), the Unix shell (the section called “Command Line Interfaces (CLIs): Unix Shells”, redirection (the section called “Redirection and Pipes”), and shell scripting (Chapter 4, Unix Shell Scripting).
BEFORE you start submitting jobs to a cluster or grid, you MUST know how to monitor and control them.
Jobs that go rogue can cause problems for other users, so you need to know how to watch over them to ensure they're not using more resources than expected.
If a job does go astray, you need to know how to terminate it quickly, before is impacts other users.
If a job is submitted and the resources required to run it are not available, the job waits in a queue until the resources become available. These pending jobs are, for the most part, started in the order they were submitted. Hence, if one user submits too many jobs at once, other users' jobs may end up waiting in the queue for a very long time.
As a general rule, if you are already using a significant share of of the total cluster resources and there are jobs pending, you should not submit new jobs until other users' jobs have a chance to run.
If there are a lot of idle resources on the cluster, it's generally better to utilize them by submitting some "extra" jobs than to let them remain idle. The ideal state for a cluster is nearly 100% utilized, but without jobs pending for a long time.
However, keep in mind that other users may need resources after you've submitted. Hence, if you're using a lot of resources, please check the cluster load once per hour. If pending jobs appear after you've submitted, and none of your jobs are near completion, please kill some of them to allow other users a fair share of cluster resources.