[Home] [Services] [Philosophy] [Software] [Publications] [Tips] [Contact]

Software

Auto-admin
Desktop-installer
SPCM HPC Cluster Management
Roboctl
OpenVex
PAPP - Portable Assembly Pre-Processor
xml-format - XML Beautifier
Diskimage Tools
Another Programmer's Editor

Overview

The SPCM cluster manager is a set of scripts for managing a simple HPC (High Performance Computing) cluster.

It is the only portable cluster management suite we are aware of and is designed to be easily adapted to most POSIX platforms.

It automates the process of configuring a head node, compute nodes, file servers, and visualization nodes for a high performance computing cluster, and managing configuration and software after installation.

Screen shot of Ganglia on a small cluster built with SPCM:

[Ganglia Screen Shot]

SPCM automates the setup of a cluster using the SLURM scheduler and the Ganglia web-based network monitoring suite. It also helps synchronize system files on the compute nodes, manage user accounts, and manage software on compute nodes.

The design philosophy centers on simplicity and minimizing interdependence of the cluster nodes. Each compute node contains a fully independent operating system installation and critical software installations on its own local storage. Compared with clusters that utilize shared storage more extensively, this strategy increases the initial cluster setup time slightly in exchange for simpler management, less "noise" on the local network, fewer single points of failure, and fewer bottlenecks.

Core design principals:

  • Speed and Simplicity: Our efforts are focused on basic functionality, robustness, fast and easy setup and management. No fluff to make it look fancy.
  • Portability: We aim to make the system portable to any POSIX operating system. Ultimately, we plan to support heterogeneous clusters, where different nodes can opaquely run different operating systems, while being managed with the exact same commands. There is currently limited support for this and we have tested using FreeBSD file servers and visualization nodes in predominantly CentOS clusters. Conversely, a CentOS compute node could be integrated into a FreeBSD cluster for running CUDA programs. ( In theory, CUDA could be run on FreeBSD using Linux compatibility, but using a CentOS node is a simpler solution. )

    The pkgsrc portable package manager plays a major role in making this possible, by managing both system and scientific software on any POSIX platform. Many system-dependent differences are encapsulated in the auto-admin tools, on which SPCM depends, so the SPCM scripts can remain cleaner and more generic. Much work remains to be done in this area, but it will remain a core principal and steady improvement should be expected.

  • Non-interference with core operating system: Unlike some other cluster management systems, SPCM does not depend on hacks to the base operating system, and all OS updates can be applied using the standard tools that come with the OS (through the SPCM interface to avoid issues). Critical security update just released for your OS? Install it immediately without worrying about breaking your cluster. You don't have to wait for us to port it to a modified OS image.
  • Never require a cluster shutdown. All system updates can be applied to a live cluster. Compute nodes are all set to draining state and updated when they become idle. This ensures that all new jobs will land on updated nodes.

Implementation of this design is facilitated by leveraging the extensive systems management tools provided by the FreeBSD base system, including the ports system which automates the installation of nearly every mainstream open source application in existence. The pkgsrc package manager is used on other platforms. Yum is used for the most basic system services and commercial software support on RHEL/CentOS.

The SPCM tools are written entirely in POSIX Bourne shell using standard Unix tools to configure the system, and utilizing ports/packages for all software management. The only dependency for SPCM is auto-admin.

[cluster diagram] In many clusters, the head node can be multi-homed (have two network interfaces) and serve as the gateway for the cluster. SPCM supports this configuration, but be aware that it complicates the setup of the head node as well as configuration of services running on the head node, including the scheduler and the Ganglia resource monitor.

The recommended hardware configuration uses a single network interface on all nodes, including the head node, and a separate router. Many network switches have built-in routing capability. If you're using a simple switch without routine capability for your cluster, you can use an inexpensive hardware router, or quickly and cheaply build a sophisticated firewall router using any PC with two network adapters and pfSense or OPNsense.

Be sure that the hardware router or PC have gigabit Ethernet for both WAN and LAN ports. Many older routers designed for home use and low-end PCs may have only 100 mb Ethernet.

Currently Supported Platforms: RHEL/CentOS and FreeBSD

Redhat Enterprise Linux and it's free twin, CentOS, are the de facto standard operating systems for HPC (High Performance Computing) clusters. They are very stable, have strong support for HPC system software like Infiniband drivers, parallel file systems, etc., and are the only GNU/Linux platforms officially supported by many commercial software vendors.

The main disadvantages of enterprise Linux platforms (compared to FreeBSD or community Linux distributions such as Debian and Gentoo) are the outdated base installations and the limited and outdated collection of packages available in the Yum repository. (Stability and long-term binary compatibility in enterprise Linux systems is maintained by running older, time-tested, and heavily patched versions of system software.)

We've had great success using pkgsrc to manage more up-to-date open source software on RHEL/CentOS. The pkgsrc system is well-supported on Linux, offers far more packages than Yum, and can install a complete set of packages that are almost completely independent from the base Linux installation. This allows the base system (including RPMs from Yum) to be updated without breaking software installed by pkgsrc.

FreeBSD's unparalleled stability, near-optimal efficiency, and easy software management via ports and packages make it an ideal platform for high performance computing (HPC) clusters. There is no better platform for running huge computational jobs that may require weeks or months of uninterrupted up time. FreeBSD is the only operating system we've found that offers enterprise stability combined with top-tier development tools and software management (FreeBSD ports).

FreeBSD is the basis of many products used in HPC incuding FreeNAS, Isilon, NetApp, OPNSense, Panasas, and pfSense.

Many FreeBSD HPC clusters are in use today, serving science, engineering, and other disciplines. FreeBSD is a supported platform on Amazon's EC2 virtual machine service. It is also a little-known fact that the special effects for the movie "Matrix" were rendered on a FreeBSD cluster.

FreeBSD can run most Linux binaries natively (with better performance than Linux in some cases), using its CentOS-compatible Linux compatibility module. This module is *NOT* an emulation layer. It simple adds Linux system calls to the FreeBSD kernel so that it can run Linux binaries directly. Hence, there is no performance penalty. The only cost is a small amount of memory and disk used to house the module and Linux software.

To get started:

  1. Do a basic RHEL/CentOS minimal or FreeBSD installation on your head node.
  2. Download and run the cluster-bootstrap script.