To efficiently manage the environment and provide users and applications with a consistent way of using the resources, we have deployed widely used, robust operating systems and resource managers. The Red Hat distribution of Linux is installed on each node to provide a secure operating environment supported by all-important research applications. Most of the resources are provisioned to user applications through a powerful workload scheduler Slurm, which a user interacts with to request resources such as computing cores and memory for specified periods of time. The Analytics clusters are managed by Apache Hadoop, which is a resource manager tailored to data-intensive applications. Storage services are provided through a powerful parallel file system, Lustre, and a general network file service based on NFS. The computer hardware environment is based on industry-standard or prevalent technologies.
We have standardized on Red Hat Enterprise Linux to provide a stable platform on which many vendors have certified their software to run. Although a commercial company, Red Hat creates, maintains, and contributes to many free software projects and has also acquired several proprietary software packages. Red Hat released their source code under mostly GNU GPL while holding copyright under a single commercial entity and selling looser licenses.
Slurm is an open-source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Slurm requires no kernel modifications for its operation and is relatively self-contained. As a cluster workload manager, Slurm has three essential functions. First, it allocates exclusive and non-exclusive access to resources (compute nodes) to users for a duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (typically a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work. Slurm development is lead by SchedMD.
We believe in using Free/Open-Source Software (F/OSS) in our environment whenever possible.
Apache Hadoop is a Java-based software framework developed and maintained by the Apache Software Foundation that supports data-intensive distributed applications under a free license. It is designed to scale up from a single server to thousands of machines, each offering local computation and storage. Hadoop was inspired by Google’s MapReduce and the Google File System (GFS).
We have standardized using Cloudera’s Distribution of Hadoop (CDH) on our clusters to provide the following Hadoop services: HDFS, HBase, Hive, Hue, Impala, Oozie, Spark2, Sqoop 2, and YARN (w/ MapReduce2).
Hardware & Storage
Our Research Computing clusters are primarily Intel Xeon-based Dell servers, but we do have some AMD EPYC based compute nodes. We have a mix of models and generations, but our primary compute nodes are Intel Xeon-based Poweredge R630s / R640s and AMD EPYC-based Poweredge R6525. We offer compute nodes with different compute capabilities, so if you need large memory nodes or GPU nodes, we’ve got you covered. Our GPU nodes provide a mix of NVIDIA cards: GTX-1080ti, Titan V, and Titan RTX, as well as Tesla V100S and A100 Tensor Core GPUs. Our large memory nodes range from 1.5TB to 4TB of RAM in a single system.
We have a high-speed Mellanox 100Gb/s EDR Infiniband fabric in one data center and a 200Gb/s HDR Infiniband fabric in our other data center. The fabrics are connected via dual redundant Mellanox Technologies MetroX-2 Long-Haul IB switches. Our Lustre Filesystem is served out over our IB fabric to provide incredible throughput performance for our high I/O compute jobs.
For a more detailed overview of the types of systems that make up each cluster, please check out our Research Clusters and Educational Clusters pages. Research Computing provides an extensive set of applications and codes for use by our researchers on the cluster.