Awesome HPC 
High Performance Computing tools and resources for engineers and
administrators.
High
Performance Computing (HPC) most generally refers to the practice of
aggregating computing power in a way that delivers much higher
performance than one could get out of a typical desktop computer or
workstation in order to solve large problems in science, engineering, or
business.
Contents
(click to expand)
Provisioning
- Grendel - Bare Metal
Provisioning system for HPC Linux clusters (Source Code)
GPL-3.
- XCat - xCAT is a toolkit for
deployment and administration of clusters of all sizes (Source Code)
EPL-1.0.
- Warewulf - Warewulf is a
stateless and diskless container operating system provisioning system
for large clusters of bare metal and/or virtual systems (Source Code)
BSD-3.
- Rocks - A Linux
distribution for developing Linux clusters
other.
- Cobbler - Cobbler is a
Linux installation server that allows for rapid setup of network
installation environments (Source Code)
GPL-2.0.
- Base
Command Manager - Base Command Manager allows administrator to
quickly build and manage heterogeneous clusters
Proprietary.
- Scyld
- Scyld Clusterware Scyld ClusterWare is developed based on the
continuing evolution of Beowulf clusters first developed at NASA in the
1990s
Proprietary.
- BlueBanquise - BlueBanquise
is an open source cluster deployment and management stack built on
Python and Ansible (Source Code)
MIT.
Workload Managers
- Slurm - A
free and open source job scheduler (Source Code)
OSS.
- LSF - A
job scheduler and workload management software developed by IBM
Proprietary.
- Moab -
Moab is a workload management and job scheduler
other.
- Torque - Torque
is a workload management and job scheduler
other.
- OpenLava -
OpenLava is a workload management and job scheduler
other.
- UGE/SGE -
Univa Grid Engine is a workload management engine for HPC
Proprietary.
- Volcano - Volcano is a batch
system built on Kubernetes
Apache-2.0.
- Maui - Maui is a workload
management and job scheduler
other.
- Kube
Batch - A batch scheduler of kubernetes for high performance
workload, e.g. AI/ML, BigData, HPC
Apache-2.0.
- OpenPBS - OpenPBS® software
optimizes job scheduling and workload management in high-performance
computing (HPC) environments (Source Code)
other.
Pipelines
- Nextflow - Data drive
computational pipelines
Apache-2.0.
- Cromwell -
Scientific workflow engine designed for simplicity & scalability (Source Code)
BSD-3.
- Pegasus - A configurable
system for mapping and executing scientific workflows over a wide range
of computational infrastructure (Source
Code)
Apache-2.0.
Applications
- Spack - A flexible package manager
that supports multiple versions, configurations, platforms, and
compilers (Source Code)
other.
- EasyBuild - EasyBuild - building
software with ease (Source Code)
GPL-2.
Compilers
- Nvidia -
NVIDIA HPC compiler suite for Fortran, C/C++ with OpenACC
Proprietary.
- Portland Group - The
Portland Group compilers were Fortran, C/C++ compilers now integrated
into NVIDIA HPC SDK
Proprietary.
- Intel
- The Intel compiler suite offers many language compilers for use in the
HPC space
Proprietary.
- Cray - A
suite of compilers designed and optimized to target the AMD interlagos
instruction set
Proprietary.
- GNU - The GNU Compiler Collection
is a suite of compilers targeting many languages (Source Code)
GPL-3.
- LLVM - The LLVM project is a
collection of modular compilers and toolchains (Source Code)
OSS.
MPI
- OpenMPI - OpenMPI is an open
source implementation of the MPI-3.1 standard (Source Code)
BSD.
- MPICH - MPICH is a
high-performance and widely portable implementation of the MPI-3.1
standard (Source Code)
other.
- MVAPICH - MVAPICH
is an open source implementation of the MPI-3.1 standard developed by
Ohio State University
BSD.
- Intel-MPI
- Intel-MPI is Intel’s MPI-3.1 implementation included in their compiler
suite
other.
Parallel Computing
- ArrayFire - A
general purpose tensor library that simplifies the process of software
development for parallel architectures
other.
- OpenMP - OpenMP is an
application programming interface that supports multi-platform
shared-memory multiprocessing programming
other.
Benchmarking
- OSU
Benchmarks - A collection of benchmarking tools for MPI developed by
Ohio State University
other.
- Intel
MPI Benchmarks - A set of benchmarks developed by Intel for use with
their Intel MPI
other.
- HPCC Systems - HPCC Systems
(High Performance Computing Cluster) is an open source, massive
parallel-processing computing platform for big data processing and
analytics (Source Code)
other.
- LINPACK - LINPACK is a
set of efficient fortran subroutines for solving linear systems which
benchmarks are useful for HPC
other.
- IOzone - IOzone is a
filesystem benchmark tool
OSS.
- IOR -
Interleaved or Random is a useful benchmarking tool for testing parallel
filesystems
other.
- MDtest -
MDtest is an MPI-based application for evaluating the metadata
performance of a file system
other.
- FIO
- Flexible I/O is an advanced disk benchmark that depends upon the
kernel’s AIO access library (Source Code)
GPL-2.
- elbencho - A
distributed storage benchmark for files, objects & blocks with
support for GPUs
GPL-3.
Miscellaneous
- OpenOnDemand - Open OnDemand
helps computational researchers and students efficiently utilize remote
computing resources by making them easy to access from any device (Source Code)
MIT.
- Open XDMod - Open XDMoD is an
open source tool to facilitate the management of high performance
computing resources (Source
Code)
LGPL-3.
- Coldfront
- ColdFront is an open source resource allocation system designed to
provide a central portal for administration, reporting, and measuring
scientific impact of HPC resources (Source Code)
GPL-3.
- Pavilion2 - Pavilion
is a Python 3 (3.6+) based framework for running and analyzing tests
targeting HPC systems (Source
Code)
other.
- Reframe
- A powerful Python framework for writing and running portable
regression tests and benchmarks for HPC systems. (Source Code)
BSD-3.
- OLCF Test
Harness - The OLCF Test Harness (OTH) helps automate the testing of
applications, tools, and other system software (Source Code)
other.
- GoSlmailer -
Goslmailer is a drop-in notification delivery solution for slurm that
can do slack, mattermost, teams, and more.
- TotalView -
TotalView is a debugging tool for HPC applications
Proprietary.
- Tau -
TAU Performance System® is a portable profiling and tracing toolkit for
performance analysis of parallel programs written in Fortran, C, C++,
UPC, Java, Python
other.
- Valgrind - Valgrind is a
tool designed to profile programs to determine memory leaks (Source Code)
GPL-2.
- Paraver - Paraver is a
very flexible data browser that is part of the CEPBA-Tools toolkit
other.
- PAPI - Performance
Application Programming Interface (PAPI) is a performance analysis tool
(Source Code)
other.
Parallel Shells
Containers
- Apptainer - Apptainer is an open
source container system (Source Code)
BSD.
- Charliecloud -
Charliecloud provides user-defined software stacks (UDSS) for
high-performance computing (HPC) centers (Source Code)
Apache-2.0.
- Docker - Docker is a set of
platform as a service products that use OS-level virtualization to
deliver software in packages called containers
other.
- uDocker - A basic
user tool to execute simple docker containers in batch or interactive
systems without root privileges (Source Code)
Apache-2.0.
- Shifter
- Shifter is Linux containers for HPC (Source Code)
other.
- HPC
Container Maker - HPC Container Maker is an open source tool to make
it easier to generate container specification files.
Apache-2.0.
- Scarus - An
OCI-compatible container engine for HPC
BSD.
- Singularity HPC
- Singularity Registry HPC (shpc) allows you to install containers as
modules (Source
Code)
MPL 2.0.
Environment Management
- Lmod - Lmod: An
Environment Module System based on Lua, Reads TCL Modules, Supports a
Software Hierarchy (Source
Code)
other.
- Environment
Modules - Environment Modules: provides dynamic modification of a
user’s environment (Source
Code)
GPL-2.
- Anaconda - Anaconda is a
Python and R distribution for use in computational science
other.
- Mamba - Mamba
is a reimplementation of the conda package manager in C++ (Source Code)
BSD.
Visualization
- Visit -
VisIt - Visualization and Data Analysis for Mesh-based Scientific Data
(Source Code)
BSD-3.
- Paraview - ParaView is an
open-source, multi-platform data analysis and visualization application
based on Visualization Toolkit (VTK) (Source Code)
BSD-3.
Parallel Filesystems
- GPFS
- GPFS is a high-performance clustered file system software developed by
IBM
Proprietary.
- Quobyte
- A high performance filesystem
Proprietary.
- Ceph - Ceph is a distributed
object, block, and file storage platform (Source Code)
other.
- Weka - A file system designed for
HPC
Proprietary .
- Lustre/Exascaler - Lustre is
an open-source, distributed parallel file system software platform
designed for scalability, high-performance, and high-availability (Source Code)
other.
- BeeGFS - BeeGFS is a
hardware-independent POSIX parallel file system developed with a strong
focus on performance and designed for ease of use, simple installation,
and management
Proprietary.
- OrangeFS - OrangeFS is a next
generation parallel file system for Linux clusters (Source Code)
other.
- MooseFS - Moose File System is an
Open-source, POSIX-compliant distributed file system developed by Core
Technology (Source
Code)
GPL-2.0.
Programming Languages
- Julia - Julia is a high-level,
high-performance dynamic language for technical computing
MIT.
- Futhark - Futhark is a
purely functional data-parallel programming language in the ML family
isc.
- Chapel - Chapel is a
programming language designed for productive parallel computing at scale
Apache-2.0.
Monitoring
Prometheus Based
- Slurm
Exporter - Prometheus exporter for performance metrics from Slurm
GPL-3.0.
- Slurm Exporter
- Slurm Exporter for Prometheus using Rest API
GPL-3.0.
- Infiniband
Exporter - The InfiniBand exporter collects counters from InfiniBand
switches and HCAs
Apache-2.0.
- Cgroup
Exporter - Produces metrics from cgroups
Apache-2.0.
- Cgroup
Exporter - A Prometheus exporter for cgroup-level metrics
unknown.
- GPFS
Exporter - The GPFS exporter collects metrics from the GPFS
filesystem
Apache-2.0.
- Lustre
Exporter - Prometheus exporter for use with the Lustre parallel
filesystem
GPL-3.0.
- DCGM Exporter
- NVIDIA GPU metrics exporter for Prometheus leveraging DCGM
Apache-2.0.
Journals
Podcasts
- This week in
HPC - Each week, Intersect360 Research CEO Addison Snell and HPCwire
editor Tiffany Trader dissect the week’s top HPC stories.
- Exascaler
Project - ECP’s Let’s Talk Exascale podcast goes behind the scenes
to chat with some of the people who are bringing a capable and
sustainable exascale computing ecosystem to fruition.
- @HPCpodcast - Join
Shahin Khan and Doug Black as they discuss Supercomputing technologies
and the applications, markets, and policies that shape them.
Blogs
- HPCWire - Since 1987 covering
the fastest computers in the world and the people who run them.
- InsideHPC - insideHPC is a
global publication recognized for its comprehensive and insightful
coverage of the HPC-AI community, linking vendors, end-users and HPC
strategists.
- The Next
Platform - Offers in-depth coverage of high-end computing at large
enterprises, supercomputing centers, hyperscale data centers, and public
clouds.
- The Register
HPC - The Register is a leading and trusted global online enterprise
technology news publication, reaching roughly 40 million readers
worldwide.
- HPC at Dell - High-Performance
Computing knowledge base articles from Dell.
Conferences
- Pearc - Practice &
Experience in Advanced Research Computing.
- Supercomputing (SC) - The
International Conference for High Performance Computing, Networking,
Storage, and Analysis.
- Supercomputing International
(ISC) - The International Conference for High Performance Computing,
Networking, Storage, and Analysis.
- CCGrid - IEEE/ACM
International Symposium on Cluster, Cloud and Internet Computing.
- IEEE-HPEC - IEEE High
Performance Embedded Computing.
- Hot Chips - Semiconductor
industry’s leading conference on high-performance microprocessors and
related circuits.
- Hot Interconnects - IEEE conference
on software architectures and implementations for interconnection
networks of all scales.
- ESSA -
Workshop on Extreme-Scale Storage and Analysis.
- IEEE-IPDPS - IEEE International
Parallel & Distributed Processing Symposium.
- ESPM2 Workshop
- International Workshop on Extreme Scale Programming Models and
Middleware.
- LCI
Workshops - The Linux Clusters Institute (LCI) is providing
education and advanced technical training for the deployment and use of
computing clusters to the high performance computing community
worldwide.
- HPC Carpentry -
Teaching basic skills for high-performance computing.
Websites
- Top500 - The TOP500 project ranks
and details the 500 most powerful non-distributed computer systems in
the world.
User Groups
- MVAPICH - The
MUG conference provides an open forum for all attendees (users, system
administrators, researchers, engineers, and students) to discuss and
share their knowledge on using MVAPICH libraries.
- Slurm -
The annual Slurm user group meeting.
Contributing
Contributing guidelines can be found in contributing.md.