HPC Carpentry part 2 - High Performance Computing: Cheatsheets for Queuing System Quick Reference

Key Points

Day one follow up
Working on a remote HPC system	HPC systems typically provides login nodes and a set of worker nodes. The resources found on independent (worker) nodes can vary in volume and type (amount of RAM, processor architecture, availability of network mounted file systems, etc.). Files saved on one node are available on all nodes.
Scheduling jobs	The scheduler handles how compute resources are shared between users. Everything you do should be run through the scheduler. A job is just a shell script. If in doubt, request more resources than you will need.
Accessing software	Load software with `module load softwareName` Unload software with `module purge` The module system handles software versioning and package conflicts for you automatically.
Transferring files	`wget` downloads a file from the internet. `scp` transfer files to and from your computer. You can use an SFTP client like FileZilla to transfer files through a GUI.
Using resources effectively	The smaller your job, the faster it will schedule. For a constant proble size, most parallel codes will eventually stop getting faster with more processes. If you have a serial code, it will still only run on one core even if you ask for multiple cores in your submission script.
Scheduling multiple similar jobs	Job arrays allow you to launch many jobs that are each assigned a different index value, using just one submission script.
Using shared resources responsibly	Clusters are designed to handle large number of users The shared resources to be careful with are memory and cpus on the login nodes, and I/O Your data on the system is your responsibility. Plan and test large data transfers. It is often best to convert many files to a single archive file before transferring. Don’t run stuff on the login node. Really don’t run stuff on the login node. When in doubt, talk to the cluster support team first

Cheatsheets for Queuing System Quick Reference

SLURM

Glossary

The following list captures terms that need to be added to this glossary. This is a great way to contribute.

Accelerator: to be defined
Beowulf cluster: to be defined
Central processing unit: to be defined
Cloud computing: to be defined
Cluster: a collection of computers configured to enable collaboration on a common task by means of purposefully configured hardware (e.g., networking) and software (e.g. workload management).
Distributed memory: to be defined
Grid computing: to be defined
High availability computing: to be defined
High performance computing: to be defined
Interconnect: to be defined
Node: to be defined
Parallel: to be defined
Serial: to be defined
Server: to be defined
Shared memory: to be defined
Slurm: to be defined
Supercomputer: … “a major scientific instrument” …
Workstation: to be defined
Grid Engine: to be defined
Parallel File System: to be defined