This page provides an general overview of the UBELIX cluster. It describes the different components of the cluster that you will interact with. Subsequent pages in UBELIX 101 will take you on a tour on important topics that will provide you the information to get up and running with UBELIX. You will learn how to login to the cluster, how to move files between the cluster and your local workstation, and how to submit your first job. If you never worked on a shared cluster and/or never used Slurm before, this page and subsequent pages are your starting point.
Some remarks about Linux:
UBELIX is a Linux cluster meaning that all nodes in the cluster run on Linux (CentOS). We assume that you have at least a basic knowledge of how to get things done on a Linux operating system. This includes using a text editor, manipulating files and writing simple bash scripts. While we have some tutorials about selected topics we will not provide you with a crash course in Linux, yet we will provide you with links to useful resources.
UBELIX (University of Bern Linux Cluster) is a HPC cluster that consists of about 266 compute nodes/4'408 cores and a software-defined storage infrastructure providing ~580 TB of disk storage net. Compute nodes, front-end servers and the storage are interconnected through a high speed Infiniband network. The front-end servers also provide a link to the outside world. UBELIX is used by various institutes and research groups within chemistry, biology, physics, astronomy, computer science, geography, medical radiology and others for scientific research and by students working on their thesis.
UBELIX System Overview
Login Server AKA Frontend Server AKA Submit Server
A user connects to the cluster by logging into the submit host via SSH. You can use this host for medium-performance tasks, e.g. to edit files or to compile programs. Resource-demanding/high-performance tasks must be submitted to the batch queuing system as jobs, and will finally run on one or multiple compute nodes. Even long running compile tasks could fit as a job on a compute instead of running it on the submit host
On UBELIX we use the open-source batch-queueing system Slurm for executing jobs on a pool of cooperating compute nodes. Slurm manages the distributed resources provided by the compute nodes and is responsible for accepting, scheduling, dispatching, and managing the remote and distributed execution of sequential, parallel or interactive user jobs.
Compute nodes eventually execute the user jobs. After submitting a job to the cluster, the scheduler checks the job's resource requirements and dispatches it to one or multiple compute nodes that can fulfill those requirements at the given time. The following table contains the hardware details for the different compute nodes:
|Class||#Nodes||CPU Type||#Cores||RAM||Local Scratch|
|anodes||144||Intel Xeon CPU E5-2630 v4 @ 2.0GHz||20||120GB||850GB|
Intel Xeon CPU X5550 @ 2.67GHz
Intel Xeon CPU E5620 @ 2.40GHz
Intel Xeon CPU X5680 @ 3.33GHz
Intel Xeon CPU X5650 @ 2.67GHz
Intel Xeon CPU E5649 @ 2.53GHz
|hnodes[01-42]||42||Intel Xeon CPU E5-2665 0 @ 2.40GHz||16||78GB||250GB|
|hnodes[43-49]||7||Intel Xeon CPU E5-2695 v2 @ 2.40GHz||24||94GB||500GB|
|jnodes||21||Intel Xeon CPU E5-2665 0 @ 2.40GHz||16||252GB||500GB|
|knodes||36||Intel Xeon CPU E5-2650 v2 @ 2.60GHz||16||125GB||850GB|
|knlnodes||4||Intel Xeon Phi CPU 7210 @ 1.30GHz||64||108GB
+16GB on CPU
Cluster Partitions (Queues)
A partition is a container for a class of jobs. You can choose a partition depending on your jobs requirements. UBELIX provides 4 different partitions as shown in the following table:
|Partition name||max runtime (wall clock time) in h||max memory per node||max (cores|threads)/node (shared memory jobs)||GPU||Compute Nodes|
|96h||252 GB||20cores | 1 thread per core = 20 "slurm cpus"||-||anodes, enodes, fnodes, hnodes[01-34],
|empi||24h||125 GB||20 | 1 thread per core = 20 "slurm cpus"||-||anodes|
|long1)||360h||94 GB||24 | 1 thread per core = 24 "slurm cpus"||-||hnode[43-49]|
|gpu2)||24h||256GB||24 | 1 thread per core = 24 "slurm cpus"||48x nVidia GTX 1080TI
6x nVidia Tesla P100
|phi3)||24h||108GB||64 | 4 threads per core = 256 "slurm cpus"||-||knlnode[01-04]|
1) Due to the limited resources and the potentially long job runtime, access to the long partition must be requested explicitly once.
2) The gpu partition is closed by definition. If you need GPU resources, you have request access to this partition. Write an email to email@example.com to do so.
3) The phi partition is currently open for all. Only use with code that can benefit of the architecture.
A modular, software-defined storage system (IBM Spectrum Scale) provides a shared, parallel file system that is mounted on all frontend servers and compute nodes. Ubelix also provides a limited amount of storage space on the Campus Storage. The different storage locations are summarized in the table below. For more information about the storage infrastructure see here.
1) Default: 3TB/user, 15TB/group
2) Default: 50GB/user