UBELIX

  • {{ layer1.title }}
    • {{ layer2.title }}
      • {{ layer3.title }}
        • {{ layer4.title }}
View in Confluence Edit Page Log In Log Out

Description

This page contains all information you need to submit GPU-jobs successfully on Ubelix.

On this page

Important Information on GPU Usage

Code that runs on the CPU will not magically make use of GPUs by simply submitting a job to the 'gpu' partition! You have to explicitly adapt your code to run on the GPU. Also, code that runs on a GPU will not necessarily run faster than it runs on the CPU. For example, GPUs are not suited to handle tasks that are not highly parallelizable. In other words, you must understand the characteristics of your job, and make sure that you only submit jobs to the 'gpu' partition that can actually benefit from GPUs.

Privileged vs. Regular Users

We have two categories of users on Ubelix concerning GPU usage: privileged and regular users. Privileged users are users that have invested money into GPUs. Jobs of privileged users can preempt running jobs of regular users on a certain number of GPUs. Unless the option --no-requeue was used when submitting the job, a preempted job is automatically requeued, or canceled otherwise. A requeued job can start on different resources. This behavior is enforced by job QOSs. Whether a job is privileged or not depends on the job QoS that was used to submit the job. Regular users submit their jobs always with the unprivileged QoS 'job_gpu', while privileged users submits their jobs by default with the privileged QoS 'job_gpu_<name_of_head>'. Additionally, privileged users can also submit jobs with the unprivileged QoS. A privileged job will cancel a running unprivileged job when the following two criteria are met:

  • There are no free GPU resources of the requested GPU type available.
  • The QoS of the privileged user has not yet reached the maximum number of GPUs allowed to use with this QoS.

If an unprivileged job needs to be preempted to make resources available for a privileged job, Slurm will always preempt the youngest running job in the partition.

Because an unprivileged job can be preempted at any time, it is important that you checkpoint your jobs. This allows you to resubmit the job and continue execution from the last saved checkpoint.

Access to the 'gpu' Partition

While the 'gpu' partition is open for everybody, regular users must request access to this partition explicitly before they can submit jobs. You have to request access only once. To do so, simply write an email to grid-support@id.unibe.ch and describe in a few words your application.

GPU Type

Ubelix currently features two types of GPUs:

  • 48x Nvidia Geforce GTX 1080 Ti
  • 4x   Nvidia Tesla P100

You must request a GPU type using the --gres option:

		
    --gres=gpu:1080ti:<number_of_gpus>
or
--gres=gpu:teslap100:<number_of_gpus>

Job Submission

Use the following options to submit a job to the 'gpu' partition using the default job QoS:

		
    #SBATCH --partition=gpu
#SBATCH --gres=gpu:<type>:<number_of_gpus>

Privileged user only: Use the following options to submit a job using the non-privileged QoS:

		
    #SBATCH --partition=gpu
#SBATCH --qos=job_gpu
#SBATCH --gres=gpu:<type>:<number_of_gpus>

Use the following option to ensure that the job, if preempted, won't be requeued but canceled instead:

		
    #SBATCH --no-requeue

Further Information

CUDA: https://developer.nvidia.com/cuda-zone

CUDA C/C++ Basics: http://www.nvidia.com/docs/IO/116711/sc11-cuda-c-basics.pdf

Nvidia Geforce GTX 1080 Ti: https://www.nvidia.com/en-us/geforce/products/10series/geforce-gtx-1080-ti/

Nvidia Tesla P100: http://www.nvidia.com/object/tesla-p100.html

Related pages:

There is no content with the specified labels