UBELIX

  • {{ layer1.title }}
    • {{ layer2.title }}
      • {{ layer3.title }}
        • {{ layer4.title }}
  1. Home
  2. :: UBELIX 101

Job Submission

View in Confluence Edit Page Log In Log Out

Description

This page describes the job submission process with Slurm.

It is important to collect error/output messages either by writing such information to the default location or by specifying specific locations using the --error/–output option. Do not redirect the error/output stream to /dev/null unless you know what you are doing. Error and output messages are the starting point for investigating a job failure.

For backfilling performance and hence to maximize job throughput it is crucial to submit array jobs (collection of similar jobs) instead of submitting the same job repeatedly.

On this page

Resource Allocation

Every job submission starts with a resources allocation (nodes, cores, memory). An allocation is valid for a specific amount of time, and can be created using the salloc, sbatch or srun commands. Whereas salloc and sbatch only create resource allocations, srun launches parallel tasks within such a resource allocation, or implicitly creates an allocation if not started within one. The usual procedure is to combine resource requests and task execution (job steps) in a single batch script (job script) and then submit the script using the sbatch command.

Most command options support a short form as well as a long form (e.g. -u <username>, and --user=<username>). Because few options only support the long form, we will consistently use the long form throughout this documentation.

Some options have default values if not specified: The --time option has partition-specific default values (see scontrol show partition <partname>). The --mem-per-cpu option has a global default value of 2048MB.

The default partition is the all partition. To select another partition one must use the --partition option, e.g. --partition=long.

Important information for investors regarding account selection

Investors have two different accounts for accounting purposes. The investor account (increased privileges) is used automatically when using the empi partition (--partition=empi). To use another partition, the user must explicitly select the group account (e.g --account=dcb). To display your accounts, use: sacctmgr show assoc where user=<username> format=user,account%20,partition.

sbatch

The sbatch command is used to submit a job script for later execution. It is the most common way to submit a job to the cluster due to its reusability. Slurm options are usually embedded in a job script prefixed by '#SBATCH' directives. Slurm options specified as command line options overwrite corresponding options embedded in the job script.

Syntax
		
    sbatch [options] script [args...] 

Job Script

Usually a job script consists of two parts. The first part is optional but highly recommended:

  • Slurm-specific options used by the scheduler to manage the resources (e.g. memory) and configure the job environment
  • Job-specific shell commands

The job script acts as a wrapper for your actual job. Command-line options can still be used to overwrite embedded options.

Although you can specify all Slurm options on the command-line, we encourage you, for clarity and reusability, to embed Slurm options in the job script

You can find a template for a job script under /gpfs/software/workshop/slurm_template.sh. Copy the template to your home directory and adapt it to your needs:
cp /gpfs/software/workshop/job_script_template.sh $HOME

Options

Option Description Example Default Value
--mail-user Mail address to contact job owner. Must specify a valid email address! --mail-user=foo.bar@baz.unibe.ch  
--mail-type When to notify a job owner: none, all, begin, end, fail, requeue, array_tasks --mail-type=end,fail  
--account Which account to charge. Regular users don't need to specify this option. For users with enhanced privileges on the empi partition, see here   The users default account.
--job-name Specify a job name --job-name="Simple Matlab"  
--time Expected runtime of the job. Format: dd-hh:mm:ss --time=12:00:00
--time=2-06:00:00 
Partition-specific, see scontrol show partition <partname>
--mem-per-cpu Minimum memory required per allocated CPU in megabytes. Different units can be specified using the suffix [K|M|G] --mem-per-cpu=2G 2048 MB
--tmp Specify the amount of disk space that must be available on the compute node(s). The local scratch space for the job is referenced by the variable TMPDIR. Default units are megabytes. Different units can be specified using the suffix [K|M|G|T]. --tmp=8G
--tmp=2048
 

--ntasks

Number of tasks (processes). Used for MPI jobs that may run distributed on multiple compute nodes --ntasks=4 1 or to match --nodes, --tasks-per-node if specified
--nodes Request a certain number of nodes --nodes=2 1 or to match --ntasks, --tasks-per-node if specified
--ntasks-per-node

Specifies how many tasks will run on each allocated node. Meant to be used with --nodes. If used with the
--ntasks option, the --ntasks option will take precedence and the --ntasks-per-node will be treated as a maximum count of tasks per node.

--ntasks-per-node=2  
--cpus-per-task Number of CPUs per taks (threads). Used for shared memory jobs that run locally on a single compute node --cpus-per-task=4 1
--array Submit an array job. Use "%" to specify the max number of tasks allowed to run concurrently.

--array=1,4,16-32:4
--array=1-100%20 

 
--workdir Set the current working directory. All relative paths used in the job script are relative to this directory   The directory from where the sbatch command was executed
--output Redirect standard output. All directories specified in the path must exist before the job starts!   By default stderr and stdout are connected to the same file slurm-%j.out, where '%j' is replaced with the job allocation number.
--error Redirect standard error. All directories specified in the path must exist before the job starts!   By default stderr and stdout are connected to the same file slurm-%j.out, where '%j' is replaced with the job allocation number.
--partition The "all" partition is the default partition. A different partition must be requested with the --partition option! --partition=long Default partition: all
--dependency Defer the start of this job until the specified dependencies have been satisfied. See man sbatch for a description of all valid dependency types --dependency=afterany:11908  
--hold Submit job in hold state. Job is not allowed to run until explicitly released    
--immediate Only submit the job if all requested resources are immediately available    
--exclusive Use the compute node(s) exclusively, i.e. do not share nodes with other jobs. CAUTION: Only use this option if you are an experienced user, and you really understand the implications of this feature. If used improperly, the use of this option can lead to a massive waste of computational resources    
--constraint Request nodes with certain features. This option allows you to request a homogeneous pool of nodes for you MPI job

--constraint=ivy (all, long partition)
--constraint=sandy (all partition only)
--constraint=broadwell (empi partition only) 

 
--parsable Print the job id only   Default output: "Submitted batch job <jobid>"
--test-only Validate the batch script and return the estimated start time considering the current cluster state    

Example

job.sh
		
	#!/bin/bash
#SBATCH --mail-user=foo.bar@baz.unibe.ch
#SBATCH --mail-type=end,fail
#SBATCH --job-name="Example Job"
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=16
#SBATCH --time=00:10:00
#SBATCH --mem-per-cpu=1G
 
# Your code below this line
srun --mpi=pmi2 mpi_hello_world

Submit the job-script:

		
    sbatch job.sh

See below for more examples

salloc

The salloc command is used to allocate resources (e.g. nodes), possibly with a set of constraints (e.g. number of processor per node) for later utilization. It is typically used to allocate resources and spawn a shell, in which the srun command is used to launch parallel tasks.

Syntax
		
     salloc [options] [<command> [args...]]

Example

		
    bash$ salloc -N 2 sh
salloc: Granted job allocation 247
sh$ module load openmpi/1.10.2-gcc
sh$ srun --mpi=pmi2 mpi_hello_world
Hello, World.  I am 1 of 2 running on knlnode03.ubelix.unibe.ch
Hello, World.  I am 0 of 2 running on knlnode02.ubelix.unibe.ch
sh$ exit
salloc: Relinquishing job allocation 247

srun

The srun command creates job steps. One or multiple srun invocations are usually used from within an existing resource allocation. Thereby, a job step can utilize all resources allocated to the job, or utilize only a subset of the resource allocation. Multiple job steps can run sequentially in the order defined in the batch script or run in parallel, but can together never utilize more resources than provided by the allocation.

Syntax
		
    srun [options] executable [args...] 

Example

		
    #!/bin/bash
#SBATCH --mail-user=foo.bar@baz.unibe.ch
#SBATCH --mail-type=end,fail
#SBATCH --job-name="Example Job"
#SBATCH --nodes=2
#SBATCH --time=01:00:00
 
# Your code below this line
# Create some directories
export OUTFOO=/home/group/user/results/foo; mkdir -p $OUTFOO
export OUTBAR=/home/group/user/results/bar; mkdir -p $OUTBAR
# Each job step only utilizes only one node. Run both job steps concurrently (&)
srun --nodes=1 foo > $OUTFOO &
srun --nodes=1 bar > $OUTBAR &
# Wait until all job steps have finished
wait

The script itself is regarded as the first job step and is sequentially executed on the first compute node in the job allocation. Because only executables executed with the srun command are run as potentially parallel tasks, the export, mkdir and wait commands are only executed once.

To illustrate the remarks, consider the following simplified example:

		
    #!/bin/bash
#SBATCH --mail-user=foo.bar@baz.unibe.ch
#SBATCH --mail-type=end
#SBATCH --job-name="Example Job"
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
#SBATCH --time=00:01:00

# Your code below this line
hostname
Output
		
    knode01.ubelix.unibe.ch

Although the allocation contains 16 tasks distributed evenly over 2 nodes, the hostname command is executed only once on the first compute node in the allocation. The next example behaves differently:

		
    #!/bin/bash
#SBATCH --mail-user=foo.bar@baz.unibe.ch
#SBATCH --mail-type=end
#SBATCH --job-name="Example Job"
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --time=00:01:00

# Your code below this line
srun hostname
Output
		
    knode01.ubelix.unibe.ch
knode02.ubelix.unibe.ch
knode01.ubelix.unibe.ch
knode01.ubelix.unibe.ch
knode02.ubelix.unibe.ch
knode02.ubelix.unibe.ch
knode01.ubelix.unibe.ch
knode02.ubelix.unibe.ch

When using srun to execute hostname, the command is executed 8 times; 4 time on the first node and 4 time on the second node of the allocation. In this example, srun inherits all Slurm options specified for the sbatch command. 

Requesting a Partition (Queue)

The default partition is the 'all' partition. If you do not explicitly request a partition, your job will run in the default partition. To request a different partition, you must use the --partition option:

		
    #SBATCH --partition=long

See here for a list of all available partitions.

Requesting an Account

Accounts are used for accounting purposes. Every user has a default account that is used unless a different account is specified using the --account option. Regular users only have a single account and can thus not request a different account. The default account for regular user is named after their group (e.g. dcb):

		
    $ sacctmgr show user foo
      User   Def Acct     Admin
---------- ---------- ---------
       foo        bar      None

The remaining information provided in this section applies only to users with enhanced privileges on the empi partition.

Users with enhanced privileges have an additional account for the empi partition. This additional account is set as their default account, which means they don't have to specify an account when submitting to the empi partition (--partition=empi), but must specify their "group account" (--account=<group>) for submitting to any other partition (e.g all). If a wrong account/partition combination is requested, you will experience the following error message:

		
    sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified

Example

Here are some examples on the usage of the --account and --partition options.

Regular users:

		
    Submit to the "all" partition:
No options required!
 
Submit to any other partition:
--partition=<partname> e.g. --partition=empi
 

Users with enhanced privileges on the empi partition:

		
    Submit to the "all" partition:
--account=<grpname> e.g. --account=dcb
 
Submit to the "empi" partition:
--partition=empi
 
Submit to any other partition:
--account=<grpname> e.g. --account=dcb
--partition=<partname> e.g. --partition=long
 

Parallel Jobs

A parallel job either runs on multiple CPU cores on a single compute node, or on multiple CPU cores distributed over multiple compute nodes. With Slurm you can request tasks, and CPUs per task. A task corresponds to a process that may be made up of multiple threads (CPUs per task). Different tasks of a job allocation may run on different compute nodes, while all threads that belong to a certain process execute on the same node. For shared memory jobs (SMP, parallel jobs that run on a single compute node) one would request a single task and a certain number of CPUs for that task:

		
    #SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16

This is equivalent to:

		
    #SBATCH --cpus-per-task=16

For MPI jobs (parallel jobs that may be distributed over multiple compute nodes) one would request a certain number of tasks, and a certain number of CPUs per task, depending on the desired degree of distribution of the job over different compute nodes. If it does not matter if the job runs on a single compute node or is distributed over several compute nodes, one can simply request a certain number of tasks:

		
    #SBATCH --ntasks=16

In this case the distribution of tasks to different nodes is not easily predictable. The scheduler tries to place as much tasks as possible on the same compute node, then proceeds with the next compute node, and so forth. This means that your job may run on a single node, or runs distributed over a a certain number of compute nodes.

You can explicitly specify the number nodes:

		
    #SBATCH --nodes=2
#SBATCH --ntasks=16

In this case the 16 tasks will be distributed to 2 compute nodes. The number of tasks running on a node depends on the available resources and the load on the node, but in general, Slurm tries to place as many tasks on the same node.

You can explicitly state how many tasks must run on each node:

		
    #SBATCH --nodes=2
#SBATCH --ntasks-per-node=8

If the --ntasks-per-node option is used with the --ntasks option, the --ntasks option will take precedence and the --ntasks-per-node option will be treated as a maximum count of tasks per node.

The requested node,task, and CPU resources must match! For example, you cannot request one node (--nodes=1) and more tasks (--ntasks) than CPU cores are available on a single node in the partition. In such a case you will experience an error message: sbatch: error: Batch job submission failed: Requested node configuration is not available.

Open MPI

Open MPI was compiled with Slurm support, which means that you do not have to specify the number of processes and the execution hosts using the -np and the -hostfile options. Slurm will automatically provide this information to mpirun based on the allocated tasks (--ntasks):

		
    #!/bin/bash
#SBATCH --mail-user=foo.bar@baz.unibe.ch
(...)
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=16
 
module load openmpi/1.10.2-gcc
mpirun <options> <binary>

Environment Variables

Slurm sets various environment variables available in the context of the job script. Some are set based on the requested resources for the job.

Environment Variable Set By Option Description

SLURM_JOB_NAME

--job-name Name of the job.

SLURM_ARRAY_JOB_ID

  ID of your job

SLURM_ARRAY_TASK_ID

--array ID of the current array task

SLURM_ARRAY_TASK_MAX

--array Job array's maximum ID (index) number.

SLURM_ARRAY_TASK_MIN

--array Job array's minimum ID (index) number.

SLURM_ARRAY_TASK_STEP

--array Job array's index step size.

SLURM_NTASKS

--ntasks Same as -n, --ntasks

SLURM_NTASKS_PER_NODE

--ntasks-per-node Number of tasks requested per node.  Only set if the --ntasks-per-node option is specified.

SLURM_CPUS_PER_TASK

--cpus-per-task Number of cpus requested per task.  Only set if the --cpus-per-task option is specified.

TMPDIR

  References the disk space for the job on the local scratch

Job Examples

Sequential Job

Running a Single Job Step

		
    #!/bin/bash
#SBATCH --mail-user=foo.bar@baz.unibe.ch
#SBATCH --mail-type=end,fail
#SBATCH --job-name="Serial Job"
#SBATCH --time=04:00:00
 
# Your code below this line
echo "I'm on host:"
hostname
echo "Environment variables:"
env

Parallel Jobs

Shared Memory Jobs (e.g. OpenMP)

SMP parallelization is based upon dynamically created threads (fork and join) that share memory on a single node. The key request is "--cpus-per-task". To run N threads in parallel, we request N CPUs on the node (--cpus-per-task=N). OpenMP is not slurm-aware, you need to specify "export OMP_NUM_THREADS=..." in your submission script! Thereby OMP_NUM_THREADS (max number of thread spawned by your program) must correspond the number of cores requested. As an example, consider the following job script:

		
    #!/bin/bash
#SBATCH --mail-user=foo.bar@baz.unibe.ch
#SBATCH --mail-type=end,fail
#SBATCH --job-name="SMP Job"
#SBATCH --mem-per-cpu=2G
#SBATCH --cpus-per-task=16
#SBATCH --time=04:00:00
 
# Your code below this line
# set OMP_NUM_THREADS to the number of --cpus-per-task that we requested
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
./my_binary

For optimal resource management, notably to prevent oversubscribing the compute node, setting the correct number of threads is crucial. The assignment OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK will ensure that your program does not spawn more threads than requested.

MPI Jobs (e.g. Open MPI)

MPI parallelization is based upon processes (local or distributed) that communicate by passing messages. Since they don't rely on shared memory those processes can be distributed among several compute nodes.
Use the option --ntasks to request a certain number of tasks (processes) that can be distributed over multiple nodes: 

		
    #!/bin/bash
#SBATCH --mail-user=foo.bar@baz.unibe.ch
#SBATCH --mail-type=end
#SBATCH --job-name="MPI Job"
#SBATCH --mem-per-cpu=2G
#SBATCH --ntasks=8
#SBATCH --time=04:00:00
  
# Your code below this line
# First set the environment for using Open MPI
module load openmpi/1.10.2-gcc
srun --mpi=pmi2 ./my_binary # or, mpirun ./my_binary

On the 'empi' partition you must use all CPUs provided by a node (20 CPUs). For example to run an OMPI job on 80 CPUs, do:

		
    #!/bin/bash
#SBATCH --mail-user=foo.bar@baz.unibe.ch
#SBATCH --mail-type=end,fail
#SBATCH --job-name="MPI Job"
#SBATCH --mem-per-cpu=2G
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=20
#SBATCH --time=12:00:00
  
# Your code below this line
# First set the environment for using Open MPI
module load openmpi/1.10.2-gcc
srun --mpi=pmi2 ./my_binary # or, mpirun ./my_binary

Related pages: