This page describes the job submission process with Slurm.
Every job submission starts with a resources allocation (nodes, cores, memory). An allocation is valid for a specific amount of time, and can be created using the salloc, sbatch or srun commands. Whereas salloc and sbatch only create resource allocations, srun launches parallel tasks within such a resource allocation, or implicitly creates an allocation if not started within one. The usual procedure is to combine resource requests and task execution (job steps) in a single batch script (job script) and then submit the script using the sbatch command.
The sbatch command is used to submit a job script for later execution. It is the most common way to submit a job to the cluster due to its reusability. Slurm options are usually embedded in a job script prefixed by '#SBATCH' directives. Slurm options specified as command line options overwrite corresponding options embedded in the job script.
sbatch [options] script [args...]
Usually a job script consists of two parts. The first part is optional but highly recommended:
- Slurm-specific options used by the scheduler to manage the resources (e.g. memory) and configure the job environment
- Job-specific shell commands
The job script acts as a wrapper for your actual job. Command-line options can still be used to overwrite embedded options.
|--mail-user||Mail address to contact job owner. Must specify a valid email firstname.lastname@example.org|
|--mail-type||When to notify a job owner: none, all, begin, end, fail, requeue, array_tasks||--mail-type=end,fail|
|--account||Which account to charge. Regular users don't need to specify this option. For users with enhanced privileges on the empi partition, see here||The users default account.|
|--job-name||Specify a job name||--job-name="Simple Matlab"|
|--time||Expected runtime of the job. Format: dd-hh:mm:ss||--time=12:00:00
|Partition-specific, see scontrol show partition <partname>|
|--mem-per-cpu||Minimum memory required per allocated CPU in megabytes. Different units can be specified using the suffix [K|M|G]||--mem-per-cpu=2G||2048 MB|
|--tmp||Specify the amount of disk space that must be available on the compute node(s). The local scratch space for the job is referenced by the variable TMPDIR. Default units are megabytes. Different units can be specified using the suffix [K|M|G|T].||--tmp=8G
|Number of tasks (processes). Used for MPI jobs that may run distributed on multiple compute nodes||--ntasks=4||1 or to match --nodes, --tasks-per-node if specified|
|--nodes||Request a certain number of nodes||--nodes=2||1 or to match --ntasks, --tasks-per-node if specified|
Specifies how many tasks will run on each allocated node. Meant to be used with --nodes. If used with the
|--cpus-per-task||Number of CPUs per taks (threads). Used for shared memory jobs that run locally on a single compute node||--cpus-per-task=4||1|
|--array||Submit an array job. Use "%" to specify the max number of tasks allowed to run concurrently.||
|--workdir||Set the current working directory. All relative paths used in the job script are relative to this directory||The directory from where the sbatch command was executed|
|--output||Redirect standard output. All directories specified in the path must exist before the job starts!||By default stderr and stdout are connected to the same file slurm-%j.out, where '%j' is replaced with the job allocation number.|
|--error||Redirect standard error. All directories specified in the path must exist before the job starts!||By default stderr and stdout are connected to the same file slurm-%j.out, where '%j' is replaced with the job allocation number.|
|--partition||The "all" partition is the default partition. A different partition must be requested with the --partition option!||--partition=long||Default partition: all|
|--dependency||Defer the start of this job until the specified dependencies have been satisfied. See man sbatch for a description of all valid dependency types||--dependency=afterany:11908|
|--hold||Submit job in hold state. Job is not allowed to run until explicitly released|
|--immediate||Only submit the job if all requested resources are immediately available|
|--exclusive||Use the compute node(s) exclusively, i.e. do not share nodes with other jobs. CAUTION: Only use this option if you are an experienced user, and you really understand the implications of this feature. If used improperly, the use of this option can lead to a massive waste of computational resources|
|--constraint||Request nodes with certain features. This option allows you to request a homogeneous pool of nodes for you MPI job||
--constraint=ivy (all, long partition)
|--parsable||Print the job id only||Default output: "Submitted batch job <jobid>"|
|--test-only||Validate the batch script and return the estimated start time considering the current cluster state|
#!/bin/bash #SBATCH --email@example.com #SBATCH --mail-type=end,fail #SBATCH --job-name="Example Job" #SBATCH --nodes=2 #SBATCH --ntasks-per-node=16 #SBATCH --time=00:10:00 #SBATCH --mem-per-cpu=1G # Your code below this line srun --mpi=pmi2 mpi_hello_world
Submit the job-script:
The salloc command is used to allocate resources (e.g. nodes), possibly with a set of constraints (e.g. number of processor per node) for later utilization. It is typically used to allocate resources and spawn a shell, in which the srun command is used to launch parallel tasks.
salloc [options] [<command> [args...]]
bash$ salloc -N 2 sh salloc: Granted job allocation 247 sh$ module load openmpi/1.10.2-gcc sh$ srun --mpi=pmi2 mpi_hello_world Hello, World. I am 1 of 2 running on knlnode03.ubelix.unibe.ch Hello, World. I am 0 of 2 running on knlnode02.ubelix.unibe.ch sh$ exit salloc: Relinquishing job allocation 247
The srun command creates job steps. One or multiple srun invocations are usually used from within an existing resource allocation. Thereby, a job step can utilize all resources allocated to the job, or utilize only a subset of the resource allocation. Multiple job steps can run sequentially in the order defined in the batch script or run in parallel, but can together never utilize more resources than provided by the allocation.
srun [options] executable [args...]
#!/bin/bash #SBATCH --firstname.lastname@example.org #SBATCH --mail-type=end,fail #SBATCH --job-name="Example Job" #SBATCH --nodes=2 #SBATCH --time=01:00:00 # Your code below this line # Create some directories export OUTFOO=/home/group/user/results/foo; mkdir -p $OUTFOO export OUTBAR=/home/group/user/results/bar; mkdir -p $OUTBAR # Each job step only utilizes only one node. Run both job steps concurrently (&) srun --nodes=1 foo > $OUTFOO & srun --nodes=1 bar > $OUTBAR & # Wait until all job steps have finished wait
To illustrate the remarks, consider the following simplified example:
#!/bin/bash #SBATCH --email@example.com #SBATCH --mail-type=end #SBATCH --job-name="Example Job" #SBATCH --nodes=2 #SBATCH --ntasks-per-node=8 #SBATCH --time=00:01:00 # Your code below this line hostname
Although the allocation contains 16 tasks distributed evenly over 2 nodes, the hostname command is executed only once on the first compute node in the allocation. The next example behaves differently:
#!/bin/bash #SBATCH --firstname.lastname@example.org #SBATCH --mail-type=end #SBATCH --job-name="Example Job" #SBATCH --nodes=2 #SBATCH --ntasks-per-node=4 #SBATCH --time=00:01:00 # Your code below this line srun hostname
knode01.ubelix.unibe.ch knode02.ubelix.unibe.ch knode01.ubelix.unibe.ch knode01.ubelix.unibe.ch knode02.ubelix.unibe.ch knode02.ubelix.unibe.ch knode01.ubelix.unibe.ch knode02.ubelix.unibe.ch
When using srun to execute hostname, the command is executed 8 times; 4 time on the first node and 4 time on the second node of the allocation. In this example, srun inherits all Slurm options specified for the sbatch command.
Requesting a Partition (Queue)
The default partition is the 'all' partition. If you do not explicitly request a partition, your job will run in the default partition. To request a different partition, you must use the --partition option:
See here for a list of all available partitions.
Requesting an Account
Accounts are used for accounting purposes. Every user has a default account that is used unless a different account is specified using the --account option. Regular users only have a single account and can thus not request a different account. The default account for regular user is named after their group (e.g. dcb):
$ sacctmgr show user foo User Def Acct Admin ---------- ---------- --------- foo bar None
Users with enhanced privileges have an additional account for the empi partition. This additional account is set as their default account, which means they don't have to specify an account when submitting to the empi partition (--partition=empi), but must specify their "group account" (--account=<group>) for submitting to any other partition (e.g all). If a wrong account/partition combination is requested, you will experience the following error message:
sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified
Here are some examples on the usage of the --account and --partition options.
Submit to the "all" partition: No options required! Submit to any other partition: --partition=<partname> e.g. --partition=empi
Users with enhanced privileges on the empi partition:
Submit to the "all" partition: --account=<grpname> e.g. --account=dcb Submit to the "empi" partition: --partition=empi Submit to any other partition: --account=<grpname> e.g. --account=dcb --partition=<partname> e.g. --partition=long
A parallel job either runs on multiple CPU cores on a single compute node, or on multiple CPU cores distributed over multiple compute nodes. With Slurm you can request tasks, and CPUs per task. A task corresponds to a process that may be made up of multiple threads (CPUs per task). Different tasks of a job allocation may run on different compute nodes, while all threads that belong to a certain process execute on the same node. For shared memory jobs (SMP, parallel jobs that run on a single compute node) one would request a single task and a certain number of CPUs for that task:
#SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=16
This is equivalent to:
For MPI jobs (parallel jobs that may be distributed over multiple compute nodes) one would request a certain number of tasks, and a certain number of CPUs per task, depending on the desired degree of distribution of the job over different compute nodes. If it does not matter if the job runs on a single compute node or is distributed over several compute nodes, one can simply request a certain number of tasks:
In this case the distribution of tasks to different nodes is not easily predictable. The scheduler tries to place as much tasks as possible on the same compute node, then proceeds with the next compute node, and so forth. This means that your job may run on a single node, or runs distributed over a a certain number of compute nodes.
You can explicitly specify the number nodes:
#SBATCH --nodes=2 #SBATCH --ntasks=16
In this case the 16 tasks will be distributed to 2 compute nodes. The number of tasks running on a node depends on the available resources and the load on the node, but in general, Slurm tries to place as many tasks on the same node.
You can explicitly state how many tasks must run on each node:
#SBATCH --nodes=2 #SBATCH --ntasks-per-node=8
Open MPI was compiled with Slurm support, which means that you do not have to specify the number of processes and the execution hosts using the -np and the -hostfile options. Slurm will automatically provide this information to mpirun based on the allocated tasks (--ntasks):
#!/bin/bash #SBATCH --email@example.com (...) #SBATCH --nodes=4 #SBATCH --ntasks-per-node=16 module load openmpi/1.10.2-gcc mpirun <options> <binary>
Slurm sets various environment variables available in the context of the job script. Some are set based on the requested resources for the job.
|Environment Variable||Set By Option||Description|
|--job-name||Name of the job.|
|ID of your job|
|--array||ID of the current array task|
|--array||Job array's maximum ID (index) number.|
|--array||Job array's minimum ID (index) number.|
|--array||Job array's index step size.|
|--ntasks||Same as -n, --ntasks|
|--ntasks-per-node||Number of tasks requested per node. Only set if the --ntasks-per-node option is specified.|
|--cpus-per-task||Number of cpus requested per task. Only set if the --cpus-per-task option is specified.|
|References the disk space for the job on the local scratch|
Running a Single Job Step
#!/bin/bash #SBATCH --firstname.lastname@example.org #SBATCH --mail-type=end,fail #SBATCH --job-name="Serial Job" #SBATCH --time=04:00:00 # Your code below this line echo "I'm on host:" hostname echo "Environment variables:" env
Shared Memory Jobs (e.g. OpenMP)
SMP parallelization is based upon dynamically created threads (fork and join) that share memory on a single node. The key request is "--cpus-per-task". To run N threads in parallel, we request N CPUs on the node (--cpus-per-task=N). OpenMP is not slurm-aware, you need to specify "export OMP_NUM_THREADS=..." in your submission script! Thereby OMP_NUM_THREADS (max number of thread spawned by your program) must correspond the number of cores requested. As an example, consider the following job script:
#!/bin/bash #SBATCH --email@example.com #SBATCH --mail-type=end,fail #SBATCH --job-name="SMP Job" #SBATCH --mem-per-cpu=2G #SBATCH --cpus-per-task=16 #SBATCH --time=04:00:00 # Your code below this line # set OMP_NUM_THREADS to the number of --cpus-per-task that we requested export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK ./my_binary
MPI Jobs (e.g. Open MPI)
MPI parallelization is based upon processes (local or distributed) that communicate by passing messages. Since they don't rely on shared memory those processes can be distributed among several compute nodes.
Use the option --ntasks to request a certain number of tasks (processes) that can be distributed over multiple nodes:
#!/bin/bash #SBATCH --firstname.lastname@example.org #SBATCH --mail-type=end #SBATCH --job-name="MPI Job" #SBATCH --mem-per-cpu=2G #SBATCH --ntasks=8 #SBATCH --time=04:00:00 # Your code below this line # First set the environment for using Open MPI module load openmpi/1.10.2-gcc srun --mpi=pmi2 ./my_binary # or, mpirun ./my_binary
On the 'empi' partition you must use all CPUs provided by a node (20 CPUs). For example to run an OMPI job on 80 CPUs, do:
#!/bin/bash #SBATCH --email@example.com #SBATCH --mail-type=end,fail #SBATCH --job-name="MPI Job" #SBATCH --mem-per-cpu=2G #SBATCH --nodes=4 #SBATCH --ntasks-per-node=20 #SBATCH --time=12:00:00 # Your code below this line # First set the environment for using Open MPI module load openmpi/1.10.2-gcc srun --mpi=pmi2 ./my_binary # or, mpirun ./my_binary