Slurm user guide

Table of contents:

At UPPMAX we use Slurm as our batch system to allow a fair and efficient usage of the systems. Please make use of the jobstats tool to see how efficient your jobs are. If you request more hours the efficiency will determin if you can have more or not.

Nearly all of the compute power of the UPPMAX clusters are found in the compute nodes and Slurm is the tool to utilize that power.

You can use UPPMAX login nodes interactively, e.g. to quickly test your algorithms or explore the computing environment, but the grand potential is to give the UPPMAX clusters bigger chunks of work packaged into a batch script.

Also see our disk storage guide for information on how you should use storage.

Slurm Commands

The Slurm system is accessed using the following commands:

  • interactive - Start an interactive session
  • sbatch - Submit and run a batch job script
  • srun - Typically used inside batch job scripts for running parallel jobs (See examples further down)
  • scancel - Cancel one or more of your jobs.

Specifying job parameters

Whether you use the UPPMAX clusters interactively or in batch mode, you always have to specify a few things, like number of cores needed, running time et.c. These things can be specified in two ways:

  • Either as flags sent to the different Slurm commands (sbatch, srun, the interactive command, et.c.), like so:

    sbatch -A sens2012999 -p core -n 1 -t 12:00:00 -J some_job_name my_job_script_file.sh
  • or, when using the sbatch command, it can be specified inside the job script file itself, by using special "SBATCH" comments, for example:
    #!/bin/bash -l
     
    #SBATCH -A sens2012999
    #SBATCH -p core
    #SBATCH -n 1
    #SBATCH -t 12:00:00
    #SBATCH -J some_job_name
     
    ... the actual job script code ...
    If doing this, then one will only need to start the script like so, without any flags:
    sbatch my_job_script_file.sh

Required job parameters

These are the things you typically need to specify for each job (required parameters might differ sligthly depending on which other parameters are set):

  • Which project should be accounted for the running time (Format: -A [project name])
    • For example, if your project is named NAISS 2017/1-334 you specify -A naiss2017-1-234
    • You can find your current projects (and other projects that you have run jobs in) with the program projinfo.
  • What partition to choose (Format: -p [partition])
    • Partitions are a way to tell what type of job you are submitting, e.g. if it needs to reserve a whole node, or part of a node.
    • If you need anywhere from one core to a full node's set of cores, specify "-p core".  This is especially important if you might adjust core usage of the job to be something less than a full node.  Whenever -p node is specified, an entire node is used, no matter how many cores are specifically requested with -n [no_of_cores].  For example, some bioinformatics tools show minimal increase in performance when more than 8-10 cores/job; in this case, specify "-p core -n 8" to ensure that only 8 cores (less than a single node) are allocated for such a job.
      (More about this later).
  • If you specified the "node" partition above and want to run on less than the number of cores per node (for example, running something in only one process per node) you have to give the number of nodes (Format: -N [no of nodes]) in addition to the number of cores. Please note that your account is still charged for using core hours on all nodes.
  • How many cores you will need (Format: -n [no_of_cores]).
    • The most atomic compute element to specify is -n 1, i.e. one core.
    • When using the "node" partition, remember that the different clusters at UPPMAX has different number of cores per node, so you will need to multiply the number of nodes you have specified, to get the correct number of cores. An example for Tintin, specifying 2 nodes, and thus 32 (2 * 16) cpus, would be -n 32. See the table below for number of cores per cluster.
  • How long you want to reserve those nodes/cores (Format: -t d-hh:mm:ss).
    • Specification is in days, hours, minutes and (not very useful) seconds. A three day timelimit is given as -t 3-00:00:00 Twenty minutes is written as -t 20:00 and three hours as -t 3:00:00
    • A long time will increase the chance that your computations is finished within time, while a shorter time may sometimes make your job start faster.
    • Your project will be accounted for the time the job runs, which is not necessarily as long as your timelimit. If your job goes over the timelimit, it will be automatically cancelled.

For test runs shorter than 15 min, add the --qos=short specification, which gives you a high priority. This is not for production jobs, and you are limited to 15 minutes on a maximum of four nodes, as well as a maximum of two such jobs simultaneously.

Monitoring jobs

To see the status of your program, you can run commands like:

  • jobinfo
  • jobinfo -u your_account_name
  • squeue

Please also see our page about how the job priority and queue works.

Cancelling jobs

For various reasons, you might want to terminate your running jobs or remove your waiting jobs from the queue. The command is "scancel" and you can read its documentation with the command "man scancel". Straightforward is to run

scancel 123456 123457

to kill two of your jobs, by giving their job number. The command

scancel -i -u your_account_name

kills all your jobs, but asks for each job if you really want to terminate that job.

scancel -u your_account_name --state=pending

terminates all your waiting jobs, while

scancel -u your_account_name -n firsttest -t running

kills all your running jobs that are named "firsttest".

Details about the "core" and "node" partitions

For example, a normal Rackham node contains 128 GB of RAM and twenty compute cores. An equal share of RAM for each core would mean that each core gets at most 6.4 GB of RAM. This simple calculation gives one of the limits mentioned below for a "core" job.

Number of cores and amount of memory per node at the different resources at UPPMAX.

Cluster name Amount of RAM per node [GB] Number of cores per node Amount of RAM per core [GB] Flag to use node
bianca 112 16 7
256 -C mem256GB
512 -C mem512GB
snowy 128 16 8
256 -C mem256GB
512 -C mem512GB
miarka 384 48 8
2048 -p fat -C 1TB
4096 -p fat -C 4tB
rackham 128 20 6.4
256 16 16
256 -C mem256GB
1024 -C mem1TB

You need to choose between running a "core" job or a "node" job. A "core" job must keep within certain limits, to be able to run together with other "core" jobs on a shared node. A job that cannot keep within those limits must run as a "node" job.

You tell Slurm that you need a "node" job with the flag "-p node". (If you forget to tell Slurm, you are by default choosing to run a "core" job.)

A "core" job:

  • Will use a part of the resources on a node, from a 1/16 share to a 16/16 share on Bianca, and a 1/20 share to 20/20 share on Rackham.

  • Must not demand "-N", "--nodes", "--mem", or "--exclusive".

  • Must not demand to run on a fat node (see below, for an explanation of "fat"), or a devel node. 

  • Must not use more than the number of GB of RAM for each core per cluster as specified in the table above. Example, on Rackham: If a job needs half of the RAM, i.e. 64 GB, you need to reserve also at least half of the cores on the node, i.e. ten cores, with the "-n 10" flag

A "core" job is accounted on your project as one "core hour" (sometimes also named as a "CPU hour") for each wallclock hour per core that it runs. On the other hand, a "node" job is accounted on your project as having used all the cores on all your reserved nodes for each wallclock hour that it runs. Your project will be charged for all the cores you have asked for, even if you don't use all the resources (i.e cores) that you have reserved.

Specifications for a job on a single, full node

This will book an entire node. Only use this if you need a lot of memory or are running multithreaded applications.

If you want to run a single application on your node, you specify:

#SBATCH -p node -n 1

This application can use all the memory of your node all by itself. If you have a threaded application or an OpenMP application, you normally use the same specification.

If you want to run e.g. four copies of the same program in parallel, you specify

#SBATCH -p node -n 4

to inform Slurm about this. Slurm then will know that you want to run four tasks on the node. Some tools, like mpirun and srun, ask Slurm for this information and behave differently depending on the specified number of tasks. Most programs and tools do not ask Slurm for this information and thus behave the same, regardless of how many tasks you specify.

By default, mpirun and srun start as many copies of your specified command or program as the number of specified tasks. If you do not want them to go for the default behaviour, you can give them flags to specify, among other things, how many copies you want them to start.

To specify more tasks than the number of cores per node is in most cases a bad idea. For the same reason, if you run a threaded application or an OpenMP application, you would normally not want it to start so many parallel threads that you in total run more than the number of cores in parallel threads on the node.

For example, since a node on Rackham has 20 cores, don't start more than 20 tasks or run more than 20 threads.
This might depend on your program.

To run a non-parallel job as a "node" job, might mean that you "pay" for more than you get out of the arrangement. In that case, you are welcome to get in touch with the UPPMAX staff, for a discussion on the best way to run your application. A common solution is to pack 2-(the number of cores per node, e.g. 16 for Snowy) job tasks into one "node" job, writing something like this in your job script:

your_application --infile infile1 --outfile outfile1 & 
your_application --infile infile2 --outfile outfile2 & 
your_application --infile infile3 --outfile outfile3 & 
your_application --infile infile4 --outfile outfile4 & 
your_application --infile infile5 --outfile outfile5 & 
your_application --infile infile6 --outfile outfile6 & 
your_application --infile infile7 --outfile outfile7 & 
your_application --infile infile8 --outfile outfile8 & 
your_application --infile infile9 --outfile outfile9 & 
your_application --infile infile10 --outfile outfile10 & 
your_application --infile infile11 --outfile outfile11 & 
your_application --infile infile12 --outfile outfile12 & 
your_application --infile infile13 --outfile outfile13 & 
your_application --infile infile14 --outfile outfile14 & 
your_application --infile infile15 --outfile outfile15 & 
your_application --infile infile16 --outfile outfile16 & 
wait

This example pinpoints a few details, needed for task packing:

  • Normally, each task needs individual input and output files.

  • Each application call needs an "&" written at the end of the line, to make it start at the same time as the other application calls.

  • The "wait" command at the bottom tells the job script to wait until all the tasks have run to their normal finish. Otherwise the job and thus also the tasks will terminate prematurely.

You may specify a node with more RAM, by adding the words like "-C mem256GB" or similar to your job submission line and thus making sure that you will get 256 GB of RAM on each node in your job. Please note the number of nodes with more memory in the table above. Specifying more memory might lead to longer time in the queue for your job.

From the squeue command, you can get a lot of information, using different command options. Some of these options are used within a

jobinfo

command, that tells you about running jobs, gives you some statistics about the UPPMAX resource node status and gives you a list of all waiting jobs, sorted on job priority. The jobinfo command has many option flags, most of them the same as for the squeue command. One of the most useful flags is "-u your_user_account_name" to specify that you want to look only on your jobs.

The jobinfo command will give an estimate of the starting time for jobs at the top of the queue. You can also use squeue command with the "--start" option.

Specifications for a multi-node job

If you want to run computations that takes more than one node, but can be run in parts as core jobs and/or single-node jobs, you should probably split them up into several core jobs and/or single-node jobs, and not read any further about multi-node jobs.

If you want to run e.g. 40 copies of your program on Rackham with 20 cores per node, e.g. for a openmpi program, you normally specify

#SBATCH -p node -n 40

You will get assigned two nodes and making your job run with twenty copies of your program on each of two nodes. openmpi interacts with Slurm to get your program copies distributed over the allocated nodes, when the mpirun tool is called within your jobscript. The script would look something like

#! /bin/bash -l 
# 
#SBATCH -p node -n 40 -t 7-00:00:00 
#SBATCH -A p2010999 -J elixir_B 
 
module load gcc/7.1.0 openmpi/2.1.1 
mpirun elixir B_gamma.txt

if your application is named "elixir" and is compiled with the gcc compiler. mpirun will read from the Slurm environment that it must start the "elixir" program 40 times, i.e. twenty times on each of two nodes.
Note. If you are using intelmpi you must use the command srun, as recommended by Intel. If you are switching from mpirun to srun or the other way around, please check the man pages for differencies in flags, e.g. srun -n and mpirun -np.

It is often advantageous to bind processes to the cores, especially for very wide jobs. You can see more information about process binding by typing "mpirun -help". To bind each process to cores, in system order do:

mpirun -bind-to-core elixir B_gamma.txt

If you want to be sure to use only nodes with 128 GB of memory on Rackham, you specify

#SBATCH -p node -n 40 -C mem128GB

The main reason for not wanting to use a fat node within your job, is that it is a shortage of fat nodes and someone might need the fat node for a job that is not able to run on a 128 GB node.

If your memory requirements are high, you may want to run your 40 copies distributed over more nodes like in

#SBATCH -p node -N 10 -n 40 -C mem128GB

making your job run with four copies of your program on each of ten nodes. If you use OpenMPI, mpirun will automatically distribute the 40 instances of your program over ten nodes.

We now show an example, where you want to run a program "smart_aleck", that communicates over OpenMP within a node, but over OpenMPI between nodes. We want to use ten nodes, i.e. 200 cores. Because a single copy of the program knows how to utilize all twenty cores within a node, we start one copy (one task) of the program on each node, meaning that we will specify the Slurm flags "-N 10 -n 10" to get the wanted distribution of tasks. Our OpenMP implementation can read the number of cores per node from an environment variable called OMP_NUM_THREADS, so we set the variable to the number of cores in the node, letting mpirun distribute it to all nodes. The job script might look like this:

#! /bin/bash -l 
# 
#SBATCH -p node -N 10 -n 10 
#SBATCH -t 7-00:00:00 
#SBATCH -A p2010999 -J another_test 
 
module load gcc/7.1.0 openmpi/2.1.1 
export OMP_NUM_THREADS=20
mpirun smart_aleck

We get all 200 cores to work, with one OpenMP thread on each. If we later find that twenty threads together use more memory than is available on the node, we need to find a fatter node or might try lowering the value of OMP_NUM_THREADS to reduce the number of threads.

Short test runs in the "devel" partition

"devel" is an abbreviation for "development".

The "--qos=short" flag allows short jobs, up to 15 minutes in length and up to four nodes in width, to be submitted with a high priority, as described earlier.

If you need to run a half-long test job, up to one hour in length and up to two nodes in width, you may use the "devel" partition. A small number of nodes are removed from the "node" partition and can be used in this way.

This partition of nodes are meant only for small experiments and test runs, and not for production jobs. Like for "--qos=short", the "devel" partition makes it easier to develop programs and do small tests on a crowded system.

Here is a compilation of facts about the development partition:

  • Jobs are submitted like "node" jobs, but with a "-p devel" instead of a "-p node".
  • The maximum timelimit for the job is 60 minutes and the maximum node count is two.
  • You must not have more than one "devel" job in the batch system simultaneously, regardless if they are running or queued. If you by mistake submit more, they will probably all be automatically cancelled.
  • To get information about the current status of the "devel" partition, you can use the command
sinfo -p devel
  • The interactive command is allowed to use the "devel" partition to start short jobs. You do normally not need to tell the interactive command what partition to use, as it can make the choice automatically.

Difference between devel partition and devcore partition

Sometimes it is too expensive to pay for a full node, if you only need one core or a few. So, we have now configured a new partition, named "devcore". It covers the same physical nodes as the "devel" partition, but you can ask for single cores or multiple cores, like in the "core" partition.

Some examples:

- "-p devcore -n 8" asks for eight cores and the proportional amount of RAM
- "-p devcore -n 1" on Rackham gives you one core and 6.4 GB of RAM
- "-p devcore -n 10" on Rackham gives you ten cores and 64 GB of RAM
- "-p devel -n 20" on Rackham gives you all cores and 128 GB of RAM
- "-p devcore -n 1" on Snowy gives you one core and 8 GB of RAM
- "-p devcore -n 8" on Snowy gives you eight cores and 64 GB of RAM
- "-p devel -n 16" on Snowy gives you all cores and 128 GB of RAM

So, what is the difference on Snowy between
-p devcore -n 16
and
-p devel -n 16
?

None at all! In both cases, you ask for all cores on the node and all RAM on the node.

Project accounting

When you specify "-p node", you allocate full nodes, each with 16 or 20 cores at UPPMAX, so you are accounted a number of "core hours" (sometimes named "CPU hours") that are the number of hours your job did run, multiplied with the number of cores in a node times the number of nodes that you did allocate.

On the other hand, if you do not specify "-p node" and keep within the limits mentioned above, you will accounted only the number of hours that your job did run.

To get an overview of how much of your project allocation that has been used, please use the projinfo command. Please use the command

projinfo -h

to get details on usage. With no flags given,

projinfo

will tell you your usage in all your projects during the last 30 days.

To get your usage in project p2010999 during the current year, please use one of the commands

projinfo -y p2010999

or

projinfo -s january p2010999

The projinfo command extracts information from a system log of all finished jobs, and also includes information from the batch system on currently running jobs.

Finished jobs

In order to see information about finished jobs, use the command

finishedjobinfo

The command gives you, apart for the timings of the job, the amount of memory your job used. If your job was cancelled it might be because your job used more memory than it was allowed to.

Use the -h flag to see a list of flags and options for the command.


Details about memory usage

Historical information can first of all be found by issuing the command "finishedjobinfo -j [job id]". That will print out the maximum memory used by your job. If you want more details then we also save some memory information each 5 minutes interval for the job in the file /sw/share/slurm/[cluster_name]/uppmax_jobstats/[node_name]/[job_id]. Notice that this is only stored for 30 days.

You can also ask for an e-mail containing the log, when you submit your job with sbatch or start an "interactive" session, by adding a "-C usage_mail" flag to your command. Two examples:

sbatch -A testproj -p core -n 5 -C usage_mail batchscript1
interactive -A testproj -p node -n 1 -C "fat&usage_mail"

As you see, you have to be careful with the syntax when asking for two features, like "fat" and "usage_mail", at the same time. If you overdraft the RAM that you asked for, you will probably get an automatic e-mail anyway.

Discovering job resource usage with jobstats

If you want to be able to see even more details about how your jobs have used the requested resources, then please check out our guide about how to use our jobstats scripts.

Example 1: Interactive job on one core

interactive -n 1 -t 2:00:00 -A p2010999

Automatically, you will get an interactive session with the command interpreter "bash" on one compute core.

Example 2: Interactive job on four nodes

[lka@rackham1 ~]$ interactive -n 80 -t 15:00 -A p2010999 --qos=short
[lka@r83 ~]$ # How to see on what nodes I am running
[lka@r83 ~]$ srun hostname -s|sort -u
r83
r84
r85
r86
[lka@r83 ~]$ # Create the same local directory on all four nodes
[lka@r83 ~]$ srun -N 4 -n 4 mkdir /scratch/$SLURM_JOB_ID/indata
[lka@r83 ~]$ # Copy indata for my_program to the local directories
[lka@r83 ~]$ srun -N 4 -n 4 cp -r ~/glob/indata/* /scratch/$SLURM_JOB_ID/indata
[lka@r83 ~]$ #
[lka@r83 ~]$ cd ~/glob/testprogram
[lka@r83 testprogram]$ module load intel openmpi
Loaded openMPI 1.4, compiled with intel11.1 (found in /opt/openmpi/1.4intel11.1/)
[lka@ti83 testprogram]$ mpirun -v my_program
.
.
.

We needed to add -p node, because we used more than one node. If you use at most four nodes and fifteen minutes, you may specify --qos=short, which gives you a tremendously higher queue priority.

The srun -N 4 -n 4 construction is very useful, when you want to run a command once on each of your nodes. You need to know how many nodes you have asked for; e.g., for eight nodes you will need an srun -N 8 -n 8 construction.

Example 3: Using interactive command, to let you run X applications

If you need to run X applications, please use the interactive command. The easiest usage example is

interactive -A p2010999

which sets some default values, to give you the highest queue priority allowed, using e.g. "--qos=short". If you do not like the default values, you can add most options that the sbatch command allows. You will get a shell prompt from the screen command (the command "man screen" gives more information) and if you have tried screen and do not like it, please escape from it with an "exec xterm" command. To get some more information about the interactive command, please try

interactive -h

You may run one interactive command with a high queue priority at a time, up to a time limit of 12 hours, regardless of your simultaneous use of "--qos=short" jobs.

Example 4: A small batch script, with a Job name

An example of a small batch script, requesting 8 cores within the "core" partition, so less than a full node.  We also show how to give your job a name (the "-J" flag):

#!/bin/bash -l
#SBATCH -A b2010999
#SBATCH -p core -n 8
#SBATCH -t 1:00:00
#SBATCH -J mapping
module load bioinfo-tools bwa/0.7.15 samtools/1.5
cd /proj/b2010999/nobackup/mapping
bwa mem -t 6 -p ref.fa r.fq.gz | samtools sort -@2 -O BAM -o o.bam -

An example of a small batch script to run on 4 nodes, communicating among the nodes using MPI.:

#!/bin/bash -l
#SBATCH -A p2010999
#SBATCH -p node -n 64
#SBATCH -t 1-20:00:00
#SBATCH -J test42
module load intel openmpi
cd ~/glob/testprogram
mpirun my_program


If you name the batch script file "script-v4", you submit the script with a bash command like

sbatch script-v4

Last modified: 2023-01-10