Skip to end of metadata
Go to start of metadata

 

Simple Linux Utility Resource Manager (SLURM)

 

SLURM Links

  • SBATCH submit a batch script to the scheduler
  • SQUEUE view information about jobs in the slurm scheduling queue
  • SCONTROL view and modify configuration and state of jobs
  • SCANCEL used to signal or cancel jobs, job arrays or job steps
  • SINFO view information about nodes and partitions
  • SSHARE view information about fairshare
  • SPRIO view information about job priority

Introduction

Balena uses SLURM (Simple Linux Utility Resource Manager) for its resource management and scheduling, its one of the most popular job scheduling systems available and used on about 40 percent of the largest computers in the world (Top500) including Tianhe-2, which is on top of the list.

 

Balena SLURM configuration

Partition/Queues

The compute nodes that are part of the scheduler are divided into the following SLURM partitions/queues. Users should select the appropriate partition based on the job requirement.

Partition NameNodesCharacteristics
batch
158Default partition - jobs that do not request a partition are submitted into this partition
batch-64gb
88Nodes with 64GB RAM (DDR3 1866 MHz), using dual ranked DIMMS
batch-128gb
80Nodes with 128GB RAM (DDR3 1866 MHz), using single ranked DIMMS
batch-512gb
2Nodes with 512GB RAM (DDR3 1333 MHz)
batch-acc
22Nodes with accelerators - GPUs(K20x and P100)/MIC/NVMe
batch-micnative4MIC Cards for native mode
batch-all
179All Ivybridge compute nodes (except 512GB nodes, and MIC native)
batch-sky
16Skylake compute nodes (192GB DDR4 2666MHz)
batch-devel
4Nodes with 64GB RAM (DDR3 1866 MHz)
itd
4Nodes for Interactive Test and Development - 2 nodes with a GPU and 2 nodes with a Xeon Phi
itd-sky
1Skylake node for Interactive Test and Development
teaching

variable

Nodes with 64GB RAM (DDR3 1866 MHz) - this partition is dedicated for the use of academic courses that run on Balena

Project accounts

User can view the project accounts to which they have access to using the sshare command.

[user123@balena-01 ~]$ sshare
             Account         User Raw Shares Norm Shares   Raw Usage Effectv Usage  FairShare 
-------------------- ------------ ---------- ----------- ----------- ------------- ---------- 
free                      user123     parent    0.100000        6466      0.657390   0.010497 
prj-cc001                 user123     parent    0.900000        6032      0.342610   0.768076

Project account limitations

A default walltime of 10 minutes for jobs submitted to any partitions (queue) if the user does not specify '--time=<>' option in the job script.

Account Type
Maximum Walltime
Max nodes
Max CPU cores
Max CPU time
Starting Priority
teaching15 minutes464NAdedicated 
free6 hours16256384 core-hours0
premium5 days32

512 (default)

>512 cores
by arrangement

NA+24 hours

Free account restrictions

Job size

The free account is restricted to a maximum CPU time of 384 core-hours (23040 core-minutes) per job.

Example job sizes that can run using this limit includes:

WalltimeNodesCPU coresTotal CPU time (mins)Total node time (mins)
6 hours464230401440
3 hours8128230401440
90 minutes16256230401440

 

Running cpumins

A user is limited to 115,200 running cpumins for jobs submitted from the free account - this is equivalent to having five concurrent jobs each using 4 nodes running for 6 hours. Eg. If a user submits six jobs requesting 4 nodes for 6 hours, only 5 of those jobs can run concurrently and one of the job will have to wait for the sufficient running cpumins to become available to get an allocation from SLURM.

Teaching account restrictions

The teaching accounts (eg. cm30225) will be able to use only the teaching partition and will not able to run jobs from any other partition.

Job priority

The sprio cammand can be used to view a jobs priority and the components making up the priority. The priority of a job is determined by the sum of two components:

  1. Fairshare
    • Details of fairshare can be found with the sshare command. 
    • The function for calculating the weighted fairshare component is: F = 1,000,000  *  2**(-E / N / D) ; E = effective usage, N = normalised share, D = dampening factor (100)
    • For premium project accounts, the normalised (N) share levels are determined by the project's (total budget for HPC / project duration).
    • Users under the free account all have an equal share of 1. 

  2. QOS
    • Job is submitted to a premium project account will have an additional 1,000,000 priority units

Usage decay

A decay half-life of 2 days is applied for all users.  This will decay the raw usage of a user to half every 2 days thus decreasing the effective usage (E) and further increasing the user's fairshare.

 

SLURM Basics

A simple job script

SLURM Example Job Script: example.slm
#!/bin/bash
 
# set the account to be used for the job
#SBATCH --account=free
 
# set name of job 
#SBATCH --job-name=myjob
#SBATCH --output=myjob.out
#SBATCH --error=myjob.err
 
# set the number of nodes and partition
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=16
#SBATCH --partition=batch
 
# set max wallclock time
#SBATCH --time=04:00:00
 
# mail alert at start, end and abortion of execution
#SBATCH --mail-type=END
 
# send mail to this address
#SBATCH --mail-user=user123@bath.ac.uk
 
# Load all the dependant modules
module purge # clear all modules from the environment
module load slurm
module load intel/compilers
module load intel/mpi
module load intel/mkl

# run the application
./myparallelscript

Submit a job

sbatch <job-script>
[user123@balena-01 ~]$ sbatch example.slm
Submitted batch job 11

List jobs

View information about jobs located in the SLURM scheduling queue.

squeue
[user123@balena-01 ~]$ squeue 
     JOBID       NAME       USER    ACCOUNT  PARTITION    ST NODES  CPUS  MIN_MEMORY           START_TIME     TIME_LEFT  PRIORITY NODELIST(REASON)
        11      myjob      user123     free      batch     R     1    16         62K  2015-04-30T12:43:37       3:59:57         9 node-sw-081

Get job details

scontrol show job <job-id>
[user123@balena-01 ~]$ scontrol show job 11
JobId=11 Name=myjob
   UserId=user123(123) GroupId=balena_cc(10305)
   Priority=9 Nice=0 Account=free QOS=free WCKey=*
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 ExitCode=0:0
   RunTime=00:00:08 TimeLimit=04:00:00 TimeMin=N/A
   SubmitTime=2015-04-30T12:43:37 EligibleTime=2015-04-30T12:43:37
   StartTime=2015-04-30T12:43:37 EndTime=2015-04-30T16:43:37
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=batch AllocNode:Sid=balena-02:11458
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=node-sw-081
   BatchHost=node-sw-081
   NumNodes=1 NumCPUs=16 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   Socks/Node=* NtasksPerN:B:S:C=16:0:*:* CoreSpec=0
   MinCPUsNode=16 MinMemoryNode=62G MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=0 Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/n/user123/testing/example.slm
   WorkDir=/home/n/user123/testing
   StdErr=/home/n/user123/testing/myjob.err
   StdIn=/dev/null
   StdOut=/home/n/user123/testing/myjob.out

Kill a job

scancel <job-id>
$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                16     batch    hello    rtm25  R       0:03      1 slurm-compute-02
                15     batch    hello    rtm25  R       0:06      1 slurm-compute-01
$ scancel 15
$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                16     batch    hello    rtm25  R       0:11      1 slurm-compute-02

Hold a job

scontrol hold <job-id>
$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                19     batch    hello    rtm25 PD       0:00      1 (Resources)
                20     batch    hello    rtm25 PD       0:00      1 (Priority)
                21     batch    hello    rtm25 PD       0:00      1 (Priority)
                18     batch    hello    rtm25  R       0:03      1 slurm-compute-02
                17     batch    hello    rtm25  R       0:05      1 slurm-compute-01
$ scontrol hold 20
$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                19     batch    hello    rtm25 PD       0:00      1 (Resources)
                21     batch    hello    rtm25 PD       0:00      1 (Priority)
                20     batch    hello    rtm25 PD       0:00      1 (JobHeldUser)
                18     batch    hello    rtm25  R       0:13      1 slurm-compute-02
                17     batch    hello    rtm25  R       0:15      1 slurm-compute-01

Release a job

scontrol release <job-id>
$ scontrol release 20
$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                20     batch    hello    rtm25 PD       0:00      1 (Resources)
                21     batch    hello    rtm25  R       0:10      1 slurm-compute-02
                19     batch    hello    rtm25  R       0:11      1 slurm-compute-01

List partitions

sinfo
[user123@balena-01 ~]$ sinfo
PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
batch*         up   infinite    168   idle node-sw-[001-168]
batch-acc      up   infinite     11   idle node-as-agpu-01,node-as-ngpu-[01-06],node-dw-ngpu-[001-004]
batch-all      up   infinite    179   idle node-as-agpu-01,node-as-ngpu-[01-06],node-dw-ngpu-[001-004],node-sw-[001-168]
batch-512gb    up   infinite      2   idle node-sw-fat-[01-02]
batch-64gb     up   infinite     88   idle node-sw-[081-168]
batch-128gb    up   infinite     80   idle node-sw-[001-080]
itd            up   infinite      2   idle itd-ngpu-[01-02]

Submit a dependant job

This job can begin execution after any previously launched jobs sharing the same job name and user have terminated.

sbatch -d singleton <job-script>
$ sbatch hello.slurm 
Submitted batch job 22
$ sbatch --dependency singleton hello.slurm
Submitted batch job 23
$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                23     batch    hello    rtm25 PD       0:00      1 (Dependency)
                22     batch    hello    rtm25  R       0:07      1 slurm-compute-01
$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                23     batch    hello    rtm25  R       0:02      1 slurm-compute-01

Quick Reference

User commands

User commands
SLURM
Job submission
sbatch [script_file] 
Queue list
squeue
Queue list (by user)
squeue -u [user_name]
Job deletion
scancel [job_id]
Job information
scontrol show job [job_id]
Job hold
scontrol hold [job_id]
Job release
scontrol release [job_id]
Node list
sinfo --Nodes --long
Cluster status
sinfo or squeue
GUI
sview /** graphical user interface to view and modify SLURM state **/

Job environment

Environment
Description
$SLURM_ARRAY_TASK_ID
Job array ID (index) number
$SLURM_ARRAY_JOB_ID
Job array's master job ID number
$SLURM_JOB_ID 
The ID of the job allocation
$SLURM_JOB_DEPENDENCYSet to value of the --dependency option
$SLURM_JOB_NAMEName of the job
$SLURM_JOB_NODELIST
List of nodes allocated to the job
$SLURM_JOB_NUM_NODES

Total number of nodes in the job's resource allocation

$SLURM_JOB_PARTITION
Name of the partition in which the job is running
$SLURM_MEM_PER_NODE
Memory requested per node
$SLURM_NODEID
ID of the nodes allocated
$SLURM_NTASKS
Number of tasks requested. Same as -n, --ntasks. To be used with mpirun, e.g. mpirun -np $SLURM_NTASKS binary
$SLURM_NTASKS_PER_NODE
Number of tasks requested per node. Only set if the --ntasks-per-node option is specified
$SLURM_PROCID
The MPI rank (or relative process ID) of the current process
$SLURM_RESTART_COUNT
If the job has been restarted due to system failure or has been explicitly requeued, this will be sent to the number of times the job has been restarted
$SLURM_SUBMIT_DIR
The directory from which sbatch was invoked
$SLURM_SUBMIT_HOST
The hostname of the computer from which sbatch was invoked
$SLURM_TASKS_PER_NODE
Number of tasks to be initiated on each node. Values are comma separated and in the same order as $SLURM_JOB_NODELIST
$SLURM_TOPOLOGY_ADDR
The value will be set to the names network switches which may be involved in the job's communications from the system's top level switch down to the leaf switch and ending with node name
$SLURM_TOPOLOGY_ADDR_PATTERN
The value will be set component types listed in $SLURM_TOPOLOGY_ADDR. Each component will be identified as either "switch" or "node". A period is used to separate each hardware component type

Job specification

Job specification
SLURM
Script directive
#SBATCH
Account to charge
--account=[account]
Begin Time
--begin=YYYY-MM-DD[HH:MM[:SS]]
Combine stdout/stderr
(use --output without the --error)
Copy Environment
--export=[ALL|NONE|variable]
CPU Count
--ntasks [count]
CPUs Per Task
--cpus-per-task=[count]
Email Address
--mail-user=[address]
Event Notification
--mail-type=[events] eg. BEGIN, END, FAIL, REQUEUE, and ALL (any state change)
Generic Resources
--gres=[resource_spec] eg. gpu:4 or mic:4
Node features
--constraint=[feature] eg. k20x, s10k and 5110p
Job Arrays
--array=[array_spec] 
Job Dependency
--depend=[state:job_id]
Job host preference
--nodelist=[nodes] AND/OR --exclude=[nodes]
Job Name
--job-name=[name]
Job Restart
--requeue OR --no-requeue
Licenses
--licenses=[license_spec]
Memory Size
--mem=[mem][M][G][T] 
Node Count
--nodes=[min[-max]]
Quality of Service
--qos=[name]
Queue
--partition=[queue]
Resource Sharing
--exclusive OR --shared
Standard Error File
--error=[file_name]
Standard Output File
--output=[file_name]
Tasks Per Node
--ntasks-per-node=[count]
Wall Clock Limit
--time=[min] OR  [days-hh:mm:ss]
Working Directory
--workdir=[dir_name]

 

Requesting Accelerators Nodes

The `sinfo --partition=batch-acc --format="%10P %.5D %.4c %.8m %7G %8f %N"` command will reveal additional information about the different features available on the compute nodes, e.g. accelerator cards. These specific resources can be requested in sbatch scripts using the --gres and --constraint options.

output from `sinfo --partition=batch-acc --all`
# sinfo --partition=batch-acc --format="%10P %.5D %.4c %.7m %7G %8f %N"


PARTITION  NODES CPUS  MEMORY GRES    AVAIL_FE NODELIST
batch-acc      1   16   64508 gpu:4   p100     node-as-ngpu-005
batch-acc      2   16   64508 gpu:1   p100     node-as-ngpu-[006-007]
batch-acc      1   16   64508 mic:4   5110p,mi node-as-phi-001
batch-acc      5   16  129105 mic:1   5110p,mi node-dw-phi-[001-005]
batch-acc      2   16  129105 mic:2   5110p,mi node-dw-phi-[006-007]
batch-acc      2   16  64508+ (null)  nvme     node-nvme-[001-002]
batch-acc      4   16   64508 gpu:4   k20x     node-as-ngpu-[001-004]
batch-acc      3   16  129105 gpu:1   k20x     node-dw-ngpu-[001-003]
batch-acc      2   16  129105 gpu:2   k20x     node-dw-ngpu-[004-005]
output from `sinfo --partition=batch-micnative --all`
# sinfo --partition=batch-micnative --format="%15P %.5D %.4c %.8m %7G %15f %N"

PARTITION       NODES CPUS   MEMORY GRES    AVAIL_FEATURES  NODELIST
batch-micnative     4    1     7697 (null)  5110p,miccard   node-as-phi-002-mic[0-3]

NVIDIA GPU Nodes

Balena has two different types of Nvidia GPU resources available - K20x and P100. Please use the --constraint parameter within SLURM to choose specific resources.

K20x GPUs

sbatch script for NVIDIA K20x GPU Nodes
## batch-acc partition contains all the accelerator nodes
#SBATCH --partition=batch-acc

## Requesting NVIDIA k20x nodes with n cards
#SBATCH --constraint=k20x
#SBATCH --gres=gpu:n

P100 GPUs

sbatch script for NVIDIA P100 GPU Nodes
## batch-acc partition contains all the accelerator nodes
#SBATCH --partition=batch-acc

## Requesting NVIDIA P100 nodes with n cards
#SBATCH --constraint=p100
#SBATCH --gres=gpu:n

NVMe

sbatch script for AMD FirePro GPU Nodes
## batch-acc partition contains all the accelerator nodes
#SBATCH --partition=batch-acc

## Requesting NVMe nodes 
#SBATCH --constraint=nvme

Accessing the nvme filesystem

Each node have about 2TB of NVME storage - NVMe SSD DC P3600

$ df -h /nvme
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p1  1.9T   33M  1.9T   1% /nvme

You can read and write data using the below path

# The file/dir name should be prefixed with /nvme/

 

Xeon Phi Nodes

Offload

sbatch script for Offload Intel Xeon Phi Nodes
## batch-acc partition contains all the accelerator nodes
#SBATCH --partition=batch-acc

## Requesting Xeon Phi (MIC) Nodes with n cards (max 4)
#SBATCH --gres=mic:n

Native

sbatch script for Native Intel Xeon Phi Nodes
## batch-micnative partition contains all the accelerator cards
#SBATCH --partition=batch-micnative

## Requesting n Xeon Phi (MIC) Nodes (max 4 nodes)
#SBATCH --nodes=n
#SBATCH --constraint=5110p

Batch development partition

A batch-devel partition is available for users to test their SLURM job scripts. All users have access to it and jobs to this partition are limited as follows:

Account typeMaximum WalltimeMax NodesMax CPU coresMax jobs per user (at a time)
ALL15 minutes (00:15:00)4641

Users can access this partition by specifying the following in their SLURM job script

#SBATCH --partition=batch-devel
#SBATCH --qos=devel

Interactive jobs (Test and Development nodes)

By default all interactive jobs are submitted to the ITD partition using the free (maximum walltime of 6 hours) account. The resources of this partition are used in SHARED mode, which means all the users allocated to a particular node have equal access to all its resources (CPU,MEM,GPU,MIC). Each user is limited to one interactive job on the ITD partition.

Nodes in the ITD partition

# sinfo --partition=itd --format="%10P %.5D %.4c %.7m %15f %N"

PARTITION  NODES CPUS  MEMORY AVAIL_FEATURES  NODELIST
itd            1   16  129150 p100,ivybridge  itd-ngpu-02
itd            1   16  129150 5110p,ivybridge itd-phi-01
itd            1   16  129150 k20x,ivybridge  itd-ngpu-01

Nodes in ITD-sky partition

# sinfo --partition=itd-sky --format="%10P %.5D %.4c %.7m %15f %N"


PARTITION  NODES CPUS  MEMORY AVAIL_FEATURES  NODELIST
itd-sky        1   24  193220 skylake         itd-sky-01
$ sinteractive 

For interactive sessions using specific resource, use the gres option to specify the resource and the number of that resource required. The itd partition is configured with nodes having either 1 GPU (Nvidia K20x or Nvidia P100) or 1 MIC (Xeon 5110p)

Access the K20x node

$ sinteractive --constraint=k20x

Access the P100 node

$ sinteractive --constraint=p100

Access the MIC (5110p) node

$ sinteractive --constraint=5110p

Access the Skylake node

$ sinteractive --constraint=skylake

Exclusive interactive access

For an EXCLUSIVE interactive session, use a specific partition depending on your node requirement

$ sinteractive --time=00:20:00 --partition=batch-acc --constraint=k20x

Job accounting information

sacct - report accounting information by individual job and job step

$ sacct --job=<jobid>

Run tasks before job times out

SLURM provides an option to send a signal to your job before it times out 

--signal=[B:]<sig_num>[@<sig_time>]
    When a job is within sig_time seconds of its end time, send it the signal sig_num. Due to the resolution of event handling by Slurm, the signal may be sent up to 60 seconds earlier than specified.
    sig_num may either be a signal number or name (e.g. "10" or "USR1"). sig_time must have an integer value between 0 and 65535. By default, no signal is sent before the job’s end time. If a sig_num
    is specified without any sig_time, the default time will be 60 seconds. Use the "B:" option to signal only the batch shell, none of the other processes will be signaled. By default all job steps
    will be signaled, but not the batch shell itself.


Example job-script that uses signal
#!/bin/bash

#SBATCH --partition=batch
#SBATCH --nodes=1
#SBATCH --time=10


#SBATCH --signal=B:USR1@60

pre_timeout_task() {
	# Perform some task before jobs finishes
  	echo "from timeout task" `date`
  	sleep 10
  	echo "timeout task done"
}
trap 'pre_timeout_task' USR1


# Use "&" after your application and "wait" - this is important
my_application  &
wait


  • No labels