Bath HPC
Balena
- Access Research Computing services
- Getting Started
- System Architecture
- Developer Guides
- Premium Accounts
- MPI Guide
- Training
- Scheduler
- Storage
- Software
- Environment Modules
- Getting Help
Simple Linux Utility Resource Manager (SLURM)
Balena uses SLURM (Simple Linux Utility Resource Manager) for its resource management and scheduling, its one of the most popular job scheduling systems available and used on about 40 percent of the largest computers in the world (Top500) including Tianhe-2, which is on top of the list.
The compute nodes that are part of the scheduler are divided into the following SLURM partitions/queues. Users should select the appropriate partition based on the job requirement.
Partition Name | Nodes | Characteristics |
---|---|---|
batch | 158 | Default partition - jobs that do not request a partition are submitted into this partition |
batch-64gb | 88 | Nodes with 64GB RAM (DDR3 1866 MHz), using dual ranked DIMMS |
batch-128gb | 80 | Nodes with 128GB RAM (DDR3 1866 MHz), using single ranked DIMMS |
batch-512gb | 2 | Nodes with 512GB RAM (DDR3 1333 MHz) |
batch-acc | 22 | Nodes with accelerators - GPUs(K20x and P100)/MIC/NVMe |
batch-micnative | 4 | MIC Cards for native mode |
batch-all | 179 | All Ivybridge compute nodes (except 512GB nodes, and MIC native) |
batch-sky | 16 | Skylake compute nodes (192GB DDR4 2666MHz) |
batch-devel | 4 | Nodes with 64GB RAM (DDR3 1866 MHz) |
itd | 4 | Nodes for Interactive Test and Development - 2 nodes with a GPU and 2 nodes with a Xeon Phi |
itd-sky | 1 | Skylake node for Interactive Test and Development |
teaching | variable | Nodes with 64GB RAM (DDR3 1866 MHz) - this partition is dedicated for the use of academic courses that run on Balena |
User can view the project accounts to which they have access to using the sshare
command.
[user123@balena-01 ~]$ sshare Account User Raw Shares Norm Shares Raw Usage Effectv Usage FairShare -------------------- ------------ ---------- ----------- ----------- ------------- ---------- free user123 parent 0.100000 6466 0.657390 0.010497 prj-cc001 user123 parent 0.900000 6032 0.342610 0.768076
Account Type | Maximum Walltime | Max nodes | Max CPU cores | Max CPU time | Starting Priority |
---|---|---|---|---|---|
teaching | 15 minutes | 4 | 64 | NA | dedicated |
free | 6 hours | 16 | 256 | 384 core-hours | 0 |
premium | 5 days | 32 | 512 (default) >512 cores | NA | +24 hours |
The free account is restricted to a maximum CPU time of 384 core-hours (23040 core-minutes) per job.
Example job sizes that can run using this limit includes:
Walltime | Nodes | CPU cores | Total CPU time (mins) | Total node time (mins) |
---|---|---|---|---|
6 hours | 4 | 64 | 23040 | 1440 |
3 hours | 8 | 128 | 23040 | 1440 |
90 minutes | 16 | 256 | 23040 | 1440 |
A user is limited to 115,200 running cpumins for jobs submitted from the free account - this is equivalent to having five concurrent jobs each using 4 nodes running for 6 hours. Eg. If a user submits six jobs requesting 4 nodes for 6 hours, only 5 of those jobs can run concurrently and one of the job will have to wait for the sufficient running cpumins to become available to get an allocation from SLURM.
The teaching accounts (eg. cm30225) will be able to use only the teaching partition and will not able to run jobs from any other partition.
The sprio
cammand can be used to view a jobs priority and the components making up the priority. The priority of a job is determined by the sum of two components:
sshare
command. A decay half-life of 2 days is applied for all users. This will decay the raw usage of a user to half every 2 days thus decreasing the effective usage (E) and further increasing the user's fairshare.
#!/bin/bash # set the account to be used for the job #SBATCH --account=free # set name of job #SBATCH --job-name=myjob #SBATCH --output=myjob.out #SBATCH --error=myjob.err # set the number of nodes and partition #SBATCH --nodes=2 #SBATCH --ntasks-per-node=16 #SBATCH --partition=batch # set max wallclock time #SBATCH --time=04:00:00 # mail alert at start, end and abortion of execution #SBATCH --mail-type=END # send mail to this address #SBATCH --mail-user=user123@bath.ac.uk # Load all the dependant modules module purge # clear all modules from the environment module load slurm module load intel/compilers module load intel/mpi module load intel/mkl # run the application ./myparallelscript
[user123@balena-01 ~]$ sbatch example.slm Submitted batch job 11
View information about jobs located in the SLURM scheduling queue.
[user123@balena-01 ~]$ squeue JOBID NAME USER ACCOUNT PARTITION ST NODES CPUS MIN_MEMORY START_TIME TIME_LEFT PRIORITY NODELIST(REASON) 11 myjob user123 free batch R 1 16 62K 2015-04-30T12:43:37 3:59:57 9 node-sw-081
[user123@balena-01 ~]$ scontrol show job 11 JobId=11 Name=myjob UserId=user123(123) GroupId=balena_cc(10305) Priority=9 Nice=0 Account=free QOS=free WCKey=* JobState=RUNNING Reason=None Dependency=(null) Requeue=0 Restarts=0 BatchFlag=1 ExitCode=0:0 RunTime=00:00:08 TimeLimit=04:00:00 TimeMin=N/A SubmitTime=2015-04-30T12:43:37 EligibleTime=2015-04-30T12:43:37 StartTime=2015-04-30T12:43:37 EndTime=2015-04-30T16:43:37 PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=batch AllocNode:Sid=balena-02:11458 ReqNodeList=(null) ExcNodeList=(null) NodeList=node-sw-081 BatchHost=node-sw-081 NumNodes=1 NumCPUs=16 CPUs/Task=1 ReqB:S:C:T=0:0:*:* Socks/Node=* NtasksPerN:B:S:C=16:0:*:* CoreSpec=0 MinCPUsNode=16 MinMemoryNode=62G MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=0 Contiguous=0 Licenses=(null) Network=(null) Command=/home/n/user123/testing/example.slm WorkDir=/home/n/user123/testing StdErr=/home/n/user123/testing/myjob.err StdIn=/dev/null StdOut=/home/n/user123/testing/myjob.out
$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 16 batch hello rtm25 R 0:03 1 slurm-compute-02 15 batch hello rtm25 R 0:06 1 slurm-compute-01 $ scancel 15 $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 16 batch hello rtm25 R 0:11 1 slurm-compute-02
$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 19 batch hello rtm25 PD 0:00 1 (Resources) 20 batch hello rtm25 PD 0:00 1 (Priority) 21 batch hello rtm25 PD 0:00 1 (Priority) 18 batch hello rtm25 R 0:03 1 slurm-compute-02 17 batch hello rtm25 R 0:05 1 slurm-compute-01 $ scontrol hold 20 $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 19 batch hello rtm25 PD 0:00 1 (Resources) 21 batch hello rtm25 PD 0:00 1 (Priority) 20 batch hello rtm25 PD 0:00 1 (JobHeldUser) 18 batch hello rtm25 R 0:13 1 slurm-compute-02 17 batch hello rtm25 R 0:15 1 slurm-compute-01
$ scontrol release 20 $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 20 batch hello rtm25 PD 0:00 1 (Resources) 21 batch hello rtm25 R 0:10 1 slurm-compute-02 19 batch hello rtm25 R 0:11 1 slurm-compute-01
[user123@balena-01 ~]$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST batch* up infinite 168 idle node-sw-[001-168] batch-acc up infinite 11 idle node-as-agpu-01,node-as-ngpu-[01-06],node-dw-ngpu-[001-004] batch-all up infinite 179 idle node-as-agpu-01,node-as-ngpu-[01-06],node-dw-ngpu-[001-004],node-sw-[001-168] batch-512gb up infinite 2 idle node-sw-fat-[01-02] batch-64gb up infinite 88 idle node-sw-[081-168] batch-128gb up infinite 80 idle node-sw-[001-080] itd up infinite 2 idle itd-ngpu-[01-02]
This job can begin execution after any previously launched jobs sharing the same job name and user have terminated.
$ sbatch hello.slurm Submitted batch job 22 $ sbatch --dependency singleton hello.slurm Submitted batch job 23 $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 23 batch hello rtm25 PD 0:00 1 (Dependency) 22 batch hello rtm25 R 0:07 1 slurm-compute-01 $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 23 batch hello rtm25 R 0:02 1 slurm-compute-01
User commands | SLURM |
---|---|
Job submission | sbatch [script_file] |
Queue list | squeue |
Queue list (by user) | squeue -u [user_name] |
Job deletion | scancel [job_id] |
Job information | scontrol show job [job_id] |
Job hold | scontrol hold [job_id] |
Job release | scontrol release [job_id] |
Node list | sinfo --Nodes --long |
Cluster status | sinfo or squeue |
GUI | sview /** graphical user interface to view and modify SLURM state **/ |
Environment | Description |
---|---|
$SLURM_ARRAY_TASK_ID | Job array ID (index) number |
$SLURM_ARRAY_JOB_ID | Job array's master job ID number |
$SLURM_JOB_ID | The ID of the job allocation |
$SLURM_JOB_DEPENDENCY | Set to value of the --dependency option |
$SLURM_JOB_NAME | Name of the job |
$SLURM_JOB_NODELIST | List of nodes allocated to the job |
$SLURM_JOB_NUM_NODES | Total number of nodes in the job's resource allocation |
$SLURM_JOB_PARTITION | Name of the partition in which the job is running |
$SLURM_MEM_PER_NODE | Memory requested per node |
$SLURM_NODEID | ID of the nodes allocated |
$SLURM_NTASKS | Number of tasks requested. Same as -n, --ntasks. To be used with mpirun, e.g. mpirun -np $SLURM_NTASKS binary |
$SLURM_NTASKS_PER_NODE | Number of tasks requested per node. Only set if the --ntasks-per-node option is specified |
$SLURM_PROCID | The MPI rank (or relative process ID) of the current process |
$SLURM_RESTART_COUNT | If the job has been restarted due to system failure or has been explicitly requeued, this will be sent to the number of times the job has been restarted |
$SLURM_SUBMIT_DIR | The directory from which sbatch was invoked |
$SLURM_SUBMIT_HOST | The hostname of the computer from which sbatch was invoked |
$SLURM_TASKS_PER_NODE | Number of tasks to be initiated on each node. Values are comma separated and in the same order as $SLURM_JOB_NODELIST |
$SLURM_TOPOLOGY_ADDR | The value will be set to the names network switches which may be involved in the job's communications from the system's top level switch down to the leaf switch and ending with node name |
$SLURM_TOPOLOGY_ADDR_PATTERN | The value will be set component types listed in $SLURM_TOPOLOGY_ADDR. Each component will be identified as either "switch" or "node". A period is used to separate each hardware component type |
Job specification | SLURM |
---|---|
Script directive | #SBATCH |
Account to charge | --account=[account] |
Begin Time | --begin=YYYY-MM-DD[HH:MM[:SS]] |
Combine stdout/stderr | (use --output without the --error) |
Copy Environment | --export=[ALL|NONE|variable] |
CPU Count | --ntasks [count] |
CPUs Per Task | --cpus-per-task=[count] |
Email Address | --mail-user=[address] |
Event Notification | --mail-type=[events] eg. BEGIN, END, FAIL, REQUEUE, and ALL (any state change) |
Generic Resources | --gres=[resource_spec] eg. gpu:4 or mic:4 |
Node features | --constraint=[feature] eg. k20x, s10k and 5110p |
Job Arrays | --array=[array_spec] |
Job Dependency | --depend=[state:job_id] |
Job host preference | --nodelist=[nodes] AND/OR --exclude=[nodes] |
Job Name | --job-name=[name] |
Job Restart | --requeue OR --no-requeue |
Licenses | --licenses=[license_spec] |
Memory Size | --mem=[mem][M][G][T] |
Node Count | --nodes=[min[-max]] |
Quality of Service | --qos=[name] |
Queue | --partition=[queue] |
Resource Sharing | --exclusive OR --shared |
Standard Error File | --error=[file_name] |
Standard Output File | --output=[file_name] |
Tasks Per Node | --ntasks-per-node=[count] |
Wall Clock Limit | --time=[min] OR [days-hh:mm:ss] |
Working Directory | --workdir=[dir_name] |
The `sinfo --partition=batch-acc --format="%10P %.5D %.4c %.8m %7G %8f %N
` command will reveal additional information about the different features available on the compute nodes, e.g. accelerator cards. These specific resources can be requested in sbatch scripts using the "
--gres
and --constraint
options.
# sinfo --partition=batch-acc --format="%10P %.5D %.4c %.7m %7G %8f %N" PARTITION NODES CPUS MEMORY GRES AVAIL_FE NODELIST batch-acc 1 16 64508 gpu:4 p100 node-as-ngpu-005 batch-acc 2 16 64508 gpu:1 p100 node-as-ngpu-[006-007] batch-acc 1 16 64508 mic:4 5110p,mi node-as-phi-001 batch-acc 5 16 129105 mic:1 5110p,mi node-dw-phi-[001-005] batch-acc 2 16 129105 mic:2 5110p,mi node-dw-phi-[006-007] batch-acc 2 16 64508+ (null) nvme node-nvme-[001-002] batch-acc 4 16 64508 gpu:4 k20x node-as-ngpu-[001-004] batch-acc 3 16 129105 gpu:1 k20x node-dw-ngpu-[001-003] batch-acc 2 16 129105 gpu:2 k20x node-dw-ngpu-[004-005]
# sinfo --partition=batch-micnative --format="%15P %.5D %.4c %.8m %7G %15f %N" PARTITION NODES CPUS MEMORY GRES AVAIL_FEATURES NODELIST batch-micnative 4 1 7697 (null) 5110p,miccard node-as-phi-002-mic[0-3]
Balena has two different types of Nvidia GPU resources available - K20x and P100. Please use the --constraint parameter within SLURM to choose specific resources.
## batch-acc partition contains all the accelerator nodes #SBATCH --partition=batch-acc ## Requesting NVIDIA k20x nodes with n cards #SBATCH --constraint=k20x #SBATCH --gres=gpu:n
## batch-acc partition contains all the accelerator nodes #SBATCH --partition=batch-acc ## Requesting NVIDIA P100 nodes with n cards #SBATCH --constraint=p100 #SBATCH --gres=gpu:n
## batch-acc partition contains all the accelerator nodes #SBATCH --partition=batch-acc ## Requesting NVMe nodes #SBATCH --constraint=nvme
Each node have about 2TB of NVME storage - NVMe SSD DC P3600
$ df -h /nvme Filesystem Size Used Avail Use% Mounted on /dev/nvme0n1p1 1.9T 33M 1.9T 1% /nvme
You can read and write data using the below path
# The file/dir name should be prefixed with /nvme/
## batch-acc partition contains all the accelerator nodes #SBATCH --partition=batch-acc ## Requesting Xeon Phi (MIC) Nodes with n cards (max 4) #SBATCH --gres=mic:n
## batch-micnative partition contains all the accelerator cards #SBATCH --partition=batch-micnative ## Requesting n Xeon Phi (MIC) Nodes (max 4 nodes) #SBATCH --nodes=n #SBATCH --constraint=5110p
A batch-devel partition is available for users to test their SLURM job scripts. All users have access to it and jobs to this partition are limited as follows:
Account type | Maximum Walltime | Max Nodes | Max CPU cores | Max jobs per user (at a time) |
---|---|---|---|---|
ALL | 15 minutes (00:15:00) | 4 | 64 | 1 |
Users can access this partition by specifying the following in their SLURM job script
#SBATCH --partition=batch-devel #SBATCH --qos=devel
By default all interactive jobs are submitted to the ITD partition using the free (maximum walltime of 6 hours) account. The resources of this partition are used in SHARED mode, which means all the users allocated to a particular node have equal access to all its resources (CPU,MEM,GPU,MIC). Each user is limited to one interactive job on the ITD partition.
# sinfo --partition=itd --format="%10P %.5D %.4c %.7m %15f %N" PARTITION NODES CPUS MEMORY AVAIL_FEATURES NODELIST itd 1 16 129150 p100,ivybridge itd-ngpu-02 itd 1 16 129150 5110p,ivybridge itd-phi-01 itd 1 16 129150 k20x,ivybridge itd-ngpu-01
# sinfo --partition=itd-sky --format="%10P %.5D %.4c %.7m %15f %N" PARTITION NODES CPUS MEMORY AVAIL_FEATURES NODELIST itd-sky 1 24 193220 skylake itd-sky-01
$ sinteractive
For interactive sessions using specific resource, use the gres option to specify the resource and the number of that resource required. The itd partition is configured with nodes having either 1 GPU (Nvidia K20x or Nvidia P100) or 1 MIC (Xeon 5110p)
$ sinteractive --constraint=k20x
$ sinteractive --constraint=p100
$ sinteractive --constraint=5110p
$ sinteractive --constraint=skylake
For an EXCLUSIVE interactive session, use a specific partition depending on your node requirement
$ sinteractive --time=00:20:00 --partition=batch-acc --constraint=k20x
sacct
- report accounting information by individual job and job step
$ sacct --job=<jobid>
SLURM provides an option to send a signal to your job before it times out
--signal=[B:]<sig_num>[@<sig_time>]
When a job is within sig_time seconds of its end time, send it the signal sig_num. Due to the resolution of event handling by Slurm, the signal may be sent up to 60 seconds earlier than specified.
sig_num may either be a signal number or name (e.g. "10" or "USR1"). sig_time must have an integer value between 0 and 65535. By default, no signal is sent before the job’s end time. If a sig_num
is specified without any sig_time, the default time will be 60 seconds. Use the "B:" option to signal only the batch shell, none of the other processes will be signaled. By default all job steps
will be signaled, but not the batch shell itself.
#!/bin/bash #SBATCH --partition=batch #SBATCH --nodes=1 #SBATCH --time=10 #SBATCH --signal=B:USR1@60 pre_timeout_task() { # Perform some task before jobs finishes echo "from timeout task" `date` sleep 10 echo "timeout task done" } trap 'pre_timeout_task' USR1 # Use "&" after your application and "wait" - this is important my_application & wait