- Getting Access
- Getting Started
- System Architecture
- Developer Guides
- Premium Accounts
- MPI Guide
- Environment Modules
- Getting Help
Simple Linux Utility Resource Manager (SLURM)
Balena uses SLURM (Simple Linux Utility Resource Manager) for its resource management and scheduling, its one of the most popular job scheduling systems available and used on about 40 percent of the largest computers in the world (Top500) including Tianhe-2, which is on top of the list.
The compute nodes that are part of the scheduler are divided into the following SLURM partitions/queues. Users should select the appropriate partition based on the job requirement.
|158||Default partition - jobs that do not request a partition are submitted into this partition|
|88||Nodes with 64GB RAM (DDR3 1866 MHz), using dual ranked DIMMS|
|80||Nodes with 128GB RAM (DDR3 1866 MHz), using single ranked DIMMS|
|2||Nodes with 512GB RAM (DDR3 1333 MHz)|
|22||Nodes with accelerators - GPUs(K20x and P100)/MIC/NVMe|
|4||MIC Cards for native mode|
|179||All Ivybridge compute nodes (except 512GB nodes, and MIC native)|
|16||Skylake compute nodes (192GB DDR4 2666MHz)|
|4||Nodes with 64GB RAM (DDR3 1866 MHz)|
|4||Nodes for Interactive Test and Development - 2 nodes with a GPU and 2 nodes with a Xeon Phi|
|1||Skylake node for Interactive Test and Development|
Nodes with 64GB RAM (DDR3 1866 MHz) - this partition is dedicated for the use of academic courses that run on Balena
User can view the project accounts to which they have access to using the
Max CPU cores
|Max CPU time|
|free||6 hours||16||256||384 core-hours||0|
The free account is restricted to a maximum CPU time of 384 core-hours (23040 core-minutes) per job.
Example job sizes that can run using this limit includes:
|Walltime||Nodes||CPU cores||Total CPU time (mins)||Total node time (mins)|
A user is limited to 115,200 running cpumins for jobs submitted from the free account - this is equivalent to having five concurrent jobs each using 4 nodes running for 6 hours. Eg. If a user submits six jobs requesting 4 nodes for 6 hours, only 5 of those jobs can run concurrently and one of the job will have to wait for the sufficient running cpumins to become available to get an allocation from SLURM.
The teaching accounts (eg. cm30225) will be able to use only the teaching partition and will not able to run jobs from any other partition.
sprio cammand can be used to view a jobs priority and the components making up the priority. The priority of a job is determined by the sum of two components:
A decay half-life of 2 days is applied for all users. This will decay the raw usage of a user to half every 2 days thus decreasing the effective usage (E) and further increasing the user's fairshare.
View information about jobs located in the SLURM scheduling queue.
This job can begin execution after any previously launched jobs sharing the same job name and user have terminated.
|Queue list (by user)|
squeue -u [user_name]
scontrol show job [job_id]
scontrol hold [job_id]
scontrol release [job_id]
sinfo --Nodes --long
sinfo or squeue
sview /** graphical user interface to view and modify SLURM state **/
|Job array ID (index) number|
|Job array's master job ID number|
|The ID of the job allocation|
|$SLURM_JOB_DEPENDENCY||Set to value of the --dependency option|
|$SLURM_JOB_NAME||Name of the job|
|List of nodes allocated to the job|
Total number of nodes in the job's resource allocation
|Name of the partition in which the job is running|
|Memory requested per node|
|ID of the nodes allocated|
|Number of tasks requested. Same as -n, --ntasks. To be used with mpirun, e.g. mpirun -np $SLURM_NTASKS binary|
|Number of tasks requested per node. Only set if the --ntasks-per-node option is specified|
|The MPI rank (or relative process ID) of the current process|
|If the job has been restarted due to system failure or has been explicitly requeued, this will be sent to the number of times the job has been restarted|
|The directory from which sbatch was invoked|
|The hostname of the computer from which sbatch was invoked|
|Number of tasks to be initiated on each node. Values are comma separated and in the same order as $SLURM_JOB_NODELIST|
|The value will be set to the names network switches which may be involved in the job's communications from the system's top level switch down to the leaf switch and ending with node name|
|The value will be set component types listed in $SLURM_TOPOLOGY_ADDR. Each component will be identified as either "switch" or "node". A period is used to separate each hardware component type|
|Account to charge|
(use --output without the --error)
|CPUs Per Task|
--mail-type=[events] eg. BEGIN, END, FAIL, REQUEUE, and ALL (any state change)
--gres=[resource_spec] eg. gpu:4 or mic:4
--constraint=[feature] eg. k20x, s10k and 5110p
|Job host preference|
--nodelist=[nodes] AND/OR --exclude=[nodes]
--requeue OR --no-requeue
|Quality of Service|
--exclusive OR --shared
|Standard Error File|
|Standard Output File|
|Tasks Per Node|
|Wall Clock Limit|
--time=[min] OR [days-hh:mm:ss]
sinfo --partition=batch-acc --format="%10P %.5D %.4c %.8m %7G %8f %N` command will reveal additional information about the different features available on the compute nodes, e.g. accelerator cards. These specific resources can be requested in sbatch scripts using the
Balena has two different types of Nvidia GPU resources available - K20x and P100. Please use the --constraint parameter within SLURM to choose specific resources.
Each node have about 2TB of NVME storage - NVMe SSD DC P3600
You can read and write data using the below path
A batch-devel partition is available for users to test their SLURM job scripts. All users have access to it and jobs to this partition are limited as follows:
|Account type||Maximum Walltime||Max Nodes||Max CPU cores||Max jobs per user (at a time)|
|ALL||15 minutes (00:15:00)||4||64||1|
Users can access this partition by specifying the following in their SLURM job script
By default all interactive jobs are submitted to the ITD partition using the free (maximum walltime of 6 hours) account. The resources of this partition are used in SHARED mode, which means all the users allocated to a particular node have equal access to all its resources (CPU,MEM,GPU,MIC). Each user is limited to one interactive job on the ITD partition.
For interactive sessions using specific resource, use the gres option to specify the resource and the number of that resource required. The itd partition is configured with nodes having either 1 GPU (Nvidia K20x or Nvidia P100) or 1 MIC (Xeon 5110p)
For an EXCLUSIVE interactive session, use a specific partition depending on your node requirement
SLURM provides an option to send a signal to your job before it times out
When a job is within sig_time seconds of its end time, send it the signal sig_num. Due to the resolution of event handling by Slurm, the signal may be sent up to 60 seconds earlier than specified.
sig_num may either be a signal number or name (e.g. "10" or "USR1"). sig_time must have an integer value between 0 and 65535. By default, no signal is sent before the job’s end time. If a sig_num
is specified without any sig_time, the default time will be 60 seconds. Use the "B:" option to signal only the batch shell, none of the other processes will be signaled. By default all job steps
will be signaled, but not the batch shell itself.