Skip to content

Scheduling

Scheduling computational jobs

You may want to run multiple jobs or the same job repeatedly, SLURM allows you to schedule these jobs and allocate resources to them.

UFRC uses the Simple Linux Utility for Resource Management, or SLURM, to allocate resources and schedule jobs. Users can create SLURM job scripts to submit jobs to the system. These scripts can, and should, be modified in order to control several aspects of your job, like resource allocation, email notifications, or an output destination.

To submit a job script from one of the login nodes accessed via hpg.rc.ufl.edu, use the following command:

sbatch <your_job_script>

To check the status of submitted jobs, use the following command:

squeue -u <username>

View SLURM Commands for more useful SLURM commands.

Managing Cores and Memory

See Account and QOS limits under SLURM for the main documentation on efficient management of computational resources and an extensive explanation of QOS and SLURM account use.

The amount of resources within an investment is calculated in NCU (Normalized Computing Units), which correspond to 1 CPU core and about 3.5GB of memory for each NCU purchased. CPUs (cores) and RAM are allocated to jobs independently as requested by your job script.

  • Your group's investment can run out of cores (SLURM may show QOSGrpCpuLimit in the reason a job is pending) OR memory (SLURM may show QOSGrpMemLimit in the reason a job is pending) depending on the current use by running jobs.

The majority of HiPerGator nodes have the same ratio of about 4 GB of RAM per core, which, after accounting for the operating system and system services, leaves about 3.5 GB usable for jobs; hence the ratio of 1 core and 3.5GB RAM per NCU.

Most HiPerGator nodes have 128 CPU cores and 1000GB of RAM. The bigmem nodes go up to 4TB of available memory. See Available Node Features for the exact data on resources available on all types of nodes on HiPerGator.

You must specify both the number of cores and the amount of RAM needed in the job script for SLURM with the --mem (total job memory) or --mem-per-cpu (per-core memory) options. Otherwise, the job will be assigned the default 600mb of memory.

If you need more than 128 GB of RAM, you can only run on the older nodes, which have 256 GB of RAM, or on the bigmem nodes, which have up to 1.5 TB of RAM.

Monitoring Your Workloads

You can see presently running workloads with the squeue command, e.g., $ squeue -u <username>.

Open OnDemand offers a method to monitor jobs using the Jobs menu in the upper toolbar on your dashboard. This will show your current running, pending, and recently completed jobs. Select: Jobs -> Active Jobs from the upper dashboard menu.

We provide a number of helpful commands in the UFRC module. The ufrc module is loaded by default at login, but you can also load the ufrc module with the following command: $ module load ufrc.

Examples of commands for SLURM or HiPerGator specific UFRC environment module:

slurmInfo           # displays resource usage for your group
slurmInfo -p        # displays resource usage per partition
showQos             # displays your available QoS
home_quota          # displays your /home quota
blue_quota          # displays your /blue quota
orange_quota        # displays your /orange quota
sacct               # displays job ID and state of your recent workloads
nodeInfo            # displays partitions by node types, showing total RAM and other features
sinfo -p partition  # displays the status of nodes in a partition
jobhtop             # displays resource usage for jobs
jobnvtop            # displays resource usage for GPU jobs
which python        # displays the path to the Python install of the environment modules you have loaded