GPU Access¶

GPUs Per Group¶

Normalized Graphics Processor Units (NGUs) include all of the infrastructure (memory, network, rack space, cooling) necessary for GPU-accelerated computation. Each NGU is equivalent to 1 GPU presently, however newer GPUs such as the B200s may require more than 1 NGU to access in the future.

In order to use GPU resources HiPerGator groups need to have an active NGU investment. Check if your group(s) has GPUs allocated and available with the command slurmInfo -g group_name (with the module "ufrc" loaded).

Researchers can add NGUs to their allocations by filling out the Purchase Form or requesting a Trial Allocation

Requesting GPUs from the Scheduler¶

To request a GPu in a scheduled job or session, first decide on the type of GPU to be requsted. There is a cardinal difference between requesting L4 GPUs integrated into main HPG4 nodes, which are hybrid compute nodes available for both cpu-only and gpu jobs, and B200 GPUs, which are only available in a separate partition and should only be used for jobs requiring extreme performance and VRAM (gpu memory) such as large model training.

You must request at least one CPU core per GPU or the job will be rejected by the scheduler.

L4 GPUs: specify gpu quantity with --gpus=NUMBER in a job script or OOD session form alongside the same or larger number of CPU cores. No other specifications are necessary.
B200 GPUs: specify --partition=hpg-b200 in addition to the gpu quantity specification i.e. both --partition=hpg-b200 and --gpus=NUMBER OR --gres=gpu:NUMBER are required.

Request format¶

--gpus=NUMBER

The above translates to e.g.

SBATCH --gpus=1

for one GPU and so on.

GRES format can also be used both in job scripts and in OOD session forms.

--gres=gpu:NUMBER

or

gpu:NUMBER

in an OOD form GRES field

Partitions¶

Partitions with GPUs:

hpg-b200: NVIDIA DGX B200 SuperPod.
hwgui: Hardware accelerated GPU partition for visualization applications with NVIDIA L4 cards.
hpg-turin: Regular HPG4 nodes integrated with 3 NVIDIA L4 cards per node. This partition does NOT need to be specified in a job script. It will be selected automatically if a GPU is requested.

GPU Hardware on HiPerGator¶

We have the following types of NVIDIA GPU nodes available:

Node Specifications	NVIDIA L4	NVIDIA B200
Host Quantity	200	31
Host Architecture	AMD TURIN	Intel Xeon 8570
Host Memory	753 GB	2 TB
Host Interconnect	ConnectX-7 IB	ConnectX-7 IB
CPU cores per Host	96	112
CPU cores per Socket	96	56
GPUs per Host	3	8
CPU cores per GPU	4 (reserved per GPU)	14
Memory per GPU	24GB	180 GB
Slurm partition	hpg-turin,hwgui,gpu	hpg-b200
Slurm Feature	l4	b200
GRES GPU type	l4	b200
Technical Ref	Specifications	Specifications

Legend:

Slurm features are used in '--constraint' job specifications
GRES types are used in '--gres=gpu:TYPE:NUMBER' job specifications

For a list of additional node features, see the Available Node Features page.

Open On Demand Access¶

Note

Interactive Open OnDemand Jobs in the GPU partition are limited to 12 hrs. Computational GPU jobs are limited to 14 days. Each GPU job requires at least one CPU core.

To access GPUs using Open OnDemand, you need to set the partition and a GRES (generic resource) with the number and (optionally) type of GPU.

Compiling CUDA Enabled Programs¶

The most direct way to develop a custom GPU accelerated algorithm is with the CUDA programming, please refer to the Nvidia CUDA Toolkit page. The current CUDA environment is cuda/12. However, C++ or Python packages numba and PyCuda are other ways to program GPU algorithms.

Conda Environments with GPU¶

To make sure your code will run on GPUs install a recent cudatoolkit package that works with the NVIDIA drivers on HPG (currently 12.x, but older versions are still supported) alongside the pytorch or tensorflow package(s). See RC provided tensorflow or pytorch installs for examples if needed. Mamba can detect if there is a gpu in the environment, so the easiest approach is to run the mamba install command in a gpu session.

You can also visit Conda for more information.

Slurm and GPU Use¶

View instructions for using GPUs and scheduling GPU jobs with SLURM at Slurm and GPU Use

Hardware Accelerated GUI¶

GPUs in these servers are used to accelerate rendering for graphical applications. These servers are in the SLURM "hwgui" partition.

There are several preset applications available in the Open OnDemand drop-down list (e.g. Freeview, Unreal Engine). You can run additional GUI applications by starting a Console or HiPerGator Desktop session, loading the application module and running the application.

To do this:

Select the 'hwgui' partition for an Open OnDemand Console or HiPerGator Desktop Application. See Open OnDemand for details on using OOD.
Once connected to the session, open a terminal and load the appropriate environment module and launch the application in question.

Apptainer Containers¶

Apptainer commands require a --nv flag to mount NVIDIA drivers, libraries, and devices. Because the container's Inter-Process Communication (IPC) namespace is isolated, we need to mount the GPU node's path for multi-gpu in order to take advantage of the available shared memory. Use the following flags to run apptainer containers on GPU nodes:

--nv (all gpu jobs)

--ipc=host or --bind /dev/shm (multi-gpu B200 node jobs)