Idle GPU Policy and Guidance

Idle GPU Policy and Guidance

To maintain efficiency and resource availability for all users, UFIT Research Computing has a policy prohibiting the allocation of GPUs and leaving them idle (see paragraph 2 of the Scheduler/Job policy).

Automated systems are now in place to terminate jobs whose B200 GPU utilization over an hour has been 0%. While we recognize this may require some effort to adjust workflows, the B200 GPUs are simply too valuable and scarce to sit idle for extended periods.

FAQ

I have a workflow that needs a GPU for part of the time, but has other parts that do not use the GPU. How can I run this on HiPerGator given this policy?

We suggest breaking these workflows into separate jobs that form a chain, where the next job is submitted when a job in the chain finishes: GPU processing in one job and CPU-only processing in another. This can be automated using Slurm’s dependency flag for sbatch.

I work interactively. How can I prevent my job from being canceled?

We suggest that interactive users cancel jobs/sessions when they are not working and request a new job/session when they return.

How can I get help managing GPU use better?

Please open a support request, and we will gladly assist you.

Does this policy apply to RTX 6000 Pro or L4 GPUs?

Yes, though at this time, we are not canceling jobs on those GPUs.

How can I monitor GPU utilization?

The jobnvtop command can be used (see here).

How does this apply to multi-GPU jobs?

If any GPU allocated to a job is idle, the job will be terminated.