New User Training
This page mirrors and expands upon the content provided in the HiPerGator User Training in Canvas.
HiPerGator User Training Registration
The HiPerGator User Training is offered through UF’s Professional and Workforce Development site.
REGISTER HERE: https://go.ufl.edu/hpg-training.
The course is free.
Expand for full registration details
After navigating to https://go.ufl.edu/hpg-training
- Click the + to expand the only section of the course, and click “Add to Cart.”
- In your cart, click “Checkout”
- When signing in, users with a GatorLink account should sign in with their GatorLink. Federated users may select from the other options to sign in.
- After completing the checkout process (no payment will be required), users will get two confirmation emails from Matt Gitzendanner (magitz@ufl.edu).
-
For UF users, the course should be available in your Canvas dashboard. All users can also follow the information below:
- Follow this link: https://elearning.ufl.edu/
- Click the blue "Log in to continuing education" button
- Login with the same account you used to register.
A more in-depth guide on logging in: https://pwd.aa.ufl.edu/wp-content/uploads/2023/11/AccessingYourCourse-copy.pdf
Taking the Course is Required
While this page mirrors the content, to get credit for taking the training, you must complete the training and pass the final quiz in the Canvas course.
Training Objectives¶
- Recognize the role of UFIT Research Computing.
- Understand Research Computing's investment model for resource allocation.
- Understand different mechanisms to access HiPerGator.
- Describe the appropriate use of the login servers and how to request resources for work beyond those limits.
- Understand the primary resources tracked by the SLURM scheduler.
- Understand how to log in to HiPerGator using an SSH client.
- Understand several common mistakes HiPerGator users make and how to avoid them.
Module 1: Introduction to Research Computing and HiPerGator¶
HiPerGator¶
- About 70,000 cores
- Hundreds of GPUs
- 10 Petabytes of storage
- The HiPerGator AI cluster has
- 1,120 NVIDIA A100 GPUs
- 17,000 AMD Rome Epyc Cores
HiPerGator Updates in 2025
As announced on December 13, 2024, the University of Florida Board of Trustees approved investing $24 million to acquire a more advanced version of the HiPerGator supercomputer. This exciting upgrade will include deploying a 63-node NVIDIA DGX B200 SuperPOD, replacing the A100 servers that made up the original HiPerGator AI. To prepare for the upgrade, 60 of the 140 DGX A100 nodes were returned before the holiday break. That means we currently have only about 640 A100 GPUs instead of the 1,120 we have had since 2020.
UFIT Research Computing staff are finalizing the setup of temporary, cloud-based services on the NVIDIA DGX Cloud to provide GPU access to smaller jobs and teaching resources during the transition. This will allow the remaining A100s to be prioritized for research projects until the B200s are available. More details will be announced as they become available.
Components for the DGX SuperPOD will start arriving later this month, and we expect some early access will be available around April. As the new SuperPOD comes online, the remaining A100s will be returned.
In addition to the NVIDIA DGX B200 SuperPOD, the funds approved by the Board of Trustees include the purchase of HiPerGator 4 with 19,200 CPU cores and 600 NVIDIA L4 GPUs, which support both visualization and computation workflows, and a new 11.88 PB, all-flash, Blue storage system. UFIT Research Computing staff will be working hard to take delivery, install, test, and deploy these systems as quickly and efficiently as possible.
Growing HiPerGator will undoubtedly have some bumps, but we are confident that the new resources made possible by this investment will continue to provide world-class research infrastructure for the University of Florida.
The Cluster History page has updated information on the current and historical hardware at Research Computing.
Summary
HiPerGator is a large, high-performance compute cluster capable of tackling some of the largest computational challenges, but users need to understand how to responsibly and efficiently use the resources.
Investor Supported¶
HiPerGator is heavily subsidized by the university, but the researchers need to make investments for access to resources. Research Computing sells three main products:
- Compute: NCUs (Normalized Compute Units)
- 1 CPU core and 7.8 GB of RAM (as of Jan 2021)
- Storage:
- Blue: High-performance storage for most data during analyses. Filesystem: /blue
- Orange: Intended for long-term storage, raw data, and archival use. Not intended for regular job i/o. Filesystem: /orange
- GPUs: NGUs (Normalized Graphic Units)
- Sold in units of GPU cards
- NCU investment is required to make use of GPU(s)
Investments can either be categorized as hardware investments, lasting for a fixed 5-year term with no IDC or service investments with flexible investment times lasting 3-months or longer, but with IDC.
- HiPerGator Price sheets
- Submit a purchase request for:
- Hardware (5-years)
- Services (3-months to longer)
- Full explanation of services offered by Research Computing
Module 2: Accessing HiPerGator and Running Jobs¶
Cluster Components¶
Accessing HiPerGator¶
- Connecting to HiPerGator:
- SSH from MacOS
- SSH from Windows
- JupyterHub
- Galaxy
- Open on Demand
Proper use of Login Nodes¶
- Generally speaking, interactive work other than managing jobs and data is discouraged on the login nodes.
- Login nodes are intended for file and job management, and short-duration testing and development. See more information here
Login server acceptable use limits
- No more than 16-cores
- No longer than 10 minutes (wall time)
- No more than 64 GB of RAM
Resources for Scheduling a Job¶
For use beyond what is acceptable on the login servers, you can request resources on development servers, GPUs servers, through JupyterHub, Galaxy, Graphical User Interface servers via open on demand or submit batch jobs. All of these services work with the scheduler to allocate your requested resources so that your computations run efficiently and do not impact other users.
Scheduling a Job¶
- Understand the resources that your analysis will use:
- CPUs: Can your job use multiple CPU cores? Does it scale?
- Memory: How much RAM will it use? Requesting more will not make your job run faster!
- GPUs: Does your application use GPUs?
- Time: How long will it run?
- Request those resources:
- Sample SLURM Scripts
- Watch the HiPerGator: SLURM Submission Scripts training video. This video is approximately 35 minutes and includes a demonstration
- Watch the HiPerGator: SLURM Submission Scripts for MPI Jobs training video. This video is approximately 25 minutes and includes a demonstration
- Open on Demand, JupyterHub and Galaxy all have other mechanisms to request resources as SLURM needs this information to schedule your job.
- Submit the Job
- Either using the
sbatch JOB_SCRIPT
command or through on of the interfaces - Once your job is submitted, SLURM will check that there are resources available in your group and schedule the job to run
- Either using the
- Run
- SLURM will work through the queue and run your job
Locations for Storage¶
The storage systems are reviewed on the storage page.
Note
In the examples below, the text in all-caps (e.g. USER
)
indicates example text for user-specific information (e.g.
/home/albertgator
)
- Home Storage:
/home/USER
- Each user has 40GB of space
- Good for scripts, code and compiled applications
- Do not use for job input/output
- Snapshots are available for seven days
- Blue Storage:
/blue/GROUP
- Our highest-performance filesystem
- All input/output from jobs should go here
- Orange Storage:
/orange/GROUP
- Slower than /blue
- Not intended for large I/O for jobs
- Primarily for archival purposes
- Red Storage:
/red/GROUP
- All flash parallel filesystem primarily for HiPerGator AI
- Space allocated based on need
- Scratch filesystem with data regularly deleted
Backup and Quotas¶
- The Data Protection page describes the tape backup options and costs.
- Orange and Blue storage quotas are at the group level and based on investment.
-- See the price sheets
-- Submit a purchase request
-- The
blue_quota
andorange_quota
commands will show your group's current quota and use. -- Thehome_quota
command will show you home directory quota and use.
Directories¶
Directory Automounting
- Directories on the Orange and Blue filesystems are automounted--they are only added when accessed.
- Your group's directory may not appear until you access it.
- Do not remove files necessary for your account to function normally, such as the ~/.ssh and /home directories.|warn}}
As you can see in the gif above an ls
in /blue
does not show the group directory for ufhpc
. After a cd
into the directory, and back out, it does show. This shows how the directory is automounted on access. Of course, it wobe easier to just go directly without stopping in /blue
first.
Remember
- Directories may not show up in an
ls
of/blue
or/orange
- If you
cd
to/blue
and typels
, you will likely not see your group directory. You also cannot tab-complete the path to your group's directory. However, if you add the name of the group directory (e.g.cd /blue/<group>/
the directory becomes available and tab-completion functions. - Of course, there is no need to change directories one step at a time...
cd /blue/<group>
, will get there in one command. - You may need to type the path in SFTP clients or Globus for your group directory to appear
- You cannot always use tab completion in the shell
Environment Modules System¶
- HiPerGator uses the Lmod Environment Modules System to hide application installation complexity and make it easy to use the installed applications.
- For applications, compilers or interpreters, load the corresponding module
- See a list of all Installed Applications
Module 3: Common Mistakes and Getting Support¶
Common Mistakes¶
- Running resource intensive applications on the login nodes
- Submit a batch job: see SLURM Scheduler
- Request a development session for testing and development
- Using IDE SSH connections
- VSCode, Spyder, and PyCharm all have SSH extensions. These will connect you to a login server, not a job running on HiPerGator.
- Running applications using these often violates the acceptable use of login nodes
- Use the VSCode Remote Tunnel extension as documented here.
- Writing to
/home
or/orange
during batch job execution- Use
/blue/GROUP
for job input/output. See practical storage use.
- Use
- Wasting resources
- Understand CPU and memory needs of your application
- Over-requesting resources generally does not make your application run faster--it prevents other users from accessing resources.
- Blindly copying scripts from colleagues
- Make sure you understand what borrowed scripts do
- Many users copy previous lab mate scripts, but do not understand the details
- This often leads to wasted resources
- Misunderstanding the group investment limits and the burst QOS
- Each group has specific limits
- Burst jobs are run as idle resources are available
- The
slurmInfo
command can show your group's investment and current use. See qos limits for more information.
- The
- When using our Python environment modules, attempting to install new Python packages may not work because incompatible packages may get
installed into the
~/.local
folder and result in errors at run time. If you need to install packages, create a personal or project-specific Conda environment or request the addition of new packages in existing environment modules via the RC Support System.
See the FAQ page for these and more hints, answers, and potential pitfalls that you may want to avoid.
How to get Help¶
See the Get Help page for more details. In general,
- Submit support requests via the UFRC Support System
- For problems with running jobs, provide:
- JobID number(s)
- Filesystem path(s) to job scripts and job logs
- As much detailed information as you can about your problem
- For requests to install an application, provide:
- Name of application
- URL to download the application
Additional information for students using the cluster for courses¶
The instructor will provide a list of students enrolled in the course
Course group names will be in the format pre1234
with the 3-letter
department prefix and 4-digits course code.
- All sections of a course will typically use the same group name, so the name may be different from the specific course you are enrolled in.
- In the documentation below, substitute the generic
pre1234
with your particular group name.
Students who do not have a HiPerGator Account will have one created for them Students who already have a HiPerGator Account will be added to the course group (as a secondary group)
- You can create a folder in the class's
/blue/pre1234
folder with your GatorLink username -
To use the resources of the class rather than your primary group, use the
--account
and--qos
flags, in the submit script, in thesbatch
command or in the boxes in the Open on Demand interface.
JupyterHub Primary Group Only
JupyterHub can only use your primary group's resources and cannot be used for accessing secondary group resources. To use Jupyter using your secondary group, please use Open on Demand.
Using your account implies agreeing to the Acceptable use policy. Students understand that no restricted data should be used on HiPerGator Classes are typically allocated 32-cores, 112GB RAM and 2TB of storage
- Instructors should keep this in mind when designing exercises and assignments.
- Students should understand that these are shared resources: use them efficiently, share them fairly and know that if everyone waits until the last minute, there may not be enough resources to run all jobs.
- All storage should be used for research and coursework only.
- Accounts created for the class and the contents of the class
/blue/pre1234
folder will be deleted at the end of the semester. Please copy anything you want to keep off of the cluster before the end of the semester. - Students should consult with their professor or TA rather than opening a support request.
- Only the professor or TA should open support requests if needed.