This guide walks you through setting up an environment for the class project / competition on the HPC. We show you how to train a simple GCN on a small dataset, and you can use this code as a template for your own custom models.
We assume you're on the education_gpu partition and have confirmed the presence of a GPU via nvidia-smi
uv is an ultra fast package manager. It is similar to conda, in that it helps you isolate and manage your packages and dependencies while sorting out versioning issues. It's lighter than conda. Yale's SLURM doesn't come with uv pre-installed so you need to install it from source:
curl -LsSf https://astral.sh/uv/install.sh | sh
Open a terminal instance to setup your environment:
uv venv my_env
source my_env/bin/activate$ uv pip install torch==2.5.0 torchvision --index-url "https://download.pytorch.org/whl/cu121"
$ uv pip install torch-geometric
$ uv pip install \
"https://data.pyg.org/whl/torch-2.5.0+cu121/torch_scatter-2.1.2+pt25cu121-cp312-cp312-linux_x86_64.whl" \
"https://data.pyg.org/whl/torch-2.5.0+cu121/torch_cluster-1.6.3+pt25cu121-cp312-cp312-linux_x86_64.whl"
$ uv pip install ogb jupyter matplotlib ipykernelNow, move to a new notebook. We're going to ensure the notebook can access your new environment.
!my_env/bin/python -m ipykernel install --user --name my_env --display-name "my_env (uv)"This registers the environment permanently to your account — you only need to do this once.
- Refresh the page
- Go to Kernel → Change Kernel
- Select "my_env (uv)" from the dropdown
- Click Kernel → Restart Kernel
Run this to confirm you're using the right environment:
import sys
print(sys.executable)
# Expected: .../my_env/bin/pythonThen verify all packages are working:
import torch, torch_geometric, torch_scatter, torch_cluster
print(torch.__version__, torch_geometric.__version__, torch.cuda.is_available())
# Expected: 2.5.0+cu121 2.7.0 True- Steps 1 and 2 only need to be done once — after that, "my_env (uv)" will always appear in the kernel dropdown for any new notebook.
- If "my_env (uv)" doesn't appear in the dropdown after refreshing, re-run Step 2.
Suppose you have a Python script called you want to run on GPU for a long duration. Notebooks won't cut it. You'll need to send a SLURM job instead. You will have to exit any instance you launched on the education_gpu partition and create an instance on the education partition instead.
You'll need a bash file (ending in .sh) with the job details. These files have header text that tells the cluster how many GPUs you're requesting, how long your job should run, and where to store any logging or error files. This is followed by environment initialization and the Python script you want to run.
#!/bin/bash
#SBATCH --job-name=hpc_demo
#SBATCH --output=hpc_demo-%j.log
#SBATCH --error=hpc_demo-%j.err
#SBATCH --time=00:10:00
#SBATCH --ntasks=1
#SBATCH --partition=education_gpu
#SBATCH --gres=gpu:1
cd ... # where your project is
# activate the env
source myenv/bin/activate
# run the program
python my_demo_script.pySave the training script and job file, then submit:
sbatch job.shMonitor your job:
squeue -u $USERYou'll get a table with relevant information about your job:
JOBID PARTITION NAME USER ACCOUNT ST SUBMIT_TIME TIME TIME_LIMIT NODES CPUS MIN_MEMORY TRES_PER_N FEATURES PRIORITY NODELIST(REASON)
5119655 education ood-jupy cpsc4830 cpsc4830 R 2026-02-22T19:12 8:32 1:00:00 1 1 5G N/A (null) 0.00001806 a1132u05n01
5119757 education hpc_demo cpsc4830 cpsc4830 PD 2026-02-22T19:20 0:00 10:00 1 1 5G gres/gpu:1 (null) 0.00001803 (Resources)
Once the job is done, you can view the error and logging files created by the job. In case your run fails or crashes, the error file will contain the necessary information to help you debug. Make the relevant fixes and send another job.
Remember, you have to be on the
educationpartition to request a GPU on theeducation_gpupartition. If you're on the latter and send a job viasbatch, it's akin to requesting 2 GPUs, which the current HPC setup doesn't allow for fairness.