Skip to content

Instantly share code, notes, and snippets.

@rish-16
Last active February 23, 2026 00:23
Show Gist options
  • Select an option

  • Save rish-16/1b737ec61342de455bdaf9e3372cccd3 to your computer and use it in GitHub Desktop.

Select an option

Save rish-16/1b737ec61342de455bdaf9e3372cccd3 to your computer and use it in GitHub Desktop.
CPSC 4830 S26 HPC Setup Demo

CPSC 4830 (S26) HPC Setup Demo

This guide walks you through setting up an environment for the class project / competition on the HPC. We show you how to train a simple GCN on a small dataset, and you can use this code as a template for your own custom models.

We assume you're on the education_gpu partition and have confirmed the presence of a GPU via nvidia-smi


Step 0: Install uv

uv is an ultra fast package manager. It is similar to conda, in that it helps you isolate and manage your packages and dependencies while sorting out versioning issues. It's lighter than conda. Yale's SLURM doesn't come with uv pre-installed so you need to install it from source:

curl -LsSf https://astral.sh/uv/install.sh | sh

Step 1: Create the Environment and Install Packages

Open a terminal instance to setup your environment:

uv venv my_env
source my_env/bin/activate

Step 1: Install minimal libraries

$ uv pip install torch==2.5.0 torchvision --index-url "https://download.pytorch.org/whl/cu121"
$ uv pip install torch-geometric
$ uv pip install \
  "https://data.pyg.org/whl/torch-2.5.0+cu121/torch_scatter-2.1.2+pt25cu121-cp312-cp312-linux_x86_64.whl" \
  "https://data.pyg.org/whl/torch-2.5.0+cu121/torch_cluster-1.6.3+pt25cu121-cp312-cp312-linux_x86_64.whl"
$ uv pip install ogb jupyter matplotlib ipykernel

Now, move to a new notebook. We're going to ensure the notebook can access your new environment.

Step 2: Register the Kernel

!my_env/bin/python -m ipykernel install --user --name my_env --display-name "my_env (uv)"

This registers the environment permanently to your account — you only need to do this once.


Step 3: Select the Kernel

  1. Refresh the page
  2. Go to Kernel → Change Kernel
  3. Select "my_env (uv)" from the dropdown
  4. Click Kernel → Restart Kernel

Step 4: Verify

Run this to confirm you're using the right environment:

import sys
print(sys.executable)
# Expected: .../my_env/bin/python

Then verify all packages are working:

import torch, torch_geometric, torch_scatter, torch_cluster
print(torch.__version__, torch_geometric.__version__, torch.cuda.is_available())
# Expected: 2.5.0+cu121 2.7.0 True

Notes

  • Steps 1 and 2 only need to be done once — after that, "my_env (uv)" will always appear in the kernel dropdown for any new notebook.
  • If "my_env (uv)" doesn't appear in the dropdown after refreshing, re-run Step 2.

Running via SLURM

Suppose you have a Python script called you want to run on GPU for a long duration. Notebooks won't cut it. You'll need to send a SLURM job instead. You will have to exit any instance you launched on the education_gpu partition and create an instance on the education partition instead.

You'll need a bash file (ending in .sh) with the job details. These files have header text that tells the cluster how many GPUs you're requesting, how long your job should run, and where to store any logging or error files. This is followed by environment initialization and the Python script you want to run.

#!/bin/bash
#SBATCH --job-name=hpc_demo
#SBATCH --output=hpc_demo-%j.log
#SBATCH --error=hpc_demo-%j.err
#SBATCH --time=00:10:00
#SBATCH --ntasks=1
#SBATCH --partition=education_gpu
#SBATCH --gres=gpu:1

cd ... # where your project is

# activate the env
source myenv/bin/activate

# run the program
python my_demo_script.py

Save the training script and job file, then submit:

sbatch job.sh

Monitor your job:

squeue -u $USER

You'll get a table with relevant information about your job:

            JOBID PARTITION     NAME     USER   ACCOUNT ST      SUBMIT_TIME       TIME TIME_LIMIT NODES  CPUS MIN_MEMORY TRES_PER_N   FEATURES   PRIORITY NODELIST(REASON)
           5119655 education ood-jupy cpsc4830  cpsc4830  R 2026-02-22T19:12       8:32    1:00:00     1     1         5G        N/A     (null) 0.00001806 a1132u05n01
           5119757 education hpc_demo cpsc4830  cpsc4830 PD 2026-02-22T19:20       0:00      10:00     1     1         5G gres/gpu:1     (null) 0.00001803 (Resources)

Once the job is done, you can view the error and logging files created by the job. In case your run fails or crashes, the error file will contain the necessary information to help you debug. Make the relevant fixes and send another job.

Remember, you have to be on the education partition to request a GPU on the education_gpu partition. If you're on the latter and send a job via sbatch, it's akin to requesting 2 GPUs, which the current HPC setup doesn't allow for fairness.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment