rish-16/cpsc4830_hpc_pyg_demo.md

## cpsc4830_hpc_pyg_demo.md

      
    Raw
  

              cpsc4830_hpc_pyg_demo.md
            
          
    CPSC 4830 (S26) HPC Setup Demo

This guide walks you through setting up an environment for the class project / competition on the HPC.
We show you how to train a simple GCN on a small dataset, and you can use this code as a template for your
own custom models.
We assume you're on the education_gpu partition and have confirmed the presence of a GPU via nvidia-smi

Step 0: Install uv

uv is an ultra fast package manager. It is similar to conda, in that it helps you isolate and manage your packages and dependencies while sorting out versioning issues. It's lighter than conda. Yale's SLURM doesn't come with uv pre-installed so you need to install it from source:
curl -LsSf https://astral.sh/uv/install.sh | sh

Step 1: Create the Environment and Install Packages

Open a terminal instance to setup your environment:
uv venv my_env
source my_env/bin/activate
Step 1: Install minimal libraries

$ uv pip install torch==2.5.0 torchvision --index-url "https://download.pytorch.org/whl/cu121"
$ uv pip install torch-geometric
$ uv pip install \
  "https://data.pyg.org/whl/torch-2.5.0+cu121/torch_scatter-2.1.2+pt25cu121-cp312-cp312-linux_x86_64.whl" \
  "https://data.pyg.org/whl/torch-2.5.0+cu121/torch_cluster-1.6.3+pt25cu121-cp312-cp312-linux_x86_64.whl"
$ uv pip install ogb jupyter matplotlib ipykernel

Now, move to a new notebook. We're going to ensure the notebook can access your new environment.
Step 2: Register the Kernel

!my_env/bin/python -m ipykernel install --user --name my_env --display-name "my_env (uv)"
This registers the environment permanently to your account — you only need to do this once.

Step 3: Select the Kernel


Refresh the page
Go to Kernel → Change Kernel
Select "my_env (uv)" from the dropdown
Click Kernel → Restart Kernel


Step 4: Verify

Run this to confirm you're using the right environment:
import sys
print(sys.executable)
# Expected: .../my_env/bin/python
Then verify all packages are working:
import torch, torch_geometric, torch_scatter, torch_cluster
print(torch.__version__, torch_geometric.__version__, torch.cuda.is_available())
# Expected: 2.5.0+cu121 2.7.0 True

Notes


Steps 1 and 2 only need to be done once — after that, "my_env (uv)" will
always appear in the kernel dropdown for any new notebook.
If "my_env (uv)" doesn't appear in the dropdown after refreshing, re-run Step 2.


Running via SLURM

Suppose you have a Python script called you want to run on GPU for a long duration. Notebooks won't cut it. You'll need to send a SLURM job instead. You will have to exit any instance you launched on the education_gpu partition and create an instance on the education partition instead.
You'll need a bash file (ending in .sh) with the job details. These files have header text that tells the cluster how many GPUs you're requesting, how long your job should run, and where to store any logging or error files. This is followed by environment initialization and the Python script you want to run.
#!/bin/bash
#SBATCH --job-name=hpc_demo
#SBATCH --output=hpc_demo-%j.log
#SBATCH --error=hpc_demo-%j.err
#SBATCH --time=00:10:00
#SBATCH --ntasks=1
#SBATCH --partition=education_gpu
#SBATCH --gres=gpu:1

cd ... # where your project is

# activate the env
source myenv/bin/activate

# run the program
python my_demo_script.py
Save the training script and job file, then submit:
sbatch job.sh
Monitor your job:
squeue -u $USER
You'll get a table with relevant information about your job:
            JOBID PARTITION     NAME     USER   ACCOUNT ST      SUBMIT_TIME       TIME TIME_LIMIT NODES  CPUS MIN_MEMORY TRES_PER_N   FEATURES   PRIORITY NODELIST(REASON)
           5119655 education ood-jupy cpsc4830  cpsc4830  R 2026-02-22T19:12       8:32    1:00:00     1     1         5G        N/A     (null) 0.00001806 a1132u05n01
           5119757 education hpc_demo cpsc4830  cpsc4830 PD 2026-02-22T19:20       0:00      10:00     1     1         5G gres/gpu:1     (null) 0.00001803 (Resources)

Once the job is done, you can view the error and logging files created by the job. In case your run fails or crashes, the error file will contain the necessary information to help you debug. Make the relevant fixes and send another job.

Remember, you have to be on the education partition to request a GPU on the education_gpu partition. If you're on the latter and send a job via sbatch, it's akin to requesting 2 GPUs, which the current HPC setup doesn't allow for fairness.
No results found