basavyr/gpu-architecture-notes.md

## gpu-architecture-notes.md

      
    Raw
  

              gpu-architecture-notes.md
            
          
    Resources on GPU architecture

A youtube channel (found on geohotz videos)

14 GPU Architecture 1 - YouTube


Documentation on CUDA architecture from Modal AI:

https://modal.com/
Full glossary with useful GPU specific terminology
What is a CUDA Device Architecture? | GPU Glossary
From George Hotz | how do GPUs work? (noob) + paper reading (not noob) | tinycorp.myshopify.com - YouTube
Warp scheduler: https://modal.com/gpu-glossary/device-hardware/warp-scheduler
Steaming Multiprocessors: https://modal.com/gpu-glossary/device-hardware/streaming-multiprocessor


CUDA "refresher" technical blogs from nvidia:

https://developer.nvidia.com/blog/tag/cuda-refresher/
Cuda programming model: https://developer.nvidia.com/blog/cuda-refresher-cuda-programming-model/


Video on CUDA and GPU (focused on C++)

Matrix multiplication on GPU
Mini Project: How to program a GPU? | CUDA C/C++ - YouTube
2678x Faster with CUDA C: Simple Matrix Multiplication on a GPU | Episode 1: Introduction to GPGPU - YouTube
2678x Faster Matrix Multiplication with a GPU
[[GPU Knowledge#^da17d3]]
Video graphic tutorial on how tiled matrix multiplication works (tiled GEMM / WMMA):

https://www.youtube.com/watch?v=Q3GgbfGTnVc
This video contains resources from: https://0mean1sigma.com/


0Mean1Sigma a great series on GPGPU programming: ^38d8d8

https://0mean1sigma.com/tag/gpgpu-programming/
Introduction to Tensor Cores Programming: https://0mean1sigma.com/tgemm/

contains detailes on tiles, threads, kernels, wmma


Memory Coalescing and Tiled Matrix Multiplication: https://0mean1sigma.com/chapter-4-memory-coalescing-and-tiled-matrix-multiplication/
Github repo: https://github.com/tgautam03/tGeMM/tree/master
2678x Faster Matrix Multiplication with a GPU: https://0mean1sigma.com/2678x-faster-how-gpus-supercharge-matrix-multiplication/
Tiled matmul (tiled gemm) implementation in cuda:

https://github.com/tgautam03/CUDA-C/tree/master/05_tiled_mat_mul
https://github.com/tgautam03/CUDA-C


Simon Oz YouTube Channel:

Explains GPU architecture: https://www.youtube.com/watch?v=Zrbw0zajhJM
Tiling With Shared Memory | GPU Programming | Episode 7 : https://www.youtube.com/watch?v=ccHyFnEZt7M

More videos:
![[Pasted image 20251203121620.png]]


Other resources


How to scale your model: https://jax-ml.github.io/scaling-book/

Repo from github with detailed calculations
https://github.com/jax-ml/scaling-book


Horace He blog post:

Making Deep Learning Go Brrrr From First Principles
https://horace.io/brrr_intro.html


GPU memory bandwidth explained:

https://www.digitalocean.com/community/tutorials/gpu-memory-bandwidth
GPU memory bandwidth refers to the rate at which data can be transferred between the GPU and its memory (VRAM). It is measured in gigabytes per second (GB/s) and plays a critical role in handling large datasets, real-time rendering, and AI/ML workloads. Higher bandwidth allows for faster data movement, improving overall performance.


Parallel Computing CUDA C:

https://github.com/CisMine/Parallel-Computing-Cuda-C
https://github.com/CisMine/Parallel-Computing-Cuda-C/blob/main/Chapter06/README.md?ref=0mean1sigma.com
Taken from [[GPU Architecture#^38d8d8]]: ![[Pasted image 20251203103307.png]]
CUDA Learning guide In addition to the code examples, this repository provides a curated list of resources, including books, tutorials, online courses, and research papers, to further enhance your understanding of parallel computing and CUDA-C programming. These resources will help you delve deeper into the subject and explore advanced topics and techniques.

NVIDIA Practices_Guide 2023
NVIDIA Programming_Guide 2023
GPU Programming 2021 in youtube
Cuda Programming 2023 in youtube
Programming Massively 2022 in youtube
Cuda training series 2022-2023 in youtube
Programming Heterogeneous Computing Systems with GPUs 2023 in youtube
Cuda Thread Indexing cheatsheet


The repo also contains a great course on GPU architecture (from ETH zurich course in 2023):

HetSys Course: Lecture 4: GPU Memory Hierarchy (Spring 2023)
https://www.youtube.com/watch?v=ZQKMZIP3Fzg&list=PL5Q2soXY2Zi-qSKahS4ofaEwYl7_qp9mw&index=4
Explains architecture of the H100 gpu:

![[Pasted image 20251203104234.png]]


Nvidia and AMD GPU architecture


Full in depth architecture:

H100 detailed architecture
Source: https://modal-cdn.com/gpu-glossary/gtc22-whitepaper-hopper.pdf

![[Pasted image 20251203115758.png]]


AMD Instruction set for RDNA4 operations:

https://docs.amd.com/v/u/en-US/rdna4-instruction-set-architecture
Example:

![[Pasted image 20251203120416.png]]


AMD Technical reports are all available at:

https://docs.amd.com/
RDNA instruction set: https://docs.amd.com/search/all?query=rdna+instruction+set&content-lang=en-US


Further reading recommended by "How to Scale your model"


TPU Deep Dive: a wonderful in-depth look at the TPU architecture in the spirit of this book.
Making Deep Learning Go Brrrr From First Principles: a more GPU and PyTorch-focused tutorial on LLM rooflines and performance engineering.
Writing TPU Kernels with Pallas: increasingly, TPU programming involves writing custom kernels in Pallas. This series discusses how to write kernels and many lower level TPU details that aren’t mentioned here.
How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog: while GPU and CUDA specific, this is an excellent blog post showing how to optimize a matmul kernel in CUDA. This might be a good deep dive into how TPUs and GPUs are different.

Guide from Anthropic: https://siboehm.com/articles/22/CUDA-MMM
Key results:
![[Pasted image 20251203120912.png]]
Full repo with GEMM implementations: https://github.com/siboehm/SGEMM_CUDA


Distributed arrays and automatic parallelization: this is a really nice guide to parallelism APIs in JAX and is a good way to learn how to actually implement some of the ideas we’ve discussed here.
Rafi Witten’s High Performance LLMs 2024 Class: our former colleague Rafi gave a great course on TPU performance engineering and the slides are all on GitHub. This covers a bunch of things in more depth than we do here.
[2211.05102] Efficiently Scaling Transformer Inference: a detailed paper on the mathematics of Transformer inference. This is the inspiration for a lot of this document.
Huggingface Ultra-Scale Playbook: something of a GPU analog to this book, this talks more at depth about how PyTorch implements parallelism techniques and memory-saving techniques during training.
Transformer Inference Arithmetic: a blog with many of the same ideas as this book and some excellent illustrations.
Stanford CS336 Slides and Videos: a fantastic Stanford course covering many details of LLM training and serving, with some useful exercises. Assignments 1 and 2 are particularly relevant.
Stas Bekman’s ML Engineering Handbook: a highly practical guide to ML infrastructure, covering topics not addressed in this book like how to negotiate with cloud providers, cluster management, and empirical measurements of GPU throughput.
No results found