Resources on GPU architecture
- A youtube channel (found on geohotz videos)
- Documentation on CUDA architecture from Modal AI:
- https://modal.com/
- Full glossary with useful GPU specific terminology
- What is a CUDA Device Architecture? | GPU Glossary
- From George Hotz | how do GPUs work? (noob) + paper reading (not noob) | tinycorp.myshopify.com - YouTube
- Warp scheduler: https://modal.com/gpu-glossary/device-hardware/warp-scheduler
- Steaming Multiprocessors: https://modal.com/gpu-glossary/device-hardware/streaming-multiprocessor
- CUDA "refresher" technical blogs from nvidia:
- Video on CUDA and GPU (focused on C++)
- Matrix multiplication on GPU
- Mini Project: How to program a GPU? | CUDA C/C++ - YouTube
- 2678x Faster with CUDA C: Simple Matrix Multiplication on a GPU | Episode 1: Introduction to GPGPU - YouTube
- 2678x Faster Matrix Multiplication with a GPU
- [[GPU Knowledge#^da17d3]]
- Video graphic tutorial on how tiled matrix multiplication works (tiled GEMM / WMMA):
- https://www.youtube.com/watch?v=Q3GgbfGTnVc
- This video contains resources from: https://0mean1sigma.com/
- 0Mean1Sigma a great series on GPGPU programming: ^38d8d8
- https://0mean1sigma.com/tag/gpgpu-programming/
- Introduction to Tensor Cores Programming: https://0mean1sigma.com/tgemm/
- contains detailes on tiles, threads, kernels, wmma
- Memory Coalescing and Tiled Matrix Multiplication: https://0mean1sigma.com/chapter-4-memory-coalescing-and-tiled-matrix-multiplication/
- Github repo: https://github.com/tgautam03/tGeMM/tree/master
- 2678x Faster Matrix Multiplication with a GPU: https://0mean1sigma.com/2678x-faster-how-gpus-supercharge-matrix-multiplication/
- Tiled matmul (tiled gemm) implementation in cuda:
- Simon Oz YouTube Channel:
- Explains GPU architecture: https://www.youtube.com/watch?v=Zrbw0zajhJM
- Tiling With Shared Memory | GPU Programming | Episode 7 : https://www.youtube.com/watch?v=ccHyFnEZt7M
- More videos:
- ![[Pasted image 20251203121620.png]]
- How to scale your model: https://jax-ml.github.io/scaling-book/
- Repo from github with detailed calculations
- https://github.com/jax-ml/scaling-book
- Horace He blog post:
- Making Deep Learning Go Brrrr From First Principles
- https://horace.io/brrr_intro.html
- GPU memory bandwidth explained:
- https://www.digitalocean.com/community/tutorials/gpu-memory-bandwidth
- GPU memory bandwidth refers to the rate at which data can be transferred between the GPU and its memory (VRAM). It is measured in gigabytes per second (GB/s) and plays a critical role in handling large datasets, real-time rendering, and AI/ML workloads. Higher bandwidth allows for faster data movement, improving overall performance.
- Parallel Computing CUDA C:
- https://github.com/CisMine/Parallel-Computing-Cuda-C
- https://github.com/CisMine/Parallel-Computing-Cuda-C/blob/main/Chapter06/README.md?ref=0mean1sigma.com
- Taken from [[GPU Architecture#^38d8d8]]: ![[Pasted image 20251203103307.png]]
- CUDA Learning guide In addition to the code examples, this repository provides a curated list of resources, including books, tutorials, online courses, and research papers, to further enhance your understanding of parallel computing and CUDA-C programming. These resources will help you delve deeper into the subject and explore advanced topics and techniques.
- NVIDIA Practices_Guide 2023
- NVIDIA Programming_Guide 2023
- GPU Programming 2021 in youtube
- Cuda Programming 2023 in youtube
- Programming Massively 2022 in youtube
- Cuda training series 2022-2023 in youtube
- Programming Heterogeneous Computing Systems with GPUs 2023 in youtube
- Cuda Thread Indexing cheatsheet
- The repo also contains a great course on GPU architecture (from ETH zurich course in 2023):
- HetSys Course: Lecture 4: GPU Memory Hierarchy (Spring 2023)
- https://www.youtube.com/watch?v=ZQKMZIP3Fzg&list=PL5Q2soXY2Zi-qSKahS4ofaEwYl7_qp9mw&index=4
- Explains architecture of the H100 gpu:
- ![[Pasted image 20251203104234.png]]
- Full in depth architecture:
- H100 detailed architecture
- Source: https://modal-cdn.com/gpu-glossary/gtc22-whitepaper-hopper.pdf
- ![[Pasted image 20251203115758.png]]
- AMD Instruction set for RDNA4 operations:
- https://docs.amd.com/v/u/en-US/rdna4-instruction-set-architecture
- Example:
- ![[Pasted image 20251203120416.png]]
- AMD Technical reports are all available at:
- TPU Deep Dive: a wonderful in-depth look at the TPU architecture in the spirit of this book.
- Making Deep Learning Go Brrrr From First Principles: a more GPU and PyTorch-focused tutorial on LLM rooflines and performance engineering.
- Writing TPU Kernels with Pallas: increasingly, TPU programming involves writing custom kernels in Pallas. This series discusses how to write kernels and many lower level TPU details that aren’t mentioned here.
- How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog: while GPU and CUDA specific, this is an excellent blog post showing how to optimize a matmul kernel in CUDA. This might be a good deep dive into how TPUs and GPUs are different.
- Guide from Anthropic: https://siboehm.com/articles/22/CUDA-MMM
- Key results:
- ![[Pasted image 20251203120912.png]]
- Full repo with GEMM implementations: https://github.com/siboehm/SGEMM_CUDA
- Distributed arrays and automatic parallelization: this is a really nice guide to parallelism APIs in JAX and is a good way to learn how to actually implement some of the ideas we’ve discussed here.
- Rafi Witten’s High Performance LLMs 2024 Class: our former colleague Rafi gave a great course on TPU performance engineering and the slides are all on GitHub. This covers a bunch of things in more depth than we do here.
- [2211.05102] Efficiently Scaling Transformer Inference: a detailed paper on the mathematics of Transformer inference. This is the inspiration for a lot of this document.
- Huggingface Ultra-Scale Playbook: something of a GPU analog to this book, this talks more at depth about how PyTorch implements parallelism techniques and memory-saving techniques during training.
- Transformer Inference Arithmetic: a blog with many of the same ideas as this book and some excellent illustrations.
- Stanford CS336 Slides and Videos: a fantastic Stanford course covering many details of LLM training and serving, with some useful exercises. Assignments 1 and 2 are particularly relevant.
- Stas Bekman’s ML Engineering Handbook: a highly practical guide to ML infrastructure, covering topics not addressed in this book like how to negotiate with cloud providers, cluster management, and empirical measurements of GPU throughput.