orsonadams/notes_gpu_opt_talk_jane_st.md

## notes_gpu_opt_talk_jane_st.md

      
    Raw
  

              notes_gpu_opt_talk_jane_st.md
            
          
    GPU Talk from Jane Street: Performance Deep Dive

https://www.youtube.com/watch?v=pHqcHzxx6I8
CPU vs. GPU Performance


A 2048 x 2048 matrix multiplication takes about 28 ms on 1 CPU core.
The same operation on 1 GPU takes roughly 209 microseconds ($\mu s$).


Question: This is a massive speedup! what specific kind of CPU and GPU were they using for this comparison?


Understanding Kernel Launches and Synchronization

The key takeaway here is avoiding synchronization between the CPU and GPU to keep the pipeline flowing.


Kernel Launches: The process starts with a synchronous point on the CPU:
$$\text{Launch kernel from CPU} \longrightarrow \text{Driver} \longrightarrow \text{Operation on GPU}$$
However, the overall workload should be run from an asynchronous pipeline to mask this latency.


Common Synchronization Pitfalls (Sync Points): These are things that force the CPU to wait for the GPU to finish, killing performance. They often involve the CPU needing a result that only the GPU has completed.


Tensor Coercions: Changing the data type or layout of a tensor.

memcpy and .to('cuda'): Any memory transfer is a sync point because we have to page-lock the memory (make it non-swappable) so the GPU can safely read it via DMA.


Condition Checks as a Sync Point (Graph Break Example):

Imagine the GPU is running an operation that computes a maximum value (max_val) for a tensor. If the next operation's execution path depends on that result, it causes a sync point:
max_val = tensor.max()  # GPU computes the max value
if max_val.item() > 100:  # CPU checks the result
    # ... execute one path
else:
    # ... execute another path
The line max_val.item() forces the CPU to wait for the GPU to finish the tensor.max() kernel and transfer the result back, thus breaking the asynchronous pipeline.


Strategies to Maximize Async Work:

Do as much work as possible on the GPU!

Asynchronous Checks: Move checks for stability or convergence off the main thread.

DMA (Direct Memory Access): The GPU can perform DMA (asynchronously) if the memory is pinned (page-locked) and not deallocated.


Kernel Efficiency and Bottlenecks

The goal is to keep the GPU's Streaming Multiprocessors (SMs) busy and fed with data.


SM Architecture (Example: H100): The H100 has 123 SMs. We need to ensure:


Work Assignment: Work should be explicitly assigned to each SM to spread the load.

No Shared Computation: SM-level computation should generally not be shared between them to avoid contention.


Memory Hierarchy and Use: Data moves from SM Shared Memory $\longrightarrow$ L2 Cache $\longrightarrow$ Global Memory.


Question: The biggest question for optimizing memory is what can I fit into shared memory? (Since it's the fastest per-SM memory).


General Principle: Always give the GPU sufficient work to keep it working (i.e., hide latency).


Bottleneck
Solution(s)
Notes


Compute Bottleneck
Use Tensor Cores
This is essential for modern ML (especially transformer-based models) as they're specialized for matrix multiplication.


Memory Bottleneck
Kernel Fusion
Amortizes the cost of the memory transfer/load by combining multiple operations into a single kernel.


Kernel Overhead
CUDA Graphs
Record the sequence of kernel launches and their parameters once (e.g., launch configs), and pay the launch cost only once. Kernel fusion also helps by reducing the total number of kernels launched.


Kernel Fusion and torch.compile

Kernel fusion is a powerful optimization that merges multiple small, memory-bound operations into a single kernel launch.


How torch.compile helps:

It traces the inputs and operations on those inputs.
It looks for optimization opportunities by benchmarking and analyzing the graph.
It then automatically builds the fused kernel (usually via an underlying compiler like Triton or an in-house tool).


Beware of Graph Breaks:

A graph break is a point where the compiler cannot merge operations.
This often happens when there's an operation that requires reading data back to the CPU or a runtime-dependent control flow (like the example above).
Crucially: Synchronizations in the graph lead to graph breaks.
Bottleneck	Solution(s)	Notes
Compute Bottleneck	Use Tensor Cores	This is essential for modern ML (especially transformer-based models) as they're specialized for matrix multiplication.
Memory Bottleneck	Kernel Fusion	Amortizes the cost of the memory transfer/load by combining multiple operations into a single kernel.
Kernel Overhead	CUDA Graphs	Record the sequence of kernel launches and their parameters once (e.g., launch configs), and pay the launch cost only once. Kernel fusion also helps by reducing the total number of kernels launched.
No results found