Skip to content

Instantly share code, notes, and snippets.

@sophiawisdom
Created August 27, 2024 07:30
Show Gist options
  • Select an option

  • Save sophiawisdom/4962d844ee870d7d9d233cfdac98e903 to your computer and use it in GitHub Desktop.

Select an option

Save sophiawisdom/4962d844ee870d7d9d233cfdac98e903 to your computer and use it in GitHub Desktop.
* Many different layers of hierarchy -- 128 SMs/GPU, 4 SMSPs/SM, 32 threads/lanes per SMSP
* Coordination is cheaper the lower you go in the hierarchy, but you can scale more the higher you go, so key to think about how to trade these off
* Lots of registers
* Many functional units, each can operate independently
* Combine these: you can do super-hyperthreading, which gets you great utilization of your at the cost of latency
* Every (vector) operation happens in units of 32
* Doing the same thing/vectorization/loss of control matters but only on the scale of 32 things -- it's free to have different cores to do different things
* Memory coalescing is an important optimization
density, not locality
explanation of nvlink vs rdma
explanation of pcie
in general you have more control (explicit yield, explicit prefetch, SASS scheduling)
predication, not branching
in exchange for only being able to execute 32-wide instructions, those instructions are only x% slower than cpu 1-wide instructions (for fp32 fma)
32-wide instructions + 128 byte cacheline -> why it’s so hard to extract value from sparsity
something for the start: GPU programming heavily depends on having a good model of the machine. If you have an ok model of the machine and can design your algorithms with this in mind, they will run much faster
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment