Created
August 27, 2024 07:30
-
-
Save sophiawisdom/4962d844ee870d7d9d233cfdac98e903 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| * Many different layers of hierarchy -- 128 SMs/GPU, 4 SMSPs/SM, 32 threads/lanes per SMSP | |
| * Coordination is cheaper the lower you go in the hierarchy, but you can scale more the higher you go, so key to think about how to trade these off | |
| * Lots of registers | |
| * Many functional units, each can operate independently | |
| * Combine these: you can do super-hyperthreading, which gets you great utilization of your at the cost of latency | |
| * Every (vector) operation happens in units of 32 | |
| * Doing the same thing/vectorization/loss of control matters but only on the scale of 32 things -- it's free to have different cores to do different things | |
| * Memory coalescing is an important optimization | |
| density, not locality | |
| explanation of nvlink vs rdma | |
| explanation of pcie | |
| in general you have more control (explicit yield, explicit prefetch, SASS scheduling) | |
| predication, not branching | |
| in exchange for only being able to execute 32-wide instructions, those instructions are only x% slower than cpu 1-wide instructions (for fp32 fma) | |
| 32-wide instructions + 128 byte cacheline -> why it’s so hard to extract value from sparsity | |
| something for the start: GPU programming heavily depends on having a good model of the machine. If you have an ok model of the machine and can design your algorithms with this in mind, they will run much faster |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment