sophiawisdom/some things i want to think about

## some things i want to think about
* Many different layers of hierarchy -- 128 SMs/GPU, 4 SMSPs/SM, 32 threads/lanes per SMSP
   * Coordination is cheaper the lower you go in the hierarchy, but you can scale more the higher you go, so key to think about how to trade these off
* Lots of registers
* Many functional units, each can operate independently
* Combine these: you can do super-hyperthreading, which gets you great utilization of your  at the cost of latency
* Every (vector) operation happens in units of 32
* Doing the same thing/vectorization/loss of control matters but only on the scale of 32 things -- it's free to have different cores to do different things
* Memory coalescing is an important optimization

density, not locality
explanation of nvlink vs rdma
explanation of pcie
in general you have more control (explicit yield, explicit prefetch, SASS scheduling)

predication, not branching
in exchange for only being able to execute 32-wide instructions, those instructions are only x% slower than cpu 1-wide instructions (for fp32 fma)

32-wide instructions + 128 byte cacheline -> why it’s so hard to extract value from sparsity
something for the start: GPU programming heavily depends on having a good model of the machine. If you have an ok model of the machine and can design your algorithms with this in mind, they will run much faster
	* Many different layers of hierarchy -- 128 SMs/GPU, 4 SMSPs/SM, 32 threads/lanes per SMSP
	* Coordination is cheaper the lower you go in the hierarchy, but you can scale more the higher you go, so key to think about how to trade these off
	* Lots of registers
	* Many functional units, each can operate independently
	* Combine these: you can do super-hyperthreading, which gets you great utilization of your at the cost of latency
	* Every (vector) operation happens in units of 32
	* Doing the same thing/vectorization/loss of control matters but only on the scale of 32 things -- it's free to have different cores to do different things
	* Memory coalescing is an important optimization

	density, not locality
	explanation of nvlink vs rdma
	explanation of pcie
	in general you have more control (explicit yield, explicit prefetch, SASS scheduling)

	predication, not branching
	in exchange for only being able to execute 32-wide instructions, those instructions are only x% slower than cpu 1-wide instructions (for fp32 fma)

	32-wide instructions + 128 byte cacheline -> why it’s so hard to extract value from sparsity
	something for the start: GPU programming heavily depends on having a good model of the machine. If you have an ok model of the machine and can design your algorithms with this in mind, they will run much faster
No results found