A100 - spec 312 TFLOPS/s
40GB? 80GB HBM ram 20MB cache
-
deepzero3 should read the deep speed paper, looks like they did as baseline model parallelism with bs2?
-
bloom blog https://huggingface.co/blog/bloom-megatron-deepspeed
-
activations
- 12(input/proj/attention/nonlin) x hidden_dim x local_batch x seq_length x transformer_layers x 2(activation size)
-
params
- transformer_layers * 12 (2 x 4 hidden_dim mlp + 4 hidden_dim) * hidden_dim * hidden_dim x 2? or x4?
activations_per_layer / params_per_layer == local_batch x seq_len / hidden_dim == 2 * 512 / 1024 == 1???? seems high?
only hdim for big models is larger?
- bs, hd = 12, 8192 for 172B model 400 GPUS Table 9 https://arxiv.org/pdf/1910.02054
hmm do I need to compare per layer? or assume fsdp?
- they discuss a bit here about design options https://github.com/bigscience-workshop/bigscience/blob/master/train/tr11-176B-ml/chronicles-prequel.md
FSDP forward pass:
for layer_i in layers:
all-gather full weights for layer_i
forward pass for layer_i
discard full weights for layer_i
FSDP backward pass:
for layer_i in layers:
all-gather full weights for layer_i
backward pass for layer_i
discard full weights for layer_i
reduce-scatter gradients for layer_i
simple fsdp implementation with torch compile https://github.com/facebookresearch/capi/blob/main/fsdp.py
-
TIL use lecun init N(0, 1/fan_in)? and you don't need to do scaleddotprod attn
- xavier/golorot std dev sqrt(2/(fan_in + fan_out))
- hmm
- https://claude.ai/chat/748db8c8-d94f-4b21-b11f-1754eeef2a39
-
fusedeinsum might be a thing