cosminscn/notes_dt.md

## notes_dt.md

      
    Raw
  

              notes_dt.md
            
          
    A100 - spec
312 TFLOPS/s
40GB? 80GB HBM ram
20MB cache
Large model run


deepzero3 should read the deep speed paper, looks like they did as baseline model parallelism with bs2?

https://arxiv.org/pdf/1910.02054


bloom blog https://huggingface.co/blog/bloom-megatron-deepspeed


activations

12(input/proj/attention/nonlin) x hidden_dim x local_batch x seq_length x transformer_layers x 2(activation size)


params

transformer_layers * 12 (2 x 4 hidden_dim mlp + 4 hidden_dim) * hidden_dim * hidden_dim x 2? or x4?


activations_per_layer / params_per_layer == local_batch x seq_len / hidden_dim == 2 * 512 / 1024 == 1???? seems high?
only hdim for big models is larger?

bs, hd = 12, 8192 for 172B model 400 GPUS Table 9 https://arxiv.org/pdf/1910.02054

hmm do I need to compare per layer? or assume fsdp?

they discuss a bit here about design options https://github.com/bigscience-workshop/bigscience/blob/master/train/tr11-176B-ml/chronicles-prequel.md

FSDP

FSDP forward pass:
    for layer_i in layers:
        all-gather full weights for layer_i
        forward pass for layer_i
        discard full weights for layer_i

FSDP backward pass:
    for layer_i in layers:
        all-gather full weights for layer_i
        backward pass for layer_i
        discard full weights for layer_i
        reduce-scatter gradients for layer_i

simple fsdp implementation with torch compile https://github.com/facebookresearch/capi/blob/main/fsdp.py


JAX course


TIL use lecun init N(0, 1/fan_in)? and you don't need to do scaleddotprod attn

xavier/golorot std dev sqrt(2/(fan_in + fan_out))
hmm
https://claude.ai/chat/748db8c8-d94f-4b21-b11f-1754eeef2a39


fusedeinsum might be a thing


GPU Blog
No results found