bcdonadio/llm_dict.md

## llm_dict.md

      
    Raw
  

              llm_dict.md
            
          
    LLM Symbol Dictionary

Use this dictionary whenever referring to LLM hyperparameters, shapes, and efficiency metrics across this project. It is a unified superset covering gpt‑oss, qwen3, qwen3‑moe, llama3, llama4, gemma3, and seed‑oss. When a feature is unused, assign the trivial setting (e.g., dense models: $E{=}e{=}1$, $N_{L,\text{moe}}{=}0$; MQA: $h_{kv}{=}1$; tied embeddings: $\mathbb{1}_{\text{tie}}{=}1$).
Unit key: “[-]” dimensionless; “[#]” count; “[features]” channel width; “[tokens]” token length; “[bytes]”, “[FLOP/s]”, “[bytes/s]”.
1) Core transformer topology


$N_L$ — [#] Transformer block (layer) count.

$d$ — [features] Model hidden size (residual width).

$h_q,,h_{kv}$ — [#] Query heads and KV heads (GQA/MQA).

$g_{\text{GQA}}{=}\frac{h_q}{h_{kv}}$ — [-] GQA grouping factor.

$d_h$ — [features] Per‑head dim; usually $d=h_q d_h$.

$d_q,,d_k,,d_v$ — [features] Optional per‑projection head dims when unequal.

$p_{\text{attn}}$ — [-] Attention dropout probability.

$\alpha_{\text{attn}}{=}\frac{1}{\sqrt{d_k}}$ — [-] Attention scaling factor.

Attention projection shapes (per layer)


$W_Q\in\mathbb{R}^{d\times(h_q d_h)}$, $W_K\in\mathbb{R}^{d\times(h_{kv} d_h)}$, $W_V\in\mathbb{R}^{d\times(h_{kv} d_h)}$, $W_O\in\mathbb{R}^{(h_q d_h)\times d}$ — [parameters] Projection matrices.

2) Positional encoding & attention span


$L_{\max}$ — [tokens] Maximum supported/trained context length.

$L$ — [tokens] Cached context length at current decode step.

$S$ — [tokens] Sliding‑window/local attention span.

$k_{\text{sinks}}$ — [tokens] Count of sink/pinned tokens.

$d_{\text{rope}}$ — [features] Rotated channels for RoPE.

$f_{\text{rope}}{=}\frac{d_{\text{rope}}}{d}$ — [-] RoPE channel fraction.

$\theta_{\text{base}}$ — [-] RoPE base.

$s_{\text{rope}}$ — [-] RoPE scaling factor.

$s_{\text{ntk}}$ — [-] NTK‑aware scaling factor.

$\mathcal{P}\in{\text{RoPE},\text{ALiBi},\text{LE},\dots}$ — [-] Positional scheme tag.

3) MLP block (dense)


$d_{\text{ff}}$ — [features] MLP intermediate width.

$r_{\text{ff}}{=}\frac{d_{\text{ff}}}{d}$ — [-] MLP expansion ratio.

$g_{\text{up}}$ — [#] Up‑projection branches ($1$=GeLU MLP, $2$=SwiGLU/GeGLU).

$f_{\text{act}}$ — [-] Activation (e.g., GeLU, SwiGLU).

$p_{\text{mlp}}$ — [-] MLP dropout probability.

MLP projection shapes (per layer)


$W_{\text{up}}^{(i)}\in\mathbb{R}^{d\times d_{\text{ff}}}$ for $i{=}1..g_{\text{up}}$, $W_{\text{down}}\in\mathbb{R}^{d_{\text{ff}}\times d}$ — [parameters] MLP projections.

4) Normalization & residuals


$\mathsf{Norm}\in{\text{RMSNorm},\text{LayerNorm}}$ — [-] Norm type.

$\epsilon_{\text{norm}}$ — [-] Norm epsilon.

$\mathbb{1}_{\text{prenorm}}$ — [-] 1 if pre‑norm, else 0.

$s_{\text{res}}$ — [-] Residual scaling factor.

$p_{\text{res}}$ — [-] Residual dropout probability.

5) Embeddings & LM head


$V$ — [tokens] Vocabulary size.

$d_{\text{emb}}$ — [features] Token embedding width.

$\mathbb{1}_{\text{tie}}$ — [-] 1 if embeddings and LM head are tied.

$n_{\text{special}}$ — [tokens] Count of special tokens.

$W_{E}\in\mathbb{R}^{V\times d_{\text{emb}}}$ — [parameters] Token embedding matrix.

$W_{\text{LM}}\in\mathbb{R}^{d\times V}$ — [parameters] LM head matrix.

6) Mixture‑of‑Experts (MoE)


$E,,e$ — [#] Total experts and active experts per token (top‑$e$).

$N_{L,\text{moe}}$ — [#] Number of MoE layers.

$d_{\text{moe}}$ — [features] Expert MLP width.

$r_{\text{moe}}{=}\frac{d_{\text{moe}}}{d}$ — [-] Expert expansion ratio.

$k_{\text{top}}$ — [#] Router top‑$k$ (usually $e$).

$C_{\text{cap}}$ — [-] Capacity factor per expert.

$p_{\text{drop,moe}}$ — [-] Token drop probability on overflow.

$\tau_{\text{router}}$ — [-] Router temperature.

$\lambda_{\text{load}}$ — [-] Load‑balancing loss weight.

$\mathbb{1}_{\text{shared}}$ — [-] 1 if shared/global expert present.

Expert shapes (per MoE layer)


$W_{\text{up}}^{(e)}\in\mathbb{R}^{d\times d_{\text{moe}}}$, $W_{\text{down}}^{(e)}\in\mathbb{R}^{d_{\text{moe}}\times d}$ — [parameters] Per‑expert MLP projections.

7) Precision, quantization & cache


$b$ — [bytes/elt] Bytes per element for a dtype (e.g., BF16→2).

$b_w,,b_a,,b_{kv}$ — [bytes/elt] Bytes per element for weights, activations, KV cache.

$b_m,,b_v$ — [bytes/elt] Bytes per element for Adam first/second moments.

$q_w,,q_a,,q_{kv}$ — [bits] Quantization bit‑widths for weights, activations, KV cache.

$g_q$ — [#] Quantization group size.

$\mathbb{1}_{\text{sym}}$ — [-] 1 if symmetric quantization.

$\mathbb{1}_{\text{zp}}$ — [-] 1 if zero‑points used.

8) Training schedule


$B_{\text{seq}}$ — [#] Sequences per microbatch (per device).

$L_{\text{train}}$ — [tokens] Training sequence length.

$A$ — [#] Gradient accumulation steps.

$n_{\text{GPU}}$ — [#] Number of GPUs.

$\mathcal{B}_{\text{tok}}{=}B_{\text{seq}}\,L_{\text{train}}\,A\,n_{\text{GPU}}$ — [tokens/step] Global tokens per optimizer step.

$S_{\text{steps}}$ — [#] Optimizer steps.

$T_{\text{train}}{=}\mathcal{B}_{\text{tok}}\,S_{\text{steps}}$ — [tokens] Total pretraining tokens.

$\eta$ — [-] Base/peak learning rate.

$\beta_1,\beta_2,,\epsilon_{\text{adam}}$ — [-] Adam/AdamW hyperparameters.

$\lambda$ — [-] Weight decay.

$S_{\text{warm}}$ — [#] Warmup steps.

$c_{\text{grad}}$ — [-] Gradient‑norm clip threshold.

$p_{\text{label}}$ — [-] Label smoothing probability.

$\mathcal{L}$ — [-] Training objective (e.g., cross‑entropy).

9) Parallelism & system topology


$D_p,,T_p,,P_p,,S_p,,E_p$ — [#] Data, tensor, pipeline, sequence, expert parallel degrees.

$F_{\text{peak}}$ — [FLOP/s] GPU peak tensor throughput (BF16/FP16).

$\text{BW}_{\text{HBM}}$ — [bytes/s] On‑device HBM bandwidth.

$\text{BW}_{\text{NVLink}}$ — [bytes/s] Node‑local interconnect bandwidth.

$\text{BW}_{\text{NIC}}$ — [bytes/s] Cross‑node network bandwidth.

$\eta_{\text{compute}},,\eta_{\text{bw}}$ — [-] Achieved compute and bandwidth utilizations.

10) Parameter counts


$P$ — [parameters] Total parameter count.

$P_{\text{tok}}$ — [parameters] Embedding + LM head params (if tied, count once).

$P_{\text{attn,per\_layer}} \approx d\,(h_q d_h + 2h_{kv} d_h) + d\,(h_q d_h)$ — [parameters] Attention per layer.

$P_{\text{mlp,per\_layer}} \approx g_{\text{up}}\,d\,d_{\text{ff}} + d_{\text{ff}}\,d$ — [parameters] MLP per layer.

$P_{\text{dense}} \approx N_L\,(P_{\text{attn,per\_layer}}+P_{\text{mlp,per\_layer}}) + P_{\text{tok}}$ — [parameters] Dense total.

$P_{\text{expert}} \approx g_{\text{up}}\,d\,d_{\text{moe}} + d_{\text{moe}}\,d$ — [parameters] Per‑expert params.

$P_{\text{moe\_all}} \approx N_{L,\text{moe}}\cdot E\cdot P_{\text{expert}}$ — [parameters] All experts across MoE layers.

$P_{\text{active}} \approx P_{\text{dense}} + N_{L,\text{moe}}\cdot e\cdot P_{\text{expert}}$ — [parameters] Active params per token.

$P_{\text{kv\_state}} = 2\,N_L\,h_{kv}\,d_h$ — [features/token] KV features stored per token (across layers).

11) Memory sizing


$M_{\text{param}} = P,b_w$ — [bytes] Resident model weights (per replica).

$M_{\text{opt}} \approx P,(b_w + b_m + b_v)$ — [bytes] Adam optimizer states.

$M_{\text{grad}} = P,b_w$ — [bytes] Gradient memory (if not sharded).

$M_{\text{act}} \approx \kappa,B_{\text{seq}},L_{\text{train}},d,b_a$ — [bytes] Activations (training); $\kappa$ depends on checkpointing/attention.

$B$ — [#] Inference batch size (concurrent sequences).

$M_{\text{KV}} = B,L,N_L,(2,h_{kv},d_h),b_{kv}$ — [bytes] KV cache at decode.

$M_{\text{KV,win}} = B,\min(L,S),N_L,(2,h_{kv},d_h),b_{kv}$ — [bytes] KV cache with sliding window.

$M_{\text{embed}} = V,d_{\text{emb}},b_w$ — [bytes] Token embeddings (LM head adds if untied).

12) FLOPs & arithmetic intensity

Per new token at decode (cache length $L$)


$F_{\text{proj}} \approx 2,d,(h_q d_h + 2h_{kv} d_h) + 2,(h_q d_h),d$ — [FLOPs] QKV + output projections.

$F_{\text{attn}}(L) \approx 4,L,h_q,d_h$ — [FLOPs] $QK^\top$ and $AV$.

$F_{\text{mlp}} \approx 2,(g_{\text{up}}+1),d,d_{\text{ff}}$ — [FLOPs] MLP GEMMs.

$F_{\text{router}} \approx 2,d,E$ — [FLOPs] Router (MoE layer).

$F_{\text{decode}}(L) \approx N_L\left(F_{\text{proj}} + F_{\text{attn}}(L) + F_{\text{mlp}}\right)$ — [FLOPs] Dense stack (add $N_{L,\text{moe}}F_{\text{router}}$ if MoE).

Prefill (context build, length $T_{\text{ctx}}$)


$F_{\text{prefill}}(T_{\text{ctx}}) \sim O!\left(N_L,T_{\text{ctx}},d^2 + N_L,T_{\text{ctx}}^2,h_q d_h\right)$ — [FLOPs] Quadratic attention term dominates without windowing.

Arithmetic intensity


$I \equiv \frac{\text{FLOPs}}{\text{bytes moved}}$ — [FLOPs/byte] Roofline intensity.

13) Throughput, latency & efficiency


$\mathcal{T}_{\text{tok/s}}$ — [tokens/s] Generation throughput.

$t_{\text{P50}},\,t_{\text{P95}}$ — [s] Latency percentiles (first‑token/per‑token).

$F_{\text{achieved}}{=}\eta_{\text{compute}}\,F_{\text{peak}}$ — [FLOP/s] Effective compute.

$\text{BW}_{\text{achieved}}{=}\eta_{\text{bw}}\,\text{BW}_{\text{HBM}}$ — [bytes/s] Effective bandwidth.

$\rho_{\text{comm}}$ — [-] Communication fraction of step time.

$\phi_{\text{cache}}{=}\frac{M_{\text{KV}}}{\text{HBM capacity}}$ — [-] KV cache fraction of HBM.

$\chi \equiv \frac{F_{\text{attn}}(L)}{F_{\text{mlp}}}$ — [-] Attention/MLP compute ratio.

14) Inference (sampling)


$T$ — [-] Softmax temperature.

$k$ — [#] Top‑$k$ cutoff.

$p$ — [-] Nucleus (top‑$p$) mass.

$\rho$ — [-] Repetition penalty factor.

$\pi_{\text{presence}},,\phi_{\text{frequency}}$ — [-] Presence/frequency penalty strengths.

$b_{\text{beam}}$ — [#] Beam width.

$L_{\text{gen}}$ — [tokens] Target/generated tokens per request.

15) Useful sums & products


$g_{\text{GQA}}=\frac{h_q}{h_{kv}}$; $r_{\text{ff}}=\frac{d_{\text{ff}}}{d}$; $r_{\text{moe}}=\frac{d_{\text{moe}}}{d}$ — [-] Ratios.

$P_{\text{active}} \approx P_{\text{dense}} + N_{L,\text{moe}}\cdot e\cdot P_{\text{expert}}$ — [parameters] Active params per token (MoE).

$K\!V_{\text{feat}} = 2\,h_{kv}\,d_h$ — [features/token/layer] KV features per token per layer.

$M_{\text{KV}} = B\,L\,N_L\,(2\,h_{kv}\,d_h)\,b_{kv}$; $M_{\text{KV,win}} = B\,\min(L,S)\,N_L\,(2\,h_{kv}\,d_h)\,b_{kv}$ — [bytes] KV memory.
If $h_{kv}{=}h_q$ and $h_q d_h{=}d$: $P_{\text{attn,per\_layer}}\!\approx\!4d^2$, $F_{\text{proj}}\!\approx\!4d^2$, $F_{\text{attn}}(L)\!\approx\!4Ld$, $F_{\text{mlp}}\!\approx\!2(g_{\text{up}}+1)dd_{\text{ff}}$ — [parameters]/[FLOPs].

$T_{\text{train}}=\mathcal{B}_{\text{tok}}\,S_{\text{steps}}$ — [tokens] Training tokens.

$\text{Perf} \le \min\!\big(F_{\text{achieved}},\; I\,\text{BW}_{\text{achieved}}\big)$ — [FLOP/s] Roofline bound.

16) Family mapping hints


Dense families (llama3/4, gemma3, seed‑oss): $E{=}e{=}1$, $N_{L,\text{moe}}{=}0$, specify $h_q,h_{kv}$ (GQA if $h_{kv}{&lt;}h_q$), $\mathcal{P}$, $L_{\max}$, $r_{\text{ff}}$, $g_{\text{up}}$, $\mathsf{Norm}$, $\epsilon_{\text{norm}}$, $(b_w,b_a,b_{kv})$.

qwen3 (dense/long‑context options): as dense, with explicit $(s_{\text{rope}}, s_{\text{ntk}})$ and optional window $S$.

qwen3‑moe, gpt‑oss (MoE): additionally specify $E$, $e$, $N_{L,\text{moe}}$, $d_{\text{moe}}$ (or $r_{\text{moe}}$), router knobs $(k_{\text{top}}, C_{\text{cap}}, p_{\text{drop,moe}}, \tau_{\text{router}}, \lambda_{\text{load}}, \mathbb{1}_{\text{shared}})$.

17) Sanity checklist


Ensure $d = h_q,d_h$ (or declare $d_q,d_k,d_v$).
Always report $L_{\max}$ and current $L$ for decode metrics.
With GQA/MQA, include $g_{\text{GQA}}$ and $h_{kv}$ in KV/FLOP formulas.
For MoE, report both $P$ and $P_{\text{active}}$.
Pair utilization with ceilings: $(\eta_{\text{compute}},F_{\text{peak}})$ and $(\eta_{\text{bw}},\text{BW}_{\text{HBM}})$.
No results found