Skip to content

Instantly share code, notes, and snippets.

@bcdonadio
Last active December 8, 2025 06:17
Show Gist options
  • Select an option

  • Save bcdonadio/de21a921a357480d326de4dc8f0fec53 to your computer and use it in GitHub Desktop.

Select an option

Save bcdonadio/de21a921a357480d326de4dc8f0fec53 to your computer and use it in GitHub Desktop.

LLM Symbol Dictionary

Use this dictionary whenever referring to LLM hyperparameters, shapes, and efficiency metrics across this project. It is a unified superset covering gpt‑oss, qwen3, qwen3‑moe, llama3, llama4, gemma3, and seed‑oss. When a feature is unused, assign the trivial setting (e.g., dense models: $E{=}e{=}1$, $N_{L,\text{moe}}{=}0$; MQA: $h_{kv}{=}1$; tied embeddings: $\mathbb{1}_{\text{tie}}{=}1$).

Unit key: “[-]” dimensionless; “[#]” count; “[features]” channel width; “[tokens]” token length; “[bytes]”, “[FLOP/s]”, “[bytes/s]”.

1) Core transformer topology

  • $N_L$ — [#] Transformer block (layer) count.
  • $d$ — [features] Model hidden size (residual width).
  • $h_q,,h_{kv}$ — [#] Query heads and KV heads (GQA/MQA).
  • $g_{\text{GQA}}{=}\frac{h_q}{h_{kv}}$ — [-] GQA grouping factor.
  • $d_h$ — [features] Per‑head dim; usually $d=h_q d_h$.
  • $d_q,,d_k,,d_v$ — [features] Optional per‑projection head dims when unequal.
  • $p_{\text{attn}}$ — [-] Attention dropout probability.
  • $\alpha_{\text{attn}}{=}\frac{1}{\sqrt{d_k}}$ — [-] Attention scaling factor.

Attention projection shapes (per layer)

  • $W_Q\in\mathbb{R}^{d\times(h_q d_h)}$, $W_K\in\mathbb{R}^{d\times(h_{kv} d_h)}$, $W_V\in\mathbb{R}^{d\times(h_{kv} d_h)}$, $W_O\in\mathbb{R}^{(h_q d_h)\times d}$ — [parameters] Projection matrices.

2) Positional encoding & attention span

  • $L_{\max}$ — [tokens] Maximum supported/trained context length.
  • $L$ — [tokens] Cached context length at current decode step.
  • $S$ — [tokens] Sliding‑window/local attention span.
  • $k_{\text{sinks}}$ — [tokens] Count of sink/pinned tokens.
  • $d_{\text{rope}}$ — [features] Rotated channels for RoPE.
  • $f_{\text{rope}}{=}\frac{d_{\text{rope}}}{d}$ — [-] RoPE channel fraction.
  • $\theta_{\text{base}}$ — [-] RoPE base.
  • $s_{\text{rope}}$ — [-] RoPE scaling factor.
  • $s_{\text{ntk}}$ — [-] NTK‑aware scaling factor.
  • $\mathcal{P}\in{\text{RoPE},\text{ALiBi},\text{LE},\dots}$ — [-] Positional scheme tag.

3) MLP block (dense)

  • $d_{\text{ff}}$ — [features] MLP intermediate width.
  • $r_{\text{ff}}{=}\frac{d_{\text{ff}}}{d}$ — [-] MLP expansion ratio.
  • $g_{\text{up}}$ — [#] Up‑projection branches ($1$=GeLU MLP, $2$=SwiGLU/GeGLU).
  • $f_{\text{act}}$ — [-] Activation (e.g., GeLU, SwiGLU).
  • $p_{\text{mlp}}$ — [-] MLP dropout probability.

MLP projection shapes (per layer)

  • $W_{\text{up}}^{(i)}\in\mathbb{R}^{d\times d_{\text{ff}}}$ for $i{=}1..g_{\text{up}}$, $W_{\text{down}}\in\mathbb{R}^{d_{\text{ff}}\times d}$ — [parameters] MLP projections.

4) Normalization & residuals

  • $\mathsf{Norm}\in{\text{RMSNorm},\text{LayerNorm}}$ — [-] Norm type.
  • $\epsilon_{\text{norm}}$ — [-] Norm epsilon.
  • $\mathbb{1}_{\text{prenorm}}$ — [-] 1 if pre‑norm, else 0.
  • $s_{\text{res}}$ — [-] Residual scaling factor.
  • $p_{\text{res}}$ — [-] Residual dropout probability.

5) Embeddings & LM head

  • $V$ — [tokens] Vocabulary size.
  • $d_{\text{emb}}$ — [features] Token embedding width.
  • $\mathbb{1}_{\text{tie}}$ — [-] 1 if embeddings and LM head are tied.
  • $n_{\text{special}}$ — [tokens] Count of special tokens.
  • $W_{E}\in\mathbb{R}^{V\times d_{\text{emb}}}$ — [parameters] Token embedding matrix.
  • $W_{\text{LM}}\in\mathbb{R}^{d\times V}$ — [parameters] LM head matrix.

6) Mixture‑of‑Experts (MoE)

  • $E,,e$ — [#] Total experts and active experts per token (top‑$e$).
  • $N_{L,\text{moe}}$ — [#] Number of MoE layers.
  • $d_{\text{moe}}$ — [features] Expert MLP width.
  • $r_{\text{moe}}{=}\frac{d_{\text{moe}}}{d}$ — [-] Expert expansion ratio.
  • $k_{\text{top}}$ — [#] Router top‑$k$ (usually $e$).
  • $C_{\text{cap}}$ — [-] Capacity factor per expert.
  • $p_{\text{drop,moe}}$ — [-] Token drop probability on overflow.
  • $\tau_{\text{router}}$ — [-] Router temperature.
  • $\lambda_{\text{load}}$ — [-] Load‑balancing loss weight.
  • $\mathbb{1}_{\text{shared}}$ — [-] 1 if shared/global expert present.

Expert shapes (per MoE layer)

  • $W_{\text{up}}^{(e)}\in\mathbb{R}^{d\times d_{\text{moe}}}$, $W_{\text{down}}^{(e)}\in\mathbb{R}^{d_{\text{moe}}\times d}$ — [parameters] Per‑expert MLP projections.

7) Precision, quantization & cache

  • $b$ — [bytes/elt] Bytes per element for a dtype (e.g., BF16→2).
  • $b_w,,b_a,,b_{kv}$ — [bytes/elt] Bytes per element for weights, activations, KV cache.
  • $b_m,,b_v$ — [bytes/elt] Bytes per element for Adam first/second moments.
  • $q_w,,q_a,,q_{kv}$ — [bits] Quantization bit‑widths for weights, activations, KV cache.
  • $g_q$ — [#] Quantization group size.
  • $\mathbb{1}_{\text{sym}}$ — [-] 1 if symmetric quantization.
  • $\mathbb{1}_{\text{zp}}$ — [-] 1 if zero‑points used.

8) Training schedule

  • $B_{\text{seq}}$ — [#] Sequences per microbatch (per device).
  • $L_{\text{train}}$ — [tokens] Training sequence length.
  • $A$ — [#] Gradient accumulation steps.
  • $n_{\text{GPU}}$ — [#] Number of GPUs.
  • $\mathcal{B}_{\text{tok}}{=}B_{\text{seq}}\,L_{\text{train}}\,A\,n_{\text{GPU}}$ — [tokens/step] Global tokens per optimizer step.
  • $S_{\text{steps}}$ — [#] Optimizer steps.
  • $T_{\text{train}}{=}\mathcal{B}_{\text{tok}}\,S_{\text{steps}}$ — [tokens] Total pretraining tokens.
  • $\eta$ — [-] Base/peak learning rate.
  • $\beta_1,\beta_2,,\epsilon_{\text{adam}}$ — [-] Adam/AdamW hyperparameters.
  • $\lambda$ — [-] Weight decay.
  • $S_{\text{warm}}$ — [#] Warmup steps.
  • $c_{\text{grad}}$ — [-] Gradient‑norm clip threshold.
  • $p_{\text{label}}$ — [-] Label smoothing probability.
  • $\mathcal{L}$ — [-] Training objective (e.g., cross‑entropy).

9) Parallelism & system topology

  • $D_p,,T_p,,P_p,,S_p,,E_p$ — [#] Data, tensor, pipeline, sequence, expert parallel degrees.
  • $F_{\text{peak}}$ — [FLOP/s] GPU peak tensor throughput (BF16/FP16).
  • $\text{BW}_{\text{HBM}}$ — [bytes/s] On‑device HBM bandwidth.
  • $\text{BW}_{\text{NVLink}}$ — [bytes/s] Node‑local interconnect bandwidth.
  • $\text{BW}_{\text{NIC}}$ — [bytes/s] Cross‑node network bandwidth.
  • $\eta_{\text{compute}},,\eta_{\text{bw}}$ — [-] Achieved compute and bandwidth utilizations.

10) Parameter counts

  • $P$ — [parameters] Total parameter count.
  • $P_{\text{tok}}$ — [parameters] Embedding + LM head params (if tied, count once).
  • $P_{\text{attn,per\_layer}} \approx d\,(h_q d_h + 2h_{kv} d_h) + d\,(h_q d_h)$ — [parameters] Attention per layer.
  • $P_{\text{mlp,per\_layer}} \approx g_{\text{up}}\,d\,d_{\text{ff}} + d_{\text{ff}}\,d$ — [parameters] MLP per layer.
  • $P_{\text{dense}} \approx N_L\,(P_{\text{attn,per\_layer}}+P_{\text{mlp,per\_layer}}) + P_{\text{tok}}$ — [parameters] Dense total.
  • $P_{\text{expert}} \approx g_{\text{up}}\,d\,d_{\text{moe}} + d_{\text{moe}}\,d$ — [parameters] Per‑expert params.
  • $P_{\text{moe\_all}} \approx N_{L,\text{moe}}\cdot E\cdot P_{\text{expert}}$ — [parameters] All experts across MoE layers.
  • $P_{\text{active}} \approx P_{\text{dense}} + N_{L,\text{moe}}\cdot e\cdot P_{\text{expert}}$ — [parameters] Active params per token.
  • $P_{\text{kv\_state}} = 2\,N_L\,h_{kv}\,d_h$ — [features/token] KV features stored per token (across layers).

11) Memory sizing

  • $M_{\text{param}} = P,b_w$ — [bytes] Resident model weights (per replica).
  • $M_{\text{opt}} \approx P,(b_w + b_m + b_v)$ — [bytes] Adam optimizer states.
  • $M_{\text{grad}} = P,b_w$ — [bytes] Gradient memory (if not sharded).
  • $M_{\text{act}} \approx \kappa,B_{\text{seq}},L_{\text{train}},d,b_a$ — [bytes] Activations (training); $\kappa$ depends on checkpointing/attention.
  • $B$ — [#] Inference batch size (concurrent sequences).
  • $M_{\text{KV}} = B,L,N_L,(2,h_{kv},d_h),b_{kv}$ — [bytes] KV cache at decode.
  • $M_{\text{KV,win}} = B,\min(L,S),N_L,(2,h_{kv},d_h),b_{kv}$ — [bytes] KV cache with sliding window.
  • $M_{\text{embed}} = V,d_{\text{emb}},b_w$ — [bytes] Token embeddings (LM head adds if untied).

12) FLOPs & arithmetic intensity

Per new token at decode (cache length $L$)

  • $F_{\text{proj}} \approx 2,d,(h_q d_h + 2h_{kv} d_h) + 2,(h_q d_h),d$ — [FLOPs] QKV + output projections.
  • $F_{\text{attn}}(L) \approx 4,L,h_q,d_h$ — [FLOPs] $QK^\top$ and $AV$.
  • $F_{\text{mlp}} \approx 2,(g_{\text{up}}+1),d,d_{\text{ff}}$ — [FLOPs] MLP GEMMs.
  • $F_{\text{router}} \approx 2,d,E$ — [FLOPs] Router (MoE layer).
  • $F_{\text{decode}}(L) \approx N_L\left(F_{\text{proj}} + F_{\text{attn}}(L) + F_{\text{mlp}}\right)$ — [FLOPs] Dense stack (add $N_{L,\text{moe}}F_{\text{router}}$ if MoE).

Prefill (context build, length $T_{\text{ctx}}$)

  • $F_{\text{prefill}}(T_{\text{ctx}}) \sim O!\left(N_L,T_{\text{ctx}},d^2 + N_L,T_{\text{ctx}}^2,h_q d_h\right)$ — [FLOPs] Quadratic attention term dominates without windowing.

Arithmetic intensity

  • $I \equiv \frac{\text{FLOPs}}{\text{bytes moved}}$ — [FLOPs/byte] Roofline intensity.

13) Throughput, latency & efficiency

  • $\mathcal{T}_{\text{tok/s}}$ — [tokens/s] Generation throughput.
  • $t_{\text{P50}},\,t_{\text{P95}}$ — [s] Latency percentiles (first‑token/per‑token).
  • $F_{\text{achieved}}{=}\eta_{\text{compute}}\,F_{\text{peak}}$ — [FLOP/s] Effective compute.
  • $\text{BW}_{\text{achieved}}{=}\eta_{\text{bw}}\,\text{BW}_{\text{HBM}}$ — [bytes/s] Effective bandwidth.
  • $\rho_{\text{comm}}$ — [-] Communication fraction of step time.
  • $\phi_{\text{cache}}{=}\frac{M_{\text{KV}}}{\text{HBM capacity}}$ — [-] KV cache fraction of HBM.
  • $\chi \equiv \frac{F_{\text{attn}}(L)}{F_{\text{mlp}}}$ — [-] Attention/MLP compute ratio.

14) Inference (sampling)

  • $T$ — [-] Softmax temperature.
  • $k$ — [#] Top‑$k$ cutoff.
  • $p$ — [-] Nucleus (top‑$p$) mass.
  • $\rho$ — [-] Repetition penalty factor.
  • $\pi_{\text{presence}},,\phi_{\text{frequency}}$ — [-] Presence/frequency penalty strengths.
  • $b_{\text{beam}}$ — [#] Beam width.
  • $L_{\text{gen}}$ — [tokens] Target/generated tokens per request.

15) Useful sums & products

  • $g_{\text{GQA}}=\frac{h_q}{h_{kv}}$; $r_{\text{ff}}=\frac{d_{\text{ff}}}{d}$; $r_{\text{moe}}=\frac{d_{\text{moe}}}{d}$ — [-] Ratios.
  • $P_{\text{active}} \approx P_{\text{dense}} + N_{L,\text{moe}}\cdot e\cdot P_{\text{expert}}$ — [parameters] Active params per token (MoE).
  • $K\!V_{\text{feat}} = 2\,h_{kv}\,d_h$ — [features/token/layer] KV features per token per layer.
  • $M_{\text{KV}} = B\,L\,N_L\,(2\,h_{kv}\,d_h)\,b_{kv}$; $M_{\text{KV,win}} = B\,\min(L,S)\,N_L\,(2\,h_{kv}\,d_h)\,b_{kv}$ — [bytes] KV memory.
  • If $h_{kv}{=}h_q$ and $h_q d_h{=}d$: $P_{\text{attn,per\_layer}}\!\approx\!4d^2$, $F_{\text{proj}}\!\approx\!4d^2$, $F_{\text{attn}}(L)\!\approx\!4Ld$, $F_{\text{mlp}}\!\approx\!2(g_{\text{up}}+1)dd_{\text{ff}}$ — [parameters]/[FLOPs].
  • $T_{\text{train}}=\mathcal{B}_{\text{tok}}\,S_{\text{steps}}$ — [tokens] Training tokens.
  • $\text{Perf} \le \min\!\big(F_{\text{achieved}},\; I\,\text{BW}_{\text{achieved}}\big)$ — [FLOP/s] Roofline bound.

16) Family mapping hints

  • Dense families (llama3/4, gemma3, seed‑oss): $E{=}e{=}1$, $N_{L,\text{moe}}{=}0$, specify $h_q,h_{kv}$ (GQA if $h_{kv}{<}h_q$), $\mathcal{P}$, $L_{\max}$, $r_{\text{ff}}$, $g_{\text{up}}$, $\mathsf{Norm}$, $\epsilon_{\text{norm}}$, $(b_w,b_a,b_{kv})$.
  • qwen3 (dense/long‑context options): as dense, with explicit $(s_{\text{rope}}, s_{\text{ntk}})$ and optional window $S$.
  • qwen3‑moe, gpt‑oss (MoE): additionally specify $E$, $e$, $N_{L,\text{moe}}$, $d_{\text{moe}}$ (or $r_{\text{moe}}$), router knobs $(k_{\text{top}}, C_{\text{cap}}, p_{\text{drop,moe}}, \tau_{\text{router}}, \lambda_{\text{load}}, \mathbb{1}_{\text{shared}})$.

17) Sanity checklist

  • Ensure $d = h_q,d_h$ (or declare $d_q,d_k,d_v$).
  • Always report $L_{\max}$ and current $L$ for decode metrics.
  • With GQA/MQA, include $g_{\text{GQA}}$ and $h_{kv}$ in KV/FLOP formulas.
  • For MoE, report both $P$ and $P_{\text{active}}$.
  • Pair utilization with ceilings: $(\eta_{\text{compute}},F_{\text{peak}})$ and $(\eta_{\text{bw}},\text{BW}_{\text{HBM}})$.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment