Use this dictionary whenever referring to LLM hyperparameters, shapes, and efficiency metrics across this project. It is a unified superset covering gpt‑oss, qwen3, qwen3‑moe, llama3, llama4, gemma3, and seed‑oss. When a feature is unused, assign the trivial setting (e.g., dense models:
Unit key: “[-]” dimensionless; “[#]” count; “[features]” channel width; “[tokens]” token length; “[bytes]”, “[FLOP/s]”, “[bytes/s]”.
-
$N_L$ — [#] Transformer block (layer) count. -
$d$ — [features] Model hidden size (residual width). -
$h_q,,h_{kv}$ — [#] Query heads and KV heads (GQA/MQA). -
$g_{\text{GQA}}{=}\frac{h_q}{h_{kv}}$ — [-] GQA grouping factor. -
$d_h$ — [features] Per‑head dim; usually$d=h_q d_h$ . -
$d_q,,d_k,,d_v$ — [features] Optional per‑projection head dims when unequal. -
$p_{\text{attn}}$ — [-] Attention dropout probability. -
$\alpha_{\text{attn}}{=}\frac{1}{\sqrt{d_k}}$ — [-] Attention scaling factor.
-
$W_Q\in\mathbb{R}^{d\times(h_q d_h)}$ ,$W_K\in\mathbb{R}^{d\times(h_{kv} d_h)}$ ,$W_V\in\mathbb{R}^{d\times(h_{kv} d_h)}$ ,$W_O\in\mathbb{R}^{(h_q d_h)\times d}$ — [parameters] Projection matrices.
-
$L_{\max}$ — [tokens] Maximum supported/trained context length. -
$L$ — [tokens] Cached context length at current decode step. -
$S$ — [tokens] Sliding‑window/local attention span. -
$k_{\text{sinks}}$ — [tokens] Count of sink/pinned tokens. -
$d_{\text{rope}}$ — [features] Rotated channels for RoPE. -
$f_{\text{rope}}{=}\frac{d_{\text{rope}}}{d}$ — [-] RoPE channel fraction. -
$\theta_{\text{base}}$ — [-] RoPE base. -
$s_{\text{rope}}$ — [-] RoPE scaling factor. -
$s_{\text{ntk}}$ — [-] NTK‑aware scaling factor. -
$\mathcal{P}\in{\text{RoPE},\text{ALiBi},\text{LE},\dots}$ — [-] Positional scheme tag.
-
$d_{\text{ff}}$ — [features] MLP intermediate width. -
$r_{\text{ff}}{=}\frac{d_{\text{ff}}}{d}$ — [-] MLP expansion ratio. -
$g_{\text{up}}$ — [#] Up‑projection branches ($1$ =GeLU MLP,$2$ =SwiGLU/GeGLU). -
$f_{\text{act}}$ — [-] Activation (e.g., GeLU, SwiGLU). -
$p_{\text{mlp}}$ — [-] MLP dropout probability.
-
$W_{\text{up}}^{(i)}\in\mathbb{R}^{d\times d_{\text{ff}}}$ for$i{=}1..g_{\text{up}}$ ,$W_{\text{down}}\in\mathbb{R}^{d_{\text{ff}}\times d}$ — [parameters] MLP projections.
-
$\mathsf{Norm}\in{\text{RMSNorm},\text{LayerNorm}}$ — [-] Norm type. -
$\epsilon_{\text{norm}}$ — [-] Norm epsilon. -
$\mathbb{1}_{\text{prenorm}}$ — [-] 1 if pre‑norm, else 0. -
$s_{\text{res}}$ — [-] Residual scaling factor. -
$p_{\text{res}}$ — [-] Residual dropout probability.
-
$V$ — [tokens] Vocabulary size. -
$d_{\text{emb}}$ — [features] Token embedding width. -
$\mathbb{1}_{\text{tie}}$ — [-] 1 if embeddings and LM head are tied. -
$n_{\text{special}}$ — [tokens] Count of special tokens. -
$W_{E}\in\mathbb{R}^{V\times d_{\text{emb}}}$ — [parameters] Token embedding matrix. -
$W_{\text{LM}}\in\mathbb{R}^{d\times V}$ — [parameters] LM head matrix.
-
$E,,e$ — [#] Total experts and active experts per token (top‑$e$ ). -
$N_{L,\text{moe}}$ — [#] Number of MoE layers. -
$d_{\text{moe}}$ — [features] Expert MLP width. -
$r_{\text{moe}}{=}\frac{d_{\text{moe}}}{d}$ — [-] Expert expansion ratio. -
$k_{\text{top}}$ — [#] Router top‑$k$ (usually$e$ ). -
$C_{\text{cap}}$ — [-] Capacity factor per expert. -
$p_{\text{drop,moe}}$ — [-] Token drop probability on overflow. -
$\tau_{\text{router}}$ — [-] Router temperature. -
$\lambda_{\text{load}}$ — [-] Load‑balancing loss weight. -
$\mathbb{1}_{\text{shared}}$ — [-] 1 if shared/global expert present.
-
$W_{\text{up}}^{(e)}\in\mathbb{R}^{d\times d_{\text{moe}}}$ ,$W_{\text{down}}^{(e)}\in\mathbb{R}^{d_{\text{moe}}\times d}$ — [parameters] Per‑expert MLP projections.
-
$b$ — [bytes/elt] Bytes per element for a dtype (e.g., BF16→2). -
$b_w,,b_a,,b_{kv}$ — [bytes/elt] Bytes per element for weights, activations, KV cache. -
$b_m,,b_v$ — [bytes/elt] Bytes per element for Adam first/second moments. -
$q_w,,q_a,,q_{kv}$ — [bits] Quantization bit‑widths for weights, activations, KV cache. -
$g_q$ — [#] Quantization group size. -
$\mathbb{1}_{\text{sym}}$ — [-] 1 if symmetric quantization. -
$\mathbb{1}_{\text{zp}}$ — [-] 1 if zero‑points used.
-
$B_{\text{seq}}$ — [#] Sequences per microbatch (per device). -
$L_{\text{train}}$ — [tokens] Training sequence length. -
$A$ — [#] Gradient accumulation steps. -
$n_{\text{GPU}}$ — [#] Number of GPUs. -
$\mathcal{B}_{\text{tok}}{=}B_{\text{seq}}\,L_{\text{train}}\,A\,n_{\text{GPU}}$ — [tokens/step] Global tokens per optimizer step. -
$S_{\text{steps}}$ — [#] Optimizer steps. -
$T_{\text{train}}{=}\mathcal{B}_{\text{tok}}\,S_{\text{steps}}$ — [tokens] Total pretraining tokens. -
$\eta$ — [-] Base/peak learning rate. -
$\beta_1,\beta_2,,\epsilon_{\text{adam}}$ — [-] Adam/AdamW hyperparameters. -
$\lambda$ — [-] Weight decay. -
$S_{\text{warm}}$ — [#] Warmup steps. -
$c_{\text{grad}}$ — [-] Gradient‑norm clip threshold. -
$p_{\text{label}}$ — [-] Label smoothing probability. -
$\mathcal{L}$ — [-] Training objective (e.g., cross‑entropy).
-
$D_p,,T_p,,P_p,,S_p,,E_p$ — [#] Data, tensor, pipeline, sequence, expert parallel degrees. -
$F_{\text{peak}}$ — [FLOP/s] GPU peak tensor throughput (BF16/FP16). -
$\text{BW}_{\text{HBM}}$ — [bytes/s] On‑device HBM bandwidth. -
$\text{BW}_{\text{NVLink}}$ — [bytes/s] Node‑local interconnect bandwidth. -
$\text{BW}_{\text{NIC}}$ — [bytes/s] Cross‑node network bandwidth. -
$\eta_{\text{compute}},,\eta_{\text{bw}}$ — [-] Achieved compute and bandwidth utilizations.
-
$P$ — [parameters] Total parameter count. -
$P_{\text{tok}}$ — [parameters] Embedding + LM head params (if tied, count once). -
$P_{\text{attn,per\_layer}} \approx d\,(h_q d_h + 2h_{kv} d_h) + d\,(h_q d_h)$ — [parameters] Attention per layer. -
$P_{\text{mlp,per\_layer}} \approx g_{\text{up}}\,d\,d_{\text{ff}} + d_{\text{ff}}\,d$ — [parameters] MLP per layer. -
$P_{\text{dense}} \approx N_L\,(P_{\text{attn,per\_layer}}+P_{\text{mlp,per\_layer}}) + P_{\text{tok}}$ — [parameters] Dense total. -
$P_{\text{expert}} \approx g_{\text{up}}\,d\,d_{\text{moe}} + d_{\text{moe}}\,d$ — [parameters] Per‑expert params. -
$P_{\text{moe\_all}} \approx N_{L,\text{moe}}\cdot E\cdot P_{\text{expert}}$ — [parameters] All experts across MoE layers. -
$P_{\text{active}} \approx P_{\text{dense}} + N_{L,\text{moe}}\cdot e\cdot P_{\text{expert}}$ — [parameters] Active params per token. -
$P_{\text{kv\_state}} = 2\,N_L\,h_{kv}\,d_h$ — [features/token] KV features stored per token (across layers).
-
$M_{\text{param}} = P,b_w$ — [bytes] Resident model weights (per replica). -
$M_{\text{opt}} \approx P,(b_w + b_m + b_v)$ — [bytes] Adam optimizer states. -
$M_{\text{grad}} = P,b_w$ — [bytes] Gradient memory (if not sharded). -
$M_{\text{act}} \approx \kappa,B_{\text{seq}},L_{\text{train}},d,b_a$ — [bytes] Activations (training);$\kappa$ depends on checkpointing/attention. -
$B$ — [#] Inference batch size (concurrent sequences). -
$M_{\text{KV}} = B,L,N_L,(2,h_{kv},d_h),b_{kv}$ — [bytes] KV cache at decode. -
$M_{\text{KV,win}} = B,\min(L,S),N_L,(2,h_{kv},d_h),b_{kv}$ — [bytes] KV cache with sliding window. -
$M_{\text{embed}} = V,d_{\text{emb}},b_w$ — [bytes] Token embeddings (LM head adds if untied).
-
$F_{\text{proj}} \approx 2,d,(h_q d_h + 2h_{kv} d_h) + 2,(h_q d_h),d$ — [FLOPs] QKV + output projections. -
$F_{\text{attn}}(L) \approx 4,L,h_q,d_h$ — [FLOPs]$QK^\top$ and$AV$ . -
$F_{\text{mlp}} \approx 2,(g_{\text{up}}+1),d,d_{\text{ff}}$ — [FLOPs] MLP GEMMs. -
$F_{\text{router}} \approx 2,d,E$ — [FLOPs] Router (MoE layer). -
$F_{\text{decode}}(L) \approx N_L\left(F_{\text{proj}} + F_{\text{attn}}(L) + F_{\text{mlp}}\right)$ — [FLOPs] Dense stack (add$N_{L,\text{moe}}F_{\text{router}}$ if MoE).
-
$F_{\text{prefill}}(T_{\text{ctx}}) \sim O!\left(N_L,T_{\text{ctx}},d^2 + N_L,T_{\text{ctx}}^2,h_q d_h\right)$ — [FLOPs] Quadratic attention term dominates without windowing.
-
$I \equiv \frac{\text{FLOPs}}{\text{bytes moved}}$ — [FLOPs/byte] Roofline intensity.
-
$\mathcal{T}_{\text{tok/s}}$ — [tokens/s] Generation throughput. -
$t_{\text{P50}},\,t_{\text{P95}}$ — [s] Latency percentiles (first‑token/per‑token). -
$F_{\text{achieved}}{=}\eta_{\text{compute}}\,F_{\text{peak}}$ — [FLOP/s] Effective compute. -
$\text{BW}_{\text{achieved}}{=}\eta_{\text{bw}}\,\text{BW}_{\text{HBM}}$ — [bytes/s] Effective bandwidth. -
$\rho_{\text{comm}}$ — [-] Communication fraction of step time. -
$\phi_{\text{cache}}{=}\frac{M_{\text{KV}}}{\text{HBM capacity}}$ — [-] KV cache fraction of HBM. -
$\chi \equiv \frac{F_{\text{attn}}(L)}{F_{\text{mlp}}}$ — [-] Attention/MLP compute ratio.
-
$T$ — [-] Softmax temperature. -
$k$ — [#] Top‑$k$ cutoff. -
$p$ — [-] Nucleus (top‑$p$) mass. -
$\rho$ — [-] Repetition penalty factor. -
$\pi_{\text{presence}},,\phi_{\text{frequency}}$ — [-] Presence/frequency penalty strengths. -
$b_{\text{beam}}$ — [#] Beam width. -
$L_{\text{gen}}$ — [tokens] Target/generated tokens per request.
-
$g_{\text{GQA}}=\frac{h_q}{h_{kv}}$ ;$r_{\text{ff}}=\frac{d_{\text{ff}}}{d}$ ;$r_{\text{moe}}=\frac{d_{\text{moe}}}{d}$ — [-] Ratios. -
$P_{\text{active}} \approx P_{\text{dense}} + N_{L,\text{moe}}\cdot e\cdot P_{\text{expert}}$ — [parameters] Active params per token (MoE). -
$K\!V_{\text{feat}} = 2\,h_{kv}\,d_h$ — [features/token/layer] KV features per token per layer. -
$M_{\text{KV}} = B\,L\,N_L\,(2\,h_{kv}\,d_h)\,b_{kv}$ ;$M_{\text{KV,win}} = B\,\min(L,S)\,N_L\,(2\,h_{kv}\,d_h)\,b_{kv}$ — [bytes] KV memory. - If
$h_{kv}{=}h_q$ and $h_q d_h{=}d$ :$P_{\text{attn,per\_layer}}\!\approx\!4d^2$ ,$F_{\text{proj}}\!\approx\!4d^2$ ,$F_{\text{attn}}(L)\!\approx\!4Ld$ ,$F_{\text{mlp}}\!\approx\!2(g_{\text{up}}+1)dd_{\text{ff}}$ — [parameters]/[FLOPs]. -
$T_{\text{train}}=\mathcal{B}_{\text{tok}}\,S_{\text{steps}}$ — [tokens] Training tokens. -
$\text{Perf} \le \min\!\big(F_{\text{achieved}},\; I\,\text{BW}_{\text{achieved}}\big)$ — [FLOP/s] Roofline bound.
- Dense families (llama3/4, gemma3, seed‑oss):
$E{=}e{=}1$ ,$N_{L,\text{moe}}{=}0$ , specify$h_q,h_{kv}$ (GQA if$h_{kv}{<}h_q$ ),$\mathcal{P}$ ,$L_{\max}$ ,$r_{\text{ff}}$ ,$g_{\text{up}}$ ,$\mathsf{Norm}$ ,$\epsilon_{\text{norm}}$ ,$(b_w,b_a,b_{kv})$ . -
qwen3 (dense/long‑context options): as dense, with explicit
$(s_{\text{rope}}, s_{\text{ntk}})$ and optional window$S$ . -
qwen3‑moe, gpt‑oss (MoE): additionally specify
$E$ ,$e$ ,$N_{L,\text{moe}}$ ,$d_{\text{moe}}$ (or$r_{\text{moe}}$ ), router knobs$(k_{\text{top}}, C_{\text{cap}}, p_{\text{drop,moe}}, \tau_{\text{router}}, \lambda_{\text{load}}, \mathbb{1}_{\text{shared}})$ .
- Ensure
$d = h_q,d_h$ (or declare$d_q,d_k,d_v$ ). - Always report
$L_{\max}$ and current$L$ for decode metrics. - With GQA/MQA, include
$g_{\text{GQA}}$ and$h_{kv}$ in KV/FLOP formulas. - For MoE, report both
$P$ and$P_{\text{active}}$ . - Pair utilization with ceilings:
$(\eta_{\text{compute}},F_{\text{peak}})$ and$(\eta_{\text{bw}},\text{BW}_{\text{HBM}})$ .