Use this dictionary whenever referring to LLM hyperparameters, shapes, and efficiency metrics across this project. It is a unified superset covering gpt‑oss, qwen3, qwen3‑moe, llama3, llama4, gemma3, and seed‑oss. When a feature is unused, assign the trivial setting (e.g., dense models:
Unit key: “[-]” dimensionless; “[#]” count; “[features]” channel width; “[tokens]” token length; “[bytes]”, “[FLOP/s]”, “[bytes/s]”.
-
$N_L$ — [#] Transformer block (layer) count. -
$d$ — [features] Model hidden size (residual width). -
$h_q,,h_{kv}$ — [#] Query heads and KV heads (GQA/MQA).