Date: 2026-03-08 Model: google/gemma-3-1b-it (262,144 vocab) Eval dataset: wikitext-2-raw-v1 validation (254,828 positions) Code: voltropy/shortlist@8168cac
We discovered that a static, frequency-ranked token set with a simple margin-based fallback to full-vocab scoring achieves better parity than a trained neural router, with zero parameters, zero training, and zero inference-time routing.
The key insight: the teacher model's output distribution is extremely concentrated. Out of a 262K-token vocabulary, only ~80K tokens ever appear in the top-10 predictions across 1.2M training positions, and the top 60K (ranked by frequency of appearance in top-10) cover 98-99%+ of all predictions.
-
Collect frequency histogram from teacher model predictions on training data (1.2M positions, 24 shards). For each position, record which tokens appear in the teacher's top-k predictions.
-
Rank tokens by how often they appear anywhere in the teacher's top-10 predictions. Select the top 60,000 tokens (22.9% of vocab).
-
At decode time, score only the 60K static candidates using the LM head (a single matrix multiply:
hidden @ W[static_set].T). If the logit margin between the top-1 and top-2 candidates exceeds a threshold, accept the prediction. Otherwise, fall back to full-vocab scoring. -
No neural network is involved in the candidate selection. The static set is a fixed list of token IDs determined entirely offline.
Tokens ranked by top-10 appearance frequency:
| Static set size | % of vocab | top-1 | top-2 | top-4 | top-10 |
|---|---|---|---|---|---|
| 5,000 | 1.9% | 92.9% | 87.5% | 79.0% | 60.8% |
| 10,000 | 3.8% | 96.2% | 93.0% | 87.2% | 73.2% |
| 20,000 | 7.6% | 98.4% | 97.0% | 94.0% | 85.2% |
| 30,000 | 11.5% | 99.3% | 98.5% | 96.9% | 91.3% |
| 45,000 | 17.2% | 99.8% | 99.5% | 98.8% | 96.1% |
| 60,000 | 22.9% | 99.9% | 99.8% | 99.6% | 98.5% |
| 75,000 | 28.6% | 100.0% | 100.0% | 99.9% | 99.7% |
Extended coverage (tokens ranked by top-20 and top-30 appearance frequency):
| Static set size | % of vocab | top-1 | top-4 | top-10 | top-20 | top-30 |
|---|---|---|---|---|---|---|
| 20,000 | 7.6% | 98.0% | 93.0% | 84.9% | 73.3% | 62.9% |
| 30,000 | 11.5% | 99.0% | 96.1% | 90.7% | 82.4% | 74.4% |
| 45,000 | 17.2% | 99.6% | 98.3% | 95.5% | 90.4% | 85.1% |
| 60,000 | 22.9% | 99.8% | 99.3% | 97.9% | 95.0% | 91.6% |
| 90,000 | 34.4% | 100.0% | 99.9% | 99.6% | 98.9% | 97.8% |
| Margin threshold | Tier0 (static 60K) | Fallback (full 262K) | Parity |
|---|---|---|---|
| 1.5 (conservative) | 33.0% | 67.0% | 99.94% |
| 0.5 | 70.2% | 29.8% | 99.81% |
| 0.3 | 80.6% | 19.4% | 99.75% |
| 0.1 | 93.1% | 6.9% | 99.67% |
| 0.02 | 97.5% | 2.5% | 99.65% |
| 0.0 (no fallback) | 100.0% | 0.0% | 98.85% |
Best operating point: margin=0.02 gives 97.5% of positions resolved from the 60K static set with only 2.5% fallback, achieving 99.65% parity. The margin distribution is bimodal -- positions either have margin > 0.02 or margin ~= 0 (a near-tie), with nothing in between.
For reference, the trained router approach (SwiGLU MLP selecting from 4096 embedding-clustered expert shortlists) achieved:
| Configuration | top-1 perfect | top-10 perfect | Parity |
|---|---|---|---|
| Router, 4 experts (h1024) | 92.08% | 34.47% | 98.92% |
| Router, 8 experts | 93.69% | 39.07% | 99.20% |
| Router, 16 experts | 95.00% | 44.05% | 99.39% |
| Router, 32 experts | 95.92% | 49.36% | 99.53% |
| Router, 64 experts | 96.70% | 53.85% | 99.64% |
| Static 60K (no router) | 99.9% | 98.5% | 99.65% |
The static 60K set dominates the router at every expert count, despite having zero trainable parameters.
The router's candidate sets were small (~2K tokens: 512 base + 4 experts x ~512 shortlist tokens). Even with 64 experts (64 x 512 = 32K expert tokens + 512 base), the total candidate pool was smaller than 60K AND suffered from being context-dependent (the router could pick wrong experts). A static 60K set is both larger and deterministic.
The router added only ~11-16% on top of the 80.6% base-set coverage. The static approach simply expands the "base set" to 60K, which is large enough to cover almost everything.
All artifacts are in S3 bucket voltcode-artifacts-17f9c348:
runs/shortlist-router/static_set/token_freq_histogram.pt-- Full token frequency histogram (Counter dict from 1.2M training positions)
runs/shortlist-router/router_train_512tok_h1024_10ep/-- 10-epoch h1024 router (best router: 98.88% parity with 4 experts)- Includes all epoch checkpoints, training diagnostics, eval logs
runs/shortlist-router/router_train_512tok_h2048_10ep/-- 10-epoch h2048 router (98.92% parity, wider FFN didn't help)- Includes eval logs for 4/8/16 expert configurations
runs/shortlist-router/router_train_512tok_continued/-- 20-epoch continued training (slightly worse than 10-epoch)
scripts/eval_static_set.py-- Standalone eval for static candidate sets with margin fallbackshortlist/eval.py-- General eval framework (modified to report perfect-coverage metrics)shortlist/train.py-- Router training code (train_router()function)shortlist/cli.py-- CLI entry points includingtrain-routersubcommandshortlist/decode.py-- Tiered decode with confidence-based escalationshortlist/models.py-- ShortlistStudent model withrouter_coverage_loss()
- Integrate the static-set approach into the main decode pipeline as a first-class tier
- Investigate whether a smaller static set (e.g. 30K) with a lightweight fallback mechanism could achieve similar parity with lower compute
- Profile the actual wall-clock speedup of scoring 60K vs 262K candidates in the LM head matmul
- Consider whether the static set should be model-specific or if a single set transfers across model sizes