Skip to content

Instantly share code, notes, and snippets.

@belisarius222
Created March 9, 2026 01:02
Show Gist options
  • Select an option

  • Save belisarius222/e3c316f30c475e88c0cff1a5d8d42f38 to your computer and use it in GitHub Desktop.

Select an option

Save belisarius222/e3c316f30c475e88c0cff1a5d8d42f38 to your computer and use it in GitHub Desktop.
Static frequency-ranked shortlist for speculative decoding -- 99.65% parity with zero parameters

Static Frequency-Ranked Shortlist for Speculative Decoding

Date: 2026-03-08 Model: google/gemma-3-1b-it (262,144 vocab) Eval dataset: wikitext-2-raw-v1 validation (254,828 positions) Code: voltropy/shortlist@8168cac

Summary

We discovered that a static, frequency-ranked token set with a simple margin-based fallback to full-vocab scoring achieves better parity than a trained neural router, with zero parameters, zero training, and zero inference-time routing.

The key insight: the teacher model's output distribution is extremely concentrated. Out of a 262K-token vocabulary, only ~80K tokens ever appear in the top-10 predictions across 1.2M training positions, and the top 60K (ranked by frequency of appearance in top-10) cover 98-99%+ of all predictions.

Approach

  1. Collect frequency histogram from teacher model predictions on training data (1.2M positions, 24 shards). For each position, record which tokens appear in the teacher's top-k predictions.

  2. Rank tokens by how often they appear anywhere in the teacher's top-10 predictions. Select the top 60,000 tokens (22.9% of vocab).

  3. At decode time, score only the 60K static candidates using the LM head (a single matrix multiply: hidden @ W[static_set].T). If the logit margin between the top-1 and top-2 candidates exceeds a threshold, accept the prediction. Otherwise, fall back to full-vocab scoring.

  4. No neural network is involved in the candidate selection. The static set is a fixed list of token IDs determined entirely offline.

Results

Static set coverage (% of positions where ALL top-k teacher tokens are in the set)

Tokens ranked by top-10 appearance frequency:

Static set size % of vocab top-1 top-2 top-4 top-10
5,000 1.9% 92.9% 87.5% 79.0% 60.8%
10,000 3.8% 96.2% 93.0% 87.2% 73.2%
20,000 7.6% 98.4% 97.0% 94.0% 85.2%
30,000 11.5% 99.3% 98.5% 96.9% 91.3%
45,000 17.2% 99.8% 99.5% 98.8% 96.1%
60,000 22.9% 99.9% 99.8% 99.6% 98.5%
75,000 28.6% 100.0% 100.0% 99.9% 99.7%

Extended coverage (tokens ranked by top-20 and top-30 appearance frequency):

Static set size % of vocab top-1 top-4 top-10 top-20 top-30
20,000 7.6% 98.0% 93.0% 84.9% 73.3% 62.9%
30,000 11.5% 99.0% 96.1% 90.7% 82.4% 74.4%
45,000 17.2% 99.6% 98.3% 95.5% 90.4% 85.1%
60,000 22.9% 99.8% 99.3% 97.9% 95.0% 91.6%
90,000 34.4% 100.0% 99.9% 99.6% 98.9% 97.8%

End-to-end parity with fallback (60K static set, wikitext-2 validation)

Margin threshold Tier0 (static 60K) Fallback (full 262K) Parity
1.5 (conservative) 33.0% 67.0% 99.94%
0.5 70.2% 29.8% 99.81%
0.3 80.6% 19.4% 99.75%
0.1 93.1% 6.9% 99.67%
0.02 97.5% 2.5% 99.65%
0.0 (no fallback) 100.0% 0.0% 98.85%

Best operating point: margin=0.02 gives 97.5% of positions resolved from the 60K static set with only 2.5% fallback, achieving 99.65% parity. The margin distribution is bimodal -- positions either have margin > 0.02 or margin ~= 0 (a near-tie), with nothing in between.

Comparison to trained neural router

For reference, the trained router approach (SwiGLU MLP selecting from 4096 embedding-clustered expert shortlists) achieved:

Configuration top-1 perfect top-10 perfect Parity
Router, 4 experts (h1024) 92.08% 34.47% 98.92%
Router, 8 experts 93.69% 39.07% 99.20%
Router, 16 experts 95.00% 44.05% 99.39%
Router, 32 experts 95.92% 49.36% 99.53%
Router, 64 experts 96.70% 53.85% 99.64%
Static 60K (no router) 99.9% 98.5% 99.65%

The static 60K set dominates the router at every expert count, despite having zero trainable parameters.

Why the router couldn't compete

The router's candidate sets were small (~2K tokens: 512 base + 4 experts x ~512 shortlist tokens). Even with 64 experts (64 x 512 = 32K expert tokens + 512 base), the total candidate pool was smaller than 60K AND suffered from being context-dependent (the router could pick wrong experts). A static 60K set is both larger and deterministic.

The router added only ~11-16% on top of the 80.6% base-set coverage. The static approach simply expands the "base set" to 60K, which is large enough to cover almost everything.

Artifacts

All artifacts are in S3 bucket voltcode-artifacts-17f9c348:

Static set approach

  • runs/shortlist-router/static_set/token_freq_histogram.pt -- Full token frequency histogram (Counter dict from 1.2M training positions)

Router training runs (for comparison)

  • runs/shortlist-router/router_train_512tok_h1024_10ep/ -- 10-epoch h1024 router (best router: 98.88% parity with 4 experts)
    • Includes all epoch checkpoints, training diagnostics, eval logs
  • runs/shortlist-router/router_train_512tok_h2048_10ep/ -- 10-epoch h2048 router (98.92% parity, wider FFN didn't help)
    • Includes eval logs for 4/8/16 expert configurations
  • runs/shortlist-router/router_train_512tok_continued/ -- 20-epoch continued training (slightly worse than 10-epoch)

Key files in the codebase

Next steps

  • Integrate the static-set approach into the main decode pipeline as a first-class tier
  • Investigate whether a smaller static set (e.g. 30K) with a lightweight fallback mechanism could achieve similar parity with lower compute
  • Profile the actual wall-clock speedup of scoring 60K vs 262K candidates in the LM head matmul
  • Consider whether the static set should be model-specific or if a single set transfers across model sizes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment