Skip to content

Instantly share code, notes, and snippets.

@benvanik
Created February 22, 2026 04:26
Show Gist options
  • Select an option

  • Save benvanik/9090b155a3102b333896c1d92c451cbc to your computer and use it in GitHub Desktop.

Select an option

Save benvanik/9090b155a3102b333896c1d92c451cbc to your computer and use it in GitHub Desktop.
iree/tokenizer/ perf

Tokenizer Performance Comparison

Benchmark comparing IREE's C tokenizer against the best available Rust and Python tokenizer implementations.

Last updated: 2026-02-21. Re-run benchmarks before citing these numbers.

Test Environment

  • CPU: AMD EPYC (192 cores, 5391 MHz), 32 KB L1d, 1 MB L2, 32 MB L3
  • OS: Linux 6.14.0-37-generic
  • IREE: C23, Clang -O3 -march=native, ThinLTO, backtracking BPE
  • Rust bpe crate: bpe 0.2 + bpe-openai 0.3, opt-level=3, lto=true, backtracking BPE
  • tiktoken-rs: tiktoken-rs 0.9, opt-level=3, lto=true, priority-queue BPE
  • HF (Rust): tokenizers 0.22.2, opt-level=3, lto=true, Rust-native (no Python)
  • tiktoken (Python): tiktoken 0.12.0 (Python wrapper around tiktoken-rs)
  • HF tokenizers (Python): tokenizers 0.21.4 (Python wrapper around Rust backend)

Methodology

All benchmarks use identical methodology: full files, single copy, cache-hot. Each benchmark runs for at least 2 seconds to ensure statistical stability. The same three text files are used across all implementations.

Corpus File Size Description
ASCII sherlock.txt 595 KB English prose (Sherlock Holmes)
CJK japanese.txt 100 KB Japanese text
Code c_code.txt 171 KB C source code

GPT-2 r50k_base (50K vocab)

All GPT-2 benchmarks use the same vocabulary for fair comparison — same algorithm, same vocabulary, different implementations.

GPT-2 Encode Throughput (MiB/s, higher is better)

Corpus IREE Rust bpe HF (Rust) tiktoken-rs tiktoken (Py) HF (Py)
ASCII 44.7 39.9 4.0 15.0 19.9 2.7
CJK 37.9 37.6 4.2 15.4 15.8 2.6
Code 52.1 30.6 3.9 12.7 14.6 2.5

IREE uses the best of streaming and one-shot modes. All Rust crates are native (no Python overhead). tiktoken (Python) calls tiktoken-rs under the hood but adds Python dispatch overhead.

IREE vs Rust bpe (same algorithm, same vocabulary)

Both IREE and Rust bpe use backtracking BPE with GPT-2 r50k_base vocabulary. This is the fairest head-to-head comparison:

  • Code: IREE is 1.70x faster (52.1 vs 30.6 MiB/s). Code text has predictable token patterns (identifiers, operators, whitespace) that benefit from IREE's DAAC trie-based vocabulary lookup with O(1) state transitions.

  • ASCII: IREE leads (44.7 vs 39.9 MiB/s, 1.12x faster). The reachability-gated whole-segment fast-path skips full BPE merge processing for segments that match a single reachable vocabulary token. IREE's streaming mode (44.7 MiB/s) outperforms one-shot (32.5 MiB/s) by reusing pre-allocated state. Profiling shows the remaining bottleneck is the backtracking algorithm (29% of time) and pair validation (17%), with the DFA regex segmenter at 8%.

  • CJK: IREE now leads (37.9 vs 37.6 MiB/s, 1.01x). CJK tokens average only 1.3 bytes each (vs 3.5 for ASCII), which amplifies all per-token costs. Optimizations have improved CJK by +84% total (from 20.6 MiB/s): whole-segment reachability fast-path (+31%), Unicode category fast-paths for Latin Extended and CJK ranges (+11%), whitespace rejection fast-paths for CJK codepoints (+7%), forced inlining of UTF-8 decode / Unicode classification hot-path functions (+3%), and expanding the pair validation cache from 256 to 4096 entries (+12%). The GPT-2 ByteLevel pretokenizer remaps all bytes to ASCII/Latin codepoints before BPE merging, so the Unicode classification functions are only exercised during regex segmentation — tokenizers that operate on raw UTF-8 (e.g., LLaMA 3.1) see much larger gains from inlining (+47% CJK throughput).

    Both IREE and Rust bpe validate token pairs using iterative decomposition via a split table. Rust bpe's implementation is a tight ~40-line loop with a single deterministic decomposition path. IREE's is more thorough: a stack-based search that tries multiple decomposition paths (split table first, then all byte-split positions) and handles deferred merge semantics. The 4096-entry pair cache (32 KB, fits in L1) compensates for the additional per-call cost by caching valid pairs across segments.

IREE vs HF tokenizers (Rust-native)

The HuggingFace tokenizers Rust crate is the same backend that the Python tokenizers library wraps. Running it directly from Rust eliminates Python overhead, isolating the algorithm-level performance difference:

Corpus IREE HF (Rust) Speedup
ASCII 44.7 MiB/s 4.0 MiB/s 11.2x
CJK 37.9 MiB/s 4.2 MiB/s 9.0x
Code 52.1 MiB/s 3.9 MiB/s 13.4x

HF tokenizers is uniformly ~4 MiB/s regardless of corpus, suggesting a per-token overhead floor that dominates encoding time. The Python wrapper adds only ~30% on top of this.

IREE vs tiktoken-rs (different algorithm, same vocabulary)

tiktoken-rs uses priority-queue BPE (O(n log L) per segment) vs IREE's backtracking BPE (O(n) amortized). IREE is 2.5-4.1x faster on encode:

Corpus Speedup
ASCII 3.0x
CJK 2.5x
Code 4.1x

GPT-2 Decode Throughput (MiB/s, higher is better)

Corpus IREE Rust bpe HF (Rust) tiktoken-rs tiktoken (Py) HF (Py)
ASCII 655 481 47.6 310 165 31
CJK 1,558 320 30.5 231 104 20
Code 2,428 490 36.4 338 139 23

IREE dominates decode across the board.

IREE vs Rust bpe decode

Corpus IREE Rust bpe Speedup
ASCII 655 MiB/s 481 MiB/s 1.4x
CJK 1,558 MiB/s 320 MiB/s 4.9x
Code 2,428 MiB/s 490 MiB/s 5.0x

The architecture is fundamentally different: IREE pre-decodes vocabulary entries at initialization time, storing decoded UTF-8 strings contiguously in memory. Decode is a direct lookup + memcpy from the pre-computed table. Rust bpe reconstructs strings from per-token byte sequences at decode time.

The CJK result (1.5 GiB/s) demonstrates how cache-friendly this approach is: CJK tokens average 1.87 bytes each, so the token-to-string lookup stays in L1 cache and the output is nearly a sequential memory write.

LLaMA 3.1 (128K vocab)

LLaMA 3.1 uses a 128K-token vocabulary — 2.5x larger than GPT-2. The Rust bpe crate cannot load LLaMA 3's vocabulary (its rank-based BPE invariant check fails because LLaMA 3 has 280K merge rules for 128K tokens, creating "unreachable" tokens that violate the crate's self-encoding assertion). The comparison here is IREE vs HuggingFace tokenizers (Rust-native, no Python).

LLaMA 3 Encode Throughput (MiB/s, higher is better)

Corpus IREE HF (Rust) Speedup
ASCII 42.8 4.4 9.7x
CJK 23.0 5.8 4.0x
Code 57.1 4.5 12.7x

IREE's advantage is largest on Code (12.7x) because the 128K vocabulary contains long code-specific tokens (average 4.2 bytes/token vs GPT-2's 2.4), reducing BPE merge iterations per byte and letting IREE's whole-segment fast-path trigger more frequently. CJK shows the smallest advantage (4.0x) because CJK tokens are still relatively short (3.2 bytes/token), meaning more BPE work per byte.

LLaMA 3 Decode Throughput (MiB/s, higher is better)

Corpus IREE HF (Rust) Speedup
ASCII 655 46.9 14.0x
CJK 1,558 44.6 34.9x
Code 2,428 49.4 49.1x

IREE's pre-decoded lookup table architecture gives an enormous advantage on decode. The CJK speedup (34.9x) reflects the extreme cache-friendliness of IREE's approach: with short CJK tokens, the lookup table fits in L1 cache and decode becomes a near-sequential memory write.

Feature Comparison

IREE's tokenizer goes beyond raw throughput:

Feature IREE Rust bpe tiktoken-rs HF tokenizers
Streaming encode Yes No No No
Streaming decode Yes No No No
Zero-allocation encode Yes No No No
Offset tracking Yes No No Yes
Special token handling Yes No No Yes
BPE + WordPiece + Unigram Yes BPE only BPE only Yes
Pure C (no runtime deps) Yes Rust stdlib Rust stdlib Rust + Python
Embeddable (no allocator) Yes No No No

IREE's streaming API allows encoding arbitrarily long text with bounded memory (configurable 4-64 KB transform buffer), producing tokens incrementally. The Rust and Python alternatives require the entire input in memory and produce all tokens at once.

Reproduction

IREE benchmark

cd ~/src/iree/loom

# GPT-2
iree-bazel-run --copt=-O3 --copt=-march=native --features=thin_lto \
  //runtime/src/iree/tokenizer/tools:comprehensive_benchmark -- \
  --tokenizer_json=path/to/gpt2.json \
  --ascii_text=path/to/sherlock.txt \
  --cjk_text=path/to/japanese.txt \
  --code_text=path/to/c_code.txt \
  --benchmark_min_time=2s

# LLaMA 3
iree-bazel-run --copt=-O3 --copt=-march=native --features=thin_lto \
  //runtime/src/iree/tokenizer/tools:comprehensive_benchmark -- \
  --tokenizer_json=path/to/NousResearch_Meta-Llama-3.1-8B.json \
  --ascii_text=path/to/sherlock.txt \
  --cjk_text=path/to/japanese.txt \
  --code_text=path/to/c_code.txt \
  --benchmark_min_time=2s

Or use the convenience script that downloads tokenizers automatically:

runtime/src/iree/tokenizer/tools/run_benchmarks.sh

Rust benchmarks (bpe + tiktoken-rs + HF tokenizers)

cd ~/src/iree-tmp/tokenizers/benchmarks/bpe_bench
cargo run --release           # All benchmarks (GPT-2 + LLaMA 3)
cargo run --release -- --gpt2   # GPT-2 only
cargo run --release -- --llama3 # LLaMA 3 only

Python benchmarks

# tiktoken (all corpora, encode + decode)
python3 ~/src/iree-tmp/tokenizers/benchmarks/comprehensive_tiktoken.py

# HuggingFace tokenizers (all corpora, encode + decode)
python3 ~/src/iree-tmp/tokenizers/benchmarks/comprehensive_huggingface.py

Test data

Text corpora are at ~/src/iree-tmp/tokenizers/txt/ and tokenizer JSON files at ~/src/iree-tmp/tokenizers/json/. The run_benchmarks.sh script downloads these automatically if not present.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment