Benchmark comparing IREE's C tokenizer against the best available Rust and Python tokenizer implementations.
Last updated: 2026-02-21. Re-run benchmarks before citing these numbers.
- CPU: AMD EPYC (192 cores, 5391 MHz), 32 KB L1d, 1 MB L2, 32 MB L3
- OS: Linux 6.14.0-37-generic
- IREE: C23, Clang
-O3 -march=native, ThinLTO, backtracking BPE - Rust bpe crate:
bpe0.2 +bpe-openai0.3,opt-level=3, lto=true, backtracking BPE - tiktoken-rs:
tiktoken-rs0.9,opt-level=3, lto=true, priority-queue BPE - HF (Rust):
tokenizers0.22.2,opt-level=3, lto=true, Rust-native (no Python) - tiktoken (Python):
tiktoken0.12.0 (Python wrapper around tiktoken-rs) - HF tokenizers (Python):
tokenizers0.21.4 (Python wrapper around Rust backend)
All benchmarks use identical methodology: full files, single copy, cache-hot. Each benchmark runs for at least 2 seconds to ensure statistical stability. The same three text files are used across all implementations.
| Corpus | File | Size | Description |
|---|---|---|---|
| ASCII | sherlock.txt | 595 KB | English prose (Sherlock Holmes) |
| CJK | japanese.txt | 100 KB | Japanese text |
| Code | c_code.txt | 171 KB | C source code |
All GPT-2 benchmarks use the same vocabulary for fair comparison — same algorithm, same vocabulary, different implementations.
| Corpus | IREE | Rust bpe | HF (Rust) | tiktoken-rs | tiktoken (Py) | HF (Py) |
|---|---|---|---|---|---|---|
| ASCII | 44.7 | 39.9 | 4.0 | 15.0 | 19.9 | 2.7 |
| CJK | 37.9 | 37.6 | 4.2 | 15.4 | 15.8 | 2.6 |
| Code | 52.1 | 30.6 | 3.9 | 12.7 | 14.6 | 2.5 |
IREE uses the best of streaming and one-shot modes. All Rust crates are native (no Python overhead). tiktoken (Python) calls tiktoken-rs under the hood but adds Python dispatch overhead.
Both IREE and Rust bpe use backtracking BPE with GPT-2 r50k_base vocabulary. This is the fairest head-to-head comparison:
-
Code: IREE is 1.70x faster (52.1 vs 30.6 MiB/s). Code text has predictable token patterns (identifiers, operators, whitespace) that benefit from IREE's DAAC trie-based vocabulary lookup with O(1) state transitions.
-
ASCII: IREE leads (44.7 vs 39.9 MiB/s, 1.12x faster). The reachability-gated whole-segment fast-path skips full BPE merge processing for segments that match a single reachable vocabulary token. IREE's streaming mode (44.7 MiB/s) outperforms one-shot (32.5 MiB/s) by reusing pre-allocated state. Profiling shows the remaining bottleneck is the backtracking algorithm (29% of time) and pair validation (17%), with the DFA regex segmenter at 8%.
-
CJK: IREE now leads (37.9 vs 37.6 MiB/s, 1.01x). CJK tokens average only 1.3 bytes each (vs 3.5 for ASCII), which amplifies all per-token costs. Optimizations have improved CJK by +84% total (from 20.6 MiB/s): whole-segment reachability fast-path (+31%), Unicode category fast-paths for Latin Extended and CJK ranges (+11%), whitespace rejection fast-paths for CJK codepoints (+7%), forced inlining of UTF-8 decode / Unicode classification hot-path functions (+3%), and expanding the pair validation cache from 256 to 4096 entries (+12%). The GPT-2 ByteLevel pretokenizer remaps all bytes to ASCII/Latin codepoints before BPE merging, so the Unicode classification functions are only exercised during regex segmentation — tokenizers that operate on raw UTF-8 (e.g., LLaMA 3.1) see much larger gains from inlining (+47% CJK throughput).
Both IREE and Rust bpe validate token pairs using iterative decomposition via a split table. Rust bpe's implementation is a tight ~40-line loop with a single deterministic decomposition path. IREE's is more thorough: a stack-based search that tries multiple decomposition paths (split table first, then all byte-split positions) and handles deferred merge semantics. The 4096-entry pair cache (32 KB, fits in L1) compensates for the additional per-call cost by caching valid pairs across segments.
The HuggingFace tokenizers Rust crate is the same backend that the Python
tokenizers library wraps. Running it directly from Rust eliminates Python
overhead, isolating the algorithm-level performance difference:
| Corpus | IREE | HF (Rust) | Speedup |
|---|---|---|---|
| ASCII | 44.7 MiB/s | 4.0 MiB/s | 11.2x |
| CJK | 37.9 MiB/s | 4.2 MiB/s | 9.0x |
| Code | 52.1 MiB/s | 3.9 MiB/s | 13.4x |
HF tokenizers is uniformly ~4 MiB/s regardless of corpus, suggesting a per-token overhead floor that dominates encoding time. The Python wrapper adds only ~30% on top of this.
tiktoken-rs uses priority-queue BPE (O(n log L) per segment) vs IREE's backtracking BPE (O(n) amortized). IREE is 2.5-4.1x faster on encode:
| Corpus | Speedup |
|---|---|
| ASCII | 3.0x |
| CJK | 2.5x |
| Code | 4.1x |
| Corpus | IREE | Rust bpe | HF (Rust) | tiktoken-rs | tiktoken (Py) | HF (Py) |
|---|---|---|---|---|---|---|
| ASCII | 655 | 481 | 47.6 | 310 | 165 | 31 |
| CJK | 1,558 | 320 | 30.5 | 231 | 104 | 20 |
| Code | 2,428 | 490 | 36.4 | 338 | 139 | 23 |
IREE dominates decode across the board.
| Corpus | IREE | Rust bpe | Speedup |
|---|---|---|---|
| ASCII | 655 MiB/s | 481 MiB/s | 1.4x |
| CJK | 1,558 MiB/s | 320 MiB/s | 4.9x |
| Code | 2,428 MiB/s | 490 MiB/s | 5.0x |
The architecture is fundamentally different: IREE pre-decodes vocabulary entries at initialization time, storing decoded UTF-8 strings contiguously in memory. Decode is a direct lookup + memcpy from the pre-computed table. Rust bpe reconstructs strings from per-token byte sequences at decode time.
The CJK result (1.5 GiB/s) demonstrates how cache-friendly this approach is: CJK tokens average 1.87 bytes each, so the token-to-string lookup stays in L1 cache and the output is nearly a sequential memory write.
LLaMA 3.1 uses a 128K-token vocabulary — 2.5x larger than GPT-2. The Rust
bpe crate cannot load LLaMA 3's vocabulary (its rank-based BPE invariant
check fails because LLaMA 3 has 280K merge rules for 128K tokens, creating
"unreachable" tokens that violate the crate's self-encoding assertion). The
comparison here is IREE vs HuggingFace tokenizers (Rust-native, no Python).
| Corpus | IREE | HF (Rust) | Speedup |
|---|---|---|---|
| ASCII | 42.8 | 4.4 | 9.7x |
| CJK | 23.0 | 5.8 | 4.0x |
| Code | 57.1 | 4.5 | 12.7x |
IREE's advantage is largest on Code (12.7x) because the 128K vocabulary contains long code-specific tokens (average 4.2 bytes/token vs GPT-2's 2.4), reducing BPE merge iterations per byte and letting IREE's whole-segment fast-path trigger more frequently. CJK shows the smallest advantage (4.0x) because CJK tokens are still relatively short (3.2 bytes/token), meaning more BPE work per byte.
| Corpus | IREE | HF (Rust) | Speedup |
|---|---|---|---|
| ASCII | 655 | 46.9 | 14.0x |
| CJK | 1,558 | 44.6 | 34.9x |
| Code | 2,428 | 49.4 | 49.1x |
IREE's pre-decoded lookup table architecture gives an enormous advantage on decode. The CJK speedup (34.9x) reflects the extreme cache-friendliness of IREE's approach: with short CJK tokens, the lookup table fits in L1 cache and decode becomes a near-sequential memory write.
IREE's tokenizer goes beyond raw throughput:
| Feature | IREE | Rust bpe | tiktoken-rs | HF tokenizers |
|---|---|---|---|---|
| Streaming encode | Yes | No | No | No |
| Streaming decode | Yes | No | No | No |
| Zero-allocation encode | Yes | No | No | No |
| Offset tracking | Yes | No | No | Yes |
| Special token handling | Yes | No | No | Yes |
| BPE + WordPiece + Unigram | Yes | BPE only | BPE only | Yes |
| Pure C (no runtime deps) | Yes | Rust stdlib | Rust stdlib | Rust + Python |
| Embeddable (no allocator) | Yes | No | No | No |
IREE's streaming API allows encoding arbitrarily long text with bounded memory (configurable 4-64 KB transform buffer), producing tokens incrementally. The Rust and Python alternatives require the entire input in memory and produce all tokens at once.
cd ~/src/iree/loom
# GPT-2
iree-bazel-run --copt=-O3 --copt=-march=native --features=thin_lto \
//runtime/src/iree/tokenizer/tools:comprehensive_benchmark -- \
--tokenizer_json=path/to/gpt2.json \
--ascii_text=path/to/sherlock.txt \
--cjk_text=path/to/japanese.txt \
--code_text=path/to/c_code.txt \
--benchmark_min_time=2s
# LLaMA 3
iree-bazel-run --copt=-O3 --copt=-march=native --features=thin_lto \
//runtime/src/iree/tokenizer/tools:comprehensive_benchmark -- \
--tokenizer_json=path/to/NousResearch_Meta-Llama-3.1-8B.json \
--ascii_text=path/to/sherlock.txt \
--cjk_text=path/to/japanese.txt \
--code_text=path/to/c_code.txt \
--benchmark_min_time=2sOr use the convenience script that downloads tokenizers automatically:
runtime/src/iree/tokenizer/tools/run_benchmarks.shcd ~/src/iree-tmp/tokenizers/benchmarks/bpe_bench
cargo run --release # All benchmarks (GPT-2 + LLaMA 3)
cargo run --release -- --gpt2 # GPT-2 only
cargo run --release -- --llama3 # LLaMA 3 only# tiktoken (all corpora, encode + decode)
python3 ~/src/iree-tmp/tokenizers/benchmarks/comprehensive_tiktoken.py
# HuggingFace tokenizers (all corpora, encode + decode)
python3 ~/src/iree-tmp/tokenizers/benchmarks/comprehensive_huggingface.pyText corpora are at ~/src/iree-tmp/tokenizers/txt/ and tokenizer JSON files at
~/src/iree-tmp/tokenizers/json/. The run_benchmarks.sh script downloads these
automatically if not present.