benvanik/PERFORMANCE_COMPARISON.md

## PERFORMANCE_COMPARISON.md

      
    Raw
  

              PERFORMANCE_COMPARISON.md
            
          
    Tokenizer Performance Comparison

Benchmark comparing IREE's C tokenizer against the best available Rust and Python
tokenizer implementations.
Last updated: 2026-02-21. Re-run benchmarks before citing these numbers.
Test Environment


CPU: AMD EPYC (192 cores, 5391 MHz), 32 KB L1d, 1 MB L2, 32 MB L3
OS: Linux 6.14.0-37-generic
IREE: C23, Clang -O3 -march=native, ThinLTO, backtracking BPE
Rust bpe crate: bpe 0.2 + bpe-openai 0.3, opt-level=3, lto=true, backtracking BPE
tiktoken-rs: tiktoken-rs 0.9, opt-level=3, lto=true, priority-queue BPE
HF (Rust): tokenizers 0.22.2, opt-level=3, lto=true, Rust-native (no Python)
tiktoken (Python): tiktoken 0.12.0 (Python wrapper around tiktoken-rs)
HF tokenizers (Python): tokenizers 0.21.4 (Python wrapper around Rust backend)

Methodology

All benchmarks use identical methodology: full files, single copy, cache-hot.
Each benchmark runs for at least 2 seconds to ensure statistical stability.
The same three text files are used across all implementations.


Corpus
File
Size
Description


ASCII
sherlock.txt
595 KB
English prose (Sherlock Holmes)


CJK
japanese.txt
100 KB
Japanese text


Code
c_code.txt
171 KB
C source code


GPT-2 r50k_base (50K vocab)

All GPT-2 benchmarks use the same vocabulary for fair comparison — same algorithm,
same vocabulary, different implementations.
GPT-2 Encode Throughput (MiB/s, higher is better)


Corpus
IREE
Rust bpe
HF (Rust)
tiktoken-rs
tiktoken (Py)
HF (Py)


ASCII
44.7
39.9
4.0
15.0
19.9
2.7


CJK
37.9
37.6
4.2
15.4
15.8
2.6


Code
52.1
30.6
3.9
12.7
14.6
2.5


IREE uses the best of streaming and one-shot modes. All Rust crates are native
(no Python overhead). tiktoken (Python) calls tiktoken-rs under the hood but adds
Python dispatch overhead.
IREE vs Rust bpe (same algorithm, same vocabulary)

Both IREE and Rust bpe use backtracking BPE with GPT-2 r50k_base vocabulary.
This is the fairest head-to-head comparison:


Code: IREE is 1.70x faster (52.1 vs 30.6 MiB/s). Code text has
predictable token patterns (identifiers, operators, whitespace) that benefit
from IREE's DAAC trie-based vocabulary lookup with O(1) state transitions.


ASCII: IREE leads (44.7 vs 39.9 MiB/s, 1.12x faster). The
reachability-gated whole-segment fast-path skips full BPE merge processing
for segments that match a single reachable vocabulary token. IREE's streaming
mode (44.7 MiB/s) outperforms one-shot (32.5 MiB/s) by reusing pre-allocated
state. Profiling shows the remaining bottleneck is the backtracking algorithm
(29% of time) and pair validation (17%), with the DFA regex segmenter at 8%.


CJK: IREE now leads (37.9 vs 37.6 MiB/s, 1.01x). CJK tokens
average only 1.3 bytes each (vs 3.5 for ASCII), which amplifies all per-token
costs. Optimizations have improved CJK by +84% total (from 20.6 MiB/s):
whole-segment reachability fast-path (+31%), Unicode category fast-paths for
Latin Extended and CJK ranges (+11%), whitespace rejection fast-paths for
CJK codepoints (+7%), forced inlining of UTF-8 decode / Unicode
classification hot-path functions (+3%), and expanding the pair validation
cache from 256 to 4096 entries (+12%). The GPT-2 ByteLevel pretokenizer
remaps all bytes to ASCII/Latin codepoints before BPE merging, so the
Unicode classification functions are only exercised during regex
segmentation — tokenizers that operate on raw UTF-8 (e.g., LLaMA 3.1) see
much larger gains from inlining (+47% CJK throughput).
Both IREE and Rust bpe validate token pairs using iterative decomposition
via a split table. Rust bpe's implementation is a tight ~40-line loop with
a single deterministic decomposition path. IREE's is more thorough: a
stack-based search that tries multiple decomposition paths (split table
first, then all byte-split positions) and handles deferred merge semantics.
The 4096-entry pair cache (32 KB, fits in L1) compensates for the additional
per-call cost by caching valid pairs across segments.


IREE vs HF tokenizers (Rust-native)

The HuggingFace tokenizers Rust crate is the same backend that the Python
tokenizers library wraps. Running it directly from Rust eliminates Python
overhead, isolating the algorithm-level performance difference:


Corpus
IREE
HF (Rust)
Speedup


ASCII
44.7 MiB/s
4.0 MiB/s
11.2x


CJK
37.9 MiB/s
4.2 MiB/s
9.0x


Code
52.1 MiB/s
3.9 MiB/s
13.4x


HF tokenizers is uniformly ~4 MiB/s regardless of corpus, suggesting a
per-token overhead floor that dominates encoding time. The Python wrapper adds
only ~30% on top of this.
IREE vs tiktoken-rs (different algorithm, same vocabulary)

tiktoken-rs uses priority-queue BPE (O(n log L) per segment) vs IREE's
backtracking BPE (O(n) amortized). IREE is 2.5-4.1x faster on encode:


Corpus
Speedup


ASCII
3.0x


CJK
2.5x


Code
4.1x


GPT-2 Decode Throughput (MiB/s, higher is better)


Corpus
IREE
Rust bpe
HF (Rust)
tiktoken-rs
tiktoken (Py)
HF (Py)


ASCII
655
481
47.6
310
165
31


CJK
1,558
320
30.5
231
104
20


Code
2,428
490
36.4
338
139
23


IREE dominates decode across the board.
IREE vs Rust bpe decode


Corpus
IREE
Rust bpe
Speedup


ASCII
655 MiB/s
481 MiB/s
1.4x


CJK
1,558 MiB/s
320 MiB/s
4.9x


Code
2,428 MiB/s
490 MiB/s
5.0x


The architecture is fundamentally different: IREE pre-decodes vocabulary entries
at initialization time, storing decoded UTF-8 strings contiguously in memory.
Decode is a direct lookup + memcpy from the pre-computed table. Rust bpe
reconstructs strings from per-token byte sequences at decode time.
The CJK result (1.5 GiB/s) demonstrates how cache-friendly this approach is:
CJK tokens average 1.87 bytes each, so the token-to-string lookup stays in L1
cache and the output is nearly a sequential memory write.
LLaMA 3.1 (128K vocab)

LLaMA 3.1 uses a 128K-token vocabulary — 2.5x larger than GPT-2. The Rust
bpe crate cannot load LLaMA 3's vocabulary (its rank-based BPE invariant
check fails because LLaMA 3 has 280K merge rules for 128K tokens, creating
"unreachable" tokens that violate the crate's self-encoding assertion). The
comparison here is IREE vs HuggingFace tokenizers (Rust-native, no Python).
LLaMA 3 Encode Throughput (MiB/s, higher is better)


Corpus
IREE
HF (Rust)
Speedup


ASCII
42.8
4.4
9.7x


CJK
23.0
5.8
4.0x


Code
57.1
4.5
12.7x


IREE's advantage is largest on Code (12.7x) because the 128K vocabulary
contains long code-specific tokens (average 4.2 bytes/token vs GPT-2's 2.4),
reducing BPE merge iterations per byte and letting IREE's whole-segment
fast-path trigger more frequently. CJK shows the smallest advantage (4.0x)
because CJK tokens are still relatively short (3.2 bytes/token), meaning more
BPE work per byte.
LLaMA 3 Decode Throughput (MiB/s, higher is better)


Corpus
IREE
HF (Rust)
Speedup


ASCII
655
46.9
14.0x


CJK
1,558
44.6
34.9x


Code
2,428
49.4
49.1x


IREE's pre-decoded lookup table architecture gives an enormous advantage on
decode. The CJK speedup (34.9x) reflects the extreme cache-friendliness of
IREE's approach: with short CJK tokens, the lookup table fits in L1 cache and
decode becomes a near-sequential memory write.
Feature Comparison

IREE's tokenizer goes beyond raw throughput:


Feature
IREE
Rust bpe
tiktoken-rs
HF tokenizers


Streaming encode
Yes
No
No
No


Streaming decode
Yes
No
No
No


Zero-allocation encode
Yes
No
No
No


Offset tracking
Yes
No
No
Yes


Special token handling
Yes
No
No
Yes


BPE + WordPiece + Unigram
Yes
BPE only
BPE only
Yes


Pure C (no runtime deps)
Yes
Rust stdlib
Rust stdlib
Rust + Python


Embeddable (no allocator)
Yes
No
No
No


IREE's streaming API allows encoding arbitrarily long text with bounded memory
(configurable 4-64 KB transform buffer), producing tokens incrementally. The
Rust and Python alternatives require the entire input in memory and produce all
tokens at once.
Reproduction

IREE benchmark

cd ~/src/iree/loom

# GPT-2
iree-bazel-run --copt=-O3 --copt=-march=native --features=thin_lto \
  //runtime/src/iree/tokenizer/tools:comprehensive_benchmark -- \
  --tokenizer_json=path/to/gpt2.json \
  --ascii_text=path/to/sherlock.txt \
  --cjk_text=path/to/japanese.txt \
  --code_text=path/to/c_code.txt \
  --benchmark_min_time=2s

# LLaMA 3
iree-bazel-run --copt=-O3 --copt=-march=native --features=thin_lto \
  //runtime/src/iree/tokenizer/tools:comprehensive_benchmark -- \
  --tokenizer_json=path/to/NousResearch_Meta-Llama-3.1-8B.json \
  --ascii_text=path/to/sherlock.txt \
  --cjk_text=path/to/japanese.txt \
  --code_text=path/to/c_code.txt \
  --benchmark_min_time=2s
Or use the convenience script that downloads tokenizers automatically:
runtime/src/iree/tokenizer/tools/run_benchmarks.sh
Rust benchmarks (bpe + tiktoken-rs + HF tokenizers)

cd ~/src/iree-tmp/tokenizers/benchmarks/bpe_bench
cargo run --release           # All benchmarks (GPT-2 + LLaMA 3)
cargo run --release -- --gpt2   # GPT-2 only
cargo run --release -- --llama3 # LLaMA 3 only
Python benchmarks

# tiktoken (all corpora, encode + decode)
python3 ~/src/iree-tmp/tokenizers/benchmarks/comprehensive_tiktoken.py

# HuggingFace tokenizers (all corpora, encode + decode)
python3 ~/src/iree-tmp/tokenizers/benchmarks/comprehensive_huggingface.py
Test data

Text corpora are at ~/src/iree-tmp/tokenizers/txt/ and tokenizer JSON files at
~/src/iree-tmp/tokenizers/json/. The run_benchmarks.sh script downloads these
automatically if not present.
Corpus	File	Size	Description
ASCII	sherlock.txt	595 KB	English prose (Sherlock Holmes)
CJK	japanese.txt	100 KB	Japanese text
Code	c_code.txt	171 KB	C source code
Corpus	IREE	Rust bpe	HF (Rust)	tiktoken-rs	tiktoken (Py)	HF (Py)
ASCII	44.7	39.9	4.0	15.0	19.9	2.7
CJK	37.9	37.6	4.2	15.4	15.8	2.6
Code	52.1	30.6	3.9	12.7	14.6	2.5
Corpus	IREE	HF (Rust)	Speedup
ASCII	44.7 MiB/s	4.0 MiB/s	11.2x
CJK	37.9 MiB/s	4.2 MiB/s	9.0x
Code	52.1 MiB/s	3.9 MiB/s	13.4x
Corpus	IREE	Rust bpe	HF (Rust)	tiktoken-rs	tiktoken (Py)	HF (Py)
ASCII	655	481	47.6	310	165	31
CJK	1,558	320	30.5	231	104	20
Code	2,428	490	36.4	338	139	23
Corpus	IREE	Rust bpe	Speedup
ASCII	655 MiB/s	481 MiB/s	1.4x
CJK	1,558 MiB/s	320 MiB/s	4.9x
Code	2,428 MiB/s	490 MiB/s	5.0x
Corpus	IREE	HF (Rust)	Speedup
ASCII	42.8	4.4	9.7x
CJK	23.0	5.8	4.0x
Code	57.1	4.5	12.7x
Corpus	IREE	HF (Rust)	Speedup
ASCII	655	46.9	14.0x
CJK	1,558	44.6	34.9x
Code	2,428	49.4	49.1x
Feature	IREE	Rust bpe	tiktoken-rs	HF tokenizers
Streaming encode	Yes	No	No	No
Streaming decode	Yes	No	No	No
Zero-allocation encode	Yes	No	No	No
Offset tracking	Yes	No	No	Yes
Special token handling	Yes	No	No	Yes
BPE + WordPiece + Unigram	Yes	BPE only	BPE only	Yes
Pure C (no runtime deps)	Yes	Rust stdlib	Rust stdlib	Rust + Python
Embeddable (no allocator)	Yes	No	No	No