Machine: AMD Ryzen 9 9950X 16-Core Processor
Date: 2026-01-08
Features: --features parallel
| Size | unbatched | batched<1> | Δ | unbatched×2 | batched<2> | Δ |
|---|---|---|---|---|---|---|
| 2^16×128 | 1.79 ms | 1.83 ms | +2% | 3.49 ms | 3.33 ms | -5% |
| 2^16×512 | 6.14 ms | 6.17 ms | ~0% | 12.28 ms | 12.46 ms | +2% |
| 2^16×4096 | 47.59 ms | 47.69 ms | ~0% | 95.17 ms | 99.58 ms | +5% |
| 2^18×128 | 6.09 ms | 6.10 ms | ~0% | 12.19 ms | 11.99 ms | -2% |
| 2^18×512 | 22.82 ms | 23.21 ms | +2% | 45.83 ms | 46.41 ms | +1% |
| 2^18×4096 | 179.26 ms | 179.10 ms | ~0% | 358.69 ms | 372.41 ms | +4% |
| 2^20×128 | 22.52 ms | 22.59 ms | ~0% | 45.30 ms | 45.87 ms | +1% |
| 2^20×512 | 88.06 ms | 88.17 ms | ~0% | 178.71 ms | 181.90 ms | +2% |
| 2^20×4096 | 701.30 ms | 700.55 ms | ~0% | 1.40 s | 1.45 s | +4% |
| 2^22×128 | 88.38 ms | 88.31 ms | ~0% | 176.86 ms | 182.29 ms | +3% |
| 2^22×512 | 348.25 ms | 348.28 ms | ~0% | 693.73 ms | 714.54 ms | +3% |
| 2^22×4096 | 2.78 s | 2.80 s | ~0% | 5.56 s | 5.77 s | +4% |
- batched<1> overhead: negligible (0-4%), well within noise
- batched<2> benefit: up to 5% faster on narrow matrices (128 cols), minimal on wider matrices
- Wide matrices (4096 cols): no benefit, memory bandwidth saturated; batched version is slightly slower (~4% overhead)
Compares 4× unbatched calls vs batched<4>.
| Size | unbatched×4 | batched<4> | Δ |
|---|---|---|---|
| 2^16×128 | 7.26 ms | 6.67 ms | -8% |
| 2^16×512 | 24.79 ms | 25.33 ms | +2% |
| 2^16×4096 | 189.05 ms | 203.00 ms | +7% |
| 2^18×128 | 24.00 ms | 23.58 ms | -2% |
| 2^18×512 | 90.57 ms | 93.91 ms | +4% |
| 2^18×4096 | 715.68 ms | 757.23 ms | +6% |
| 2^20×128 | 89.96 ms | 91.68 ms | +2% |
| 2^20×512 | 352.87 ms | 379.19 ms | +7% |
| 2^20×4096 | 2.80 s | 2.94 s | +5% |
| 2^22×128 | 352.68 ms | 363.71 ms | +3% |
| 2^22×512 | 1.41 s | 1.45 s | +3% |
| 2^22×4096 | 11.15 s | 11.70 s | +5% |
- batched<4> benefit: up to 8% faster on small narrow matrices (2^16×128)
- Narrow matrices (128 cols): minimal overhead or slight benefit (-2% to +3%)
- Medium matrices (512 cols): 2-7% overhead, increases with size
- Wide matrices (4096 cols): 5-7% overhead, memory bandwidth saturated
- The benefit of batched<4> decreases as matrix size increases, likely due to cache/memory bandwidth effects
Simulates opening quotient polynomial at 1 vs 2 evaluation points. For degree-3 constraints, quotient has degree 2·(n-1), requiring 2 chunks:
- BabyBear: 8 columns (2 chunks × 4 extension degree)
- Goldilocks: 4 columns (2 chunks × 2 extension degree)
| log₂(rows) | 1 point | 2 points | Overhead |
|---|---|---|---|
| 16 | 266 µs | 370 µs | 1.39× |
| 18 | 679 µs | 1085 µs | 1.60× |
| 20 | 2075 µs | 3596 µs | 1.73× |
| 22 | 6855 µs | 12851 µs | 1.87× |
| log₂(rows) | 1 point | 2 points | Overhead |
|---|---|---|---|
| 16 | 177 µs | 288 µs | 1.63× |
| 18 | 361 µs | 471 µs | 1.30× |
| 20 | 866 µs | 1342 µs | 1.55× |
| 22 | 4625 µs | 6315 µs | 1.37× |
Opening at 2 points costs 1.30×–1.87× a single point (well below naive 2×).
The results here are broadly consistent with the PR author's M2 Pro benchmarks:
- batched<1> overhead: Similar negligible overhead (0-5% on M2 Pro vs 0-2% here)
- batched<2> benefit: M2 Pro showed 5-8% benefit on narrow matrices; this AMD setup shows ~2-5%
- batched<4> benefit: Up to 8% faster on small narrow matrices, but overhead increases with matrix size
- Wide matrices: Both show no benefit on 4096-column matrices; batched<4> shows 5-7% overhead
- Quotient opening: M2 Pro showed 1.2×–1.65× overhead; this AMD setup shows 1.30×–1.87×
The AMD Ryzen 9 9950X generally shows higher absolute latencies than the M2 Pro, but the relative performance characteristics of the batched implementation are consistent across architectures.