Skip to content

Instantly share code, notes, and snippets.

@huitseeker
Last active January 8, 2026 17:57
Show Gist options
  • Select an option

  • Save huitseeker/a20857ba83c9fe8ad18e48c80fa1bf9b to your computer and use it in GitHub Desktop.

Select an option

Save huitseeker/a20857ba83c9fe8ad18e48c80fa1bf9b to your computer and use it in GitHub Desktop.

Benchmark Results for PR #1225: columnwise_dot_product_batched

Machine: AMD Ryzen 9 9950X 16-Core Processor Date: 2026-01-08 Features: --features parallel

General columnwise_dot_product (BabyBear, EF=degree-4 extension)

Size unbatched batched<1> Δ unbatched×2 batched<2> Δ
2^16×128 1.79 ms 1.83 ms +2% 3.49 ms 3.33 ms -5%
2^16×512 6.14 ms 6.17 ms ~0% 12.28 ms 12.46 ms +2%
2^16×4096 47.59 ms 47.69 ms ~0% 95.17 ms 99.58 ms +5%
2^18×128 6.09 ms 6.10 ms ~0% 12.19 ms 11.99 ms -2%
2^18×512 22.82 ms 23.21 ms +2% 45.83 ms 46.41 ms +1%
2^18×4096 179.26 ms 179.10 ms ~0% 358.69 ms 372.41 ms +4%
2^20×128 22.52 ms 22.59 ms ~0% 45.30 ms 45.87 ms +1%
2^20×512 88.06 ms 88.17 ms ~0% 178.71 ms 181.90 ms +2%
2^20×4096 701.30 ms 700.55 ms ~0% 1.40 s 1.45 s +4%
2^22×128 88.38 ms 88.31 ms ~0% 176.86 ms 182.29 ms +3%
2^22×512 348.25 ms 348.28 ms ~0% 693.73 ms 714.54 ms +3%
2^22×4096 2.78 s 2.80 s ~0% 5.56 s 5.77 s +4%

Key findings

  • batched<1> overhead: negligible (0-4%), well within noise
  • batched<2> benefit: up to 5% faster on narrow matrices (128 cols), minimal on wider matrices
  • Wide matrices (4096 cols): no benefit, memory bandwidth saturated; batched version is slightly slower (~4% overhead)

Batched<4> Performance (BabyBear, EF=degree-4 extension)

Compares 4× unbatched calls vs batched<4>.

Size unbatched×4 batched<4> Δ
2^16×128 7.26 ms 6.67 ms -8%
2^16×512 24.79 ms 25.33 ms +2%
2^16×4096 189.05 ms 203.00 ms +7%
2^18×128 24.00 ms 23.58 ms -2%
2^18×512 90.57 ms 93.91 ms +4%
2^18×4096 715.68 ms 757.23 ms +6%
2^20×128 89.96 ms 91.68 ms +2%
2^20×512 352.87 ms 379.19 ms +7%
2^20×4096 2.80 s 2.94 s +5%
2^22×128 352.68 ms 363.71 ms +3%
2^22×512 1.41 s 1.45 s +3%
2^22×4096 11.15 s 11.70 s +5%

Key findings

  • batched<4> benefit: up to 8% faster on small narrow matrices (2^16×128)
  • Narrow matrices (128 cols): minimal overhead or slight benefit (-2% to +3%)
  • Medium matrices (512 cols): 2-7% overhead, increases with size
  • Wide matrices (4096 cols): 5-7% overhead, memory bandwidth saturated
  • The benefit of batched<4> decreases as matrix size increases, likely due to cache/memory bandwidth effects

Quotient Opening Simulation (degree-3 AIR)

Simulates opening quotient polynomial at 1 vs 2 evaluation points. For degree-3 constraints, quotient has degree 2·(n-1), requiring 2 chunks:

  • BabyBear: 8 columns (2 chunks × 4 extension degree)
  • Goldilocks: 4 columns (2 chunks × 2 extension degree)

BabyBear (8 columns)

log₂(rows) 1 point 2 points Overhead
16 266 µs 370 µs 1.39×
18 679 µs 1085 µs 1.60×
20 2075 µs 3596 µs 1.73×
22 6855 µs 12851 µs 1.87×

Goldilocks (4 columns)

log₂(rows) 1 point 2 points Overhead
16 177 µs 288 µs 1.63×
18 361 µs 471 µs 1.30×
20 866 µs 1342 µs 1.55×
22 4625 µs 6315 µs 1.37×

Key findings

Opening at 2 points costs 1.30×–1.87× a single point (well below naive 2×).

Comparison with PR Author's Benchmarks (Apple M2 Pro)

The results here are broadly consistent with the PR author's M2 Pro benchmarks:

  • batched<1> overhead: Similar negligible overhead (0-5% on M2 Pro vs 0-2% here)
  • batched<2> benefit: M2 Pro showed 5-8% benefit on narrow matrices; this AMD setup shows ~2-5%
  • batched<4> benefit: Up to 8% faster on small narrow matrices, but overhead increases with matrix size
  • Wide matrices: Both show no benefit on 4096-column matrices; batched<4> shows 5-7% overhead
  • Quotient opening: M2 Pro showed 1.2×–1.65× overhead; this AMD setup shows 1.30×–1.87×

The AMD Ryzen 9 9950X generally shows higher absolute latencies than the M2 Pro, but the relative performance characteristics of the batched implementation are consistent across architectures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment