huitseeker/BENCHMARKS.md

## BENCHMARKS.md

      
    Raw
  

              BENCHMARKS.md
            
          
    Benchmark Results for PR #1225: columnwise_dot_product_batched

Machine: AMD Ryzen 9 9950X 16-Core Processor
Date: 2026-01-08
Features: --features parallel
General columnwise_dot_product (BabyBear, EF=degree-4 extension)


Size
unbatched
batched<1>
Δ
unbatched×2
batched<2>
Δ


2^16×128
1.79 ms
1.83 ms
+2%
3.49 ms
3.33 ms
-5%


2^16×512
6.14 ms
6.17 ms
~0%
12.28 ms
12.46 ms
+2%


2^16×4096
47.59 ms
47.69 ms
~0%
95.17 ms
99.58 ms
+5%


2^18×128
6.09 ms
6.10 ms
~0%
12.19 ms
11.99 ms
-2%


2^18×512
22.82 ms
23.21 ms
+2%
45.83 ms
46.41 ms
+1%


2^18×4096
179.26 ms
179.10 ms
~0%
358.69 ms
372.41 ms
+4%


2^20×128
22.52 ms
22.59 ms
~0%
45.30 ms
45.87 ms
+1%


2^20×512
88.06 ms
88.17 ms
~0%
178.71 ms
181.90 ms
+2%


2^20×4096
701.30 ms
700.55 ms
~0%
1.40 s
1.45 s
+4%


2^22×128
88.38 ms
88.31 ms
~0%
176.86 ms
182.29 ms
+3%


2^22×512
348.25 ms
348.28 ms
~0%
693.73 ms
714.54 ms
+3%


2^22×4096
2.78 s
2.80 s
~0%
5.56 s
5.77 s
+4%


Key findings


batched<1> overhead: negligible (0-4%), well within noise
batched<2> benefit: up to 5% faster on narrow matrices (128 cols), minimal on wider matrices
Wide matrices (4096 cols): no benefit, memory bandwidth saturated; batched version is slightly slower (~4% overhead)

Batched<4> Performance (BabyBear, EF=degree-4 extension)

Compares 4× unbatched calls vs batched<4>.


Size
unbatched×4
batched<4>
Δ


2^16×128
7.26 ms
6.67 ms
-8%


2^16×512
24.79 ms
25.33 ms
+2%


2^16×4096
189.05 ms
203.00 ms
+7%


2^18×128
24.00 ms
23.58 ms
-2%


2^18×512
90.57 ms
93.91 ms
+4%


2^18×4096
715.68 ms
757.23 ms
+6%


2^20×128
89.96 ms
91.68 ms
+2%


2^20×512
352.87 ms
379.19 ms
+7%


2^20×4096
2.80 s
2.94 s
+5%


2^22×128
352.68 ms
363.71 ms
+3%


2^22×512
1.41 s
1.45 s
+3%


2^22×4096
11.15 s
11.70 s
+5%


Key findings


batched<4> benefit: up to 8% faster on small narrow matrices (2^16×128)
Narrow matrices (128 cols): minimal overhead or slight benefit (-2% to +3%)
Medium matrices (512 cols): 2-7% overhead, increases with size
Wide matrices (4096 cols): 5-7% overhead, memory bandwidth saturated
The benefit of batched<4> decreases as matrix size increases, likely due to cache/memory bandwidth effects

Quotient Opening Simulation (degree-3 AIR)

Simulates opening quotient polynomial at 1 vs 2 evaluation points. For degree-3 constraints, quotient has degree 2·(n-1), requiring 2 chunks:

BabyBear: 8 columns (2 chunks × 4 extension degree)
Goldilocks: 4 columns (2 chunks × 2 extension degree)

BabyBear (8 columns)


log₂(rows)
1 point
2 points
Overhead


16
266 µs
370 µs
1.39×


18
679 µs
1085 µs
1.60×


20
2075 µs
3596 µs
1.73×


22
6855 µs
12851 µs
1.87×


Goldilocks (4 columns)


log₂(rows)
1 point
2 points
Overhead


16
177 µs
288 µs
1.63×


18
361 µs
471 µs
1.30×


20
866 µs
1342 µs
1.55×


22
4625 µs
6315 µs
1.37×


Key findings

Opening at 2 points costs 1.30×–1.87× a single point (well below naive 2×).
Comparison with PR Author's Benchmarks (Apple M2 Pro)

The results here are broadly consistent with the PR author's M2 Pro benchmarks:

batched<1> overhead: Similar negligible overhead (0-5% on M2 Pro vs 0-2% here)
batched<2> benefit: M2 Pro showed 5-8% benefit on narrow matrices; this AMD setup shows ~2-5%
batched<4> benefit: Up to 8% faster on small narrow matrices, but overhead increases with matrix size
Wide matrices: Both show no benefit on 4096-column matrices; batched<4> shows 5-7% overhead
Quotient opening: M2 Pro showed 1.2×–1.65× overhead; this AMD setup shows 1.30×–1.87×

The AMD Ryzen 9 9950X generally shows higher absolute latencies than the M2 Pro, but the relative performance characteristics of the batched implementation are consistent across architectures.
Size	unbatched	batched<1>	Δ	unbatched×2	batched<2>	Δ
2^16×128	1.79 ms	1.83 ms	+2%	3.49 ms	3.33 ms	-5%
2^16×512	6.14 ms	6.17 ms	~0%	12.28 ms	12.46 ms	+2%
2^16×4096	47.59 ms	47.69 ms	~0%	95.17 ms	99.58 ms	+5%
2^18×128	6.09 ms	6.10 ms	~0%	12.19 ms	11.99 ms	-2%
2^18×512	22.82 ms	23.21 ms	+2%	45.83 ms	46.41 ms	+1%
2^18×4096	179.26 ms	179.10 ms	~0%	358.69 ms	372.41 ms	+4%
2^20×128	22.52 ms	22.59 ms	~0%	45.30 ms	45.87 ms	+1%
2^20×512	88.06 ms	88.17 ms	~0%	178.71 ms	181.90 ms	+2%
2^20×4096	701.30 ms	700.55 ms	~0%	1.40 s	1.45 s	+4%
2^22×128	88.38 ms	88.31 ms	~0%	176.86 ms	182.29 ms	+3%
2^22×512	348.25 ms	348.28 ms	~0%	693.73 ms	714.54 ms	+3%
2^22×4096	2.78 s	2.80 s	~0%	5.56 s	5.77 s	+4%
Size	unbatched×4	batched<4>	Δ
2^16×128	7.26 ms	6.67 ms	-8%
2^16×512	24.79 ms	25.33 ms	+2%
2^16×4096	189.05 ms	203.00 ms	+7%
2^18×128	24.00 ms	23.58 ms	-2%
2^18×512	90.57 ms	93.91 ms	+4%
2^18×4096	715.68 ms	757.23 ms	+6%
2^20×128	89.96 ms	91.68 ms	+2%
2^20×512	352.87 ms	379.19 ms	+7%
2^20×4096	2.80 s	2.94 s	+5%
2^22×128	352.68 ms	363.71 ms	+3%
2^22×512	1.41 s	1.45 s	+3%
2^22×4096	11.15 s	11.70 s	+5%
log₂(rows)	1 point	2 points	Overhead
16	266 µs	370 µs	1.39×
18	679 µs	1085 µs	1.60×
20	2075 µs	3596 µs	1.73×
22	6855 µs	12851 µs	1.87×
log₂(rows)	1 point	2 points	Overhead
16	177 µs	288 µs	1.63×
18	361 µs	471 µs	1.30×
20	866 µs	1342 µs	1.55×
22	4625 µs	6315 µs	1.37×