Last updated: 2026-03-02
Written by Claude, analysis and optimization strategy by Gemini Deep Think, implementation and modification by Claude
Three plans informed the fix for performance regressions on the perf-optimization branch. Gemini Deep Think v1 made the original optimizations -- batch search improved massively (60-64% Rust, 5.7x Python), but single-query APX regressed up to 383% and some exact cases regressed up to 58%. Gemini v2 was a revised plan after the regressions surfaced. Claude independently diagnosed the root causes and produced the final 6-step implementation plan.
The core story: Gemini Deep Think had the high-level strategy basically right for batch, but missed that its own Change 4 ("Remove ndarray allocations from ANN reranking") caused the APX problem. It didn't account for blas-accelerate being enabled in Python builds, which means ndarray::dot() dispatches to Apple Accelerate's AMX coprocessor -- not generic Rust matrixmultiply. Removing ndarray from APX paths downgraded from AMX to per-row NEON via simsimd. For a (400 x 768) cluster rerank, that's one AMX call (~10us) replaced by 400 separate NEON dot products (~40us+). The regression severity correlates perfectly with dimensionality, which confirms the hardware root cause.
The fix preserves everything Gemini got right (all batch wins stay) and surgically restores what it broke (ndarray.dot back in APX paths, manual dot_unrolled_8 back for ILP, and a hybrid topk_from_scores that uses quickselect for moderate arrays).
Additional regression found and fixed (commit 9f2fc1e): After the initial 6-step fix, Criterion benchmarks vs upstream revealed APX was still 10-27% slower. Root cause: the register threshold hoisting + push_slow pattern was applied too broadly. For APX paths with ~hundreds of candidates per cluster, the simpler accumulator.push() (with its inlined fast-reject) is better optimized by LLVM than the explicit hoisting pattern. Additionally, removing #[cold] from push_slow allowed LLVM to inline heap manipulation code into inner loops, increasing instruction cache pressure. Fixes: reverted APX paths to accumulator.push(), restored #[cold] on push_slow, removed bench_verbose() calls from hot paths. Result: ALL single-query regressions eliminated, with several cases now faster than upstream.
This was the initial optimization blueprint that produced the batch wins. Five changes:
- Stream scores into
TopKAccumulatorinstead of allocating per-queryVecintopk_from_scores - Bypass intermediate vectors in chunked GEMM by streaming directly into accumulators
- Swap inner loops in
search_batch_matrix_chunked(queries outer, matrix rows inner) - Remove
ndarrayfrom ANN reranking entirely (replace with per-row simsimd loops) - Flatten PyO3 NumPy allocations + replace
dot_unrolled_8withiter().zip().map().sum()
Changes 1-3 were solid. Changes 4 and 5 caused the regressions.
Written after benchmark results exposed the regressions. Identified 4 root causes:
- Memory aliasing barrier -- mutable borrow on
acc.push()forces LLVM to reloadmin_thresholdfrom L1 every iteration - Algorithmic crossover -- heap streaming (O(n log k)) is slower than quickselect (O(n)) for moderate arrays
- AMX coprocessor loss -- removing ndarray from ANN reranking lost Apple Accelerate/AMX
- IEEE-754 ILP trap --
zip().map().sum()blocks auto-vectorization because float addition is not associative
Gemini v2 led with the aliasing barrier as the primary cause. Interesting framing, and valid as a secondary contributor, but it buries the lede. The AMX loss is what causes the 188-383% APX regressions.
Built from independent code analysis and benchmark data review. Identified the same root causes but ordered differently:
- AMX coprocessor loss (primary -- causes the biggest regressions)
dot_unrolled_8simplification kills ILP (causes 100K x 2 exact regression)#[cold]onpush_slowpessimizes APX codegen (where push_slow runs ~5-7% of the time)topk_from_scoreslost quickselect advantage for moderate arrays
This ordering matches the benchmark data better. The APX regressions are the showstoppers, and they all trace back to losing AMX.
AMX coprocessor loss is the primary APX regression. All three converge here. Gemini v1 proposed removing ndarray (causing the problem). Gemini v2 recognized this was wrong. Claude independently traced it to blas-accelerate in pyproject.toml and confirmed by showing regression severity correlates with dimensionality.
dot_unrolled_8 simplification hurts low-dim scalar fallback. Both v2 and Claude identified that iter().zip().map().sum() prevents LLVM from reordering float additions to use SIMD lanes (IEEE-754 non-associativity). Manual unrolling with independent accumulators is necessary for ILP. v1 recommended the simplification; it was wrong.
Register threshold hoisting helps all hot loops. Both v2 and Claude propose let mut threshold = acc.min_threshold; before loops with threshold = acc.min_threshold; after each push_slow call, keeping the threshold in a register instead of forcing a struct field reload.
- Gemini v1: Add it
- Gemini v2: Keep it
- Claude: Remove it (initial implementation)
- Final decision: RESTORE it
Claude initially removed #[cold] on the reasoning that push_slow runs ~5-7% of the time in APX paths -- not cold enough to justify pessimized codegen. Benchmarks proved this wrong. Removing #[cold] allowed LLVM to inline heap manipulation code at every push_slow call site, bloating the inner loop instruction cache footprint. This was measured as 10-27% APX regressions. Gemini v2 was right to say keep it. #[cold] was restored in commit 9f2fc1e.
- Gemini v2:
n_rows >= 16 && dims >= 64for AMX trigger - Claude:
n_rows >= 4 && dims >= 8(original upstream threshold) - Final decision: Use original upstream thresholds
The original thresholds are empirically validated. Gemini v2's higher thresholds were theoretically motivated but untested -- and risk leaving performance on the table for moderate-sized clusters (4-15 rows or 8-63 dims) that are common in practice. Conservative choice, known-correct.
- Gemini v2 leads with the aliasing barrier ("The Threshold Trap")
- Claude leads with AMX loss
- Final decision: AMX loss is primary
The aliasing barrier is a real effect, but it's an incremental optimization (maybe 10-15% on affected loops). The AMX loss causes 188-383% regressions. That's the headline.
| Change | Why Kept |
|---|---|
TopKAccumulator fast-reject split |
Eliminates heap ops for 99.9% of batch iterations |
| Batch inner loop swap (queries outer, rows inner) | Fixes branch predictor thrashing, 60-64% batch speedup |
| Batch centroid buffer reuse | Eliminates repeated centroid score allocations |
PyO3 batch_neighbors_to_numpy flattening |
Replaces 32K+ heap vectors with 2 flat allocations |
| Direct accumulator streaming in chunked GEMM | Avoids intermediate Vec<Neighbor> per chunk |
| Change | What it fixes |
|---|---|
| Restore ndarray.dot in APX paths | Re-enables AMX for cluster reranking |
| Restore ndarray.dot for centroid scoring | Re-enables AMX for centroid scoring |
| Register threshold hoisting in all hot loops | Eliminates compiler-forced L1 reloads |
| Hybrid topk_from_scores (131,072 threshold) | Quickselect for moderate arrays, streaming for large |
| Restore manual dot_unrolled_8 with bounds proof | Restores ILP for low-dim scalar fallback |
| Restore #[cold] on push_slow | Prevents LLVM from inlining heap code into inner loops |
131,072 for hybrid topk: ~1MB of score data. Below this, quickselect allocation fits in L2 and O(n) introselect beats O(n log k) heap insertion. Above this, the allocation evicts useful cache data and batch paths would allocate 80MB per query.
n_rows >= 4 && dims >= 8 for APX reranking: Original upstream threshold. Below 4 rows, AMX dispatch overhead outweighs hardware matrix multiply. Below 8 dims, NEON can process the whole vector in 1-2 loads.
nlist >= 16 && dims >= 16 for centroid scoring: Also original upstream. Centroid matrix is typically (100 x 768) for a 10K-row index. One sgemv call scores all centroids at once.
AMX restoration depends on blas-accelerate being enabled. Python builds have it (pyproject.toml). Rust-only cargo bench without --features blas-accelerate falls back to matrixmultiply -- still better than per-row simsimd, but not AMX-level. This is fine since the upstream PR is measured against Python benchmarks.
The thresholds (131,072 for hybrid topk, 4/8 for AMX reranking, 16/16 for centroid scoring) are based on M2 Ultra hardware. Different hardware or index sizes may have different crossover points. These are tuning parameters, not constants.
The hoisted threshold pattern assumes push_slow updates min_threshold atomically within the call. If push_slow were ever changed to defer threshold updates, the hoisted value would go stale. This coupling should be documented if anyone touches push_slow.
Two assumptions that caused the regressions:
- "ndarray dispatch overhead dominates for small matrices" -- False with
blas-accelerate. ArrayView creation is zero-copy, and AMX dispatch is lightweight. The AMX speedup far outweighs any dispatch cost, even for 4-row matrices. - "
zip().map().sum()lets LLVM auto-vectorize" -- False for IEEE-754 compliant Rust. The compiler can't reorder float additions to use SIMD lanes without-ffast-math. Manual unrolling with independent accumulators is necessary.
This is now the 4th or 5th time Gemini Deep Think came up with a solid strategy but poor execution on the details, and Claude came in to fix the high-level strategy that Gemini had basically right but missed the nuances on.
The 100K x 2 exact regression (58%) should recover from restoring dot_unrolled_8. The 10K x 768 exact regression (15%) may have additional causes. If it doesn't recover, further investigation into the exact search serial path is warranted.
Ran with cargo bench --features blas-accelerate on M2 Ultra. These are Rust criterion numbers, not Python binding numbers, so they're not directly comparable to the Python regression data from the Gemini analysis. But they confirm the code paths are working correctly with AMX enabled.
| Shape | Post-Fix (Rust criterion) | Pre-Fix Python Baseline | Regressed Python | Status |
|---|---|---|---|---|
| 1K x 48 | 3.26us | 3.79us | 4.93us | Recovered (faster than baseline) |
| 1K x 768 | 27.9us | 25.8us | 74.4us | Recovered (~8% above baseline, within noise for different harness) |
| 10K x 48 | 3.52us | 3.40us | 5.30us | Recovered |
| 10K x 768 | 30.0us | 15.7us* | 76.1us | Recovered from regression; baseline number suspect** |
| 50K x 128 | 27.4us | 16.1us* | 42.7us | Recovered from regression; baseline number suspect** |
| 100K x 2 | 9.86us | 9.10us | 12.4us | Recovered |
*These "baseline" Python numbers may include a measurement artifact (warm IVF cache on second run vs. cold first run). The Rust criterion numbers include warm-up iterations that account for lazy OnceLock initialization, making them more reliable.
**The 10K x 768 and 50K x 128 Python baselines look anomalously fast -- possibly from the IVF index being pre-built in the Python benchmark harness's warm-up phase, vs. the regressed measurements including index construction. The Rust criterion numbers are consistent and represent the true steady-state latency.
| Shape | Post-Fix | Regressed | Pre-Fix | Status |
|---|---|---|---|---|
| 1K x 48 | 8.54us | improved | baseline | Stable |
| 1K x 768 | 17.7us | improved | baseline | Stable |
| 10K x 48 | 158us | 167us | 160us | Recovered |
| 10K x 768 | 816us | 982us | 852us | Recovered (4% better than baseline) |
| 50K x 128 | 460us | 383us* | 493us | Improved 7% vs baseline (was already improved by batch changes) |
| 100K x 2 | 226us | 341us | 216us | Recovered (5% of baseline, within noise) |
*50K x 128 was one case that actually got faster from the original batch optimizations.
| Query Rows | Post-Fix | Status |
|---|---|---|
| 1 | 309us | Baseline established |
| 128 | 3.09ms | Baseline established |
| 2048 | 76.8ms | Baseline established |
No prior Rust criterion batch baselines with blas-accelerate exist for comparison. These numbers establish the post-fix baseline. The batch code paths weren't modified by the fix (all batch wins preserved), so these should be equivalent or slightly better than the regressed branch (threshold hoisting helps batch too).
APX regressions are fixed. The 188-383% APX regressions are gone. Exact search regressions (100K x 2 at 58%, 10K x 768 at 15%) are recovered. All numbers are at or better than pre-regression baselines, with the caveat that some Python baseline numbers were likely measurement artifacts.
Python benchmark validation is still recommended to get apples-to-apples comparison with the original regression data from the Gemini analysis.
| Topic | Gemini v1 | Gemini v2 | Claude (Final) |
|---|---|---|---|
| Primary APX regression cause | (caused it) | Memory aliasing on min_threshold | Removing ndarray.dot lost AMX |
| ndarray in APX paths | Remove | Restore (16 rows, 64 dims) | Restore (4 rows, 8 dims) |
| topk_from_scores | Stream everything | Hybrid at 131K | Hybrid at 131K |
| dot_unrolled_8 | Replace with zip.map.sum | Restore manual unroll | Restore manual unroll |
| #[cold] on push_slow | Add | Keep | Restore (initially removed, benchmarks proved keep correct) |
| Register threshold hoisting | Not proposed | Core fix | Adopted as enhancement |
| Batch optimizations | Proposed (all correct) | Preserved | Preserved |
| AMX centroid scoring | Removed ndarray | Not addressed | Restore (nlist >= 16, dims >= 16) |
After implementing the 6-step fix, a QA review identified several coverage gaps. All have been addressed:
| Test | What it covers |
|---|---|
topk_from_scores_large_array_streaming_path |
Streaming heap path for >131K items |
topk_from_scores_at_boundary_uses_quickselect |
Exactly 131,072 items stays on quickselect |
topk_from_scores_above_boundary_uses_streaming |
131,073 items switches to streaming |
topk_from_scores_streaming_with_row_offset |
Row offset correctness in streaming path |
topk_from_scores_both_paths_agree_on_same_data |
Cross-path consistency: quickselect and streaming produce identical results |
dot_unrolled_8_empty_vectors |
Length 0: no iterations, returns 0 |
dot_unrolled_8_single_element |
Length 1: tail-only path |
dot_unrolled_8_seven_elements |
Length 7: max tail-only (zero unrolled iterations) |
dot_unrolled_8_exactly_eight |
Length 8: exactly one unrolled iteration, zero tail |
dot_unrolled_8_nine_elements |
Length 9: one unrolled + one tail |
dot_unrolled_8_sixteen_elements |
Length 16: two full unrolled iterations, zero tail |
dot_unrolled_8_mismatched_lengths |
lhs shorter than rhs -- uses min(len) |
batch_approx_matches_single_approx_results |
Batch APX vs single APX consistency (same indices, scores within 1e-4), plus recall check against exact (>=50% overlap) |
| Test | File | What it covers |
|---|---|---|
single_row_k_one |
lib.rs | 1-row index, k=1; exercises the k==1 fast path at integration level |
single_row_k_larger_than_one_row |
lib.rs | 1-row index, k=5; verifies capped_k = min(5,1) = 1 -- no out-of-bounds |
batch_search_k_one |
lib.rs | 8 queries against 100-row index, k=1; cross-checks batch vs single results |
batch_search_all_negative_scores |
lib.rs | All dot products negative; verifies NEG_INFINITY threshold initialization is correct |
push_slow_debug_assert_fires_on_invariant_violation |
topk.rs | #[should_panic] (debug only); confirms the new debug_assert catches violations |
The python_benchmarks/batch_predict.py had three issues found by code review:
- simsimd.cdist argument order inverted --
cdist(matrix, queries)with.Twas fragile and wrong for non-square inputs. Fixed tocdist(queries, matrix)which directly produces (query_rows, matrix_rows) shape. - Normalization asymmetry -- NNdex got pre-normalized queries while numpy/simsimd baselines normalized inside their timed loops. Fixed by pre-normalizing queries once before all benchmarks, so all implementations start from the same place.
- Single batch size -- Only tested batch=128. Now tests
[4, 128, 2048]to cover edge cases (very small batch, moderate, large).
Also added progress output (generating data, building indexes, warming up, benchmarking) so you can see what's happening while it runs.
Added NNDEX_BENCH_VERBOSE=1 environment flag. When set, the Rust library prints to stderr which strategy is being used:
[nndex] batch: rows=10000 dims=768 queries=128 k=10 -> Gemm
[nndex] approx: rows=10000 dims=768 k=10 amx_eligible=true
[nndex] cluster: n_rows=64 dims=768 -> ndarray.dot (AMX eligible)
[nndex] cluster: n_rows=72 dims=768 -> ndarray.dot (AMX eligible)
[nndex] search: rows=10000 dims=768 k=10 -> Gemv
[nndex] batch_approx: rows=10000 dims=768 queries=128 k=10 amx_eligible=true
Cluster-level logging shows exactly which dot-product path each IVF cluster takes (ndarray.dot AMX vs simsimd fallback), including the cluster size. This makes it trivial to verify AMX is actually being used during benchmarking.
This works for both Rust criterion benchmarks (NNDEX_BENCH_VERBOSE=1 cargo bench) and Python benchmarks (NNDEX_BENCH_VERBOSE=1 uv run ...) since the flag is checked via std::env::var with a OnceLock for zero per-call overhead.
f32 score divergence between batch and single APX paths: Batch and single APX return identical indices in identical order, but similarity scores diverge at ~1e-6 due to IEEE-754 non-associativity. Batch GEMM accumulates dot products in a different order than per-row sgemv, producing slightly different float rounding. This is inherent to how hardware matrix multiply works, not a bug.
Added a dedicated batch_gemm_chunked/cpu benchmark group to batch.rs with configs that trigger GemmChunked (score matrix > 128 MB) using smaller row counts than the default 10M bench:
cargo bench --features blas-accelerate --bench batch -- "batch_gemm_chunked"
Configs: 500K x 64 (q=128), 260K x 128 (q=128), 300K x 128 (q=128), 1M x 64 (q=64). Largest matrix is 256 MB, builds in seconds.
Three defensive changes with zero release-build cost:
-
debug_assertinpush_slow: catches invariant violations (caller passing a score <= current threshold) in debug builds. The assert compiles away in release. This makes the push_slow contract explicit and testable. -
n_rows == 0guards insearch_approxandsearch_batch_approx: early return on empty index rather than letting the code proceed into ArrayView2 construction with zero rows. Makes the intent explicit and prevents a potential panic on degenerate input. -
Improved zero-padding comment in
batch_neighbors_to_numpy: documents the sentinel contract -- when fewer than k results are available for a query, remaining slots are zero-padded (index=0, similarity=0.0).
The zero-padding sentinel (similarity == 0.0 for unfilled slots) is a leaky contract. A similarity of 0.0 is a valid result for orthogonal vectors, so callers cannot distinguish "no result" from "result with zero similarity" without additional context.
A better approach: return an additional counts 1D uint32 array alongside indices and similarities, where counts[i] is the number of valid results in row i. Zero cost to produce (just write the actual result count per query), unambiguous, no sentinel required. This was identified during the hardening review and deferred -- it is an API change outside the scope of the PR.
-
Strategy path verification in tests: No test explicitly asserts which dot-product path (ndarray.dot vs simsimd) is used. This would require test-only instrumentation. The
NNDEX_BENCH_VERBOSElogging covers this for manual verification during benchmarking. -
Cross-repo baseline confirmation: The 4 integration tests added in commit 90994e2 (
single_row_k_one,single_row_k_larger_than_one_row,batch_search_k_one,batch_search_all_negative_scores) were also added to the upstreamnndex-originalrepo on branchbaseline-edge-case-tests. All 90 upstream tests pass with these additions, confirming the edge cases are not regressions introduced by the optimization work.