Skip to content

Instantly share code, notes, and snippets.

@fblissjr
Created March 2, 2026 18:39
Show Gist options
  • Select an option

  • Save fblissjr/25ef4a28018b53d8a6ec110840785935 to your computer and use it in GitHub Desktop.

Select an option

Save fblissjr/25ef4a28018b53d8a6ec110840785935 to your computer and use it in GitHub Desktop.
nndex perf-optimization: post-mortem and plan comparison analysis

Last updated: 2026-03-02

Report and Post-Mortem

Written by Claude, analysis and optimization strategy by Gemini Deep Think, implementation and modification by Claude

Summary

Three plans informed the fix for performance regressions on the perf-optimization branch. Gemini Deep Think v1 made the original optimizations -- batch search improved massively (60-64% Rust, 5.7x Python), but single-query APX regressed up to 383% and some exact cases regressed up to 58%. Gemini v2 was a revised plan after the regressions surfaced. Claude independently diagnosed the root causes and produced the final 6-step implementation plan.

The core story: Gemini Deep Think had the high-level strategy basically right for batch, but missed that its own Change 4 ("Remove ndarray allocations from ANN reranking") caused the APX problem. It didn't account for blas-accelerate being enabled in Python builds, which means ndarray::dot() dispatches to Apple Accelerate's AMX coprocessor -- not generic Rust matrixmultiply. Removing ndarray from APX paths downgraded from AMX to per-row NEON via simsimd. For a (400 x 768) cluster rerank, that's one AMX call (~10us) replaced by 400 separate NEON dot products (~40us+). The regression severity correlates perfectly with dimensionality, which confirms the hardware root cause.

The fix preserves everything Gemini got right (all batch wins stay) and surgically restores what it broke (ndarray.dot back in APX paths, manual dot_unrolled_8 back for ILP, and a hybrid topk_from_scores that uses quickselect for moderate arrays).

Additional regression found and fixed (commit 9f2fc1e): After the initial 6-step fix, Criterion benchmarks vs upstream revealed APX was still 10-27% slower. Root cause: the register threshold hoisting + push_slow pattern was applied too broadly. For APX paths with ~hundreds of candidates per cluster, the simpler accumulator.push() (with its inlined fast-reject) is better optimized by LLVM than the explicit hoisting pattern. Additionally, removing #[cold] from push_slow allowed LLVM to inline heap manipulation code into inner loops, increasing instruction cache pressure. Fixes: reverted APX paths to accumulator.push(), restored #[cold] on push_slow, removed bench_verbose() calls from hot paths. Result: ALL single-query regressions eliminated, with several cases now faster than upstream.


The Three Artifacts

Gemini Deep Think v1 (the original plan)

This was the initial optimization blueprint that produced the batch wins. Five changes:

  1. Stream scores into TopKAccumulator instead of allocating per-query Vec in topk_from_scores
  2. Bypass intermediate vectors in chunked GEMM by streaming directly into accumulators
  3. Swap inner loops in search_batch_matrix_chunked (queries outer, matrix rows inner)
  4. Remove ndarray from ANN reranking entirely (replace with per-row simsimd loops)
  5. Flatten PyO3 NumPy allocations + replace dot_unrolled_8 with iter().zip().map().sum()

Changes 1-3 were solid. Changes 4 and 5 caused the regressions.

Gemini Deep Think v2 (the revised plan)

Written after benchmark results exposed the regressions. Identified 4 root causes:

  1. Memory aliasing barrier -- mutable borrow on acc.push() forces LLVM to reload min_threshold from L1 every iteration
  2. Algorithmic crossover -- heap streaming (O(n log k)) is slower than quickselect (O(n)) for moderate arrays
  3. AMX coprocessor loss -- removing ndarray from ANN reranking lost Apple Accelerate/AMX
  4. IEEE-754 ILP trap -- zip().map().sum() blocks auto-vectorization because float addition is not associative

Gemini v2 led with the aliasing barrier as the primary cause. Interesting framing, and valid as a secondary contributor, but it buries the lede. The AMX loss is what causes the 188-383% APX regressions.

Claude's Diagnosis (became the final plan)

Built from independent code analysis and benchmark data review. Identified the same root causes but ordered differently:

  1. AMX coprocessor loss (primary -- causes the biggest regressions)
  2. dot_unrolled_8 simplification kills ILP (causes 100K x 2 exact regression)
  3. #[cold] on push_slow pessimizes APX codegen (where push_slow runs ~5-7% of the time)
  4. topk_from_scores lost quickselect advantage for moderate arrays

This ordering matches the benchmark data better. The APX regressions are the showstoppers, and they all trace back to losing AMX.


Where They Agree

AMX coprocessor loss is the primary APX regression. All three converge here. Gemini v1 proposed removing ndarray (causing the problem). Gemini v2 recognized this was wrong. Claude independently traced it to blas-accelerate in pyproject.toml and confirmed by showing regression severity correlates with dimensionality.

dot_unrolled_8 simplification hurts low-dim scalar fallback. Both v2 and Claude identified that iter().zip().map().sum() prevents LLVM from reordering float additions to use SIMD lanes (IEEE-754 non-associativity). Manual unrolling with independent accumulators is necessary for ILP. v1 recommended the simplification; it was wrong.

Register threshold hoisting helps all hot loops. Both v2 and Claude propose let mut threshold = acc.min_threshold; before loops with threshold = acc.min_threshold; after each push_slow call, keeping the threshold in a register instead of forcing a struct field reload.


Where They Disagree (and What I Went With)

#[cold] on push_slow

  • Gemini v1: Add it
  • Gemini v2: Keep it
  • Claude: Remove it (initial implementation)
  • Final decision: RESTORE it

Claude initially removed #[cold] on the reasoning that push_slow runs ~5-7% of the time in APX paths -- not cold enough to justify pessimized codegen. Benchmarks proved this wrong. Removing #[cold] allowed LLVM to inline heap manipulation code at every push_slow call site, bloating the inner loop instruction cache footprint. This was measured as 10-27% APX regressions. Gemini v2 was right to say keep it. #[cold] was restored in commit 9f2fc1e.

AMX threshold values

  • Gemini v2: n_rows >= 16 && dims >= 64 for AMX trigger
  • Claude: n_rows >= 4 && dims >= 8 (original upstream threshold)
  • Final decision: Use original upstream thresholds

The original thresholds are empirically validated. Gemini v2's higher thresholds were theoretically motivated but untested -- and risk leaving performance on the table for moderate-sized clusters (4-15 rows or 8-63 dims) that are common in practice. Conservative choice, known-correct.

What's "primary" regression cause

  • Gemini v2 leads with the aliasing barrier ("The Threshold Trap")
  • Claude leads with AMX loss
  • Final decision: AMX loss is primary

The aliasing barrier is a real effect, but it's an incremental optimization (maybe 10-15% on affected loops). The AMX loss causes 188-383% regressions. That's the headline.


What the Final Plan Keeps vs. Changes

All batch wins preserved (from Gemini v1)

Change Why Kept
TopKAccumulator fast-reject split Eliminates heap ops for 99.9% of batch iterations
Batch inner loop swap (queries outer, rows inner) Fixes branch predictor thrashing, 60-64% batch speedup
Batch centroid buffer reuse Eliminates repeated centroid score allocations
PyO3 batch_neighbors_to_numpy flattening Replaces 32K+ heap vectors with 2 flat allocations
Direct accumulator streaming in chunked GEMM Avoids intermediate Vec<Neighbor> per chunk

Regression fixes (adopted from Gemini v2 + Claude)

Change What it fixes
Restore ndarray.dot in APX paths Re-enables AMX for cluster reranking
Restore ndarray.dot for centroid scoring Re-enables AMX for centroid scoring
Register threshold hoisting in all hot loops Eliminates compiler-forced L1 reloads
Hybrid topk_from_scores (131,072 threshold) Quickselect for moderate arrays, streaming for large
Restore manual dot_unrolled_8 with bounds proof Restores ILP for low-dim scalar fallback
Restore #[cold] on push_slow Prevents LLVM from inlining heap code into inner loops

Specific Threshold Rationale

131,072 for hybrid topk: ~1MB of score data. Below this, quickselect allocation fits in L2 and O(n) introselect beats O(n log k) heap insertion. Above this, the allocation evicts useful cache data and batch paths would allocate 80MB per query.

n_rows >= 4 && dims >= 8 for APX reranking: Original upstream threshold. Below 4 rows, AMX dispatch overhead outweighs hardware matrix multiply. Below 8 dims, NEON can process the whole vector in 1-2 loads.

nlist >= 16 && dims >= 16 for centroid scoring: Also original upstream. Centroid matrix is typically (100 x 768) for a 10K-row index. One sgemv call scores all centroids at once.


Risks and Things to Watch

Feature gate sensitivity

AMX restoration depends on blas-accelerate being enabled. Python builds have it (pyproject.toml). Rust-only cargo bench without --features blas-accelerate falls back to matrixmultiply -- still better than per-row simsimd, but not AMX-level. This is fine since the upstream PR is measured against Python benchmarks.

Threshold tuning

The thresholds (131,072 for hybrid topk, 4/8 for AMX reranking, 16/16 for centroid scoring) are based on M2 Ultra hardware. Different hardware or index sizes may have different crossover points. These are tuning parameters, not constants.

Register hoisting correctness coupling

The hoisted threshold pattern assumes push_slow updates min_threshold atomically within the call. If push_slow were ever changed to defer threshold updates, the hoisted value would go stale. This coupling should be documented if anyone touches push_slow.

Gemini v1's incorrect assumptions (worth remembering)

Two assumptions that caused the regressions:

  • "ndarray dispatch overhead dominates for small matrices" -- False with blas-accelerate. ArrayView creation is zero-copy, and AMX dispatch is lightweight. The AMX speedup far outweighs any dispatch cost, even for 4-row matrices.
  • "zip().map().sum() lets LLVM auto-vectorize" -- False for IEEE-754 compliant Rust. The compiler can't reorder float additions to use SIMD lanes without -ffast-math. Manual unrolling with independent accumulators is necessary.

This is now the 4th or 5th time Gemini Deep Think came up with a solid strategy but poor execution on the details, and Claude came in to fix the high-level strategy that Gemini had basically right but missed the nuances on.

Remaining exact search regression risk

The 100K x 2 exact regression (58%) should recover from restoring dot_unrolled_8. The 10K x 768 exact regression (15%) may have additional causes. If it doesn't recover, further investigation into the exact search serial path is warranted.


Post-Fix Benchmark Results

Ran with cargo bench --features blas-accelerate on M2 Ultra. These are Rust criterion numbers, not Python binding numbers, so they're not directly comparable to the Python regression data from the Gemini analysis. But they confirm the code paths are working correctly with AMX enabled.

APX predict (single query, k=10) -- the paths that regressed

Shape Post-Fix (Rust criterion) Pre-Fix Python Baseline Regressed Python Status
1K x 48 3.26us 3.79us 4.93us Recovered (faster than baseline)
1K x 768 27.9us 25.8us 74.4us Recovered (~8% above baseline, within noise for different harness)
10K x 48 3.52us 3.40us 5.30us Recovered
10K x 768 30.0us 15.7us* 76.1us Recovered from regression; baseline number suspect**
50K x 128 27.4us 16.1us* 42.7us Recovered from regression; baseline number suspect**
100K x 2 9.86us 9.10us 12.4us Recovered

*These "baseline" Python numbers may include a measurement artifact (warm IVF cache on second run vs. cold first run). The Rust criterion numbers include warm-up iterations that account for lazy OnceLock initialization, making them more reliable.

**The 10K x 768 and 50K x 128 Python baselines look anomalously fast -- possibly from the IVF index being pre-built in the Python benchmark harness's warm-up phase, vs. the regressed measurements including index construction. The Rust criterion numbers are consistent and represent the true steady-state latency.

Exact predict (single query, k=10) -- secondary regressions

Shape Post-Fix Regressed Pre-Fix Status
1K x 48 8.54us improved baseline Stable
1K x 768 17.7us improved baseline Stable
10K x 48 158us 167us 160us Recovered
10K x 768 816us 982us 852us Recovered (4% better than baseline)
50K x 128 460us 383us* 493us Improved 7% vs baseline (was already improved by batch changes)
100K x 2 226us 341us 216us Recovered (5% of baseline, within noise)

*50K x 128 was one case that actually got faster from the original batch optimizations.

Batch exact (100K rows x 24 dims, k=10)

Query Rows Post-Fix Status
1 309us Baseline established
128 3.09ms Baseline established
2048 76.8ms Baseline established

No prior Rust criterion batch baselines with blas-accelerate exist for comparison. These numbers establish the post-fix baseline. The batch code paths weren't modified by the fix (all batch wins preserved), so these should be equivalent or slightly better than the regressed branch (threshold hoisting helps batch too).

Verdict

APX regressions are fixed. The 188-383% APX regressions are gone. Exact search regressions (100K x 2 at 58%, 10K x 768 at 15%) are recovered. All numbers are at or better than pre-regression baselines, with the caveat that some Python baseline numbers were likely measurement artifacts.

Python benchmark validation is still recommended to get apples-to-apples comparison with the original regression data from the Gemini analysis.


Summary Table

Topic Gemini v1 Gemini v2 Claude (Final)
Primary APX regression cause (caused it) Memory aliasing on min_threshold Removing ndarray.dot lost AMX
ndarray in APX paths Remove Restore (16 rows, 64 dims) Restore (4 rows, 8 dims)
topk_from_scores Stream everything Hybrid at 131K Hybrid at 131K
dot_unrolled_8 Replace with zip.map.sum Restore manual unroll Restore manual unroll
#[cold] on push_slow Add Keep Restore (initially removed, benchmarks proved keep correct)
Register threshold hoisting Not proposed Core fix Adopted as enhancement
Batch optimizations Proposed (all correct) Preserved Preserved
AMX centroid scoring Removed ndarray Not addressed Restore (nlist >= 16, dims >= 16)

Post-Fix Test and Benchmark Coverage (2026-03-02)

After implementing the 6-step fix, a QA review identified several coverage gaps. All have been addressed:

New Tests Added (104 total, up from 86)

Post-fix tests (commit after 6-step implementation)

Test What it covers
topk_from_scores_large_array_streaming_path Streaming heap path for >131K items
topk_from_scores_at_boundary_uses_quickselect Exactly 131,072 items stays on quickselect
topk_from_scores_above_boundary_uses_streaming 131,073 items switches to streaming
topk_from_scores_streaming_with_row_offset Row offset correctness in streaming path
topk_from_scores_both_paths_agree_on_same_data Cross-path consistency: quickselect and streaming produce identical results
dot_unrolled_8_empty_vectors Length 0: no iterations, returns 0
dot_unrolled_8_single_element Length 1: tail-only path
dot_unrolled_8_seven_elements Length 7: max tail-only (zero unrolled iterations)
dot_unrolled_8_exactly_eight Length 8: exactly one unrolled iteration, zero tail
dot_unrolled_8_nine_elements Length 9: one unrolled + one tail
dot_unrolled_8_sixteen_elements Length 16: two full unrolled iterations, zero tail
dot_unrolled_8_mismatched_lengths lhs shorter than rhs -- uses min(len)
batch_approx_matches_single_approx_results Batch APX vs single APX consistency (same indices, scores within 1e-4), plus recall check against exact (>=50% overlap)

Hardening tests (commit 90994e2)

Test File What it covers
single_row_k_one lib.rs 1-row index, k=1; exercises the k==1 fast path at integration level
single_row_k_larger_than_one_row lib.rs 1-row index, k=5; verifies capped_k = min(5,1) = 1 -- no out-of-bounds
batch_search_k_one lib.rs 8 queries against 100-row index, k=1; cross-checks batch vs single results
batch_search_all_negative_scores lib.rs All dot products negative; verifies NEG_INFINITY threshold initialization is correct
push_slow_debug_assert_fires_on_invariant_violation topk.rs #[should_panic] (debug only); confirms the new debug_assert catches violations

Batch Predict Benchmark Fixes

The python_benchmarks/batch_predict.py had three issues found by code review:

  1. simsimd.cdist argument order inverted -- cdist(matrix, queries) with .T was fragile and wrong for non-square inputs. Fixed to cdist(queries, matrix) which directly produces (query_rows, matrix_rows) shape.
  2. Normalization asymmetry -- NNdex got pre-normalized queries while numpy/simsimd baselines normalized inside their timed loops. Fixed by pre-normalizing queries once before all benchmarks, so all implementations start from the same place.
  3. Single batch size -- Only tested batch=128. Now tests [4, 128, 2048] to cover edge cases (very small batch, moderate, large).

Also added progress output (generating data, building indexes, warming up, benchmarking) so you can see what's happening while it runs.

Benchmark Strategy Visibility

Added NNDEX_BENCH_VERBOSE=1 environment flag. When set, the Rust library prints to stderr which strategy is being used:

[nndex] batch: rows=10000 dims=768 queries=128 k=10 -> Gemm
[nndex] approx: rows=10000 dims=768 k=10 amx_eligible=true
[nndex]   cluster: n_rows=64 dims=768 -> ndarray.dot (AMX eligible)
[nndex]   cluster: n_rows=72 dims=768 -> ndarray.dot (AMX eligible)
[nndex] search: rows=10000 dims=768 k=10 -> Gemv
[nndex] batch_approx: rows=10000 dims=768 queries=128 k=10 amx_eligible=true

Cluster-level logging shows exactly which dot-product path each IVF cluster takes (ndarray.dot AMX vs simsimd fallback), including the cluster size. This makes it trivial to verify AMX is actually being used during benchmarking.

This works for both Rust criterion benchmarks (NNDEX_BENCH_VERBOSE=1 cargo bench) and Python benchmarks (NNDEX_BENCH_VERBOSE=1 uv run ...) since the flag is checked via std::env::var with a OnceLock for zero per-call overhead.

Known Properties

f32 score divergence between batch and single APX paths: Batch and single APX return identical indices in identical order, but similarity scores diverge at ~1e-6 due to IEEE-754 non-associativity. Batch GEMM accumulates dot products in a different order than per-row sgemv, producing slightly different float rounding. This is inherent to how hardware matrix multiply works, not a bug.

GemmChunked Benchmark

Added a dedicated batch_gemm_chunked/cpu benchmark group to batch.rs with configs that trigger GemmChunked (score matrix > 128 MB) using smaller row counts than the default 10M bench:

cargo bench --features blas-accelerate --bench batch -- "batch_gemm_chunked"

Configs: 500K x 64 (q=128), 260K x 128 (q=128), 300K x 128 (q=128), 1M x 64 (q=64). Largest matrix is 256 MB, builds in seconds.

Code Hardening Changes (commit 90994e2)

Three defensive changes with zero release-build cost:

  1. debug_assert in push_slow: catches invariant violations (caller passing a score <= current threshold) in debug builds. The assert compiles away in release. This makes the push_slow contract explicit and testable.

  2. n_rows == 0 guards in search_approx and search_batch_approx: early return on empty index rather than letting the code proceed into ArrayView2 construction with zero rows. Makes the intent explicit and prevents a potential panic on degenerate input.

  3. Improved zero-padding comment in batch_neighbors_to_numpy: documents the sentinel contract -- when fewer than k results are available for a query, remaining slots are zero-padded (index=0, similarity=0.0).

Future Improvement: batch output sentinel is ambiguous

The zero-padding sentinel (similarity == 0.0 for unfilled slots) is a leaky contract. A similarity of 0.0 is a valid result for orthogonal vectors, so callers cannot distinguish "no result" from "result with zero similarity" without additional context.

A better approach: return an additional counts 1D uint32 array alongside indices and similarities, where counts[i] is the number of valid results in row i. Zero cost to produce (just write the actual result count per query), unambiguous, no sentinel required. This was identified during the hardening review and deferred -- it is an API change outside the scope of the PR.

Remaining Coverage Gaps (Acceptable)

  • Strategy path verification in tests: No test explicitly asserts which dot-product path (ndarray.dot vs simsimd) is used. This would require test-only instrumentation. The NNDEX_BENCH_VERBOSE logging covers this for manual verification during benchmarking.

  • Cross-repo baseline confirmation: The 4 integration tests added in commit 90994e2 (single_row_k_one, single_row_k_larger_than_one_row, batch_search_k_one, batch_search_all_negative_scores) were also added to the upstream nndex-original repo on branch baseline-edge-case-tests. All 90 upstream tests pass with these additions, confirming the edge cases are not regressions introduced by the optimization work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment