Skip to content

Instantly share code, notes, and snippets.

@couzic
Last active December 7, 2025 17:42
Show Gist options
  • Select an option

  • Save couzic/93126a1c12b8d77651f93a7805b4bd60 to your computer and use it in GitHub Desktop.

Select an option

Save couzic/93126a1c12b8d77651f93a7805b4bd60 to your computer and use it in GitHub Desktop.
This evaluation has 100% been conducted by Claude Code. You can reproduce something similar very easily: just clone the ruvnet/ruvector Github repo, and ask Claude Code for an honest evaluation. Here, it used the agents from the repo itself. When it mentioned that the benchmarks used hardcoded improvement rates, I asked Claude Code to write "rea…

RuVector Final Evaluation Report

Date: December 6, 2025 Evaluator: Independent Code Analysis Version Evaluated: 0.1.21


Executive Summary

Category Rating Key Finding
Overall Verdict 6.5/10 Legitimate but oversold
Core Vector DB 8/10 Production-grade SIMD, HNSW, quantization
Advanced Features 3/10 30-40% incomplete or fake
Benchmark Claims 2/10 Simulated, not measured
Architecture 6.5/10 Solid tech, severe scope bloat

TL;DR: The core vector database works and is well-engineered. However, benchmark claims are fabricated, advanced features (AgenticDB, supervised GNN training) are incomplete, and the project suffers from "kitchen sink syndrome" - trying to be 8 products simultaneously.


Critical Findings

1. Benchmark Fraud

Severity: CRITICAL

The benchmark file benchmarks/qdrant_vs_ruvector_benchmark.py does NOT run actual RuVector code. It simulates performance by dividing Qdrant's measured times by hardcoded speedup factors:

# From SimulatedRuvectorBenchmark class
rust_speedup = 3.5   # Arbitrary multiplier
simd_factor = 1.5    # Arbitrary multiplier
# Combined: 5.25x fake speedup for inserts

Simulated Claims vs Reality (from actual benchmarks):

Metric Simulated Claim Actual Measured Reality
Search speedup 4x-5.25x faster 1.6x faster Inflated 2.5-3x
Insert speedup "Faster" implied 27x SLOWER Completely wrong
p50 search latency "61µs" 1.88ms Fabricated

Evidence: benchmarks/real/ contains actual benchmark code and results showing the real performance.


2. AgenticDB Uses Fake Embeddings

Severity: CRITICAL

The "AgenticDB" semantic features use hash-based fake embeddings instead of real neural embeddings:

Location: crates/ruvector-core/src/agenticdb.rs:660-678

// This is NOT a real embedding - it's a hash
fn simple_text_embedding(text: &str) -> Vec<f32> {
    let bytes = text.as_bytes();
    // ... hash manipulation, not ML embedding
}

Impact: All semantic search, text similarity, and AI features in AgenticDB are meaningless without real embeddings.


3. GNN Training Partially Incomplete

Severity: HIGH

The GNN implementation has a split personality:

Component Status Evidence
Contrastive Loss (InfoNCE) ✅ Working training.rs:362-411 - fully implemented with gradients
Local Contrastive Loss ✅ Working training.rs:444-462 - graph-aware loss
SGD/Adam Optimizers ✅ Working training.rs:96-216 - fully tested
Supervised Losses (MSE, CE) ❌ Stub unimplemented!("TODO") at line 230
GNN Inference Methods ⚠️ Placeholders Returns dummy values (0.7, 0.2, 0.1)

Can a GNN work without a loss function?

NO - neural networks fundamentally require a loss function to train. However:

  • The GNN CAN be trained using contrastive learning (unsupervised)
  • The GNN CANNOT be trained for supervised tasks (classification, regression)
  • Inference methods return hardcoded dummy values, not real predictions

4. Distance Function Bugs

Severity: HIGH

Property-based testing revealed 6 critical bugs in core distance calculations:

Bug Location Impact
Numeric overflow → inf simd_intrinsics.rs Incorrect distances for large vectors
Euclidean asymmetry distance.rs d(a,b) ≠ d(b,a) violates math definition
Manhattan asymmetry distance.rs Same violation
Dot product asymmetry distance.rs Same violation
Translation invariance failure distance.rs d(a+c, b+c) ≠ d(a,b)
Scalar quantization overflow quantization.rs:49-50 255*255 = 65025 > i16::MAX

Why HNSW search still works: It delegates to the external hnsw_rs library which has correct implementations.


5. Transaction Tests Are Empty Stubs

Severity: HIGH

23 of 26 transaction tests are empty stubs:

#[test]
fn test_transaction_rollback() {
    // TODO: Implement
}

Impact: Transaction safety is untested and potentially broken.


6. Scope Explosion (36 Crates)

Severity: MEDIUM

The project attempts to be 8 different products:

Product Reality
Vector database ✅ Core competency, works well
Graph database (Neo4j-compatible) ⚠️ Partial, two unfinished Cypher parsers
PostgreSQL extension ⚠️ Separate product embedded in project
Neural network framework ⚠️ Incomplete training, placeholder inference
ML training platform (SONA) ⚠️ Working but orthogonal to vector DB
AI router (Tiny Dancer) ⚠️ Separate product
Distributed system (Raft) ✅ Well-implemented
Research playground ⚠️ 17 examples, some >1000 LOC

What Actually Works

Core Vector Database (8/10)

Feature Quality Evidence
SIMD distance calculations ✅ Excellent 1,693 lines of AVX-512/AVX2/NEON code
HNSW indexing ✅ Good Wraps battle-tested hnsw_rs library
Quantization ✅ Excellent Real 4-32x memory reduction
NAPI bindings ✅ Professional Proper napi-rs with 5 platform binaries
Raft consensus ✅ Good Clean distributed implementation

Search Performance (Verified)

Real benchmarks show RuVector IS faster at search:

  • p50 latency: 1.88ms vs Qdrant's 3.08ms (1.6x faster)
  • p99 latency: 2.70ms vs Qdrant's 7.12ms (2.6x faster)

This is a genuine advantage, just not as large as claimed.


What Doesn't Work

Issue Severity Recommendation
Simulated benchmarks CRITICAL Use real Rust benchmarks for claims
Fake text embeddings CRITICAL Integrate real embedding model
Supervised loss stubs HIGH Implement or remove API
Distance function bugs HIGH Fix symmetry, overflow issues
Empty transaction tests HIGH Implement or remove feature
Scope bloat MEDIUM Split into focused products

Recommendations

For Users

Use Case Recommendation
Read-heavy, rare updates RuVector may be suitable
Write-heavy workloads Do not use (27x slower than Qdrant)
Production deployment Use mature solution (Qdrant, Milvus)
Learning/experimentation RuVector is fine
AgenticDB semantic features Do not use (fake embeddings)
GNN supervised training Do not use (unimplemented)

For Maintainers

  1. Immediate: Remove or clearly label simulated benchmarks
  2. Immediate: Fix distance function symmetry bugs
  3. Short-term: Implement real text embeddings or remove AgenticDB claims
  4. Short-term: Complete supervised loss functions or remove API
  5. Medium-term: Split into focused products (core, postgres, ML)
  6. Long-term: Stabilize and document core API at 1.0

Verification Commands

# Run real benchmarks
cd benchmarks/real && ./run.sh

# Run property tests (reveals distance bugs)
cargo test -p ruvector-core --test property_tests

# Run bug documentation tests
cargo test -p ruvector-core --test bug_tests

# Find simulated benchmark code
grep -n "rust_speedup\|simd_factor" benchmarks/*.py

# Find unimplemented loss functions
grep -rn "unimplemented!" crates/ruvector-gnn/src/

Files Analyzed

Document Key Finding
docs/BENCHMARK_ANALYSIS.md Simulated benchmarks with hardcoded multipliers
docs/PROJECT_EVALUATION.md 6.5/10 overall, 30-40% vaporware
docs/REAL_BENCHMARK_RESULTS.md Insert 27x slower, search 1.6x faster
docs/TEST_RESULTS.md 6 critical bugs in distance functions
docs/architectural-assessment.md Coherent tech, incoherent product scope
crates/ruvector-gnn/src/training.rs Contrastive loss works, supervised stubs

Conclusion

RuVector is a technically competent project with dishonest marketing.

The core vector database functionality is genuinely good - real SIMD optimizations, solid HNSW integration, working quantization. A competent engineer built this.

However:

  • Performance claims are fabricated from simulated benchmarks
  • 30-40% of advertised features are incomplete or fake
  • The project tries to be 8 products instead of one good one
  • Critical bugs exist in core distance calculations

The foundation is salvageable, but requires:

  1. Honest benchmarking
  2. Feature completion or removal
  3. Scope discipline
  4. Bug fixes in core algorithms

Final Rating: 6.5/10 - Legitimate foundation, oversold execution.


Report generated from independent code analysis and testing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment