Treat LFM2 as the reasoning head, ruvector as the world model and memory, and FastGRNN as the control circuit that decides how to use both.
- LFM2 as the language core (700M and 1.2B, optionally 2.6B). (liquid.ai)
- ruvector as a vector plus graph memory with attention over neighborhoods.
- FastGRNN as the tiny router RNN that decides how to use LFM2 and ruvector per request. (arXiv)
You can adapt the language and infra stack (Python, Rust, Node) without changing the logic.
Define this explicitly so you can benchmark against it.
-
Target devices
- Tier A: laptop or desktop CPU (for dev and on‑prem)
- Tier B: mobile or edge CPU or NPU (Snapdragon, Apple silicon)
-
Core objectives
- Median end‑to‑end latency under 500 ms for “simple” queries on Tier A, 800 ms on Tier B.
- 2x speedup vs a “naive” baseline (single LFM2 1.2B, no router, naive RAG) at equal or better quality.
- Retrieval and reasoning quality comparable to “always use LFM2 2.6B with full context”. (arXiv)
-
Functional goals
- Persistent graph memory (ruvector) with attention over neighborhoods.
- Router that chooses: model size, context size, and retrieval strategy per request.
- All components deployable on device or on‑prem.
-
LFM2 Inference Service
-
Embedding Service
-
Either:
- LFM2 encoder head (if you use the retrieval variant or a pooled representation). (arXiv)
- Or a separate small encoder (ruvector’s current embedder) projected to your existing dimensionality.
-
-
ruvector Memory Service
-
Stores:
- Nodes: texts, states, tool results, compressed summaries.
- Vectors: dense embeddings.
- Graph: edges with relations and weights.
-
Index:
- HNSW for approximate nearest neighbors. (arXiv)
-
Attention engine:
- Graph attention over retrieved neighborhoods (GAT‑style or custom attention).
-
-
FastGRNN Router Service
- Small RNN with gating as described by Kusupati et al. (arXiv)
- Inputs: query and retrieval stats.
- Outputs: routing decisions.
-
Orchestrator / Gateway
- Single entry for clients.
- Implements the step‑by‑step request flow.
- Handles logging and benchmarking.
For a user query q:
-
Preprocess and embed
q. -
Call ruvector for approximate nearest neighbors via HNSW.
-
Run ruvector attention over the neighborhood.
-
Extract routing features (query and retrieval stats).
-
Call FastGRNN router to decide:
- model size
- context size
- sampling config
- fallback strategy
-
Build the prompt using top‑k attended nodes.
-
Call chosen LFM2 model.
-
Optionally write back:
- new nodes
- updated edges
- compressed summary node
All of these steps are instrumented for latency and quality metrics.
Use a simple but expressive schema:
Store vector outside hot row if needed for memory locality.
{
"id": "uuid",
"src": "node-id",
"dst": "node-id",
"rel": "cites|follows|same_topic|agent_step|...",
"weight": 0.0,
"metadata": {
"timestamp": "ISO-8601",
"created_by": "router|agent|user",
"confidence": 0.0
}
}Edges are used during graph attention.
For each request, compute a fixed‑length feature vector f:
-
Query stats
len_tokens,lang_id, domain id or one‑hot, log frequency of user.
-
Embedding stats
||embedding||2, top principal component coordinate (optional).
-
HNSW search stats (from ruvector)
k,mean_distance,std_distance,min_distance,max_distance.entropy_attention(after attention weights computed).depth_touched(hops visited in graph).
-
System constraints
budget_ms,device_class(enum),privacy_level(enum).
Concatenate to a numeric vector, for example size 64 or 128.
FastGRNN outputs:
model_idlogits over {350m, 700m, 1.2b, 2.6b}.context_size_binlogits over {small, medium, large}.temperature,top_p(via regression heads or quantized bins).- Optional
fallback_to_cloudprobability.
During inference, apply softmax or argmax and map to concrete values using a config table.
Pick a reference environment:
-
Python 3.11, Poetry or uv for env management.
-
Key libraries:
-
Download models from Hugging Face
LiquidAI/LFM2-*. (Hugging Face) -
Decide runtime:
- For CPU only:
llama.cppquantized Q4 or Q5. - For mixed CPU / GPU: vLLM or similar.
- For CPU only:
-
Define a simple gRPC or HTTP API:
POST /generate
{
"model_id": "lfm2-700m",
"prompt": "string",
"max_tokens": 512,
"temperature": 0.7,
"top_p": 0.9
}-
Implement:
- KV cache enabled.
- Streaming responses if possible, but still log total latency.
-
For embeddings:
- Either export the final hidden state of LFM2 or use a lighter embedding model and keep dimension consistent with ruvector.
-
Storage layer
- Use Postgres or a key value store for nodes and edges.
- Vector column in a separate table, or use a dedicated vector DB if you prefer.
-
HNSW index
-
Graph attention
Implementation sketch:
-
Retrieve top
kneighbors using HNSW. -
Get the adjacency list for these nodes up to depth
h(for example 2 hops). -
Build a small induced subgraph
G_q. -
Run one or two layers of attention:
# Pseudo for node in G_q.nodes: h_i = W_node * concat(embedding_i, meta_i) for edge (i, j): e_ij = rel_emb[rel_ij] for l in range(L): for node i: # neighbors j in N(i) alpha_ij = softmax_j( a^T [Wh_i || Wh_j || e_ij] ) h_i_new = sum_j alpha_ij * Wh_j h_i = h_i_new
-
Use the final
h_ifor nodes as scores. Normalize to obtain attention weights.
-
-
Context builder
-
Sort nodes by attention weight descending.
-
Truncate by:
max_docschosen by router.max_tokens_for_context(for example 2k tokens).
-
Build structured context:
[Context block 1] (score=0.93, type=doc, tags=...) text... [Context block 2] (score=0.87, type=summary, tags=...) text...
-
Implement FastGRNN following the original paper but with a small hidden size. (arXiv)
-
Model equations
For input vector sequence
x_t(you can use a single step sot=1):-
Standard FastGRNN:
[ \tilde h_t = \sigma(W x_t + U h_{t-1} + b_h) ] [ z_t = \sigma(W_z x_t + U_z h_{t-1} + b_z) ] [ h_t = (\zeta (1 - z_t) + \nu) \odot h_{t-1} + z_t \odot \tilde h_t ]
where
zetaandnuare scalar parameters constrained to keep training stable and prediction efficient. (arXiv) -
Use
t=1withh_0 = 0so the router is basically a gated MLP with recurrent flavor.
-
-
Input and heads
-
Input dimension:
d_inbetween 32 and 128. -
Hidden size: 32 or 64.
-
Classification heads:
model_logits = W_m h_1 + b_msize 4.ctx_logits = W_c h_1 + b_csize 3.
-
Regression heads:
temperature = softplus(w_T^T h_1 + b_T) clipped.top_p = sigmoid(w_P^T h_1 + b_P) scaled between 0.7 and 1.
-
-
Training data
Create training samples from logs of a “brute force” baseline:
-
Run a period without routing: always use
lfm2-2.6bwith large context. -
For each request:
- Log feature vector
f. - Log quality metrics (human or auto evaluation).
- Log latency and resource usage.
- Log feature vector
-
Off‑line, simulate alternative decisions:
-
For the same request, rerun using 350M, 700M, 1.2B and different context sizes on a sample subset.
-
Compute a scalar utility
Usuch as:[ U = Q - \lambda \cdot \text{latency_ms} - \mu \cdot \text{cost_unit} ]
where
Qis quality score,lambdaandmuweight latency and cost.
-
-
Derive labels:
y_modelis model with maxU.y_ctxis smallest context bin among those within margin ofU_max - \epsilon.y_temp,y_top_pare from regression fit to top performing trials.
-
Train FastGRNN to predict these labels or values.
-
-
Loss
L = L_model + L_ctx + alpha * L_temp + beta * L_top_p- Use cross‑entropy for model and context, smooth L1 or MSE for regression.
-
Serving
- Export router as a TorchScript or ONNX model.
- Load in the Orchestrator or as a sidecar microservice.
Core pseudo code:
def handle_request(query, constraints):
t0 = now()
pre = preprocess(query)
emb = embed(pre) # LFM2 or external encoder
# ruvector retrieval
hnsw_res = ruv.search(emb, k=64)
attn_res = ruv.attend(hnsw_res, hops=2)
# feature extraction for router
feats = build_router_features(query, emb, hnsw_res, attn_res, constraints)
# FastGRNN route
route = fastgrnn(feats)
context = build_context(attn_res, max_docs=route.max_docs, max_tokens=route.max_ctx_tokens)
prompt = assemble_prompt(query, context)
# LFM2 call
llm_t0 = now()
reply = lfm2_generate(
model_id=route.model_id,
prompt=prompt,
temperature=route.temperature,
top_p=route.top_p
)
llm_t1 = now()
# optional writeback
new_nodes, new_edges = post_process_and_writeback(query, reply, attn_res)
t1 = now()
log_metrics(
query=query,
route=route,
latency_total_ms=(t1 - t0),
latency_llm_ms=(llm_t1 - llm_t0),
retrieval_stats=hnsw_res.stats,
quality_placeholder=None # filled later by evaluators
)
return replyYou want both performance and quality benchmarks, plus router performance.
Define a benchmark matrix:
-
Devices:
- Laptop CPU (for example 8 core x86 or Apple M‑series).
- Smartphone or dev board with Snapdragon class SoC. (liquid.ai)
-
Models:
lfm2-350m,lfm2-700m,lfm2-1.2b,lfm2-2.6b.
-
Quantization:
- Q4 and Q5 for CPU (llama.cpp).
- 8 bit or 4 bit weight only for vLLM if GPU available.
For each cell, log:
- Prefill tokens per second.
- Decode tokens per second.
- Peak memory (RAM).
LFM2 report gives reference numbers showing up to 2x speedup vs Qwen3 on CPU, which you can use for sanity checks. (arXiv)
Define scenarios:
- Simple FAQ.
- Moderate reasoning with some retrieval.
- Heavy multi step question requiring deeper graph attention.
For each scenario:
-
Sample 100 to 1 000 queries from real workloads.
-
Run four systems:
- Baseline big:
lfm2-2.6bwith large context, no router. - Mid fixed:
lfm2-1.2bwith medium context. - Small fixed:
lfm2-700mwith small context. - Routed: your full ruvector plus FastGRNN system.
- Baseline big:
Measure:
P50,P90,P99end‑to‑end latency.- LLM portion vs retrieval plus routing portion.
- Request success rate (no timeouts).
Pick a mix of:
-
Public benchmarks for sanity:
- GSM8K, MMLU subsets, IFEval style instruction tests. LFM2 already reports strong small model performance on these; use them to confirm you have not degraded quality. (arXiv)
-
Internal tasks:
- Domain QA with ground truth answers.
- Retrieval tasks: “is the correct document present in the context”.
Metrics:
-
Exact match and F1 on QA tasks.
-
Retrieval recall at k (R@5, R@10).
-
Judge model evaluation:
- Use an external LLM as a judge to rate answers from 1 to 5 on helpfulness and correctness.
Important: Evaluate both baseline big system and routed system on the same dataset. Compute regret:
[ \text{Regret} = \mathbb{E}[Q_{\text{big}} - Q_{\text{routed}}] ]
Target: keep average regret under 0.1 points on a 1 to 5 scale while gaining substantial latency savings.
From the same dataset:
-
Route distribution:
- Fraction of calls to each model size.
- Fraction of “escalations” where router goes to 2.6B.
-
Oracle comparison:
- For each request, compute best decision among your candidate policies.
- Measure how often FastGRNN picks the same or within epsilon utility.
-
Cost efficiency:
- Total compute tokens or “model‑tokens” consumed.
- Compare to always big baseline.
Define a log row per request:
{
"request_id": "uuid",
"timestamp": "ISO-8601",
"user_id": "hash",
"query": "string",
"router_features": [...],
"router_decision": {
"model_id": "lfm2-700m",
"context_size": 2048,
"temperature": 0.7,
"top_p": 0.9
},
"retrieval_stats": {
"k": 64,
"mean_distance": 0.61,
"std_distance": 0.07,
"entropy_attention": 1.93,
"nodes_used": 12
},
"latency_ms": {
"total": 420,
"retrieval": 80,
"router": 2,
"llm": 310,
"writeback": 28
},
"quality": {
"label": "correct|partial|wrong",
"score": 4.5,
"judge_model": "gpt-5.1-pro"
}
}This becomes the training data for router refinement and system optimization.
Run this as a phased program rather than tweaking everything at once.
-
Choose smallest viable model per task
- Start with 700M for default, 1.2B for complex, 350M for classification. LFM2 is tuned for strong performance at these sizes. (liquid.ai)
-
Quantization strategy
- For CPU: use 4 bit weights (Q4) for 350M and 700M and 5 bit if you can afford it for 1.2B and 2.6B.
- Verify no major quality drop using GSM8K and a sample of your tasks.
-
Context packing
- Aggressive deduplication and summarization before passing context.
- Use LFM2 itself to compress multiple high attention nodes into a short summary node and write it back to ruvector.
-
Prompt templates
- Keep prompts short and stable. Long system prompts kill prefill speed.
- Benchmark standard vs minimal templates.
-
KV cache reuse and chunking
- For multi turn dialogs, reuse KV cache; LFM2 is designed for long contexts and can benefit greatly from KV reuse. (arXiv)
-
Tune HNSW parameters
-
Start with:
M=32,efConstruction=200,efSearch=64.
-
Measure recall vs baseline brute force on a validation set.
-
Adjust
efSearchupward if recall too low, downward if latency too high. (arXiv)
-
-
Hybrid indexing if huge corpus
If you have billions of nodes, consider a hybrid like HANNIS or partitioned HNSW:
- Cluster documents, build one HNSW per cluster, route into a cluster first. (VLDB)
-
Graph pruning
- Prune edges with very low weights or old timestamps to keep neighborhoods sparse and attention efficient.
- Limit degree per node to a budget (for example 64) using age plus weight heuristics.
-
Attention optimization
- Use multi head attention with low ranks and fused ops if possible.
- Cap induced subgraph size (for example 256 nodes) to bound compute.
-
Model size vs CPU time
- Start with hidden size 32 and measure average router latency. FastGRNN is designed to be tiny and should be sub‑millisecond on CPU. (arXiv)
- If decision quality is poor (high regret), increase hidden size gradually.
-
Curriculum training
- Begin training on easy routing decisions (clear latency vs quality tradeoffs).
- Gradually introduce more ambiguous examples.
-
Online refinement
- Periodically retrain router using new logs.
- Use a bandit‑style exploration where a small fraction of requests try alternative routing decisions to explore.
-
Guardrails
- For very high risk or high value queries (by user, domain, or self‑estimated uncertainty), always escalate to 2.6B ignoring router, or require explicit router confidence threshold.
-
Batching across requests
- For high load, batch LFM2 calls with compatible models and sampling configs. vLLM makes this straightforward. (Hugging Face)
-
Cache common queries
- Maintain a cache keyed by normalized query plus top retrieval IDs.
- Serve from cache if a near identical query appears and the graph neighborhood has not changed.
-
Parallel retrieval and routing
- Embed and retrieve in parallel, then run router once both embedding and retrieval stats are ready.
- Router can use a subset of features if full stats are not needed.
-
Tiered deployment
- On device, keep 350M and 700M.
- On edge or server, keep 1.2B and 2.6B.
- Router decides whether to stay local or escalate across tiers, respecting privacy flags.
You can use this as a project checklist:
-
LFM2 service up with four sizes, quantized, with metrics.
-
Embedding service aligned with ruvector dimensions.
-
ruvector:
- Node and edge schemas implemented.
- HNSW index online and tuned.
- Graph attention implemented and benchmarked on synthetic graphs.
-
Orchestrator:
- Happy path working: query → retrieval → context → LFM2.
- Telemetry and logging integrated.
-
Baseline benchmarks:
- Always big and mid fixed systems fully measured.
-
Router:
- Feature extractor implemented.
- FastGRNN implemented and trained from baseline logs.
- FastGRNN serving endpoint integrated into orchestrator.
-
Routed system:
- A/B tested vs always big baseline.
- Latency and quality targets validated.
-
Optimization loop:
- Monthly or continuous retrain for router.
- HNSW and attention parameters periodically re‑tuned.
- New LFM2 variants plugged in as they arrive. (liquid.ai)
.
{ "id": "uuid", "vector": [float], // d dims "text": "string", "type": "doc|memory|trace|tool_result|summary", "source": "kb|user|agent|system", "metadata": { "timestamp": "ISO-8601", "tags": ["string"], "language": "en", "domain": "support|finance|ops|...", "version": "string", "score": 0.0 } }