ruvnet/LFM2.md

## LFM2.md

      
    Raw
  

              LFM2.md
            
          
    Treat LFM2 as the reasoning head, ruvector as the world model and memory, and FastGRNN as the control circuit that decides how to use both.

LFM2 as the language core (700M and 1.2B, optionally 2.6B). (liquid.ai)
ruvector as a vector plus graph memory with attention over neighborhoods.
FastGRNN as the tiny router RNN that decides how to use LFM2 and ruvector per request. (arXiv)

You can adapt the language and infra stack (Python, Rust, Node) without changing the logic.

0. System goals and constraints

Define this explicitly so you can benchmark against it.


Target devices

Tier A: laptop or desktop CPU (for dev and on‑prem)
Tier B: mobile or edge CPU or NPU (Snapdragon, Apple silicon)


Core objectives

Median end‑to‑end latency under 500 ms for “simple” queries on Tier A, 800 ms on Tier B.
2x speedup vs a “naive” baseline (single LFM2 1.2B, no router, naive RAG) at equal or better quality.
Retrieval and reasoning quality comparable to “always use LFM2 2.6B with full context”. (arXiv)


Functional goals

Persistent graph memory (ruvector) with attention over neighborhoods.
Router that chooses: model size, context size, and retrieval strategy per request.
All components deployable on device or on‑prem.


1. Architecture specification

1.1 Components


LFM2 Inference Service


Wraps LiquidAI LFM2 checkpoints via:

llama.cpp binary for CPU and lightweight deployments. (liquid.ai)
or vLLM for server‑style deployments. (liquid.ai)


Supports multiple model sizes:

lfm2-350m, lfm2-700m, lfm2-1.2b, lfm2-2.6b. (liquid.ai)


Embedding Service


Either:

LFM2 encoder head (if you use the retrieval variant or a pooled representation). (arXiv)
Or a separate small encoder (ruvector’s current embedder) projected to your existing dimensionality.


ruvector Memory Service


Stores:

Nodes: texts, states, tool results, compressed summaries.
Vectors: dense embeddings.
Graph: edges with relations and weights.


Index:

HNSW for approximate nearest neighbors. (arXiv)


Attention engine:

Graph attention over retrieved neighborhoods (GAT‑style or custom attention).


FastGRNN Router Service

Small RNN with gating as described by Kusupati et al. (arXiv)
Inputs: query and retrieval stats.
Outputs: routing decisions.


Orchestrator / Gateway

Single entry for clients.
Implements the step‑by‑step request flow.
Handles logging and benchmarking.


1.2 Request flow

For a user query q:


Preprocess and embed q.


Call ruvector for approximate nearest neighbors via HNSW.


Run ruvector attention over the neighborhood.


Extract routing features (query and retrieval stats).


Call FastGRNN router to decide:

model size
context size
sampling config
fallback strategy


Build the prompt using top‑k attended nodes.


Call chosen LFM2 model.


Optionally write back:

new nodes
updated edges
compressed summary node


All of these steps are instrumented for latency and quality metrics.

2. Data and schema specification

2.1 Node schema

Use a simple but expressive schema:
{
  "id": "uuid",
  "vector": [float],           // d dims
  "text": "string",
  "type": "doc|memory|trace|tool_result|summary",
  "source": "kb|user|agent|system",
  "metadata": {
    "timestamp": "ISO-8601",
    "tags": ["string"],
    "language": "en",
    "domain": "support|finance|ops|...",
    "version": "string",
    "score": 0.0
  }
}
Store vector outside hot row if needed for memory locality.
2.2 Edge schema

{
  "id": "uuid",
  "src": "node-id",
  "dst": "node-id",
  "rel": "cites|follows|same_topic|agent_step|...",
  "weight": 0.0,
  "metadata": {
    "timestamp": "ISO-8601",
    "created_by": "router|agent|user",
    "confidence": 0.0
  }
}
Edges are used during graph attention.
2.3 Router feature vector

For each request, compute a fixed‑length feature vector f:


Query stats

len_tokens, lang_id, domain id or one‑hot, log frequency of user.


Embedding stats

||embedding||2, top principal component coordinate (optional).


HNSW search stats (from ruvector)

k, mean_distance, std_distance, min_distance, max_distance.
entropy_attention (after attention weights computed).
depth_touched (hops visited in graph).


System constraints

budget_ms, device_class (enum), privacy_level (enum).


Concatenate to a numeric vector, for example size 64 or 128.
2.4 Router output

FastGRNN outputs:

model_id logits over {350m, 700m, 1.2b, 2.6b}.
context_size_bin logits over {small, medium, large}.
temperature, top_p (via regression heads or quantized bins).
Optional fallback_to_cloud probability.

During inference, apply softmax or argmax and map to concrete values using a config table.

3. Step‑by‑step implementation

3.1 Environment and dependencies

Pick a reference environment:


Python 3.11, Poetry or uv for env management.


Key libraries:

vllm or llama_cpp_python for LFM2. (liquid.ai)
faiss or your own HNSW implementation if ruvector is custom, else ruvector’s native index. (arXiv)
torch for FastGRNN implementation. (arXiv)
fastapi or grpc for services.
prometheus_client for metrics.


3.2 LFM2 inference service


Download models from Hugging Face LiquidAI/LFM2-*. (Hugging Face)


Decide runtime:

For CPU only: llama.cpp quantized Q4 or Q5.
For mixed CPU / GPU: vLLM or similar.


Define a simple gRPC or HTTP API:


POST /generate
{
  "model_id": "lfm2-700m",
  "prompt": "string",
  "max_tokens": 512,
  "temperature": 0.7,
  "top_p": 0.9
}


Implement:

KV cache enabled.
Streaming responses if possible, but still log total latency.


For embeddings:

Either export the final hidden state of LFM2 or use a lighter embedding model and keep dimension consistent with ruvector.


3.3 ruvector memory and attention


Storage layer

Use Postgres or a key value store for nodes and edges.
Vector column in a separate table, or use a dedicated vector DB if you prefer.


HNSW index


Build HNSW with parameters:

M (graph degree) initial 32.
efConstruction 200.
efSearch 64 or 128. (arXiv)


For very large collections, shard by collection or domain. (VLDB)


Graph attention
Implementation sketch:


Retrieve top k neighbors using HNSW.


Get the adjacency list for these nodes up to depth h (for example 2 hops).


Build a small induced subgraph G_q.


Run one or two layers of attention:
# Pseudo
for node in G_q.nodes:
    h_i = W_node * concat(embedding_i, meta_i)
for edge (i, j):
    e_ij = rel_emb[rel_ij]
for l in range(L):
    for node i:
        # neighbors j in N(i)
        alpha_ij = softmax_j( a^T [Wh_i || Wh_j || e_ij] )
        h_i_new = sum_j alpha_ij * Wh_j
    h_i = h_i_new


Use the final h_i for nodes as scores. Normalize to obtain attention weights.


Context builder


Sort nodes by attention weight descending.


Truncate by:

max_docs chosen by router.
max_tokens_for_context (for example 2k tokens).


Build structured context:
[Context block 1] (score=0.93, type=doc, tags=...)
text...

[Context block 2] (score=0.87, type=summary, tags=...)
text...


3.4 FastGRNN router

Implement FastGRNN following the original paper but with a small hidden size. (arXiv)


Model equations
For input vector sequence x_t (you can use a single step so t=1):


Standard FastGRNN:
[
\tilde h_t = \sigma(W x_t + U h_{t-1} + b_h)
]
[
z_t = \sigma(W_z x_t + U_z h_{t-1} + b_z)
]
[
h_t = (\zeta (1 - z_t) + \nu) \odot h_{t-1} + z_t \odot \tilde h_t
]
where zeta and nu are scalar parameters constrained to keep training stable and prediction efficient. (arXiv)


Use t=1 with h_0 = 0 so the router is basically a gated MLP with recurrent flavor.


Input and heads


Input dimension: d_in between 32 and 128.


Hidden size: 32 or 64.


Classification heads:

model_logits = W_m h_1 + b_m size 4.
ctx_logits = W_c h_1 + b_c size 3.


Regression heads:

temperature = softplus(w_T^T h_1 + b_T) clipped.
top_p = sigmoid(w_P^T h_1 + b_P) scaled between 0.7 and 1.


Training data
Create training samples from logs of a “brute force” baseline:


Run a period without routing: always use lfm2-2.6b with large context.


For each request:

Log feature vector f.
Log quality metrics (human or auto evaluation).
Log latency and resource usage.


Off‑line, simulate alternative decisions:


For the same request, rerun using 350M, 700M, 1.2B and different context sizes on a sample subset.


Compute a scalar utility U such as:
[
U = Q - \lambda \cdot \text{latency_ms} - \mu \cdot \text{cost_unit}
]
where Q is quality score, lambda and mu weight latency and cost.


Derive labels:

y_model is model with max U.
y_ctx is smallest context bin among those within margin of U_max - \epsilon.
y_temp, y_top_p are from regression fit to top performing trials.


Train FastGRNN to predict these labels or values.


Loss

L = L_model + L_ctx + alpha * L_temp + beta * L_top_p
Use cross‑entropy for model and context, smooth L1 or MSE for regression.


Serving

Export router as a TorchScript or ONNX model.
Load in the Orchestrator or as a sidecar microservice.


3.5 Orchestrator

Core pseudo code:
def handle_request(query, constraints):
    t0 = now()

    pre = preprocess(query)
    emb = embed(pre)               # LFM2 or external encoder

    # ruvector retrieval
    hnsw_res = ruv.search(emb, k=64)
    attn_res = ruv.attend(hnsw_res, hops=2)

    # feature extraction for router
    feats = build_router_features(query, emb, hnsw_res, attn_res, constraints)

    # FastGRNN route
    route = fastgrnn(feats)

    context = build_context(attn_res, max_docs=route.max_docs, max_tokens=route.max_ctx_tokens)

    prompt = assemble_prompt(query, context)

    # LFM2 call
    llm_t0 = now()
    reply = lfm2_generate(
        model_id=route.model_id,
        prompt=prompt,
        temperature=route.temperature,
        top_p=route.top_p
    )
    llm_t1 = now()

    # optional writeback
    new_nodes, new_edges = post_process_and_writeback(query, reply, attn_res)

    t1 = now()

    log_metrics(
        query=query,
        route=route,
        latency_total_ms=(t1 - t0),
        latency_llm_ms=(llm_t1 - llm_t0),
        retrieval_stats=hnsw_res.stats,
        quality_placeholder=None  # filled later by evaluators
    )

    return reply

4. Benchmarking specification

You want both performance and quality benchmarks, plus router performance.
4.1 Hardware and model matrix

Define a benchmark matrix:


Devices:

Laptop CPU (for example 8 core x86 or Apple M‑series).
Smartphone or dev board with Snapdragon class SoC. (liquid.ai)


Models:

lfm2-350m, lfm2-700m, lfm2-1.2b, lfm2-2.6b.


Quantization:

Q4 and Q5 for CPU (llama.cpp).
8 bit or 4 bit weight only for vLLM if GPU available.


For each cell, log:

Prefill tokens per second.
Decode tokens per second.
Peak memory (RAM).

LFM2 report gives reference numbers showing up to 2x speedup vs Qwen3 on CPU, which you can use for sanity checks. (arXiv)
4.2 End‑to‑end latency benchmarks

Define scenarios:

Simple FAQ.
Moderate reasoning with some retrieval.
Heavy multi step question requiring deeper graph attention.

For each scenario:


Sample 100 to 1 000 queries from real workloads.


Run four systems:

Baseline big: lfm2-2.6b with large context, no router.
Mid fixed: lfm2-1.2b with medium context.
Small fixed: lfm2-700m with small context.
Routed: your full ruvector plus FastGRNN system.


Measure:

P50, P90, P99 end‑to‑end latency.
LLM portion vs retrieval plus routing portion.
Request success rate (no timeouts).

4.3 Quality benchmarks

Pick a mix of:


Public benchmarks for sanity:

GSM8K, MMLU subsets, IFEval style instruction tests. LFM2 already reports strong small model performance on these; use them to confirm you have not degraded quality. (arXiv)


Internal tasks:

Domain QA with ground truth answers.
Retrieval tasks: “is the correct document present in the context”.


Metrics:


Exact match and F1 on QA tasks.


Retrieval recall at k (R@5, R@10).


Judge model evaluation:

Use an external LLM as a judge to rate answers from 1 to 5 on helpfulness and correctness.


Important: Evaluate both baseline big system and routed system on the same dataset. Compute regret:
[
\text{Regret} = \mathbb{E}[Q_{\text{big}} - Q_{\text{routed}}]
]
Target: keep average regret under 0.1 points on a 1 to 5 scale while gaining substantial latency savings.
4.4 Router‑specific benchmarks

From the same dataset:


Route distribution:

Fraction of calls to each model size.
Fraction of “escalations” where router goes to 2.6B.


Oracle comparison:

For each request, compute best decision among your candidate policies.
Measure how often FastGRNN picks the same or within epsilon utility.


Cost efficiency:

Total compute tokens or “model‑tokens” consumed.
Compare to always big baseline.


4.5 Telemetry schema

Define a log row per request:
{
  "request_id": "uuid",
  "timestamp": "ISO-8601",
  "user_id": "hash",
  "query": "string",

  "router_features": [...],
  "router_decision": {
    "model_id": "lfm2-700m",
    "context_size": 2048,
    "temperature": 0.7,
    "top_p": 0.9
  },

  "retrieval_stats": {
    "k": 64,
    "mean_distance": 0.61,
    "std_distance": 0.07,
    "entropy_attention": 1.93,
    "nodes_used": 12
  },

  "latency_ms": {
    "total": 420,
    "retrieval": 80,
    "router": 2,
    "llm": 310,
    "writeback": 28
  },

  "quality": {
    "label": "correct|partial|wrong",
    "score": 4.5,
    "judge_model": "gpt-5.1-pro"
  }
}
This becomes the training data for router refinement and system optimization.

5. Optimization playbook

Run this as a phased program rather than tweaking everything at once.
5.1 LFM2 level optimizations


Choose smallest viable model per task

Start with 700M for default, 1.2B for complex, 350M for classification. LFM2 is tuned for strong performance at these sizes. (liquid.ai)


Quantization strategy

For CPU: use 4 bit weights (Q4) for 350M and 700M and 5 bit if you can afford it for 1.2B and 2.6B.
Verify no major quality drop using GSM8K and a sample of your tasks.


Context packing

Aggressive deduplication and summarization before passing context.
Use LFM2 itself to compress multiple high attention nodes into a short summary node and write it back to ruvector.


Prompt templates

Keep prompts short and stable. Long system prompts kill prefill speed.
Benchmark standard vs minimal templates.


KV cache reuse and chunking

For multi turn dialogs, reuse KV cache; LFM2 is designed for long contexts and can benefit greatly from KV reuse. (arXiv)


5.2 ruvector and HNSW optimizations


Tune HNSW parameters


Start with:

M=32, efConstruction=200, efSearch=64.


Measure recall vs baseline brute force on a validation set.


Adjust efSearch upward if recall too low, downward if latency too high. (arXiv)


Hybrid indexing if huge corpus
If you have billions of nodes, consider a hybrid like HANNIS or partitioned HNSW:

Cluster documents, build one HNSW per cluster, route into a cluster first. (VLDB)


Graph pruning

Prune edges with very low weights or old timestamps to keep neighborhoods sparse and attention efficient.
Limit degree per node to a budget (for example 64) using age plus weight heuristics.


Attention optimization

Use multi head attention with low ranks and fused ops if possible.
Cap induced subgraph size (for example 256 nodes) to bound compute.


5.3 FastGRNN router optimizations


Model size vs CPU time

Start with hidden size 32 and measure average router latency. FastGRNN is designed to be tiny and should be sub‑millisecond on CPU. (arXiv)
If decision quality is poor (high regret), increase hidden size gradually.


Curriculum training

Begin training on easy routing decisions (clear latency vs quality tradeoffs).
Gradually introduce more ambiguous examples.


Online refinement

Periodically retrain router using new logs.
Use a bandit‑style exploration where a small fraction of requests try alternative routing decisions to explore.


Guardrails

For very high risk or high value queries (by user, domain, or self‑estimated uncertainty), always escalate to 2.6B ignoring router, or require explicit router confidence threshold.


5.4 System level optimizations


Batching across requests

For high load, batch LFM2 calls with compatible models and sampling configs. vLLM makes this straightforward. (Hugging Face)


Cache common queries

Maintain a cache keyed by normalized query plus top retrieval IDs.
Serve from cache if a near identical query appears and the graph neighborhood has not changed.


Parallel retrieval and routing

Embed and retrieve in parallel, then run router once both embedding and retrieval stats are ready.
Router can use a subset of features if full stats are not needed.


Tiered deployment

On device, keep 350M and 700M.
On edge or server, keep 1.2B and 2.6B.
Router decides whether to stay local or escalate across tiers, respecting privacy flags.


6. Implementation checklist

You can use this as a project checklist:


LFM2 service up with four sizes, quantized, with metrics.


Embedding service aligned with ruvector dimensions.


ruvector:

Node and edge schemas implemented.
HNSW index online and tuned.
Graph attention implemented and benchmarked on synthetic graphs.


Orchestrator:

Happy path working: query → retrieval → context → LFM2.
Telemetry and logging integrated.


Baseline benchmarks:

Always big and mid fixed systems fully measured.


Router:

Feature extractor implemented.
FastGRNN implemented and trained from baseline logs.
FastGRNN serving endpoint integrated into orchestrator.


Routed system:

A/B tested vs always big baseline.
Latency and quality targets validated.


Optimization loop:

Monthly or continuous retrain for router.
HNSW and attention parameters periodically re‑tuned.
New LFM2 variants plugged in as they arrive. (liquid.ai)


.
No results found