Skip to content

Instantly share code, notes, and snippets.

@tanaypratap
Created November 2, 2025 12:46
Show Gist options
  • Select an option

  • Save tanaypratap/245d515aba6d08c2b7495a8661d1f0f6 to your computer and use it in GitHub Desktop.

Select an option

Save tanaypratap/245d515aba6d08c2b7495a8661d1f0f6 to your computer and use it in GitHub Desktop.
hw.embedding.agents.rag

RAG implementation reference.


🗂 Folder layout

rag-agent-pgvector/
│
├── data/
│   └── docs.json                # seed KB
│
├── src/
│   ├── db.js                    # Postgres connection + table setup
│   ├── embed.js                 # create + store embeddings in pgvector
│   ├── retrieve.js              # query pgvector for top-k docs
│   ├── agent.js                 # reasoning + reflection loop
│   └── utils.js                 # small helpers
│
├── .env                         # API + DB creds
├── package.json
└── index.js                     # entry point

1️⃣ .env

OPENAI_API_KEY=sk-your-key
DATABASE_URL=postgresql://postgres:password@localhost:5432/ragdb

You need Postgres 15+ with pgvector installed:

psql -d ragdb -c "CREATE EXTENSION IF NOT EXISTS vector;"

2️⃣ package.json

{
  "name": "rag-agent-pgvector",
  "version": "1.0.0",
  "type": "module",
  "scripts": {
    "start": "node index.js",
    "embed": "node src/embed.js"
  },
  "dependencies": {
    "dotenv": "^16.4.0",
    "openai": "^4.0.0",
    "pg": "^8.11.0"
  }
}

Install:

npm install

3️⃣ data/docs.json

Same toy corpus:

[
  {
    "id": "D1",
    "title": "Billing Policy",
    "text": "You can pause your subscription once per year for up to 4 weeks. Pauses longer than 4 weeks require program lead approval."
  },
  {
    "id": "D2",
    "title": "Refunds",
    "text": "Refunds are available within 7 days of first payment only."
  },
  {
    "id": "D3",
    "title": "Cohort Rules",
    "text": "Attendance below 60% requires remedial tasks; pauses don’t reset attendance."
  },
  {
    "id": "D4",
    "title": "Contact",
    "text": "Email support@neog.camp for approval cases; include start and end dates."
  }
]

4️⃣ src/db.js

// src/db.js
import pg from "pg";
import dotenv from "dotenv";
dotenv.config();

export const pool = new pg.Pool({
  connectionString: process.env.DATABASE_URL,
});

export async function setupDB() {
  await pool.query(`
    CREATE TABLE IF NOT EXISTS documents (
      id TEXT PRIMARY KEY,
      title TEXT,
      content TEXT,
      embedding vector(1536)
    );
  `);
  console.log("✅ documents table ready");
}

5️⃣ src/embed.js

// src/embed.js
import fs from "fs";
import path from "path";
import OpenAI from "openai";
import { pool, setupDB } from "./db.js";
import dotenv from "dotenv";
dotenv.config();

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
const docs = JSON.parse(fs.readFileSync(path.resolve("data/docs.json"), "utf-8"));

async function createEmbeddings() {
  await setupDB();

  console.log("🧠 Creating embeddings and inserting into Postgres...");
  for (const doc of docs) {
    const input = `${doc.title}\n${doc.text}`;
    const emb = await openai.embeddings.create({
      model: "text-embedding-3-small",
      input,
    });
    const vec = emb.data[0].embedding;

    await pool.query(
      `INSERT INTO documents (id, title, content, embedding)
       VALUES ($1,$2,$3,$4)
       ON CONFLICT (id) DO UPDATE SET
         title = EXCLUDED.title,
         content = EXCLUDED.content,
         embedding = EXCLUDED.embedding;`,
      [doc.id, doc.title, doc.text, vec]
    );

    console.log(`→ stored ${doc.id}`);
  }
  console.log("✅ All documents embedded.");
  await pool.end();
}

createEmbeddings();

Run once:

npm run embed

6️⃣ src/retrieve.js

// src/retrieve.js
import OpenAI from "openai";
import { pool } from "./db.js";
import dotenv from "dotenv";
dotenv.config();

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

export async function retrieveRelevantDocs(query, k = 2) {
  // 1. get embedding for query
  const emb = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: query,
  });
  const qVec = emb.data[0].embedding;

  // 2. similarity search using pgvector's <-> operator (L2 distance)
  const res = await pool.query(
    `SELECT id, title, content, 1 - (embedding <=> $1) AS similarity
     FROM documents
     ORDER BY embedding <-> $1
     LIMIT $2;`,
    [qVec, k]
  );

  return res.rows;
}

Note: <-> is the pgvector distance operator; smaller = closer. 1 - distance gives a pseudo-similarity score.


7️⃣ src/agent.js

// src/agent.js
import OpenAI from "openai";
import { retrieveRelevantDocs } from "./retrieve.js";
import dotenv from "dotenv";
dotenv.config();

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

export async function agenticAnswer(question) {
  console.log("🤔 Question:", question);

  // Step 1 — retrieve
  const docs = await retrieveRelevantDocs(question, 2);
  console.log("📚 Retrieved:", docs.map(d => d.title).join(", "));

  const context = docs.map(d => `${d.title}: ${d.content}`).join("\n\n");

  // Step 2 — generate grounded answer
  const completion = await openai.chat.completions.create({
    model: "gpt-4o-mini",
    messages: [
      {
        role: "system",
        content:
          "Answer truthfully using only the context. Cite the doc titles you used."
      },
      {
        role: "user",
        content: `Context:\n${context}\n\nQuestion: ${question}\nAnswer:`
      }
    ]
  });

  const answer = completion.choices[0].message.content;
  return { answer, sources: docs.map(d => d.title) };
}

8️⃣ index.js

// index.js
import { agenticAnswer } from "./src/agent.js";
import { pool } from "./src/db.js";

const question = "Can I pause my neoG Bootcamp subscription for 2 months?";

const { answer, sources } = await agenticAnswer(question);

console.log("\n💬 Final Answer:\n", answer);
console.log("\n📎 Sources:", sources);

await pool.end();

🚀 Run

npm run embed
npm start

Output:

🤔 Question: Can I pause my neoG Bootcamp subscription for 2 months?
📚 Retrieved: Billing Policy, Contact

💬 Final Answer:
You can pause for up to 4 weeks per year. Two months exceeds this limit, so you
must get program-lead approval. Email support@neog.camp with your dates.
(Sources: Billing Policy, Contact)

🧭 Why pgvector is better

Aspect JSON + cosine pgvector
Scale few hundred docs millions easily
Speed linear scan index-accelerated (ivfflat, hnsw)
Query manual cosine math SQL (ORDER BY embedding <-> $1)
Maintenance static file updatable, versionable
Integration none full SQL joins, filters, metadata

🥭 Technical Primer: Agentic RAG — the “look-it-up before you speak” pattern

This is a practical, beginner-friendly guide to what Agentic RAG is and how it works, using plain language and web-dev parallels. No hand-waving, no mysticism—just the moving parts and how they click.


1) What problem are we solving, really?

LLMs are great at writing, but not great at remembering specific facts on demand (latest policies, your internal docs, error codes). RAG (Retrieval-Augmented Generation) fixes this by letting the model look things up before it answers—like your brain opening a tab, scanning a doc, then replying.

Agentic AI adds the “common sense” loop: the model can decide to search again, refine the query, ask for a missing detail, or cite sources. Think of it as a cautious senior dev who checks the docs and runs another query before committing a PR.

Core idea: Don’t guess. Retrieve → read → respond. If weak, try again.


2) Mental model (with a web-dev analogy)

Picture a typical web request cycle:

  1. Request comes in → Express route/controller.
  2. Decide what to fetch → Query builder chooses tables/filters.
  3. Fetch → SQL query hits DB/index.
  4. Compose response → Templating layer builds HTML/JSON.
  5. Middleware → Retries, logging, rate limits, validation.

Now map that to Agentic RAG:

  1. User question → “controller”.
  2. Plan: Do I know this, or should I fetch? → simple heuristic.
  3. Retrieve → vector search (pgvector) for top-k relevant chunks.
  4. Generate → LLM composes an answer using those chunks.
  5. Reflect → was the answer grounded? If not, re-query/clarify.

RAG is basically semantic SQL + a polite writer.


3) The minimum viable architecture (MVA)

You can build a solid RAG with four parts:

  1. Knowledge base: your documents (Markdown, PDFs, FAQs).
  2. Embedder: turns text into vectors (OpenAI Embeddings).
  3. Vector index: stores vectors and finds nearest neighbors (pgvector in Postgres).
  4. Generator: the LLM that writes the answer (OpenAI chat completion).

Agentic layer sits on top: small rules that decide whether to re-query, ask a follow-up, or provide citations.

Web-dev parallel:

  • Knowledge base = your tables.
  • Embedder = a preprocessor that creates search-friendly columns.
  • Vector index = your DB index (GIN/HNSW).
  • Generator = your templating engine.
  • Agentic layer = middleware + service orchestration.

4) End-to-end example (neoG policy scenario)

Question: “Can I pause my neoG Bootcamp subscription for 2 months?”

Docs (tiny KB):

  • Billing Policy: “You can pause once per year for up to 4 weeks. Longer needs program-lead approval.”
  • Contact: “Email support@neog.camp with dates.”

Flow:

  1. Plan: It’s a policy query → retrieve.
  2. Retrieve: Query embedding for “pause 2 months” → top-k returns “Billing Policy” and “Contact”.
  3. Generate: LLM drafts: 2 months > 4 weeks → needs approval; email support with dates.
  4. Reflect: Are sources present? Yes. Confidence: high.
  5. Respond with receipts.

Why this matters: The model isn’t hallucinating; it’s quoting your source of truth.


5) Key concepts in small bites

  • Embedding: A numeric fingerprint of text. Similar meaning → nearby vectors. Web analogy: computed columns for search (like tsvector), but semantic.

  • Vector search: “Find me chunks like this query.” Web analogy: ORDER BY <-> in pgvector is your semantic ORDER BY similarity.

  • Chunking: Split docs into 200–1000-token pieces so retrieval picks focused parts. Web analogy: normalized rows vs one giant blob.

  • Top-k: Return k best matches (usually 3–8). Web analogy: LIMIT k.

  • Grounding: Forcing the LLM to answer using retrieved text. Web analogy: server-side rendering with strict data props—no client-side fantasy.

  • Reflection: If confidence is low, retry retrieval or ask a clarifying question. Web analogy: a middleware that retries a flaky DB call or returns 422 asking for params.


6) How planning and reflection actually work (without buzzwords)

Start with rules, not “AI planning”:

  • Rule A (coverage): If no retrieved chunk has ≥ X similarity (e.g., 0.25), expand query (“pause policy approval”).
  • Rule B (citations): If the draft answer has no citation tokens ([Source: …]), reject and retry with a stricter prompt.
  • Rule C (length): If answer is too short/too long, regenerate with an explicit style hint.
  • Rule D (clarify): If the query lacks required fields (dates, cohort), ask a single clarifying question.

That’s agentic enough to beat many “fancy framework” demos.


7) Retrieval quality—what actually moves the needle

  • Chunking strategy: Overlap 20–50% to avoid splitting related lines.
  • Field-aware text: Title + headers + content → one string for embedding.
  • Metadata filters: Narrow by collection/tags before vector search.
  • Indexing: ivfflat/hnsw in pgvector for speed at scale.
  • Re-ranking (optional): After top-k, re-score with a tiny cross-encoder or simple rules (title match > body match).

Web-dev instinct applies: clean data, good indexes, pragmatic filters.


8) Prompting that doesn’t collapse under load

System message (strict and boring on purpose):

  • “Answer only from the provided context. If missing, say what’s missing and ask one follow-up.”
  • “Cite the document titles you used in brackets at the end.”
  • “If policy conflicts, choose the newest timestamped doc.”

User message:

  • Context:\n<top-k chunks>\n\nQuestion:\n<user question>\n\nConstraints: cite titles, keep it concise.

Guardrail: If the model mentions facts not present in the chunks, reject and regenerate with a sterner instruction: “Do not invent. If unsure, say ‘Not enough info in context.’”


9) Failure modes (and blunt fixes)

  • Hallucination: Add a hard rule—no answer without citations. Penalize in eval.
  • Bad retrieval: Improve chunking + metadata filters. Increase k. Add query rewrite.
  • Latency: Cache embeddings, warm indexes, parallelize retrieval/generation for multi-query flows.
  • Cost creep: Keep chunks short, use smaller models for rewrite/reflect steps, compress context aggressively.
  • Document drift: Re-embed on change, store doc version, prefer newest in conflicts.

10) How to measure “it works”

  • Groundedness score: Fraction of answer sentences that map to cited text spans.
  • Exactness on policy Qs: are limits, dates, addresses correct?
  • Coverage: % of queries with at least one high-similarity chunk.
  • User-visible metrics: latency p95, answer length bands, citation count.
  • Cost per answer: embeddings + tokens + DB time.

Start simple: log question → retrieved IDs → final answer → citations. Review diffs like you review PRs.


11) Scaling beyond the toy

  • pgvector tips:

    • Use vector_cosine_ops (cosine) or L2 consistently across write/read.
    • Build ivfflat or hnsw index once you cross ~50k chunks.
    • Keep metadata columns (collection, tags, updated_at) for filtering.
  • Pipelines:

    • Offline: ingest → chunk → embed → upsert.
    • Online: query → retrieve → generate → reflect → log.
  • Multi-tool agents (later): route to calculator, code runner, or web fetch if the question demands it. Start with retrieval only.


12) Minimal “from-scratch” build order (no frameworks)

  1. Postgres + pgvector table.
  2. Ingest: parse Markdown/PDF → chunk → embed → upsert.
  3. Retrieval: SELECT … ORDER BY embedding <-> $1 LIMIT k.
  4. Generation: strict system prompt + context stuffing.
  5. Reflection: retry if similarity or citations are missing.
  6. Logging: dump JSON lines; evaluate weekly.

This sequence is the backbone. Fancy libraries are optional sugar.


13) Real-world housekeeping (the boring bits that save you)

  • Source of truth: Keep docs in a repo or CMS; push on change triggers re-embedding.
  • PII: Don’t embed raw sensitive data; mask or store IDs.
  • Access control: Filter by tenant/team before retrieval.
  • Version pins: Log embedding model and chunker version; migrations are real.
  • Backups: It’s a DB. Treat it like one.

14) Tiny pseudo-code sketch (Node mindset, readable)

// plan
if (isPolicyLike(userQuestion)) {
  // retrieve
  const qVec = embed(userQuestion);
  const chunks = db.vectorSearch(qVec, { k: 4, filter: { collection: "policies" } });

  // generate
  const draft = llm.answer({
    system: "Use only provided chunks. Cite titles. If missing info, say so.",
    context: chunks.map(c => `[${c.title}] ${c.text}`).join("\n\n"),
    question: userQuestion
  });

  // reflect
  if (!hasCitations(draft) || lowSimilarity(chunks)) {
    const q2 = rewrite(userQuestion, "focus: pause policy, approval, duration");
    return runAgain(q2);
  }
  return draft;
}
// else: other handlers…

This is all most agent demos are doing—just well-organized.


15) Cheat-sheet glossary (two-liners)

  • RAG: Retrieve relevant text first, then generate the answer using it.
  • Agentic: Add decisions—retry, clarify, cite, or abstain.
  • Embedding: Vector form of text; “meaning coordinates.”
  • Vector DB / pgvector: SQL + special index to search by meaning.
  • Chunk: A slice of a document for precise retrieval.
  • Top-k: The k best matches.
  • Grounding: Answer only from retrieved text.
  • Re-ranking: Reordering results with a better scorer.
  • Reflection: A quick sanity check and retry logic.

16) Where students usually stumble—and how to steer them

  • Too-big chunks → noisy context → vague answers. Fix: 300–700 tokens, with small overlaps.

  • Inconsistent distance ops (cosine vs L2) → weird retrieval. Fix: pick one and stick to it across indexing and queries.

  • Prompts that allow invention → hallucinations. Fix: forbid external facts, require citations, accept “not enough info”.

  • No logs → slow debugging. Fix: log query, chunk IDs, similarities, answer, and citations per run.


17) The blunt takeaway

Agentic RAG is just disciplined reading before careful writing. It’s more like building a search-backed API than “training a genius.” Good chunking, sensible indexing, strict prompts, and boring reflection rules beat hype every day of the week.

Ship the minimal loop. Measure it. Then add cleverness.

🧠 EPIC: Build a Minimal Agentic RAG System with OpenAI + pgvector

🎯 Goal

Build a fully functional Retrieval-Augmented Generation (RAG) pipeline with an agentic reasoning layer — from scratch. By the end of this Epic, the learner should be able to query a local knowledge base, retrieve semantically relevant documents using pgvector, and generate grounded answers using OpenAI’s chat and embedding APIs.

No boilerplate code is provided — learners must implement each ticket step by step.


🏁 Context

Large Language Models are great at reasoning but forgetful about external facts. Retrieval-Augmented Generation (RAG) fixes that by giving the model a memory it can search. Agentic AI adds one more layer: self-reflection — the ability to decide when to retrieve, when to retry, and when to ask for clarification.

This project simulates the workflow of a small AI startup or research lab — where engineers must combine API usage, database design, and system thinking.


🗓️ Project Duration

~5–7 days of focused effort (2–3 hours/day).


🔩 Deliverable

A runnable command:

npm start

that outputs:

🤔 Question: Can I pause my neoG Bootcamp subscription for 2 months?
📚 Retrieved: Billing Policy, Contact
💬 Final Answer:
You can pause for 4 weeks per year. For 2 months, request approval via support@neog.camp.
(Sources: Billing Policy, Contact)

🧱 Tickets Breakdown


🪧 EPIC: Agentic RAG System (OpenAI + pgvector)


Ticket #1 — Project Setup

Goal: Create a Node.js project with environment setup and dependencies.

Tasks:

  • Initialize a Node project (npm init).

  • Install required libraries: dotenv, openai, pg.

  • Create .env file and add placeholders for API and DB credentials.

  • Set up folder structure:

    src/
    data/
    .env
    package.json
    index.js
    

Acceptance Criteria:

  • Project runs a simple “hello world” log.
  • Environment variables load correctly via dotenv.

Ticket #2 — Database Initialization

Goal: Set up PostgreSQL and enable pgvector extension.

Tasks:

  • Create a local database ragdb.

  • Enable the vector extension.

  • Write a db.js file that connects to Postgres using a connection pool.

  • Implement a setup function that creates a documents table with:

    • id TEXT PRIMARY KEY
    • title TEXT
    • content TEXT
    • embedding vector(1536)

Acceptance Criteria:

  • Running the setup script logs “✅ documents table ready”.
  • Table verified via psql \d documents.

Ticket #3 — Knowledge Base Preparation

Goal: Seed the project with a few small reference documents.

Tasks:

  • Create data/docs.json with 3–5 short documents (like FAQs or policies).
  • Ensure structure: {id, title, text}.
  • Add sample content manually — don’t scrape or import.

Acceptance Criteria:

  • JSON file loads correctly via fs.readFileSync.
  • console.log() shows all document titles.

Ticket #4 — Generate and Store Embeddings

Goal: Generate vector embeddings for all documents using OpenAI’s API.

Tasks:

  • Create embed.js that:

    • Reads from data/docs.json.
    • Calls openai.embeddings.create({ model: 'text-embedding-3-small' }).
    • Inserts each doc into Postgres with the resulting vector.

Acceptance Criteria:

  • Running npm run embed populates documents table.
  • SELECT COUNT(*) FROM documents; returns number of docs.
  • Verify one embedding manually (SELECT embedding[1:5] FROM documents LIMIT 1;).

Ticket #5 — Retrieval with pgvector

Goal: Implement nearest-neighbor retrieval using vector distance.

Tasks:

  • Create retrieve.js with a function retrieveRelevantDocs(query, k).

  • Generate query embedding.

  • Use SQL query:

    SELECT id, title, content
    FROM documents
    ORDER BY embedding <-> $1
    LIMIT $2;
  • Return the top-k docs.

Acceptance Criteria:

  • console.log(retrieveRelevantDocs("pause subscription")) prints top 2 docs with scores.
  • Retrieval accuracy looks intuitive.

Ticket #6 — Agentic Reasoning

Goal: Combine retrieval + generation + reflection.

Tasks:

  • Create agent.js.

  • Steps:

    1. Retrieve top-k docs.

    2. Build a context string combining doc titles + text.

    3. Call openai.chat.completions.create() with:

      • system prompt explaining the rules.
      • user prompt including context + question.
    4. (Optional) If retrieval scores are low, rephrase query and retry.

  • Return final answer + sources.

Acceptance Criteria:

  • Given a question, the system prints:

    • retrieved docs
    • final answer
    • sources

Ticket #7 — Integration + Entry Point

Goal: Create index.js that ties everything together.

Tasks:

  • Import the agent function.
  • Ask one fixed question (like “Can I pause my subscription?”).
  • Print the full reasoning chain and final output.

Acceptance Criteria:

  • Running npm start outputs a complete RAG flow.

Ticket #8 — Reflection and Extension

Goal: Make the agent agentic, not static.

Tasks (Optional Stretch):

  • Implement a confidence threshold (e.g., mean similarity < 0.25).
  • If low, refine the query (“policy approval”).
  • Rerun retrieval and generation.
  • Print both attempts.

Acceptance Criteria:

  • System prints when it reformulates the query.
  • Output shows improved results after retry.

Ticket #9 — Indexing & Optimization (Advanced)

Goal: Scale the retrieval layer.

Tasks:

  • Create an index:

    CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops);
  • Benchmark retrieval speed before vs after.

  • Experiment with different similarity ops (<->, <#>, etc.).

Acceptance Criteria:

  • Retrieval latency < 50 ms for 1k docs.
  • Learner explains in comments what the index does.

🧩 Learning Outcomes

By finishing this Epic, learners will:

  1. Understand RAG architecture from first principles.
  2. Use OpenAI embeddings and chat completions programmatically.
  3. Learn pgvector’s SQL semantics and distance operators.
  4. Experience agentic reasoning loops — retrieval → generation → reflection.
  5. Build a reproducible foundation for future products (assistants, copilots, search systems).

🪄 Extensions for Ambitious Builders

  • Add metadata filters (date, tags, author).
  • Log query–answer pairs for fine-tuning datasets.
  • Plug in LangChain or LlamaIndex after writing the primitives manually — to appreciate the abstractions.
  • Replace OpenAI with a local embedding + inference model to understand portability.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment