Skip to content

Instantly share code, notes, and snippets.

@bacalj
Last active March 12, 2026 18:50
Show Gist options
  • Select an option

  • Save bacalj/1eca6440f05dcad04657d77099918990 to your computer and use it in GitHub Desktop.

Select an option

Save bacalj/1eca6440f05dcad04657d77099918990 to your computer and use it in GitHub Desktop.
ACCESS-CI Feb/March 2026 Work Plan

ACCESS-CI: Feb/March 2026 Work Plan

Owner: Joe Bacal Collaborator: Andrew Pasquale Last updated: 2026-03-11 Status: A.3 Run 5 complete — first full-system test (pgvector + MCP + routing vs UKY). C.1–C.3 complete. Node tracing added to agent graph (gated behind ?include_trace=true). Next: curate cross-cutting Q&A pairs for the 12 LLM-only questions.


Context

The ACCESS-CI AI system has two RAG backends:

  1. UKY Document RAG — The existing system at access-ai-grace1-external.ccs.uky.edu. Answers are generated from canonical documents. This is what the production QA Bot uses today.
  2. pgvector Q&A-Pair RAG — A separate service called access-qa-service (repo, port 8001). Stores human-curated Q&A pairs with semantic search via sentence-transformers + HNSW indexing. Called by access-agent over HTTP. Currently a fallback only.

Andrew built access-agent (LangGraph orchestrator on Linode at access-agent.elytra.net) which classifies queries and routes them to the appropriate backend. His qa-bot-core feature branch (feature/access-agent-integration) already points the chatbot UI at access-agent instead of UKY directly.

What's on the feature/access-agent-integration branch (qa-bot-core)

This branch rewires the chatbot UI to talk to access-agent instead of UKY:

  • Netlify reverse proxy (netlify.toml): /api/*https://access-agent.elytra.net/api/:splat — no CORS needed, the UI just hits /api/v1/query as a relative path.
  • Request body (qa-flow.tsx): Sends session_id and question_id in the POST body (access-agent's expected format).
  • Response handling (qa-flow.tsx): Reads body.response (access-agent's field) alongside the legacy body.answer/body.text fields.
  • Metadata display (qa-flow.tsx): buildMetadataText() renders confidence, tools_used, and metadata.agent behind a showMetadata toggle — useful for debugging during Project A.

This defines the response contract access-agent must honor: { response, confidence, tools_used, metadata }. Any changes to rag_answer (Project A.2) or the feedback protocol (Project B.1) must stay compatible with this shape.


Joe built the access-qa-extraction pipeline (feat/two-shot branch) that generates Q&A pairs from MCP server data via a two-shot LLM process (battery pass + discovery pass), scores them with a judge LLM, and pushes them to Argilla for human review.

The missing link: Getting curated Q&A pairs from Argilla into the pgvector RAG service, and proving this approach outperforms document RAG.

access-agent graph

See SYSTEM_OVERVIEW.md → "Agent Graph" for the full routing table and sequence diagrams.


Project A: Compare Performance of Document RAG and QA-Pair RAG

Goal: Run both RAG backends against the same questions and measure which produces better answers — or where each has strengths and weaknesses.

Why first: We need data on how these two approaches compare before making further investment decisions.

A.1 — Build the Argilla → pgvector sync pipeline ✅

  • Build the pipeline to sync approved Q&A pairs from Argilla into access-qa-service (pgvector)
  • This is real production infrastructure — not a one-off test load
  • The planning repo (03-review-system.md) describes a webhook-driven sync: Argilla record.completed → generate embedding → upsert in pgvector
  • Verify pairs are searchable via the /search endpoint

Result: 83 records synced across 5 domains. Commit 5b57ae0 on access-qa-service/main.

A.2 — Dual-RAG logging in access-agent ✅

  • Feature branch on access-agent: modify the rag_answer node to query both UKY and pgvector for every question
  • Log both responses to PostgreSQL with metadata (latency, similarity scores, response text, which backend)
  • The user-facing answer can still come from one backend — the other is logged for comparison

Result: Branch feature/dual-rag-logging (commit caf7256). Gated behind DUAL_RAG_LOGGING=true env var. Parallel asyncio.gather queries both backends, logs to rag_comparison_logs table. 19 tests.

Where this lives in code:

  • access-agent/src/agent/nodes/rag_answer.py — dual-RAG path in _dual_rag_answer()
  • access-agent/src/rag_comparison_logger.py — comparison log model + logger
  • access-agent/src/config.pyDUAL_RAG_LOGGING flag
  • access-agent/src/services/qa_client.py — pgvector client
  • access-agent/src/services/uky_client.py — UKY client

A.3 — Run the comparison and evaluate ✅

Result: 5 runs completed. Runs 1–4 compared RAG-vs-RAG. Run 5 (2026-03-11) was the first full-system comparison: pgvector + MCP tools + LangGraph routing vs UKY baseline.

  • Review Q&A pairs in Argilla to identify which entities/domains have coverage
  • Write test questions (41 total: 25 pgvector-targeted, 8 UKY-targeted, 8 edge cases)
  • Run all questions through agent with DUAL_RAG_LOGGING=true
  • Export comparison logs and build interactive HTML comparison
  • Compare UKY vs pgvector: accuracy, completeness, coverage
  • Identify patterns: pgvector excels on curated entities (96%), UKY covers general how-to topics pgvector has zero pairs for
  • Document findings — see DEV_JOURNAL.md "A.3 Run 3" and "A.3 post-mortem"
  • Run 5: Full-system test (pgvector + MCP + routing). 24 via RAG, 5 via MCP tools, 12 LLM-only. MCP fills Ranch gap, q41 answered for the first time. 12 cross-cutting questions fall to ungrounded LLM synthesis. See DEV_JOURNAL.md "2026-03-11" entry.
  • Build Run 5 comparison HTML — ~/.agent/diagrams/a3-run5-comparison.html

Key findings:

  • RAG-vs-RAG (Runs 1–4): Gap is coverage (content type), not quality. pgvector needs Q&A pairs for general ACCESS topics. This motivated Project C.
  • Full-system (Run 5): MCP tools fill resource gaps (Ranch, project search). 12 questions fall to ungrounded LLM — need curated cross-cutting Q&A pairs and/or better observability to understand routing.

Done: node_trace added to agent graph (commits 04342c8, b7a9bec). Each node appends a structured trace dict to AgentState.node_trace via operator.add reducer. API response includes trace data when ?include_trace=true is passed. OTel/Honeycomb handles ops tracing separately.

Dependencies & Open Questions

  • Resolved: Running locally in Docker. UKY is remote, everything else local.
  • Resolved: The pgvector QA service is a separate repo/service: access-qa-service (cloned locally, 83 pairs synced).
  • Resolved: A qualitative comparison was sufficient — Run 3 gave clear signal without a formal evaluation framework.

Project B: Smarter Feedback Collection

Goal: Replace the always-on thumbs up/down with server-driven, context-aware feedback requests. Associate feedback with specific Q&A pairs and tool calls so it can flow back to Argilla for quality improvement.

Why: Current feedback is noisy — shown after every message, not tied to specific retrieval sources. With access-agent's richer response metadata (tools_used, RAG matches, confidence), we can collect feedback that actually improves the system.

B.1 — Design the feedback signal protocol

  • access-agent response already includes: response, confidence, tools_used, metadata
  • Add a feedback_request field to the response:
    {
      "response": "...",
      "feedback_request": {
        "enabled": true,
        "reason": "rag_answer",     // why we're asking
        "source_refs": ["mcp://compute-resources/resources/delta"],
        "rag_match_ids": ["qa-pair-uuid-123"],
        "confidence": 0.87
      }
    }
  • Logic for when to request feedback:
    • Always after RAG-sourced answers (these map to curated Q&A pairs)
    • After combined answers (RAG + tools)
    • Maybe skip for pure dynamic/tool-only answers (ephemeral data, less actionable feedback)
    • Consider confidence-based: ask more when confidence is borderline (0.7–0.85)

B.2 — Update qa-bot-core to handle smart feedback

  • Read feedback_request from response
  • Only show thumbs up/down when feedback_request.enabled is true
  • When user gives thumbs down, send enriched feedback:
    {
      "session_id": "...",
      "question_id": "...",
      "rating": 0,
      "source_refs": ["mcp://compute-resources/resources/delta"],
      "rag_match_ids": ["qa-pair-uuid-123"],
      "reason": "incorrect"  // optional: incorrect, outdated, incomplete, other
    }
  • Planning docs (03-review-system.md) describe a richer feedback UI for thumbs-down: radio buttons for reason + optional comment

B.3 — Connect feedback to Argilla

  • When a thumbs-down arrives with rag_match_ids, look up the Q&A pair in Argilla
  • Create a record in the feedback-review dataset linking:
    • The original question asked
    • The response given
    • The Q&A pair(s) that sourced it
    • The user's feedback reason
    • The trace_id (for Honeycomb debugging)
  • This gives reviewers direct context: "This curated pair produced a bad answer for this question"

B.4 — Build the feedback API endpoint in access-agent

  • Currently, rating goes to UKY's endpoint (/access/chat/rating/)
  • Need a new endpoint in access-agent: POST /api/v1/feedback
  • This endpoint: validates, logs to PostgreSQL, pushes to Argilla, returns 200
  • Update qa-bot-core to send feedback to this new endpoint

Dependencies & Open Questions

  • Depends on Project A (partially): The feedback system is designed for the QA-pair-backed endpoint. It makes sense to build once access-agent is the active backend.
  • Can start design work now: The protocol design (B.1) and qa-bot-core changes (B.2) don't need access-agent to be fully deployed
  • Question: Should feedback go to access-agent (which proxies to Argilla) or directly to Argilla? Architecture docs suggest through access-agent for consistency.
  • Question: Do we show the richer feedback form (reason + comment) from day one, or start with just thumbs up/down and iterate?

Project C: Document-Based Q&A Extraction Pipeline

Status: ACTIVE — C.1–C.3 complete. 807 document Q&A pairs generated and pushed to Argilla for review. Next: C.4 — sync approved pairs to pgvector and re-run A.3 comparison.

Goal: Extend access-qa-extraction to accept documents (PDFs, web pages) as input and generate Q&A pairs from them, complementing the existing MCP-data extraction.

C.1 — Categorize the document corpus ✅

  • Categorized all 75 files in rag_documents/CORPUS_INDEX.md
  • 20 NET-NEW (process/how-to), 22 USER GUIDE (deep), 17 MCP OVERLAP, 12 DATA FILE, 4 POINTER/EMPTY

C.2 — Build the DocumentExtractor ✅

  • Created parsers.py — PDF (PyMuPDF), docx (python-docx), txt/md parsing + chunking
  • Created DocumentExtractor(BaseExtractor) — reads files from disk, two-shot LLM pipeline, judge, incremental cache
  • Added "documents" domain to question_categories.py with 5 field groups
  • Added source parameter to QAPair.create() for "doc_generated" source type
  • Wired into CLI — qa-extract extract documents works end-to-end
  • Smoke-tested: produces Q&A pairs from docx and md files

C.3 — Run extraction on the corpus ✅

  • Run on 20 NET-NEW files (highest priority — fills the A.3 gap)
  • Review output quality, iterate on prompts if needed
  • Run on 22 USER GUIDE files
  • Run on data/ directory (Darwin, Delta, FASTER, Travel Rewards, software lists)
  • Push to Argilla for human review

Result: 825 Q&A pairs total (611 from staging/, 214 from data/). All pushed to Argilla qa-review dataset with document_name metadata for filtering. Added document_name field and fixed source_data to show file reference instead of misleading content preview (commit 8e9edd6).

C.4 — Load into pgvector and re-run A.3

  • Review document Q&A pairs in Argilla — filter by document_name, approve/reject/edit
  • Sync approved document Q&A pairs into pgvector alongside entity pairs
  • Re-run A.3 comparison — the decisive bake-off with expanded coverage
  • If pgvector matches UKY across the board, begin planning UKY retirement

Dependencies & Open Questions

  • Resolved: Project A confirmed the need — pgvector has 0% coverage on general how-to topics that documents cover.
  • Resolved: UKY document corpus obtained — 75 files in rag_documents/ (staging/ and data/ subdirectories). See DEV_JOURNAL.md "2026-03-06" entry for full inventory.
  • Resolved: PDF extraction library choice — PyMuPDF (fitz) for PDFs, python-docx for docx.

Project D: Self-Service Evaluation Harness

Status: Designed (2026-03-10). Full design in EVAL_HARNESS_PLAN.md. Not yet implemented.

Goal: Build a reusable evaluation pipeline so team members can run golden-question bake-offs against different agent configurations and compare results visually — without needing Joe in the loop.

Why: Andrew wants an ongoing process for evaluating agent quality as the system evolves. The current A.3 bake-off was manual (one person, one script, manual HTML generation). The team needs a shared, self-service tool.

D.1 — Golden questions + scenario configs

  • Create eval/golden_questions.yaml — merge 41 questions from A3_TEST_QUESTIONS.md + 35 from tests/e2e_test_cases.csv, deduplicate (~55 questions)
  • Each question: id, text, category, tags, expected assertions (query_type, keywords, tools, answer length)
  • Create scenario YAML configs: baseline, strict_rag, loose_rag, rag_only
  • Scenarios override Settings env vars (RAG thresholds, model, MCP server subsets)

D.2 — Web UI (static Netlify app)

  • Build eval-ui/ — static web app deployed to Netlify (no backend, no database)
  • Question list view with scenario picker, API key input, run button
  • Calls agent endpoint directly, displays pass/fail per question inline
  • Expandable per-question detail (answer, tools used, assertion results)
  • JSON export/import for comparing runs
  • Visual style matches existing a3-run3-comparison.html

D.3 — Resolve scenario + auth mechanics

  • Decide how scenarios change agent behavior (config-override endpoint, separate deployments, or labels only)
  • Decide API key routing (shared project key on agent vs pass-through)

D.4 — Team onboarding

  • Deploy to Netlify, share URL with team
  • No local setup needed — just a browser

Dependencies & Open Questions

  • Depends on C.4: The harness is most useful once the Q&A bank is comprehensive (entity + document pairs).
  • Can start D.1 now: Golden questions and scenario configs don't require code changes.
  • Question: Should golden questions include science-domain questions from actual researchers? Andrew suggested getting input from "science people."

Recommended Sequence

✅ Weeks 1-3: Project A (complete)
├── A.1: Argilla → pgvector sync pipeline ✅
├── A.2: Dual-RAG logging in access-agent ✅
└── A.3: Run comparison, evaluate results ✅

Current: Project C (document extraction)
├── C.1: Categorize the 75 documents ✅
├── C.2: Build document extractor ✅
├── C.3: Run extraction on corpus, push to Argilla ✅
├── C.4: Load into pgvector alongside entity pairs
└── Re-run A.3 with expanded Q&A bank

Next: Projects B + D (can run in parallel)

Project B (feedback)
├── B.1: Design feedback protocol
├── B.2: Update qa-bot-core
├── B.3: Connect feedback to Argilla
└── B.4: Build feedback endpoint in access-agent

Project D (evaluation harness) — design in EVAL_HARNESS_PLAN.md
├── D.1: Golden questions YAML + scenario configs
├── D.2: Web UI (static Netlify app)
├── D.3: Resolve scenario + auth mechanics
└── D.4: Team onboarding / deploy

Rationale:

  • A first because it validates the approach — done, pgvector wins on quality, loses on coverage
  • C next because it closes the coverage gap that A.3 identified, potentially letting pgvector replace UKY entirely
  • B after because smarter feedback closes the quality loop once the Q&A bank is comprehensive
  • D parallel with B because the eval harness is independent infrastructure — Andrew wants this for ongoing evaluation as the agent evolves

Infrastructure Checklist

  • Can Joe SSH into Linode (45.79.215.140) where access-agent runs?
  • Can Joe run access-agent locally via Docker Compose? ← A.3 plan: yes, both repos have compose files
  • Is the pgvector QA service deployed alongside access-agent? ← Cloned locally, 83 pairs synced (A.1)
  • Does Joe have Argilla credentials for the review instance? ← Credentials in access-argilla repo env file
  • Is Honeycomb accessible for observability during testing?
  • Does the Netlify proxy (qa-bot-core feature branch) work for local testing? ← Bypassing: using direct curl for A.3

Key File References

What Where
Agent graph + routing access-agent/src/agent/graph.py
RAG retrieval node access-agent/src/agent/nodes/rag_answer.py
Query classifier access-agent/src/agent/nodes/classify.py
pgvector QA client access-agent/src/services/qa_client.py
UKY RAG client access-agent/src/services/uky_client.py
QA Service (pgvector RAG) access-qa-service/ (separate repo, port 8001)
QA Service client access-agent/src/services/qa_client.py
Agent state (incl. node_trace) access-agent/src/agent/state.py
API routes (incl. ?include_trace) access-agent/src/api/routes.py
Usage logging access-agent/src/usage_logger.py
RAG comparison logging access-agent/src/rag_comparison_logger.py
RAG comparison tests access-agent/tests/test_rag_answer.py
QA Bot chat flow qa-bot-core/src/utils/flows/qa-flow.tsx
QA Bot → access-agent proxy qa-bot-core/netlify.toml (Netlify reverse proxy config)
QA Bot feedback qa-bot-core/src/utils/flows/qa-flow.tsx:133-176
Extraction pipeline access-qa-extraction/src/access_qa_extraction/extractors/
QA pair model access-qa-extraction/src/access_qa_extraction/models.py
Argilla sync client access-qa-extraction/src/access_qa_extraction/argilla_client.py
Review system design access-qa-planning/03-review-system.md
Agent architecture access-qa-planning/01-agent-architecture.md
Observability plan access-qa-planning/08-observability.md
Feedback design access-qa-planning/03-review-system.md (feedback section)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment