Owner: Joe Bacal Collaborator: Andrew Pasquale Last updated: 2026-03-11 Status: A.3 Run 5 complete — first full-system test (pgvector + MCP + routing vs UKY). C.1–C.3 complete. Node tracing added to agent graph (gated behind
?include_trace=true). Next: curate cross-cutting Q&A pairs for the 12 LLM-only questions.
The ACCESS-CI AI system has two RAG backends:
- UKY Document RAG — The existing system at
access-ai-grace1-external.ccs.uky.edu. Answers are generated from canonical documents. This is what the production QA Bot uses today. - pgvector Q&A-Pair RAG — A separate service called
access-qa-service(repo, port 8001). Stores human-curated Q&A pairs with semantic search via sentence-transformers + HNSW indexing. Called byaccess-agentover HTTP. Currently a fallback only.
Andrew built access-agent (LangGraph orchestrator on Linode at access-agent.elytra.net) which classifies queries and routes them to the appropriate backend. His qa-bot-core feature branch (feature/access-agent-integration) already points the chatbot UI at access-agent instead of UKY directly.
This branch rewires the chatbot UI to talk to access-agent instead of UKY:
- Netlify reverse proxy (
netlify.toml):/api/*→https://access-agent.elytra.net/api/:splat— no CORS needed, the UI just hits/api/v1/queryas a relative path. - Request body (
qa-flow.tsx): Sendssession_idandquestion_idin the POST body (access-agent's expected format). - Response handling (
qa-flow.tsx): Readsbody.response(access-agent's field) alongside the legacybody.answer/body.textfields. - Metadata display (
qa-flow.tsx):buildMetadataText()rendersconfidence,tools_used, andmetadata.agentbehind ashowMetadatatoggle — useful for debugging during Project A.
This defines the response contract access-agent must honor: { response, confidence, tools_used, metadata }. Any changes to rag_answer (Project A.2) or the feedback protocol (Project B.1) must stay compatible with this shape.
Joe built the access-qa-extraction pipeline (feat/two-shot branch) that generates Q&A pairs from MCP server data via a two-shot LLM process (battery pass + discovery pass), scores them with a judge LLM, and pushes them to Argilla for human review.
The missing link: Getting curated Q&A pairs from Argilla into the pgvector RAG service, and proving this approach outperforms document RAG.
See SYSTEM_OVERVIEW.md → "Agent Graph" for the full routing table and sequence diagrams.
Goal: Run both RAG backends against the same questions and measure which produces better answers — or where each has strengths and weaknesses.
Why first: We need data on how these two approaches compare before making further investment decisions.
- Build the pipeline to sync approved Q&A pairs from Argilla into
access-qa-service(pgvector) - This is real production infrastructure — not a one-off test load
- The planning repo (03-review-system.md) describes a webhook-driven sync: Argilla
record.completed→ generate embedding → upsert in pgvector - Verify pairs are searchable via the
/searchendpoint
Result: 83 records synced across 5 domains. Commit 5b57ae0 on access-qa-service/main.
- Feature branch on
access-agent: modify therag_answernode to query both UKY and pgvector for every question - Log both responses to PostgreSQL with metadata (latency, similarity scores, response text, which backend)
- The user-facing answer can still come from one backend — the other is logged for comparison
Result: Branch feature/dual-rag-logging (commit caf7256). Gated behind DUAL_RAG_LOGGING=true env var. Parallel asyncio.gather queries both backends, logs to rag_comparison_logs table. 19 tests.
Where this lives in code:
access-agent/src/agent/nodes/rag_answer.py— dual-RAG path in_dual_rag_answer()access-agent/src/rag_comparison_logger.py— comparison log model + loggeraccess-agent/src/config.py—DUAL_RAG_LOGGINGflagaccess-agent/src/services/qa_client.py— pgvector clientaccess-agent/src/services/uky_client.py— UKY client
Result: 5 runs completed. Runs 1–4 compared RAG-vs-RAG. Run 5 (2026-03-11) was the first full-system comparison: pgvector + MCP tools + LangGraph routing vs UKY baseline.
- Review Q&A pairs in Argilla to identify which entities/domains have coverage
- Write test questions (41 total: 25 pgvector-targeted, 8 UKY-targeted, 8 edge cases)
- Run all questions through agent with
DUAL_RAG_LOGGING=true - Export comparison logs and build interactive HTML comparison
- Compare UKY vs pgvector: accuracy, completeness, coverage
- Identify patterns: pgvector excels on curated entities (96%), UKY covers general how-to topics pgvector has zero pairs for
- Document findings — see
DEV_JOURNAL.md"A.3 Run 3" and "A.3 post-mortem" - Run 5: Full-system test (pgvector + MCP + routing). 24 via RAG, 5 via MCP tools, 12 LLM-only. MCP fills Ranch gap, q41 answered for the first time. 12 cross-cutting questions fall to ungrounded LLM synthesis. See
DEV_JOURNAL.md"2026-03-11" entry. - Build Run 5 comparison HTML —
~/.agent/diagrams/a3-run5-comparison.html
Key findings:
- RAG-vs-RAG (Runs 1–4): Gap is coverage (content type), not quality. pgvector needs Q&A pairs for general ACCESS topics. This motivated Project C.
- Full-system (Run 5): MCP tools fill resource gaps (Ranch, project search). 12 questions fall to ungrounded LLM — need curated cross-cutting Q&A pairs and/or better observability to understand routing.
Done: node_trace added to agent graph (commits 04342c8, b7a9bec). Each node appends a structured trace dict to AgentState.node_trace via operator.add reducer. API response includes trace data when ?include_trace=true is passed. OTel/Honeycomb handles ops tracing separately.
- Resolved: Running locally in Docker. UKY is remote, everything else local.
- Resolved: The pgvector QA service is a separate repo/service:
access-qa-service(cloned locally, 83 pairs synced). - Resolved: A qualitative comparison was sufficient — Run 3 gave clear signal without a formal evaluation framework.
Goal: Replace the always-on thumbs up/down with server-driven, context-aware feedback requests. Associate feedback with specific Q&A pairs and tool calls so it can flow back to Argilla for quality improvement.
Why: Current feedback is noisy — shown after every message, not tied to specific retrieval sources. With access-agent's richer response metadata (tools_used, RAG matches, confidence), we can collect feedback that actually improves the system.
- access-agent response already includes:
response,confidence,tools_used,metadata - Add a
feedback_requestfield to the response:{ "response": "...", "feedback_request": { "enabled": true, "reason": "rag_answer", // why we're asking "source_refs": ["mcp://compute-resources/resources/delta"], "rag_match_ids": ["qa-pair-uuid-123"], "confidence": 0.87 } } - Logic for when to request feedback:
- Always after RAG-sourced answers (these map to curated Q&A pairs)
- After combined answers (RAG + tools)
- Maybe skip for pure dynamic/tool-only answers (ephemeral data, less actionable feedback)
- Consider confidence-based: ask more when confidence is borderline (0.7–0.85)
- Read
feedback_requestfrom response - Only show thumbs up/down when
feedback_request.enabledis true - When user gives thumbs down, send enriched feedback:
{ "session_id": "...", "question_id": "...", "rating": 0, "source_refs": ["mcp://compute-resources/resources/delta"], "rag_match_ids": ["qa-pair-uuid-123"], "reason": "incorrect" // optional: incorrect, outdated, incomplete, other } - Planning docs (03-review-system.md) describe a richer feedback UI for thumbs-down: radio buttons for reason + optional comment
- When a thumbs-down arrives with
rag_match_ids, look up the Q&A pair in Argilla - Create a record in the
feedback-reviewdataset linking:- The original question asked
- The response given
- The Q&A pair(s) that sourced it
- The user's feedback reason
- The trace_id (for Honeycomb debugging)
- This gives reviewers direct context: "This curated pair produced a bad answer for this question"
- Currently, rating goes to UKY's endpoint (
/access/chat/rating/) - Need a new endpoint in access-agent:
POST /api/v1/feedback - This endpoint: validates, logs to PostgreSQL, pushes to Argilla, returns 200
- Update qa-bot-core to send feedback to this new endpoint
- Depends on Project A (partially): The feedback system is designed for the QA-pair-backed endpoint. It makes sense to build once access-agent is the active backend.
- Can start design work now: The protocol design (B.1) and qa-bot-core changes (B.2) don't need access-agent to be fully deployed
- Question: Should feedback go to access-agent (which proxies to Argilla) or directly to Argilla? Architecture docs suggest through access-agent for consistency.
- Question: Do we show the richer feedback form (reason + comment) from day one, or start with just thumbs up/down and iterate?
Status: ACTIVE — C.1–C.3 complete. 807 document Q&A pairs generated and pushed to Argilla for review. Next: C.4 — sync approved pairs to pgvector and re-run A.3 comparison.
Goal: Extend access-qa-extraction to accept documents (PDFs, web pages) as input and generate Q&A pairs from them, complementing the existing MCP-data extraction.
- Categorized all 75 files in
rag_documents/→CORPUS_INDEX.md - 20 NET-NEW (process/how-to), 22 USER GUIDE (deep), 17 MCP OVERLAP, 12 DATA FILE, 4 POINTER/EMPTY
- Created
parsers.py— PDF (PyMuPDF), docx (python-docx), txt/md parsing + chunking - Created
DocumentExtractor(BaseExtractor)— reads files from disk, two-shot LLM pipeline, judge, incremental cache - Added
"documents"domain toquestion_categories.pywith 5 field groups - Added
sourceparameter toQAPair.create()for"doc_generated"source type - Wired into CLI —
qa-extract extract documentsworks end-to-end - Smoke-tested: produces Q&A pairs from docx and md files
- Run on 20 NET-NEW files (highest priority — fills the A.3 gap)
- Review output quality, iterate on prompts if needed
- Run on 22 USER GUIDE files
- Run on data/ directory (Darwin, Delta, FASTER, Travel Rewards, software lists)
- Push to Argilla for human review
Result: 825 Q&A pairs total (611 from staging/, 214 from data/). All pushed to Argilla qa-review dataset with document_name metadata for filtering. Added document_name field and fixed source_data to show file reference instead of misleading content preview (commit 8e9edd6).
- Review document Q&A pairs in Argilla — filter by
document_name, approve/reject/edit - Sync approved document Q&A pairs into pgvector alongside entity pairs
- Re-run A.3 comparison — the decisive bake-off with expanded coverage
- If pgvector matches UKY across the board, begin planning UKY retirement
- Resolved: Project A confirmed the need — pgvector has 0% coverage on general how-to topics that documents cover.
- Resolved: UKY document corpus obtained — 75 files in
rag_documents/(staging/ and data/ subdirectories). SeeDEV_JOURNAL.md"2026-03-06" entry for full inventory. - Resolved: PDF extraction library choice — PyMuPDF (fitz) for PDFs, python-docx for docx.
Status: Designed (2026-03-10). Full design in
EVAL_HARNESS_PLAN.md. Not yet implemented.
Goal: Build a reusable evaluation pipeline so team members can run golden-question bake-offs against different agent configurations and compare results visually — without needing Joe in the loop.
Why: Andrew wants an ongoing process for evaluating agent quality as the system evolves. The current A.3 bake-off was manual (one person, one script, manual HTML generation). The team needs a shared, self-service tool.
- Create
eval/golden_questions.yaml— merge 41 questions fromA3_TEST_QUESTIONS.md+ 35 fromtests/e2e_test_cases.csv, deduplicate (~55 questions) - Each question: id, text, category, tags, expected assertions (query_type, keywords, tools, answer length)
- Create scenario YAML configs:
baseline,strict_rag,loose_rag,rag_only - Scenarios override
Settingsenv vars (RAG thresholds, model, MCP server subsets)
- Build
eval-ui/— static web app deployed to Netlify (no backend, no database) - Question list view with scenario picker, API key input, run button
- Calls agent endpoint directly, displays pass/fail per question inline
- Expandable per-question detail (answer, tools used, assertion results)
- JSON export/import for comparing runs
- Visual style matches existing
a3-run3-comparison.html
- Decide how scenarios change agent behavior (config-override endpoint, separate deployments, or labels only)
- Decide API key routing (shared project key on agent vs pass-through)
- Deploy to Netlify, share URL with team
- No local setup needed — just a browser
- Depends on C.4: The harness is most useful once the Q&A bank is comprehensive (entity + document pairs).
- Can start D.1 now: Golden questions and scenario configs don't require code changes.
- Question: Should golden questions include science-domain questions from actual researchers? Andrew suggested getting input from "science people."
✅ Weeks 1-3: Project A (complete)
├── A.1: Argilla → pgvector sync pipeline ✅
├── A.2: Dual-RAG logging in access-agent ✅
└── A.3: Run comparison, evaluate results ✅
Current: Project C (document extraction)
├── C.1: Categorize the 75 documents ✅
├── C.2: Build document extractor ✅
├── C.3: Run extraction on corpus, push to Argilla ✅
├── C.4: Load into pgvector alongside entity pairs
└── Re-run A.3 with expanded Q&A bank
Next: Projects B + D (can run in parallel)
Project B (feedback)
├── B.1: Design feedback protocol
├── B.2: Update qa-bot-core
├── B.3: Connect feedback to Argilla
└── B.4: Build feedback endpoint in access-agent
Project D (evaluation harness) — design in EVAL_HARNESS_PLAN.md
├── D.1: Golden questions YAML + scenario configs
├── D.2: Web UI (static Netlify app)
├── D.3: Resolve scenario + auth mechanics
└── D.4: Team onboarding / deploy
Rationale:
- A first because it validates the approach — done, pgvector wins on quality, loses on coverage
- C next because it closes the coverage gap that A.3 identified, potentially letting pgvector replace UKY entirely
- B after because smarter feedback closes the quality loop once the Q&A bank is comprehensive
- D parallel with B because the eval harness is independent infrastructure — Andrew wants this for ongoing evaluation as the agent evolves
- Can Joe SSH into Linode (45.79.215.140) where access-agent runs?
- Can Joe run access-agent locally via Docker Compose? ← A.3 plan: yes, both repos have compose files
- Is the pgvector QA service deployed alongside access-agent? ← Cloned locally, 83 pairs synced (A.1)
- Does Joe have Argilla credentials for the review instance? ← Credentials in
access-argillarepo env file - Is Honeycomb accessible for observability during testing?
- Does the Netlify proxy (qa-bot-core feature branch) work for local testing? ← Bypassing: using direct curl for A.3
| What | Where |
|---|---|
| Agent graph + routing | access-agent/src/agent/graph.py |
| RAG retrieval node | access-agent/src/agent/nodes/rag_answer.py |
| Query classifier | access-agent/src/agent/nodes/classify.py |
| pgvector QA client | access-agent/src/services/qa_client.py |
| UKY RAG client | access-agent/src/services/uky_client.py |
| QA Service (pgvector RAG) | access-qa-service/ (separate repo, port 8001) |
| QA Service client | access-agent/src/services/qa_client.py |
| Agent state (incl. node_trace) | access-agent/src/agent/state.py |
| API routes (incl. ?include_trace) | access-agent/src/api/routes.py |
| Usage logging | access-agent/src/usage_logger.py |
| RAG comparison logging | access-agent/src/rag_comparison_logger.py |
| RAG comparison tests | access-agent/tests/test_rag_answer.py |
| QA Bot chat flow | qa-bot-core/src/utils/flows/qa-flow.tsx |
| QA Bot → access-agent proxy | qa-bot-core/netlify.toml (Netlify reverse proxy config) |
| QA Bot feedback | qa-bot-core/src/utils/flows/qa-flow.tsx:133-176 |
| Extraction pipeline | access-qa-extraction/src/access_qa_extraction/extractors/ |
| QA pair model | access-qa-extraction/src/access_qa_extraction/models.py |
| Argilla sync client | access-qa-extraction/src/access_qa_extraction/argilla_client.py |
| Review system design | access-qa-planning/03-review-system.md |
| Agent architecture | access-qa-planning/01-agent-architecture.md |
| Observability plan | access-qa-planning/08-observability.md |
| Feedback design | access-qa-planning/03-review-system.md (feedback section) |