bacalj/FEB_MARCH_PLAN.md

## FEB_MARCH_PLAN.md

      
    Raw
  

              FEB_MARCH_PLAN.md
            
          
    ACCESS-CI: Feb/March 2026 Work Plan


Owner: Joe Bacal
Collaborator: Andrew Pasquale
Last updated: 2026-03-11
Status: A.3 Run 5 complete — first full-system test (pgvector + MCP + routing vs UKY). C.1–C.3 complete. Node tracing added to agent graph (gated behind ?include_trace=true). Next: curate cross-cutting Q&A pairs for the 12 LLM-only questions.


Context

The ACCESS-CI AI system has two RAG backends:

UKY Document RAG — The existing system at access-ai-grace1-external.ccs.uky.edu. Answers are generated from canonical documents. This is what the production QA Bot uses today.
pgvector Q&A-Pair RAG — A separate service called access-qa-service (repo, port 8001). Stores human-curated Q&A pairs with semantic search via sentence-transformers + HNSW indexing. Called by access-agent over HTTP. Currently a fallback only.

Andrew built access-agent (LangGraph orchestrator on Linode at access-agent.elytra.net) which classifies queries and routes them to the appropriate backend. His qa-bot-core feature branch (feature/access-agent-integration) already points the chatbot UI at access-agent instead of UKY directly.
What's on the feature/access-agent-integration branch (qa-bot-core)

This branch rewires the chatbot UI to talk to access-agent instead of UKY:

Netlify reverse proxy (netlify.toml): /api/* → https://access-agent.elytra.net/api/:splat — no CORS needed, the UI just hits /api/v1/query as a relative path.
Request body (qa-flow.tsx): Sends session_id and question_id in the POST body (access-agent's expected format).
Response handling (qa-flow.tsx): Reads body.response (access-agent's field) alongside the legacy body.answer/body.text fields.
Metadata display (qa-flow.tsx): buildMetadataText() renders confidence, tools_used, and metadata.agent behind a showMetadata toggle — useful for debugging during Project A.

This defines the response contract access-agent must honor: { response, confidence, tools_used, metadata }. Any changes to rag_answer (Project A.2) or the feedback protocol (Project B.1) must stay compatible with this shape.

Joe built the access-qa-extraction pipeline (feat/two-shot branch) that generates Q&A pairs from MCP server data via a two-shot LLM process (battery pass + discovery pass), scores them with a judge LLM, and pushes them to Argilla for human review.
The missing link: Getting curated Q&A pairs from Argilla into the pgvector RAG service, and proving this approach outperforms document RAG.
access-agent graph

See SYSTEM_OVERVIEW.md → "Agent Graph" for the full routing table and sequence diagrams.

Project A: Compare Performance of Document RAG and QA-Pair RAG

Goal: Run both RAG backends against the same questions and measure which produces better answers — or where each has strengths and weaknesses.
Why first: We need data on how these two approaches compare before making further investment decisions.
A.1 — Build the Argilla → pgvector sync pipeline ✅


 Build the pipeline to sync approved Q&A pairs from Argilla into access-qa-service (pgvector)
 This is real production infrastructure — not a one-off test load
 The planning repo (03-review-system.md) describes a webhook-driven sync: Argilla record.completed → generate embedding → upsert in pgvector
 Verify pairs are searchable via the /search endpoint

Result: 83 records synced across 5 domains. Commit 5b57ae0 on access-qa-service/main.
A.2 — Dual-RAG logging in access-agent ✅


 Feature branch on access-agent: modify the rag_answer node to query both UKY and pgvector for every question
 Log both responses to PostgreSQL with metadata (latency, similarity scores, response text, which backend)
 The user-facing answer can still come from one backend — the other is logged for comparison

Result: Branch feature/dual-rag-logging (commit caf7256). Gated behind DUAL_RAG_LOGGING=true env var. Parallel asyncio.gather queries both backends, logs to rag_comparison_logs table. 19 tests.
Where this lives in code:

access-agent/src/agent/nodes/rag_answer.py — dual-RAG path in _dual_rag_answer()
access-agent/src/rag_comparison_logger.py — comparison log model + logger
access-agent/src/config.py — DUAL_RAG_LOGGING flag
access-agent/src/services/qa_client.py — pgvector client
access-agent/src/services/uky_client.py — UKY client

A.3 — Run the comparison and evaluate ✅

Result: 5 runs completed. Runs 1–4 compared RAG-vs-RAG. Run 5 (2026-03-11) was the first full-system comparison: pgvector + MCP tools + LangGraph routing vs UKY baseline.

 Review Q&A pairs in Argilla to identify which entities/domains have coverage
 Write test questions (41 total: 25 pgvector-targeted, 8 UKY-targeted, 8 edge cases)
 Run all questions through agent with DUAL_RAG_LOGGING=true
 Export comparison logs and build interactive HTML comparison
 Compare UKY vs pgvector: accuracy, completeness, coverage
 Identify patterns: pgvector excels on curated entities (96%), UKY covers general how-to topics pgvector has zero pairs for
 Document findings — see DEV_JOURNAL.md "A.3 Run 3" and "A.3 post-mortem"
 Run 5: Full-system test (pgvector + MCP + routing). 24 via RAG, 5 via MCP tools, 12 LLM-only. MCP fills Ranch gap, q41 answered for the first time. 12 cross-cutting questions fall to ungrounded LLM synthesis. See DEV_JOURNAL.md "2026-03-11" entry.
 Build Run 5 comparison HTML — ~/.agent/diagrams/a3-run5-comparison.html

Key findings:

RAG-vs-RAG (Runs 1–4): Gap is coverage (content type), not quality. pgvector needs Q&A pairs for general ACCESS topics. This motivated Project C.
Full-system (Run 5): MCP tools fill resource gaps (Ranch, project search). 12 questions fall to ungrounded LLM — need curated cross-cutting Q&A pairs and/or better observability to understand routing.

Done: node_trace added to agent graph (commits 04342c8, b7a9bec). Each node appends a structured trace dict to AgentState.node_trace via operator.add reducer. API response includes trace data when ?include_trace=true is passed. OTel/Honeycomb handles ops tracing separately.
Dependencies & Open Questions


Resolved: Running locally in Docker. UKY is remote, everything else local.
Resolved: The pgvector QA service is a separate repo/service: access-qa-service (cloned locally, 83 pairs synced).
Resolved: A qualitative comparison was sufficient — Run 3 gave clear signal without a formal evaluation framework.


Project B: Smarter Feedback Collection

Goal: Replace the always-on thumbs up/down with server-driven, context-aware feedback requests. Associate feedback with specific Q&A pairs and tool calls so it can flow back to Argilla for quality improvement.
Why: Current feedback is noisy — shown after every message, not tied to specific retrieval sources. With access-agent's richer response metadata (tools_used, RAG matches, confidence), we can collect feedback that actually improves the system.
B.1 — Design the feedback signal protocol


 access-agent response already includes: response, confidence, tools_used, metadata
 Add a feedback_request field to the response:
{
  "response": "...",
  "feedback_request": {
    "enabled": true,
    "reason": "rag_answer",     // why we're asking
    "source_refs": ["mcp://compute-resources/resources/delta"],
    "rag_match_ids": ["qa-pair-uuid-123"],
    "confidence": 0.87
  }
}

 Logic for when to request feedback:

Always after RAG-sourced answers (these map to curated Q&A pairs)
After combined answers (RAG + tools)
Maybe skip for pure dynamic/tool-only answers (ephemeral data, less actionable feedback)
Consider confidence-based: ask more when confidence is borderline (0.7–0.85)


B.2 — Update qa-bot-core to handle smart feedback


 Read feedback_request from response
 Only show thumbs up/down when feedback_request.enabled is true
 When user gives thumbs down, send enriched feedback:
{
  "session_id": "...",
  "question_id": "...",
  "rating": 0,
  "source_refs": ["mcp://compute-resources/resources/delta"],
  "rag_match_ids": ["qa-pair-uuid-123"],
  "reason": "incorrect"  // optional: incorrect, outdated, incomplete, other
}

 Planning docs (03-review-system.md) describe a richer feedback UI for thumbs-down: radio buttons for reason + optional comment

B.3 — Connect feedback to Argilla


 When a thumbs-down arrives with rag_match_ids, look up the Q&A pair in Argilla
 Create a record in the feedback-review dataset linking:

The original question asked
The response given
The Q&A pair(s) that sourced it
The user's feedback reason
The trace_id (for Honeycomb debugging)


 This gives reviewers direct context: "This curated pair produced a bad answer for this question"

B.4 — Build the feedback API endpoint in access-agent


 Currently, rating goes to UKY's endpoint (/access/chat/rating/)
 Need a new endpoint in access-agent: POST /api/v1/feedback
 This endpoint: validates, logs to PostgreSQL, pushes to Argilla, returns 200
 Update qa-bot-core to send feedback to this new endpoint

Dependencies & Open Questions


Depends on Project A (partially): The feedback system is designed for the QA-pair-backed endpoint. It makes sense to build once access-agent is the active backend.
Can start design work now: The protocol design (B.1) and qa-bot-core changes (B.2) don't need access-agent to be fully deployed
Question: Should feedback go to access-agent (which proxies to Argilla) or directly to Argilla? Architecture docs suggest through access-agent for consistency.
Question: Do we show the richer feedback form (reason + comment) from day one, or start with just thumbs up/down and iterate?


Project C: Document-Based Q&A Extraction Pipeline


Status: ACTIVE — C.1–C.3 complete. 807 document Q&A pairs generated and pushed to Argilla for review. Next: C.4 — sync approved pairs to pgvector and re-run A.3 comparison.

Goal: Extend access-qa-extraction to accept documents (PDFs, web pages) as input and generate Q&A pairs from them, complementing the existing MCP-data extraction.
C.1 — Categorize the document corpus ✅


 Categorized all 75 files in rag_documents/ → CORPUS_INDEX.md
 20 NET-NEW (process/how-to), 22 USER GUIDE (deep), 17 MCP OVERLAP, 12 DATA FILE, 4 POINTER/EMPTY

C.2 — Build the DocumentExtractor ✅


 Created parsers.py — PDF (PyMuPDF), docx (python-docx), txt/md parsing + chunking
 Created DocumentExtractor(BaseExtractor) — reads files from disk, two-shot LLM pipeline, judge, incremental cache
 Added "documents" domain to question_categories.py with 5 field groups
 Added source parameter to QAPair.create() for "doc_generated" source type
 Wired into CLI — qa-extract extract documents works end-to-end
 Smoke-tested: produces Q&A pairs from docx and md files

C.3 — Run extraction on the corpus ✅


 Run on 20 NET-NEW files (highest priority — fills the A.3 gap)
 Review output quality, iterate on prompts if needed
 Run on 22 USER GUIDE files
 Run on data/ directory (Darwin, Delta, FASTER, Travel Rewards, software lists)
 Push to Argilla for human review

Result: 825 Q&A pairs total (611 from staging/, 214 from data/). All pushed to Argilla qa-review dataset with document_name metadata for filtering. Added document_name field and fixed source_data to show file reference instead of misleading content preview (commit 8e9edd6).
C.4 — Load into pgvector and re-run A.3


 Review document Q&A pairs in Argilla — filter by document_name, approve/reject/edit
 Sync approved document Q&A pairs into pgvector alongside entity pairs
 Re-run A.3 comparison — the decisive bake-off with expanded coverage
 If pgvector matches UKY across the board, begin planning UKY retirement

Dependencies & Open Questions


Resolved: Project A confirmed the need — pgvector has 0% coverage on general how-to topics that documents cover.
Resolved: UKY document corpus obtained — 75 files in rag_documents/ (staging/ and data/ subdirectories). See DEV_JOURNAL.md "2026-03-06" entry for full inventory.
Resolved: PDF extraction library choice — PyMuPDF (fitz) for PDFs, python-docx for docx.


Project D: Self-Service Evaluation Harness


Status: Designed (2026-03-10). Full design in EVAL_HARNESS_PLAN.md. Not yet implemented.

Goal: Build a reusable evaluation pipeline so team members can run golden-question bake-offs against different agent configurations and compare results visually — without needing Joe in the loop.
Why: Andrew wants an ongoing process for evaluating agent quality as the system evolves. The current A.3 bake-off was manual (one person, one script, manual HTML generation). The team needs a shared, self-service tool.
D.1 — Golden questions + scenario configs


 Create eval/golden_questions.yaml — merge 41 questions from A3_TEST_QUESTIONS.md + 35 from tests/e2e_test_cases.csv, deduplicate (~55 questions)
 Each question: id, text, category, tags, expected assertions (query_type, keywords, tools, answer length)
 Create scenario YAML configs: baseline, strict_rag, loose_rag, rag_only
 Scenarios override Settings env vars (RAG thresholds, model, MCP server subsets)

D.2 — Web UI (static Netlify app)


 Build eval-ui/ — static web app deployed to Netlify (no backend, no database)
 Question list view with scenario picker, API key input, run button
 Calls agent endpoint directly, displays pass/fail per question inline
 Expandable per-question detail (answer, tools used, assertion results)
 JSON export/import for comparing runs
 Visual style matches existing a3-run3-comparison.html

D.3 — Resolve scenario + auth mechanics


 Decide how scenarios change agent behavior (config-override endpoint, separate deployments, or labels only)
 Decide API key routing (shared project key on agent vs pass-through)

D.4 — Team onboarding


 Deploy to Netlify, share URL with team
 No local setup needed — just a browser

Dependencies & Open Questions


Depends on C.4: The harness is most useful once the Q&A bank is comprehensive (entity + document pairs).
Can start D.1 now: Golden questions and scenario configs don't require code changes.
Question: Should golden questions include science-domain questions from actual researchers? Andrew suggested getting input from "science people."


Recommended Sequence

✅ Weeks 1-3: Project A (complete)
├── A.1: Argilla → pgvector sync pipeline ✅
├── A.2: Dual-RAG logging in access-agent ✅
└── A.3: Run comparison, evaluate results ✅

Current: Project C (document extraction)
├── C.1: Categorize the 75 documents ✅
├── C.2: Build document extractor ✅
├── C.3: Run extraction on corpus, push to Argilla ✅
├── C.4: Load into pgvector alongside entity pairs
└── Re-run A.3 with expanded Q&A bank

Next: Projects B + D (can run in parallel)

Project B (feedback)
├── B.1: Design feedback protocol
├── B.2: Update qa-bot-core
├── B.3: Connect feedback to Argilla
└── B.4: Build feedback endpoint in access-agent

Project D (evaluation harness) — design in EVAL_HARNESS_PLAN.md
├── D.1: Golden questions YAML + scenario configs
├── D.2: Web UI (static Netlify app)
├── D.3: Resolve scenario + auth mechanics
└── D.4: Team onboarding / deploy

Rationale:

A first because it validates the approach — done, pgvector wins on quality, loses on coverage
C next because it closes the coverage gap that A.3 identified, potentially letting pgvector replace UKY entirely
B after because smarter feedback closes the quality loop once the Q&A bank is comprehensive
D parallel with B because the eval harness is independent infrastructure — Andrew wants this for ongoing evaluation as the agent evolves


Infrastructure Checklist


 Can Joe SSH into Linode (45.79.215.140) where access-agent runs?
 Can Joe run access-agent locally via Docker Compose? ← A.3 plan: yes, both repos have compose files
 Is the pgvector QA service deployed alongside access-agent? ← Cloned locally, 83 pairs synced (A.1)
 Does Joe have Argilla credentials for the review instance? ← Credentials in access-argilla repo env file
 Is Honeycomb accessible for observability during testing?
 Does the Netlify proxy (qa-bot-core feature branch) work for local testing? ← Bypassing: using direct curl for A.3


Key File References


What
Where


Agent graph + routing
access-agent/src/agent/graph.py


RAG retrieval node
access-agent/src/agent/nodes/rag_answer.py


Query classifier
access-agent/src/agent/nodes/classify.py


pgvector QA client
access-agent/src/services/qa_client.py


UKY RAG client
access-agent/src/services/uky_client.py


QA Service (pgvector RAG)
access-qa-service/ (separate repo, port 8001)


QA Service client
access-agent/src/services/qa_client.py


Agent state (incl. node_trace)
access-agent/src/agent/state.py


API routes (incl. ?include_trace)
access-agent/src/api/routes.py


Usage logging
access-agent/src/usage_logger.py


RAG comparison logging
access-agent/src/rag_comparison_logger.py


RAG comparison tests
access-agent/tests/test_rag_answer.py


QA Bot chat flow
qa-bot-core/src/utils/flows/qa-flow.tsx


QA Bot → access-agent proxy
qa-bot-core/netlify.toml (Netlify reverse proxy config)


QA Bot feedback
qa-bot-core/src/utils/flows/qa-flow.tsx:133-176


Extraction pipeline
access-qa-extraction/src/access_qa_extraction/extractors/


QA pair model
access-qa-extraction/src/access_qa_extraction/models.py


Argilla sync client
access-qa-extraction/src/access_qa_extraction/argilla_client.py


Review system design
access-qa-planning/03-review-system.md


Agent architecture
access-qa-planning/01-agent-architecture.md


Observability plan
access-qa-planning/08-observability.md


Feedback design
access-qa-planning/03-review-system.md (feedback section)
What	Where
Agent graph + routing	`access-agent/src/agent/graph.py`
RAG retrieval node	`access-agent/src/agent/nodes/rag_answer.py`
Query classifier	`access-agent/src/agent/nodes/classify.py`
pgvector QA client	`access-agent/src/services/qa_client.py`
UKY RAG client	`access-agent/src/services/uky_client.py`
QA Service (pgvector RAG)	`access-qa-service/` (separate repo, port 8001)
QA Service client	`access-agent/src/services/qa_client.py`
Agent state (incl. node_trace)	`access-agent/src/agent/state.py`
API routes (incl. ?include_trace)	`access-agent/src/api/routes.py`
Usage logging	`access-agent/src/usage_logger.py`
RAG comparison logging	`access-agent/src/rag_comparison_logger.py`
RAG comparison tests	`access-agent/tests/test_rag_answer.py`
QA Bot chat flow	`qa-bot-core/src/utils/flows/qa-flow.tsx`
QA Bot → access-agent proxy	`qa-bot-core/netlify.toml` (Netlify reverse proxy config)
QA Bot feedback	`qa-bot-core/src/utils/flows/qa-flow.tsx:133-176`
Extraction pipeline	`access-qa-extraction/src/access_qa_extraction/extractors/`
QA pair model	`access-qa-extraction/src/access_qa_extraction/models.py`
Argilla sync client	`access-qa-extraction/src/access_qa_extraction/argilla_client.py`
Review system design	`access-qa-planning/03-review-system.md`
Agent architecture	`access-qa-planning/01-agent-architecture.md`
Observability plan	`access-qa-planning/08-observability.md`
Feedback design	`access-qa-planning/03-review-system.md` (feedback section)
No results found