Skip to content

Instantly share code, notes, and snippets.

@bacalj
Last active March 12, 2026 20:03
Show Gist options
  • Select an option

  • Save bacalj/a6c6f9726844611df5a09c83884a0e83 to your computer and use it in GitHub Desktop.

Select an option

Save bacalj/a6c6f9726844611df5a09c83884a0e83 to your computer and use it in GitHub Desktop.
ACCESS-CI Dev Journal

ACCESS-CI Dev Journal

Gist mirror: https://gist.github.com/bacalj/a6c6f9726844611df5a09c83884a0e83

2026-02-28 — A.1: Argilla → pgvector sync pipeline

Goal: Get Q&A pairs from Argilla into access-qa-service (pgvector) so they're searchable via semantic search.

Discovery: access-qa-service already had a /admin/sync endpoint and argilla_sync.py — but the code was scaffolded with placeholder logic that didn't match the actual Argilla v2 API or the record schema created by access-qa-extraction.

What was wrong:

  • Used deprecated Argilla v1 API (rg.init() / rg.load())
  • Guessed at record field access (record.inputs, record.question) — Argilla v2 uses record.fields["question"]
  • Looked for entity_id in metadata (doesn't exist) — needs to come from <<SRC:...>> citation markers in the answer text
  • Default dataset name was "access-qa" but extraction creates "qa-review"
  • argilla Python SDK wasn't in the dependencies

What we fixed (commit 5b57ae0 on access-qa-service/main):

  • Rewrote sync_from_argilla() for Argilla v2 client API
  • Correct field access via record.fields
  • Domain/entity_id extracted from citation markers, with source_ref parsing as fallback
  • Added _get_edited_values() to prefer reviewer edits (future-proofing)
  • Judge scores (faithfulness, relevance, completeness, confidence) carried through to pgvector metadata
  • Added argilla>=2.0.0 as a proper dependency
  • Added Argilla env vars to docker-compose.yml for local dev

Test result:

POST /admin/sync → {"synced": 83, "skipped": 0, "citations_loaded": 12, "errors": []}
POST /search {"query": "What is ACES designed for?"} → similarity_score: 1.0, correct answer with citation

83 records across 5 domains (compute-resources, software-discovery, affinity-groups, allocations, nsf-awards) synced and searchable.

Also documented: Andrew's feature/access-agent-integration branch on qa-bot-core — what it changes (Netlify proxy, request body format, response contract) and why it matters for Projects A and B. Added to FEB_MARCH_PLAN.md and synced to the gist.


2026-02-28 — A.2: Dual-RAG comparison logging in access-agent

Goal: Modify rag_answer node to query both UKY document RAG and pgvector Q&A-pair RAG for every question, logging side-by-side results for A.3 evaluation.

Approach: Parallel queries via asyncio.gather, gated behind DUAL_RAG_LOGGING env var. When the flag is off, behavior is identical to before.

What was built (commit caf7256 on access-agent/feature/dual-rag-logging):

  • src/config.py — Added DUAL_RAG_LOGGING: bool = False setting
  • src/rag_comparison_logger.py (new) — SQLAlchemy model + singleton logger for rag_comparison_logs table. Follows same pattern as usage_logger.py. Table auto-creates on first use.
  • src/agent/nodes/rag_answer.py — Added:
    • _query_uky_raw() / _query_pgvector_raw() — lightweight async helpers that return raw results without span side-effects
    • _dual_rag_answer() — runs both queries concurrently, applies same UKY-primary/pgvector-fallback priority, logs comparison to PostgreSQL
    • Gate in rag_answer_node: settings.DUAL_RAG_LOGGING and rag_endpoint → dual path; else unchanged
  • tests/test_rag_answer.py (new) — 19 tests: citation processing, raw query helpers, dual-RAG logic (UKY served, pgvector fallback, both fail, combined query, below threshold, logger failure resilience), flag gating

Comparison log table schema (rag_comparison_logs):

  • Query context: session_id, question_id, query_text, expanded_query, query_type, rag_endpoint
  • UKY result: uky_response, uky_duration_ms, uky_error
  • pgvector result: pgvector_matches (JSONB), pgvector_best_score, pgvector_match_count, pgvector_duration_ms, pgvector_error
  • Outcome: served_by, served_answer_length

Test result: 94 passed (all existing + 19 new), 0 failures.

What's unchanged: state.py, graph.py, routes.py — the graph contract is untouched. The comparison log is a side-effect inside the rag_answer node.

Next (A.3): Deploy the feature/dual-rag-logging branch with DUAL_RAG_LOGGING=true, ask questions via qa-bot-core or direct API, then query rag_comparison_logs to evaluate UKY vs pgvector.


2026-02-28 — A.3 setup: Docker environment stood up and smoke-tested

Decision: Run A.3 locally in Docker, bypass qa-bot-core, use direct curl requests.

Docker setup (two separate compose projects):

  • access-qa-service/docker-compose.yml → qa-service (port 8001) + PostgreSQL (port 5433) + Redis (port 6380)
  • access-agent/docker-compose.yml → agent (port 8000) + PostgreSQL (port 5432) + Redis
  • access-agent reaches access-qa-service via host.docker.internal:8001 (macOS Docker)
  • UKY endpoint is remote — uses same API key as qa-bot-core (ACCESS_AI_API_KEY)

What we did to get access-agent running:

  • Created access-agent/.env from discovered keys: OPENAI_API_KEY (from access-qa-extraction/.env), ACCESS_AI_API_KEY (same key as QA_MODEL_API_KEY in access-serverless-api/.env and REACT_APP_API_KEY in qa-bot-core/.env.local), plus DUAL_RAG_LOGGING=true, QA_SERVICE_URL=http://host.docker.internal:8001, OTEL_ENABLED=false
  • Modified access-agent/docker-compose.yml: added env_file: .env to the agent service (previously all env vars had to be listed explicitly), removed external mcp-network dependency (MCP servers aren't needed for A.3)
  • Built and started: docker compose up --build -d — all containers healthy

Smoke test (successful):

curl -X POST http://localhost:8000/api/v1/query \
  -H "Content-Type: application/json" \
  -d '{"query": "What is Delta?", "session_id": "test-a3-smoke", "question_id": "smoke-1"}'

→ Got a full UKY-sourced response about Delta (NCSA HPC resource), 6s latency, tools_used: ["uky_rag_retrieval"]. Agent is live and hitting UKY successfully.

Note: The API field is query (not question). The MCP server warnings in the agent logs are expected and harmless — those servers aren't on this Docker network and aren't needed for A.3.

Current container status (all running):

Service Port Notes
access-agent 8000 feature/dual-rag-logging branch, DUAL_RAG_LOGGING=true
access-agent postgres 5432 checkpointing + comparison logs
access-qa-service 8001 83 Q&A pairs loaded
qa-service postgres 5433 pgvector embeddings
access-argilla 6900 Q&A pair review UI

2026-03-02 — A.3 pre-flight: similarity threshold bug found

Goal: Verify Docker environment still works and start A.3 evaluation.

Discovery: pgvector is returning zero matches for reasonable queries like "What is ACES?" — even though we have 20 compute-resources Q&A pairs including several about ACES.

Root cause: The similarity threshold is too aggressive. There are two thresholds stacked:

  1. qa-service default (access-qa-service/src/access_qa_service/config.py:26): rag_similarity_threshold = 0.85
  2. access-agent per-query-type thresholds (access-agent/src/config.py:69-71):
    • RAG_THRESHOLD_STATIC = 0.85 (static queries)
    • RAG_THRESHOLD_COMBINED = 0.75 (combined queries)
    • RAG_THRESHOLD_FALLBACK = 0.65 (fallback)

The agent's _query_pgvector_raw() passes the threshold to the qa-service, which uses it to filter results. For static queries (the most common type), both sides enforce 0.85.

The problem: "What is ACES?" scores 0.84 against the best match ("What is ACES designed for?") — just below the 0.85 cutoff. With threshold 0.3, the same query returns 3 solid matches (0.84, 0.82, 0.76). Short or naturally-phrased questions routinely fall just under 0.85 even when the topic matches perfectly.

Evidence:

curl /search {"query": "What is ACES?", "threshold": 0.85}  → 0 matches
curl /search {"query": "What is ACES?", "threshold": 0.3}   → 3 matches (0.84, 0.82, 0.76)
curl /search {"query": "What is ACES designed for?"}         → 1 match (1.0, exact)

The rag_comparison_logs table confirmed this — both smoke test queries ("What is Delta?", "What is ACES?") show pgvector_match_count: 0 and served_by: uky_general.

What needs to happen before running A.3:

  • Lower the threshold so pgvector actually returns matches for natural queries
  • Options: (a) lower RAG_THRESHOLD_STATIC from 0.85 to ~0.70 in access-agent config, (b) use a comparison-specific override in the dual-RAG path so production defaults aren't touched, or (c) lower the qa-service default
  • Rebuild the access-agent container after the change

Also this session: Created SYSTEM_OVERVIEW.md with sequence diagrams of the three main flows (query answering, knowledge base building, per-entity extraction detail). Updated the agent graph illustration in FEB_MARCH_PLAN.md from mermaid to an emoji-annotated state transition table. Synced plan gist.


2026-03-02 — Threshold fix committed

Change: Lowered all RAG similarity thresholds in access-agent/src/config.py (commit 08809ad on feature/dual-rag-logging):

  • RAG_THRESHOLD_STATIC: 0.85 → 0.70
  • RAG_THRESHOLD_COMBINED: 0.75 → 0.60
  • RAG_THRESHOLD_FALLBACK: 0.65 → 0.50
  • RAG_SIMILARITY_THRESHOLD (legacy): 0.85 → 0.70

Why: Best matches for natural queries scored ~0.84, just below the 0.85 cutoff. This was the A.3 blocker — pgvector returned 0 matches for every query.

Still needed: Rebuild the access-agent Docker container (docker compose up --build -d) and verify the fix with a smoke test before proceeding with A.3.


2026-03-02 — A.3 running: container rebuilt, threshold verified, test questions written

Rebuilt container: docker compose up --build -d picked up the threshold fix. All containers healthy.

Threshold fix verified: "What is ACES?" now returns pgvector_match_count: 3, pgvector_best_score: 0.84. Before the fix this was 0 matches. UKY still served (as designed), but pgvector results are now logging.

Pushed branches: access-agent/feature/dual-rag-logging pushed to GitHub (3 commits: A.2 dual-RAG logging, threshold fix). access-qa-service/main push failed — Joe doesn't have write access to necyberteam/access-qa-service (need Andrew to grant).

QAP coverage (83 pairs across 11 entities in 5 domains):

Domain Entity Pairs
compute-resources ACES (TAMU) 10
compute-resources Ranch (TACC) 10
software-discovery ABINIT 10
software-discovery Abaqus 8
allocations Grassland bird habitat (#72204) 9
allocations RL benchmark (#72205) 10
nsf-awards Pollinator conservation AI (#2529183) 10
nsf-awards Great Salt Lake dust (#2449122) 8
affinity-groups Neocortex (PSC) 5
affinity-groups REPACSS (TTU) 3

Test questions written: 40 questions in A3_TEST_QUESTIONS.md, organized in 3 groups:

  • pgvector-targeted (24): Questions about entities we have QAPs for
  • UKY-targeted (8): General ACCESS questions our 83 pairs probably don't cover
  • Edge cases (8): Vague, misspelled, or cross-domain questions

Next: Review the test questions, then fire them all through the agent and pull the comparison logs.


2026-03-04 — A.3 Run 2: first full test, unfair comparison discovered

Run 2 executed: Fired all 41 test questions through the agent with DUAL_RAG_LOGGING=true. All 41 succeeded, 40 logged (q41 classified as dynamic/xdmod). Results exported to a3_results/run2.json.

Run 2 results (high-level): UKY answered 36/40, pgvector had matches for 30/40, served by UKY 36, served by pgvector 4.

Built interactive HTML comparison: ~/.agent/diagrams/a3-run2-comparison.html — expandable rows with side-by-side answers, KPI summary, sidebar nav, analysis section.

Synthesis routing fix: pgvector static matches were previously returned as final_answer (raw Q&A pair text). Changed rag_answer.py to set rag_matches + rag_used instead, and added "synthesize" as a third routing option from route_after_rag in graph.py. This routes pgvector results through the LLM synthesis pipeline.

Unfair comparison discovered: Run 2's comparison was apples-to-oranges. UKY answers arrive already LLM-synthesized (UKY's own LLM produces polished prose). pgvector answers in the comparison log were raw Q&A pair text — just the verbatim answer field from the curated pair. This made pgvector look worse than it actually is, since the difference was partly in presentation quality, not underlying knowledge.


2026-03-04 — A.3 Run 3: fair apples-to-apples comparison

Goal: Make the comparison fair by synthesizing pgvector answers through our own LLM before logging them.

What was changed:

  • rag_comparison_logger.py — Added pgvector_synthesized_answer = Column(Text) to the model and log_comparison() method
  • rag_answer.py — Imported _format_rag_matches and _synthesize_with_rag_only from synthesize.py. In _dual_rag_answer(), after getting pgvector matches, calls synthesis to produce an LLM-polished answer before logging. This is what the user would actually see if pgvector served the answer.
  • pyproject.toml — Pinned opentelemetry-instrumentation-langchain<0.53 (newer version had a breaking import for GenAICustomOperationName)
  • DatabaseALTER TABLE rag_comparison_logs ADD COLUMN pgvector_synthesized_answer text;
  • Test runner — Created a3_results/run_a3_test.py to fire all 41 questions programmatically

Run 3 results (41/41 succeeded, all logged):

Metric Value
UKY answered 38/41 (93%)
pgvector answered (synthesized) 27/41 (66%)
Both answered 24 (direct comparison possible)
UKY only 14
pgvector only 3
Avg pgvector similarity score 0.84

Fair comparison conclusions (from HTML analysis at ~/.agent/diagrams/a3-run3-comparison.html):

  1. The two backends are complementary, not competitive. pgvector gives precise, curated answers for entities we've built Q&A pairs for. UKY covers the long tail of general ACCESS knowledge.

  2. pgvector excels on its own domain: Of 25 pgvector-targeted questions (Q1-Q25), pgvector produced synthesized answers for 24 (96%). These are entities with curated Q&A pairs.

  3. UKY handles breadth that pgvector cannot: For 8 UKY-targeted questions (Q26-Q33) about general ACCESS topics (allocations process, Globus, password reset), pgvector answered 0. Our 83 curated pairs simply don't cover these.

  4. UKY produces longer answers (~157% longer on average when both answer the same question). This may reflect UKY's larger document corpus or that our synthesis prompt is more concise. Length alone doesn't indicate quality.

  5. pgvector retrieval is dramatically faster (~5 ms vs ~2500 ms for UKY), though pgvector now also needs LLM synthesis time (not logged separately).

  6. The quality gap is narrower than Run 2 suggested. With LLM synthesis, pgvector answers read as polished, cited responses. The Run 2 comparison was unfairly penalizing pgvector by showing raw text.

  7. Production recommendation: Use both backends — pgvector for high-confidence domain matches, UKY for everything else. This is already the architecture (_dual_rag_answer uses UKY-primary, pgvector-fallback).

Files produced:

  • a3_results/run3.json — Full export of 41 comparison log entries
  • ~/.agent/diagrams/a3-run3-comparison.html — Interactive comparison with analysis
  • a3_results/run_a3_test.py — Test runner script

2026-03-05 — A.3 post-mortem: reframing the question

Realization: The A.3 analysis drifted toward "complementary backends" and fallback architecture. But that wasn't the original question. From FEB_MARCH_PLAN.md:

"proving this approach outperforms document RAG" (line 34) "We need data on how these two approaches compare before making further investment decisions" (line 65) "A first because it validates the approach before investing in B" (line 259)

A.3 was a bake-off to decide whether Q&A-pair RAG can replace UKY document RAG — not to build a hybrid system. The "use both" conclusion was the code's existing fallback architecture leaking into the analysis.

Why pgvector lost on breadth (and it's not about quality)

The coverage gap is entirely explained by content type, not approach quality:

What the extraction pipeline covers (5 MCP server domains, entity-focused):

  • Compute resources (23 entities: ACES, Delta, Anvil, etc.)
  • Software discovery (1,404 packages)
  • Allocations (5,440 projects)
  • NSF awards (10,000+ awards)
  • Affinity groups (55 groups)

These are all "what is X" questions about discrete entities. The pipeline pulls structured data from MCP servers and generates Q&A pairs about each entity's properties.

What UKY has that we don't (general ACCESS documentation):

  • How to apply for an allocation (process docs)
  • How to transfer files / use Globus (how-to guides)
  • How to reset your password (account management)
  • Startup vs research allocations (policy docs)
  • Training resources, publication acknowledgment (educational docs)

These are "how do I" questions about ACCESS-wide processes. They don't live in any MCP server — they live in documentation pages, wikis, and guides that UKY ingested.

We don't know exactly what UKY ingested. The plan has an open question: "Need a list from Andrew of what UKY currently ingests." UKY is a black-box API to us.

The actual A.3 verdict

On entity questions where we have Q&A pairs: pgvector hits 96% (24/25). The synthesized answers are concise and accurate. pgvector retrieval is ~500x faster than UKY (~5ms vs ~2500ms).

On general how-to/process questions: pgvector scores 0%. We simply have zero Q&A pairs for these topics because no MCP server serves allocation process docs or file transfer guides.

The gap is coverage, not quality. If we had Q&A pairs for general ACCESS topics, pgvector would likely match or beat UKY on those too.

Decision point

The plan says Project C ("Extract from ACCESS documentation") was deferred with this note:

"Revisit only if a specific content gap surfaces that exists only in documents with no API equivalent (e.g., narrative tutorials, policy explainers)."

A.3 just surfaced exactly that gap. The 14 UKY-only questions are all process/how-to questions with no API equivalent.

Joe needs to decide:

  1. Pursue Project C — Extract Q&A pairs from ACCESS documentation (not MCP entities). This would close the how-to gap and potentially let pgvector replace UKY entirely. Requires: getting the doc list from Andrew, building a document extractor, running extraction + Argilla review.

  2. Keep UKY for breadth, pgvector for precision — Accept the hybrid architecture. UKY handles general questions, pgvector handles entity questions. Simpler, but you're dependent on UKY's black-box system and can't control answer quality for general topics.

  3. Expand entity coverage first — Before tackling docs, run the existing extraction pipeline against more entities (we only extracted 11 of 23 compute resources, 2 of 1,404 software packages, 2 of 5,440 allocations). More entity coverage might narrow the gap enough.

UKY corpus: confirmed undocumented

Searched all repos (access-qa-planning, access-agent, access-mcp, access-qa-extraction, access-qa-bot) for any documentation of what UKY's system ingests. Found:

  • pages-current-production.md — "The Q&A backend is hosted at the University of Kentucky." No corpus details.
  • pages-access-qa-tool.md line 193 — Notes UKY's tech stack as "ChromaDB, llamaindex." No document list.
  • FEB_MARCH_PLAN.md line 233 — Open question: "Need a list from Andrew of what UKY currently ingests."
  • uky_client.py — Black-box HTTP client. No corpus metadata.

No list of UKY's ingested documents exists anywhere in our repos. Andrew is the only source for this information.

Research options independent of Andrew

Even without the UKY document list, there are viable paths to continue the bake-off:

Option A: Analyze UKY's 14 winning answers for source clues. Read the UKY-only responses from Run 3 and determine whether the information is unique to some internal corpus or is general ACCESS knowledge available on public web pages (support.access-ci.org, allocations.access-ci.org). UKY's answers may contain citations, URLs, or verbatim language that reveals their source documents. This takes ~30 minutes and informs all other options.

Option B: Generate Q&A pairs from public ACCESS content. Point the extraction pipeline (or a variant) at public ACCESS web pages — the allocations guide, getting started pages, Globus documentation, password reset instructions. These are freely available. Generate Q&A pairs, curate them, load into pgvector, re-run A.3. This directly tests whether closing the topic gap closes the performance gap.

Option C: Determine whether UKY's advantage is unique knowledge or general glue. The 14 UKY-only questions are all process/how-to topics. If UKY is synthesizing from the same public ACCESS web pages any user can read, then the "advantage" is simply that we haven't generated Q&A pairs for those topics yet — not that UKY has access to privileged information. This reframes the bake-off: it's not documents vs Q&A pairs, it's about coverage breadth.

Option D: Expand entity coverage as a control. Add Q&A pairs for remaining MCP server domains (events, announcements, system-status) and more entities within existing domains (we only extracted 11 of 23 compute resources, 2 of 1,404 software packages). This tests whether broader entity coverage alone changes the picture.

Recommended sequence: A first (30 min, informs everything), then B (directly tests the hypothesis), with D as low-effort parallel work.


2026-03-06 — UKY corpus obtained, plan aligned with Andrew

UKY document corpus now available

Andrew provided the full set of documents that feed UKY's document RAG. They are in rag_documents/ (75 files, 69 MB) split across two directories:

staging/ (~47 files) — The main corpus. Three categories:

Category Examples Count
Resource descriptions ACES, Anvil, Bridges-2, Delta, Expanse, Jetstream-2, Neocortex, Sage, Voyager, Fabric (PDFs) ~20
User guides ACES, Anvil, Bridges-2, Delta, Expanse, Jetstream-2, Neocortex, Sage (PDFs) ~10
Process/how-to docs Allocations, Globus, MFA, add users, progress reports, office hours, events/trainings, system status (docx) ~12
Misc ARA description, SDS pointer, CloudBank login, REPACSS overview, Sage edge apps, current projects ~5

data/ (~28 files) — Per-resource software lists (txt/csv) and resource-specific documentation:

  • Software installed lists for ACES, Anvil, Bridges-2, Darwin, Delta, Expanse, Jetstream2, Kyric, Stampede3
  • Darwin docs (user guide, login, filesystems, job management, SLURM, software)
  • Delta docs (user guide, data management)
  • FASTER docs (intro, SLURM partitions, documentation)
  • ACCESS Travel Rewards (md)

Key observation: The process/how-to docs in staging/ (allocations, Globus, MFA, etc.) are exactly the topics UKY beat pgvector on in A.3. The resource descriptions overlap with what MCP extraction already covers. This confirms the A.3 finding — the gap was coverage, not quality.

Alignment with Andrew

Confirmed the shared end state:

  1. Generate Q&A pairs from these documents — Use a similar two-shot process to what exists for MCP entities, but with documents as input. Andrew: "Probably a similar prompt to the MCP tools can work for generating pairs from docs."

  2. One unified Q&A pair bank in pgvector — Entity pairs (from MCP) + document pairs (from these files) living together, searchable as one corpus.

  3. The orchestrator agent decides routing — RAG for factual queries, MCP for live data, both when needed. Andrew: "The orchestrator agent should decide which tools to use (RAG, MCP, both) and then it should get synthesized. That logic should already exist in access-agent."

  4. UKY goes away — Andrew: "Eventually, we will likely not need the document based RAG since the Q&A pairs are faster." pgvector replaces UKY entirely.

Plan: document extraction pipeline

Step 1: Categorize the corpus. Skim the 75 files and bucket them: resource descriptions (entity overlap with MCP), user guides (process/how-to), general ACCESS docs. Identify what's already covered by MCP extraction vs. what's net new.

Step 2: Build a document extractor in access-qa-extraction. Extend the pipeline to accept documents (PDF/docx) as input. The two-shot prompt structure should carry over — battery pass for coverage, discovery pass for insights. New work: document parsing (PDF text extraction, docx reading) and chunking into logical sections.

Step 3: Run extraction on the full corpus. Generate Q&A pairs from all documents. Push to Argilla for review. This produces pairs for the exact topics pgvector was missing — allocations process, Globus, MFA, user guides.

Step 4: Load into pgvector alongside entity pairs. One unified bank: existing 83 entity pairs + document-sourced pairs. All searchable together.

Step 5: Re-run A.3. Same 41 questions (plus new ones if the expanded corpus suggests them). If pgvector-with-documents matches or beats UKY across the board, the bake-off is won.

Step 6: Simplify the agent routing. Once the Q&A pair bank covers everything, the agent graph simplifies: RAG for factual queries, MCP for live data, synthesis when both contribute. Remove the UKY fallback path.


2026-03-09 — Project C step 1: corpus categorized

Skimmed all 75 files in rag_documents/ and produced a categorized index at rag_documents/CORPUS_INDEX.md. No files were moved or renamed — the index is a read-only reference.

Categorization results

Category Files Priority Rationale
NET-NEW process/how-to 20 First Fills the exact A.3 gap — allocations, Globus, MFA, Sage, citations, Jupyter
USER GUIDE (deep) 22 Second Operational depth (job submission, filesystems, SLURM) beyond MCP surface data
MCP OVERLAP (descriptions) 17 Later 1-page resource catalog entries — MCP already covers most of this
DATA FILE 12 Skip Raw software lists (name/version lines) — MCP software-discovery covers this
POINTER/EMPTY 4 Skip URL stubs or corrupt files with no substantive content

Key finding: The 20 NET-NEW files are mostly small docx docs — easy to parse, directly address the A.3 gap. The 22 user guides are larger PDFs with real depth (SLURM partitions, data management, module systems). The 17 resource descriptions are 1-page PDFs that overlap with MCP entity data.

Also this session: Consolidated project documentation — SYSTEM_OVERVIEW.md is now single source of truth for architecture, FEB_MARCH_PLAN.md updated with A.3 results and Project C active status, all three docs gist-mirrored, CLAUDE.md updated with document discipline rules.


2026-03-09 — PRs merged, document extractor built (C.2)

Pre-flight: merged outstanding PRs

access-qa-extraction PR #1 (two-shot pipeline) — squash-merged to main. 4,697 additions across the full two-shot extraction pipeline: battery + discovery prompts, LLM judge, incremental cache, Argilla entity-replace, 5 domain extractors, 144 tests. Branch archived on GitHub.

access-qa-planning PR #1 (companion docs) — squash-merged to main. Documentation updates for two-shot pipeline.

access-agent and qa-bot-core — decided to leave on their branches. qa-bot-core is a production product with its own release routine. access-agent's feature/dual-rag-logging branch mixes evaluation scaffolding with production improvements — better to leave as-is until the bake-off concludes.

Smoke-test on main

Reinstalled access-qa-extraction from clean main. 144/144 tests pass. Started mcp-compute-resources Docker container from access-mcp/docker-compose.yml (port 3002). Ran extraction:

qa-extract extract compute-resources --max-entities 1 --no-judge

Produced 8 Q&A pairs for ACES — 5 battery + 3 discovery, all with citations. Two-shot pipeline confirmed working on main.

Built DocumentExtractor (Project C.2)

Branched feat/document-extractor off clean main. Built the document extraction pipeline:

New files:

  • parsers.py — Standalone document parsing module. parse_docx() (python-docx), parse_pdf() (PyMuPDF/fitz), parse_text() (.txt/.md). Dispatcher parse_document() routes by extension. chunk_text() splits large docs (~6000 words) with overlap. clean_extracted_text() collapses PDF/docx whitespace artifacts.
  • extractors/documents.pyDocumentExtractor(BaseExtractor). Overrides run() to skip MCPClient (documents are local files). Discovers files recursively from config.url directory. Each document/chunk = one entity. Two-shot LLM pipeline (battery + discovery), judge evaluation, incremental cache — same as MCP extractors. Uses source="doc_generated", source_ref="doc://documents/{entity_id}".

Modified files:

  • pyproject.toml — Added python-docx>=1.0.0, PyMuPDF>=1.24.0
  • models.py — Added source parameter to QAPair.create() (default "mcp_extraction", backward-compatible)
  • question_categories.py — Added "documents" to DOMAIN_LABELS, DOMAIN_NOTES, and FIELD_GUIDANCE (5 field groups: overview, key procedures, requirements & eligibility, important details, support & contact)
  • config.py — Added "documents" MCPServerConfig with url=os.getenv("DOCUMENTS_DIR", "../rag_documents")
  • extractors/__init__.py — Added DocumentExtractor import and export
  • cli.py — Added DocumentExtractor to EXTRACTORS registry

Smoke tests

Test 1: qa-extract extract documents --max-entities 1 --no-judge — parsed CORPUS_INDEX.md, produced 6 Q&A pairs about the document corpus.

Test 2: qa-extract extract documents --entity-ids "10_1758119706.911465_data-ACCESS-how-to-cite-Jetstream" --no-judge — parsed a docx file from staging/, produced 5 Q&A pairs about Jetstream citation formats and acknowledgment requirements.

Fix: _title_from_stem() was producing ugly titles from Slack-style filenames (e.g., 10_1758119706.911465_data-ACCESS-how-to-cite-Jetstream). Added re.sub(r"^\d+_[\d.]+_", "", stem) to strip the numeric prefix, plus stripping common prefixes (data-ACCESS-, data:, etc.). Title now renders as "How To Cite Jetstream".

All 144 existing tests still pass after all changes.

First extraction run: staging/ directory (C.3)

Ran DOCUMENTS_DIR="../rag_documents/staging" qa-extract extract documents --no-judge on all 47 files in staging/. Took ~25 minutes (94 LLM calls).

Results: 586 Q&A pairs from 83 entities (46 files processed, 1 corrupt file skipped).

Category Entities Pairs Notes
NET-NEW docx (process/how-to) 19 ~110 Allocations, MFA, Globus, Sage, Jupyter
User Guide PDFs (chunked) 39 chunks ~290 Jetstream2 (20 chunks), Anvil (6), Bridges-2 (5), etc.
MCP Overlap descriptions 17 ~134 1-page resource PDFs
Other (ARA, SDS, REPACSS) 8 ~52 Small docs
  • 100% citation markers (<<SRC:documents:...>>)
  • All pairs use source: "doc_generated"
  • Large PDFs chunked correctly (~6000 words per chunk with overlap)
  • Quality spot-check: questions are natural, answers contain specific details (URLs, commands, step-by-step procedures)
  • Only error: current-access-projects.docx (known corrupt/empty file)

Output at data/output/documents_qa_pairs.jsonl (gitignored). Branch pushed to GitHub.

Not yet run: data/ directory (Darwin, Delta, FASTER docs + ACCESS-Travel-Rewards.md + software lists).

Second extraction run: data/ directory (C.3)

Ran DOCUMENTS_DIR="../rag_documents/data" qa-extract extract documents --no-judge on all files in data/ subdirectories.

Results: 221 Q&A pairs from 29 entities.

Subdirectory Entities Pairs Notes
ACCESS-Resources/Darwin/ 9 ~65 Managing jobs, user guide, compiling, file systems, etc.
ACCESS-Resources/Delta/ 3 chunks ~25 Large PDF chunked into 3
ACCESS-Resources/FASTER/ 4 ~30 User guide, system overview, jobs, file systems
ACCESS-Travel-Rewards.md 1 ~8 Travel reimbursement program
ACCESS-Software-Installed-by-resource/ 12 ~93 Software lists (package names/versions — generic Q&A quality)
  • Software-list files produced generic "what software is installed on X" pairs — adequate but not high-value. Argilla reviewers can reject low-quality ones.
  • Darwin and FASTER docs produced strong procedural content (SLURM commands, file system paths, compilation flags).

Combined output and Argilla push

Saved staging/ output as documents_staging_qa_pairs.jsonl, combined both runs into documents_all_qa_pairs.jsonl (807 total pairs).

Pushed all 807 pairs to Argilla: qa-extract push data/output/documents_all_qa_pairs.jsonl. Records visible in qa-review dataset at http://localhost:6900.

Docker note: Argilla containers had stale network references from previous sessions. Fixed with docker compose down --remove-orphans && docker network prune -f && docker compose up -d.

Added document_name metadata field

Problem: When reviewing pairs in Argilla, all 807 records had domain: "documents" with no way to tell which source document they came from — the only clue was the source_ref URI (e.g., doc://documents/10_1758119706.911465_data-ACCESS-how-to-cite-Jetstream), which is opaque. For MCP-extracted pairs, domain provides natural grouping (compute-resources, allocations, etc.), but document pairs lack an equivalent.

Fix: Added document_name as an optional metadata field on QAMetadata, populated from the existing _title_from_stem() helper in DocumentExtractor. The field flows through to Argilla as a filterable TermsMetadataProperty. MCP extractors are unaffected (field defaults to None).

Files changed: models.py (field + factory param), documents.py (passes title), argilla_client.py (schema + record metadata).

Re-extraction: Re-ran both staging/ (611 pairs) and data/ (214 pairs) = 825 total. Deleted old Argilla dataset (no schema for document_name), pushed fresh. 72 unique document names now filterable in Argilla.

Fixed source_data for document pairs

Problem: For MCP entity pairs, source_data contains the full entity JSON that the LLM used to generate the Q&A pair — reviewer sees exactly what went in. For document pairs, source_data was set to content_preview: chunk[:500] — the first 500 characters of the chunk. This was misleading: it looked like the source material but only represented a tiny slice of the ~6000-word chunk the LLM actually saw. Reviewers would see a content_preview about topic X when the Q&A pair was about topic Y (from elsewhere in the same chunk).

Fix: Replaced content_preview with a reference: {file, chunk, total_chunks, word_count}. For non-chunked documents, chunk and total_chunks are null. The reviewer sees the file and chunk number; the actual document is in rag_documents/.

Design note on chunking: Large documents (>6000 words) are split into sequential ~6000-word chunks with 500-word overlap. Each chunk is processed as a separate entity — the LLM only sees one chunk at a time, not the whole document. So chunk 9 of a 20-chunk Jetstream PDF starts at roughly word 44,000. This is why the source_ref includes the chunk number (e.g., doc://documents/jetstream-2-user-guide__chunk_9).


2026-03-10 — Evaluation harness design (Project D)

Andrew asked about making the bake-off self-service: editable golden questions, runnable by the team with their own tokens, comparing different agent configurations ("tool combinations"). Key points from the conversation:

  • Golden questions: Andrew wants a curated benchmark set that people can view, add, and modify. These are distinct from the Q&A pairs in Argilla — they're the test inputs used to evaluate the agent.
  • Different tool combinations: Not UKY-vs-pgvector (UKY is going away), but different configurations of our agent — RAG thresholds, MCP server subsets, model choices. Each configuration is a "scenario."
  • Self-service: Team members should be able to run evaluations and see results without Joe in the loop.
  • Ongoing process: Re-run as the agent evolves, not a one-shot comparison.

Designed the evaluation harness. Full design saved as EVAL_HARNESS_PLAN.md. Summary:

  • Golden questions in YAML (merge A3_TEST_QUESTIONS.md + e2e_test_cases.csv → ~55 questions with structured assertions)
  • Scenario configs as YAML files overriding Settings env vars
  • CLI runner calling run_agent() directly (not HTTP) to capture full AgentState
  • HTML report generator producing self-contained comparison pages (matching a3-run3 visual style)
  • New access-agent/eval/ directory

Added as Project D in FEB_MARCH_PLAN.md (D.1–D.4), parallel with Project B after C.4 completes.

Pivot: Initially designed as a CLI-based Python tool (access-agent/eval/). Revised to a static web app on Netlify (eval-ui/) — no Python environment needed, users just open a browser. Golden questions and scenarios bundled at build time, results displayed inline and exportable as JSON. Two open design questions flagged: (1) how scenarios actually change agent behavior given the current API doesn't accept config overrides, and (2) API key routing (server-side vs pass-through). Plan saved as EVAL_HARNESS_PLAN.md.

This is future work — immediate next step remains C.4 (review Argilla, sync pgvector, re-run A.3).


2026-03-10 — C.4: Meta-referencing fix, re-extraction, A.3 Run 4

Meta-referencing problem in document Q&A pairs

Spot-checked the 825 document pairs in Argilla and found a systematic quality issue: 36% (300/825) of generated questions referenced the source documents rather than the subject matter.

Examples:

  • Wrong: "What are the important quotas and limits mentioned in the Darwin Filesystems Storage document?"
  • Right: "What are the storage quotas on Darwin?"

Root cause analysis: Two contributing factors:

  1. FIELD_GUIDANCE field group #1 said "what is this document about?" — 90% of seq-1 (overview) pairs were meta-referencing.
  2. Entity titles included document-type suffixes ("Jetstream 2 User Guide") which primed the LLM to treat the document as the subject.

Prompt and code fixes

question_categories.py — Two changes:

  • Added explicit anti-meta-referencing instruction to DOMAIN_NOTES["documents"] with wrong/right examples.
  • Reworded all 5 field groups in FIELD_GUIDANCE["documents"] to avoid document-referencing (e.g., "Overview — what is this topic about?" instead of "what is this document about?").

documents.py — Added regex to _title_from_stem() to strip document-type suffixes ("User Guide", "Manual", "Handbook", etc.) so the LLM sees "Jetstream 2" instead of "Jetstream 2 User Guide" as the entity name.

Re-extraction results

Three extraction runs after iterating on fixes:

  1. Staging (first fix): 608 pairs, 10% meta (down from 36%)
  2. Staging (with title suffix fix): 604 pairs, 0.9% meta (6 remaining)
  3. Data directory: 228 pairs

Combined: 832 pairs, 6 meta-referencing (0.7%). Cleared Argilla and pushed fresh.

A.3 Run 4 — the bake-off

Brought up all services locally (qa-service on 8001, access-agent on 8000). Synced 832 document pairs from Argilla and loaded 70 entity pairs via JSONL. Total: 902 pairs in pgvector.

Fired all 41 test questions. Results:

Metric Run 3 (83 pairs) Run 4 (902 pairs)
UKY hits 38/41 (93%) 40/40 (100%)
pgvector hits 27/41 (66%) 27/40 (67%)
pgvector avg latency ~5ms ~30ms

pgvector coverage stayed flat at 67% despite 10x more pairs.

The architectural insight

The 13 missed questions fall into two categories:

  1. Missing source content (4 questions) — Ranch storage has zero Q&A pairs because no Ranch documents exist in rag_documents/ and Ranch wasn't returned from MCP in the extraction run that generated the original test questions.

  2. No cross-cutting Q&A pairs (9 questions) — General ACCESS questions ("How do I apply for an allocation?", "How do I transfer files between resources?", "What training does ACCESS offer?") have no matching pairs even though we have 104 allocation mentions, 50 transfer/Globus mentions, and 40 training mentions across our pairs. The problem: all those mentions are entity-scoped. We have "How do I cite Jetstream?" but not "How do I acknowledge ACCESS?" We have "What allocations does Anvil support?" but not "How do I apply for an allocation?"

The extraction pipeline processes one document at a time, so it only ever generates entity-scoped Q&A pairs. It will never produce cross-cutting "How does ACCESS work in general?" pairs from a single-document prompt.

UKY's advantage is architectural: chunk-level retrieval at query time lets it pull relevant fragments from multiple documents and synthesize on the fly. It doesn't need a pre-generated answer that matches — it just needs chunks that are individually relevant. Our Q&A-pair RAG needs a pair whose question semantically matches the user's question, and no single entity-scoped pair matches a cross-cutting query closely enough.

Decision questions for Andrew

  1. Manually curate cross-cutting pairs — Write 20-30 general ACCESS Q&A pairs by hand. Fast, targeted, but doesn't scale.
  2. Add a cross-cutting extraction pass — Feed the LLM multiple documents simultaneously and ask for general questions that span topics. New pipeline capability.
  3. Keep UKY as fallback for general questions — Accept the hybrid. pgvector for entity questions (fast, verified), UKY for cross-cutting (slow, unverified).
  4. Lower similarity thresholds — Some misses scored 0.55-0.68, not far from the 0.70 cutoff. Won't fix the 0.28-0.49 misses.
  5. Detect cross-cutting-ness at query time — Instead of pre-generating cross-cutting pairs, use pgvector match quality as a signal: low scores with scattered partial matches → route to document chunk RAG or MCP tools. Fits existing agent graph routing.

Files produced

  • a3_results/run4.json — 40 comparison log entries
  • a3_results/run4_enriched.json — enriched with low-threshold best-possible scores
  • ~/.agent/diagrams/a3-run4-bakeoff.html — interactive comparison visualization

Answer richness gap (second dimension)

Even when pgvector hits, many answers are thinner than UKY's. Investigated whether pgvector answers were bypassing LLM synthesis — confirmed they are NOT: _dual_rag_answer() calls _synthesize_with_rag_only() for every pgvector match. The real issue: a single pre-digested Q&A pair gives the synthesis LLM very little to work with, so it returns near-verbatim text. UKY pulls multiple document chunks and the LLM has more raw material to synthesize a richer answer.

However, reviewing side-by-side answers revealed a more nuanced picture:

  • Some pgvector answers are actually better than UKY's (more precise, directly relevant)
  • Some just need link enrichment (the synthesis prompt doesn't encourage adding URLs)
  • Some questions UKY can't answer but pgvector can (entity-specific data from MCP)

This shifts the framing from "pgvector vs UKY" to "how to combine them intelligently."

Quick fix (low-effort, high-impact): The RAG_ONLY_SYNTHESIS_PROMPT in synthesize.py says "Be concise and direct" — this is why the LLM returns near-verbatim single sentences. Updating the prompt to encourage link inclusion, practical context, and resource pointers would immediately enrich thin answers without any architectural changes. The Q&A pair metadata already carries domain and entity_id which could drive link generation.

5th strategic option: cross-cutting detection at query time

Instead of generating cross-cutting Q&A pairs up front, detect cross-cutting-ness at query time based on pgvector results and route accordingly:

  • pgvector score < threshold but > 0.4 → content exists but scattered → fall back to document chunk RAG or plan+MCP
  • pgvector hit but thin answer → enrich with MCP tool calls or document chunks
  • pgvector hit with rich answer → serve it (fast, verified)
  • pgvector zero matches → missing content → MCP or UKY fallback

This fits the existing agent graph — rag_answer already evaluates match quality and routes to plan on weak matches. The change: make that evaluation smarter about why the match is weak.

Bugs noted (not fixed)

  • threshold=0.0 falsy in vectorstore.py: threshold or settings.rag_similarity_threshold treats 0.0 as falsy, falling back to default 0.85. Affects diagnostic queries with threshold=0.
  • q21 not logged: "How much funding did the pollinator conservation AI project get?" was classified as non-RAG (40/41 logged).

2026-03-11 — Run 4 reanalysis: UKY hit rate was overcounted

Discovery: The Run 4 summary reported "UKY hits 40/40 (100%)" — but this counted every UKY response as a hit, including hedges like "The provided documents do not contain specific information about Abaqus. Please open a support ticket." Applied the same hedge detection used at runtime (_rag_answer_is_weak in graph.py) to the logged responses.

Corrected Run 4 numbers:

Metric pgvector UKY
Genuine answers 27/40 (68%) 13/40 (33%)
Hedged / no match 13 27

Head-to-head breakdown:

  • Both answered well: 8
  • pgvector only (UKY hedged): 19
  • UKY only (pgvector no match): 5 — all general process questions (allocations, password reset, file transfer)
  • Neither answered well: 8

What this means: pgvector already outperforms UKY 2-to-1. UKY's 19 entity-specific hedges are questions pgvector handles from curated MCP data (software versions, resource specs, NSF awards) that UKY's document corpus simply doesn't cover. The "UKY as strong fallback" framing was wrong — UKY adds value on only 5 questions, all cross-cutting process topics.

Remaining gap (13 questions): 5 cross-cutting process questions (UKY answers, pgvector doesn't) + 8 neither backend handles. A document-chunk fallback for cross-cutting detection would address most of these, but the urgency is lower than previously thought.

Also this session: Updated SYSTEM_OVERVIEW.md routing table with file names, condition explanations, and node descriptions. Synced gist.


WHERE WE ARE — resume point (updated 2026-03-12)

A.3 Run 5 complete (full-system test). Node tracing added. Top-5 matches + enriched synthesis prompt deployed. Next: curate cross-cutting Q&A pairs for the ~5 procedural LLM-only questions.

What's done

  • A.1 (Argilla → pgvector sync) ✅
  • A.2 (dual-RAG logging in access-agent) ✅
  • A.3 Runs 1–5 complete ✅ — RAG-vs-RAG (Runs 1–4), full-system (Run 5)
  • Post-mortem analysis ✅ — gap is content type (entity vs process), not quality
  • UKY corpus obtained ✅ — 75 files in rag_documents/
  • Direction confirmed with Andrew ✅ — generate Q&A pairs from docs, unify in pgvector, retire UKY
  • C.1 corpus categorized ✅ — index at rag_documents/CORPUS_INDEX.md
  • C.2 document extractor built ✅ — committed and pushed on feat/document-extractor
  • C.3 extraction complete ✅ — 832 pairs (604 staging + 228 data), meta-referencing fixed (36% → 0.7%)
  • Outstanding PRs merged ✅ — both access-qa-extraction and access-qa-planning PRs squash-merged
  • C.4 sync + bake-off ✅ — 902 pairs in pgvector (832 document + 70 entity), 40 questions answered
  • A.3 Run 4 reanalysis ✅ — pgvector 68% vs UKY 33% (hedge responses excluded)
  • A.3 Run 5 ✅ — full-system test (pgvector + MCP + routing). 24 RAG, 5 MCP, 12 LLM-only.
  • Node tracing ✅ — node_trace in AgentState, gated behind ?include_trace=true (commits 04342c8, b7a9bec)
  • Top-5 matches + enriched synthesis prompt ✅ — RAG_TOP_K 3→5, prompt rewritten (commit ef43a21)

The core findings

  1. pgvector is already ahead: 27/40 genuine answers vs UKY's 13/40. pgvector covers entity-specific data (software, resources, awards) that UKY cannot.
  2. Full system closes more gaps: MCP tools answer Ranch questions and project search (Run 5). 12 questions remain LLM-only (ungrounded).
  3. Cross-cutting gap splits into two types: Union-type queries ("What resources support GPUs?") should now be addressed by top-5 multi-match synthesis. Procedural queries ("How do I apply for an allocation?") still need hand-curated cross-cutting Q&A pairs (~5 questions).

What's next

  1. Curate cross-cutting Q&A pairs — allocations, Globus, MFA, training, citation (fills the ~5 procedural LLM-only gaps)
  2. Re-run evaluation — test top-5 + enriched prompt against the 41 questions, compare answer quality
  3. Project D — evaluation harness (EVAL_HARNESS_PLAN.md)
  4. Project B — feedback protocol design

2026-03-11 — A.3 Run 5: Full-System Comparison

What we did

Ran all 41 questions through the production agent graph with MCP servers active and UKY disabled. This is the first system-vs-system test: pgvector RAG + MCP tools + LangGraph routing, compared against UKY's baseline responses from Run 4.

Configuration:

  • ENVIRONMENT=production, MCP_SERVER_HOST=host.docker.internal — agent container reaches MCP servers via Docker host bridge
  • UKY_RAG_ENABLED=false, DUAL_RAG_LOGGING=false — no UKY, no dual-RAG comparison path
  • 10 MCP servers running (access-mcp/docker-compose.yml)
  • 902 Q&A pairs in pgvector (832 document + 70 entity)

Results — 41/41 questions answered:

  • 24 via RAG (rag_retrieval)
  • 5 via MCP tools (search_resources, get_resource_hardware, search_events, search_projects)
  • 12 LLM-only (no tools called)

Key findings

  1. MCP tools fill the Ranch gap. Ranch had zero Q&A pairs — q5, q6, q40 now get real answers via search_resources and get_resource_hardware. Even the misspelled q40 ("reanch storage") resolves.
  2. q41 gets a real answer. "What allocation projects are using machine learning?" calls search_projects, returns 20 real projects with PIs and institutions.
  3. q31 routes to events ("What training resources does ACCESS offer?") calls search_events, though the search returned empty results.
  4. Cross-cutting questions (q3, q7, q8, q26-q28, q32-q33, q38) fall to LLM synthesis. Neither RAG nor MCP covers these general ACCESS process questions. Answers read well but are ungrounded — could hallucinate.

What we learned about observability

The API response only exposes tools_used, confidence, execution_strategy, tool_count. We cannot tell from the response:

  • What the classifier decided (static/dynamic/combined)
  • Which graph nodes actually executed (e.g. did RAG fire and fail before falling to LLM?)
  • RAG similarity scores for matched pairs
  • Whether _rag_answer_is_weak triggered
  • The plan content (if the planner node ran)
  • MCP tool arguments and raw responses

The 12 "LLM-only" answers are a black box — we can't distinguish "classified as static, RAG returned nothing, fell through to LLM" from "classified as static, LLM answered directly without trying RAG." Adding a node_trace to QueryResponse is the immediate next step.

Report

Interactive HTML comparison at ~/.agent/diagrams/a3-run5-comparison.html. Matches the Run 3/4 report format: KPI cards, filters, expandable side-by-side comparison. Note: hedge detection has a known issue — see below.

Known issue: hedge detection false positives

The report's hedge detection uses substring matching against phrases like "do not contain", "does not explicitly", etc. UKY q27 ("The provided documents do not specify...") is marked hedged but none of the exact phrases match — the detection was too aggressive. The h2h classification for q27 and potentially others needs review. Should align with access-agent/src/agent/graph.py:_rag_answer_is_weak() which uses the canonical hedge phrases.

Raw data

  • a3_results/run5.json — 41 questions with full agent responses
  • a3_results/uky_baseline_from_run4.json — UKY baseline (40 questions, q21 missing)
  • a3_results/run_a3_test.py — test runner (updated for Run 5: captures full response, saves to JSON)

What's next

  1. Add node tracing to agent graph — track which nodes executed, classification result, RAG scores. Expose in QueryResponse.metadata.
  2. Re-run with tracing — Run 5b with node trace data, so we can see exactly how each question routes.
  3. Fix hedge detection — align report's hedge logic with _rag_answer_is_weak() from the agent codebase.
  4. Tune synthesis promptRAG_ONLY_SYNTHESIS_PROMPT produces thin answers when one pair matches. Add links, context.
  5. Curate 20-30 cross-cutting Q&A pairs — allocations, Globus, MFA, training, citation (fills the 12 LLM-only gaps).

To restart Docker (if containers are down)

cd /Users/josephbacal/Projects/sweet-and-fizzy/access-ci/access-qa-service && docker compose up -d
cd /Users/josephbacal/Projects/sweet-and-fizzy/access-ci/access-agent && docker compose up -d

Verify with docker ps — you should see access-agent-agent-1 (8000), qa-service-app (8001), and their postgres/redis containers.

Quick smoke test to confirm everything works

curl -s -X POST http://localhost:8000/api/v1/query \
  -H "Content-Type: application/json" \
  -d '{"query": "What is ACES?", "session_id": "test", "question_id": "test-1"}' | python3 -m json.tool

2026-03-12 — Node tracing added to agent graph

What was built

Two commits on access-agent/main:

04342c8 — Added node_trace to AgentState as Annotated[list[dict[str, Any]], operator.add]. Each graph node appends a structured trace dict recording what it decided:

  • classify: query_type, confidence, domain, rag_endpoint, reason, whether query was expanded
  • rag_answer: source (uky/pgvector), match_count, best_score, rag_used, has_final_answer
  • plan: requires_tools, tool_count, tool names, strategy
  • execute: tools_called, succeeded, failed
  • evaluate: is_helpful, reason
  • recover: action taken, new tools selected
  • synthesize: strategy, answer_length
  • domain_agent: domain, tool_count

The /api/v1/query response includes classification summary and node_trace in metadata. 10 files changed (all node files + state.py + routes.py).

b7a9bec — Gated node_trace behind ?include_trace=true query parameter. OTel/Honeycomb (added Jan 2026, commit 422b92d) already provides full distributed tracing for ops. node_trace serves a different consumer: the eval harness needs trace data inline in the API response so it can programmatically inspect routing decisions without querying an external service. Nodes continue accumulating trace dicts in state (zero overhead), but the response only includes them when opted in.

Why two tracing systems

  • OTel/Honeycomb: Ops. Waterfall view of every span, LLM call, MCP tool. External service.
  • node_trace: Eval. Inline in API response. Shows decisions (classifier output, RAG scores, tool selection) not timing. Consumable programmatically by the eval harness (Project D).

2026-03-12 — Top-5 RAG matches + enriched synthesis prompt

Context

Andrew suggested returning the top 5 Q&A pair matches (instead of just the best) and letting the synthesizer combine them — simpler than building document-chunk retrieval for cross-cutting queries. Analysis confirmed the pipeline already supported multiple matches end-to-end (RAG_TOP_K was 3, qa-service accepts up to 20, all downstream code iterates over the full list). The change was purely configuration + prompt.

What changed (commit ef43a21)

config.pyRAG_TOP_K: 3 → 5. More material for the synthesizer, especially for union-type cross-cutting queries where related entity pairs from different resources can be combined.

synthesize.pyRAG_ONLY_SYNTHESIS_PROMPT rewritten:

  • "Be concise and direct" → "Answer the question thoroughly"
  • New guideline: when multiple knowledge entries are provided, synthesize into a unified answer
  • URLs/links elevated to IMPORTANT (matching the tool-only and combined prompts)
  • Added practical next steps guidance and support ticket link (both already present in the other prompts, missing here)

What this addresses

  • Thin answers: Single-match answers were near-verbatim because the prompt said "be concise." Now the LLM is instructed to give a complete, actionable response with links and context.
  • Union-type cross-cutting queries: "What resources support GPUs?" now gets 5 entity-scoped pairs (Delta, Bridges-2, ACES, etc.) and the prompt tells the LLM to combine them.
  • Does NOT fix procedural cross-cutting: "How do I apply for an allocation?" still has no matching pairs at any score. These need hand-curated cross-cutting Q&A pairs (~5 questions).

Commit Log (Joe Bacal, Feb 2026 work)

Commits across all repos related to the Feb/March plan. Older commits omitted.

access-qa-extraction (feat/two-shot)

Hash Date Message
c8fbf0b 02-26 docs: remove historical docs, update system overview for two-shot
853e88f 02-26 replace GUIDED-TOUR with TRACE-TOUR signposts; fix software name casing
00ba293 02-24 prompt: add rule to quote long lowercase entity names in Q&A
7b0590e 02-24 prompt: enhance rule 4 to check free-text fields; update review observations doc
28be413 02-24 fix: entity name interpolation + temporal language + coming-soon cleanup
170e87d 02-24 docs: log full corpus scan results — quantify issues #1/#2, add issue #3
8336f45 02-24 docs: move allocations:72170 finding to Patterns (positive, not an issue)
d7f57f5 02-24 docs: log allocations:72170 as non-issue (Jurafsky in source data, verified)
4f9c22d 02-24 docs: add retrieval surface area rationale to P1 (self-contained answers)
a4f7b66 02-24 docs: note preferred fix for P1 — entity name interpolation in user prompt
43e980e 02-24 docs: clarify P1 — entity name needed in both Q and A for RAG
70f9424 02-24 docs: add P1 pattern — questions must be self-contained (cross-cuts #1 and #2)
6084c93 02-24 docs: log issue #2 — decontextualized-question pattern (pervasive)
07da145 02-24 docs: log issue #1 — temporal-assumption in affinity-groups events
c4ec468 02-24 docs: add qa-review-observations.md for tracking Argilla review issues
6857db8 02-24 docs: improve signpost comments + fix COMING SOON name normalization
579e10d 02-24 fix: normalize "COMING SOON" resource names to lowercase
7bd43ba 02-24 wip: some signpost comments
3333c32 02-23 docs: update guided-tour
66e1819 02-20 refactor: adopt two-shot as sole extraction strategy
7803147 02-20 fix: restore missing return in software_discovery._generate_qa_pairs
7791e2b 02-20 feat: add --prompt-strategy flag for A/B/C extraction experiment
b662dc9 02-20 feat: implement entity-replace for Argilla push
80fc641 02-20 docs: update plan with metadata on human actions on archive records
9d54819 02-19 fix(data-quality): separate NSF program fields and add per-domain LLM guidance
39a4c06 02-19 refactor: remove factoid templates and bonus generation (2-pass pipeline)
5268caa 02-19 docs: reflect entity-replace decision and update README
8c9e7f2 02-18 docs: update all docs for freeform extraction pipeline and Argilla dedup
4181585 02-18 feat: roll out freeform extraction to all 5 extractors
da79f7d 02-18 feat: freeform extraction replaces category+bonus two-pass approach
2833d7b 02-18 docs: update for Argilla metadata integration and test count
e6d08fa 02-18 feat(argilla): add eval_issues and source_ref to Argilla records
3c762c9 02-18 feat(argilla): push judge scores and granularity to Argilla metadata
24c8373 02-17 feat(judge): LLM judge evaluation scores for Q&A pair quality
93a1fb2 02-17 feat(bonus): LLM exploratory questions for entity-unique information
068c08a 02-17 feat(incremental): hash-based change detection to skip unchanged entities
9059614 02-17 fix(factoids): data quality guards for template generation
3662d8b 02-13 feat(generators): dual-granularity Q&A + extend comparisons to all 5 domains
fa2ff93 02-12 fix(nsf-awards): normalize primaryProgram list + skip unused MCPClient
f3b1437 02-12 feat(extractors): fixed question categories + direct API for allocations/nsf-awards
fdebdab 02-12 feat(software-discovery): switch from search terms to list_all_software
e33d006 02-11 feat(extract): add max_entities cap for cheap test runs
2da2c32 02-10 Use real enumerations from taxonomies.ts for search terms
d987dee 02-10 Add report command for MCP coverage stats without LLM calls
6c4667c 02-10 Add ExtractionConfig to centralize extraction parameters
0b16ba8 02-04 Fix Q&A pair ID collisions by appending question hash
cf384bc 02-04 Add Argilla integration for pushing Q&A pairs to human review
51e9877 02-04 Expand extraction queries, fix software-discovery, update docs
a69ce2e 02-02 Fix allocations and nsf-awards extractors returning 0 results
038d42d 02-02 Add dedicated OpenAI backend (LLM_BACKEND=openai)
b557300 02-01 Add LOCAL_DIRECTIONS.md and update .env.example for OpenAI setup
d45eda1 02-01 Add NSFAwardsExtractor and register in CLI/validator
b67eba0 02-01 Add AllocationsExtractor and register in CLI/validator
18c0e49 01-31 Add AffinityGroupsExtractor and fix MCP server port defaults
de28ab2 01-31 Add CLAUDE.md and update README with local dev setup guide

access-qa-service (main)

Hash Date Message
5b57ae0 02-28 Fix Argilla sync to work with access-qa-extraction's dataset schema

access-agent (main / feature/dual-rag-logging)

Hash Date Message
ef43a21 03-12 feat: return top-5 RAG matches and enrich synthesis prompt
b7a9bec 03-12 feat: gate node_trace behind ?include_trace query parameter
04342c8 03-11 feat: add node_trace to agent graph for execution path observability
de26e37 feat: route pgvector through LLM synthesis + fair comparison logging
08809ad fix: lower RAG similarity thresholds — 0.85 was filtering valid matches
caf7256 02-28 feat: add dual-RAG comparison logging for A.2 evaluation

access-mcp (main)

Hash Date Message
bb3b54f 02-04 spike: Add list-all fallbacks to allocations and nsf-awards routers

access-qa-planning (update/mcp-extraction-two-shot)

Hash Date Message
a84fb4a 02-26 docs: GUIDED-TOUR.md → TRACE-TOUR.extract.md in file tree
033c46e 02-23 docs: update mcp-extraction-impl to reflect two-shot pipeline and entity-replace

access-argilla (main)

Hash Date Message
d5cb931 01-30 chore: init claude file
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment