bacalj/DEV_JOURNAL.md

## DEV_JOURNAL.md

      
    Raw
  

              DEV_JOURNAL.md
            
          
    ACCESS-CI Dev Journal


Gist mirror: https://gist.github.com/bacalj/a6c6f9726844611df5a09c83884a0e83

2026-02-28 — A.1: Argilla → pgvector sync pipeline

Goal: Get Q&A pairs from Argilla into access-qa-service (pgvector) so they're searchable via semantic search.
Discovery: access-qa-service already had a /admin/sync endpoint and argilla_sync.py — but the code was scaffolded with placeholder logic that didn't match the actual Argilla v2 API or the record schema created by access-qa-extraction.
What was wrong:

Used deprecated Argilla v1 API (rg.init() / rg.load())
Guessed at record field access (record.inputs, record.question) — Argilla v2 uses record.fields["question"]
Looked for entity_id in metadata (doesn't exist) — needs to come from <<SRC:...>> citation markers in the answer text
Default dataset name was "access-qa" but extraction creates "qa-review"
argilla Python SDK wasn't in the dependencies

What we fixed (commit 5b57ae0 on access-qa-service/main):

Rewrote sync_from_argilla() for Argilla v2 client API
Correct field access via record.fields
Domain/entity_id extracted from citation markers, with source_ref parsing as fallback
Added _get_edited_values() to prefer reviewer edits (future-proofing)
Judge scores (faithfulness, relevance, completeness, confidence) carried through to pgvector metadata
Added argilla>=2.0.0 as a proper dependency
Added Argilla env vars to docker-compose.yml for local dev

Test result:
POST /admin/sync → {"synced": 83, "skipped": 0, "citations_loaded": 12, "errors": []}
POST /search {"query": "What is ACES designed for?"} → similarity_score: 1.0, correct answer with citation

83 records across 5 domains (compute-resources, software-discovery, affinity-groups, allocations, nsf-awards) synced and searchable.
Also documented: Andrew's feature/access-agent-integration branch on qa-bot-core — what it changes (Netlify proxy, request body format, response contract) and why it matters for Projects A and B. Added to FEB_MARCH_PLAN.md and synced to the gist.

2026-02-28 — A.2: Dual-RAG comparison logging in access-agent

Goal: Modify rag_answer node to query both UKY document RAG and pgvector Q&A-pair RAG for every question, logging side-by-side results for A.3 evaluation.
Approach: Parallel queries via asyncio.gather, gated behind DUAL_RAG_LOGGING env var. When the flag is off, behavior is identical to before.
What was built (commit caf7256 on access-agent/feature/dual-rag-logging):

src/config.py — Added DUAL_RAG_LOGGING: bool = False setting
src/rag_comparison_logger.py (new) — SQLAlchemy model + singleton logger for rag_comparison_logs table. Follows same pattern as usage_logger.py. Table auto-creates on first use.
src/agent/nodes/rag_answer.py — Added:

_query_uky_raw() / _query_pgvector_raw() — lightweight async helpers that return raw results without span side-effects
_dual_rag_answer() — runs both queries concurrently, applies same UKY-primary/pgvector-fallback priority, logs comparison to PostgreSQL
Gate in rag_answer_node: settings.DUAL_RAG_LOGGING and rag_endpoint → dual path; else unchanged


tests/test_rag_answer.py (new) — 19 tests: citation processing, raw query helpers, dual-RAG logic (UKY served, pgvector fallback, both fail, combined query, below threshold, logger failure resilience), flag gating

Comparison log table schema (rag_comparison_logs):

Query context: session_id, question_id, query_text, expanded_query, query_type, rag_endpoint
UKY result: uky_response, uky_duration_ms, uky_error
pgvector result: pgvector_matches (JSONB), pgvector_best_score, pgvector_match_count, pgvector_duration_ms, pgvector_error
Outcome: served_by, served_answer_length

Test result: 94 passed (all existing + 19 new), 0 failures.
What's unchanged: state.py, graph.py, routes.py — the graph contract is untouched. The comparison log is a side-effect inside the rag_answer node.
Next (A.3): Deploy the feature/dual-rag-logging branch with DUAL_RAG_LOGGING=true, ask questions via qa-bot-core or direct API, then query rag_comparison_logs to evaluate UKY vs pgvector.

2026-02-28 — A.3 setup: Docker environment stood up and smoke-tested

Decision: Run A.3 locally in Docker, bypass qa-bot-core, use direct curl requests.
Docker setup (two separate compose projects):

access-qa-service/docker-compose.yml → qa-service (port 8001) + PostgreSQL (port 5433) + Redis (port 6380)
access-agent/docker-compose.yml → agent (port 8000) + PostgreSQL (port 5432) + Redis
access-agent reaches access-qa-service via host.docker.internal:8001 (macOS Docker)
UKY endpoint is remote — uses same API key as qa-bot-core (ACCESS_AI_API_KEY)

What we did to get access-agent running:

Created access-agent/.env from discovered keys: OPENAI_API_KEY (from access-qa-extraction/.env), ACCESS_AI_API_KEY (same key as QA_MODEL_API_KEY in access-serverless-api/.env and REACT_APP_API_KEY in qa-bot-core/.env.local), plus DUAL_RAG_LOGGING=true, QA_SERVICE_URL=http://host.docker.internal:8001, OTEL_ENABLED=false
Modified access-agent/docker-compose.yml: added env_file: .env to the agent service (previously all env vars had to be listed explicitly), removed external mcp-network dependency (MCP servers aren't needed for A.3)
Built and started: docker compose up --build -d — all containers healthy

Smoke test (successful):
curl -X POST http://localhost:8000/api/v1/query \
  -H "Content-Type: application/json" \
  -d '{"query": "What is Delta?", "session_id": "test-a3-smoke", "question_id": "smoke-1"}'

→ Got a full UKY-sourced response about Delta (NCSA HPC resource), 6s latency, tools_used: ["uky_rag_retrieval"]. Agent is live and hitting UKY successfully.
Note: The API field is query (not question). The MCP server warnings in the agent logs are expected and harmless — those servers aren't on this Docker network and aren't needed for A.3.
Current container status (all running):


Service
Port
Notes


access-agent
8000
feature/dual-rag-logging branch, DUAL_RAG_LOGGING=true


access-agent postgres
5432
checkpointing + comparison logs


access-qa-service
8001
83 Q&A pairs loaded


qa-service postgres
5433
pgvector embeddings


access-argilla
6900
Q&A pair review UI


2026-03-02 — A.3 pre-flight: similarity threshold bug found

Goal: Verify Docker environment still works and start A.3 evaluation.
Discovery: pgvector is returning zero matches for reasonable queries like "What is ACES?" — even though we have 20 compute-resources Q&A pairs including several about ACES.
Root cause: The similarity threshold is too aggressive. There are two thresholds stacked:

qa-service default (access-qa-service/src/access_qa_service/config.py:26):
rag_similarity_threshold = 0.85
access-agent per-query-type thresholds (access-agent/src/config.py:69-71):

RAG_THRESHOLD_STATIC = 0.85 (static queries)
RAG_THRESHOLD_COMBINED = 0.75 (combined queries)
RAG_THRESHOLD_FALLBACK = 0.65 (fallback)


The agent's _query_pgvector_raw() passes the threshold to the qa-service, which uses it to filter results. For static queries (the most common type), both sides enforce 0.85.
The problem: "What is ACES?" scores 0.84 against the best match ("What is ACES designed for?") — just below the 0.85 cutoff. With threshold 0.3, the same query returns 3 solid matches (0.84, 0.82, 0.76). Short or naturally-phrased questions routinely fall just under 0.85 even when the topic matches perfectly.
Evidence:
curl /search {"query": "What is ACES?", "threshold": 0.85}  → 0 matches
curl /search {"query": "What is ACES?", "threshold": 0.3}   → 3 matches (0.84, 0.82, 0.76)
curl /search {"query": "What is ACES designed for?"}         → 1 match (1.0, exact)

The rag_comparison_logs table confirmed this — both smoke test queries ("What is Delta?", "What is ACES?") show pgvector_match_count: 0 and served_by: uky_general.
What needs to happen before running A.3:

Lower the threshold so pgvector actually returns matches for natural queries
Options: (a) lower RAG_THRESHOLD_STATIC from 0.85 to ~0.70 in access-agent config, (b) use a comparison-specific override in the dual-RAG path so production defaults aren't touched, or (c) lower the qa-service default
Rebuild the access-agent container after the change

Also this session: Created SYSTEM_OVERVIEW.md with sequence diagrams of the three main flows (query answering, knowledge base building, per-entity extraction detail). Updated the agent graph illustration in FEB_MARCH_PLAN.md from mermaid to an emoji-annotated state transition table. Synced plan gist.

2026-03-02 — Threshold fix committed

Change: Lowered all RAG similarity thresholds in access-agent/src/config.py (commit 08809ad on feature/dual-rag-logging):

RAG_THRESHOLD_STATIC: 0.85 → 0.70
RAG_THRESHOLD_COMBINED: 0.75 → 0.60
RAG_THRESHOLD_FALLBACK: 0.65 → 0.50
RAG_SIMILARITY_THRESHOLD (legacy): 0.85 → 0.70

Why: Best matches for natural queries scored ~0.84, just below the 0.85 cutoff. This was the A.3 blocker — pgvector returned 0 matches for every query.
Still needed: Rebuild the access-agent Docker container (docker compose up --build -d) and verify the fix with a smoke test before proceeding with A.3.

2026-03-02 — A.3 running: container rebuilt, threshold verified, test questions written

Rebuilt container: docker compose up --build -d picked up the threshold fix. All containers healthy.
Threshold fix verified: "What is ACES?" now returns pgvector_match_count: 3, pgvector_best_score: 0.84. Before the fix this was 0 matches. UKY still served (as designed), but pgvector results are now logging.
Pushed branches: access-agent/feature/dual-rag-logging pushed to GitHub (3 commits: A.2 dual-RAG logging, threshold fix). access-qa-service/main push failed — Joe doesn't have write access to necyberteam/access-qa-service (need Andrew to grant).
QAP coverage (83 pairs across 11 entities in 5 domains):


Domain
Entity
Pairs


compute-resources
ACES (TAMU)
10


compute-resources
Ranch (TACC)
10


software-discovery
ABINIT
10


software-discovery
Abaqus
8


allocations
Grassland bird habitat (#72204)
9


allocations
RL benchmark (#72205)
10


nsf-awards
Pollinator conservation AI (#2529183)
10


nsf-awards
Great Salt Lake dust (#2449122)
8


affinity-groups
Neocortex (PSC)
5


affinity-groups
REPACSS (TTU)
3


Test questions written: 40 questions in A3_TEST_QUESTIONS.md, organized in 3 groups:

pgvector-targeted (24): Questions about entities we have QAPs for
UKY-targeted (8): General ACCESS questions our 83 pairs probably don't cover
Edge cases (8): Vague, misspelled, or cross-domain questions

Next: Review the test questions, then fire them all through the agent and pull the comparison logs.

2026-03-04 — A.3 Run 2: first full test, unfair comparison discovered

Run 2 executed: Fired all 41 test questions through the agent with DUAL_RAG_LOGGING=true. All 41 succeeded, 40 logged (q41 classified as dynamic/xdmod). Results exported to a3_results/run2.json.
Run 2 results (high-level): UKY answered 36/40, pgvector had matches for 30/40, served by UKY 36, served by pgvector 4.
Built interactive HTML comparison: ~/.agent/diagrams/a3-run2-comparison.html — expandable rows with side-by-side answers, KPI summary, sidebar nav, analysis section.
Synthesis routing fix: pgvector static matches were previously returned as final_answer (raw Q&A pair text). Changed rag_answer.py to set rag_matches + rag_used instead, and added "synthesize" as a third routing option from route_after_rag in graph.py. This routes pgvector results through the LLM synthesis pipeline.
Unfair comparison discovered: Run 2's comparison was apples-to-oranges. UKY answers arrive already LLM-synthesized (UKY's own LLM produces polished prose). pgvector answers in the comparison log were raw Q&A pair text — just the verbatim answer field from the curated pair. This made pgvector look worse than it actually is, since the difference was partly in presentation quality, not underlying knowledge.

2026-03-04 — A.3 Run 3: fair apples-to-apples comparison

Goal: Make the comparison fair by synthesizing pgvector answers through our own LLM before logging them.
What was changed:

rag_comparison_logger.py — Added pgvector_synthesized_answer = Column(Text) to the model and log_comparison() method
rag_answer.py — Imported _format_rag_matches and _synthesize_with_rag_only from synthesize.py. In _dual_rag_answer(), after getting pgvector matches, calls synthesis to produce an LLM-polished answer before logging. This is what the user would actually see if pgvector served the answer.
pyproject.toml — Pinned opentelemetry-instrumentation-langchain<0.53 (newer version had a breaking import for GenAICustomOperationName)
Database — ALTER TABLE rag_comparison_logs ADD COLUMN pgvector_synthesized_answer text;
Test runner — Created a3_results/run_a3_test.py to fire all 41 questions programmatically

Run 3 results (41/41 succeeded, all logged):


Metric
Value


UKY answered
38/41 (93%)


pgvector answered (synthesized)
27/41 (66%)


Both answered
24 (direct comparison possible)


UKY only
14


pgvector only
3


Avg pgvector similarity score
0.84


Fair comparison conclusions (from HTML analysis at ~/.agent/diagrams/a3-run3-comparison.html):


The two backends are complementary, not competitive. pgvector gives precise, curated answers for entities we've built Q&A pairs for. UKY covers the long tail of general ACCESS knowledge.


pgvector excels on its own domain: Of 25 pgvector-targeted questions (Q1-Q25), pgvector produced synthesized answers for 24 (96%). These are entities with curated Q&A pairs.


UKY handles breadth that pgvector cannot: For 8 UKY-targeted questions (Q26-Q33) about general ACCESS topics (allocations process, Globus, password reset), pgvector answered 0. Our 83 curated pairs simply don't cover these.


UKY produces longer answers (~157% longer on average when both answer the same question). This may reflect UKY's larger document corpus or that our synthesis prompt is more concise. Length alone doesn't indicate quality.


pgvector retrieval is dramatically faster (~5 ms vs ~2500 ms for UKY), though pgvector now also needs LLM synthesis time (not logged separately).


The quality gap is narrower than Run 2 suggested. With LLM synthesis, pgvector answers read as polished, cited responses. The Run 2 comparison was unfairly penalizing pgvector by showing raw text.


Production recommendation: Use both backends — pgvector for high-confidence domain matches, UKY for everything else. This is already the architecture (_dual_rag_answer uses UKY-primary, pgvector-fallback).


Files produced:

a3_results/run3.json — Full export of 41 comparison log entries
~/.agent/diagrams/a3-run3-comparison.html — Interactive comparison with analysis
a3_results/run_a3_test.py — Test runner script


2026-03-05 — A.3 post-mortem: reframing the question

Realization: The A.3 analysis drifted toward "complementary backends" and fallback architecture. But that wasn't the original question. From FEB_MARCH_PLAN.md:

"proving this approach outperforms document RAG" (line 34)
"We need data on how these two approaches compare before making further investment decisions" (line 65)
"A first because it validates the approach before investing in B" (line 259)

A.3 was a bake-off to decide whether Q&A-pair RAG can replace UKY document RAG — not to build a hybrid system. The "use both" conclusion was the code's existing fallback architecture leaking into the analysis.
Why pgvector lost on breadth (and it's not about quality)

The coverage gap is entirely explained by content type, not approach quality:
What the extraction pipeline covers (5 MCP server domains, entity-focused):

Compute resources (23 entities: ACES, Delta, Anvil, etc.)
Software discovery (1,404 packages)
Allocations (5,440 projects)
NSF awards (10,000+ awards)
Affinity groups (55 groups)

These are all "what is X" questions about discrete entities. The pipeline pulls structured data from MCP servers and generates Q&A pairs about each entity's properties.
What UKY has that we don't (general ACCESS documentation):

How to apply for an allocation (process docs)
How to transfer files / use Globus (how-to guides)
How to reset your password (account management)
Startup vs research allocations (policy docs)
Training resources, publication acknowledgment (educational docs)

These are "how do I" questions about ACCESS-wide processes. They don't live in any MCP server — they live in documentation pages, wikis, and guides that UKY ingested.
We don't know exactly what UKY ingested. The plan has an open question: "Need a list from Andrew of what UKY currently ingests." UKY is a black-box API to us.
The actual A.3 verdict

On entity questions where we have Q&A pairs: pgvector hits 96% (24/25). The synthesized answers are concise and accurate. pgvector retrieval is ~500x faster than UKY (~5ms vs ~2500ms).
On general how-to/process questions: pgvector scores 0%. We simply have zero Q&A pairs for these topics because no MCP server serves allocation process docs or file transfer guides.
The gap is coverage, not quality. If we had Q&A pairs for general ACCESS topics, pgvector would likely match or beat UKY on those too.
Decision point

The plan says Project C ("Extract from ACCESS documentation") was deferred with this note:

"Revisit only if a specific content gap surfaces that exists only in documents with no API equivalent (e.g., narrative tutorials, policy explainers)."

A.3 just surfaced exactly that gap. The 14 UKY-only questions are all process/how-to questions with no API equivalent.
Joe needs to decide:


Pursue Project C — Extract Q&A pairs from ACCESS documentation (not MCP entities). This would close the how-to gap and potentially let pgvector replace UKY entirely. Requires: getting the doc list from Andrew, building a document extractor, running extraction + Argilla review.


Keep UKY for breadth, pgvector for precision — Accept the hybrid architecture. UKY handles general questions, pgvector handles entity questions. Simpler, but you're dependent on UKY's black-box system and can't control answer quality for general topics.


Expand entity coverage first — Before tackling docs, run the existing extraction pipeline against more entities (we only extracted 11 of 23 compute resources, 2 of 1,404 software packages, 2 of 5,440 allocations). More entity coverage might narrow the gap enough.


UKY corpus: confirmed undocumented

Searched all repos (access-qa-planning, access-agent, access-mcp, access-qa-extraction, access-qa-bot) for any documentation of what UKY's system ingests. Found:

pages-current-production.md — "The Q&A backend is hosted at the University of Kentucky." No corpus details.
pages-access-qa-tool.md line 193 — Notes UKY's tech stack as "ChromaDB, llamaindex." No document list.
FEB_MARCH_PLAN.md line 233 — Open question: "Need a list from Andrew of what UKY currently ingests."
uky_client.py — Black-box HTTP client. No corpus metadata.

No list of UKY's ingested documents exists anywhere in our repos. Andrew is the only source for this information.
Research options independent of Andrew

Even without the UKY document list, there are viable paths to continue the bake-off:
Option A: Analyze UKY's 14 winning answers for source clues. Read the UKY-only responses from Run 3 and determine whether the information is unique to some internal corpus or is general ACCESS knowledge available on public web pages (support.access-ci.org, allocations.access-ci.org). UKY's answers may contain citations, URLs, or verbatim language that reveals their source documents. This takes ~30 minutes and informs all other options.
Option B: Generate Q&A pairs from public ACCESS content. Point the extraction pipeline (or a variant) at public ACCESS web pages — the allocations guide, getting started pages, Globus documentation, password reset instructions. These are freely available. Generate Q&A pairs, curate them, load into pgvector, re-run A.3. This directly tests whether closing the topic gap closes the performance gap.
Option C: Determine whether UKY's advantage is unique knowledge or general glue. The 14 UKY-only questions are all process/how-to topics. If UKY is synthesizing from the same public ACCESS web pages any user can read, then the "advantage" is simply that we haven't generated Q&A pairs for those topics yet — not that UKY has access to privileged information. This reframes the bake-off: it's not documents vs Q&A pairs, it's about coverage breadth.
Option D: Expand entity coverage as a control. Add Q&A pairs for remaining MCP server domains (events, announcements, system-status) and more entities within existing domains (we only extracted 11 of 23 compute resources, 2 of 1,404 software packages). This tests whether broader entity coverage alone changes the picture.
Recommended sequence: A first (30 min, informs everything), then B (directly tests the hypothesis), with D as low-effort parallel work.

2026-03-06 — UKY corpus obtained, plan aligned with Andrew

UKY document corpus now available

Andrew provided the full set of documents that feed UKY's document RAG. They are in rag_documents/ (75 files, 69 MB) split across two directories:
staging/ (~47 files) — The main corpus. Three categories:


Category
Examples
Count


Resource descriptions
ACES, Anvil, Bridges-2, Delta, Expanse, Jetstream-2, Neocortex, Sage, Voyager, Fabric (PDFs)
~20


User guides
ACES, Anvil, Bridges-2, Delta, Expanse, Jetstream-2, Neocortex, Sage (PDFs)
~10


Process/how-to docs
Allocations, Globus, MFA, add users, progress reports, office hours, events/trainings, system status (docx)
~12


Misc
ARA description, SDS pointer, CloudBank login, REPACSS overview, Sage edge apps, current projects
~5


data/ (~28 files) — Per-resource software lists (txt/csv) and resource-specific documentation:

Software installed lists for ACES, Anvil, Bridges-2, Darwin, Delta, Expanse, Jetstream2, Kyric, Stampede3
Darwin docs (user guide, login, filesystems, job management, SLURM, software)
Delta docs (user guide, data management)
FASTER docs (intro, SLURM partitions, documentation)
ACCESS Travel Rewards (md)

Key observation: The process/how-to docs in staging/ (allocations, Globus, MFA, etc.) are exactly the topics UKY beat pgvector on in A.3. The resource descriptions overlap with what MCP extraction already covers. This confirms the A.3 finding — the gap was coverage, not quality.
Alignment with Andrew

Confirmed the shared end state:


Generate Q&A pairs from these documents — Use a similar two-shot process to what exists for MCP entities, but with documents as input. Andrew: "Probably a similar prompt to the MCP tools can work for generating pairs from docs."


One unified Q&A pair bank in pgvector — Entity pairs (from MCP) + document pairs (from these files) living together, searchable as one corpus.


The orchestrator agent decides routing — RAG for factual queries, MCP for live data, both when needed. Andrew: "The orchestrator agent should decide which tools to use (RAG, MCP, both) and then it should get synthesized. That logic should already exist in access-agent."


UKY goes away — Andrew: "Eventually, we will likely not need the document based RAG since the Q&A pairs are faster." pgvector replaces UKY entirely.


Plan: document extraction pipeline

Step 1: Categorize the corpus. Skim the 75 files and bucket them: resource descriptions (entity overlap with MCP), user guides (process/how-to), general ACCESS docs. Identify what's already covered by MCP extraction vs. what's net new.
Step 2: Build a document extractor in access-qa-extraction. Extend the pipeline to accept documents (PDF/docx) as input. The two-shot prompt structure should carry over — battery pass for coverage, discovery pass for insights. New work: document parsing (PDF text extraction, docx reading) and chunking into logical sections.
Step 3: Run extraction on the full corpus. Generate Q&A pairs from all documents. Push to Argilla for review. This produces pairs for the exact topics pgvector was missing — allocations process, Globus, MFA, user guides.
Step 4: Load into pgvector alongside entity pairs. One unified bank: existing 83 entity pairs + document-sourced pairs. All searchable together.
Step 5: Re-run A.3. Same 41 questions (plus new ones if the expanded corpus suggests them). If pgvector-with-documents matches or beats UKY across the board, the bake-off is won.
Step 6: Simplify the agent routing. Once the Q&A pair bank covers everything, the agent graph simplifies: RAG for factual queries, MCP for live data, synthesis when both contribute. Remove the UKY fallback path.

2026-03-09 — Project C step 1: corpus categorized

Skimmed all 75 files in rag_documents/ and produced a categorized index at rag_documents/CORPUS_INDEX.md. No files were moved or renamed — the index is a read-only reference.
Categorization results


Category
Files
Priority
Rationale


NET-NEW process/how-to
20
First
Fills the exact A.3 gap — allocations, Globus, MFA, Sage, citations, Jupyter


USER GUIDE (deep)
22
Second
Operational depth (job submission, filesystems, SLURM) beyond MCP surface data


MCP OVERLAP (descriptions)
17
Later
1-page resource catalog entries — MCP already covers most of this


DATA FILE
12
Skip
Raw software lists (name/version lines) — MCP software-discovery covers this


POINTER/EMPTY
4
Skip
URL stubs or corrupt files with no substantive content


Key finding: The 20 NET-NEW files are mostly small docx docs — easy to parse, directly address the A.3 gap. The 22 user guides are larger PDFs with real depth (SLURM partitions, data management, module systems). The 17 resource descriptions are 1-page PDFs that overlap with MCP entity data.
Also this session: Consolidated project documentation — SYSTEM_OVERVIEW.md is now single source of truth for architecture, FEB_MARCH_PLAN.md updated with A.3 results and Project C active status, all three docs gist-mirrored, CLAUDE.md updated with document discipline rules.

2026-03-09 — PRs merged, document extractor built (C.2)

Pre-flight: merged outstanding PRs

access-qa-extraction PR #1 (two-shot pipeline) — squash-merged to main. 4,697 additions across the full two-shot extraction pipeline: battery + discovery prompts, LLM judge, incremental cache, Argilla entity-replace, 5 domain extractors, 144 tests. Branch archived on GitHub.
access-qa-planning PR #1 (companion docs) — squash-merged to main. Documentation updates for two-shot pipeline.
access-agent and qa-bot-core — decided to leave on their branches. qa-bot-core is a production product with its own release routine. access-agent's feature/dual-rag-logging branch mixes evaluation scaffolding with production improvements — better to leave as-is until the bake-off concludes.
Smoke-test on main

Reinstalled access-qa-extraction from clean main. 144/144 tests pass. Started mcp-compute-resources Docker container from access-mcp/docker-compose.yml (port 3002). Ran extraction:
qa-extract extract compute-resources --max-entities 1 --no-judge

Produced 8 Q&A pairs for ACES — 5 battery + 3 discovery, all with citations. Two-shot pipeline confirmed working on main.
Built DocumentExtractor (Project C.2)

Branched feat/document-extractor off clean main. Built the document extraction pipeline:
New files:

parsers.py — Standalone document parsing module. parse_docx() (python-docx), parse_pdf() (PyMuPDF/fitz), parse_text() (.txt/.md). Dispatcher parse_document() routes by extension. chunk_text() splits large docs (~6000 words) with overlap. clean_extracted_text() collapses PDF/docx whitespace artifacts.
extractors/documents.py — DocumentExtractor(BaseExtractor). Overrides run() to skip MCPClient (documents are local files). Discovers files recursively from config.url directory. Each document/chunk = one entity. Two-shot LLM pipeline (battery + discovery), judge evaluation, incremental cache — same as MCP extractors. Uses source="doc_generated", source_ref="doc://documents/{entity_id}".

Modified files:

pyproject.toml — Added python-docx>=1.0.0, PyMuPDF>=1.24.0
models.py — Added source parameter to QAPair.create() (default "mcp_extraction", backward-compatible)
question_categories.py — Added "documents" to DOMAIN_LABELS, DOMAIN_NOTES, and FIELD_GUIDANCE (5 field groups: overview, key procedures, requirements & eligibility, important details, support & contact)
config.py — Added "documents" MCPServerConfig with url=os.getenv("DOCUMENTS_DIR", "../rag_documents")
extractors/__init__.py — Added DocumentExtractor import and export
cli.py — Added DocumentExtractor to EXTRACTORS registry

Smoke tests

Test 1: qa-extract extract documents --max-entities 1 --no-judge — parsed CORPUS_INDEX.md, produced 6 Q&A pairs about the document corpus.
Test 2: qa-extract extract documents --entity-ids "10_1758119706.911465_data-ACCESS-how-to-cite-Jetstream" --no-judge — parsed a docx file from staging/, produced 5 Q&A pairs about Jetstream citation formats and acknowledgment requirements.
Fix: _title_from_stem() was producing ugly titles from Slack-style filenames (e.g., 10_1758119706.911465_data-ACCESS-how-to-cite-Jetstream). Added re.sub(r"^\d+_[\d.]+_", "", stem) to strip the numeric prefix, plus stripping common prefixes (data-ACCESS-, data:, etc.). Title now renders as "How To Cite Jetstream".
All 144 existing tests still pass after all changes.
First extraction run: staging/ directory (C.3)

Ran DOCUMENTS_DIR="../rag_documents/staging" qa-extract extract documents --no-judge on all 47 files in staging/. Took ~25 minutes (94 LLM calls).
Results: 586 Q&A pairs from 83 entities (46 files processed, 1 corrupt file skipped).


Category
Entities
Pairs
Notes


NET-NEW docx (process/how-to)
19
~110
Allocations, MFA, Globus, Sage, Jupyter


User Guide PDFs (chunked)
39 chunks
~290
Jetstream2 (20 chunks), Anvil (6), Bridges-2 (5), etc.


MCP Overlap descriptions
17
~134
1-page resource PDFs


Other (ARA, SDS, REPACSS)
8
~52
Small docs


100% citation markers (<<SRC:documents:...>>)
All pairs use source: "doc_generated"
Large PDFs chunked correctly (~6000 words per chunk with overlap)
Quality spot-check: questions are natural, answers contain specific details (URLs, commands, step-by-step procedures)
Only error: current-access-projects.docx (known corrupt/empty file)

Output at data/output/documents_qa_pairs.jsonl (gitignored). Branch pushed to GitHub.
Not yet run: data/ directory (Darwin, Delta, FASTER docs + ACCESS-Travel-Rewards.md + software lists).
Second extraction run: data/ directory (C.3)

Ran DOCUMENTS_DIR="../rag_documents/data" qa-extract extract documents --no-judge on all files in data/ subdirectories.
Results: 221 Q&A pairs from 29 entities.


Subdirectory
Entities
Pairs
Notes


ACCESS-Resources/Darwin/
9
~65
Managing jobs, user guide, compiling, file systems, etc.


ACCESS-Resources/Delta/
3 chunks
~25
Large PDF chunked into 3


ACCESS-Resources/FASTER/
4
~30
User guide, system overview, jobs, file systems


ACCESS-Travel-Rewards.md
1
~8
Travel reimbursement program


ACCESS-Software-Installed-by-resource/
12
~93
Software lists (package names/versions — generic Q&A quality)


Software-list files produced generic "what software is installed on X" pairs — adequate but not high-value. Argilla reviewers can reject low-quality ones.
Darwin and FASTER docs produced strong procedural content (SLURM commands, file system paths, compilation flags).

Combined output and Argilla push

Saved staging/ output as documents_staging_qa_pairs.jsonl, combined both runs into documents_all_qa_pairs.jsonl (807 total pairs).
Pushed all 807 pairs to Argilla: qa-extract push data/output/documents_all_qa_pairs.jsonl. Records visible in qa-review dataset at http://localhost:6900.
Docker note: Argilla containers had stale network references from previous sessions. Fixed with docker compose down --remove-orphans && docker network prune -f && docker compose up -d.
Added document_name metadata field

Problem: When reviewing pairs in Argilla, all 807 records had domain: "documents" with no way to tell which source document they came from — the only clue was the source_ref URI (e.g., doc://documents/10_1758119706.911465_data-ACCESS-how-to-cite-Jetstream), which is opaque. For MCP-extracted pairs, domain provides natural grouping (compute-resources, allocations, etc.), but document pairs lack an equivalent.
Fix: Added document_name as an optional metadata field on QAMetadata, populated from the existing _title_from_stem() helper in DocumentExtractor. The field flows through to Argilla as a filterable TermsMetadataProperty. MCP extractors are unaffected (field defaults to None).
Files changed: models.py (field + factory param), documents.py (passes title), argilla_client.py (schema + record metadata).
Re-extraction: Re-ran both staging/ (611 pairs) and data/ (214 pairs) = 825 total. Deleted old Argilla dataset (no schema for document_name), pushed fresh. 72 unique document names now filterable in Argilla.
Fixed source_data for document pairs

Problem: For MCP entity pairs, source_data contains the full entity JSON that the LLM used to generate the Q&A pair — reviewer sees exactly what went in. For document pairs, source_data was set to content_preview: chunk[:500] — the first 500 characters of the chunk. This was misleading: it looked like the source material but only represented a tiny slice of the ~6000-word chunk the LLM actually saw. Reviewers would see a content_preview about topic X when the Q&A pair was about topic Y (from elsewhere in the same chunk).
Fix: Replaced content_preview with a reference: {file, chunk, total_chunks, word_count}. For non-chunked documents, chunk and total_chunks are null. The reviewer sees the file and chunk number; the actual document is in rag_documents/.
Design note on chunking: Large documents (>6000 words) are split into sequential ~6000-word chunks with 500-word overlap. Each chunk is processed as a separate entity — the LLM only sees one chunk at a time, not the whole document. So chunk 9 of a 20-chunk Jetstream PDF starts at roughly word 44,000. This is why the source_ref includes the chunk number (e.g., doc://documents/jetstream-2-user-guide__chunk_9).

2026-03-10 — Evaluation harness design (Project D)

Andrew asked about making the bake-off self-service: editable golden questions, runnable by the team with their own tokens, comparing different agent configurations ("tool combinations"). Key points from the conversation:

Golden questions: Andrew wants a curated benchmark set that people can view, add, and modify. These are distinct from the Q&A pairs in Argilla — they're the test inputs used to evaluate the agent.
Different tool combinations: Not UKY-vs-pgvector (UKY is going away), but different configurations of our agent — RAG thresholds, MCP server subsets, model choices. Each configuration is a "scenario."
Self-service: Team members should be able to run evaluations and see results without Joe in the loop.
Ongoing process: Re-run as the agent evolves, not a one-shot comparison.

Designed the evaluation harness. Full design saved as EVAL_HARNESS_PLAN.md. Summary:

Golden questions in YAML (merge A3_TEST_QUESTIONS.md + e2e_test_cases.csv → ~55 questions with structured assertions)
Scenario configs as YAML files overriding Settings env vars
CLI runner calling run_agent() directly (not HTTP) to capture full AgentState
HTML report generator producing self-contained comparison pages (matching a3-run3 visual style)
New access-agent/eval/ directory

Added as Project D in FEB_MARCH_PLAN.md (D.1–D.4), parallel with Project B after C.4 completes.
Pivot: Initially designed as a CLI-based Python tool (access-agent/eval/). Revised to a static web app on Netlify (eval-ui/) — no Python environment needed, users just open a browser. Golden questions and scenarios bundled at build time, results displayed inline and exportable as JSON. Two open design questions flagged: (1) how scenarios actually change agent behavior given the current API doesn't accept config overrides, and (2) API key routing (server-side vs pass-through). Plan saved as EVAL_HARNESS_PLAN.md.
This is future work — immediate next step remains C.4 (review Argilla, sync pgvector, re-run A.3).

2026-03-10 — C.4: Meta-referencing fix, re-extraction, A.3 Run 4

Meta-referencing problem in document Q&A pairs

Spot-checked the 825 document pairs in Argilla and found a systematic quality issue: 36% (300/825) of generated questions referenced the source documents rather than the subject matter.
Examples:

Wrong: "What are the important quotas and limits mentioned in the Darwin Filesystems Storage document?"
Right: "What are the storage quotas on Darwin?"

Root cause analysis: Two contributing factors:

FIELD_GUIDANCE field group #1 said "what is this document about?" — 90% of seq-1 (overview) pairs were meta-referencing.
Entity titles included document-type suffixes ("Jetstream 2 User Guide") which primed the LLM to treat the document as the subject.

Prompt and code fixes

question_categories.py — Two changes:

Added explicit anti-meta-referencing instruction to DOMAIN_NOTES["documents"] with wrong/right examples.
Reworded all 5 field groups in FIELD_GUIDANCE["documents"] to avoid document-referencing (e.g., "Overview — what is this topic about?" instead of "what is this document about?").

documents.py — Added regex to _title_from_stem() to strip document-type suffixes ("User Guide", "Manual", "Handbook", etc.) so the LLM sees "Jetstream 2" instead of "Jetstream 2 User Guide" as the entity name.
Re-extraction results

Three extraction runs after iterating on fixes:

Staging (first fix): 608 pairs, 10% meta (down from 36%)
Staging (with title suffix fix): 604 pairs, 0.9% meta (6 remaining)
Data directory: 228 pairs

Combined: 832 pairs, 6 meta-referencing (0.7%). Cleared Argilla and pushed fresh.
A.3 Run 4 — the bake-off

Brought up all services locally (qa-service on 8001, access-agent on 8000). Synced 832 document pairs from Argilla and loaded 70 entity pairs via JSONL. Total: 902 pairs in pgvector.
Fired all 41 test questions. Results:


Metric
Run 3 (83 pairs)
Run 4 (902 pairs)


UKY hits
38/41 (93%)
40/40 (100%)


pgvector hits
27/41 (66%)
27/40 (67%)


pgvector avg latency
~5ms
~30ms


pgvector coverage stayed flat at 67% despite 10x more pairs.
The architectural insight

The 13 missed questions fall into two categories:


Missing source content (4 questions) — Ranch storage has zero Q&A pairs because no Ranch documents exist in rag_documents/ and Ranch wasn't returned from MCP in the extraction run that generated the original test questions.


No cross-cutting Q&A pairs (9 questions) — General ACCESS questions ("How do I apply for an allocation?", "How do I transfer files between resources?", "What training does ACCESS offer?") have no matching pairs even though we have 104 allocation mentions, 50 transfer/Globus mentions, and 40 training mentions across our pairs. The problem: all those mentions are entity-scoped. We have "How do I cite Jetstream?" but not "How do I acknowledge ACCESS?" We have "What allocations does Anvil support?" but not "How do I apply for an allocation?"


The extraction pipeline processes one document at a time, so it only ever generates entity-scoped Q&A pairs. It will never produce cross-cutting "How does ACCESS work in general?" pairs from a single-document prompt.
UKY's advantage is architectural: chunk-level retrieval at query time lets it pull relevant fragments from multiple documents and synthesize on the fly. It doesn't need a pre-generated answer that matches — it just needs chunks that are individually relevant. Our Q&A-pair RAG needs a pair whose question semantically matches the user's question, and no single entity-scoped pair matches a cross-cutting query closely enough.
Decision questions for Andrew


Manually curate cross-cutting pairs — Write 20-30 general ACCESS Q&A pairs by hand. Fast, targeted, but doesn't scale.
Add a cross-cutting extraction pass — Feed the LLM multiple documents simultaneously and ask for general questions that span topics. New pipeline capability.
Keep UKY as fallback for general questions — Accept the hybrid. pgvector for entity questions (fast, verified), UKY for cross-cutting (slow, unverified).
Lower similarity thresholds — Some misses scored 0.55-0.68, not far from the 0.70 cutoff. Won't fix the 0.28-0.49 misses.
Detect cross-cutting-ness at query time — Instead of pre-generating cross-cutting pairs, use pgvector match quality as a signal: low scores with scattered partial matches → route to document chunk RAG or MCP tools. Fits existing agent graph routing.

Files produced


a3_results/run4.json — 40 comparison log entries
a3_results/run4_enriched.json — enriched with low-threshold best-possible scores
~/.agent/diagrams/a3-run4-bakeoff.html — interactive comparison visualization

Answer richness gap (second dimension)

Even when pgvector hits, many answers are thinner than UKY's. Investigated whether pgvector answers were bypassing LLM synthesis — confirmed they are NOT: _dual_rag_answer() calls _synthesize_with_rag_only() for every pgvector match. The real issue: a single pre-digested Q&A pair gives the synthesis LLM very little to work with, so it returns near-verbatim text. UKY pulls multiple document chunks and the LLM has more raw material to synthesize a richer answer.
However, reviewing side-by-side answers revealed a more nuanced picture:

Some pgvector answers are actually better than UKY's (more precise, directly relevant)
Some just need link enrichment (the synthesis prompt doesn't encourage adding URLs)
Some questions UKY can't answer but pgvector can (entity-specific data from MCP)

This shifts the framing from "pgvector vs UKY" to "how to combine them intelligently."
Quick fix (low-effort, high-impact): The RAG_ONLY_SYNTHESIS_PROMPT in synthesize.py says "Be concise and direct" — this is why the LLM returns near-verbatim single sentences. Updating the prompt to encourage link inclusion, practical context, and resource pointers would immediately enrich thin answers without any architectural changes. The Q&A pair metadata already carries domain and entity_id which could drive link generation.
5th strategic option: cross-cutting detection at query time

Instead of generating cross-cutting Q&A pairs up front, detect cross-cutting-ness at query time based on pgvector results and route accordingly:

pgvector score < threshold but > 0.4 → content exists but scattered → fall back to document chunk RAG or plan+MCP
pgvector hit but thin answer → enrich with MCP tool calls or document chunks
pgvector hit with rich answer → serve it (fast, verified)
pgvector zero matches → missing content → MCP or UKY fallback

This fits the existing agent graph — rag_answer already evaluates match quality and routes to plan on weak matches. The change: make that evaluation smarter about why the match is weak.
Bugs noted (not fixed)


threshold=0.0 falsy in vectorstore.py: threshold or settings.rag_similarity_threshold treats 0.0 as falsy, falling back to default 0.85. Affects diagnostic queries with threshold=0.
q21 not logged: "How much funding did the pollinator conservation AI project get?" was classified as non-RAG (40/41 logged).


2026-03-11 — Run 4 reanalysis: UKY hit rate was overcounted

Discovery: The Run 4 summary reported "UKY hits 40/40 (100%)" — but this counted every UKY response as a hit, including hedges like "The provided documents do not contain specific information about Abaqus. Please open a support ticket." Applied the same hedge detection used at runtime (_rag_answer_is_weak in graph.py) to the logged responses.
Corrected Run 4 numbers:


Metric
pgvector
UKY


Genuine answers
27/40 (68%)
13/40 (33%)


Hedged / no match
13
27


Head-to-head breakdown:

Both answered well: 8
pgvector only (UKY hedged): 19
UKY only (pgvector no match): 5 — all general process questions (allocations, password reset, file transfer)
Neither answered well: 8

What this means: pgvector already outperforms UKY 2-to-1. UKY's 19 entity-specific hedges are questions pgvector handles from curated MCP data (software versions, resource specs, NSF awards) that UKY's document corpus simply doesn't cover. The "UKY as strong fallback" framing was wrong — UKY adds value on only 5 questions, all cross-cutting process topics.
Remaining gap (13 questions): 5 cross-cutting process questions (UKY answers, pgvector doesn't) + 8 neither backend handles. A document-chunk fallback for cross-cutting detection would address most of these, but the urgency is lower than previously thought.
Also this session: Updated SYSTEM_OVERVIEW.md routing table with file names, condition explanations, and node descriptions. Synced gist.

WHERE WE ARE — resume point (updated 2026-03-12)

A.3 Run 5 complete (full-system test). Node tracing added. Top-5 matches + enriched synthesis prompt deployed. Next: curate cross-cutting Q&A pairs for the ~5 procedural LLM-only questions.
What's done


A.1 (Argilla → pgvector sync) ✅
A.2 (dual-RAG logging in access-agent) ✅
A.3 Runs 1–5 complete ✅ — RAG-vs-RAG (Runs 1–4), full-system (Run 5)
Post-mortem analysis ✅ — gap is content type (entity vs process), not quality
UKY corpus obtained ✅ — 75 files in rag_documents/
Direction confirmed with Andrew ✅ — generate Q&A pairs from docs, unify in pgvector, retire UKY
C.1 corpus categorized ✅ — index at rag_documents/CORPUS_INDEX.md
C.2 document extractor built ✅ — committed and pushed on feat/document-extractor
C.3 extraction complete ✅ — 832 pairs (604 staging + 228 data), meta-referencing fixed (36% → 0.7%)
Outstanding PRs merged ✅ — both access-qa-extraction and access-qa-planning PRs squash-merged
C.4 sync + bake-off ✅ — 902 pairs in pgvector (832 document + 70 entity), 40 questions answered
A.3 Run 4 reanalysis ✅ — pgvector 68% vs UKY 33% (hedge responses excluded)
A.3 Run 5 ✅ — full-system test (pgvector + MCP + routing). 24 RAG, 5 MCP, 12 LLM-only.
Node tracing ✅ — node_trace in AgentState, gated behind ?include_trace=true (commits 04342c8, b7a9bec)
Top-5 matches + enriched synthesis prompt ✅ — RAG_TOP_K 3→5, prompt rewritten (commit ef43a21)

The core findings


pgvector is already ahead: 27/40 genuine answers vs UKY's 13/40. pgvector covers entity-specific data (software, resources, awards) that UKY cannot.
Full system closes more gaps: MCP tools answer Ranch questions and project search (Run 5). 12 questions remain LLM-only (ungrounded).
Cross-cutting gap splits into two types: Union-type queries ("What resources support GPUs?") should now be addressed by top-5 multi-match synthesis. Procedural queries ("How do I apply for an allocation?") still need hand-curated cross-cutting Q&A pairs (~5 questions).

What's next


Curate cross-cutting Q&A pairs — allocations, Globus, MFA, training, citation (fills the ~5 procedural LLM-only gaps)
Re-run evaluation — test top-5 + enriched prompt against the 41 questions, compare answer quality
Project D — evaluation harness (EVAL_HARNESS_PLAN.md)
Project B — feedback protocol design


2026-03-11 — A.3 Run 5: Full-System Comparison

What we did

Ran all 41 questions through the production agent graph with MCP servers active and UKY disabled. This is the first system-vs-system test: pgvector RAG + MCP tools + LangGraph routing, compared against UKY's baseline responses from Run 4.
Configuration:

ENVIRONMENT=production, MCP_SERVER_HOST=host.docker.internal — agent container reaches MCP servers via Docker host bridge
UKY_RAG_ENABLED=false, DUAL_RAG_LOGGING=false — no UKY, no dual-RAG comparison path
10 MCP servers running (access-mcp/docker-compose.yml)
902 Q&A pairs in pgvector (832 document + 70 entity)

Results — 41/41 questions answered:

24 via RAG (rag_retrieval)
5 via MCP tools (search_resources, get_resource_hardware, search_events, search_projects)
12 LLM-only (no tools called)

Key findings


MCP tools fill the Ranch gap. Ranch had zero Q&A pairs — q5, q6, q40 now get real answers via search_resources and get_resource_hardware. Even the misspelled q40 ("reanch storage") resolves.
q41 gets a real answer. "What allocation projects are using machine learning?" calls search_projects, returns 20 real projects with PIs and institutions.
q31 routes to events ("What training resources does ACCESS offer?") calls search_events, though the search returned empty results.
Cross-cutting questions (q3, q7, q8, q26-q28, q32-q33, q38) fall to LLM synthesis. Neither RAG nor MCP covers these general ACCESS process questions. Answers read well but are ungrounded — could hallucinate.

What we learned about observability

The API response only exposes tools_used, confidence, execution_strategy, tool_count. We cannot tell from the response:

What the classifier decided (static/dynamic/combined)
Which graph nodes actually executed (e.g. did RAG fire and fail before falling to LLM?)
RAG similarity scores for matched pairs
Whether _rag_answer_is_weak triggered
The plan content (if the planner node ran)
MCP tool arguments and raw responses

The 12 "LLM-only" answers are a black box — we can't distinguish "classified as static, RAG returned nothing, fell through to LLM" from "classified as static, LLM answered directly without trying RAG." Adding a node_trace to QueryResponse is the immediate next step.
Report

Interactive HTML comparison at ~/.agent/diagrams/a3-run5-comparison.html. Matches the Run 3/4 report format: KPI cards, filters, expandable side-by-side comparison. Note: hedge detection has a known issue — see below.
Known issue: hedge detection false positives

The report's hedge detection uses substring matching against phrases like "do not contain", "does not explicitly", etc. UKY q27 ("The provided documents do not specify...") is marked hedged but none of the exact phrases match — the detection was too aggressive. The h2h classification for q27 and potentially others needs review. Should align with access-agent/src/agent/graph.py:_rag_answer_is_weak() which uses the canonical hedge phrases.
Raw data


a3_results/run5.json — 41 questions with full agent responses
a3_results/uky_baseline_from_run4.json — UKY baseline (40 questions, q21 missing)
a3_results/run_a3_test.py — test runner (updated for Run 5: captures full response, saves to JSON)

What's next


Add node tracing to agent graph — track which nodes executed, classification result, RAG scores. Expose in QueryResponse.metadata.
Re-run with tracing — Run 5b with node trace data, so we can see exactly how each question routes.
Fix hedge detection — align report's hedge logic with _rag_answer_is_weak() from the agent codebase.
Tune synthesis prompt — RAG_ONLY_SYNTHESIS_PROMPT produces thin answers when one pair matches. Add links, context.
Curate 20-30 cross-cutting Q&A pairs — allocations, Globus, MFA, training, citation (fills the 12 LLM-only gaps).

To restart Docker (if containers are down)

cd /Users/josephbacal/Projects/sweet-and-fizzy/access-ci/access-qa-service && docker compose up -d
cd /Users/josephbacal/Projects/sweet-and-fizzy/access-ci/access-agent && docker compose up -d

Verify with docker ps — you should see access-agent-agent-1 (8000), qa-service-app (8001), and their postgres/redis containers.
Quick smoke test to confirm everything works

curl -s -X POST http://localhost:8000/api/v1/query \
  -H "Content-Type: application/json" \
  -d '{"query": "What is ACES?", "session_id": "test", "question_id": "test-1"}' | python3 -m json.tool


2026-03-12 — Node tracing added to agent graph

What was built

Two commits on access-agent/main:
04342c8 — Added node_trace to AgentState as Annotated[list[dict[str, Any]], operator.add]. Each graph node appends a structured trace dict recording what it decided:

classify: query_type, confidence, domain, rag_endpoint, reason, whether query was expanded
rag_answer: source (uky/pgvector), match_count, best_score, rag_used, has_final_answer
plan: requires_tools, tool_count, tool names, strategy
execute: tools_called, succeeded, failed
evaluate: is_helpful, reason
recover: action taken, new tools selected
synthesize: strategy, answer_length
domain_agent: domain, tool_count

The /api/v1/query response includes classification summary and node_trace in metadata. 10 files changed (all node files + state.py + routes.py).
b7a9bec — Gated node_trace behind ?include_trace=true query parameter. OTel/Honeycomb (added Jan 2026, commit 422b92d) already provides full distributed tracing for ops. node_trace serves a different consumer: the eval harness needs trace data inline in the API response so it can programmatically inspect routing decisions without querying an external service. Nodes continue accumulating trace dicts in state (zero overhead), but the response only includes them when opted in.
Why two tracing systems


OTel/Honeycomb: Ops. Waterfall view of every span, LLM call, MCP tool. External service.
node_trace: Eval. Inline in API response. Shows decisions (classifier output, RAG scores, tool selection) not timing. Consumable programmatically by the eval harness (Project D).


2026-03-12 — Top-5 RAG matches + enriched synthesis prompt

Context

Andrew suggested returning the top 5 Q&A pair matches (instead of just the best) and letting the synthesizer combine them — simpler than building document-chunk retrieval for cross-cutting queries. Analysis confirmed the pipeline already supported multiple matches end-to-end (RAG_TOP_K was 3, qa-service accepts up to 20, all downstream code iterates over the full list). The change was purely configuration + prompt.
What changed (commit ef43a21)

config.py — RAG_TOP_K: 3 → 5. More material for the synthesizer, especially for union-type cross-cutting queries where related entity pairs from different resources can be combined.
synthesize.py — RAG_ONLY_SYNTHESIS_PROMPT rewritten:

"Be concise and direct" → "Answer the question thoroughly"
New guideline: when multiple knowledge entries are provided, synthesize into a unified answer
URLs/links elevated to IMPORTANT (matching the tool-only and combined prompts)
Added practical next steps guidance and support ticket link (both already present in the other prompts, missing here)

What this addresses


Thin answers: Single-match answers were near-verbatim because the prompt said "be concise." Now the LLM is instructed to give a complete, actionable response with links and context.
Union-type cross-cutting queries: "What resources support GPUs?" now gets 5 entity-scoped pairs (Delta, Bridges-2, ACES, etc.) and the prompt tells the LLM to combine them.
Does NOT fix procedural cross-cutting: "How do I apply for an allocation?" still has no matching pairs at any score. These need hand-curated cross-cutting Q&A pairs (~5 questions).


Commit Log (Joe Bacal, Feb 2026 work)

Commits across all repos related to the Feb/March plan. Older commits omitted.
access-qa-extraction (feat/two-shot)


Hash
Date
Message


c8fbf0b
02-26
docs: remove historical docs, update system overview for two-shot


853e88f
02-26
replace GUIDED-TOUR with TRACE-TOUR signposts; fix software name casing


00ba293
02-24
prompt: add rule to quote long lowercase entity names in Q&A


7b0590e
02-24
prompt: enhance rule 4 to check free-text fields; update review observations doc


28be413
02-24
fix: entity name interpolation + temporal language + coming-soon cleanup


170e87d
02-24
docs: log full corpus scan results — quantify issues #1/#2, add issue #3


8336f45
02-24
docs: move allocations:72170 finding to Patterns (positive, not an issue)


d7f57f5
02-24
docs: log allocations:72170 as non-issue (Jurafsky in source data, verified)


4f9c22d
02-24
docs: add retrieval surface area rationale to P1 (self-contained answers)


a4f7b66
02-24
docs: note preferred fix for P1 — entity name interpolation in user prompt


43e980e
02-24
docs: clarify P1 — entity name needed in both Q and A for RAG


70f9424
02-24
docs: add P1 pattern — questions must be self-contained (cross-cuts #1 and #2)


6084c93
02-24
docs: log issue #2 — decontextualized-question pattern (pervasive)


07da145
02-24
docs: log issue #1 — temporal-assumption in affinity-groups events


c4ec468
02-24
docs: add qa-review-observations.md for tracking Argilla review issues


6857db8
02-24
docs: improve signpost comments + fix COMING SOON name normalization


579e10d
02-24
fix: normalize "COMING SOON" resource names to lowercase


7bd43ba
02-24
wip: some signpost comments


3333c32
02-23
docs: update guided-tour


66e1819
02-20
refactor: adopt two-shot as sole extraction strategy


7803147
02-20
fix: restore missing return in software_discovery._generate_qa_pairs


7791e2b
02-20
feat: add --prompt-strategy flag for A/B/C extraction experiment


b662dc9
02-20
feat: implement entity-replace for Argilla push


80fc641
02-20
docs: update plan with metadata on human actions on archive records


9d54819
02-19
fix(data-quality): separate NSF program fields and add per-domain LLM guidance


39a4c06
02-19
refactor: remove factoid templates and bonus generation (2-pass pipeline)


5268caa
02-19
docs: reflect entity-replace decision and update README


8c9e7f2
02-18
docs: update all docs for freeform extraction pipeline and Argilla dedup


4181585
02-18
feat: roll out freeform extraction to all 5 extractors


da79f7d
02-18
feat: freeform extraction replaces category+bonus two-pass approach


2833d7b
02-18
docs: update for Argilla metadata integration and test count


e6d08fa
02-18
feat(argilla): add eval_issues and source_ref to Argilla records


3c762c9
02-18
feat(argilla): push judge scores and granularity to Argilla metadata


24c8373
02-17
feat(judge): LLM judge evaluation scores for Q&A pair quality


93a1fb2
02-17
feat(bonus): LLM exploratory questions for entity-unique information


068c08a
02-17
feat(incremental): hash-based change detection to skip unchanged entities


9059614
02-17
fix(factoids): data quality guards for template generation


3662d8b
02-13
feat(generators): dual-granularity Q&A + extend comparisons to all 5 domains


fa2ff93
02-12
fix(nsf-awards): normalize primaryProgram list + skip unused MCPClient


f3b1437
02-12
feat(extractors): fixed question categories + direct API for allocations/nsf-awards


fdebdab
02-12
feat(software-discovery): switch from search terms to list_all_software


e33d006
02-11
feat(extract): add max_entities cap for cheap test runs


2da2c32
02-10
Use real enumerations from taxonomies.ts for search terms


d987dee
02-10
Add report command for MCP coverage stats without LLM calls


6c4667c
02-10
Add ExtractionConfig to centralize extraction parameters


0b16ba8
02-04
Fix Q&A pair ID collisions by appending question hash


cf384bc
02-04
Add Argilla integration for pushing Q&A pairs to human review


51e9877
02-04
Expand extraction queries, fix software-discovery, update docs


a69ce2e
02-02
Fix allocations and nsf-awards extractors returning 0 results


038d42d
02-02
Add dedicated OpenAI backend (LLM_BACKEND=openai)


b557300
02-01
Add LOCAL_DIRECTIONS.md and update .env.example for OpenAI setup


d45eda1
02-01
Add NSFAwardsExtractor and register in CLI/validator


b67eba0
02-01
Add AllocationsExtractor and register in CLI/validator


18c0e49
01-31
Add AffinityGroupsExtractor and fix MCP server port defaults


de28ab2
01-31
Add CLAUDE.md and update README with local dev setup guide


access-qa-service (main)


Hash
Date
Message


5b57ae0
02-28
Fix Argilla sync to work with access-qa-extraction's dataset schema


access-agent (main / feature/dual-rag-logging)


Hash
Date
Message


ef43a21
03-12
feat: return top-5 RAG matches and enrich synthesis prompt


b7a9bec
03-12
feat: gate node_trace behind ?include_trace query parameter


04342c8
03-11
feat: add node_trace to agent graph for execution path observability


de26e37
—
feat: route pgvector through LLM synthesis + fair comparison logging


08809ad
—
fix: lower RAG similarity thresholds — 0.85 was filtering valid matches


caf7256
02-28
feat: add dual-RAG comparison logging for A.2 evaluation


access-mcp (main)


Hash
Date
Message


bb3b54f
02-04
spike: Add list-all fallbacks to allocations and nsf-awards routers


access-qa-planning (update/mcp-extraction-two-shot)


Hash
Date
Message


a84fb4a
02-26
docs: GUIDED-TOUR.md → TRACE-TOUR.extract.md in file tree


033c46e
02-23
docs: update mcp-extraction-impl to reflect two-shot pipeline and entity-replace


access-argilla (main)


Hash
Date
Message


d5cb931
01-30
chore: init claude file
Service	Port	Notes
access-agent	8000	`feature/dual-rag-logging` branch, `DUAL_RAG_LOGGING=true`
access-agent postgres	5432	checkpointing + comparison logs
access-qa-service	8001	83 Q&A pairs loaded
qa-service postgres	5433	pgvector embeddings
access-argilla	6900	Q&A pair review UI
Domain	Entity	Pairs
compute-resources	ACES (TAMU)	10
compute-resources	Ranch (TACC)	10
software-discovery	ABINIT	10
software-discovery	Abaqus	8
allocations	Grassland bird habitat (#72204)	9
allocations	RL benchmark (#72205)	10
nsf-awards	Pollinator conservation AI (#2529183)	10
nsf-awards	Great Salt Lake dust (#2449122)	8
affinity-groups	Neocortex (PSC)	5
affinity-groups	REPACSS (TTU)	3
Metric	Value
UKY answered	38/41 (93%)
pgvector answered (synthesized)	27/41 (66%)
Both answered	24 (direct comparison possible)
UKY only	14
pgvector only	3
Avg pgvector similarity score	0.84
Category	Examples	Count
Resource descriptions	ACES, Anvil, Bridges-2, Delta, Expanse, Jetstream-2, Neocortex, Sage, Voyager, Fabric (PDFs)	~20
User guides	ACES, Anvil, Bridges-2, Delta, Expanse, Jetstream-2, Neocortex, Sage (PDFs)	~10
Process/how-to docs	Allocations, Globus, MFA, add users, progress reports, office hours, events/trainings, system status (docx)	~12
Misc	ARA description, SDS pointer, CloudBank login, REPACSS overview, Sage edge apps, current projects	~5
Category	Files	Priority	Rationale
NET-NEW process/how-to	20	First	Fills the exact A.3 gap — allocations, Globus, MFA, Sage, citations, Jupyter
USER GUIDE (deep)	22	Second	Operational depth (job submission, filesystems, SLURM) beyond MCP surface data
MCP OVERLAP (descriptions)	17	Later	1-page resource catalog entries — MCP already covers most of this
DATA FILE	12	Skip	Raw software lists (name/version lines) — MCP software-discovery covers this
POINTER/EMPTY	4	Skip	URL stubs or corrupt files with no substantive content
Category	Entities	Pairs	Notes
NET-NEW docx (process/how-to)	19	~110	Allocations, MFA, Globus, Sage, Jupyter
User Guide PDFs (chunked)	39 chunks	~290	Jetstream2 (20 chunks), Anvil (6), Bridges-2 (5), etc.
MCP Overlap descriptions	17	~134	1-page resource PDFs
Other (ARA, SDS, REPACSS)	8	~52	Small docs
Subdirectory	Entities	Pairs	Notes
ACCESS-Resources/Darwin/	9	~65	Managing jobs, user guide, compiling, file systems, etc.
ACCESS-Resources/Delta/	3 chunks	~25	Large PDF chunked into 3
ACCESS-Resources/FASTER/	4	~30	User guide, system overview, jobs, file systems
ACCESS-Travel-Rewards.md	1	~8	Travel reimbursement program
ACCESS-Software-Installed-by-resource/	12	~93	Software lists (package names/versions — generic Q&A quality)
Metric	Run 3 (83 pairs)	Run 4 (902 pairs)
UKY hits	38/41 (93%)	40/40 (100%)
pgvector hits	27/41 (66%)	27/40 (67%)
pgvector avg latency	~5ms	~30ms
Metric	pgvector	UKY
Genuine answers	27/40 (68%)	13/40 (33%)
Hedged / no match	13	27
Hash	Date	Message
`c8fbf0b`	02-26	docs: remove historical docs, update system overview for two-shot
`853e88f`	02-26	replace GUIDED-TOUR with TRACE-TOUR signposts; fix software name casing
`00ba293`	02-24	prompt: add rule to quote long lowercase entity names in Q&A
`7b0590e`	02-24	prompt: enhance rule 4 to check free-text fields; update review observations doc
`28be413`	02-24	fix: entity name interpolation + temporal language + coming-soon cleanup
`170e87d`	02-24	docs: log full corpus scan results — quantify issues #1/#2, add issue #3
`8336f45`	02-24	docs: move allocations:72170 finding to Patterns (positive, not an issue)
`d7f57f5`	02-24	docs: log allocations:72170 as non-issue (Jurafsky in source data, verified)
`4f9c22d`	02-24	docs: add retrieval surface area rationale to P1 (self-contained answers)
`a4f7b66`	02-24	docs: note preferred fix for P1 — entity name interpolation in user prompt
`43e980e`	02-24	docs: clarify P1 — entity name needed in both Q and A for RAG
`70f9424`	02-24	docs: add P1 pattern — questions must be self-contained (cross-cuts #1 and #2)
`6084c93`	02-24	docs: log issue #2 — decontextualized-question pattern (pervasive)
`07da145`	02-24	docs: log issue #1 — temporal-assumption in affinity-groups events
`c4ec468`	02-24	docs: add qa-review-observations.md for tracking Argilla review issues
`6857db8`	02-24	docs: improve signpost comments + fix COMING SOON name normalization
`579e10d`	02-24	fix: normalize "COMING SOON" resource names to lowercase
`7bd43ba`	02-24	wip: some signpost comments
`3333c32`	02-23	docs: update guided-tour
`66e1819`	02-20	refactor: adopt two-shot as sole extraction strategy
`7803147`	02-20	fix: restore missing return in software_discovery._generate_qa_pairs
`7791e2b`	02-20	feat: add --prompt-strategy flag for A/B/C extraction experiment
`b662dc9`	02-20	feat: implement entity-replace for Argilla push
`80fc641`	02-20	docs: update plan with metadata on human actions on archive records
`9d54819`	02-19	fix(data-quality): separate NSF program fields and add per-domain LLM guidance
`39a4c06`	02-19	refactor: remove factoid templates and bonus generation (2-pass pipeline)
`5268caa`	02-19	docs: reflect entity-replace decision and update README
`8c9e7f2`	02-18	docs: update all docs for freeform extraction pipeline and Argilla dedup
`4181585`	02-18	feat: roll out freeform extraction to all 5 extractors
`da79f7d`	02-18	feat: freeform extraction replaces category+bonus two-pass approach
`2833d7b`	02-18	docs: update for Argilla metadata integration and test count
`e6d08fa`	02-18	feat(argilla): add eval_issues and source_ref to Argilla records
`3c762c9`	02-18	feat(argilla): push judge scores and granularity to Argilla metadata
`24c8373`	02-17	feat(judge): LLM judge evaluation scores for Q&A pair quality
`93a1fb2`	02-17	feat(bonus): LLM exploratory questions for entity-unique information
`068c08a`	02-17	feat(incremental): hash-based change detection to skip unchanged entities
`9059614`	02-17	fix(factoids): data quality guards for template generation
`3662d8b`	02-13	feat(generators): dual-granularity Q&A + extend comparisons to all 5 domains
`fa2ff93`	02-12	fix(nsf-awards): normalize primaryProgram list + skip unused MCPClient
`f3b1437`	02-12	feat(extractors): fixed question categories + direct API for allocations/nsf-awards
`fdebdab`	02-12	feat(software-discovery): switch from search terms to list_all_software
`e33d006`	02-11	feat(extract): add max_entities cap for cheap test runs
`2da2c32`	02-10	Use real enumerations from taxonomies.ts for search terms
`d987dee`	02-10	Add report command for MCP coverage stats without LLM calls
`6c4667c`	02-10	Add ExtractionConfig to centralize extraction parameters
`0b16ba8`	02-04	Fix Q&A pair ID collisions by appending question hash
`cf384bc`	02-04	Add Argilla integration for pushing Q&A pairs to human review
`51e9877`	02-04	Expand extraction queries, fix software-discovery, update docs
`a69ce2e`	02-02	Fix allocations and nsf-awards extractors returning 0 results
`038d42d`	02-02	Add dedicated OpenAI backend (LLM_BACKEND=openai)
`b557300`	02-01	Add LOCAL_DIRECTIONS.md and update .env.example for OpenAI setup
`d45eda1`	02-01	Add NSFAwardsExtractor and register in CLI/validator
`b67eba0`	02-01	Add AllocationsExtractor and register in CLI/validator
`18c0e49`	01-31	Add AffinityGroupsExtractor and fix MCP server port defaults
`de28ab2`	01-31	Add CLAUDE.md and update README with local dev setup guide