Gist mirror: https://gist.github.com/bacalj/a6c6f9726844611df5a09c83884a0e83
Goal: Get Q&A pairs from Argilla into access-qa-service (pgvector) so they're searchable via semantic search.
Discovery: access-qa-service already had a /admin/sync endpoint and argilla_sync.py — but the code was scaffolded with placeholder logic that didn't match the actual Argilla v2 API or the record schema created by access-qa-extraction.
What was wrong:
- Used deprecated Argilla v1 API (
rg.init()/rg.load()) - Guessed at record field access (
record.inputs,record.question) — Argilla v2 usesrecord.fields["question"] - Looked for
entity_idin metadata (doesn't exist) — needs to come from<<SRC:...>>citation markers in the answer text - Default dataset name was
"access-qa"but extraction creates"qa-review" argillaPython SDK wasn't in the dependencies
What we fixed (commit 5b57ae0 on access-qa-service/main):
- Rewrote
sync_from_argilla()for Argilla v2 client API - Correct field access via
record.fields - Domain/entity_id extracted from citation markers, with
source_refparsing as fallback - Added
_get_edited_values()to prefer reviewer edits (future-proofing) - Judge scores (faithfulness, relevance, completeness, confidence) carried through to pgvector metadata
- Added
argilla>=2.0.0as a proper dependency - Added Argilla env vars to
docker-compose.ymlfor local dev
Test result:
POST /admin/sync → {"synced": 83, "skipped": 0, "citations_loaded": 12, "errors": []}
POST /search {"query": "What is ACES designed for?"} → similarity_score: 1.0, correct answer with citation
83 records across 5 domains (compute-resources, software-discovery, affinity-groups, allocations, nsf-awards) synced and searchable.
Also documented: Andrew's feature/access-agent-integration branch on qa-bot-core — what it changes (Netlify proxy, request body format, response contract) and why it matters for Projects A and B. Added to FEB_MARCH_PLAN.md and synced to the gist.
Goal: Modify rag_answer node to query both UKY document RAG and pgvector Q&A-pair RAG for every question, logging side-by-side results for A.3 evaluation.
Approach: Parallel queries via asyncio.gather, gated behind DUAL_RAG_LOGGING env var. When the flag is off, behavior is identical to before.
What was built (commit caf7256 on access-agent/feature/dual-rag-logging):
src/config.py— AddedDUAL_RAG_LOGGING: bool = Falsesettingsrc/rag_comparison_logger.py(new) — SQLAlchemy model + singleton logger forrag_comparison_logstable. Follows same pattern asusage_logger.py. Table auto-creates on first use.src/agent/nodes/rag_answer.py— Added:_query_uky_raw()/_query_pgvector_raw()— lightweight async helpers that return raw results without span side-effects_dual_rag_answer()— runs both queries concurrently, applies same UKY-primary/pgvector-fallback priority, logs comparison to PostgreSQL- Gate in
rag_answer_node:settings.DUAL_RAG_LOGGING and rag_endpoint→ dual path; else unchanged
tests/test_rag_answer.py(new) — 19 tests: citation processing, raw query helpers, dual-RAG logic (UKY served, pgvector fallback, both fail, combined query, below threshold, logger failure resilience), flag gating
Comparison log table schema (rag_comparison_logs):
- Query context:
session_id,question_id,query_text,expanded_query,query_type,rag_endpoint - UKY result:
uky_response,uky_duration_ms,uky_error - pgvector result:
pgvector_matches(JSONB),pgvector_best_score,pgvector_match_count,pgvector_duration_ms,pgvector_error - Outcome:
served_by,served_answer_length
Test result: 94 passed (all existing + 19 new), 0 failures.
What's unchanged: state.py, graph.py, routes.py — the graph contract is untouched. The comparison log is a side-effect inside the rag_answer node.
Next (A.3): Deploy the feature/dual-rag-logging branch with DUAL_RAG_LOGGING=true, ask questions via qa-bot-core or direct API, then query rag_comparison_logs to evaluate UKY vs pgvector.
Decision: Run A.3 locally in Docker, bypass qa-bot-core, use direct curl requests.
Docker setup (two separate compose projects):
access-qa-service/docker-compose.yml→ qa-service (port 8001) + PostgreSQL (port 5433) + Redis (port 6380)access-agent/docker-compose.yml→ agent (port 8000) + PostgreSQL (port 5432) + Redis- access-agent reaches access-qa-service via
host.docker.internal:8001(macOS Docker) - UKY endpoint is remote — uses same API key as qa-bot-core (
ACCESS_AI_API_KEY)
What we did to get access-agent running:
- Created
access-agent/.envfrom discovered keys:OPENAI_API_KEY(fromaccess-qa-extraction/.env),ACCESS_AI_API_KEY(same key asQA_MODEL_API_KEYinaccess-serverless-api/.envandREACT_APP_API_KEYinqa-bot-core/.env.local), plusDUAL_RAG_LOGGING=true,QA_SERVICE_URL=http://host.docker.internal:8001,OTEL_ENABLED=false - Modified
access-agent/docker-compose.yml: addedenv_file: .envto the agent service (previously all env vars had to be listed explicitly), removed externalmcp-networkdependency (MCP servers aren't needed for A.3) - Built and started:
docker compose up --build -d— all containers healthy
Smoke test (successful):
curl -X POST http://localhost:8000/api/v1/query \
-H "Content-Type: application/json" \
-d '{"query": "What is Delta?", "session_id": "test-a3-smoke", "question_id": "smoke-1"}'
→ Got a full UKY-sourced response about Delta (NCSA HPC resource), 6s latency, tools_used: ["uky_rag_retrieval"]. Agent is live and hitting UKY successfully.
Note: The API field is query (not question). The MCP server warnings in the agent logs are expected and harmless — those servers aren't on this Docker network and aren't needed for A.3.
Current container status (all running):
| Service | Port | Notes |
|---|---|---|
| access-agent | 8000 | feature/dual-rag-logging branch, DUAL_RAG_LOGGING=true |
| access-agent postgres | 5432 | checkpointing + comparison logs |
| access-qa-service | 8001 | 83 Q&A pairs loaded |
| qa-service postgres | 5433 | pgvector embeddings |
| access-argilla | 6900 | Q&A pair review UI |
Goal: Verify Docker environment still works and start A.3 evaluation.
Discovery: pgvector is returning zero matches for reasonable queries like "What is ACES?" — even though we have 20 compute-resources Q&A pairs including several about ACES.
Root cause: The similarity threshold is too aggressive. There are two thresholds stacked:
- qa-service default (
access-qa-service/src/access_qa_service/config.py:26):rag_similarity_threshold = 0.85 - access-agent per-query-type thresholds (
access-agent/src/config.py:69-71):RAG_THRESHOLD_STATIC = 0.85(static queries)RAG_THRESHOLD_COMBINED = 0.75(combined queries)RAG_THRESHOLD_FALLBACK = 0.65(fallback)
The agent's _query_pgvector_raw() passes the threshold to the qa-service, which uses it to filter results. For static queries (the most common type), both sides enforce 0.85.
The problem: "What is ACES?" scores 0.84 against the best match ("What is ACES designed for?") — just below the 0.85 cutoff. With threshold 0.3, the same query returns 3 solid matches (0.84, 0.82, 0.76). Short or naturally-phrased questions routinely fall just under 0.85 even when the topic matches perfectly.
Evidence:
curl /search {"query": "What is ACES?", "threshold": 0.85} → 0 matches
curl /search {"query": "What is ACES?", "threshold": 0.3} → 3 matches (0.84, 0.82, 0.76)
curl /search {"query": "What is ACES designed for?"} → 1 match (1.0, exact)
The rag_comparison_logs table confirmed this — both smoke test queries ("What is Delta?", "What is ACES?") show pgvector_match_count: 0 and served_by: uky_general.
What needs to happen before running A.3:
- Lower the threshold so pgvector actually returns matches for natural queries
- Options: (a) lower
RAG_THRESHOLD_STATICfrom 0.85 to ~0.70 in access-agent config, (b) use a comparison-specific override in the dual-RAG path so production defaults aren't touched, or (c) lower the qa-service default - Rebuild the access-agent container after the change
Also this session: Created SYSTEM_OVERVIEW.md with sequence diagrams of the three main flows (query answering, knowledge base building, per-entity extraction detail). Updated the agent graph illustration in FEB_MARCH_PLAN.md from mermaid to an emoji-annotated state transition table. Synced plan gist.
Change: Lowered all RAG similarity thresholds in access-agent/src/config.py (commit 08809ad on feature/dual-rag-logging):
RAG_THRESHOLD_STATIC: 0.85 → 0.70RAG_THRESHOLD_COMBINED: 0.75 → 0.60RAG_THRESHOLD_FALLBACK: 0.65 → 0.50RAG_SIMILARITY_THRESHOLD(legacy): 0.85 → 0.70
Why: Best matches for natural queries scored ~0.84, just below the 0.85 cutoff. This was the A.3 blocker — pgvector returned 0 matches for every query.
Still needed: Rebuild the access-agent Docker container (docker compose up --build -d) and verify the fix with a smoke test before proceeding with A.3.
Rebuilt container: docker compose up --build -d picked up the threshold fix. All containers healthy.
Threshold fix verified: "What is ACES?" now returns pgvector_match_count: 3, pgvector_best_score: 0.84. Before the fix this was 0 matches. UKY still served (as designed), but pgvector results are now logging.
Pushed branches: access-agent/feature/dual-rag-logging pushed to GitHub (3 commits: A.2 dual-RAG logging, threshold fix). access-qa-service/main push failed — Joe doesn't have write access to necyberteam/access-qa-service (need Andrew to grant).
QAP coverage (83 pairs across 11 entities in 5 domains):
| Domain | Entity | Pairs |
|---|---|---|
| compute-resources | ACES (TAMU) | 10 |
| compute-resources | Ranch (TACC) | 10 |
| software-discovery | ABINIT | 10 |
| software-discovery | Abaqus | 8 |
| allocations | Grassland bird habitat (#72204) | 9 |
| allocations | RL benchmark (#72205) | 10 |
| nsf-awards | Pollinator conservation AI (#2529183) | 10 |
| nsf-awards | Great Salt Lake dust (#2449122) | 8 |
| affinity-groups | Neocortex (PSC) | 5 |
| affinity-groups | REPACSS (TTU) | 3 |
Test questions written: 40 questions in A3_TEST_QUESTIONS.md, organized in 3 groups:
- pgvector-targeted (24): Questions about entities we have QAPs for
- UKY-targeted (8): General ACCESS questions our 83 pairs probably don't cover
- Edge cases (8): Vague, misspelled, or cross-domain questions
Next: Review the test questions, then fire them all through the agent and pull the comparison logs.
Run 2 executed: Fired all 41 test questions through the agent with DUAL_RAG_LOGGING=true. All 41 succeeded, 40 logged (q41 classified as dynamic/xdmod). Results exported to a3_results/run2.json.
Run 2 results (high-level): UKY answered 36/40, pgvector had matches for 30/40, served by UKY 36, served by pgvector 4.
Built interactive HTML comparison: ~/.agent/diagrams/a3-run2-comparison.html — expandable rows with side-by-side answers, KPI summary, sidebar nav, analysis section.
Synthesis routing fix: pgvector static matches were previously returned as final_answer (raw Q&A pair text). Changed rag_answer.py to set rag_matches + rag_used instead, and added "synthesize" as a third routing option from route_after_rag in graph.py. This routes pgvector results through the LLM synthesis pipeline.
Unfair comparison discovered: Run 2's comparison was apples-to-oranges. UKY answers arrive already LLM-synthesized (UKY's own LLM produces polished prose). pgvector answers in the comparison log were raw Q&A pair text — just the verbatim answer field from the curated pair. This made pgvector look worse than it actually is, since the difference was partly in presentation quality, not underlying knowledge.
Goal: Make the comparison fair by synthesizing pgvector answers through our own LLM before logging them.
What was changed:
rag_comparison_logger.py— Addedpgvector_synthesized_answer = Column(Text)to the model andlog_comparison()methodrag_answer.py— Imported_format_rag_matchesand_synthesize_with_rag_onlyfromsynthesize.py. In_dual_rag_answer(), after getting pgvector matches, calls synthesis to produce an LLM-polished answer before logging. This is what the user would actually see if pgvector served the answer.pyproject.toml— Pinnedopentelemetry-instrumentation-langchain<0.53(newer version had a breaking import forGenAICustomOperationName)- Database —
ALTER TABLE rag_comparison_logs ADD COLUMN pgvector_synthesized_answer text; - Test runner — Created
a3_results/run_a3_test.pyto fire all 41 questions programmatically
Run 3 results (41/41 succeeded, all logged):
| Metric | Value |
|---|---|
| UKY answered | 38/41 (93%) |
| pgvector answered (synthesized) | 27/41 (66%) |
| Both answered | 24 (direct comparison possible) |
| UKY only | 14 |
| pgvector only | 3 |
| Avg pgvector similarity score | 0.84 |
Fair comparison conclusions (from HTML analysis at ~/.agent/diagrams/a3-run3-comparison.html):
-
The two backends are complementary, not competitive. pgvector gives precise, curated answers for entities we've built Q&A pairs for. UKY covers the long tail of general ACCESS knowledge.
-
pgvector excels on its own domain: Of 25 pgvector-targeted questions (Q1-Q25), pgvector produced synthesized answers for 24 (96%). These are entities with curated Q&A pairs.
-
UKY handles breadth that pgvector cannot: For 8 UKY-targeted questions (Q26-Q33) about general ACCESS topics (allocations process, Globus, password reset), pgvector answered 0. Our 83 curated pairs simply don't cover these.
-
UKY produces longer answers (~157% longer on average when both answer the same question). This may reflect UKY's larger document corpus or that our synthesis prompt is more concise. Length alone doesn't indicate quality.
-
pgvector retrieval is dramatically faster (~5 ms vs ~2500 ms for UKY), though pgvector now also needs LLM synthesis time (not logged separately).
-
The quality gap is narrower than Run 2 suggested. With LLM synthesis, pgvector answers read as polished, cited responses. The Run 2 comparison was unfairly penalizing pgvector by showing raw text.
-
Production recommendation: Use both backends — pgvector for high-confidence domain matches, UKY for everything else. This is already the architecture (
_dual_rag_answeruses UKY-primary, pgvector-fallback).
Files produced:
a3_results/run3.json— Full export of 41 comparison log entries~/.agent/diagrams/a3-run3-comparison.html— Interactive comparison with analysisa3_results/run_a3_test.py— Test runner script
Realization: The A.3 analysis drifted toward "complementary backends" and fallback architecture. But that wasn't the original question. From FEB_MARCH_PLAN.md:
"proving this approach outperforms document RAG" (line 34) "We need data on how these two approaches compare before making further investment decisions" (line 65) "A first because it validates the approach before investing in B" (line 259)
A.3 was a bake-off to decide whether Q&A-pair RAG can replace UKY document RAG — not to build a hybrid system. The "use both" conclusion was the code's existing fallback architecture leaking into the analysis.
The coverage gap is entirely explained by content type, not approach quality:
What the extraction pipeline covers (5 MCP server domains, entity-focused):
- Compute resources (23 entities: ACES, Delta, Anvil, etc.)
- Software discovery (1,404 packages)
- Allocations (5,440 projects)
- NSF awards (10,000+ awards)
- Affinity groups (55 groups)
These are all "what is X" questions about discrete entities. The pipeline pulls structured data from MCP servers and generates Q&A pairs about each entity's properties.
What UKY has that we don't (general ACCESS documentation):
- How to apply for an allocation (process docs)
- How to transfer files / use Globus (how-to guides)
- How to reset your password (account management)
- Startup vs research allocations (policy docs)
- Training resources, publication acknowledgment (educational docs)
These are "how do I" questions about ACCESS-wide processes. They don't live in any MCP server — they live in documentation pages, wikis, and guides that UKY ingested.
We don't know exactly what UKY ingested. The plan has an open question: "Need a list from Andrew of what UKY currently ingests." UKY is a black-box API to us.
On entity questions where we have Q&A pairs: pgvector hits 96% (24/25). The synthesized answers are concise and accurate. pgvector retrieval is ~500x faster than UKY (~5ms vs ~2500ms).
On general how-to/process questions: pgvector scores 0%. We simply have zero Q&A pairs for these topics because no MCP server serves allocation process docs or file transfer guides.
The gap is coverage, not quality. If we had Q&A pairs for general ACCESS topics, pgvector would likely match or beat UKY on those too.
The plan says Project C ("Extract from ACCESS documentation") was deferred with this note:
"Revisit only if a specific content gap surfaces that exists only in documents with no API equivalent (e.g., narrative tutorials, policy explainers)."
A.3 just surfaced exactly that gap. The 14 UKY-only questions are all process/how-to questions with no API equivalent.
Joe needs to decide:
-
Pursue Project C — Extract Q&A pairs from ACCESS documentation (not MCP entities). This would close the how-to gap and potentially let pgvector replace UKY entirely. Requires: getting the doc list from Andrew, building a document extractor, running extraction + Argilla review.
-
Keep UKY for breadth, pgvector for precision — Accept the hybrid architecture. UKY handles general questions, pgvector handles entity questions. Simpler, but you're dependent on UKY's black-box system and can't control answer quality for general topics.
-
Expand entity coverage first — Before tackling docs, run the existing extraction pipeline against more entities (we only extracted 11 of 23 compute resources, 2 of 1,404 software packages, 2 of 5,440 allocations). More entity coverage might narrow the gap enough.
Searched all repos (access-qa-planning, access-agent, access-mcp, access-qa-extraction, access-qa-bot) for any documentation of what UKY's system ingests. Found:
pages-current-production.md— "The Q&A backend is hosted at the University of Kentucky." No corpus details.pages-access-qa-tool.mdline 193 — Notes UKY's tech stack as "ChromaDB, llamaindex." No document list.FEB_MARCH_PLAN.mdline 233 — Open question: "Need a list from Andrew of what UKY currently ingests."uky_client.py— Black-box HTTP client. No corpus metadata.
No list of UKY's ingested documents exists anywhere in our repos. Andrew is the only source for this information.
Even without the UKY document list, there are viable paths to continue the bake-off:
Option A: Analyze UKY's 14 winning answers for source clues. Read the UKY-only responses from Run 3 and determine whether the information is unique to some internal corpus or is general ACCESS knowledge available on public web pages (support.access-ci.org, allocations.access-ci.org). UKY's answers may contain citations, URLs, or verbatim language that reveals their source documents. This takes ~30 minutes and informs all other options.
Option B: Generate Q&A pairs from public ACCESS content. Point the extraction pipeline (or a variant) at public ACCESS web pages — the allocations guide, getting started pages, Globus documentation, password reset instructions. These are freely available. Generate Q&A pairs, curate them, load into pgvector, re-run A.3. This directly tests whether closing the topic gap closes the performance gap.
Option C: Determine whether UKY's advantage is unique knowledge or general glue. The 14 UKY-only questions are all process/how-to topics. If UKY is synthesizing from the same public ACCESS web pages any user can read, then the "advantage" is simply that we haven't generated Q&A pairs for those topics yet — not that UKY has access to privileged information. This reframes the bake-off: it's not documents vs Q&A pairs, it's about coverage breadth.
Option D: Expand entity coverage as a control. Add Q&A pairs for remaining MCP server domains (events, announcements, system-status) and more entities within existing domains (we only extracted 11 of 23 compute resources, 2 of 1,404 software packages). This tests whether broader entity coverage alone changes the picture.
Recommended sequence: A first (30 min, informs everything), then B (directly tests the hypothesis), with D as low-effort parallel work.
Andrew provided the full set of documents that feed UKY's document RAG. They are in rag_documents/ (75 files, 69 MB) split across two directories:
staging/ (~47 files) — The main corpus. Three categories:
| Category | Examples | Count |
|---|---|---|
| Resource descriptions | ACES, Anvil, Bridges-2, Delta, Expanse, Jetstream-2, Neocortex, Sage, Voyager, Fabric (PDFs) | ~20 |
| User guides | ACES, Anvil, Bridges-2, Delta, Expanse, Jetstream-2, Neocortex, Sage (PDFs) | ~10 |
| Process/how-to docs | Allocations, Globus, MFA, add users, progress reports, office hours, events/trainings, system status (docx) | ~12 |
| Misc | ARA description, SDS pointer, CloudBank login, REPACSS overview, Sage edge apps, current projects | ~5 |
data/ (~28 files) — Per-resource software lists (txt/csv) and resource-specific documentation:
- Software installed lists for ACES, Anvil, Bridges-2, Darwin, Delta, Expanse, Jetstream2, Kyric, Stampede3
- Darwin docs (user guide, login, filesystems, job management, SLURM, software)
- Delta docs (user guide, data management)
- FASTER docs (intro, SLURM partitions, documentation)
- ACCESS Travel Rewards (md)
Key observation: The process/how-to docs in staging/ (allocations, Globus, MFA, etc.) are exactly the topics UKY beat pgvector on in A.3. The resource descriptions overlap with what MCP extraction already covers. This confirms the A.3 finding — the gap was coverage, not quality.
Confirmed the shared end state:
-
Generate Q&A pairs from these documents — Use a similar two-shot process to what exists for MCP entities, but with documents as input. Andrew: "Probably a similar prompt to the MCP tools can work for generating pairs from docs."
-
One unified Q&A pair bank in pgvector — Entity pairs (from MCP) + document pairs (from these files) living together, searchable as one corpus.
-
The orchestrator agent decides routing — RAG for factual queries, MCP for live data, both when needed. Andrew: "The orchestrator agent should decide which tools to use (RAG, MCP, both) and then it should get synthesized. That logic should already exist in access-agent."
-
UKY goes away — Andrew: "Eventually, we will likely not need the document based RAG since the Q&A pairs are faster." pgvector replaces UKY entirely.
Step 1: Categorize the corpus. Skim the 75 files and bucket them: resource descriptions (entity overlap with MCP), user guides (process/how-to), general ACCESS docs. Identify what's already covered by MCP extraction vs. what's net new.
Step 2: Build a document extractor in access-qa-extraction. Extend the pipeline to accept documents (PDF/docx) as input. The two-shot prompt structure should carry over — battery pass for coverage, discovery pass for insights. New work: document parsing (PDF text extraction, docx reading) and chunking into logical sections.
Step 3: Run extraction on the full corpus. Generate Q&A pairs from all documents. Push to Argilla for review. This produces pairs for the exact topics pgvector was missing — allocations process, Globus, MFA, user guides.
Step 4: Load into pgvector alongside entity pairs. One unified bank: existing 83 entity pairs + document-sourced pairs. All searchable together.
Step 5: Re-run A.3. Same 41 questions (plus new ones if the expanded corpus suggests them). If pgvector-with-documents matches or beats UKY across the board, the bake-off is won.
Step 6: Simplify the agent routing. Once the Q&A pair bank covers everything, the agent graph simplifies: RAG for factual queries, MCP for live data, synthesis when both contribute. Remove the UKY fallback path.
Skimmed all 75 files in rag_documents/ and produced a categorized index at rag_documents/CORPUS_INDEX.md. No files were moved or renamed — the index is a read-only reference.
| Category | Files | Priority | Rationale |
|---|---|---|---|
| NET-NEW process/how-to | 20 | First | Fills the exact A.3 gap — allocations, Globus, MFA, Sage, citations, Jupyter |
| USER GUIDE (deep) | 22 | Second | Operational depth (job submission, filesystems, SLURM) beyond MCP surface data |
| MCP OVERLAP (descriptions) | 17 | Later | 1-page resource catalog entries — MCP already covers most of this |
| DATA FILE | 12 | Skip | Raw software lists (name/version lines) — MCP software-discovery covers this |
| POINTER/EMPTY | 4 | Skip | URL stubs or corrupt files with no substantive content |
Key finding: The 20 NET-NEW files are mostly small docx docs — easy to parse, directly address the A.3 gap. The 22 user guides are larger PDFs with real depth (SLURM partitions, data management, module systems). The 17 resource descriptions are 1-page PDFs that overlap with MCP entity data.
Also this session: Consolidated project documentation — SYSTEM_OVERVIEW.md is now single source of truth for architecture, FEB_MARCH_PLAN.md updated with A.3 results and Project C active status, all three docs gist-mirrored, CLAUDE.md updated with document discipline rules.
access-qa-extraction PR #1 (two-shot pipeline) — squash-merged to main. 4,697 additions across the full two-shot extraction pipeline: battery + discovery prompts, LLM judge, incremental cache, Argilla entity-replace, 5 domain extractors, 144 tests. Branch archived on GitHub.
access-qa-planning PR #1 (companion docs) — squash-merged to main. Documentation updates for two-shot pipeline.
access-agent and qa-bot-core — decided to leave on their branches. qa-bot-core is a production product with its own release routine. access-agent's feature/dual-rag-logging branch mixes evaluation scaffolding with production improvements — better to leave as-is until the bake-off concludes.
Reinstalled access-qa-extraction from clean main. 144/144 tests pass. Started mcp-compute-resources Docker container from access-mcp/docker-compose.yml (port 3002). Ran extraction:
qa-extract extract compute-resources --max-entities 1 --no-judge
Produced 8 Q&A pairs for ACES — 5 battery + 3 discovery, all with citations. Two-shot pipeline confirmed working on main.
Branched feat/document-extractor off clean main. Built the document extraction pipeline:
New files:
parsers.py— Standalone document parsing module.parse_docx()(python-docx),parse_pdf()(PyMuPDF/fitz),parse_text()(.txt/.md). Dispatcherparse_document()routes by extension.chunk_text()splits large docs (~6000 words) with overlap.clean_extracted_text()collapses PDF/docx whitespace artifacts.extractors/documents.py—DocumentExtractor(BaseExtractor). Overridesrun()to skip MCPClient (documents are local files). Discovers files recursively fromconfig.urldirectory. Each document/chunk = one entity. Two-shot LLM pipeline (battery + discovery), judge evaluation, incremental cache — same as MCP extractors. Usessource="doc_generated",source_ref="doc://documents/{entity_id}".
Modified files:
pyproject.toml— Addedpython-docx>=1.0.0,PyMuPDF>=1.24.0models.py— Addedsourceparameter toQAPair.create()(default"mcp_extraction", backward-compatible)question_categories.py— Added"documents"toDOMAIN_LABELS,DOMAIN_NOTES, andFIELD_GUIDANCE(5 field groups: overview, key procedures, requirements & eligibility, important details, support & contact)config.py— Added"documents"MCPServerConfig withurl=os.getenv("DOCUMENTS_DIR", "../rag_documents")extractors/__init__.py— AddedDocumentExtractorimport and exportcli.py— AddedDocumentExtractortoEXTRACTORSregistry
Test 1: qa-extract extract documents --max-entities 1 --no-judge — parsed CORPUS_INDEX.md, produced 6 Q&A pairs about the document corpus.
Test 2: qa-extract extract documents --entity-ids "10_1758119706.911465_data-ACCESS-how-to-cite-Jetstream" --no-judge — parsed a docx file from staging/, produced 5 Q&A pairs about Jetstream citation formats and acknowledgment requirements.
Fix: _title_from_stem() was producing ugly titles from Slack-style filenames (e.g., 10_1758119706.911465_data-ACCESS-how-to-cite-Jetstream). Added re.sub(r"^\d+_[\d.]+_", "", stem) to strip the numeric prefix, plus stripping common prefixes (data-ACCESS-, data:, etc.). Title now renders as "How To Cite Jetstream".
All 144 existing tests still pass after all changes.
Ran DOCUMENTS_DIR="../rag_documents/staging" qa-extract extract documents --no-judge on all 47 files in staging/. Took ~25 minutes (94 LLM calls).
Results: 586 Q&A pairs from 83 entities (46 files processed, 1 corrupt file skipped).
| Category | Entities | Pairs | Notes |
|---|---|---|---|
| NET-NEW docx (process/how-to) | 19 | ~110 | Allocations, MFA, Globus, Sage, Jupyter |
| User Guide PDFs (chunked) | 39 chunks | ~290 | Jetstream2 (20 chunks), Anvil (6), Bridges-2 (5), etc. |
| MCP Overlap descriptions | 17 | ~134 | 1-page resource PDFs |
| Other (ARA, SDS, REPACSS) | 8 | ~52 | Small docs |
- 100% citation markers (
<<SRC:documents:...>>) - All pairs use
source: "doc_generated" - Large PDFs chunked correctly (~6000 words per chunk with overlap)
- Quality spot-check: questions are natural, answers contain specific details (URLs, commands, step-by-step procedures)
- Only error:
current-access-projects.docx(known corrupt/empty file)
Output at data/output/documents_qa_pairs.jsonl (gitignored). Branch pushed to GitHub.
Not yet run: data/ directory (Darwin, Delta, FASTER docs + ACCESS-Travel-Rewards.md + software lists).
Ran DOCUMENTS_DIR="../rag_documents/data" qa-extract extract documents --no-judge on all files in data/ subdirectories.
Results: 221 Q&A pairs from 29 entities.
| Subdirectory | Entities | Pairs | Notes |
|---|---|---|---|
| ACCESS-Resources/Darwin/ | 9 | ~65 | Managing jobs, user guide, compiling, file systems, etc. |
| ACCESS-Resources/Delta/ | 3 chunks | ~25 | Large PDF chunked into 3 |
| ACCESS-Resources/FASTER/ | 4 | ~30 | User guide, system overview, jobs, file systems |
| ACCESS-Travel-Rewards.md | 1 | ~8 | Travel reimbursement program |
| ACCESS-Software-Installed-by-resource/ | 12 | ~93 | Software lists (package names/versions — generic Q&A quality) |
- Software-list files produced generic "what software is installed on X" pairs — adequate but not high-value. Argilla reviewers can reject low-quality ones.
- Darwin and FASTER docs produced strong procedural content (SLURM commands, file system paths, compilation flags).
Saved staging/ output as documents_staging_qa_pairs.jsonl, combined both runs into documents_all_qa_pairs.jsonl (807 total pairs).
Pushed all 807 pairs to Argilla: qa-extract push data/output/documents_all_qa_pairs.jsonl. Records visible in qa-review dataset at http://localhost:6900.
Docker note: Argilla containers had stale network references from previous sessions. Fixed with docker compose down --remove-orphans && docker network prune -f && docker compose up -d.
Problem: When reviewing pairs in Argilla, all 807 records had domain: "documents" with no way to tell which source document they came from — the only clue was the source_ref URI (e.g., doc://documents/10_1758119706.911465_data-ACCESS-how-to-cite-Jetstream), which is opaque. For MCP-extracted pairs, domain provides natural grouping (compute-resources, allocations, etc.), but document pairs lack an equivalent.
Fix: Added document_name as an optional metadata field on QAMetadata, populated from the existing _title_from_stem() helper in DocumentExtractor. The field flows through to Argilla as a filterable TermsMetadataProperty. MCP extractors are unaffected (field defaults to None).
Files changed: models.py (field + factory param), documents.py (passes title), argilla_client.py (schema + record metadata).
Re-extraction: Re-ran both staging/ (611 pairs) and data/ (214 pairs) = 825 total. Deleted old Argilla dataset (no schema for document_name), pushed fresh. 72 unique document names now filterable in Argilla.
Problem: For MCP entity pairs, source_data contains the full entity JSON that the LLM used to generate the Q&A pair — reviewer sees exactly what went in. For document pairs, source_data was set to content_preview: chunk[:500] — the first 500 characters of the chunk. This was misleading: it looked like the source material but only represented a tiny slice of the ~6000-word chunk the LLM actually saw. Reviewers would see a content_preview about topic X when the Q&A pair was about topic Y (from elsewhere in the same chunk).
Fix: Replaced content_preview with a reference: {file, chunk, total_chunks, word_count}. For non-chunked documents, chunk and total_chunks are null. The reviewer sees the file and chunk number; the actual document is in rag_documents/.
Design note on chunking: Large documents (>6000 words) are split into sequential ~6000-word chunks with 500-word overlap. Each chunk is processed as a separate entity — the LLM only sees one chunk at a time, not the whole document. So chunk 9 of a 20-chunk Jetstream PDF starts at roughly word 44,000. This is why the source_ref includes the chunk number (e.g., doc://documents/jetstream-2-user-guide__chunk_9).
Andrew asked about making the bake-off self-service: editable golden questions, runnable by the team with their own tokens, comparing different agent configurations ("tool combinations"). Key points from the conversation:
- Golden questions: Andrew wants a curated benchmark set that people can view, add, and modify. These are distinct from the Q&A pairs in Argilla — they're the test inputs used to evaluate the agent.
- Different tool combinations: Not UKY-vs-pgvector (UKY is going away), but different configurations of our agent — RAG thresholds, MCP server subsets, model choices. Each configuration is a "scenario."
- Self-service: Team members should be able to run evaluations and see results without Joe in the loop.
- Ongoing process: Re-run as the agent evolves, not a one-shot comparison.
Designed the evaluation harness. Full design saved as EVAL_HARNESS_PLAN.md. Summary:
- Golden questions in YAML (merge A3_TEST_QUESTIONS.md + e2e_test_cases.csv → ~55 questions with structured assertions)
- Scenario configs as YAML files overriding
Settingsenv vars - CLI runner calling
run_agent()directly (not HTTP) to capture fullAgentState - HTML report generator producing self-contained comparison pages (matching a3-run3 visual style)
- New
access-agent/eval/directory
Added as Project D in FEB_MARCH_PLAN.md (D.1–D.4), parallel with Project B after C.4 completes.
Pivot: Initially designed as a CLI-based Python tool (access-agent/eval/). Revised to a static web app on Netlify (eval-ui/) — no Python environment needed, users just open a browser. Golden questions and scenarios bundled at build time, results displayed inline and exportable as JSON. Two open design questions flagged: (1) how scenarios actually change agent behavior given the current API doesn't accept config overrides, and (2) API key routing (server-side vs pass-through). Plan saved as EVAL_HARNESS_PLAN.md.
This is future work — immediate next step remains C.4 (review Argilla, sync pgvector, re-run A.3).
Spot-checked the 825 document pairs in Argilla and found a systematic quality issue: 36% (300/825) of generated questions referenced the source documents rather than the subject matter.
Examples:
- Wrong: "What are the important quotas and limits mentioned in the Darwin Filesystems Storage document?"
- Right: "What are the storage quotas on Darwin?"
Root cause analysis: Two contributing factors:
- FIELD_GUIDANCE field group #1 said "what is this document about?" — 90% of seq-1 (overview) pairs were meta-referencing.
- Entity titles included document-type suffixes ("Jetstream 2 User Guide") which primed the LLM to treat the document as the subject.
question_categories.py — Two changes:
- Added explicit anti-meta-referencing instruction to
DOMAIN_NOTES["documents"]with wrong/right examples. - Reworded all 5 field groups in
FIELD_GUIDANCE["documents"]to avoid document-referencing (e.g., "Overview — what is this topic about?" instead of "what is this document about?").
documents.py — Added regex to _title_from_stem() to strip document-type suffixes ("User Guide", "Manual", "Handbook", etc.) so the LLM sees "Jetstream 2" instead of "Jetstream 2 User Guide" as the entity name.
Three extraction runs after iterating on fixes:
- Staging (first fix): 608 pairs, 10% meta (down from 36%)
- Staging (with title suffix fix): 604 pairs, 0.9% meta (6 remaining)
- Data directory: 228 pairs
Combined: 832 pairs, 6 meta-referencing (0.7%). Cleared Argilla and pushed fresh.
Brought up all services locally (qa-service on 8001, access-agent on 8000). Synced 832 document pairs from Argilla and loaded 70 entity pairs via JSONL. Total: 902 pairs in pgvector.
Fired all 41 test questions. Results:
| Metric | Run 3 (83 pairs) | Run 4 (902 pairs) |
|---|---|---|
| UKY hits | 38/41 (93%) | 40/40 (100%) |
| pgvector hits | 27/41 (66%) | 27/40 (67%) |
| pgvector avg latency | ~5ms | ~30ms |
pgvector coverage stayed flat at 67% despite 10x more pairs.
The 13 missed questions fall into two categories:
-
Missing source content (4 questions) — Ranch storage has zero Q&A pairs because no Ranch documents exist in
rag_documents/and Ranch wasn't returned from MCP in the extraction run that generated the original test questions. -
No cross-cutting Q&A pairs (9 questions) — General ACCESS questions ("How do I apply for an allocation?", "How do I transfer files between resources?", "What training does ACCESS offer?") have no matching pairs even though we have 104 allocation mentions, 50 transfer/Globus mentions, and 40 training mentions across our pairs. The problem: all those mentions are entity-scoped. We have "How do I cite Jetstream?" but not "How do I acknowledge ACCESS?" We have "What allocations does Anvil support?" but not "How do I apply for an allocation?"
The extraction pipeline processes one document at a time, so it only ever generates entity-scoped Q&A pairs. It will never produce cross-cutting "How does ACCESS work in general?" pairs from a single-document prompt.
UKY's advantage is architectural: chunk-level retrieval at query time lets it pull relevant fragments from multiple documents and synthesize on the fly. It doesn't need a pre-generated answer that matches — it just needs chunks that are individually relevant. Our Q&A-pair RAG needs a pair whose question semantically matches the user's question, and no single entity-scoped pair matches a cross-cutting query closely enough.
- Manually curate cross-cutting pairs — Write 20-30 general ACCESS Q&A pairs by hand. Fast, targeted, but doesn't scale.
- Add a cross-cutting extraction pass — Feed the LLM multiple documents simultaneously and ask for general questions that span topics. New pipeline capability.
- Keep UKY as fallback for general questions — Accept the hybrid. pgvector for entity questions (fast, verified), UKY for cross-cutting (slow, unverified).
- Lower similarity thresholds — Some misses scored 0.55-0.68, not far from the 0.70 cutoff. Won't fix the 0.28-0.49 misses.
- Detect cross-cutting-ness at query time — Instead of pre-generating cross-cutting pairs, use pgvector match quality as a signal: low scores with scattered partial matches → route to document chunk RAG or MCP tools. Fits existing agent graph routing.
a3_results/run4.json— 40 comparison log entriesa3_results/run4_enriched.json— enriched with low-threshold best-possible scores~/.agent/diagrams/a3-run4-bakeoff.html— interactive comparison visualization
Even when pgvector hits, many answers are thinner than UKY's. Investigated whether pgvector answers were bypassing LLM synthesis — confirmed they are NOT: _dual_rag_answer() calls _synthesize_with_rag_only() for every pgvector match. The real issue: a single pre-digested Q&A pair gives the synthesis LLM very little to work with, so it returns near-verbatim text. UKY pulls multiple document chunks and the LLM has more raw material to synthesize a richer answer.
However, reviewing side-by-side answers revealed a more nuanced picture:
- Some pgvector answers are actually better than UKY's (more precise, directly relevant)
- Some just need link enrichment (the synthesis prompt doesn't encourage adding URLs)
- Some questions UKY can't answer but pgvector can (entity-specific data from MCP)
This shifts the framing from "pgvector vs UKY" to "how to combine them intelligently."
Quick fix (low-effort, high-impact): The RAG_ONLY_SYNTHESIS_PROMPT in synthesize.py says "Be concise and direct" — this is why the LLM returns near-verbatim single sentences. Updating the prompt to encourage link inclusion, practical context, and resource pointers would immediately enrich thin answers without any architectural changes. The Q&A pair metadata already carries domain and entity_id which could drive link generation.
Instead of generating cross-cutting Q&A pairs up front, detect cross-cutting-ness at query time based on pgvector results and route accordingly:
- pgvector score < threshold but > 0.4 → content exists but scattered → fall back to document chunk RAG or plan+MCP
- pgvector hit but thin answer → enrich with MCP tool calls or document chunks
- pgvector hit with rich answer → serve it (fast, verified)
- pgvector zero matches → missing content → MCP or UKY fallback
This fits the existing agent graph — rag_answer already evaluates match quality and routes to plan on weak matches. The change: make that evaluation smarter about why the match is weak.
threshold=0.0falsy in vectorstore.py:threshold or settings.rag_similarity_thresholdtreats 0.0 as falsy, falling back to default 0.85. Affects diagnostic queries withthreshold=0.- q21 not logged: "How much funding did the pollinator conservation AI project get?" was classified as non-RAG (40/41 logged).
Discovery: The Run 4 summary reported "UKY hits 40/40 (100%)" — but this counted every UKY response as a hit, including hedges like "The provided documents do not contain specific information about Abaqus. Please open a support ticket." Applied the same hedge detection used at runtime (_rag_answer_is_weak in graph.py) to the logged responses.
Corrected Run 4 numbers:
| Metric | pgvector | UKY |
|---|---|---|
| Genuine answers | 27/40 (68%) | 13/40 (33%) |
| Hedged / no match | 13 | 27 |
Head-to-head breakdown:
- Both answered well: 8
- pgvector only (UKY hedged): 19
- UKY only (pgvector no match): 5 — all general process questions (allocations, password reset, file transfer)
- Neither answered well: 8
What this means: pgvector already outperforms UKY 2-to-1. UKY's 19 entity-specific hedges are questions pgvector handles from curated MCP data (software versions, resource specs, NSF awards) that UKY's document corpus simply doesn't cover. The "UKY as strong fallback" framing was wrong — UKY adds value on only 5 questions, all cross-cutting process topics.
Remaining gap (13 questions): 5 cross-cutting process questions (UKY answers, pgvector doesn't) + 8 neither backend handles. A document-chunk fallback for cross-cutting detection would address most of these, but the urgency is lower than previously thought.
Also this session: Updated SYSTEM_OVERVIEW.md routing table with file names, condition explanations, and node descriptions. Synced gist.
A.3 Run 5 complete (full-system test). Node tracing added. Top-5 matches + enriched synthesis prompt deployed. Next: curate cross-cutting Q&A pairs for the ~5 procedural LLM-only questions.
- A.1 (Argilla → pgvector sync) ✅
- A.2 (dual-RAG logging in access-agent) ✅
- A.3 Runs 1–5 complete ✅ — RAG-vs-RAG (Runs 1–4), full-system (Run 5)
- Post-mortem analysis ✅ — gap is content type (entity vs process), not quality
- UKY corpus obtained ✅ — 75 files in
rag_documents/ - Direction confirmed with Andrew ✅ — generate Q&A pairs from docs, unify in pgvector, retire UKY
- C.1 corpus categorized ✅ — index at
rag_documents/CORPUS_INDEX.md - C.2 document extractor built ✅ — committed and pushed on
feat/document-extractor - C.3 extraction complete ✅ — 832 pairs (604 staging + 228 data), meta-referencing fixed (36% → 0.7%)
- Outstanding PRs merged ✅ — both
access-qa-extractionandaccess-qa-planningPRs squash-merged - C.4 sync + bake-off ✅ — 902 pairs in pgvector (832 document + 70 entity), 40 questions answered
- A.3 Run 4 reanalysis ✅ — pgvector 68% vs UKY 33% (hedge responses excluded)
- A.3 Run 5 ✅ — full-system test (pgvector + MCP + routing). 24 RAG, 5 MCP, 12 LLM-only.
- Node tracing ✅ —
node_tracein AgentState, gated behind?include_trace=true(commits04342c8,b7a9bec) - Top-5 matches + enriched synthesis prompt ✅ —
RAG_TOP_K3→5, prompt rewritten (commitef43a21)
- pgvector is already ahead: 27/40 genuine answers vs UKY's 13/40. pgvector covers entity-specific data (software, resources, awards) that UKY cannot.
- Full system closes more gaps: MCP tools answer Ranch questions and project search (Run 5). 12 questions remain LLM-only (ungrounded).
- Cross-cutting gap splits into two types: Union-type queries ("What resources support GPUs?") should now be addressed by top-5 multi-match synthesis. Procedural queries ("How do I apply for an allocation?") still need hand-curated cross-cutting Q&A pairs (~5 questions).
- Curate cross-cutting Q&A pairs — allocations, Globus, MFA, training, citation (fills the ~5 procedural LLM-only gaps)
- Re-run evaluation — test top-5 + enriched prompt against the 41 questions, compare answer quality
- Project D — evaluation harness (EVAL_HARNESS_PLAN.md)
- Project B — feedback protocol design
Ran all 41 questions through the production agent graph with MCP servers active and UKY disabled. This is the first system-vs-system test: pgvector RAG + MCP tools + LangGraph routing, compared against UKY's baseline responses from Run 4.
Configuration:
ENVIRONMENT=production,MCP_SERVER_HOST=host.docker.internal— agent container reaches MCP servers via Docker host bridgeUKY_RAG_ENABLED=false,DUAL_RAG_LOGGING=false— no UKY, no dual-RAG comparison path- 10 MCP servers running (access-mcp/docker-compose.yml)
- 902 Q&A pairs in pgvector (832 document + 70 entity)
Results — 41/41 questions answered:
- 24 via RAG (
rag_retrieval) - 5 via MCP tools (
search_resources,get_resource_hardware,search_events,search_projects) - 12 LLM-only (no tools called)
- MCP tools fill the Ranch gap. Ranch had zero Q&A pairs — q5, q6, q40 now get real answers via
search_resourcesandget_resource_hardware. Even the misspelled q40 ("reanch storage") resolves. - q41 gets a real answer. "What allocation projects are using machine learning?" calls
search_projects, returns 20 real projects with PIs and institutions. - q31 routes to events ("What training resources does ACCESS offer?") calls
search_events, though the search returned empty results. - Cross-cutting questions (q3, q7, q8, q26-q28, q32-q33, q38) fall to LLM synthesis. Neither RAG nor MCP covers these general ACCESS process questions. Answers read well but are ungrounded — could hallucinate.
The API response only exposes tools_used, confidence, execution_strategy, tool_count. We cannot tell from the response:
- What the classifier decided (
static/dynamic/combined) - Which graph nodes actually executed (e.g. did RAG fire and fail before falling to LLM?)
- RAG similarity scores for matched pairs
- Whether
_rag_answer_is_weaktriggered - The plan content (if the planner node ran)
- MCP tool arguments and raw responses
The 12 "LLM-only" answers are a black box — we can't distinguish "classified as static, RAG returned nothing, fell through to LLM" from "classified as static, LLM answered directly without trying RAG." Adding a node_trace to QueryResponse is the immediate next step.
Interactive HTML comparison at ~/.agent/diagrams/a3-run5-comparison.html. Matches the Run 3/4 report format: KPI cards, filters, expandable side-by-side comparison. Note: hedge detection has a known issue — see below.
The report's hedge detection uses substring matching against phrases like "do not contain", "does not explicitly", etc. UKY q27 ("The provided documents do not specify...") is marked hedged but none of the exact phrases match — the detection was too aggressive. The h2h classification for q27 and potentially others needs review. Should align with access-agent/src/agent/graph.py:_rag_answer_is_weak() which uses the canonical hedge phrases.
a3_results/run5.json— 41 questions with full agent responsesa3_results/uky_baseline_from_run4.json— UKY baseline (40 questions, q21 missing)a3_results/run_a3_test.py— test runner (updated for Run 5: captures full response, saves to JSON)
- Add node tracing to agent graph — track which nodes executed, classification result, RAG scores. Expose in
QueryResponse.metadata. - Re-run with tracing — Run 5b with node trace data, so we can see exactly how each question routes.
- Fix hedge detection — align report's hedge logic with
_rag_answer_is_weak()from the agent codebase. - Tune synthesis prompt —
RAG_ONLY_SYNTHESIS_PROMPTproduces thin answers when one pair matches. Add links, context. - Curate 20-30 cross-cutting Q&A pairs — allocations, Globus, MFA, training, citation (fills the 12 LLM-only gaps).
cd /Users/josephbacal/Projects/sweet-and-fizzy/access-ci/access-qa-service && docker compose up -d
cd /Users/josephbacal/Projects/sweet-and-fizzy/access-ci/access-agent && docker compose up -d
Verify with docker ps — you should see access-agent-agent-1 (8000), qa-service-app (8001), and their postgres/redis containers.
curl -s -X POST http://localhost:8000/api/v1/query \
-H "Content-Type: application/json" \
-d '{"query": "What is ACES?", "session_id": "test", "question_id": "test-1"}' | python3 -m json.tool
Two commits on access-agent/main:
04342c8 — Added node_trace to AgentState as Annotated[list[dict[str, Any]], operator.add]. Each graph node appends a structured trace dict recording what it decided:
- classify: query_type, confidence, domain, rag_endpoint, reason, whether query was expanded
- rag_answer: source (uky/pgvector), match_count, best_score, rag_used, has_final_answer
- plan: requires_tools, tool_count, tool names, strategy
- execute: tools_called, succeeded, failed
- evaluate: is_helpful, reason
- recover: action taken, new tools selected
- synthesize: strategy, answer_length
- domain_agent: domain, tool_count
The /api/v1/query response includes classification summary and node_trace in metadata. 10 files changed (all node files + state.py + routes.py).
b7a9bec — Gated node_trace behind ?include_trace=true query parameter. OTel/Honeycomb (added Jan 2026, commit 422b92d) already provides full distributed tracing for ops. node_trace serves a different consumer: the eval harness needs trace data inline in the API response so it can programmatically inspect routing decisions without querying an external service. Nodes continue accumulating trace dicts in state (zero overhead), but the response only includes them when opted in.
- OTel/Honeycomb: Ops. Waterfall view of every span, LLM call, MCP tool. External service.
node_trace: Eval. Inline in API response. Shows decisions (classifier output, RAG scores, tool selection) not timing. Consumable programmatically by the eval harness (Project D).
Andrew suggested returning the top 5 Q&A pair matches (instead of just the best) and letting the synthesizer combine them — simpler than building document-chunk retrieval for cross-cutting queries. Analysis confirmed the pipeline already supported multiple matches end-to-end (RAG_TOP_K was 3, qa-service accepts up to 20, all downstream code iterates over the full list). The change was purely configuration + prompt.
config.py — RAG_TOP_K: 3 → 5. More material for the synthesizer, especially for union-type cross-cutting queries where related entity pairs from different resources can be combined.
synthesize.py — RAG_ONLY_SYNTHESIS_PROMPT rewritten:
- "Be concise and direct" → "Answer the question thoroughly"
- New guideline: when multiple knowledge entries are provided, synthesize into a unified answer
- URLs/links elevated to IMPORTANT (matching the tool-only and combined prompts)
- Added practical next steps guidance and support ticket link (both already present in the other prompts, missing here)
- Thin answers: Single-match answers were near-verbatim because the prompt said "be concise." Now the LLM is instructed to give a complete, actionable response with links and context.
- Union-type cross-cutting queries: "What resources support GPUs?" now gets 5 entity-scoped pairs (Delta, Bridges-2, ACES, etc.) and the prompt tells the LLM to combine them.
- Does NOT fix procedural cross-cutting: "How do I apply for an allocation?" still has no matching pairs at any score. These need hand-curated cross-cutting Q&A pairs (~5 questions).
Commits across all repos related to the Feb/March plan. Older commits omitted.
| Hash | Date | Message |
|---|---|---|
c8fbf0b |
02-26 | docs: remove historical docs, update system overview for two-shot |
853e88f |
02-26 | replace GUIDED-TOUR with TRACE-TOUR signposts; fix software name casing |
00ba293 |
02-24 | prompt: add rule to quote long lowercase entity names in Q&A |
7b0590e |
02-24 | prompt: enhance rule 4 to check free-text fields; update review observations doc |
28be413 |
02-24 | fix: entity name interpolation + temporal language + coming-soon cleanup |
170e87d |
02-24 | docs: log full corpus scan results — quantify issues #1/#2, add issue #3 |
8336f45 |
02-24 | docs: move allocations:72170 finding to Patterns (positive, not an issue) |
d7f57f5 |
02-24 | docs: log allocations:72170 as non-issue (Jurafsky in source data, verified) |
4f9c22d |
02-24 | docs: add retrieval surface area rationale to P1 (self-contained answers) |
a4f7b66 |
02-24 | docs: note preferred fix for P1 — entity name interpolation in user prompt |
43e980e |
02-24 | docs: clarify P1 — entity name needed in both Q and A for RAG |
70f9424 |
02-24 | docs: add P1 pattern — questions must be self-contained (cross-cuts #1 and #2) |
6084c93 |
02-24 | docs: log issue #2 — decontextualized-question pattern (pervasive) |
07da145 |
02-24 | docs: log issue #1 — temporal-assumption in affinity-groups events |
c4ec468 |
02-24 | docs: add qa-review-observations.md for tracking Argilla review issues |
6857db8 |
02-24 | docs: improve signpost comments + fix COMING SOON name normalization |
579e10d |
02-24 | fix: normalize "COMING SOON" resource names to lowercase |
7bd43ba |
02-24 | wip: some signpost comments |
3333c32 |
02-23 | docs: update guided-tour |
66e1819 |
02-20 | refactor: adopt two-shot as sole extraction strategy |
7803147 |
02-20 | fix: restore missing return in software_discovery._generate_qa_pairs |
7791e2b |
02-20 | feat: add --prompt-strategy flag for A/B/C extraction experiment |
b662dc9 |
02-20 | feat: implement entity-replace for Argilla push |
80fc641 |
02-20 | docs: update plan with metadata on human actions on archive records |
9d54819 |
02-19 | fix(data-quality): separate NSF program fields and add per-domain LLM guidance |
39a4c06 |
02-19 | refactor: remove factoid templates and bonus generation (2-pass pipeline) |
5268caa |
02-19 | docs: reflect entity-replace decision and update README |
8c9e7f2 |
02-18 | docs: update all docs for freeform extraction pipeline and Argilla dedup |
4181585 |
02-18 | feat: roll out freeform extraction to all 5 extractors |
da79f7d |
02-18 | feat: freeform extraction replaces category+bonus two-pass approach |
2833d7b |
02-18 | docs: update for Argilla metadata integration and test count |
e6d08fa |
02-18 | feat(argilla): add eval_issues and source_ref to Argilla records |
3c762c9 |
02-18 | feat(argilla): push judge scores and granularity to Argilla metadata |
24c8373 |
02-17 | feat(judge): LLM judge evaluation scores for Q&A pair quality |
93a1fb2 |
02-17 | feat(bonus): LLM exploratory questions for entity-unique information |
068c08a |
02-17 | feat(incremental): hash-based change detection to skip unchanged entities |
9059614 |
02-17 | fix(factoids): data quality guards for template generation |
3662d8b |
02-13 | feat(generators): dual-granularity Q&A + extend comparisons to all 5 domains |
fa2ff93 |
02-12 | fix(nsf-awards): normalize primaryProgram list + skip unused MCPClient |
f3b1437 |
02-12 | feat(extractors): fixed question categories + direct API for allocations/nsf-awards |
fdebdab |
02-12 | feat(software-discovery): switch from search terms to list_all_software |
e33d006 |
02-11 | feat(extract): add max_entities cap for cheap test runs |
2da2c32 |
02-10 | Use real enumerations from taxonomies.ts for search terms |
d987dee |
02-10 | Add report command for MCP coverage stats without LLM calls |
6c4667c |
02-10 | Add ExtractionConfig to centralize extraction parameters |
0b16ba8 |
02-04 | Fix Q&A pair ID collisions by appending question hash |
cf384bc |
02-04 | Add Argilla integration for pushing Q&A pairs to human review |
51e9877 |
02-04 | Expand extraction queries, fix software-discovery, update docs |
a69ce2e |
02-02 | Fix allocations and nsf-awards extractors returning 0 results |
038d42d |
02-02 | Add dedicated OpenAI backend (LLM_BACKEND=openai) |
b557300 |
02-01 | Add LOCAL_DIRECTIONS.md and update .env.example for OpenAI setup |
d45eda1 |
02-01 | Add NSFAwardsExtractor and register in CLI/validator |
b67eba0 |
02-01 | Add AllocationsExtractor and register in CLI/validator |
18c0e49 |
01-31 | Add AffinityGroupsExtractor and fix MCP server port defaults |
de28ab2 |
01-31 | Add CLAUDE.md and update README with local dev setup guide |
| Hash | Date | Message |
|---|---|---|
5b57ae0 |
02-28 | Fix Argilla sync to work with access-qa-extraction's dataset schema |
| Hash | Date | Message |
|---|---|---|
ef43a21 |
03-12 | feat: return top-5 RAG matches and enrich synthesis prompt |
b7a9bec |
03-12 | feat: gate node_trace behind ?include_trace query parameter |
04342c8 |
03-11 | feat: add node_trace to agent graph for execution path observability |
de26e37 |
— | feat: route pgvector through LLM synthesis + fair comparison logging |
08809ad |
— | fix: lower RAG similarity thresholds — 0.85 was filtering valid matches |
caf7256 |
02-28 | feat: add dual-RAG comparison logging for A.2 evaluation |
| Hash | Date | Message |
|---|---|---|
bb3b54f |
02-04 | spike: Add list-all fallbacks to allocations and nsf-awards routers |
| Hash | Date | Message |
|---|---|---|
a84fb4a |
02-26 | docs: GUIDED-TOUR.md → TRACE-TOUR.extract.md in file tree |
033c46e |
02-23 | docs: update mcp-extraction-impl to reflect two-shot pipeline and entity-replace |
| Hash | Date | Message |
|---|---|---|
d5cb931 |
01-30 | chore: init claude file |