Looking at the script, there are 3 distinct pairctl interactions, all routed through a common helper.
All pairctl calls flow through this helper which:
- Generates a unique
/tmp/knowctl-{uuid}.jsonoutput path - Constructs a prompt with TypeScript schema + instruction to write JSON to that file
- Calls
pairctl send-message <prompt> --cli <cli>with 5-minute timeout - Reads the JSON file directly (avoids parsing chat response)
Why this pattern? Parsing JSON from chat responses is fragile—agents wrap it in markdown, add commentary, etc. Writing to a file gives clean output.
When: Topic has more documents than MAX_DOCS_PER_TOPIC (5)
Purpose: Have an LLM curate which documents are worth generating test queries for
Prompt:
Select {max_docs} documents from this topic that would be best for
generating semantic search test queries.
Choose documents that:
- Have substantial, rich content (not just short references)
- Cover diverse aspects of the topic
- Would generate interesting and varied test questions
Available documents:
{knowctl list-documents output}
Schema: string[] — array of document IDs
Goal: Avoid wasting evaluation budget on thin/duplicate docs
When: For each selected document
Purpose: Generate 8 diverse test questions that the document can answer
Prompt:
Generate 8 test questions for evaluating semantic search retrieval quality.
Document (topic: {topic}, ID: {ordinal_id}):
---
{first 4000 chars of content}
---
Create 8 diverse questions that this document can answer. Be comprehensive:
1. Factual - direct answer found in text
2. Factual - another direct fact from text
3. Procedural - how to do something described in the doc
4. Procedural - a different how-to from the doc
5. Conceptual - requires understanding the "why" explained in the doc
6. Troubleshooting - what if X goes wrong (if applicable)
7. Rephrased - uses synonyms/different terminology than the doc
8. Edge case - less obvious use case covered by the doc
Schema: Question[] with question: string and difficulty: easy|medium|hard
Goal: Create a varied test corpus covering different query types
When: For each generated question (Phase 2)
Purpose: Verify/correct which document best answers the question
Prompt:
Find which document in the "{topic}" topic best answers this question:
Question: {question}
This question was generated from document {source_doc}, but another
document might answer it better.
Use these knowctl commands to explore:
- knowctl list-documents {topic} → see all documents with summaries
- knowctl show-document <doc-id> → read full document content
Process:
1. Run list-documents to see available documents and their summaries
2. Identify 2-4 promising candidates based on summaries (must include {source_doc})
3. Use show-document to read the full content of each candidate
4. Decide which document best answers the question
Schema:
interface Result {
best_doc: string;
confidence: "high" | "medium" | "low";
reason: string;
explored_docs: string[];
}Goal: The evaluating agent actually explores the knowledge base via shell commands to determine ground truth. This is "agentic evaluation"—not just asking "which doc?" but giving the agent tools to investigate.
The script uses a two-phase batch pipeline with CLI rotation:
| Phase | CLI Assignment | Purpose |
|---|---|---|
| Generation | claude-sonnet, codex |
Generate questions in parallel |
| Evaluation | Rotated (codex, claude-sonnet) |
Different agent evaluates than generated |
This rotation (line 772: rotate_clis(base_clis, 1)) ensures the evaluator is a different model than the generator—a form of adversarial validation.