possibilities/gen-queries-pairctl-analysis.md

## gen-queries-pairctl-analysis.md

      
    Raw
  

              gen-queries-pairctl-analysis.md
            
          
    gen-queries.py: pairctl Interaction Analysis

Looking at the script, there are 3 distinct pairctl interactions, all routed through a common helper.
Core Mechanism: ask_llm_for_json() (lines 153-213)

All pairctl calls flow through this helper which:

Generates a unique /tmp/knowctl-{uuid}.json output path
Constructs a prompt with TypeScript schema + instruction to write JSON to that file
Calls pairctl send-message <prompt> --cli <cli> with 5-minute timeout
Reads the JSON file directly (avoids parsing chat response)

Why this pattern? Parsing JSON from chat responses is fragile—agents wrap it in markdown, add commentary, etc. Writing to a file gives clean output.

Interaction 1: Document Selection (lines 216-259)

When: Topic has more documents than MAX_DOCS_PER_TOPIC (5)
Purpose: Have an LLM curate which documents are worth generating test queries for
Prompt:
Select {max_docs} documents from this topic that would be best for 
generating semantic search test queries.

Choose documents that:
- Have substantial, rich content (not just short references)
- Cover diverse aspects of the topic
- Would generate interesting and varied test questions

Available documents:
{knowctl list-documents output}

Schema: string[] — array of document IDs
Goal: Avoid wasting evaluation budget on thin/duplicate docs

Interaction 2: Question Generation (lines 356-396)

When: For each selected document
Purpose: Generate 8 diverse test questions that the document can answer
Prompt:
Generate 8 test questions for evaluating semantic search retrieval quality.

Document (topic: {topic}, ID: {ordinal_id}):
---
{first 4000 chars of content}
---

Create 8 diverse questions that this document can answer. Be comprehensive:
1. Factual - direct answer found in text
2. Factual - another direct fact from text
3. Procedural - how to do something described in the doc
4. Procedural - a different how-to from the doc
5. Conceptual - requires understanding the "why" explained in the doc
6. Troubleshooting - what if X goes wrong (if applicable)
7. Rephrased - uses synonyms/different terminology than the doc
8. Edge case - less obvious use case covered by the doc

Schema: Question[] with question: string and difficulty: easy|medium|hard
Goal: Create a varied test corpus covering different query types

Interaction 3: Ground Truth Evaluation (lines 262-313)

When: For each generated question (Phase 2)
Purpose: Verify/correct which document best answers the question
Prompt:
Find which document in the "{topic}" topic best answers this question:

Question: {question}

This question was generated from document {source_doc}, but another 
document might answer it better.

Use these knowctl commands to explore:
- knowctl list-documents {topic}  → see all documents with summaries
- knowctl show-document <doc-id>  → read full document content

Process:
1. Run list-documents to see available documents and their summaries
2. Identify 2-4 promising candidates based on summaries (must include {source_doc})
3. Use show-document to read the full content of each candidate
4. Decide which document best answers the question

Schema:
interface Result {
  best_doc: string;
  confidence: "high" | "medium" | "low";
  reason: string;
  explored_docs: string[];
}
Goal: The evaluating agent actually explores the knowledge base via shell commands to determine ground truth. This is "agentic evaluation"—not just asking "which doc?" but giving the agent tools to investigate.

Orchestration Pattern

The script uses a two-phase batch pipeline with CLI rotation:


Phase
CLI Assignment
Purpose


Generation
claude-sonnet, codex
Generate questions in parallel


Evaluation
Rotated (codex, claude-sonnet)
Different agent evaluates than generated


This rotation (line 772: rotate_clis(base_clis, 1)) ensures the evaluator is a different model than the generator—a form of adversarial validation.
Phase	CLI Assignment	Purpose
Generation	`claude-sonnet`, `codex`	Generate questions in parallel
Evaluation	Rotated (`codex`, `claude-sonnet`)	Different agent evaluates than generated
No results found