Skip to content

Instantly share code, notes, and snippets.

@possibilities
Created January 21, 2026 01:46
Show Gist options
  • Select an option

  • Save possibilities/ce29d0a4a616dcf52077c3ec106a2001 to your computer and use it in GitHub Desktop.

Select an option

Save possibilities/ce29d0a4a616dcf52077c3ec106a2001 to your computer and use it in GitHub Desktop.

gen-queries.py: pairctl Interaction Analysis

Looking at the script, there are 3 distinct pairctl interactions, all routed through a common helper.

Core Mechanism: ask_llm_for_json() (lines 153-213)

All pairctl calls flow through this helper which:

  1. Generates a unique /tmp/knowctl-{uuid}.json output path
  2. Constructs a prompt with TypeScript schema + instruction to write JSON to that file
  3. Calls pairctl send-message <prompt> --cli <cli> with 5-minute timeout
  4. Reads the JSON file directly (avoids parsing chat response)

Why this pattern? Parsing JSON from chat responses is fragile—agents wrap it in markdown, add commentary, etc. Writing to a file gives clean output.


Interaction 1: Document Selection (lines 216-259)

When: Topic has more documents than MAX_DOCS_PER_TOPIC (5)

Purpose: Have an LLM curate which documents are worth generating test queries for

Prompt:

Select {max_docs} documents from this topic that would be best for 
generating semantic search test queries.

Choose documents that:
- Have substantial, rich content (not just short references)
- Cover diverse aspects of the topic
- Would generate interesting and varied test questions

Available documents:
{knowctl list-documents output}

Schema: string[] — array of document IDs

Goal: Avoid wasting evaluation budget on thin/duplicate docs


Interaction 2: Question Generation (lines 356-396)

When: For each selected document

Purpose: Generate 8 diverse test questions that the document can answer

Prompt:

Generate 8 test questions for evaluating semantic search retrieval quality.

Document (topic: {topic}, ID: {ordinal_id}):
---
{first 4000 chars of content}
---

Create 8 diverse questions that this document can answer. Be comprehensive:
1. Factual - direct answer found in text
2. Factual - another direct fact from text
3. Procedural - how to do something described in the doc
4. Procedural - a different how-to from the doc
5. Conceptual - requires understanding the "why" explained in the doc
6. Troubleshooting - what if X goes wrong (if applicable)
7. Rephrased - uses synonyms/different terminology than the doc
8. Edge case - less obvious use case covered by the doc

Schema: Question[] with question: string and difficulty: easy|medium|hard

Goal: Create a varied test corpus covering different query types


Interaction 3: Ground Truth Evaluation (lines 262-313)

When: For each generated question (Phase 2)

Purpose: Verify/correct which document best answers the question

Prompt:

Find which document in the "{topic}" topic best answers this question:

Question: {question}

This question was generated from document {source_doc}, but another 
document might answer it better.

Use these knowctl commands to explore:
- knowctl list-documents {topic}  → see all documents with summaries
- knowctl show-document <doc-id>  → read full document content

Process:
1. Run list-documents to see available documents and their summaries
2. Identify 2-4 promising candidates based on summaries (must include {source_doc})
3. Use show-document to read the full content of each candidate
4. Decide which document best answers the question

Schema:

interface Result {
  best_doc: string;
  confidence: "high" | "medium" | "low";
  reason: string;
  explored_docs: string[];
}

Goal: The evaluating agent actually explores the knowledge base via shell commands to determine ground truth. This is "agentic evaluation"—not just asking "which doc?" but giving the agent tools to investigate.


Orchestration Pattern

The script uses a two-phase batch pipeline with CLI rotation:

Phase CLI Assignment Purpose
Generation claude-sonnet, codex Generate questions in parallel
Evaluation Rotated (codex, claude-sonnet) Different agent evaluates than generated

This rotation (line 772: rotate_clis(base_clis, 1)) ensures the evaluator is a different model than the generator—a form of adversarial validation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment