Skip to content

Instantly share code, notes, and snippets.

@estsauver
Created December 29, 2025 13:49
Show Gist options
  • Select an option

  • Save estsauver/a62b99d8e2eff43f75835f4dc4a63f02 to your computer and use it in GitHub Desktop.

Select an option

Save estsauver/a62b99d8e2eff43f75835f4dc4a63f02 to your computer and use it in GitHub Desktop.
Tool-Based Source Collection Architecture - Master Design Document (All Versions)

Tool-Based Source Collection Architecture - Master Design Document

Overview

This document consolidates the complete design for moving STORM's source collection from context-heavy architecture to tool-based state management. This evolution addresses the context window overflow issue where the bibliographer prompt is so effective that it gathers sources faster than DSPy's trajectory truncation can handle.

Design Evolution (Chronological)

Version 1: Initial Tool-Based Architecture

Gist: https://gist.github.com/estsauver/e808b32d5c6a1cb5c884d12d54b7d3c3

Key Concepts:

  • Move source storage from agent context to external Redis
  • Four core tools: save_source, get_progress, check_completion, finalize_sources
  • Incremental state management instead of accumulating everything in trajectory
  • Honest Assessment: Good long-term architecture, but might be over-engineering for immediate problem

Tools:

save_source(source_type, external_id, url, title, abstract, key_excerpts, topics_covered, citation_id)
get_progress() → {"total_sources": 15, "by_type": {...}, "by_topic": {...}}
check_completion() → {"ready_to_finish": bool, "criteria": {...}}
finalize_sources() → List[dict]  # All sources

Architecture:

Agent → save_source() → Redis → get_progress() → check_completion() → FINISH
Context: Only recent activity (not all 30+ sources)

Version 2: Question-Aware Partitioning (Enhanced)

Gist: https://gist.github.com/estsauver/2d9b8dab77eecd8a4f64c3052eb1d458

Key Enhancement: Merge Phase 1 (gathering) + Phase 2 (distribution)

User Insight: "I think the save sources tool needs to do something like allocating the chunk/citation to a relevant question too?"

Changes:

  • Added relevant_questions parameter to save_source
  • Agent assigns sources to questions immediately during gathering
  • Per-question progress tracking
  • Just-in-time distribution instead of post-hoc partitioning

Enhanced save_source:

save_source(
    source_type: str,
    external_id: str,
    url: str,
    title: str,
    abstract: str,
    key_excerpts: List[str],
    topics_covered: List[str],
    citation_id: str,
    relevant_questions: List[str]  # ⭐ NEW! e.g., ["clinical.efficacy", "mechanism.moa"]
) → {
    "source_id": "src_012",
    "assigned_to": ["clinical.efficacy", "clinical.safety"],
    "questions_updated": {
        "clinical.efficacy": {"source_count": 6, "sufficient": True}
    }
}

Storage Schema:

# Redis structure
source_collection:{job_id}:sourcesList of all source dicts
source_collection:{job_id}:question:{question_key} → List of source_ids
source_collection:{job_id}:metadataJSON metadata

Benefits:

  • Intentional gathering - Agent knows WHY it's gathering each source
  • Real-time feedback - "I have enough mechanism, need more competitive"
  • Adaptive search - Agent pivots based on per-question gaps
  • Direct to synthesis - Output ready for Phase 3 (AnswerSynthesizer)

Version 3: Refined API Design

Gist: https://gist.github.com/estsauver/bf7fbb3938acd0f280564593fd7cbdc1

Key Improvements:

  • Simplified parameters: 9 → 7 for save_source
  • Flat, scannable response structure (no deep nesting)
  • Visual indicators (✓ ⚠) for quick assessment
  • All responses <500 chars to avoid context bloat
  • Human-readable status messages
  • Actionable feedback ("next_focus", "suggestion")

Refined save_source:

def save_source(
    source_type: str,               # Required: "pubmed", "clinicaltrials", etc.
    external_id: str,               # Required: PMID, NCT number, etc.
    url: str,                       # Required: Full URL to source
    title: str,                     # Required: Document title
    relevant_questions: List[str],  # Required: Question keys this answers
    key_excerpts: Optional[str] = None,      # Optional: 1-2 sentence summary
    citation_id: Optional[str] = None        # Optional: If already registered
) -> dict

Removed Redundancies:

  • abstract (redundant with key_excerpts)
  • topics_covered (duplicates relevant_questions)
  • Made citation_id optional (not available yet during gathering)

Refined get_progress:

{
    "total": 25,
    "questions": {
        "mechanism.moa":           "✓ 8 sources",
        "clinical.efficacy":       "✓ 7 sources",
        "competitive.landscape":   "⚠ 2 sources (need 3 more)",
    },
    "summary": "3/6 questions complete, 22 more sources needed",
    "next_focus": ["competitive.landscape", "market.status"]
}

Refined check_completion:

{
    "ready": false,
    "progress": "3/6 questions complete (50%)",
    "missing": {
        "competitive.landscape": "Need 3 more sources (currently 2/5)",
        "market.status": "Need 4 more sources (currently 1/5)"
    },
    "suggestion": "Focus on competitive landscape and market data"
}

Version 4: Automatic Citation Registration (Final)

Key Enhancement: Make citation handling seamless

User Request: "I think we should make it clear in the design that if a citation is not registered yet, we should register it"

Design Principle: The tool should automatically register citations if not provided - agent shouldn't need to remember to call register_citation separately.

Final save_source:

def save_source(
    source_type: str,               # Required: "pubmed", "clinicaltrials", etc.
    external_id: str,               # Required: PMID, NCT number, etc.
    url: str,                       # Required: Full URL to source
    title: str,                     # Required: Document title
    relevant_questions: List[str],  # Required: Question keys this answers
    key_excerpts: Optional[str] = None,      # Optional: 1-2 sentence summary
    citation_id: Optional[str] = None        # Optional: If already registered
) -> dict:
    """
    Save a source and assign it to relevant questions.

    AUTOMATIC CITATION REGISTRATION:
    - If citation_id is provided: Use it (agent already registered the citation)
    - If citation_id is None: Automatically register a citation for this source
      using the provided metadata (source_type, external_id, title, url, key_excerpts)

    This ensures every source has a citation_id for use in answer synthesis.

    Returns:
        {
            "source_id": "src_012",
            "citation_id": "cit_abc123",  # Always returned (auto-registered if needed)
            "citation_status": "existing" | "auto_registered",
            "assigned_to": ["clinical.efficacy", "clinical.safety"],
            "status": {
                "clinical.efficacy": "sufficient",
                "clinical.safety": "sufficient"
            },
            "message": "✓ Saved source #12 (PubMed) → 2 questions"
        }
    """

Implementation Logic:

if citation_id is None:
    # Auto-register citation
    citation_id = register_citation(
        claim=title,  # Use title as claim
        source_type=source_type,
        source_id=external_id,
        direct_quote=key_excerpts or title,
        context=f"Source: {url}",
        metadata={"title": title, "url": url, "source_type": source_type}
    )
    citation_status = "auto_registered"
else:
    # Validate that citation exists
    if not get_citation(citation_id):
        raise ValueError(f"Citation {citation_id} not found")
    citation_status = "existing"

Agent Workflows:

Option 1: Explicit Registration (Advanced)

# Agent extracts specific claim first
citation_id = register_citation(
    claim="Trastuzumab binds HER2 with KD of 5 nM",
    source_type="pubmed",
    source_id="12345678",
    direct_quote="We measured binding affinity...",
)

# Then saves with citation
save_source(..., citation_id=citation_id)

Option 2: Auto-Registration (Simple, Recommended)

# Agent just saves directly
save_source(
    source_type="pubmed",
    external_id="12345678",
    title="Trastuzumab HER2 binding characteristics",
    relevant_questions=["mechanism.moa"],
    key_excerpts="KD = 5 nM for HER2 binding"
    # citation_id not provided - will be auto-registered
)

Final Tool Suite

1. save_source - Intelligent Gathering + Assignment + Citation

def save_source(
    source_type: str,
    external_id: str,
    url: str,
    title: str,
    relevant_questions: List[str],
    key_excerpts: Optional[str] = None,
    citation_id: Optional[str] = None
) -> dict

2. get_progress - Per-Question Coverage

def get_progress() -> dict:
    """
    Returns:
        {
            "total": 25,
            "questions": {
                "mechanism.moa": "✓ 8 sources",
                "competitive.landscape": "⚠ 2 sources (need 3 more)"
            },
            "summary": "3/6 questions complete, 22 more sources needed",
            "next_focus": ["competitive.landscape", "market.status"]
        }
    """

3. check_completion - Question-Level Validation

def check_completion() -> dict:
    """
    Returns:
        {
            "ready": false,
            "progress": "3/6 questions complete (50%)",
            "missing": {
                "competitive.landscape": "Need 3 more sources (currently 2/5)"
            },
            "suggestion": "Focus on competitive landscape and market data"
        }
    """

4. finalize_sources - Pre-Organized Output

def finalize_sources() -> dict:
    """
    Returns sources organized by question, ready for Phase 3 (AnswerSynthesizer).

    Note: This is an internal workflow tool, not exposed to agent.
    Agent never sees this response - workflow retrieves it directly.
    """

Question Taxonomy (Configuration)

STORM_QUESTIONS = {
    "mechanism.moa": {
        "label": "Mechanism of Action",
        "description": "How does the drug work? Target binding, pathway effects",
        "min_sources": 5
    },
    "clinical.efficacy": {
        "label": "Clinical Efficacy",
        "description": "Trial results, patient outcomes, response rates",
        "min_sources": 5
    },
    "clinical.safety": {
        "label": "Safety Profile",
        "description": "Adverse events, toxicity, contraindications",
        "min_sources": 5
    },
    "competitive.landscape": {
        "label": "Competitive Landscape",
        "description": "Other drugs, combinations, market position",
        "min_sources": 5
    },
    "market.status": {
        "label": "Market Status",
        "description": "Development stage, approvals, commercialization",
        "min_sources": 3
    },
    "ip.patents": {
        "label": "Intellectual Property",
        "description": "Patent coverage, exclusivity, formulations",
        "min_sources": 3
    }
}

Modified Prompt (Bibliographer with Tools)

class BroadResearchSignature(dspy.Signature):
    """
    YOUR JOB: Build a comprehensive source bibliography organized by research question.

    You are given a RESEARCH SYLLABUS with 6 questions about this drug:

    1. mechanism.moa - How does the drug work? Target binding, pathway effects
    2. clinical.efficacy - What are the trial results? Patient outcomes, response rates
    3. clinical.safety - What are the risks? Adverse events, toxicity
    4. competitive.landscape - How does it compare? Other drugs, combinations
    5. market.status - Development stage, approvals, commercialization
    6. ip.patents - Patent coverage, exclusivity, formulations

    INTELLIGENT WORKFLOW:
    1. Search a database (search_pubmed, search_clinicaltrials, etc.)
    2. For EACH promising result:
       a. Use scrape_url to read the full content
       b. Determine which question(s) it answers
       c. Use save_source(..., relevant_questions=["clinical.efficacy"])
       d. Get feedback: "Saved source #5 → clinical.efficacy (now 5 sources ✓)"

    3. Periodically use get_progress to see per-question status
    4. Adjust your search strategy to fill gaps
    5. Use check_completion before finishing
    6. Call FINISH only when check_completion shows ready_to_finish: true

    COMPLETION CRITERIA:
    ✓ All 6 questions have at least 5 sources each (30+ total)
    ✓ Each question covers its designated topic area
    ✓ Sources span multiple databases (PubMed, ClinicalTrials, ChEMBL, etc.)

    SMART FEATURES:
    - You can assign the same source to multiple questions if relevant
    - You get real-time feedback on which questions need more sources
    - Context stays small because sources are saved externally
    """

Implementation Estimate

Phase 1: Core Storage (2 hours)

  • SourceCollector class with Redis backend
  • Question-aware storage schema
  • Deduplication logic (same source → multiple questions)

Phase 2: Tool Implementation (2 hours)

  • save_source with question assignment + auto-citation
  • get_progress with per-question breakdown
  • check_completion with gap analysis
  • finalize_sources with organized output

Phase 3: Prompt & Integration (1 hour)

  • Update BroadResearchSignature with question list
  • Modify BroadResearcher to use tools
  • Update STORM workflow to use pre-organized output

Phase 4: Testing (1 hour)

  • Test question assignment logic
  • Verify adaptive search behavior
  • Confirm output format for AnswerSynthesizer
  • Test automatic citation registration

Total: ~6 hours

Benefits Summary

No more context overflow - Context contains only recent activity ✅ Unlimited source gathering - Storage is external ✅ Real-time progress tracking - Can query saved sources anytime ✅ Natural checkpointing - Sources persist even if agent crashes ✅ Better observability - Can inspect Redis to see what's saved ✅ Scalable - Works for 50, 100, 1000+ sources ✅ Intentional gathering - Agent knows WHY it's gathering each source ✅ Adaptive search - Agent can pivot based on per-question gaps ✅ Direct to synthesis - Output ready for Phase 3 (AnswerSynthesizer) ✅ Natural deduplication - Same source can answer multiple questions ✅ Seamless citations - Auto-registration eliminates cognitive load

Key Design Principles

  1. Simplicity - Minimize required parameters (5 required, 2 optional)
  2. Actionability - Return information the agent can act on
  3. Consistency - Similar patterns across tools
  4. Conciseness - Keep responses small (<500 chars) to avoid context bloat
  5. Robustness - Handle edge cases gracefully (duplicate sources, invalid questions)
  6. Automation - Auto-register citations to reduce agent complexity
  7. Transparency - Clear feedback on what was auto-generated vs provided

Next Steps

  1. Decision: Implement tool-based architecture vs quick fixes (lower max_iterations)
  2. If implementing: Start with Phase 1 (Core Storage) and incrementally build
  3. Testing: Run full end-to-end STORM demo with Trastuzumab
  4. Validation: Verify 30-50+ sources gathered across all 6 questions
  5. Production: Deploy and monitor for context window issues

Last Updated: 2025-12-29 Status: Design Complete - Ready for Implementation Decision

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment