estsauver/tool_based_source_collection_master.md

## tool_based_source_collection_master.md

      
    Raw
  

              tool_based_source_collection_master.md
            
          
    Tool-Based Source Collection Architecture - Master Design Document

Overview

This document consolidates the complete design for moving STORM's source collection from context-heavy architecture to tool-based state management. This evolution addresses the context window overflow issue where the bibliographer prompt is so effective that it gathers sources faster than DSPy's trajectory truncation can handle.
Design Evolution (Chronological)

Version 1: Initial Tool-Based Architecture

Gist: https://gist.github.com/estsauver/e808b32d5c6a1cb5c884d12d54b7d3c3
Key Concepts:

Move source storage from agent context to external Redis
Four core tools: save_source, get_progress, check_completion, finalize_sources
Incremental state management instead of accumulating everything in trajectory
Honest Assessment: Good long-term architecture, but might be over-engineering for immediate problem

Tools:
save_source(source_type, external_id, url, title, abstract, key_excerpts, topics_covered, citation_id)
get_progress() → {"total_sources": 15, "by_type": {...}, "by_topic": {...}}
check_completion() → {"ready_to_finish": bool, "criteria": {...}}
finalize_sources() → List[dict]  # All sources
Architecture:
Agent → save_source() → Redis → get_progress() → check_completion() → FINISH
Context: Only recent activity (not all 30+ sources)

Version 2: Question-Aware Partitioning (Enhanced)

Gist: https://gist.github.com/estsauver/2d9b8dab77eecd8a4f64c3052eb1d458
Key Enhancement: Merge Phase 1 (gathering) + Phase 2 (distribution)
User Insight: "I think the save sources tool needs to do something like allocating the chunk/citation to a relevant question too?"
Changes:

Added relevant_questions parameter to save_source
Agent assigns sources to questions immediately during gathering
Per-question progress tracking
Just-in-time distribution instead of post-hoc partitioning

Enhanced save_source:
save_source(
    source_type: str,
    external_id: str,
    url: str,
    title: str,
    abstract: str,
    key_excerpts: List[str],
    topics_covered: List[str],
    citation_id: str,
    relevant_questions: List[str]  # ⭐ NEW! e.g., ["clinical.efficacy", "mechanism.moa"]
) → {
    "source_id": "src_012",
    "assigned_to": ["clinical.efficacy", "clinical.safety"],
    "questions_updated": {
        "clinical.efficacy": {"source_count": 6, "sufficient": True}
    }
}
Storage Schema:
# Redis structure
source_collection:{job_id}:sources → List of all source dicts
source_collection:{job_id}:question:{question_key} → List of source_ids
source_collection:{job_id}:metadata → JSON metadata
Benefits:

Intentional gathering - Agent knows WHY it's gathering each source
Real-time feedback - "I have enough mechanism, need more competitive"
Adaptive search - Agent pivots based on per-question gaps
Direct to synthesis - Output ready for Phase 3 (AnswerSynthesizer)

Version 3: Refined API Design

Gist: https://gist.github.com/estsauver/bf7fbb3938acd0f280564593fd7cbdc1
Key Improvements:

Simplified parameters: 9 → 7 for save_source
Flat, scannable response structure (no deep nesting)
Visual indicators (✓ ⚠) for quick assessment
All responses <500 chars to avoid context bloat
Human-readable status messages
Actionable feedback ("next_focus", "suggestion")

Refined save_source:
def save_source(
    source_type: str,               # Required: "pubmed", "clinicaltrials", etc.
    external_id: str,               # Required: PMID, NCT number, etc.
    url: str,                       # Required: Full URL to source
    title: str,                     # Required: Document title
    relevant_questions: List[str],  # Required: Question keys this answers
    key_excerpts: Optional[str] = None,      # Optional: 1-2 sentence summary
    citation_id: Optional[str] = None        # Optional: If already registered
) -> dict
Removed Redundancies:

abstract (redundant with key_excerpts)
topics_covered (duplicates relevant_questions)
Made citation_id optional (not available yet during gathering)

Refined get_progress:
{
    "total": 25,
    "questions": {
        "mechanism.moa":           "✓ 8 sources",
        "clinical.efficacy":       "✓ 7 sources",
        "competitive.landscape":   "⚠ 2 sources (need 3 more)",
    },
    "summary": "3/6 questions complete, 22 more sources needed",
    "next_focus": ["competitive.landscape", "market.status"]
}
Refined check_completion:
{
    "ready": false,
    "progress": "3/6 questions complete (50%)",
    "missing": {
        "competitive.landscape": "Need 3 more sources (currently 2/5)",
        "market.status": "Need 4 more sources (currently 1/5)"
    },
    "suggestion": "Focus on competitive landscape and market data"
}
Version 4: Automatic Citation Registration (Final)

Key Enhancement: Make citation handling seamless
User Request: "I think we should make it clear in the design that if a citation is not registered yet, we should register it"
Design Principle: The tool should automatically register citations if not provided - agent shouldn't need to remember to call register_citation separately.
Final save_source:
def save_source(
    source_type: str,               # Required: "pubmed", "clinicaltrials", etc.
    external_id: str,               # Required: PMID, NCT number, etc.
    url: str,                       # Required: Full URL to source
    title: str,                     # Required: Document title
    relevant_questions: List[str],  # Required: Question keys this answers
    key_excerpts: Optional[str] = None,      # Optional: 1-2 sentence summary
    citation_id: Optional[str] = None        # Optional: If already registered
) -> dict:
    """
    Save a source and assign it to relevant questions.

    AUTOMATIC CITATION REGISTRATION:
    - If citation_id is provided: Use it (agent already registered the citation)
    - If citation_id is None: Automatically register a citation for this source
      using the provided metadata (source_type, external_id, title, url, key_excerpts)

    This ensures every source has a citation_id for use in answer synthesis.

    Returns:
        {
            "source_id": "src_012",
            "citation_id": "cit_abc123",  # Always returned (auto-registered if needed)
            "citation_status": "existing" | "auto_registered",
            "assigned_to": ["clinical.efficacy", "clinical.safety"],
            "status": {
                "clinical.efficacy": "sufficient",
                "clinical.safety": "sufficient"
            },
            "message": "✓ Saved source #12 (PubMed) → 2 questions"
        }
    """
Implementation Logic:
if citation_id is None:
    # Auto-register citation
    citation_id = register_citation(
        claim=title,  # Use title as claim
        source_type=source_type,
        source_id=external_id,
        direct_quote=key_excerpts or title,
        context=f"Source: {url}",
        metadata={"title": title, "url": url, "source_type": source_type}
    )
    citation_status = "auto_registered"
else:
    # Validate that citation exists
    if not get_citation(citation_id):
        raise ValueError(f"Citation {citation_id} not found")
    citation_status = "existing"
Agent Workflows:
Option 1: Explicit Registration (Advanced)
# Agent extracts specific claim first
citation_id = register_citation(
    claim="Trastuzumab binds HER2 with KD of 5 nM",
    source_type="pubmed",
    source_id="12345678",
    direct_quote="We measured binding affinity...",
)

# Then saves with citation
save_source(..., citation_id=citation_id)
Option 2: Auto-Registration (Simple, Recommended)
# Agent just saves directly
save_source(
    source_type="pubmed",
    external_id="12345678",
    title="Trastuzumab HER2 binding characteristics",
    relevant_questions=["mechanism.moa"],
    key_excerpts="KD = 5 nM for HER2 binding"
    # citation_id not provided - will be auto-registered
)
Final Tool Suite

1. save_source - Intelligent Gathering + Assignment + Citation

def save_source(
    source_type: str,
    external_id: str,
    url: str,
    title: str,
    relevant_questions: List[str],
    key_excerpts: Optional[str] = None,
    citation_id: Optional[str] = None
) -> dict
2. get_progress - Per-Question Coverage

def get_progress() -> dict:
    """
    Returns:
        {
            "total": 25,
            "questions": {
                "mechanism.moa": "✓ 8 sources",
                "competitive.landscape": "⚠ 2 sources (need 3 more)"
            },
            "summary": "3/6 questions complete, 22 more sources needed",
            "next_focus": ["competitive.landscape", "market.status"]
        }
    """
3. check_completion - Question-Level Validation

def check_completion() -> dict:
    """
    Returns:
        {
            "ready": false,
            "progress": "3/6 questions complete (50%)",
            "missing": {
                "competitive.landscape": "Need 3 more sources (currently 2/5)"
            },
            "suggestion": "Focus on competitive landscape and market data"
        }
    """
4. finalize_sources - Pre-Organized Output

def finalize_sources() -> dict:
    """
    Returns sources organized by question, ready for Phase 3 (AnswerSynthesizer).

    Note: This is an internal workflow tool, not exposed to agent.
    Agent never sees this response - workflow retrieves it directly.
    """
Question Taxonomy (Configuration)

STORM_QUESTIONS = {
    "mechanism.moa": {
        "label": "Mechanism of Action",
        "description": "How does the drug work? Target binding, pathway effects",
        "min_sources": 5
    },
    "clinical.efficacy": {
        "label": "Clinical Efficacy",
        "description": "Trial results, patient outcomes, response rates",
        "min_sources": 5
    },
    "clinical.safety": {
        "label": "Safety Profile",
        "description": "Adverse events, toxicity, contraindications",
        "min_sources": 5
    },
    "competitive.landscape": {
        "label": "Competitive Landscape",
        "description": "Other drugs, combinations, market position",
        "min_sources": 5
    },
    "market.status": {
        "label": "Market Status",
        "description": "Development stage, approvals, commercialization",
        "min_sources": 3
    },
    "ip.patents": {
        "label": "Intellectual Property",
        "description": "Patent coverage, exclusivity, formulations",
        "min_sources": 3
    }
}
Modified Prompt (Bibliographer with Tools)

class BroadResearchSignature(dspy.Signature):
    """
    YOUR JOB: Build a comprehensive source bibliography organized by research question.

    You are given a RESEARCH SYLLABUS with 6 questions about this drug:

    1. mechanism.moa - How does the drug work? Target binding, pathway effects
    2. clinical.efficacy - What are the trial results? Patient outcomes, response rates
    3. clinical.safety - What are the risks? Adverse events, toxicity
    4. competitive.landscape - How does it compare? Other drugs, combinations
    5. market.status - Development stage, approvals, commercialization
    6. ip.patents - Patent coverage, exclusivity, formulations

    INTELLIGENT WORKFLOW:
    1. Search a database (search_pubmed, search_clinicaltrials, etc.)
    2. For EACH promising result:
       a. Use scrape_url to read the full content
       b. Determine which question(s) it answers
       c. Use save_source(..., relevant_questions=["clinical.efficacy"])
       d. Get feedback: "Saved source #5 → clinical.efficacy (now 5 sources ✓)"

    3. Periodically use get_progress to see per-question status
    4. Adjust your search strategy to fill gaps
    5. Use check_completion before finishing
    6. Call FINISH only when check_completion shows ready_to_finish: true

    COMPLETION CRITERIA:
    ✓ All 6 questions have at least 5 sources each (30+ total)
    ✓ Each question covers its designated topic area
    ✓ Sources span multiple databases (PubMed, ClinicalTrials, ChEMBL, etc.)

    SMART FEATURES:
    - You can assign the same source to multiple questions if relevant
    - You get real-time feedback on which questions need more sources
    - Context stays small because sources are saved externally
    """
Implementation Estimate

Phase 1: Core Storage (2 hours)

SourceCollector class with Redis backend
Question-aware storage schema
Deduplication logic (same source → multiple questions)

Phase 2: Tool Implementation (2 hours)

save_source with question assignment + auto-citation
get_progress with per-question breakdown
check_completion with gap analysis
finalize_sources with organized output

Phase 3: Prompt & Integration (1 hour)

Update BroadResearchSignature with question list
Modify BroadResearcher to use tools
Update STORM workflow to use pre-organized output

Phase 4: Testing (1 hour)

Test question assignment logic
Verify adaptive search behavior
Confirm output format for AnswerSynthesizer
Test automatic citation registration

Total: ~6 hours
Benefits Summary

✅ No more context overflow - Context contains only recent activity
✅ Unlimited source gathering - Storage is external
✅ Real-time progress tracking - Can query saved sources anytime
✅ Natural checkpointing - Sources persist even if agent crashes
✅ Better observability - Can inspect Redis to see what's saved
✅ Scalable - Works for 50, 100, 1000+ sources
✅ Intentional gathering - Agent knows WHY it's gathering each source
✅ Adaptive search - Agent can pivot based on per-question gaps
✅ Direct to synthesis - Output ready for Phase 3 (AnswerSynthesizer)
✅ Natural deduplication - Same source can answer multiple questions
✅ Seamless citations - Auto-registration eliminates cognitive load
Key Design Principles


Simplicity - Minimize required parameters (5 required, 2 optional)
Actionability - Return information the agent can act on
Consistency - Similar patterns across tools
Conciseness - Keep responses small (<500 chars) to avoid context bloat
Robustness - Handle edge cases gracefully (duplicate sources, invalid questions)
Automation - Auto-register citations to reduce agent complexity
Transparency - Clear feedback on what was auto-generated vs provided

Next Steps


Decision: Implement tool-based architecture vs quick fixes (lower max_iterations)
If implementing: Start with Phase 1 (Core Storage) and incrementally build
Testing: Run full end-to-end STORM demo with Trastuzumab
Validation: Verify 30-50+ sources gathered across all 6 questions
Production: Deploy and monitor for context window issues


Last Updated: 2025-12-29
Status: Design Complete - Ready for Implementation Decision
No results found