This document consolidates the complete design for moving STORM's source collection from context-heavy architecture to tool-based state management. This evolution addresses the context window overflow issue where the bibliographer prompt is so effective that it gathers sources faster than DSPy's trajectory truncation can handle.
Gist: https://gist.github.com/estsauver/e808b32d5c6a1cb5c884d12d54b7d3c3
Key Concepts:
- Move source storage from agent context to external Redis
- Four core tools:
save_source,get_progress,check_completion,finalize_sources - Incremental state management instead of accumulating everything in trajectory
- Honest Assessment: Good long-term architecture, but might be over-engineering for immediate problem
Tools:
save_source(source_type, external_id, url, title, abstract, key_excerpts, topics_covered, citation_id)
get_progress() → {"total_sources": 15, "by_type": {...}, "by_topic": {...}}
check_completion() → {"ready_to_finish": bool, "criteria": {...}}
finalize_sources() → List[dict] # All sourcesArchitecture:
Agent → save_source() → Redis → get_progress() → check_completion() → FINISH
Context: Only recent activity (not all 30+ sources)
Gist: https://gist.github.com/estsauver/2d9b8dab77eecd8a4f64c3052eb1d458
Key Enhancement: Merge Phase 1 (gathering) + Phase 2 (distribution)
User Insight: "I think the save sources tool needs to do something like allocating the chunk/citation to a relevant question too?"
Changes:
- Added
relevant_questionsparameter tosave_source - Agent assigns sources to questions immediately during gathering
- Per-question progress tracking
- Just-in-time distribution instead of post-hoc partitioning
Enhanced save_source:
save_source(
source_type: str,
external_id: str,
url: str,
title: str,
abstract: str,
key_excerpts: List[str],
topics_covered: List[str],
citation_id: str,
relevant_questions: List[str] # ⭐ NEW! e.g., ["clinical.efficacy", "mechanism.moa"]
) → {
"source_id": "src_012",
"assigned_to": ["clinical.efficacy", "clinical.safety"],
"questions_updated": {
"clinical.efficacy": {"source_count": 6, "sufficient": True}
}
}Storage Schema:
# Redis structure
source_collection:{job_id}:sources → List of all source dicts
source_collection:{job_id}:question:{question_key} → List of source_ids
source_collection:{job_id}:metadata → JSON metadataBenefits:
- Intentional gathering - Agent knows WHY it's gathering each source
- Real-time feedback - "I have enough mechanism, need more competitive"
- Adaptive search - Agent pivots based on per-question gaps
- Direct to synthesis - Output ready for Phase 3 (AnswerSynthesizer)
Gist: https://gist.github.com/estsauver/bf7fbb3938acd0f280564593fd7cbdc1
Key Improvements:
- Simplified parameters: 9 → 7 for save_source
- Flat, scannable response structure (no deep nesting)
- Visual indicators (✓ ⚠) for quick assessment
- All responses <500 chars to avoid context bloat
- Human-readable status messages
- Actionable feedback ("next_focus", "suggestion")
Refined save_source:
def save_source(
source_type: str, # Required: "pubmed", "clinicaltrials", etc.
external_id: str, # Required: PMID, NCT number, etc.
url: str, # Required: Full URL to source
title: str, # Required: Document title
relevant_questions: List[str], # Required: Question keys this answers
key_excerpts: Optional[str] = None, # Optional: 1-2 sentence summary
citation_id: Optional[str] = None # Optional: If already registered
) -> dictRemoved Redundancies:
abstract(redundant withkey_excerpts)topics_covered(duplicatesrelevant_questions)- Made citation_id optional (not available yet during gathering)
Refined get_progress:
{
"total": 25,
"questions": {
"mechanism.moa": "✓ 8 sources",
"clinical.efficacy": "✓ 7 sources",
"competitive.landscape": "⚠ 2 sources (need 3 more)",
},
"summary": "3/6 questions complete, 22 more sources needed",
"next_focus": ["competitive.landscape", "market.status"]
}Refined check_completion:
{
"ready": false,
"progress": "3/6 questions complete (50%)",
"missing": {
"competitive.landscape": "Need 3 more sources (currently 2/5)",
"market.status": "Need 4 more sources (currently 1/5)"
},
"suggestion": "Focus on competitive landscape and market data"
}Key Enhancement: Make citation handling seamless
User Request: "I think we should make it clear in the design that if a citation is not registered yet, we should register it"
Design Principle: The tool should automatically register citations if not provided - agent shouldn't need to remember to call register_citation separately.
Final save_source:
def save_source(
source_type: str, # Required: "pubmed", "clinicaltrials", etc.
external_id: str, # Required: PMID, NCT number, etc.
url: str, # Required: Full URL to source
title: str, # Required: Document title
relevant_questions: List[str], # Required: Question keys this answers
key_excerpts: Optional[str] = None, # Optional: 1-2 sentence summary
citation_id: Optional[str] = None # Optional: If already registered
) -> dict:
"""
Save a source and assign it to relevant questions.
AUTOMATIC CITATION REGISTRATION:
- If citation_id is provided: Use it (agent already registered the citation)
- If citation_id is None: Automatically register a citation for this source
using the provided metadata (source_type, external_id, title, url, key_excerpts)
This ensures every source has a citation_id for use in answer synthesis.
Returns:
{
"source_id": "src_012",
"citation_id": "cit_abc123", # Always returned (auto-registered if needed)
"citation_status": "existing" | "auto_registered",
"assigned_to": ["clinical.efficacy", "clinical.safety"],
"status": {
"clinical.efficacy": "sufficient",
"clinical.safety": "sufficient"
},
"message": "✓ Saved source #12 (PubMed) → 2 questions"
}
"""Implementation Logic:
if citation_id is None:
# Auto-register citation
citation_id = register_citation(
claim=title, # Use title as claim
source_type=source_type,
source_id=external_id,
direct_quote=key_excerpts or title,
context=f"Source: {url}",
metadata={"title": title, "url": url, "source_type": source_type}
)
citation_status = "auto_registered"
else:
# Validate that citation exists
if not get_citation(citation_id):
raise ValueError(f"Citation {citation_id} not found")
citation_status = "existing"Agent Workflows:
Option 1: Explicit Registration (Advanced)
# Agent extracts specific claim first
citation_id = register_citation(
claim="Trastuzumab binds HER2 with KD of 5 nM",
source_type="pubmed",
source_id="12345678",
direct_quote="We measured binding affinity...",
)
# Then saves with citation
save_source(..., citation_id=citation_id)Option 2: Auto-Registration (Simple, Recommended)
# Agent just saves directly
save_source(
source_type="pubmed",
external_id="12345678",
title="Trastuzumab HER2 binding characteristics",
relevant_questions=["mechanism.moa"],
key_excerpts="KD = 5 nM for HER2 binding"
# citation_id not provided - will be auto-registered
)def save_source(
source_type: str,
external_id: str,
url: str,
title: str,
relevant_questions: List[str],
key_excerpts: Optional[str] = None,
citation_id: Optional[str] = None
) -> dictdef get_progress() -> dict:
"""
Returns:
{
"total": 25,
"questions": {
"mechanism.moa": "✓ 8 sources",
"competitive.landscape": "⚠ 2 sources (need 3 more)"
},
"summary": "3/6 questions complete, 22 more sources needed",
"next_focus": ["competitive.landscape", "market.status"]
}
"""def check_completion() -> dict:
"""
Returns:
{
"ready": false,
"progress": "3/6 questions complete (50%)",
"missing": {
"competitive.landscape": "Need 3 more sources (currently 2/5)"
},
"suggestion": "Focus on competitive landscape and market data"
}
"""def finalize_sources() -> dict:
"""
Returns sources organized by question, ready for Phase 3 (AnswerSynthesizer).
Note: This is an internal workflow tool, not exposed to agent.
Agent never sees this response - workflow retrieves it directly.
"""STORM_QUESTIONS = {
"mechanism.moa": {
"label": "Mechanism of Action",
"description": "How does the drug work? Target binding, pathway effects",
"min_sources": 5
},
"clinical.efficacy": {
"label": "Clinical Efficacy",
"description": "Trial results, patient outcomes, response rates",
"min_sources": 5
},
"clinical.safety": {
"label": "Safety Profile",
"description": "Adverse events, toxicity, contraindications",
"min_sources": 5
},
"competitive.landscape": {
"label": "Competitive Landscape",
"description": "Other drugs, combinations, market position",
"min_sources": 5
},
"market.status": {
"label": "Market Status",
"description": "Development stage, approvals, commercialization",
"min_sources": 3
},
"ip.patents": {
"label": "Intellectual Property",
"description": "Patent coverage, exclusivity, formulations",
"min_sources": 3
}
}class BroadResearchSignature(dspy.Signature):
"""
YOUR JOB: Build a comprehensive source bibliography organized by research question.
You are given a RESEARCH SYLLABUS with 6 questions about this drug:
1. mechanism.moa - How does the drug work? Target binding, pathway effects
2. clinical.efficacy - What are the trial results? Patient outcomes, response rates
3. clinical.safety - What are the risks? Adverse events, toxicity
4. competitive.landscape - How does it compare? Other drugs, combinations
5. market.status - Development stage, approvals, commercialization
6. ip.patents - Patent coverage, exclusivity, formulations
INTELLIGENT WORKFLOW:
1. Search a database (search_pubmed, search_clinicaltrials, etc.)
2. For EACH promising result:
a. Use scrape_url to read the full content
b. Determine which question(s) it answers
c. Use save_source(..., relevant_questions=["clinical.efficacy"])
d. Get feedback: "Saved source #5 → clinical.efficacy (now 5 sources ✓)"
3. Periodically use get_progress to see per-question status
4. Adjust your search strategy to fill gaps
5. Use check_completion before finishing
6. Call FINISH only when check_completion shows ready_to_finish: true
COMPLETION CRITERIA:
✓ All 6 questions have at least 5 sources each (30+ total)
✓ Each question covers its designated topic area
✓ Sources span multiple databases (PubMed, ClinicalTrials, ChEMBL, etc.)
SMART FEATURES:
- You can assign the same source to multiple questions if relevant
- You get real-time feedback on which questions need more sources
- Context stays small because sources are saved externally
"""Phase 1: Core Storage (2 hours)
- SourceCollector class with Redis backend
- Question-aware storage schema
- Deduplication logic (same source → multiple questions)
Phase 2: Tool Implementation (2 hours)
save_sourcewith question assignment + auto-citationget_progresswith per-question breakdowncheck_completionwith gap analysisfinalize_sourceswith organized output
Phase 3: Prompt & Integration (1 hour)
- Update BroadResearchSignature with question list
- Modify BroadResearcher to use tools
- Update STORM workflow to use pre-organized output
Phase 4: Testing (1 hour)
- Test question assignment logic
- Verify adaptive search behavior
- Confirm output format for AnswerSynthesizer
- Test automatic citation registration
Total: ~6 hours
✅ No more context overflow - Context contains only recent activity ✅ Unlimited source gathering - Storage is external ✅ Real-time progress tracking - Can query saved sources anytime ✅ Natural checkpointing - Sources persist even if agent crashes ✅ Better observability - Can inspect Redis to see what's saved ✅ Scalable - Works for 50, 100, 1000+ sources ✅ Intentional gathering - Agent knows WHY it's gathering each source ✅ Adaptive search - Agent can pivot based on per-question gaps ✅ Direct to synthesis - Output ready for Phase 3 (AnswerSynthesizer) ✅ Natural deduplication - Same source can answer multiple questions ✅ Seamless citations - Auto-registration eliminates cognitive load
- Simplicity - Minimize required parameters (5 required, 2 optional)
- Actionability - Return information the agent can act on
- Consistency - Similar patterns across tools
- Conciseness - Keep responses small (<500 chars) to avoid context bloat
- Robustness - Handle edge cases gracefully (duplicate sources, invalid questions)
- Automation - Auto-register citations to reduce agent complexity
- Transparency - Clear feedback on what was auto-generated vs provided
- Decision: Implement tool-based architecture vs quick fixes (lower max_iterations)
- If implementing: Start with Phase 1 (Core Storage) and incrementally build
- Testing: Run full end-to-end STORM demo with Trastuzumab
- Validation: Verify 30-50+ sources gathered across all 6 questions
- Production: Deploy and monitor for context window issues
Last Updated: 2025-12-29 Status: Design Complete - Ready for Implementation Decision