Skip to content

Instantly share code, notes, and snippets.

@Nishkalkashyap
Last active January 28, 2026 10:16
Show Gist options
  • Select an option

  • Save Nishkalkashyap/8ad0340866abe587c5e18dd247c3a485 to your computer and use it in GitHub Desktop.

Select an option

Save Nishkalkashyap/8ad0340866abe587c5e18dd247c3a485 to your computer and use it in GitHub Desktop.
Puppeteer Memory: Giving AI Agents Long-Term Memory Without the Context Window Tax (WIP)

Puppeteer Memory: Giving AI Agents Long-Term Memory Without the Context Window Tax

TL;DR: We built a two-agent architecture where a "Puppeteer" agent manages memory (timeline checkpoints + learned knowledge) and directs a stateless "Executor" agent. The executor is a Boltzmann brain—spawned fresh each run with no concept of history. But to the user, the system appears to have perfect memory.


The Problem: AI Agents Forget

Every AI agent has a fundamental constraint: limited context windows.

When you run an autonomous agent on a schedule—say, monitoring news every 6 hours—each run starts fresh. The agent doesn't remember what it did yesterday, what articles it already sent you, or that it tried a failing approach three times.

┌─────────────────────────────────────────────────────────────┐
│                    THE MEMORY PROBLEM                       │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│   Run 1: "Found 3 AI articles, sent email"                  │
│                      ↓                                      │
│   Run 2: "Found 3 AI articles, sent email"  ← Same articles!│
│                      ↓                                      │
│   Run 3: "Found 3 AI articles, sent email"  ← User annoyed  │
│                      ↓                                      │
│   Run 4: "Found 3 AI articles, sent email"  ← Unsubscribed  │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Current solutions are expensive:

  • Stuff complete history into context → token costs explode
  • Summarize history → lose critical details
  • RAG over history → adds latency and complexity

We wanted something different.


The Insight: Separate Memory from Execution

What if instead of making one agent do everything, we split the responsibilities?

Concern Who Handles It
Remember everything Puppeteer Agent
Execute current task Executor Agent
Analyze history Puppeteer Agent
Use tools Executor Agent
Avoid duplicate work Puppeteer Agent
Follow instructions Executor Agent
Learn reusable facts Puppeteer Agent
Self-critique prompts Puppeteer Agent

The executor becomes a focused, efficient worker. The puppeteer becomes the strategic brain with access to complete history.


The Architecture

┌──────────────────────────────────────────────────────────────────┐
│                        USER                                      │
│                          │                                       │
│            "Monitor HN for AI news, email me daily"              │
│                          │                                       │
└──────────────────────────┼───────────────────────────────────────┘
                           ▼
┌──────────────────────────────────────────────────────────────────┐
│                   PUPPETEER AGENT                                │
│                                                                  │
│  "I can see everything. I remember everything.                   │
│   But I don't execute—I direct."                                 │
│                                                                  │
│  ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐        │
│  │   Run     │ │Checkpoints│ │ Knowledge │ │   User    │        │
│  │  History  │ │ (timeline │ │ (learned  │ │Instructions│       │
│  │ (50 runs) │ │  markers) │ │  facts)   │ │           │        │
│  └───────────┘ └───────────┘ └───────────┘ └───────────┘        │
│        │             │              │             │              │
│        └─────────────┴──────────────┴─────────────┘              │
│                          ▼                                       │
│              ┌───────────────────────┐                           │
│              │   ANALYZE & DECIDE    │                           │
│              │   + SELF-REFLECT      │                           │
│              └───────────────────────┘                           │
│                          │                                       │
│                          ▼                                       │
│              ┌───────────────────────┐                           │
│              │  GENERATE FOCUSED     │                           │
│              │  SYSTEM PROMPT        │                           │
│              └───────────────────────┘                           │
│                                                                  │
└──────────────────────────┼───────────────────────────────────────┘
                           │
                           │ "Search HN for AI posts from
                           │  last 6 hours. Don't re-send
                           │  these 3 articles..."
                           ▼
┌──────────────────────────────────────────────────────────────────┐
│                    EXECUTOR AGENT                                │
│                                                                  │
│  "I only see my current instructions.                            │
│   I have no concept of past or future—just this task."           │
│                                                                  │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                   SYSTEM PROMPT                          │    │
│  │  (Generated by Puppeteer—executor doesn't know this)     │    │
│  │                                                          │    │
│  │  "You are an AI news monitoring agent.                   │    │
│  │   Search for posts from last 6 hours.                    │    │
│  │   Don't re-send: GPT-5 article, Anthropic funding..."    │    │
│  └─────────────────────────────────────────────────────────┘    │
│                          │                                       │
│                          ▼                                       │
│              ┌───────────────────────┐                           │
│              │   EXECUTE WITH TOOLS  │                           │
│              │   hacker_news search  │                           │
│              │   send_email          │                           │
│              └───────────────────────┘                           │
│                                                                  │
└──────────────────────────┼───────────────────────────────────────┘
                           │
                           │ Results stored
                           ▼
┌──────────────────────────────────────────────────────────────────┐
│                      RUN HISTORY                                 │
│                                                                  │
│  Run 47: { status: "completed", summary: "Found 2 new           │
│            articles, sent digest", tools: ["hacker_news",       │
│            "send_email"], messages: [...] }                      │
│                                                                  │
└──────────────────────────────────────────────────────────────────┘

Key Design Decisions

1. The Executor Doesn't Know About the Puppeteer

This is crucial. The executor agent receives a system prompt and simply executes. It has no awareness of:

  • That a puppeteer exists
  • That its "memory" is actually injected context
  • That its instructions were carefully crafted based on 50 previous runs

Why? Clean separation. The executor can be any standard agent framework. No special integration needed.

2. The Puppeteer is Stateless (But Has Checkpoints + Knowledge)

The puppeteer doesn't remember its own previous decisions. Each invocation is fresh analysis.

But it has two memory mechanisms:

Checkpoints — Timeline markers ("what happened when") with actionables:

┌─────────────────────────────────────────────────────────────┐
│                      CHECKPOINTS                            │
│              (Puppeteer's notes to future self)             │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Run #10 [milestone]                                        │
│  "Monitoring pattern established. HN search + email         │
│   digest working reliably. Threshold: 2+ articles."         │
│                                                             │
│  Run #34 [issue_resolved]                                   │
│  "Email delivery fixed. Root cause: wrong recipient.        │
│   ACTIONABLE: If email fails again, check user              │
│   instructions first—may be another address change."        │
│                                                             │
│  Run #45 [strategy_change]                                  │
│  "User added Reddit. Now monitoring HN + r/ML.              │
│   ACTIONABLE: Use different rate limits for Reddit API."    │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Knowledge — Reusable facts ("what we learned") that apply across runs:

┌─────────────────────────────────────────────────────────────┐
│                       KNOWLEDGE                             │
│              (Distilled, reusable learnings)                │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Topic: kite_api_auth                                       │
│  "Kite sessions expire at 6 AM IST daily. Auth failures     │
│   in morning runs are expected. User must re-authenticate   │
│   via OAuth—don't retry repeatedly."                        │
│                                                             │
│  Topic: competitor_c_scraping                               │
│  "Main website blocks scraping. Use LinkedIn company        │
│   page instead. Pricing info is in 'About' section."        │
│                                                             │
└─────────────────────────────────────────────────────────────┘

The difference:

  • Checkpoint: "Run #45 failed due to Kite auth" (timeline event)
  • Knowledge: "Kite sessions expire at 6 AM IST" (reusable fact)

3. Reuse Policies Minimize Puppeteer Invocations

Running the puppeteer before every executor run is wasteful. Instead, the puppeteer outputs a reuse policy:

{
  "executor_system_prompt": "You are an AI news agent...",

  "reuse_policy": {
    "valid_for_runs": 5,
    "invalidate_on": ["executor_failed", "user_instructions_changed"],
    "dynamic_fields": {
      "{{hours_since_last_run}}": "Calculate from timestamps",
      "{{recent_articles}}": "Extract from last 3 runs"
    }
  }
}

How it works:

┌─────────────────────────────────────────────────────────────┐
│                    EXECUTION FLOW                           │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Cron triggers executor run                                 │
│            │                                                │
│            ▼                                                │
│  ┌─────────────────────┐                                    │
│  │ Check reuse policy  │                                    │
│  └─────────────────────┘                                    │
│            │                                                │
│     ┌──────┴──────┐                                         │
│     │             │                                         │
│     ▼             ▼                                         │
│  [Valid]      [Invalid]                                     │
│     │             │                                         │
│     ▼             ▼                                         │
│  Fill in      Invoke                                        │
│  dynamic      Puppeteer                                     │
│  fields       for fresh                                     │
│     │         analysis                                      │
│     │             │                                         │
│     └──────┬──────┘                                         │
│            │                                                │
│            ▼                                                │
│     Run Executor                                            │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Result: Puppeteer might run once every 5-10 executor runs, not every time.

4. Self-Reflection Improves Prompt Quality

Before finalizing output, the puppeteer critiques its own work:

┌─────────────────────────────────────────────────────────────┐
│                    REFLECTION                               │
│              (Puppeteer's self-critique)                    │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  "Being conservative with reuse policy (3 runs) since       │
│   we just resolved a failure. Added tool_error:send_email   │
│   to invalidation conditions as extra safety. Prompt        │
│   explicitly mentions resolution to give executor           │
│   confidence. Concern: may be including too much context    │
│   but erring on the side of clarity after the failure."     │
│                                                             │
└─────────────────────────────────────────────────────────────┘

The reflection documents tradeoffs, concerns, and reasoning—helping debug prompt quality over time.


What the Puppeteer Sees vs. What the Executor Sees

┌─────────────────────────────────┬─────────────────────────────────┐
│         PUPPETEER               │           EXECUTOR              │
├─────────────────────────────────┼─────────────────────────────────┤
│                                 │                                 │
│  Complete run history (50+)     │  Nothing about history          │
│                                 │                                 │
│  All checkpoints (timeline)     │  Nothing about checkpoints      │
│                                 │                                 │
│  All knowledge (learned facts)  │  Facts baked into prompt        │
│                                 │                                 │
│  User's original instructions   │  Distilled current task         │
│                                 │                                 │
│  Failed approaches              │  "Don't do X" (without why)     │
│                                 │                                 │
│  Tool usage patterns            │  Available tools list           │
│                                 │                                 │
│  Previous puppeteer outputs     │  Nothing about puppeteer        │
│                                 │                                 │
│  Strategic timeline             │  Just this prompt, nothing else │
│                                 │                                 │
└─────────────────────────────────┴─────────────────────────────────┘

A Concrete Example

User instruction: "Monitor Hacker News for AI news and email me daily digests"

After 47 runs, puppeteer is invoked:

What Puppeteer Analyzes:

  • Checkpoints: "Pattern established at run #10"
  • Last 5 runs: All successful, 2-4 articles each
  • No failures, stable operation

What Puppeteer Outputs:

{
  "analysis_summary": "Stable monitoring. 5 successful runs. No changes needed.",

  "strategy": "Continue cyclical monitoring. Maintain current approach.",

  "reflection": "High reuse (8 runs) appropriate for stable monitoring. Dynamic fields handle time-sensitive data. No concerns—pattern proven reliable.",

  "executor_system_prompt": "You are an AI news monitoring agent.\n\nSearch for AI posts from the past {{hours}} hours.\n\nStories already sent (do not repeat):\n{{recent_articles}}\n\nIf 2+ new stories found, send email digest.\nIf <2, skip and wait for next run.",

  "reuse_policy": {
    "valid_for_runs": 8,
    "invalidate_on": ["executor_failed"],
    "dynamic_fields": {
      "{{hours}}": "Hours since last run",
      "{{recent_articles}}": "Titles from last 5 email sends"
    }
  },

  "checkpoints_to_create": [],
  "knowledge_to_create": []
}

What Executor Sees:

You are an AI news monitoring agent.

Search for AI posts from the past 6 hours.

Stories already sent (do not repeat):
- "GPT-5 Released with 1M Context Window"
- "Anthropic Raises $2B Series C"
- "Open Source LLM Beats GPT-4 on Benchmarks"

If 2+ new stories found, send email digest.
If <2, skip and wait for next run.

The executor has no idea this prompt was carefully constructed by analyzing 47 previous runs. It just executes.


Why This Works

Benefit How
No duplicate work Puppeteer analyzes history, tells executor what's already done
Learns from failures Puppeteer sees failed approaches, routes around them
Retains knowledge Reusable facts (API quirks, workarounds) persist in knowledge base
Efficient token usage Executor gets minimal context; puppeteer runs infrequently
Clean separation Executor is standard agent; no special memory integration
Strategic continuity Checkpoints preserve high-level timeline with actionables
Self-improving Reflections help identify and fix prompt quality issues
Graceful degradation If puppeteer fails, last cached prompt still works

When to Use This Pattern

Good fit:

  • Long-running autonomous agents (cron-scheduled)
  • Agents that build on previous work (research, monitoring)
  • Multi-phase projects spanning many runs
  • Agents where duplicate work is costly or annoying

Not needed:

  • Single-shot tasks
  • Conversational agents (already have session context)
  • Agents where every run is independent

The Mental Model

Think of it like a film production:

Role Agent
Director Puppeteer — sees the whole story, plans each scene
Actor Executor — performs the current scene brilliantly
Script System prompt — actor's instructions for this scene
Dailies Run history — footage director reviews
Continuity notes Checkpoints — timeline markers ("day 3 of shoot")
Production bible Knowledge — reusable facts ("actor allergic to peanuts")

The actor doesn't need to remember every previous scene. The director handles continuity. The audience (user) sees seamless performance.


What's Next

We're building this into Alvix AI, our open-source autonomous agent platform. The architecture is generic—any agent with any tools can use this memory pattern.

Key implementation pieces:

  • Puppeteer prompt engineering (the hard part)
  • Checkpoint storage and retrieval (timeline markers)
  • Knowledge base with semantic search (reusable facts)
  • Dynamic field resolution for cached prompts
  • Reuse policy evaluation logic
  • Reflection logging for prompt quality debugging

The insight is simple: Don't make one agent do everything. Let a strategic agent manage memory—both episodic (checkpoints) and semantic (knowledge). Let an execution agent do work. Connect them with carefully crafted, self-critiqued prompts.

The executor is a Boltzmann brain—no past, no future, just now. The puppeteer sees everything. The user gets an agent that actually remembers—and learns.


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment