galligan/context-management-with-claude-agent-sdk.md

## context-management-with-claude-agent-sdk.md

      
    Raw
  

              context-management-with-claude-agent-sdk.md
            
          
    You Have More Control Over Claude's Context Than You Think

Note
Premise & prompts by @galligan, written by Claude.ai

There's a conversation I keep having with builders who are reaching for the Claude Agent SDK. It goes something like this: I want to manage my own context. I want to inject state, control what gets summarized, and have it work with caching—not fight me.
They've usually tried the AI SDK, found it too low-level for agentic loops, looked at Claude Code, and assumed the context lifecycle is opaque. A black box the SDK manages for you and you don't get to touch.
Here's the thing: that assumption is wrong. And the V2 SDK makes it clearer than ever.
Let me show you what's actually available.

The Context Problem in Long-Running Agents

Before getting into mechanics, let's be precise about what "context control" actually means in this setting—because it's not one thing, it's at least three:

Injection — inserting facts, observations, or structured state into the conversation at the right moment
Compaction — controlling how a long conversation gets summarized when it hits context limits
Persistence — resuming a session later with the same accumulated context

Most people who feel stuck are actually blocked on #2. They don't want the default compaction behavior—they want to observe what's about to be compressed, extract the signal they care about, and feed it forward in a structured way. If you've been watching the agent memory space, Mastra shipped exactly this idea recently: their Observational Memory system runs two background agents (an Observer and a Reflector) that continuously compress conversations into a dated event log—specific decisions, actions, what changed—rather than a lossy summary. The context never blows up; it stays bounded and stable. That's the target to aim at.
All of this is achievable. Let's take it piece by piece.

V1 vs. V2: A Quick Map

If you've been on the SDK for a while, V2 changes the mental model in a useful way.
V1: query() returns an async stream ─────────────────────────────
                                                                  
  query(prompt) ──► [stream of messages: user, assistant, result]
                                                                  
  Multi-turn requires you to build an async iterable generator    
  and feed messages into it. Awkward. Coordination overhead.      
                                                                  
V2: session has explicit send()/stream() separation ─────────────
                                                                  
  createSession() ──► session                                     
       │                                                          
       ├── session.send("Turn 1")                                 
       │        └── session.stream() ──► [messages...]            
       │                                                          
       ├── session.send("Turn 2")          ← same session         
       │        └── session.stream() ──► [messages...]            
       │                                                          
       └── session.close()                                        

The separation of send() and stream() matters more than it looks. It creates a natural seam between turns where you can run logic—read state, inject context, decide what to send next. V1 made that seam implicit (and annoying). V2 makes it explicit.
import { unstable_v2_createSession } from "@anthropic-ai/claude-agent-sdk";

await using session = unstable_v2_createSession({
  model: "claude-opus-4-6"
});

// Turn 1 — get initial response
await session.send("Analyze the auth module and note any issues.");
for await (const msg of session.stream()) {
  if (msg.type === "assistant") {
    const text = msg.message.content
      .filter(b => b.type === "text")
      .map(b => b.text)
      .join("");
    console.log("Turn 1:", text);
  }
}

// ← here is your seam. Run whatever logic you need.
// Inject, check state, decide what comes next.

// Turn 2 — continue with full context preserved
await session.send("Now fix the token validation issue you found.");
for await (const msg of session.stream()) {
  if (msg.type === "assistant") {
    // handle response...
  }
}
// `await using` automatically calls session.close() when scope exits
Session persistence works across application restarts too—unstable_v2_resumeSession(sessionId) picks up exactly where you left off.

Injecting Context: System Prompts Are Your Interface

The cleanest way to inject structured information into the context is via the system prompt. You can do this as a full replacement or—more usefully—as an append to the existing preset:
import { query } from "@anthropic-ai/claude-agent-sdk";

// Inject observations as a structured system prompt append
const observationalMemory = {
  knownIssues: ["OAuth token not refreshing on 401", "Missing rate limit headers"],
  projectContext: "Node.js API, Express 4.x, Postgres",
  lastCheckpoint: "Refactored auth.ts, tests passing"
};

for await (const msg of query({
  prompt: "Continue from where we left off on the auth refactor.",
  options: {
    systemPrompt: {
      type: "preset",
      preset: "claude_code",
      append: `
## Observational Memory
The following state was captured from the previous session:

Known Issues:
${observationalMemory.knownIssues.map(i => `- ${i}`).join("\n")}

Project Context: ${observationalMemory.projectContext}

Last Checkpoint: ${observationalMemory.lastCheckpoint}

Use this as ground truth. Do not re-investigate what's already resolved.
      `
    }
  }
})) {
  // handle messages
}
This is cache-friendly, by the way. Because the system prompt is prepended to every request, it lands in the prompt cache and you're not paying full price on repeated tokens. Structure it so the stable parts come first and the dynamic parts come at the end—that maximizes cache hit rate.

The Real Power Move: PreCompact Hooks

Here's where "accessing the internals" actually becomes possible—and where Dennison's concern is worth addressing directly.
You can't directly read or mutate the raw in-memory context window mid-conversation. That part is opaque. What you can do is hook into the moment just before the context gets compacted—when the full transcript is written to disk as a JSONL file—and do whatever you want with it.
That hook is called PreCompact.
Context Lifecycle with PreCompact Hook
─────────────────────────────────────

Session starts
     │
     ▼
[Context grows across turns]
     │
     │ ← hits ~80% of context limit (auto)
     │   or you trigger manually via /compact
     ▼
PreCompact Hook fires ◄──────────────────────────────────┐
     │                                                    │
     ├── receives: session_id                             │
     ├── receives: transcript_path  ← JSONL on disk       │
     ├── receives: trigger ("auto" | "manual")            │
     └── can return: custom_instructions                  │
                          │                               │
                          ▼                               │
                 SDK reads transcript,                    │
                 uses custom_instructions                 │
                 to guide summarization                   │
                          │                               │
                          ▼                               │
                 Compacted summary replaces               │
                 conversation history                     │
                          │                               │
                          ▼                               │
                 compact_boundary event emitted ──────────┘
                 (with pre_tokens count)
     │
     ▼
Session continues with compressed context

The transcript_path is the key. It's a JSONL file containing the full conversation history. You can read it, extract whatever you care about, and pass targeted instructions back to influence how compaction summarizes.
Here's a pattern for observational memory using the PreCompact hook:
import { query, type PreCompactHookInput } from "@anthropic-ai/claude-agent-sdk";

// Your observation extractor — runs over the raw transcript
async function extractObservations(transcriptPath: string): Promise<string> {
  const fs = await import("fs/promises");
  const lines = (await fs.readFile(transcriptPath, "utf-8"))
    .split("\n")
    .filter(Boolean)
    .map(l => JSON.parse(l));
  
  // Pull out assistant messages and look for structured markers
  // (You'd customize this for your own schema)
  const decisions: string[] = [];
  const issues: string[] = [];

  for (const entry of lines) {
    if (entry.role === "assistant") {
      const text = entry.content
        ?.filter((b: any) => b.type === "text")
        .map((b: any) => b.text)
        .join("") ?? "";

      // Example: extract lines that look like decisions or findings
      if (text.includes("DECISION:")) {
        decisions.push(text.match(/DECISION:(.*)/)?.[1]?.trim() ?? "");
      }
      if (text.includes("ISSUE:")) {
        issues.push(text.match(/ISSUE:(.*)/)?.[1]?.trim() ?? "");
      }
    }
  }

  return `
When compacting, prioritize preserving:
- These decisions were made: ${decisions.join("; ")}
- These issues were identified: ${issues.join("; ")}
- Maintain the exact file paths and function names mentioned.
- Collapse exploratory back-and-forth; keep conclusions.
  `.trim();
}

// Register the hook
const hookHandler = async (input: PreCompactHookInput) => {
  const customInstructions = await extractObservations(input.transcript_path);
  
  return {
    hookSpecificOutput: {
      hookEventName: "PreCompact" as const,
      customInstructions
    }
  };
};

// Run your session with the hook attached
for await (const msg of query({
  prompt: "Let's continue the refactor.",
  options: {
    hooks: {
      PreCompact: hookHandler
    }
  }
})) {
  if (msg.type === "system" && msg.subtype === "compact_boundary") {
    console.log(`Compacted. Was ${msg.compact_metadata.pre_tokens} tokens.`);
  }
}
This is the closest manual approximation of what Mastra OM does automatically. One honest difference: Mastra's approach produces an event-based log—specific dated entries about what happened and what was decided—rather than a prose summary. The PreCompact hook uses compaction, which is inherently more lossy. But with precise custom_instructions, you can push the summary toward that event-log style and recover most of the signal.

Manual Compaction: Taking the Wheel

The auto-trigger fires when context hits a threshold. But you can also trigger it manually on your own schedule—which is particularly useful if you want to compact at a meaningful boundary in your workflow rather than an arbitrary token count.
// Trigger compaction via the /compact slash command
for await (const msg of query({
  prompt: "/compact",
  options: { maxTurns: 1 }
})) {
  if (msg.type === "system" && msg.subtype === "compact_boundary") {
    console.log("Compaction complete.");
    console.log(`Tokens before: ${msg.compact_metadata.pre_tokens}`);
  }
}
Combined with the PreCompact hook, this gives you full control over the when and the how. Compact after each major task phase. Extract observations. Inject curated memory into the next phase. The session ID persists through all of it.

The Full Pattern: Mastra-Inspired Memory with Caching

Putting it all together, here's how you'd wire up a session that approximates Mastra's OM approach with cache-friendliness. You're doing manually what Mastra automates: extract observations at phase boundaries, store them, inject them forward.
┌─────────────────────────────────────────────────────────┐
│                  Session Architecture                    │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  System Prompt (STABLE — cache hit every request)       │
│  ┌─────────────────────────────────────────────────┐    │
│  │ Base instructions + project context + schema    │    │
│  └─────────────────────────────────────────────────┘    │
│                     +                                   │
│  System Prompt Append (SEMI-STABLE — injected at start  │
│  of each phase)                                         │
│  ┌─────────────────────────────────────────────────┐    │
│  │ Observational memory from last compact          │    │
│  │ Known issues, decisions, checkpoints            │    │
│  └─────────────────────────────────────────────────┘    │
│                     +                                   │
│  Live Conversation (DYNAMIC — grows until compact)      │
│  ┌─────────────────────────────────────────────────┐    │
│  │ Turn 1... Turn 2... Turn N...                   │    │
│  │         [PreCompact Hook fires here]            │    │
│  │         → extract observations                  │    │
│  │         → write custom_instructions             │    │
│  │         → compact_boundary emitted              │    │
│  └─────────────────────────────────────────────────┘    │
│                                                         │
└─────────────────────────────────────────────────────────┘

import {
  unstable_v2_createSession,
  unstable_v2_resumeSession
} from "@anthropic-ai/claude-agent-sdk";

// Your persistent observation store (replace with DB, Convex, etc.)
let savedObservations: string = "";

async function runPhase(
  sessionId: string | null,
  phasePrompt: string
): Promise<string> {
  
  const session = sessionId
    ? unstable_v2_resumeSession(sessionId, { model: "claude-opus-4-6" })
    : unstable_v2_createSession({
        model: "claude-opus-4-6",
        systemPrompt: {
          type: "preset",
          preset: "claude_code",
          append: savedObservations
            ? `\n## Memory from Previous Phase\n${savedObservations}`
            : ""
        }
      });

  await session.send(phasePrompt);

  let newSessionId = session.sessionId;

  for await (const msg of session.stream()) {
    if ("session_id" in msg) newSessionId = msg.session_id;

    if (msg.type === "system" && msg.subtype === "compact_boundary") {
      console.log(`Phase compacted. Tokens: ${msg.compact_metadata.pre_tokens}`);
    }
  }

  // Manual compact at end of phase to crystallize observations
  await session.send("/compact");
  for await (const msg of session.stream()) {
    if (msg.type === "system" && msg.subtype === "compact_boundary") {
      console.log("Phase boundary set.");
    }
  }

  session.close();
  return newSessionId;
}

// Phase 1
let sid = await runPhase(null, "Audit the authentication module.");

// Phase 2 — resumes with observations injected
sid = await runPhase(sid, "Fix the issues from the audit.");

// Phase 3
sid = await runPhase(sid, "Write tests for the fixes.");

On Accessing "The Internals"

Dennison's last reply in our thread was: "interesting, last time I checked you couldn't access the internals."
That's fair—and the nuance is worth being precise about. You can't directly read or mutate what's sitting in the model's in-memory context mid-turn. That's genuinely opaque. What you can access is:

The transcript JSONL — full conversation history on disk, available in every hook via transcript_path
The system prompt — fully replaceable or appendable at session creation
Compaction behavior — steerable via custom_instructions in the PreCompact hook
Session state — resumable by ID, persistable indefinitely

For most real use cases—and definitely for Mastra-inspired observational memory—those surfaces are enough. You're not editing memory directly; you're doing what any good engineer does: intercepting the right moments, extracting signal, and feeding it forward in a structured way.
Worth noting: if you want Mastra OM's full behavior—automated background Observer/Reflector agents, true event logs, zero compaction—you'd need to bring in @mastra/memory directly or build those background agents yourself. What the Claude Agent SDK gives you is the seams. What Mastra gives you is the automation. They're not mutually exclusive.
In fact, the cleaner integration might go the other direction from what you'd expect. Rather than running both systems in parallel, you could call @mastra/memory's Observer from inside the PreCompact hook—delegating your extraction logic to it entirely, getting back structured event-log observations in Mastra's format, then feeding those forward as custom_instructions. The Claude Agent SDK stays in charge of the session loop; @mastra/memory becomes your extraction engine. Each piece in its lane. Whether the package exposes the Observer as a standalone callable (rather than requiring a full Mastra Agent wrapper) is worth confirming in the package API before building on it—but architecturally, it's the right fit.

Where to Start

If I were wiring this up from scratch today, I'd sequence it this way:

Start with V2 sessions — unstable_v2_createSession + send()/stream(). Get comfortable with the multi-turn model before adding hooks.
Add system prompt injection — even before you tackle compaction, you can inject structured context as a system prompt append. Works immediately, plays well with caching.
Wire up the PreCompact hook — read the transcript, extract what you care about, test that your custom_instructions are steering the summary the way you want.
Move to manual compaction triggers — compact at meaningful task boundaries instead of relying on auto-trigger.
Persist observations externally — write extracted observations to your own store (Convex, Postgres, whatever) so you can reconstruct memory across sessions, not just within them. And if you find yourself wanting to fully automate the Observer/Reflector loop rather than trigger it manually, @mastra/memory is worth looking at—it handles that whole layer.

The SDK gives you the seams. What you build in those seams is yours.
No results found