Skip to content

Instantly share code, notes, and snippets.

@bdougie
Created March 10, 2026 03:33
Show Gist options
  • Select an option

  • Save bdougie/41e60120359c48557446e85f5ccc0d08 to your computer and use it in GitHub Desktop.

Select an option

Save bdougie/41e60120359c48557446e85f5ccc0d08 to your computer and use it in GitHub Desktop.
How observational memory and Tapes work together in the Pokemon agent
# Observational Memory + Tapes in the Pokemon Agent
## The Problem
Long agent runs hit context compaction. When Claude's context window fills up, older messages get summarized and the cache prefix breaks. The agent loses continuity: it forgets what it tried, what failed, and what worked. Each new session starts from scratch.
## How Tapes Solves Storage
[Tapes](https://tapes.dev) proxies all LLM API calls and records every conversation turn in a content-addressable SQLite database at `.tapes/tapes.sqlite`. Each node in the database has:
- A content hash (its primary key)
- A `parent_hash` linking it to the previous turn
- The full message content (role, text, tool calls, tool results)
- Token counts (input, output, cache creation, cache read)
- Timestamps, model name, agent name
Sessions form chains. A root node (`parent_hash IS NULL`) starts a session. Each subsequent turn points back to its parent. To reconstruct a session, you walk the chain with a recursive CTE:
```sql
WITH RECURSIVE chain(h) AS (
SELECT ? -- start from root hash
UNION ALL
SELECT n.hash FROM nodes n
JOIN chain ON n.parent_hash = chain.h
)
SELECT * FROM chain JOIN nodes n ON n.hash = chain.h
ORDER BY n.created_at
```
This gives you the full conversation in order: every user message, every assistant response, every tool call and result.
## How Observational Memory Distills It
The observer sits between Tapes (raw data) and the agent's next session (needs context). It runs after a session ends and extracts what matters.
### The pipeline
```
.tapes/tapes.sqlite
TapeReader.read_session(root_hash)
│ parses nodes into TapeEntry objects
│ extracts tool uses, tool results, token counts
Observer.observe_session(session)
│ runs 4 heuristic extractors:
│ 1. Session goal (first user message, skipping system noise)
│ 2. Tool errors and exception tracebacks
│ 3. Files created (Write tool invocations)
│ 4. Token usage summary
│ classifies each observation by priority:
│ [important] bug, error, crash, security, etc.
│ [possible] test, refactor, update, etc.
│ [informational] everything else
Observer.write_observations()
│ appends to .tapes/memory/observations.md
│ grouped by date, deduplicates headers
.tapes/memory/observations.md ← next session reads this
.tapes/memory/observer_state.json ← tracks processed sessions
```
### What the output looks like
```markdown
## 2026-03-09
- [important] Session goal: fix the crash in battle strategy (session: a3f8c012)
- [important] Tool error: ModuleNotFoundError: No module named 'numpy' (session: a3f8c012)
- [possible] File created: scripts/pathfinding.py (session: a3f8c012)
- [informational] Token usage: 45000 input, 12000 output, 38000 cache read (session: a3f8c012)
```
### What it skips
The observer filters noise that would pollute memory:
- `<system-reminder>` tags that Tapes stores as user-role nodes
- Casual mentions of "error" in assistant text (only matches `ValueError:`, `ModuleNotFoundError:`, etc. at line start)
- Sessions that have already been processed (watermark in `observer_state.json`)
## The Data Flow Across Sessions
```
Session 1: Agent plays Pokemon Red
├── Tapes records every LLM turn to .tapes/tapes.sqlite
After session: python3 scripts/observe_cli.py
├── Reads new sessions from SQLite
├── Extracts observations
├── Writes to .tapes/memory/observations.md
├── Updates watermark so these sessions aren't reprocessed
Session 2: Agent loads observations.md at startup
├── Knows what Session 1 tried
├── Knows what errors occurred
├── Knows what files were created
└── Picks up where Session 1 left off
```
## Three Scripts, One Job
| Script | Role |
|--------|------|
| `tape_reader.py` | Pure stdlib SQLite reader. Parses nodes into `TapeEntry` / `TapeSession` dataclasses. No dependencies beyond `sqlite3`. |
| `observer.py` | Heuristic pattern matcher. No LLM calls. Extracts observations via keyword matching and structural patterns. |
| `observe_cli.py` | CLI wrapper. Auto-detects `.tapes/tapes.sqlite`, supports `--dry-run`, `--reset`, `--session <hash>`. |
## Design Decisions
**No LLM in the observer.** The extraction is pure heuristics: regex for tracebacks, keyword matching for priority, structural checks for tool errors. This keeps the observer fast, free, and deterministic. An LLM-based summarizer could be layered on top later.
**Pure stdlib for the reader.** `tape_reader.py` uses only `sqlite3`, `json`, and `dataclasses`. No pandas, no ORM, no external dependencies. It runs anywhere Python runs.
**Watermark-based idempotency.** `observer_state.json` tracks which session hashes have been processed. Running the observer twice produces no duplicates. `--reset` clears the watermark to reprocess everything.
**Observations live alongside the data.** The output goes to `.tapes/memory/`, right next to `tapes.sqlite`. Both are gitignored. The observations are a derived view of the raw data, not a separate source of truth.
## Inspired By
- [Mastra's observational memory](https://mastra.ai/blog/observational-memory): the concept of distilling agent sessions into prioritized observations
- [Factorio Learning Environment](https://github.com/JackHopkins/factorio-learning-environment): incremental report distillation and error catalog patterns
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment