bdougie/gist:41e60120359c48557446e85f5ccc0d08

## gistfile0.txt
# Observational Memory + Tapes in the Pokemon Agent

## The Problem

Long agent runs hit context compaction. When Claude's context window fills up, older messages get summarized and the cache prefix breaks. The agent loses continuity: it forgets what it tried, what failed, and what worked. Each new session starts from scratch.

## How Tapes Solves Storage

[Tapes](https://tapes.dev) proxies all LLM API calls and records every conversation turn in a content-addressable SQLite database at `.tapes/tapes.sqlite`. Each node in the database has:

- A content hash (its primary key)
- A `parent_hash` linking it to the previous turn
- The full message content (role, text, tool calls, tool results)
- Token counts (input, output, cache creation, cache read)
- Timestamps, model name, agent name

Sessions form chains. A root node (`parent_hash IS NULL`) starts a session. Each subsequent turn points back to its parent. To reconstruct a session, you walk the chain with a recursive CTE:

```sql
WITH RECURSIVE chain(h) AS (
  SELECT ?              -- start from root hash
  UNION ALL
  SELECT n.hash FROM nodes n
  JOIN chain ON n.parent_hash = chain.h
)
SELECT * FROM chain JOIN nodes n ON n.hash = chain.h
ORDER BY n.created_at
```

This gives you the full conversation in order: every user message, every assistant response, every tool call and result.

## How Observational Memory Distills It

The observer sits between Tapes (raw data) and the agent's next session (needs context). It runs after a session ends and extracts what matters.

### The pipeline

```
.tapes/tapes.sqlite
    │
    ▼
TapeReader.read_session(root_hash)
    │  parses nodes into TapeEntry objects
    │  extracts tool uses, tool results, token counts
    ▼
Observer.observe_session(session)
    │  runs 4 heuristic extractors:
    │   1. Session goal (first user message, skipping system noise)
    │   2. Tool errors and exception tracebacks
    │   3. Files created (Write tool invocations)
    │   4. Token usage summary
    │
    │  classifies each observation by priority:
    │   [important]     bug, error, crash, security, etc.
    │   [possible]      test, refactor, update, etc.
    │   [informational] everything else
    ▼
Observer.write_observations()
    │  appends to .tapes/memory/observations.md
    │  grouped by date, deduplicates headers
    ▼
.tapes/memory/observations.md    ← next session reads this
.tapes/memory/observer_state.json ← tracks processed sessions
```

### What the output looks like

```markdown
## 2026-03-09

- [important] Session goal: fix the crash in battle strategy (session: a3f8c012)
- [important] Tool error: ModuleNotFoundError: No module named 'numpy' (session: a3f8c012)
- [possible] File created: scripts/pathfinding.py (session: a3f8c012)
- [informational] Token usage: 45000 input, 12000 output, 38000 cache read (session: a3f8c012)
```

### What it skips

The observer filters noise that would pollute memory:

- `<system-reminder>` tags that Tapes stores as user-role nodes
- Casual mentions of "error" in assistant text (only matches `ValueError:`, `ModuleNotFoundError:`, etc. at line start)
- Sessions that have already been processed (watermark in `observer_state.json`)

## The Data Flow Across Sessions

```
Session 1: Agent plays Pokemon Red
    │
    ├── Tapes records every LLM turn to .tapes/tapes.sqlite
    │
    ▼
After session: python3 scripts/observe_cli.py
    │
    ├── Reads new sessions from SQLite
    ├── Extracts observations
    ├── Writes to .tapes/memory/observations.md
    ├── Updates watermark so these sessions aren't reprocessed
    │
    ▼
Session 2: Agent loads observations.md at startup
    │
    ├── Knows what Session 1 tried
    ├── Knows what errors occurred
    ├── Knows what files were created
    └── Picks up where Session 1 left off
```

## Three Scripts, One Job

| Script | Role |
|--------|------|
| `tape_reader.py` | Pure stdlib SQLite reader. Parses nodes into `TapeEntry` / `TapeSession` dataclasses. No dependencies beyond `sqlite3`. |
| `observer.py` | Heuristic pattern matcher. No LLM calls. Extracts observations via keyword matching and structural patterns. |
| `observe_cli.py` | CLI wrapper. Auto-detects `.tapes/tapes.sqlite`, supports `--dry-run`, `--reset`, `--session <hash>`. |

## Design Decisions

**No LLM in the observer.** The extraction is pure heuristics: regex for tracebacks, keyword matching for priority, structural checks for tool errors. This keeps the observer fast, free, and deterministic. An LLM-based summarizer could be layered on top later.

**Pure stdlib for the reader.** `tape_reader.py` uses only `sqlite3`, `json`, and `dataclasses`. No pandas, no ORM, no external dependencies. It runs anywhere Python runs.

**Watermark-based idempotency.** `observer_state.json` tracks which session hashes have been processed. Running the observer twice produces no duplicates. `--reset` clears the watermark to reprocess everything.

**Observations live alongside the data.** The output goes to `.tapes/memory/`, right next to `tapes.sqlite`. Both are gitignored. The observations are a derived view of the raw data, not a separate source of truth.

## Inspired By

- [Mastra's observational memory](https://mastra.ai/blog/observational-memory): the concept of distilling agent sessions into prioritized observations
- [Factorio Learning Environment](https://github.com/JackHopkins/factorio-learning-environment): incremental report distillation and error catalog patterns
	# Observational Memory + Tapes in the Pokemon Agent

	## The Problem

	Long agent runs hit context compaction. When Claude's context window fills up, older messages get summarized and the cache prefix breaks. The agent loses continuity: it forgets what it tried, what failed, and what worked. Each new session starts from scratch.

	## How Tapes Solves Storage

	[Tapes](https://tapes.dev) proxies all LLM API calls and records every conversation turn in a content-addressable SQLite database at `.tapes/tapes.sqlite`. Each node in the database has:

	- A content hash (its primary key)
	- A `parent_hash` linking it to the previous turn
	- The full message content (role, text, tool calls, tool results)
	- Token counts (input, output, cache creation, cache read)
	- Timestamps, model name, agent name

	Sessions form chains. A root node (`parent_hash IS NULL`) starts a session. Each subsequent turn points back to its parent. To reconstruct a session, you walk the chain with a recursive CTE:

	```sql
	WITH RECURSIVE chain(h) AS (
	SELECT ? -- start from root hash
	UNION ALL
	SELECT n.hash FROM nodes n
	JOIN chain ON n.parent_hash = chain.h
	)
	SELECT * FROM chain JOIN nodes n ON n.hash = chain.h
	ORDER BY n.created_at
	```

	This gives you the full conversation in order: every user message, every assistant response, every tool call and result.

	## How Observational Memory Distills It

	The observer sits between Tapes (raw data) and the agent's next session (needs context). It runs after a session ends and extracts what matters.

	### The pipeline

	```
	.tapes/tapes.sqlite
	│
	▼
	TapeReader.read_session(root_hash)
	│ parses nodes into TapeEntry objects
	│ extracts tool uses, tool results, token counts
	▼
	Observer.observe_session(session)
	│ runs 4 heuristic extractors:
	│ 1. Session goal (first user message, skipping system noise)
	│ 2. Tool errors and exception tracebacks
	│ 3. Files created (Write tool invocations)
	│ 4. Token usage summary
	│
	│ classifies each observation by priority:
	│ [important] bug, error, crash, security, etc.
	│ [possible] test, refactor, update, etc.
	│ [informational] everything else
	▼
	Observer.write_observations()
	│ appends to .tapes/memory/observations.md
	│ grouped by date, deduplicates headers
	▼
	.tapes/memory/observations.md ← next session reads this
	.tapes/memory/observer_state.json ← tracks processed sessions
	```

	### What the output looks like

	```markdown
	## 2026-03-09

	- [important] Session goal: fix the crash in battle strategy (session: a3f8c012)
	- [important] Tool error: ModuleNotFoundError: No module named 'numpy' (session: a3f8c012)
	- [possible] File created: scripts/pathfinding.py (session: a3f8c012)
	- [informational] Token usage: 45000 input, 12000 output, 38000 cache read (session: a3f8c012)
	```

	### What it skips

	The observer filters noise that would pollute memory:

	- `<system-reminder>` tags that Tapes stores as user-role nodes
	- Casual mentions of "error" in assistant text (only matches `ValueError:`, `ModuleNotFoundError:`, etc. at line start)
	- Sessions that have already been processed (watermark in `observer_state.json`)

	## The Data Flow Across Sessions

	```
	Session 1: Agent plays Pokemon Red
	│
	├── Tapes records every LLM turn to .tapes/tapes.sqlite
	│
	▼
	After session: python3 scripts/observe_cli.py
	│
	├── Reads new sessions from SQLite
	├── Extracts observations
	├── Writes to .tapes/memory/observations.md
	├── Updates watermark so these sessions aren't reprocessed
	│
	▼
	Session 2: Agent loads observations.md at startup
	│
	├── Knows what Session 1 tried
	├── Knows what errors occurred
	├── Knows what files were created
	└── Picks up where Session 1 left off
	```

	## Three Scripts, One Job

	\| Script \| Role \|
	\|--------\|------\|
	\| `tape_reader.py` \| Pure stdlib SQLite reader. Parses nodes into `TapeEntry` / `TapeSession` dataclasses. No dependencies beyond `sqlite3`. \|
	\| `observer.py` \| Heuristic pattern matcher. No LLM calls. Extracts observations via keyword matching and structural patterns. \|
	\| `observe_cli.py` \| CLI wrapper. Auto-detects `.tapes/tapes.sqlite`, supports `--dry-run`, `--reset`, `--session <hash>`. \|

	## Design Decisions

	No LLM in the observer. The extraction is pure heuristics: regex for tracebacks, keyword matching for priority, structural checks for tool errors. This keeps the observer fast, free, and deterministic. An LLM-based summarizer could be layered on top later.

	Pure stdlib for the reader. `tape_reader.py` uses only `sqlite3`, `json`, and `dataclasses`. No pandas, no ORM, no external dependencies. It runs anywhere Python runs.

	Watermark-based idempotency. `observer_state.json` tracks which session hashes have been processed. Running the observer twice produces no duplicates. `--reset` clears the watermark to reprocess everything.

	Observations live alongside the data. The output goes to `.tapes/memory/`, right next to `tapes.sqlite`. Both are gitignored. The observations are a derived view of the raw data, not a separate source of truth.

	## Inspired By

	- [Mastra's observational memory](https://mastra.ai/blog/observational-memory): the concept of distilling agent sessions into prioritized observations
	- [Factorio Learning Environment](https://github.com/JackHopkins/factorio-learning-environment): incremental report distillation and error catalog patterns
No results found