Your agent isn’t a chatbot. It’s a long-lived distributed system. If it isn’t riding on a durable stream, it’s a goldfish with WiFi.
My AI research agent pulled the receipts - Jay Kreps’ The Log, Kafka docs, Flink papers - and the pattern is boringly clear: the backbone of real agents is a durable log. Append-only. Ordered. Stored so you can replay, audit, and deterministically rebuild state.
Translate Kafka-ish primitives to agent needs:
- Topics - domains of activity: agent-decisions, tool-invocations, human-feedback.
- Partitions - shard by entity or thread_id to keep per-user order and scale.
- Offsets - durable progress markers so a crash resumes without duping work.
- Retention and replay - reproduce yesterday’s bug or run post-mortems.
- Consumer groups - horizontal scale and auto-failover.
- Delivery semantics - idempotent producers, transactional commits where it truly matters.
- Compaction - keep latest truth for profiles while history lives elsewhere.
- Backpressure - built-in brake pedal so you don’t melt an API quota.
Why not REST, ad-hoc queues, or just vectors:
- REST is call-and-forget. Agents need a shared, ordered fact stream others can subscribe to.
- You can’t debug a week-long run from scratch memory. You can from a log plus offsets.
- Durable logs turn chaos into something you can audit, rewind, and fix.
What breaks without it:
- Double charges when a refund tool runs twice after a crash.
- Two agents race to approve because nothing orders their decisions.
- Cold-start amnesia - restarts redo work and drift state you can’t explain to a regulator.
Reality check:
- Exactly-once is contextual and expensive. It ends at non-transactional edges. Use idempotency keys and the outbox pattern. Save transactions for the few flows where correctness demands it.
- Hot partitions will throttle you. Partitioning is a design decision, not an afterthought.
- Replay is power and cost. Retention, TTLs, and schema discipline or it rots.
90-day, no-drama plan:
- Weeks 1-2: Stand up a Kafka-compatible backbone. Add a schema registry. Define a simple CloudEvents-style envelope.
- Weeks 3-4: Partition by thread_id. Add idempotency keys to every tool call. Wire DLQs.
- Weeks 5-6: Propagate OpenTelemetry trace IDs in headers. Build a record-replay harness.
- Weeks 7-12: Add compaction and PII TTLs. Chaos-test restarts. Introduce transactions only where needed.
Takeaway: Treat your agent as a stream processor. The next devtool war won’t be won by prompts - it will be won by durable logs, schemas, and replay.
What’s your partition key for agents today, and what bit you first - dupes, hot shards, or missing replay? 🧰