Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save TravnikovDev/797c721f092a876db9efcb3fb55126c1 to your computer and use it in GitHub Desktop.

Select an option

Save TravnikovDev/797c721f092a876db9efcb3fb55126c1 to your computer and use it in GitHub Desktop.
LinkedIn Post - 2026-02-25 05:15

Your agent isn’t a chatbot. It’s a long-lived distributed system. If it isn’t riding on a durable stream, it’s a goldfish with WiFi.

My AI research agent pulled the receipts - Jay Kreps’ The Log, Kafka docs, Flink papers - and the pattern is boringly clear: the backbone of real agents is a durable log. Append-only. Ordered. Stored so you can replay, audit, and deterministically rebuild state.

Translate Kafka-ish primitives to agent needs:

  • Topics - domains of activity: agent-decisions, tool-invocations, human-feedback.
  • Partitions - shard by entity or thread_id to keep per-user order and scale.
  • Offsets - durable progress markers so a crash resumes without duping work.
  • Retention and replay - reproduce yesterday’s bug or run post-mortems.
  • Consumer groups - horizontal scale and auto-failover.
  • Delivery semantics - idempotent producers, transactional commits where it truly matters.
  • Compaction - keep latest truth for profiles while history lives elsewhere.
  • Backpressure - built-in brake pedal so you don’t melt an API quota.

Why not REST, ad-hoc queues, or just vectors:

  • REST is call-and-forget. Agents need a shared, ordered fact stream others can subscribe to.
  • You can’t debug a week-long run from scratch memory. You can from a log plus offsets.
  • Durable logs turn chaos into something you can audit, rewind, and fix.

What breaks without it:

  • Double charges when a refund tool runs twice after a crash.
  • Two agents race to approve because nothing orders their decisions.
  • Cold-start amnesia - restarts redo work and drift state you can’t explain to a regulator.

Reality check:

  • Exactly-once is contextual and expensive. It ends at non-transactional edges. Use idempotency keys and the outbox pattern. Save transactions for the few flows where correctness demands it.
  • Hot partitions will throttle you. Partitioning is a design decision, not an afterthought.
  • Replay is power and cost. Retention, TTLs, and schema discipline or it rots.

90-day, no-drama plan:

  • Weeks 1-2: Stand up a Kafka-compatible backbone. Add a schema registry. Define a simple CloudEvents-style envelope.
  • Weeks 3-4: Partition by thread_id. Add idempotency keys to every tool call. Wire DLQs.
  • Weeks 5-6: Propagate OpenTelemetry trace IDs in headers. Build a record-replay harness.
  • Weeks 7-12: Add compaction and PII TTLs. Chaos-test restarts. Introduce transactions only where needed.

Takeaway: Treat your agent as a stream processor. The next devtool war won’t be won by prompts - it will be won by durable logs, schemas, and replay.

What’s your partition key for agents today, and what bit you first - dupes, hot shards, or missing replay? 🧰

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment