Mudguy is a system that connects AI-powered NPCs to any MUD/MUSH game server using lowest-common-denominator protocols. NPCs perceive the game world, form memories, maintain relationships, and interact with players and each other to create a living, breathing world — even when no players are present.
This document captures the high-level architecture and design intent. It is not an implementation plan. Individual subsystems will receive their own detailed design documents before any code is written.
Mudguy runs alongside any existing MUD server. Each NPC logs into the MUD as a regular player character via telnet. From the MUD server's perspective, NPCs are indistinguishable from human players — they connect, they type commands, they receive text output.
The intelligence lives entirely on the Mudguy side. Each NPC processes its own telnet stream through a multi-tiered classification pipeline, invoking language models only when necessary. A distributed contention system prevents NPCs from flooding the game with noise, while an initiative rotation gives idle NPCs the opportunity to act unprompted.
- Not a MUD server. It connects to existing MUD servers as a client.
- Not a chatbot. NPCs have structured memory, personality, relationships, and goals. They don't just respond to prompts.
- Not a distributed system. Everything runs on a single machine. We optimize for clarity and correctness, not horizontal scale.
- The world feels alive. NPCs act on their own. Players walk into scenes already in progress.
- Technical constraints become narrative mechanics. Context window limits become sleep cycles. Processing latency becomes thoughtful pauses.
- Expensive operations are rare. The system is designed so that full LLM reasoning calls are infrequent. Most processing is rule-based or handled by small, fast models.
- NPCs are observers first, speakers second. Every NPC sees everything in its room. Most of what it sees becomes memory, not speech.
- The system is MUD-agnostic. No assumptions about specific MUD engines. If it speaks telnet, Mudguy can connect to it.
┌──────────────────────────────────────────────────────────────────┐
│ MUDGUY │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │NPC Actor │ │NPC Actor │ │NPC Actor │ ... │
│ │ + Telnet │ │ + Telnet │ │ + Telnet │ │
│ │ + Timers │ │ + Timers │ │ + Timers │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
│ │ │ │ │
│ │ ┌─────────┴──────────┐ │ │
│ │ │ Coordination Bus │ │ │
│ │ │ (echo tagging, │ │ │
│ │ │ thread tracking, │ │ │
│ │ │ initiative) │ │ │
│ │ └────────────────────┘ │ │
│ │ │ │ │
│ ┌────┴──────────────┴──────────────┴─────┐ ┌───────────────┐ │
│ │ LLM Gateway │ │ Memory │ │
│ │ (Tier 1: small/fast │ │ Service │ │
│ │ Tier 2: large/slow) │ │ (SQLite) │ │
│ └────────────────────────────────────────┘ └───────────────┘ │
└──────────────────────────────────────────────────────────────────┘
│ │ │
│ telnet │ telnet │ telnet
│ (dedicated │ (dedicated │ (dedicated
│ socket) │ socket) │ socket)
▼ ▼ ▼
┌──────────────────────────────────────────┐
│ MUD Server │
│ (routes text by room natively) │
└──────────────────────────────────────────┘
Telnet Pool — Manages one TCP connection per NPC to the MUD server. Handles telnet protocol negotiation (NAWS, GMCP, MSDP, MCCP where available). Buffers incoming text and performs line-level parsing, ANSI stripping, and prompt detection. Outputs game commands from NPC actors.
NPC Actors — One actor per NPC character. Contains the NPC's hierarchical state machine, personality definition, and local state. Receives stimuli directly from its own telnet connection, runs the classification pipeline, computes contention delays, and invokes LLM reasoning when its delay expires.
Coordination Bus — Tags outgoing NPC commands so that each NPC can recognize and filter its own echoed output in Tier 0. Tracks conversation threads for continuity scoring. Manages the initiative rotation. NPCs receive stimuli directly from the MUD server via their own telnet connections — the bus does not route game text.
Initiative System — Maintains a global round-robin rotation of all awake NPCs. Grants initiative turns to idle NPCs — those with no pending contention timer are eligible. Skips NPCs that are already responding to a stimulus. Manages the day cycle (detects when all NPCs have slept, triggers new day).
Memory Service — Manages persistent block-based memory storage backed by SQLite. Handles block CRUD operations, relationship links between blocks, and sleep-cycle memory consolidation. Serializes all writes to maintain ACID guarantees.
LLM Gateway — Abstracts LLM provider details. Exposes two tiers: a fast/cheap tier for stimulus classification (Tier 1) and a slow/capable tier for full NPC reasoning (Tier 2). Handles retries, timeouts, and rate limiting.
Every line of text received from the MUD passes through a three-tier classification pipeline before reaching the NPC's reasoning engine.
Zero latency, zero cost. Pattern matching and state correlation.
Filters (discard completely):
- The NPC's own echoed output (identified via coordination bus tagging)
- Known system messages (MOTD, server broadcasts, login banners)
- Identical repeated ambient text within a configurable window
Routes (bypass further classification):
- GMCP/MSDP structured data → direct state updates (HP, room info, etc.)
- Text correlated to a pending command → ACTION_RESULT, returned to the originating tool call
Passes through everything else to Tier 1.
Small, fast language model (Haiku-class) or fine-tuned local model. ~200ms latency, ~$0.001/call.
Receives the raw text plus the NPC's name and block index. Returns a structured classification:
Category: DIRECT_ADDRESS | ROOM_CONVERSATION | WORLD_EVENT |
COMBAT_EVENT | AMBIENT
Speaker: identified speaker name (if applicable)
Speaker is NPC: whether the speaker is a Mudguy-controlled NPC
Addresses: list of names being addressed
Urgency: 0.0-1.0
Relevance to me: 0.0-1.0
Topic tags: brief keywords
Memory worthy: boolean
Memory note: one-line summary for block update (if memory worthy)
Bid interest: 0.0-1.0
Bid reason: brief justification
Retrieve blocks: list of block IDs to pre-load for Tier 2
Tier 1 runs for every NPC that can perceive the stimulus, not just one. Each NPC gets its own classification with its own relevance and bid scores.
Large, capable language model (Opus/Sonnet-class). ~2-5s latency, ~$0.02-0.05/call.
Only invoked when:
- An NPC's contention delay expires (it is the most urgent responder)
- An NPC receives an initiative turn and has observations worth acting on
The Tier 2 prompt is assembled from: the NPC's personality and system prompt, the block index (always), pre-retrieved memory blocks (selected by Tier 1), pending observations since the last Tier 2 call, and the current stimulus or initiative prompt.
The model has access to both action tools (say, emote, whisper, move, pass) and memory tools (create_block, update_block, retrieve_block, link_blocks).
NPCs resolve who speaks — and when — through a distributed contention mechanism inspired by CSMA/CA (Carrier-Sense Multiple Access with Collision Avoidance). There is no central arbiter. Each NPC independently computes a response delay based on how urgently it wants to respond, then waits. The most urgent NPC naturally responds first.
When Tier 1 classifies a stimulus as worth responding to, the NPC computes a response delay — a wait duration before it begins Tier 2 reasoning. High urgency produces a short delay; low urgency produces a long delay. The NPC continues listening during the wait period.
If a new stimulus arrives before the delay expires (another NPC responded, the player said something else, a world event occurred), the NPC re-evaluates. The new stimulus may increase urgency (accelerating the response), decrease it (extending the delay), or resolve the situation entirely (canceling the response). This preemption loop continues until either the delay expires and Tier 2 begins, or the NPC decides to pass.
There is no hard cap on conversation length. The fatigue and sleep mechanics naturally limit how long any NPC can sustain activity, preventing runaway scenes without artificial cutoffs.
The response delay is derived from a bid score — a weighted sum of deterministic factors computed locally with no model calls.
Hard factors (highest weight):
- Direct address to this NPC (near-zero delay)
- Role relevance (blacksmith bids high on weapon talk)
- Conversation continuity (NPC was already in this conversation thread)
Soft factors:
- Recency penalty (spoke recently → longer delay)
- Fatigue multiplier (tired NPCs wait longer)
- Budget remaining (per-NPC rate limit)
The bid score maps to a delay through a configurable curve. The delay includes a small random jitter to prevent ties when two NPCs compute similar urgency.
An NPC that has been participating in a conversation thread receives a continuity bonus (shorter delay) for follow-up messages in that thread. Threads are identified by speaker + time proximity + topic continuity. This naturally keeps one NPC engaged in a conversation rather than having different NPCs jump in uninvited.
Direct address always overrides thread continuity — if a player turns to address a different NPC, that NPC's delay drops to near-zero regardless of who has been carrying the conversation.
Without initiative, a room full of NPCs would stand in silence until a player arrives. The initiative system gives each NPC periodic opportunities to act unprompted — starting conversations, performing activities, reacting to accumulated observations.
A global round-robin cycles through all awake NPCs. The rotation is not per-room — the entire world shares one initiative clock. This prevents rooms with many NPCs from generating disproportionate activity.
NPCs with a pending contention timer are skipped — they are already going to act in response to a stimulus and do not need an initiative nudge. Initiative exists only for idle NPCs.
On each tick:
- Advance to the next awake, idle NPC (no pending contention timer)
- The NPC receives an initiative event
- The NPC decides whether to act or pass
- Initiative advances to the next eligible NPC
When an NPC gets its turn, the Tier 2 model receives the NPC's current state, accumulated observations since its last turn, pending memory updates for review, and room context (who is present, what has been happening). The model can act (say, emote, move), manage memories (update blocks, create links), or pass.
Players always have initiative — their actions enter the system as stimuli at any time, regardless of where the initiative rotation currently sits. Player stimuli receive a priority boost in bid scoring, ensuring NPCs are responsive to players even during NPC-NPC scenes.
Each NPC maintains a persistent, structured memory organized into named blocks. A block is a unit of knowledge about a person, place, thing, event, or concept. The NPC always has access to a compact index of all its blocks (summaries only), and can retrieve full block content on demand.
| Type | Represents | Examples |
|---|---|---|
| SELF | The NPC's own identity, goals, state | "My personality", "My current mood" |
| PERSON | A known character | "Grok the Mercenary", "A mysterious stranger" |
| PLACE | A location | "The Rusty Tankard", "Northfield village" |
| THING | An object or item of significance | "Grok's debt to me", "The starlight salve" |
| EVENT | Something that happened | "The dragon attack on Northfield" |
| CONCEPT | Abstract knowledge | "The Merchant Guild dispute", "How healing magic works" |
Each block contains:
- Header: type, name, creation time, last updated, access count, tags, one-line summary
- Content: structured free-text with key facts, relationship notes, and recent interactions
- Open threads: unresolved questions, pending actions, or items of interest related to this block
- Links: named relationships to other blocks (e.g., "participant in", "located at", "friend of")
The block index (all headers) is always present in the NPC's context. Full block content is loaded on demand, either pre-retrieved by Tier 1 or explicitly requested by Tier 2.
Create — First encounter with a new entity. Tier 2 creates a block with initial observations.
Update — New information is learned. Two paths:
- Tier 1 (cheap): appends a one-line note to the block's recent interactions log. No Tier 2 call needed. Used for ambient observations.
- Tier 2 (thorough): restructures the block content, updates key facts, adjusts relationship notes. Used during active reasoning or consolidation.
Link — Tier 2 discovers a relationship between two blocks and creates a named edge.
Merge — Two blocks are discovered to refer to the same entity. Content is combined.
Archive — Block hasn't been accessed in a long time. Moves to cold storage (index entry preserved with summary, full content requires explicit retrieval).
When Tier 1 identifies a block as relevant, it also follows relationship links to pre-retrieve associated blocks (up to a configurable depth, default 1). If a stimulus mentions the dragon bounty, the system also loads the blocks for Lady Vastra and the dragon attack event — things the NPC would naturally associate.
Every NPC observes every stimulus in its room, regardless of whether it bids to respond. Tier 1 evaluates memory_worthy for all NPCs independently. An NPC sitting silently in a tavern still accumulates rich memories of the conversations happening around it. This is the primary memory formation path — most observations are recorded without any visible NPC action.
LLM-based NPCs accumulate conversation history in their context window. Context is finite. Rather than silently degrading as context fills, the system maps this constraint to a narrative mechanic: NPCs get tired and go to sleep.
NPC behavior degrades gracefully as context fills:
| Context Usage | Behavior |
|---|---|
| 0-50% | Alert. Full engagement. Initiates conversations. Curious. |
| 50-70% | Normal. Responds readily, initiates less. |
| 70-85% | Tired. Shorter responses. Wraps up conversations. Reduced bids. |
| 85-95% | Drowsy. Only responds to direct address. Auto-passes initiative. |
| 95%+ | Must sleep immediately. |
Fatigue is enforced at the system level through bid multipliers and initiative auto-passing. Narrative hints are injected into the system prompt so the LLM naturally adjusts its tone ("You're feeling tired...").
When an NPC sleeps, a three-phase process runs:
Phase 1 — Narrative Exit. One final Tier 2 call with the instruction to say goodbye and leave naturally. The NPC's departure is visible to other characters in the room.
Phase 2 — Memory Consolidation. A dedicated Tier 2 call processes the full conversation history into block updates: new blocks created, existing blocks updated, new links formed, a day summary event block created, and the self/state block updated with mood and unresolved goals.
Phase 3 — Context Wipe. Conversation history is cleared. The NPC enters the Dormant state and is removed from the initiative rotation.
NPCs sleep at different times based on their activity level. Busy NPCs (bartenders in crowded taverns) exhaust context quickly and sleep first. Quiet observers last much longer. This creates a natural staggered rhythm — the world gradually goes quiet as NPCs retire one by one.
When the last NPC sleeps, a new day begins. All NPCs wake with fresh context, their consolidated memory blocks, and their previous day's summary. The initiative rotation resumes.
Options (to be determined in detailed design):
- Narrative framing: sleeping NPCs are simply unavailable. Players experience a natural world cycle.
- Emergency wake: direct player interaction can trigger an abbreviated wake with partial consolidation.
- Watchman pattern: a designated low-engagement NPC stays awake to handle basic player needs.
Rust. Chosen for:
- Strong type system enforces state machine correctness at compile time
- Async/await with Tokio handles concurrent NPC connections cleanly
- Ownership model naturally prevents shared mutable state bugs
- Performance headroom for Tier 0 rule evaluation and stream parsing
- Single binary deployment
Actor Model — Each NPC is an actor with isolated state and a message mailbox. Actors communicate exclusively through the coordination bus. One actor crashing does not affect others.
Hierarchical State Machine — Each NPC actor contains an HSM managing its lifecycle: Awake (Idle, Classifying, Bidding, Deliberating, Acting, Tired) → Sleeping (Consolidating, Dormant). State transitions are enforced at compile time.
Mediator / Event Bus — The coordination bus is a mediator that owns references to all NPC mailboxes. NPCs never reference each other. All inter-NPC coordination flows through typed channels on the bus.
| Concern | Crate | Role |
|---|---|---|
| Async runtime | tokio | Tasks, select!, timers |
| Channels | flume | Fast MPMC for the coordination bus |
| Actor framework | kameo | NPC actor lifecycle (or raw tokio::spawn) |
| State machines | statig | Hierarchical NPC state |
| Database | sqlx + sqlite | Persistent memory block storage |
| Telnet | tokio::net + libmudtelnet | MUD connections and protocol negotiation |
| HTTP client | reqwest | LLM API calls |
| Serialization | serde + serde_json | LLM payloads, block content |
| Logging | tracing | Structured observability |
| Error handling | anyhow | Ergonomic error propagation |
| Property testing | proptest | Property-based test generation and shrinking |
| Mutation testing | cargo-mutants | Test quality verification via fault injection |
Each NPC processes its own telnet stream independently. The MUD server handles room-scoped delivery — NPCs only receive text relevant to their location.
MUD Server
│ │ │
│ telnet │ telnet │ telnet
│ (socket A) │ (socket B) │ (socket C)
▼ ▼ ▼
NPC Actor A NPC Actor B NPC Actor C
│ │ │
Tier 0 Tier 0 Tier 0
(rules) (rules) (rules)
│ │ │
Tier 1 Tier 1 Tier 1
(Haiku) (Haiku) (Haiku)
│ │ │
delay: 0.2s delay: 1.5s pass
│ │
┌──┘ │
│ (delay expires │ (new stimulus arrives
│ first) │ before delay expires →
▼ │ re-evaluate)
Tier 2 │
(Opus) │
│ │
action: say "..." │
│ │
Coordination Bus ─┘
(tags as NPC output)
│
Telnet Pool
│
▼
MUD Server ──────────▶ NPC B & C receive the
speech on their own sockets
Mudguy is an agent-driven project. AI coding agents will participate in every phase of development — writing implementations, refactoring subsystems, and evolving the codebase over time. This changes the relationship between code and tests fundamentally.
Human developers build intuition about their code. They notice when something "feels wrong" even if tests pass. Agents have no such intuition. An agent that sees green tests assumes correctness. If the tests are shallow — if they are the moral equivalent of assert 1 == 1 — the agent will confidently build on a rotten foundation.
Every subsystem must be accompanied by tests that formally verify its behavior. Tests are not an afterthought or a checklist item. They are the specification that agents use to reason about correctness. Weak tests are worse than no tests, because they create false confidence.
The project enforces three complementary testing strategies:
Layer 1: Property-Based Testing (proptest)
Unit tests verify specific examples. Property tests verify invariants across entire input spaces. For every deterministic subsystem, tests must express the properties that hold for all valid inputs, not just the handful of cases a developer thought of.
Examples of properties in Mudguy:
- Bid scores are always in
[0.0, 1.0]regardless of input. - Serializing a memory block and deserializing it produces an identical block.
- The initiative rotation visits every awake NPC exactly once per cycle.
- Classification always produces exactly one category from the valid set.
- Tier 0 rules always fire before Tier 1 for any input.
proptest generates hundreds of random inputs per test, finds counterexamples, and automatically shrinks failures to minimal reproductions. This catches edge cases that example-based tests miss.
Layer 2: Mutation Testing (cargo-mutants)
Even well-written tests can be accidentally weak — executing code paths without actually verifying their results. Mutation testing answers the question: "If I introduced a bug here, would any test catch it?"
cargo-mutants systematically injects faults (flipping operators, replacing return values, removing match arms) and runs the test suite against each mutation. A "surviving" mutation is a test gap — code that can be broken without any test noticing.
Mutation testing is run periodically (not on every commit) to audit test quality across the codebase. Surviving mutants are treated as bugs in the test suite, not in the implementation.
Layer 3: Mock Telnet Integration Tests
The mock telnet server (see Design Decision 7) replays scripted MUD scenarios end-to-end. Integration tests assert on NPC responses, bid outcomes, memory operations, and state transitions. These validate that the full pipeline — from raw telnet bytes to NPC actions — behaves correctly as a system, not just in isolation.
Each subsystem design document must include a Verification Strategy section that specifies:
- Properties — What invariants must hold? These become proptest properties.
- Boundaries — What are the edge cases? These become targeted unit tests.
- Integration points — How does this subsystem interact with others? These become integration tests using the mock telnet server or test harnesses.
- Mutation coverage targets — Which functions are critical enough that no mutation should survive?
Subsystems with non-deterministic behavior (LLM calls) must isolate the non-deterministic boundary behind a trait, so that all logic around it is tested against deterministic mocks.
A subsystem is not complete until:
- Property tests cover all stated invariants.
- Mutation testing shows no surviving mutants in core logic.
- Integration tests demonstrate correct behavior in the context of the full pipeline.
If an agent cannot verify its work against meaningful tests, the work is not done.
The following subsystems will each receive their own detailed design document before implementation begins.
Telnet connection management, protocol negotiation (GMCP, MSDP, MCCP, MXP), line buffering, ANSI stripping, prompt detection, and command-response correlation. The correlation engine that matches outgoing commands to incoming response text.
Tier 0 rule engine, Tier 1 prompt design and response parsing, block pre-retrieval logic, memory-worthy evaluation. Includes the Tier 1 model selection strategy (hosted small model vs. local fine-tuned model).
NPC output tagging (echo filtering), conversation thread tracking, contention delay computation and tuning (bid-to-delay curve), preemption logic (new stimulus during wait period), rate limiting, initiative coordination.
Initiative rotation algorithm, idle-NPC eligibility (skip NPCs with pending contention), day cycle state machine, fatigue curve enforcement, sleep triggers, new-day synchronization, player interaction policy during sleep.
Actor lifecycle, HSM state definitions and transitions, Tier 2 prompt assembly pipeline (how personality + block index + pre-retrieved blocks + observations + stimulus are composed), tool definitions and execution, fatigue hint injection.
Block schema and storage (SQLite table design), block index maintenance, CRUD operations, relationship links and graph queries, transitive retrieval, Tier 1 lightweight updates vs. Tier 2 structured updates, merge and archive operations.
Consolidation prompt design, day summary generation, self/state update logic, relationship evolution during consolidation, consolidation quality (what gets preserved vs. lost), wake-up prompt assembly.
Provider abstraction, Tier 1 and Tier 2 model configuration, tool-use serialization and response parsing, retry and timeout policies, token counting and budget tracking, cost monitoring.
Personality definition format, relationship initialization, starting memory blocks, role-based bid weights, MUD-specific calibration (login sequences, prompt patterns, command vocabulary).
These questions were originally deferred as open questions and have been resolved.
-
Tier 1 model: Hosted Haiku only. Both tiers use the Anthropic API (Haiku for Tier 1, Opus for Tier 2). At ~10K classification calls/day, hosted Haiku costs ~$90/month — significantly cheaper than provisioning local inference hardware. The LLM Gateway abstraction allows swapping to a local backend later if economics change.
-
MUD-specific calibration: Minimal config + LLM parsing. Each MUD requires a small TOML configuration file (connection details, basic command vocabulary like
look,say,go). The LLM handles parsing unstructured MUD output — no per-MUD adapter code needed. This maximizes portability across MUD servers. -
Player identity: No distinction. The system does not attempt to classify entities as players vs. server NPCs. All non-self entities are treated equally. NPCs respond based on stimulus content and context, not identity type. This is simpler, more universal, and avoids unreliable heuristics.
-
Room awareness: Socket-scoped stimuli. Each NPC receives only its own telnet stream — the MUD server handles room-scoped delivery natively. Mudguy does not track which room an NPC is in; the MUD server is the sole source of truth for room membership. This eliminates an entire class of synchronization bugs and removes complexity from the coordination bus.
-
Graceful degradation: Go silent. When the LLM API is unavailable, NPCs become passive — no bids, no initiative actions. They remain present in the room but do not respond. Normal behavior resumes when API connectivity is restored. No scripted fallback or narrative excuse needed.
-
Observability: Structured JSON logging. All system events (stimuli, classifications, bids, responses, memory operations, token costs) are logged as structured JSON. No built-in dashboard or metrics export in v1 — operators use external log analysis tools. Dashboard and Prometheus/OpenTelemetry export are future work.
-
Testing: Mock telnet server. A lightweight mock MUD server replays scripted scenarios for testing. Tests assert on NPC responses, bid outcomes, and memory operations. The mock server doubles as a development tool for iterating on NPC behavior without a live MUD connection. See section 9 (Testing Philosophy) for the broader verification strategy including property-based and mutation testing.
-
Cost model: Per-NPC tracking with budget caps. Token usage is tracked per NPC per day. Operators set a daily cost ceiling for each NPC. When an NPC's budget is exhausted, it triggers the sleep mechanic — a natural narrative integration of cost control. Estimated steady-state cost: Tier 1 (Haiku) ~$0.001/classification, Tier 2 (Opus) ~$0.02–0.05/response. Active NPC in a busy room: ~$2–5/day depending on response frequency.
-
Emergency wake: Tier 1-gated. When a sleeping NPC receives a stimulus, Tier 1 classifies whether it is significant enough to warrant waking (e.g., being directly addressed vs. ambient noise). If the stimulus is significant, the NPC wakes with fresh context and a "just woken" narrative prompt — it acts tired, disoriented, and groggy. This provides responsiveness for meaningful interactions while preserving the sleep mechanic for trivial ones.
-
Multi-MUD support: One MUD per instance. Each Mudguy process connects to a single MUD server. Running NPCs on multiple MUDs requires multiple Mudguy instances. This keeps process isolation simple and avoids cross-MUD coordination complexity.
The system is successful when:
- A player can walk into a tavern and find NPCs already engaged in conversation that references events from previous days.
- An NPC remembers a player from yesterday and references their shared history.
- Multiple NPCs in the same room maintain distinct personalities and don't all respond to the same stimulus.
- The world continues to evolve when no players are present.
- An operator can point Mudguy at a new MUD server and have NPCs functional with minimal configuration.
- The system runs sustainably — cost per NPC-hour is predictable and manageable.