This document contains user stories for implementing the Raft consensus algorithm in Rust - a distributed consensus protocol that enables a cluster of computing nodes to agree on a series of state transitions and maintain a replicated state machine across multiple servers, even in the presence of failures.
- Distributed Systems Engineer (DSE): Needs a correct, performant Raft implementation to build distributed systems
- Application Developer (AD): Uses Raft as a library to add consensus features to their application
- Operations Engineer (OE): Deploys, monitors, and maintains Raft clusters in production
- System Architect (SA): Designs distributed systems requiring consensus and fault tolerance
- DSE: Requires safety guarantees, performance, and comprehensive observability
- AD: Needs simple APIs, clear documentation, and reliable behavior
- OE: Requires monitoring capabilities, operational tools, and debuggability
- SA: Needs proven correctness, scalability characteristics, and deployment flexibility
The stories are organized into 8 major phases:
- Foundation/Infrastructure (F-series): Establishes project structure, core types, and abstractions
- Core Raft Algorithm (C-series): Implements leader election, log replication, and commit rules
- Persistence (P-series): Adds durable storage for critical state
- Network/RPC (N-series): Implements communication between nodes
- State Machine Integration (S-series): Connects Raft to application state machines
- Cluster Membership (M-series): Enables dynamic cluster configuration changes
- Observability/Monitoring (O-series): Adds metrics, logging, and health checks
- Testing and Validation (T-series): Comprehensive testing strategies
This sequencing enables incremental development where each story builds upon previous ones while delivering value independently.
As a Distributed Systems Engineer I want a properly initialized Rust project with standard tooling So that I have a solid foundation for implementing Raft with good development practices
Acceptance Criteria:
- Cargo project created with appropriate metadata (name, version, authors)
- Project includes standard directories (src, tests, examples, benches)
- Basic .gitignore file excludes build artifacts and IDE files
- README file describes the project purpose
- License file is present (MIT or Apache 2.0)
- Cargo.toml includes workspace configuration if needed
- Project compiles successfully with
cargo build - Basic CI configuration file is present (GitHub Actions or similar)
Definition of Done:
- Developer can clone the repository and build successfully
- All standard Rust tooling works (cargo build, cargo test, cargo clippy)
- Project structure follows Rust community conventions
As a Distributed Systems Engineer I want well-defined types for basic Raft concepts So that I can represent Raft state with type safety and clarity
Acceptance Criteria:
- NodeId type defined (wraps unique node identifier)
- Term type defined (monotonically increasing term number)
- LogIndex type defined (position in the log, starting from 1)
- ServerState enum defined (Follower, Candidate, Leader)
- Types implement required traits (Clone, Copy, Debug, PartialEq, Eq, PartialOrd, Ord as appropriate)
- Types have clear documentation comments explaining their purpose
- Types use Rust's type system to prevent invalid states (e.g., Term cannot be negative)
- Module structure organizes types logically
Definition of Done:
- All types compile without warnings
- Documentation is clear and accurate
- Types can be used in subsequent stories
- Unit tests verify type behavior (ordering, equality)
As a Distributed Systems Engineer I want a type representing a single log entry So that I can store commands with their associated metadata
Acceptance Criteria:
- LogEntry struct contains command data (generic over command type)
- LogEntry includes term number when entry was created
- LogEntry includes index position in the log
- LogEntry supports serialization and deserialization
- LogEntry implements Clone, Debug traits
- Clear documentation explains each field's purpose
- Type supports generic command payloads (using generics or trait objects)
- LogEntry is immutable once created
Definition of Done:
- LogEntry can be created and its fields accessed
- LogEntry can be serialized to bytes and deserialized back
- Type supports various command types
- Documentation explains usage patterns
As a Distributed Systems Engineer I want a type representing state that must survive crashes So that I can track what needs to be persisted for safety
Acceptance Criteria:
- PersistentState struct includes current term
- PersistentState includes votedFor (Option)
- PersistentState includes log entries (Vec)
- Type enforces invariants (e.g., term never decreases)
- Type supports serialization for storage
- Clear documentation explains why each field must be persistent
- Type provides builder pattern or constructor for initialization
- Type implements Clone, Debug traits
Definition of Done:
- PersistentState can be created and modified
- All fields are accessible via appropriate methods
- Type can be serialized and deserialized
- Documentation is complete and accurate
As a Distributed Systems Engineer I want types for volatile state (not persisted) So that I can track runtime state that can be reconstructed after restart
Acceptance Criteria:
- VolatileState struct includes commitIndex (highest known committed entry)
- VolatileState includes lastApplied (highest entry applied to state machine)
- LeaderState struct includes nextIndex map (per-follower tracking)
- LeaderState includes matchIndex map (per-follower tracking)
- Types implement Default for initialization
- Clear documentation explains each field's purpose and when it's used
- Types support efficient updates and queries
- Types implement Clone, Debug traits
Definition of Done:
- VolatileState and LeaderState can be created and modified
- Fields can be updated atomically where needed
- Types integrate cleanly with persistent state
- Documentation explains relationship to persistent state
As a Distributed Systems Engineer I want types for all Raft RPC messages So that I can communicate between nodes with type safety
Acceptance Criteria:
- RequestVoteRequest struct includes term, candidateId, lastLogIndex, lastLogTerm
- RequestVoteResponse struct includes term, voteGranted
- AppendEntriesRequest struct includes term, leaderId, prevLogIndex, prevLogTerm, entries, leaderCommit
- AppendEntriesResponse struct includes term, success, matchIndex
- All messages support serialization/deserialization
- Messages implement Clone, Debug traits
- Clear documentation explains each field per Raft specification
- Types use previously defined types (Term, NodeId, LogIndex, LogEntry)
Definition of Done:
- All message types compile and can be instantiated
- Messages can be serialized to bytes and deserialized
- Documentation matches Raft paper specification
- Message types are used in RPC layer stories
As a Distributed Systems Engineer I want a trait defining the storage interface So that I can swap storage implementations without changing core Raft logic
Acceptance Criteria:
- Storage trait defines methods for reading/writing term
- Storage trait defines methods for reading/writing votedFor
- Storage trait defines methods for appending log entries
- Storage trait defines methods for reading log entries by index
- Storage trait defines method for reading last log entry
- All methods return Result types for error handling
- Trait is async-compatible (returns Future or uses async_trait)
- Clear documentation explains storage contract and durability requirements
- Trait includes method for atomic multi-field updates
Definition of Done:
- Storage trait compiles and is well-documented
- Trait can be implemented by different backends
- Error types are defined for storage failures
- Trait supports all operations needed by Raft
As a Distributed Systems Engineer I want an in-memory storage implementation So that I can test Raft logic without disk I/O complexity
Acceptance Criteria:
- InMemoryStorage implements Storage trait
- Storage holds term, votedFor, and log entries in memory
- All Storage trait methods are implemented correctly
- Implementation uses appropriate synchronization (Mutex/RwLock) for thread safety
- Implementation returns errors for invalid operations (e.g., reading non-existent entry)
- Clear documentation explains this is for testing, not production
- Storage can be created empty or with initial state
- Implementation is efficient for test scenarios
Definition of Done:
- InMemoryStorage compiles and all trait methods work
- Basic tests verify reading and writing state
- Storage can be used in unit tests for other components
- Documentation clarifies testing-only usage
As a Distributed Systems Engineer I want comprehensive error types for all failure modes So that I can handle errors appropriately and provide clear diagnostics
Acceptance Criteria:
- RaftError enum covers all error categories (Storage, Network, Protocol, Configuration)
- Each error variant includes context (message, source error if applicable)
- Errors implement std::error::Error trait
- Errors implement Display for user-friendly messages
- Errors distinguish recoverable from fatal errors
- Clear documentation explains when each error occurs
- Errors support error chaining (from other error types)
- Errors include structured data for programmatic handling
Definition of Done:
- RaftError compiles and implements required traits
- Errors can be created and converted from underlying errors
- Error messages are clear and actionable
- Documentation explains error handling strategies
As a Application Developer I want a configuration structure with sensible defaults So that I can configure Raft behavior without understanding all details
Acceptance Criteria:
- RaftConfig struct includes election timeout range (min/max)
- RaftConfig includes heartbeat interval
- RaftConfig includes node ID and cluster peer list
- RaftConfig includes storage configuration
- RaftConfig provides builder pattern for construction
- Default values match Raft paper recommendations (150-300ms election timeout)
- Configuration validates values (e.g., heartbeat < election timeout)
- Clear documentation explains each configuration option
- Configuration supports loading from file or environment variables
Definition of Done:
- RaftConfig can be created with defaults
- Configuration can be customized via builder
- Invalid configurations are rejected with clear error messages
- Documentation explains tuning recommendations
As a Distributed Systems Engineer I want nodes to initialize in follower state So that new nodes start in a safe, passive state
Acceptance Criteria:
- Node initializes with ServerState::Follower
- Node initializes persistent state (term=0, votedFor=None, empty log)
- Node initializes volatile state (commitIndex=0, lastApplied=0)
- Node loads existing state from storage if available
- Clear logging indicates node started as follower
- Node ID is set from configuration
- Initialization completes successfully or returns clear error
Definition of Done:
- Node can be created and starts as follower
- State is properly initialized from storage or defaults
- Logs provide visibility into initialization process
- Node is ready to receive RPCs
As a Distributed Systems Engineer I want followers to start an election timer So that followers can detect leader failure
Acceptance Criteria:
- Follower creates timer with randomized timeout (within configured range)
- Timeout is chosen uniformly from election timeout range
- Timer starts when node becomes follower
- Timer can be reset without restarting the node
- Clear logging shows timer value and expiration time
- Timer uses monotonic clock (not wall clock)
- Timer is cancelled when node changes state
- Different followers choose different random timeouts
Definition of Done:
- Election timer runs and can be observed
- Timer timeout is randomized appropriately
- Timer can be reset on receiving valid RPCs
- Logging provides visibility into timer behavior
As a Distributed Systems Engineer I want followers to transition to candidate when election timer expires So that the cluster can elect a new leader when current leader fails
Acceptance Criteria:
- When election timer expires, follower transitions to candidate state
- Node increments term before transitioning
- New term is persisted before state transition
- Election timeout event is logged clearly
- Transition only occurs if node is still a follower
- Timer expiration doesn't occur if timer was reset
- State transition is atomic
Definition of Done:
- Follower becomes candidate after timeout expires
- Term is incremented and persisted
- State transition is logged and observable
- No data races in state transition
As a Distributed Systems Engineer I want candidates to start an election when entering candidate state So that a new leader can be elected
Acceptance Criteria:
- Candidate votes for itself (updates votedFor)
- Vote for self is persisted before requesting votes from others
- Candidate resets election timer with new random timeout
- Candidate tracks votes received (starts with 1 for self)
- Current term is already incremented (from follower transition)
- Clear logging indicates election started with term number
- Candidate prepares to send RequestVote RPCs (actual sending in later story)
- Election state is initialized (vote count, nodes contacted)
Definition of Done:
- Candidate votes for itself and persists vote
- Election timer is reset with new random value
- Vote tracking is initialized
- Election start is logged and observable
As a Distributed Systems Engineer I want candidates to detect when they receive majority votes So that successful candidates can become leader
Acceptance Criteria:
- Candidate calculates quorum size (cluster_size / 2 + 1)
- Candidate counts votes including self-vote
- When vote count reaches quorum, candidate recognizes election success
- Quorum calculation works for odd and even cluster sizes
- Vote counting is thread-safe (if concurrent vote responses)
- Clear logging when quorum is reached
- Candidate tracks which nodes granted votes (for debugging)
- Duplicate votes from same node don't count twice
Definition of Done:
- Candidate correctly identifies when quorum is reached
- Quorum calculation is correct for various cluster sizes
- Vote counting is safe under concurrency
- Election success is logged and observable
As a Distributed Systems Engineer I want candidates to become leader upon winning election So that the cluster has a leader to coordinate operations
Acceptance Criteria:
- Candidate transitions to Leader state after receiving quorum votes
- Leader initializes nextIndex for each follower (last log index + 1)
- Leader initializes matchIndex for each follower (0)
- Election timer is stopped (leaders don't need it)
- State transition is atomic
- Clear logging indicates node became leader with term number
- Leader state is initialized before accepting client requests
- Transition only occurs if still in candidate state and term unchanged
Definition of Done:
- Candidate becomes leader after winning election
- Leader state (nextIndex, matchIndex) is properly initialized
- State transition is logged and observable
- Leader is ready to send heartbeats
As a Distributed Systems Engineer I want candidates to retry elections when timeouts occur without winning So that elections eventually succeed even with split votes
Acceptance Criteria:
- When election timer expires in candidate state, start new election
- Increment term before starting new election
- Reset election timer with new random timeout
- Reset vote count and vote for self again
- New term is persisted before proceeding
- Clear logging indicates new election attempt with new term
- Limit is not imposed on election retries (retry indefinitely)
- Previous election state is cleared
Definition of Done:
- Candidate retries election on timeout
- Each retry uses a new term and random timeout
- Split votes eventually resolve through retries
- Retry behavior is logged and observable
As a Distributed Systems Engineer I want leaders to start a heartbeat timer So that leaders can maintain leadership by sending periodic heartbeats
Acceptance Criteria:
- Leader creates heartbeat timer with configured interval
- Heartbeat interval is less than election timeout (typically electionTimeout / 2)
- Timer starts when node becomes leader
- Timer fires periodically until node is no longer leader
- Clear logging shows heartbeat interval
- Timer is cancelled when leader loses leadership
- Timer uses monotonic clock
- Heartbeat timer is separate from election timer
Definition of Done:
- Heartbeat timer runs while node is leader
- Timer fires at configured interval
- Timer is stopped when leadership ends
- Timer behavior is logged and observable
As a Distributed Systems Engineer I want leaders to create AppendEntries messages for heartbeats So that followers know the leader is alive
Acceptance Criteria:
- Leader creates AppendEntriesRequest for each follower
- Heartbeat messages have empty entries array
- Messages include current term
- Messages include leader's ID
- Messages include prevLogIndex and prevLogTerm (based on follower's nextIndex)
- Messages include current commitIndex
- Each follower gets a message tailored to their state (based on nextIndex)
- Messages are prepared when heartbeat timer fires
- Message creation doesn't modify any state
Definition of Done:
- Heartbeat messages are created correctly
- Each follower receives appropriate message
- Messages contain all required fields per Raft spec
- Message preparation is efficient
As a Distributed Systems Engineer I want followers to reset election timer on receiving valid heartbeat So that followers don't start unnecessary elections while leader is active
Acceptance Criteria:
- Follower receives AppendEntriesRequest with empty entries (heartbeat)
- Follower verifies message term is >= current term
- If message term is higher, update current term and persist
- Reset election timer with new random timeout
- Return success response if log consistency check passes
- Clear logging indicates heartbeat received from leader
- Don't start election while receiving regular heartbeats
- Handle heartbeat even if log is inconsistent (return failure but reset timer)
Definition of Done:
- Followers reset timer on valid heartbeats
- Followers remain followers while heartbeats arrive
- Timer reset prevents unnecessary elections
- Heartbeat handling is logged
As a Distributed Systems Engineer I want nodes to detect when they see a higher term So that nodes can update to current term and maintain safety
Acceptance Criteria:
- Node compares incoming RPC term with current term
- If incoming term > current term, update current term
- New term is persisted before continuing
- Reset votedFor to None when updating term
- If node is leader or candidate, step down to follower
- Clear logging indicates term update with old and new values
- Term update is atomic with state change
- Respond to RPC using updated term
Definition of Done:
- Nodes correctly detect higher terms
- Term and votedFor are updated and persisted
- Leaders/candidates step down when seeing higher term
- Term updates are logged and observable
As a Distributed Systems Engineer I want nodes to validate RequestVote requests So that only valid vote requests are granted
Acceptance Criteria:
- Node receives RequestVoteRequest
- Check if request term >= current term
- If request term < current term, reject with current term in response
- If request term > current term, update term first (per C-11)
- Validate request contains required fields (candidateId, lastLogIndex, lastLogTerm)
- Log validation result (granted or denied with reason)
- Response includes current term
- Validation doesn't grant vote yet (just checks basic validity)
Definition of Done:
- RequestVote requests are validated correctly
- Invalid requests are rejected with appropriate response
- Term updates occur before vote granting
- Validation is logged
As a Distributed Systems Engineer I want nodes to grant votes only if not already voted this term So that each node votes at most once per term (election safety)
Acceptance Criteria:
- Node checks votedFor field
- If votedFor is None, vote can be granted (subject to other checks)
- If votedFor is Some(candidateId) for this candidate, grant vote (idempotent)
- If votedFor is Some(differentId), reject vote
- Vote granting updates votedFor and persists before responding
- Clear logging shows vote decision and reason
- Vote persistence is atomic
- Response includes voteGranted boolean and current term
Definition of Done:
- Nodes vote at most once per term
- Vote granting respects existing votes
- Votes are persisted before responding
- Vote decisions are logged with reasons
As a Distributed Systems Engineer I want nodes to grant votes only to candidates with up-to-date logs So that new leaders have all committed entries (leader completeness)
Acceptance Criteria:
- Compare candidate's last log term with voter's last log term
- If candidate's last term > voter's last term, candidate log is more up-to-date
- If terms equal, compare last log index (higher index is more up-to-date)
- If candidate's log is not at least as up-to-date, reject vote
- If candidate's log is up-to-date and other checks pass, grant vote
- Clear logging shows log comparison result
- Comparison uses lastLogIndex and lastLogTerm from request
- Log comparison is correct for empty logs (term=0, index=0)
Definition of Done:
- Vote granting includes log up-to-date check
- Only candidates with up-to-date logs receive votes
- Log comparison logic is correct per Raft spec
- Comparison results are logged
As a Distributed Systems Engineer I want followers to verify log consistency on AppendEntries So that followers maintain identical log prefixes (log matching property)
Acceptance Criteria:
- Follower checks if log contains entry at prevLogIndex with term = prevLogTerm
- If prevLogIndex is 0, consistency check passes (for first entry)
- If log doesn't have entry at prevLogIndex, return failure
- If log has entry at prevLogIndex but different term, return failure
- If consistency check passes, proceed with appending entries
- If consistency check fails, return success=false with current term
- Clear logging shows consistency check result with index/term values
- Don't modify log on consistency failure
Definition of Done:
- Followers correctly verify log consistency
- Inconsistent logs are detected and rejected
- Consistency check follows Raft specification
- Check results are logged with diagnostic information
As a Distributed Systems Engineer I want followers to append new entries from leader So that follower logs replicate leader's log
Acceptance Criteria:
- After successful consistency check, append entries from request
- If log already has entries at those indices, they must match (or be deleted)
- Delete conflicting entries and all that follow (log must match leader)
- Append all new entries to log
- Persist new log entries before responding
- Clear logging shows which entries were appended (indices/terms)
- Entry appending is atomic (all or nothing)
- Response indicates success=true after appending
Definition of Done:
- Followers append entries correctly
- Conflicting entries are handled properly
- Log entries are persisted before response
- Append operations are logged
As a Distributed Systems Engineer I want followers to update commit index from leader So that followers know which entries are committed
Acceptance Criteria:
- Follower receives leaderCommit field from AppendEntriesRequest
- If leaderCommit > follower's commitIndex, update commitIndex
- New commitIndex is min(leaderCommit, index of last new entry)
- CommitIndex never decreases
- Clear logging when commitIndex advances
- CommitIndex update doesn't require persistence (volatile state)
- Trigger state machine application after commit index update
- CommitIndex is bounded by follower's log size
Definition of Done:
- Followers update commitIndex from leader
- CommitIndex advances monotonically
- CommitIndex never exceeds log size
- Updates are logged and observable
As a Distributed Systems Engineer I want leaders to update matchIndex based on AppendEntries responses So that leaders know how much of the log each follower has replicated
Acceptance Criteria:
- Leader receives AppendEntriesResponse from follower
- If response success=true, update matchIndex[follower] to last appended entry index
- Update nextIndex[follower] to matchIndex[follower] + 1
- If response success=false, decrement nextIndex[follower] and retry
- MatchIndex values are used to determine commit point
- Clear logging shows matchIndex and nextIndex updates
- Updates are thread-safe if responses arrive concurrently
- Leader doesn't update matchIndex for itself
Definition of Done:
- Leaders track replication progress per follower
- matchIndex and nextIndex are updated correctly
- Failed AppendEntries trigger retry with lower nextIndex
- Progress tracking is logged
As a Distributed Systems Engineer I want leaders to advance commit index when majority has replicated So that entries become committed and can be applied to state machine
Acceptance Criteria:
- Leader calculates highest index N where majority have matchIndex >= N
- N must be > current commitIndex
- Entry at index N must have term = current term (safety check)
- If such N exists, update commitIndex to N
- CommitIndex never decreases
- Clear logging when commitIndex advances with new value
- Commit determination runs after each AppendEntries response
- Leader's own log counts toward majority (leader has all entries)
Definition of Done:
- Leaders correctly determine commit point
- Commit requires majority replication
- Only entries from current term are committed directly
- Commit index advancement is logged
As a Application Developer I want to submit commands to the leader So that my commands are replicated and applied to the state machine
Acceptance Criteria:
- Leader provides interface to accept client commands
- Leader appends command to its local log with current term
- New log entry is persisted before responding to client
- Leader returns index and term to client (for tracking)
- Leader begins replicating entry via AppendEntries RPCs
- If node is not leader, reject command with error (redirect to leader)
- Clear logging shows command acceptance with assigned index
- Command acceptance is thread-safe
Definition of Done:
- Clients can submit commands to leader
- Commands are appended to log and persisted
- Replication begins immediately
- Non-leaders reject commands appropriately
As an Application Developer I want to receive notification when my command is committed So that I know the operation is durable and will be applied
Acceptance Criteria:
- Leader provides mechanism to wait for specific log index to be committed
- When commitIndex advances past command's index, notify waiting client
- Notification includes success indication
- If leadership is lost before commit, notify client of failure
- Waiting is async (doesn't block leader operation)
- Clear timeout mechanism for client waits
- Multiple clients can wait for different indices simultaneously
- Notification occurs promptly after commit
Definition of Done:
- Clients can wait for commit notification
- Notifications are sent when commands commit
- Leadership loss is communicated to waiting clients
- Waiting doesn't block the Raft node
As a Distributed Systems Engineer I want committed entries to be applied to the state machine So that the replicated state machine processes commands
Acceptance Criteria:
- When commitIndex > lastApplied, apply intervening entries
- Apply entries in strict log order (from lastApplied+1 to commitIndex)
- Each entry is applied exactly once to the state machine
- Update lastApplied after successful application
- Clear logging shows which entries are being applied
- Application happens on all nodes (leaders and followers)
- State machine application is decoupled from log replication
- Handle state machine errors gracefully
Definition of Done:
- Committed entries are applied to state machine
- Entries are applied in order, exactly once
- lastApplied tracks application progress
- Application is logged and observable
As an Operations Engineer I want nodes to detect when snapshots are needed So that log size is bounded and doesn't grow indefinitely
Acceptance Criteria:
- Monitor log size (number of entries or bytes)
- When log exceeds configured threshold, trigger snapshot
- Snapshot only includes committed entries (up to lastApplied)
- Snapshot triggering doesn't block normal operations
- Clear logging indicates snapshot is being created
- Configurable threshold for snapshot trigger (entries or size)
- Snapshot creation can be triggered manually for testing
- Only create snapshot if there are entries to compact
Definition of Done:
- Nodes detect when snapshot is needed
- Snapshot creation is triggered appropriately
- Triggering doesn't disrupt normal operation
- Snapshot triggers are logged
As a Distributed Systems Engineer I want nodes to create snapshots of state machine state So that old log entries can be discarded
Acceptance Criteria:
- Capture current state machine state as snapshot
- Snapshot includes lastIncludedIndex (last entry applied to state)
- Snapshot includes lastIncludedTerm (term of lastIncludedIndex entry)
- Snapshot creation is atomic (consistent point-in-time)
- Snapshot is stored durably (via storage layer)
- Clear logging indicates snapshot creation progress
- Snapshot creation doesn't block state machine queries
- Old snapshots can be replaced with newer ones
Definition of Done:
- Nodes create snapshots of state machine state
- Snapshots include required metadata
- Snapshots are stored durably
- Snapshot creation is logged
As a Distributed Systems Engineer I want nodes to discard log entries included in snapshot So that log storage is reclaimed
Acceptance Criteria:
- After snapshot creation, discard entries up to lastIncludedIndex
- Keep entries after lastIncludedIndex (not yet in snapshot)
- Update storage to remove old entries
- Keep snapshot metadata (lastIncludedIndex, lastIncludedTerm)
- Clear logging shows how many entries were discarded
- Discarding is atomic (snapshot and log update)
- Log compaction doesn't lose any committed entries
- Log indices remain consistent (don't renumber)
Definition of Done:
- Old log entries are discarded after snapshot
- Storage space is reclaimed
- Log remains consistent and usable
- Compaction is logged with statistics
As a Distributed Systems Engineer I want a file-based storage implementation So that Raft state persists across process restarts
Acceptance Criteria:
- FileStorage implements Storage trait
- Storage uses separate files for term, votedFor, and log
- Files are created in configured directory
- File paths are configurable
- Basic file I/O works (create, read, write)
- Storage directory is created if it doesn't exist
- Clear error handling for file operations (permission denied, disk full, etc.)
- Storage can be initialized from existing files
Definition of Done:
- FileStorage implements all Storage trait methods
- Files are created and accessed correctly
- Basic persistence works (write then read returns same data)
- File errors are handled gracefully
As a Distributed Systems Engineer I want storage writes to be atomic So that partial writes don't corrupt state on crash
Acceptance Criteria:
- Use write-rename pattern for atomic file updates
- Write to temporary file first
- Call fsync on temporary file to ensure durability
- Rename temporary file to actual file (atomic on POSIX)
- If crash occurs, either old or new state is intact (never partial)
- Clear logging of write operations
- Handle errors during any step of atomic write
- Temporary files are cleaned up on error
Definition of Done:
- Storage writes are atomic and durable
- Crash during write leaves system in consistent state
- Atomic write pattern is correctly implemented
- Write operations are logged
As a Distributed Systems Engineer I want efficient log appending to file So that log writes have acceptable performance
Acceptance Criteria:
- Append-only log file for better performance
- Use buffered writes with explicit flush for durability
- Batch multiple entries into single fsync when possible
- Index file or header for quick lookups by index
- Handle log truncation (for conflicting entries)
- Clear logging of log operations
- Configurable fsync policy (every write vs batched)
- Log file format supports forward and backward scanning
Definition of Done:
- Log appending is efficient (measured via benchmarks)
- Durability guarantees are maintained
- Log file can be read and appended correctly
- Performance is acceptable for target workload
As a Operations Engineer I want storage to detect corrupted files So that I'm alerted to data corruption issues
Acceptance Criteria:
- Use checksums (CRC32 or better) for data integrity
- Verify checksums on read operations
- Detect truncated files (incomplete writes)
- Return clear error when corruption is detected
- Log corruption events with details
- Don't automatically "fix" corruption (preserve evidence)
- Provide tools to manually recover from corruption
- Consider using length-prefixed records for structure validation
Definition of Done:
- Corruption is reliably detected on read
- Corruption errors are clear and actionable
- Checksums protect against silent corruption
- Corruption detection is tested
As a Distributed Systems Engineer I want nodes to recover state from files on restart So that restarts don't lose critical state
Acceptance Criteria:
- On initialization, load term from term file
- Load votedFor from votedFor file
- Load all log entries from log file
- Verify data integrity during load (checksums)
- Handle missing files (initialize to defaults)
- Clear logging shows what state was recovered
- Recovery fails cleanly if corruption detected
- Recovery performance is reasonable (acceptable startup time)
Definition of Done:
- Nodes successfully recover state on restart
- Recovered state matches state before shutdown
- Missing or corrupt files are handled appropriately
- Recovery process is logged
As a Distributed Systems Engineer I want snapshots to be stored durably So that snapshots survive restarts and enable log compaction
Acceptance Criteria:
- Snapshot stored as separate file
- Snapshot file includes lastIncludedIndex and lastIncludedTerm metadata
- Snapshot file includes serialized state machine state
- Snapshot writes are atomic (write-rename pattern)
- Snapshot files use checksums for integrity
- Old snapshots are replaced by newer ones
- Snapshot loading on recovery
- Clear logging of snapshot persistence operations
Definition of Done:
- Snapshots are persistently stored
- Snapshot files include all necessary data
- Snapshots can be loaded on restart
- Snapshot persistence is atomic and durable
As a Distributed Systems Engineer I want storage to support log compaction via snapshot So that old log entries can be removed after snapshot
Acceptance Criteria:
- Provide method to install snapshot and truncate log
- Atomically replace log entries with snapshot
- Remove log entries up to snapshot's lastIncludedIndex
- Keep snapshot file and remaining log entries
- Ensure atomicity (snapshot + log truncation together)
- Recovery works with snapshot + partial log
- Clear logging of compaction operations
- Storage space is reclaimed after compaction
Definition of Done:
- Log compaction via snapshot works correctly
- Old entries are removed and storage reclaimed
- Recovery works with compacted logs
- Compaction is atomic and safe
As a Distributed Systems Engineer I want an abstract RPC transport interface So that I can swap network implementations without changing Raft logic
Acceptance Criteria:
- RpcTransport trait defines methods for sending RequestVote, AppendEntries
- Trait methods are async (return Futures)
- Methods accept target node ID and request message
- Methods return Result with response or network error
- Trait supports timeout configuration per request
- Clear documentation of transport contract
- Trait is generic over message types
- Transport handles connection management internally
Definition of Done:
- RpcTransport trait is well-defined and documented
- Trait can be implemented by different backends (TCP, gRPC, in-memory)
- Trait methods cover all Raft RPC needs
- Trait is async-compatible
As a Distributed Systems Engineer I want an in-memory RPC transport for tests So that I can test Raft logic without real network I/O
Acceptance Criteria:
- InMemoryTransport implements RpcTransport trait
- Uses channels or similar for message passing between nodes
- Simulates RPC calls synchronously or with minimal delay
- Supports multiple nodes in same process
- Can simulate message delays for testing
- Can simulate message loss (drop messages)
- Can simulate network partitions (block messages between groups)
- Clear API for controlling test scenarios
Definition of Done:
- InMemoryTransport works for multi-node tests
- Message passing is reliable in normal mode
- Network failures can be injected for testing
- Transport is used in Raft integration tests
As a Operations Engineer I want TCP connections between Raft nodes So that nodes can communicate over real networks
Acceptance Criteria:
- TcpTransport implements RpcTransport trait
- Establishes TCP connections to other nodes
- Connection pool manages connections per target node
- Reconnects automatically on connection failure
- Configurable connection timeout
- Configurable request timeout
- Clear logging of connection events
- Handles node unavailability gracefully
Definition of Done:
- TCP connections are established and managed
- Connection failures trigger reconnection
- Transport works over real networks
- Connection management is logged
As a Distributed Systems Engineer I want RPC messages serialized for network transmission So that messages can be sent over TCP
Acceptance Criteria:
- Messages serialized using chosen format (bincode, MessagePack, protobuf)
- Serialization includes message length prefix
- Deserialization validates message format
- Handles serialization errors gracefully
- Supports large messages (log entries, snapshots)
- Clear error messages for serialization failures
- Versioning support for protocol evolution
- Efficient serialization (low overhead)
Definition of Done:
- Messages are correctly serialized and deserialized
- Large messages are handled efficiently
- Serialization errors are caught and reported
- Wire format is documented
As a Distributed Systems Engineer I want to send RequestVote RPCs over TCP So that candidates can request votes during elections
Acceptance Criteria:
- Serialize RequestVoteRequest message
- Send over TCP connection to target node
- Wait for response with timeout
- Deserialize RequestVoteResponse
- Return response or timeout error
- Handle connection failures during request
- Clear logging of RPC calls (target, term, success/failure)
- Retry logic for transient failures (optional)
Definition of Done:
- RequestVote RPCs are sent successfully
- Responses are received and deserialized
- Timeouts and failures are handled
- RPC calls are logged
As a Distributed Systems Engineer I want to send AppendEntries RPCs over TCP So that leaders can replicate log entries to followers
Acceptance Criteria:
- Serialize AppendEntriesRequest message
- Send over TCP connection to target node
- Wait for response with timeout
- Deserialize AppendEntriesResponse
- Return response or timeout error
- Handle large messages (many log entries)
- Clear logging of RPC calls
- Performance is acceptable (low latency)
Definition of Done:
- AppendEntries RPCs are sent successfully
- Large messages are handled correctly
- Performance meets requirements
- RPC calls are logged
As a Distributed Systems Engineer I want nodes to listen for incoming RPC connections So that nodes can receive RPC requests from other nodes
Acceptance Criteria:
- RPC server listens on configured address and port
- Accepts incoming TCP connections
- Spawns handler for each connection
- Handles multiple concurrent connections
- Configurable listen address and port
- Clear logging when server starts listening
- Graceful shutdown of server
- Handles connection errors (port already in use, etc.)
Definition of Done:
- RPC server accepts incoming connections
- Multiple connections are handled concurrently
- Server can be started and stopped cleanly
- Server startup/shutdown is logged
As a Distributed Systems Engineer I want nodes to receive and process RequestVote RPCs So that candidates can collect votes
Acceptance Criteria:
- Receive and deserialize RequestVoteRequest
- Invoke Raft node's vote handling logic (C-12, C-13, C-14)
- Serialize and send RequestVoteResponse
- Handle multiple simultaneous vote requests
- Clear logging of received requests
- Return errors for malformed requests
- Handle handler errors gracefully
- Respond promptly (low latency)
Definition of Done:
- RequestVote requests are received and processed
- Responses are sent correctly
- Vote granting logic is invoked properly
- Request handling is logged
As a Distributed Systems Engineer I want nodes to receive and process AppendEntries RPCs So that followers can replicate log entries from leader
Acceptance Criteria:
- Receive and deserialize AppendEntriesRequest
- Invoke Raft node's append handling logic (C-15, C-16, C-17)
- Serialize and send AppendEntriesResponse
- Handle heartbeats (empty entries) correctly
- Clear logging of received requests
- Handle large requests (many entries) efficiently
- Return errors for malformed requests
- Respond promptly
Definition of Done:
- AppendEntries requests are received and processed
- Responses are sent correctly
- Log replication logic is invoked properly
- Request handling is efficient and logged
As a Distributed Systems Engineer I want candidates to send RequestVote RPCs to all nodes in parallel So that elections complete quickly
Acceptance Criteria:
- Candidate sends RequestVote to all other nodes concurrently
- Uses async I/O to avoid blocking
- Collects responses as they arrive
- Doesn't wait for all responses (proceed when quorum reached)
- Times out requests that take too long
- Clear logging shows when requests are sent
- Handles partial responses (some nodes unreachable)
- Early termination when quorum reached or election lost
Definition of Done:
- Vote requests are sent in parallel
- Elections complete in minimal time
- Quorum detection works with partial responses
- Parallel sending is logged
As a Distributed Systems Engineer I want leaders to send AppendEntries to all followers in parallel So that replication is fast
Acceptance Criteria:
- Leader sends AppendEntries to all followers concurrently
- Uses async I/O for parallelism
- Tracks responses per follower independently
- Updates matchIndex as responses arrive
- Retries failed requests to specific followers
- Clear logging shows replication progress
- Handles slow or unresponsive followers without blocking others
- Heartbeats are sent in parallel as well
Definition of Done:
- AppendEntries are sent in parallel to all followers
- Replication is efficient and non-blocking
- Per-follower progress is tracked independently
- Parallel replication is logged
As an Operations Engineer I want clear timeout behavior for RPC requests So that network delays don't cause indefinite blocking
Acceptance Criteria:
- All RPC requests have configurable timeout
- Timeout is enforced for connection and response
- Timed-out requests return timeout error
- Timeouts don't crash the node
- Clear logging when timeouts occur (which node, RPC type)
- Timeout values are reasonable defaults but configurable
- Distinguish between connection timeout and read timeout
- Canceled requests clean up resources
Definition of Done:
- All RPCs enforce timeout
- Timeouts are handled gracefully
- Timeout configuration works
- Timeouts are logged
As a Distributed Systems Engineer I want appropriate retry logic for transient failures So that temporary network issues don't cause permanent failures
Acceptance Criteria:
- Retry logic for connection failures (node temporarily unreachable)
- Exponential backoff for retries
- Maximum retry limit to avoid infinite loops
- Don't retry on non-transient errors (invalid request, etc.)
- Clear logging of retry attempts
- Retry logic is configurable (enable/disable, max retries)
- Retries don't violate Raft safety properties
- Distinguish between retryable and non-retryable errors
Definition of Done:
- Transient failures are retried appropriately
- Retries use exponential backoff
- Retry behavior is logged
- Retries don't compromise safety
As an Application Developer I want a trait defining the state machine interface So that I can implement custom state machines for my application
Acceptance Criteria:
- StateMachine trait defines apply method (apply command to state)
- Apply method receives command and returns result
- Trait defines snapshot method (capture current state)
- Trait defines restore method (restore from snapshot)
- Methods use generic or trait object for command type
- Clear documentation explains state machine contract
- Apply must be deterministic (same command = same result)
- Trait is simple and easy to implement
Definition of Done:
- StateMachine trait is well-defined and documented
- Trait can be implemented for various applications
- Trait supports all required operations
- Examples show how to implement the trait
As an Application Developer I want an example state machine implementation So that I understand how to use Raft for my application
Acceptance Criteria:
- Implement simple key-value store as StateMachine
- Support commands: Put(key, value), Get(key), Delete(key)
- Apply method updates internal HashMap
- Snapshot method serializes entire HashMap
- Restore method deserializes HashMap from snapshot
- Clear documentation explains implementation
- Example can be used in tests and demos
- State transitions are deterministic
Definition of Done:
- KV store implements StateMachine trait correctly
- All operations work as expected
- Snapshot/restore preserves state exactly
- Example is documented and usable
As a Distributed Systems Engineer I want committed entries applied to state machine in order So that state machine sees consistent command sequence
Acceptance Criteria:
- Maintain queue of entries to be applied (lastApplied to commitIndex)
- Apply entries sequentially in log order
- Each entry applied exactly once
- Application happens asynchronously (doesn't block Raft)
- Handle state machine errors (log but don't crash)
- Clear logging shows which entries are applied
- Backpressure if state machine is slow
- Application catches up efficiently after being behind
Definition of Done:
- Committed entries are applied in order
- Application is decoupled from Raft main loop
- State machine errors are handled gracefully
- Application progress is logged
As an Application Developer I want to read committed state without going through the log So that read-only queries are fast
Acceptance Criteria:
- Leader tracks read requests with current commitIndex
- Leader confirms leadership by sending heartbeat to majority
- After majority confirms leadership, serve read at tracked commitIndex
- Ensure read sees all committed writes
- Reads don't require log entries
- Clear documentation explains linearizability guarantee
- Read requests return error if leader loses leadership
- Performance is better than writing read to log
Definition of Done:
- Linearizable reads work on leader
- Reads don't create log entries
- Read linearizability is guaranteed
- Read performance is acceptable
As an Application Developer I want to query state machine state without modifying it So that I can read data from the replicated state
Acceptance Criteria:
- Provide read-only query interface to state machine
- Queries don't modify state machine
- Queries use read index mechanism for linearizability (S-4)
- Support queries on followers (with weaker consistency)
- Clear documentation of consistency guarantees per mode
- Query errors are handled gracefully
- Queries are efficient (low latency)
- Non-leader nodes can redirect queries to leader
Definition of Done:
- State machine can be queried safely
- Linearizable and non-linearizable reads both supported
- Query interface is documented with consistency semantics
- Query performance is acceptable
As a Distributed Systems Engineer I want snapshots to capture state machine state So that log compaction preserves application state
Acceptance Criteria:
- Snapshot calls state machine's snapshot method
- Snapshot includes state machine state bytes
- Snapshot metadata includes lastAppliedIndex and term
- Restore calls state machine's restore method on startup
- Snapshot/restore handles large state (streaming if needed)
- Clear logging of snapshot operations
- Snapshot is consistent (point-in-time)
- State machine state matches log application up to snapshot point
Definition of Done:
- Snapshots capture complete state machine state
- Restore recreates exact state
- Large states are handled efficiently
- Snapshot integration is logged
As a Distributed Systems Engineer I want InstallSnapshot RPC message types defined So that leaders can send snapshots to lagging followers
Acceptance Criteria:
- InstallSnapshotRequest includes term, leaderId
- Request includes lastIncludedIndex and lastIncludedTerm
- Request includes snapshot data (chunk for large snapshots)
- Request includes offset and done flag (for chunked transfer)
- InstallSnapshotResponse includes term
- Messages support serialization
- Clear documentation explains fields per Raft spec
- Messages support large snapshot transfer
Definition of Done:
- InstallSnapshot message types are defined
- Messages support chunked snapshot transfer
- Messages can be serialized and sent over network
- Message structure matches Raft specification
As a Distributed Systems Engineer I want leaders to send snapshots to followers missing log entries So that lagging followers can catch up without full log replay
Acceptance Criteria:
- Leader detects when follower needs snapshot (nextIndex <= snapshot's lastIncludedIndex)
- Leader sends InstallSnapshotRequest instead of AppendEntries
- Large snapshots are sent in chunks
- Leader tracks transfer progress per follower
- Leader resends chunks on failure
- Clear logging of snapshot transfers
- Transfer doesn't block normal replication to other followers
- Leader continues heartbeats during snapshot transfer
Definition of Done:
- Leaders send snapshots to lagging followers
- Large snapshots are transferred successfully
- Transfer is reliable (handles failures)
- Snapshot transfers are logged
As a Distributed Systems Engineer I want followers to receive and install snapshots from leader So that followers can catch up when missing too many log entries
Acceptance Criteria:
- Follower receives InstallSnapshotRequest
- Verify request term and step down if needed
- Receive snapshot chunks and assemble
- Discard log entries covered by snapshot
- Restore state machine from snapshot
- Update commitIndex and lastApplied to snapshot point
- Clear logging of snapshot installation
- Reset election timer on receiving snapshot (leader is active)
- Respond with success after installation
Definition of Done:
- Followers receive and install snapshots
- State machine is restored from snapshot
- Old log entries are discarded
- Installation is logged and observable
As a Distributed Systems Engineer I want configuration changes represented as log entries So that membership changes are replicated and committed like data
Acceptance Criteria:
- Define ConfigurationEntry type with list of node IDs
- Configuration entries have special type marker
- Configurations are versioned (track which is current)
- Support serialization for log storage
- Clear documentation explains configuration log entries
- Configuration entry includes old and new configuration (for joint consensus)
- Configuration is immutable once created
- Type distinguishes configuration from regular data entries
Definition of Done:
- Configuration entry type is defined
- Type can be stored in log like other entries
- Configuration changes are represented clearly
- Type is documented
As an Operations Engineer I want to add a new node to the cluster So that cluster can grow or replace failed nodes
Acceptance Criteria:
- Leader accepts request to add new server
- Leader verifies no concurrent configuration change in progress
- Leader appends configuration entry to log with new server added
- New configuration is replicated and committed
- New configuration becomes effective after commit
- Clear logging of membership change process
- New server must catch up before voting (non-voting learner initially)
- Operation fails if concurrent change exists
Definition of Done:
- New servers can be added to cluster
- Addition is safe (no split-brain)
- Configuration change is committed before taking effect
- Addition process is logged
As an Operations Engineer I want to remove a node from the cluster So that cluster can shrink or remove failed nodes permanently
Acceptance Criteria:
- Leader accepts request to remove server
- Leader verifies no concurrent configuration change in progress
- Leader appends configuration entry to log with server removed
- New configuration is replicated and committed
- New configuration becomes effective after commit
- Clear logging of membership change process
- Removed server shuts down after change is committed
- Handle case where leader removes itself
Definition of Done:
- Servers can be removed from cluster
- Removal is safe and clean
- Removed servers stop participating
- Removal process is logged
As a Distributed Systems Engineer I want only one configuration change at a time So that cluster doesn't have multiple pending configurations
Acceptance Criteria:
- Track if configuration change is in progress
- Reject new configuration change requests while one is pending
- Consider change complete when configuration entry is committed
- Clear error message when concurrent change is rejected
- Log indicates when change completes
- System resumes accepting changes after completion
- Leader tracks pending change status
- Followers can't initiate configuration changes
Definition of Done:
- Only one configuration change happens at a time
- Concurrent changes are rejected cleanly
- System correctly tracks change status
- Concurrency control is logged
As a Distributed Systems Engineer I want new servers to replicate log without voting So that new servers catch up before affecting quorum
Acceptance Criteria:
- New server receives log entries like follower
- New server doesn't participate in elections (doesn't vote or become candidate)
- Leader tracks catch-up progress for new server
- Leader transitions new server to voting member when caught up
- Caught up means matchIndex is close to leader's last log index
- Clear logging shows catch-up progress
- Catch-up happens before new server joins voting configuration
- Timeout if new server doesn't catch up in reasonable time
Definition of Done:
- New servers catch up without voting
- Catch-up progress is tracked and logged
- Transition to voting happens when ready
- Catch-up mechanism is reliable
As an Application Developer I want to query current cluster configuration So that I know which nodes are in the cluster
Acceptance Criteria:
- Provide API to retrieve current committed configuration
- Return list of node IDs in current configuration
- Indicate which node is current leader (if known)
- Query works on any node (leader or follower)
- Configuration returned is the committed configuration (not pending)
- Clear documentation of query API
- Query is fast (doesn't require consensus)
- Result distinguishes voting and non-voting members
Definition of Done:
- Configuration can be queried easily
- Query returns accurate current membership
- API is documented and simple
- Query is efficient
As an Operations Engineer I want to bootstrap initial cluster configuration So that cluster can start with known membership
Acceptance Criteria:
- Initial configuration specified in config file or API
- All nodes start with same initial configuration
- Initial configuration is logged as first entry (index 0 or 1)
- Bootstrap only happens on first start (not on restart)
- Clear logging shows bootstrap process
- Bootstrap includes all initial cluster members
- Configuration is persisted before accepting operations
- Nodes in initial configuration can elect leader
Definition of Done:
- Cluster can be bootstrapped with initial configuration
- All nodes agree on initial membership
- Bootstrap process is reliable
- Bootstrap is logged clearly
As an Operations Engineer I want structured logs for core Raft events So that I can monitor cluster behavior and debug issues
Acceptance Criteria:
- Log state transitions (Follower/Candidate/Leader)
- Log term changes with old and new term
- Log election starts and results
- Log vote granting/denying with reason
- Log heartbeat sending and receiving
- Logs include timestamp, node ID, term
- Use structured logging format (JSON or key-value)
- Configurable log level (debug, info, warn, error)
Definition of Done:
- Core Raft events are logged comprehensively
- Logs are structured and parseable
- Log level is configurable
- Logs provide good visibility into cluster state
As an Operations Engineer I want logs for log replication activity So that I can monitor replication progress and lag
Acceptance Criteria:
- Log when entries are appended (index, term, count)
- Log replication progress per follower (matchIndex updates)
- Log commit index advancement
- Log state machine application (lastApplied updates)
- Log replication failures and retries
- Logs include relevant indices and node IDs
- Log frequency is reasonable (not too verbose)
- Logs help diagnose replication issues
Definition of Done:
- Replication activity is well-logged
- Logs help identify slow followers
- Commit progress is visible
- Logs are not overwhelming in volume
As an Operations Engineer I want metrics exposed for monitoring systems So that I can track cluster health over time
Acceptance Criteria:
- Expose current term as gauge
- Expose current state (Follower/Candidate/Leader) as gauge
- Expose commit index and last applied as gauges
- Expose election count as counter
- Expose RPC request/response counts as counters
- Metrics follow standard format (Prometheus, StatsD)
- Metrics include node ID label
- Metrics are updated in real-time
Definition of Done:
- Basic metrics are exposed
- Metrics can be scraped by monitoring systems
- Metrics are accurate and up-to-date
- Metrics follow standard conventions
As an Operations Engineer I want performance metrics So that I can monitor latency and throughput
Acceptance Criteria:
- Expose commit latency histogram (time from append to commit)
- Expose RPC latency histograms (per RPC type)
- Expose throughput (commands per second)
- Expose log size (entries and bytes)
- Expose snapshot size and frequency
- Metrics include percentiles (p50, p95, p99)
- Metrics help identify performance issues
- Metrics are updated continuously
Definition of Done:
- Performance metrics are exposed
- Latency and throughput are trackable
- Metrics help identify bottlenecks
- Metrics are useful for capacity planning
As an Operations Engineer I want a health check endpoint So that load balancers and orchestrators can detect node health
Acceptance Criteria:
- HTTP endpoint returns health status (healthy/unhealthy)
- Health check considers: node is running, storage is accessible, in valid state
- Returns 200 OK when healthy, 503 when unhealthy
- Response includes basic info (node ID, state, term)
- Health check is fast (< 100ms)
- Configurable health check criteria
- Clear documentation of health check semantics
- Endpoint doesn't require authentication (or has separate auth)
Definition of Done:
- Health check endpoint works reliably
- Orchestrators can use endpoint for health detection
- Unhealthy nodes are detected appropriately
- Endpoint is documented
As an Operations Engineer I want cluster-wide aggregate metrics So that I can monitor overall cluster health
Acceptance Criteria:
- Track quorum status (have majority or not)
- Track leader election frequency
- Track follower lag (max and average)
- Track configuration change events
- Track snapshot frequency and size
- Metrics aggregate across cluster (via external monitoring)
- Alerts can be configured on metrics
- Metrics help identify cluster-wide issues
Definition of Done:
- Cluster-wide metrics are exposed
- Metrics provide cluster health visibility
- Metrics support alerting strategies
- Metrics are documented
As a Distributed Systems Engineer I want an endpoint to inspect node state So that I can debug issues in development and production
Acceptance Criteria:
- Endpoint returns current state (Follower/Candidate/Leader)
- Returns current term, votedFor, commitIndex, lastApplied
- Returns log summary (size, last index, last term)
- Returns current configuration (cluster members)
- Returns leader ID (if known)
- Response is JSON formatted
- Endpoint requires authentication in production
- Clear documentation of endpoint
Definition of Done:
- Debug endpoint provides comprehensive state view
- Endpoint is useful for troubleshooting
- State information is accurate
- Endpoint is documented
As a Distributed Systems Engineer I want distributed tracing for requests So that I can trace request flow across cluster nodes
Acceptance Criteria:
- Integrate with tracing framework (OpenTelemetry, Jaeger)
- Trace client requests through consensus process
- Trace RPC calls between nodes
- Spans include relevant context (term, index, node IDs)
- Tracing is configurable (enable/disable, sampling rate)
- Tracing overhead is minimal
- Clear documentation of tracing setup
- Traces help diagnose latency issues
Definition of Done:
- Tracing integration works correctly
- Requests can be traced across nodes
- Tracing provides useful performance insights
- Tracing is documented
As a Distributed Systems Engineer I want unit tests for core types So that basic type behavior is verified
Acceptance Criteria:
- Tests for Term, LogIndex, NodeId ordering and equality
- Tests for LogEntry creation and serialization
- Tests for PersistentState and VolatileState
- Tests for RPC message types serialization
- Tests cover normal cases and edge cases
- Tests use standard Rust test framework
- Clear test names describe what is tested
- All tests pass consistently
Definition of Done:
- Core types have comprehensive unit tests
- Tests verify expected behavior
- Edge cases are covered
- Tests run fast and reliably
As a Distributed Systems Engineer I want tests for storage implementations So that persistence behavior is correct
Acceptance Criteria:
- Tests for InMemoryStorage read/write operations
- Tests for FileStorage read/write operations
- Tests for atomic writes and crash scenarios
- Tests for corruption detection
- Tests for recovery from persisted state
- Tests for snapshot storage
- Tests verify storage trait contract
- Tests cover error conditions
Definition of Done:
- Storage implementations have thorough tests
- Persistence correctness is verified
- Error handling is tested
- Tests cover crash recovery scenarios
As a Distributed Systems Engineer I want tests for election logic So that leader election works correctly
Acceptance Criteria:
- Tests for follower timeout and candidate transition
- Tests for vote granting rules (term, log up-to-date, already voted)
- Tests for quorum detection
- Tests for candidate to leader transition
- Tests for split vote and retry
- Tests for higher term detection and step down
- Tests use mocked dependencies (storage, network)
- Tests verify election safety invariants
Definition of Done:
- Election logic is thoroughly tested
- All vote granting rules are verified
- Election transitions work correctly
- Safety properties are tested
As a Distributed Systems Engineer I want tests for log replication So that log consistency is maintained
Acceptance Criteria:
- Tests for log consistency check (prevIndex/prevTerm)
- Tests for appending new entries
- Tests for handling conflicting entries
- Tests for commitIndex advancement
- Tests for matchIndex/nextIndex tracking
- Tests for commit point calculation (majority)
- Tests verify log matching property
- Tests cover edge cases (empty log, first entry)
Definition of Done:
- Log replication logic is well-tested
- Consistency checks are verified
- Commit logic is correct
- Edge cases are covered
As a Distributed Systems Engineer I want integration tests for leader election So that election works across multiple nodes
Acceptance Criteria:
- Test 3-node cluster elects single leader
- Test 5-node cluster elects single leader
- Test election completes within timeout
- Test only one leader per term
- Test followers recognize leader
- Tests use InMemoryTransport
- Tests verify election safety
- Tests are deterministic and repeatable
Definition of Done:
- Multi-node election tests pass
- Leader election is reliable
- Election safety is verified
- Tests run consistently
As a Distributed Systems Engineer I want integration tests for log replication So that logs are consistent across nodes
Acceptance Criteria:
- Test leader replicates entries to followers
- Test followers have identical logs after replication
- Test commit happens when majority replicates
- Test entries are applied to state machine
- Test multiple commands in sequence
- Tests verify log matching property across nodes
- Tests use InMemoryTransport and InMemoryStorage
- Tests check final state machine state matches
Definition of Done:
- Multi-node replication tests pass
- Logs are consistent across cluster
- State machines converge
- Tests are reliable
As a Distributed Systems Engineer I want tests for leader failure scenarios So that cluster recovers from leader failures
Acceptance Criteria:
- Test leader crash triggers new election
- Test new leader is elected with all committed entries
- Test cluster continues accepting commands after re-election
- Test state machine state is preserved
- Tests verify leader completeness property
- Tests simulate leader crash (stop responding)
- Tests verify election timeout works
- Tests check committed entries are not lost
Definition of Done:
- Leader failure tests pass
- Cluster recovers automatically
- Committed data is preserved
- Re-election works reliably
As a Distributed Systems Engineer I want tests for network partition scenarios So that cluster handles partitions safely
Acceptance Criteria:
- Test majority partition can continue
- Test minority partition cannot commit entries
- Test cluster recovers when partition heals
- Test no split-brain occurs
- Tests verify safety during partition
- Tests use InMemoryTransport with partition simulation
- Tests verify entries committed only in majority partition
- Tests check partition healing reconciles logs
Definition of Done:
- Partition tests pass
- Cluster is safe during partitions
- Split-brain is prevented
- Partition recovery works
As a Distributed Systems Engineer I want tests for crash and recovery So that nodes recover correctly after restart
Acceptance Criteria:
- Test node crashes and restarts
- Test node recovers persisted state (term, vote, log)
- Test recovered node rejoins cluster
- Test recovered node replicates any missed entries
- Tests use FileStorage for persistence
- Tests verify recovered state matches pre-crash state
- Tests check node catches up after restart
- Tests verify no data loss
Definition of Done:
- Crash recovery tests pass
- Nodes recover correctly
- Persisted state is accurate
- Recovery is reliable
As a Distributed Systems Engineer I want tests for log compaction So that snapshots work correctly
Acceptance Criteria:
- Test snapshot creation after threshold
- Test log entries are discarded after snapshot
- Test snapshot includes correct state machine state
- Test node can recover from snapshot
- Test InstallSnapshot RPC for lagging followers
- Tests verify snapshot metadata (lastIncludedIndex/Term)
- Tests check storage is reclaimed
- Tests verify state machine state after snapshot recovery
Definition of Done:
- Snapshot tests pass
- Compaction works correctly
- Snapshots preserve state accurately
- Snapshot installation works
As a Distributed Systems Engineer I want property-based tests for safety properties So that Raft invariants are verified across many scenarios
Acceptance Criteria:
- Test election safety (at most one leader per term)
- Test log matching (same index/term implies identical prefix)
- Test leader completeness (leader has all committed entries)
- Test state machine safety (nodes apply same commands at same index)
- Tests generate random inputs (commands, failures, partitions)
- Tests use property testing framework (quickcheck, proptest)
- Tests run many iterations to find edge cases
- Tests verify invariants never violated
Definition of Done:
- Property-based tests verify safety invariants
- Tests run across many random scenarios
- No invariant violations found
- Tests provide high confidence in correctness
As a Distributed Systems Engineer I want chaos tests with fault injection So that system handles arbitrary failure combinations
Acceptance Criteria:
- Randomly crash and restart nodes
- Randomly delay or drop messages
- Randomly partition network
- Randomly slow down nodes
- Run client workload during chaos
- Verify no data loss or inconsistency
- Tests run for extended duration (minutes/hours)
- Tests verify safety properties maintained during chaos
Definition of Done:
- Chaos tests run successfully
- System remains safe under arbitrary failures
- No safety violations detected
- Tests provide confidence in production readiness
As a Distributed Systems Engineer I want performance benchmarks So that latency and throughput meet requirements
Acceptance Criteria:
- Benchmark commit latency for single command
- Benchmark throughput (commands per second)
- Benchmark election time (leader failure to new leader)
- Benchmark snapshot creation time
- Benchmarks use realistic cluster sizes (3, 5, 7 nodes)
- Benchmarks run on realistic hardware/network
- Results compared against requirements
- Benchmarks are repeatable
Definition of Done:
- Performance benchmarks run successfully
- Latency meets requirements (< 50ms commit)
- Throughput meets requirements (> 10k ops/sec)
- Performance is documented
As a Distributed Systems Engineer I want linearizability verification tests So that consistency guarantees are proven
Acceptance Criteria:
- Record all client operations (commands and reads)
- Record timestamps and responses
- Use linearizability checker (Knossos or similar)
- Verify history is linearizable
- Tests run with concurrent clients
- Tests include failures and partitions
- Tests verify both writes and reads
- Tests fail if non-linearizable behavior detected
Definition of Done:
- Linearizability tests pass
- Consistency is proven formally
- Tests cover concurrent and failure scenarios
- System provides correct consistency guarantees
As an Application Developer I want end-to-end tests demonstrating complete usage So that I understand how to use Raft in my application
Acceptance Criteria:
- Test complete cluster lifecycle (bootstrap, operations, shutdown)
- Test client submitting commands and reading results
- Test adding and removing nodes
- Test rolling upgrades (simulated)
- Test monitoring and metrics collection
- Tests use realistic configuration
- Tests demonstrate best practices
- Tests serve as examples
Definition of Done:
- End-to-end tests cover full usage scenarios
- Tests demonstrate complete Raft functionality
- Tests serve as documentation and examples
- Tests pass reliably
Stories F-1 through F-10, C-1 through C-3 establish the foundation and should be completed first to enable all other work.
Stories C-4 through C-22 implement the core Raft algorithm and must be completed for a minimally functional system.
Stories P-1 through P-7 (Persistence), N-1 through N-13 (Network), S-1 through S-9 (State Machine) make the system production-ready.
Stories M-1 through M-7 (Membership), O-1 through O-8 (Observability) add operational capabilities.
Stories T-1 through T-15 should be developed in parallel with implementation stories, not deferred to the end.
- F-series has minimal dependencies, can start immediately
- C-series depends on F-series foundation types
- P-series depends on F-7 (Storage trait)
- N-series depends on F-6 (RPC types) and C-series (handlers)
- S-series depends on C-series (state machine application logic)
- M-series depends on C-series and N-series (requires working cluster)
- O-series can be developed in parallel with implementation
- T-series should accompany each phase
These 110+ user stories collectively implement the Raft consensus algorithm by:
- Establishing Foundation (F-series): Creating type-safe representations of Raft concepts and storage abstractions
- Implementing Core Algorithm (C-series): Building leader election, log replication, and commit mechanisms that ensure safety
- Adding Persistence (P-series): Making state durable across crashes, preventing data loss
- Enabling Communication (N-series): Allowing nodes to send RPCs and coordinate across a network
- Integrating Applications (S-series): Connecting Raft to state machines, enabling real applications
- Supporting Operations (M-series): Allowing dynamic cluster membership changes
- Providing Visibility (O-series): Making the system observable and debuggable
- Ensuring Correctness (T-series): Verifying safety properties and validating the implementation
Each story is independently valuable, testable, and builds incrementally toward a complete, production-ready Raft implementation in Rust that satisfies all functional requirements, non-functional requirements, and business rules from the problem analysis.
The stories follow INVEST principles:
- Independent: Most stories can be implemented without blocking on others in the same phase
- Negotiable: Details can be refined during implementation
- Valuable: Each delivers observable functionality or verification
- Estimable: Scope is clear enough to estimate complexity
- Small: Each can be completed in days, not weeks
- Testable: Clear acceptance criteria enable verification
Document Version: 1.0 Generated: 2026-01-15 Based on Problem Analysis: /Users/andrealaforgia/dev/personal/claude-code-agents/problem-analysis.md