Skip to content

Instantly share code, notes, and snippets.

@oddur
Last active October 28, 2025 15:11
Show Gist options
  • Select an option

  • Save oddur/07a23e4876f5094392b7daea66e33a13 to your computer and use it in GitHub Desktop.

Select an option

Save oddur/07a23e4876f5094392b7daea66e33a13 to your computer and use it in GitHub Desktop.
SimTime Service Multi-Instance Architecture Plan - Leader-Follower approach for horizontal scaling

SimTime Service Multi-Instance Architecture Plan

Executive Summary

This document outlines a solution to enable the SimTime service to run safely across multiple instances, eliminating the current single point of failure while maintaining time consistency across the distributed system.


ELI5: The Solution in Simple Terms

The Problem

Imagine you have a single master clock in a building that everyone looks at to know the time. If that clock breaks, nobody knows what time it is anymore. That's our current problem - we have only one time server, and if it fails, the entire game loses track of time.

The Solution

Instead of one clock, we'll have multiple clocks that work together:

  1. The Leader Clock πŸ•

    • One clock is chosen as the "leader" - this is the official time
    • It's like the principal's clock at school that everyone agrees is correct
    • If it breaks, another clock quickly becomes the new leader
  2. The Follower Clocks πŸ•‘πŸ•’πŸ•“

    • All other clocks are "followers" that sync with the leader
    • They ask the leader "what time is it?" every minute
    • Between asks, they keep track of time on their own (like counting seconds)
    • They remember the difference between their time and the leader's time
  3. How It Stays in Sync

    • When a follower asks for time, it's like calling someone in another timezone
    • Follower: "What time do you have?" (sent at 2:00:00)
    • Leader: "It's 2:00:05"
    • Follower: "Okay, I'm 5 seconds behind you"
    • Now the follower adds 5 seconds to its own counting
  4. Why This Is Smart

    • Game clients can ask ANY clock for the time
    • Followers don't need to constantly bother the leader
    • If the leader breaks, a follower becomes the new leader within 30 seconds
    • Everyone always agrees on what time it is (within milliseconds)

Real World Analogy

It's like having multiple smartphones that all sync their time with the cell tower. Even if one phone dies, the others keep working. And even if the main cell tower fails, the phones keep showing the right time because they know how to count seconds on their own.

The clever part: We're reusing the same time-sync technology that game clients already use to sync with the server. We're just making some servers act like "super clients" that can also answer time questions from real game clients.


Why PostgreSQL: Aligning with Unified Datastore Strategy

Technical Strategy Alignment

Our technical guidelines state:

"We have a single, unified way of storing data at rest in Postgres databases. Where possible, lean into going along the grain of relational databases since that is where the data lives."

Using PostgreSQL for SimTime coordination provides critical advantages:

  1. Unified Backup/Restore

    -- When you backup PostgreSQL, time state comes along
    pg_dump production > backup.dump
    
    -- Restore includes perfect time state
    pg_restore backup.dump
    -- SimTime continues from exact backup point!
  2. Transactional Consistency

    BEGIN;
    -- Game event and time update in same transaction
    INSERT INTO game_events (...);
    UPDATE simtime_state SET ...;
    COMMIT;  -- Both succeed or both fail
  3. Zero Additional Infrastructure

    • No NATS to maintain
    • No separate backup strategy
    • Single connection pool
    • One monitoring system

PostgreSQL vs NATS - Division of Responsibilities

Component PostgreSQL NATS Our Choice
Leader Election Lock tables with TTL KV with TTL PostgreSQL βœ…
State Persistence Native tables with ACID KV Store (eventual) PostgreSQL βœ…
Backup/Restore Included automatically Separate system PostgreSQL βœ…
Observability Rich SQL queries Limited PostgreSQL βœ…
Instance Coordination Polling (simple, reliable) Pub/Sub PostgreSQL βœ…
Game Client Broadcasting Possible but suboptimal Excellent fan-out NATS βœ…

We use PostgreSQL for reliability (leader election, state) and NATS for performance (broadcasting to thousands of game clients).

The Decisive Factor

When disaster recovery happens:

  • PostgreSQL: Restore one backup, time state included
  • NATS: Restore PostgreSQL + NATS separately, hope they align

This alone justifies PostgreSQL given your unified datastore strategy.


Current State & Problem

The Issue

The SimTime service currently cannot scale horizontally due to its stateful nature:

Current Architecture:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  SimTime Service    β”‚ ← Single Instance (SPOF)
β”‚  - Mutable State    β”‚
β”‚  - No Persistence   β”‚
β”‚  - No Coordination  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           ↓
    All Game Services

Why It's a Problem

  1. Single Point of Failure (SPOF)

    • Service crash = no time synchronization for entire system
    • Deployment/updates require downtime
    • No redundancy for this critical service
  2. State Management Issues

    // Each instance maintains local state:
    private DateTime serverEpoch;
    private long multiplier;
    private DateTime? pausedTimestamp;
    • Multiple instances = divergent state
    • No synchronization mechanism
    • Race conditions on time manipulation
  3. Scalability Limitations

    • Cannot handle increased load by adding instances
    • All time sync requests hit single instance
    • Becomes bottleneck as system grows
  4. Sync Load Overwhelms Single Instance

    Production Reality with 10,000 concurrent clients:
    - Each client syncs every 60 seconds
    - 10,000 clients Γ· 60 seconds = 167 sync requests/second
    - Each sync involves: network I/O, time calculation, response
    - Single instance CPU and network become saturated
    
    Result: Sync latency increases β†’ Time drift β†’ Game desynchronization
    
    • Vertical scaling (bigger instance) has limits
    • Cannot distribute load across multiple instances
    • Critical during peak gaming hours or events

Proposed Solution: Leader-Follower Architecture

High-Level Design

Proposed Architecture:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Leader Instance   β”‚ ← Source of Truth
β”‚  (SimTimeServer)    β”‚   Elected via PostgreSQL
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           ↓
    PostgreSQL + Optional Pub/Sub
           ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Follower Instance  β”‚  β”‚  Follower Instance  β”‚
β”‚  (SimTimeClient)    β”‚  β”‚  (SimTimeClient)    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           ↓                        ↓
     Game Clients              Game Clients

Key Innovation: Followers as SimTime Clients

The breakthrough insight: Follower instances use the existing SimTimeClient library to synchronize with the leader, just like game clients do. This solves clock skew and ensures consistency.

How It Works

1. Leader Election (via PostgreSQL Lock Table)

// Optimized: Single query returns both election result AND current leader endpoint
// This eliminates an extra round-trip to the database for followers
public async Task<LeaderElectionResult> TryBecomeLeader() {
    var sql = @"
        WITH election_attempt AS (
            INSERT INTO simtime_leader
            (resource_name, instance_id, instance_endpoint, expires_at)
            VALUES ('simtime', @instanceId, @endpoint, NOW() + INTERVAL '30 seconds')
            ON CONFLICT (resource_name) DO UPDATE SET
                instance_id = @instanceId,
                instance_endpoint = @endpoint,
                expires_at = NOW() + INTERVAL '30 seconds'
            WHERE simtime_leader.expires_at < NOW()  -- Only if expired
            RETURNING instance_id, instance_endpoint
        )
        SELECT
            instance_id,
            instance_endpoint,
            instance_id = @instanceId as became_leader
        FROM election_attempt
        UNION ALL
        SELECT
            instance_id,
            instance_endpoint,
            false as became_leader
        FROM simtime_leader
        WHERE resource_name = 'simtime'
            AND NOT EXISTS (SELECT 1 FROM election_attempt)
        LIMIT 1";

    var result = await db.QuerySingleAsync<LeaderElectionResult>(sql,
        new { instanceId, endpoint });

    if (result.BecameLeader) {
        StartLeaseRenewal();  // Renew every 15 seconds
        RunAsLeader();        // Full SimTimeServer mode
    } else {
        // We already have the leader endpoint from the query!
        await ConnectToLeader(result.LeaderEndpoint);
        RunAsFollower();      // SimTimeClient mode
    }

    return result;
}

public class LeaderElectionResult {
    public bool BecameLeader { get; set; }
    public string LeaderEndpoint { get; set; }
    public string LeaderId { get; set; }
}

2. Follower Synchronization

public class FollowerMode {
    private SimTimeClient clientToLeader;

    public void Initialize(string leaderEndpoint) {
        // Create transport to communicate with leader
        var transport = new GrpcSimTimeTransport(leaderEndpoint);

        // Use SimTimeClient for time synchronization
        clientToLeader = new SimTimeClient(transport);
        clientToLeader.StartSyncing();

        // Wait for initial sync
        await clientToLeader.WaitUntilTimeIsSynced();
    }

    // Serve client requests using synchronized time
    public DateTime GetCurrentTime() {
        return clientToLeader.ServerNow; // Locally extrapolated!
    }
}

3. Clock Synchronization Algorithm (NTP-style)

The SimTimeClient uses sophisticated time synchronization:

1. Follower sends sync request with timestamp T1
2. Leader receives at T2, responds at T3
3. Follower receives response at T4

Round Trip Time = (T4 - T1)
One-way latency β‰ˆ RTT / 2
Clock offset = T3 - (T1 + one-way latency)

Result: Follower knows exact offset to leader's clock

4. Request Routing

Operation Leader Behavior Follower Behavior
GetTimeData Serve from local state Serve from synchronized state
Sync Provide time reference Provide synchronized time
Time Manipulation Execute and broadcast Proxy to leader

5. Client Notification via NATS

While PostgreSQL handles leader election and state persistence, NATS is still used for broadcasting time modification events to all connected game clients:

// Leader-side implementation
public class TimeManipulationBroadcast {
    private readonly INatsPublisher natsPublisher;

    public async Task OnTimeManipulation(TimeManipulationEvent evt) {
        // Leader executes the manipulation
        UpdateLocalState(evt);

        // Save to PostgreSQL for persistence
        await SaveToPostgreSQL(evt);

        // Broadcast to ALL game clients via NATS
        await natsPublisher.PublishAsync("simtime.events", new {
            Type = evt.Type,  // "pause", "resume", "multiplier_change", "advance"
            NewState = GetCurrentTimeData(),
            Timestamp = DateTime.UtcNow
        });

        // Note: Follower instances detect changes by polling PostgreSQL,
        // not through NATS. This NATS broadcast is for game clients only.
    }
}

// Follower-side implementation: Always proxy to leader
public class FollowerTimeManipulation {
    private readonly SimTimeServerInternalServiceClient leaderClient;
    private readonly bool isLeader;

    public async Task<PauseResponse> PauseAsync(PauseRequest request) {
        if (isLeader) {
            // Leader executes directly
            return await ExecutePauseAsync(request);
        } else {
            // Follower ALWAYS proxies to leader
            var leaderEndpoint = await GetLeaderEndpoint();
            return await leaderClient.PauseAsync(request);
        }
    }

    public async Task<ResumeResponse> ResumeAsync(ResumeRequest request) {
        if (!isLeader) {
            // Always proxy to leader
            return await leaderClient.ResumeAsync(request);
        }
        return await ExecuteResumeAsync(request);
    }

    public async Task<ChangeMultiplierResponse> ChangeMultiplierAsync(ChangeMultiplierRequest request) {
        if (!isLeader) {
            // Always proxy to leader
            return await leaderClient.ChangeMultiplierAsync(request);
        }
        return await ExecuteChangeMultiplierAsync(request);
    }
}

This hybrid approach leverages:

  • PostgreSQL for reliable state and leader coordination
  • NATS for efficient fan-out to thousands of game clients
  • Proxying ensures all time manipulations go through the leader
  • Existing client code continues to work unchanged

6. Unified Polling Architecture (No LISTEN/NOTIFY)

We intentionally use simple polling instead of PostgreSQL LISTEN/NOTIFY for follower coordination:

Why Polling Is Superior Here:

  • Connection Simplicity: No persistent DB connections or reconnection logic
  • Container-Friendly: Works perfectly with connection pooling and Kubernetes
  • Predictable Behavior: Fixed 10-second intervals, easy to debug and monitor
  • Resilient: Network blips don't break anything, just delay by one poll cycle
  • Simpler Testing: No need to mock LISTEN/NOTIFY, fully deterministic

The Unified Polling Loop:

public class FollowerPollingService {
    private string currentLeaderEndpoint;
    private DateTime lastStateUpdate;
    private readonly TimeSpan pollInterval = TimeSpan.FromSeconds(10);

    public async Task RunPollingLoop() {
        while (running) {
            try {
                // Single efficient query gets everything we need
                var sql = @"
                    SELECT
                        l.instance_id as leader_id,
                        l.instance_endpoint as leader_endpoint,
                        l.expires_at > NOW() as leader_active,
                        s.sim_time_ticks,
                        s.epoch_ticks,
                        s.multiplier,
                        s.paused_timestamp_ticks,
                        s.saved_at as state_updated_at
                    FROM simtime_leader l
                    CROSS JOIN LATERAL (
                        SELECT * FROM simtime_state
                        WHERE is_current = true
                        ORDER BY saved_at DESC LIMIT 1
                    ) s
                    WHERE l.resource_name = 'simtime'";

                var status = await db.QuerySingleOrDefaultAsync<dynamic>(sql);

                if (status == null || !status.leader_active) {
                    // No active leader - attempt election
                    await TryBecomeLeader();
                    continue;
                }

                // Check if leader changed
                if (status.leader_endpoint != currentLeaderEndpoint) {
                    Logger.Info($"Leader changed to {status.leader_endpoint}");
                    await ReconnectToNewLeader(status.leader_endpoint);
                    currentLeaderEndpoint = status.leader_endpoint;
                }

                // Check if time state was manipulated by admin
                if (status.state_updated_at > lastStateUpdate) {
                    Logger.Info("Time state changed, refreshing");
                    RefreshLocalTimeState(status);
                    lastStateUpdate = status.state_updated_at;
                }

            } catch (Exception ex) {
                Logger.Error($"Polling failed: {ex.Message}");
                // Continue polling - system remains available
            }

            await Task.Delay(pollInterval);
        }
    }
}

Acceptable Latency Trade-offs:

Event Detection Latency Impact Mitigation
Leader failure 0-10 seconds Rare event (weekly/monthly) Followers continue serving
Time manipulation 0-10 seconds Admin operation, non-critical Game clients get instant NATS update
New leader elected 0-10 seconds Follows leader failure Extrapolation continues working

The 10-second polling interval is 3x faster than the 30-second leader TTL, ensuring we detect changes promptly while keeping database load minimal.


PostgreSQL Schema

Database Tables

-- Leader election table
CREATE TABLE simtime_leader (
    resource_name TEXT PRIMARY KEY DEFAULT 'simtime',
    instance_id TEXT NOT NULL,
    instance_endpoint TEXT NOT NULL,  -- Where followers connect
    acquired_at TIMESTAMPTZ DEFAULT NOW(),
    expires_at TIMESTAMPTZ NOT NULL,
    heartbeat_at TIMESTAMPTZ DEFAULT NOW(),
    fencing_token BIGSERIAL,  -- Monotonic counter for split-brain prevention
    metadata JSONB,  -- Version, capabilities, etc.

    CONSTRAINT valid_lease CHECK (expires_at > acquired_at)
);

-- State persistence table
CREATE TABLE simtime_state (
    id BIGSERIAL PRIMARY KEY,
    saved_at TIMESTAMPTZ DEFAULT NOW(),
    sim_time_ticks BIGINT NOT NULL,
    epoch_ticks BIGINT NOT NULL,
    multiplier INTEGER NOT NULL,
    paused_timestamp_ticks BIGINT,
    is_current BOOLEAN DEFAULT true,
    saved_by TEXT NOT NULL,
    fencing_token BIGINT NOT NULL  -- Must match leader's token
);

-- Audit trail for time manipulations
CREATE TABLE simtime_events (
    id BIGSERIAL PRIMARY KEY,
    event_time TIMESTAMPTZ DEFAULT NOW(),
    event_type TEXT NOT NULL,  -- 'pause', 'resume', 'multiplier_change', 'advance'
    old_value JSONB,
    new_value JSONB,
    performed_by TEXT NOT NULL,
    instance_id TEXT NOT NULL
);

-- Indexes for performance
CREATE INDEX idx_current_state ON simtime_state(is_current)
WHERE is_current = true;

CREATE INDEX idx_active_leader ON simtime_leader(expires_at DESC)
WHERE expires_at > NOW();

CREATE INDEX idx_recent_events ON simtime_events(event_time DESC);

Why Lock Table Instead of Advisory Locks

Aspect Advisory Locks Lock Table Our Choice
Automatic cleanup βœ… On disconnect ❌ Manual TTL Lock Table
Survives brief disconnects ❌ Lost immediately βœ… TTL-based Lock Table βœ…
Observability ❌ No metadata βœ… Full visibility Lock Table βœ…
Debugging ❌ Can't see holder βœ… SQL queries Lock Table βœ…
Graceful handoff ❌ Not possible βœ… Update owner Lock Table βœ…
Audit trail ❌ None βœ… History table Lock Table βœ…
Works with polling ❌ Requires connection βœ… Stateless queries Lock Table βœ…

The lock table approach provides production-grade observability:

-- Who is the leader?
SELECT instance_id, instance_endpoint, expires_at - NOW() as remaining
FROM simtime_leader WHERE expires_at > NOW();

-- Leader changes in last hour
SELECT * FROM simtime_events
WHERE event_type = 'leader_change'
AND event_time > NOW() - INTERVAL '1 hour';

Detailed Sequence Diagrams

1. Leader Election Sequence

sequenceDiagram
    participant I1 as Instance 1
    participant I2 as Instance 2
    participant I3 as Instance 3
    participant PG as PostgreSQL

    Note over I1,PG: Startup - All instances try to become leader
    I1->>PG: INSERT INTO simtime_leader ... ON CONFLICT UPDATE<br/>WHERE expires_at < NOW()
    PG-->>I1: Success (became_leader = true)
    Note over I1: Becomes Leader

    I2->>PG: INSERT INTO simtime_leader ... ON CONFLICT UPDATE<br/>WHERE expires_at < NOW()
    PG-->>I2: Failed (became_leader = false, leader_endpoint = Instance 1)
    Note over I2: Becomes Follower

    I3->>PG: INSERT INTO simtime_leader ... ON CONFLICT UPDATE<br/>WHERE expires_at < NOW()
    PG-->>I3: Failed (became_leader = false, leader_endpoint = Instance 1)
    Note over I3: Becomes Follower

    loop Every 15 seconds
        I1->>PG: UPDATE simtime_leader SET expires_at = NOW() + 30s
        PG-->>I1: Success (lease renewed)
    end

    Note over I2,I3: Followers poll for leader changes (no LISTEN/NOTIFY)
    loop Every 10 seconds (polling loop)
        I2->>PG: SELECT leader, time_state FROM simtime_leader, simtime_state
        PG-->>I2: Current leader and state
        I3->>PG: SELECT leader, time_state FROM simtime_leader, simtime_state
        PG-->>I3: Current leader and state
    end
Loading

2. Follower Time Synchronization

sequenceDiagram
    participant F as Follower Instance
    participant SC as SimTimeClient
    participant L as Leader Instance
    participant LS as Leader SimTimeServer

    Note over F,LS: Initial Synchronization
    F->>SC: Initialize(leaderEndpoint)
    SC->>SC: StartSyncing()

    loop Every 60 seconds
        SC->>L: SyncNoStreamRequest(T1=local_time)
        L->>LS: OnTimeSync(T1)
        LS->>LS: T2=DateTime.UtcNow
        LS-->>L: TimeSyncResponse(T2)
        L-->>SC: Response(ServerTime=T2, RTT)
        SC->>SC: Calculate offset = T2 - (T1 + RTT/2)
        SC->>SC: Store serverTimeDifference
    end

    SC->>L: GetTimeDataRequest()
    L->>LS: GetTimeData()
    LS-->>L: TimeData(epoch, multiplier, paused)
    L-->>SC: TimeDataResponse
    SC->>SC: Store TimeData
    Note over SC: Now synchronized
Loading

3. Client Request Handling - Leader vs Follower

sequenceDiagram
    participant GC as Game Client
    participant LB as Load Balancer
    participant F as Follower
    participant L as Leader
    participant SC as SimTimeClient

    alt GetTimeData Request to Follower
        GC->>LB: GetTimeData()
        LB->>F: GetTimeData()
        F->>SC: ServerNow
        SC->>SC: Calculate: lastSync + offset + elapsed
        SC-->>F: Synchronized time
        F-->>GC: TimeDataResponse
        Note over GC,F: No network call to leader!
    else GetTimeData Request to Leader
        GC->>LB: GetTimeData()
        LB->>L: GetTimeData()
        L->>L: Return local state
        L-->>GC: TimeDataResponse
    end
Loading

4. Time Manipulation Flow

sequenceDiagram
    participant Admin as Admin Client
    participant F as Follower
    participant L as Leader
    participant PG as PostgreSQL
    participant NATS as NATS
    participant GC as Game Clients
    participant F2 as Other Followers

    alt Request to Follower
        Admin->>F: Pause()
        F->>F: Check if leader
        Note over F: Not leader - proxy to leader
        F->>L: Proxy: Pause()
        L->>L: UpdateState(paused=true)
        L->>PG: INSERT INTO simtime_events(...)
        L->>PG: UPDATE simtime_state SET ...
        L->>NATS: Publish("simtime.events", newState)
        NATS-->>GC: Broadcast to all game clients
        Note over F,F2: Followers detect change on next poll (0-10s)
        L-->>F: Success
        F-->>Admin: Success
    else Request to Leader
        Admin->>L: Pause()
        L->>L: UpdateState(paused=true)
        L->>PG: INSERT INTO simtime_events(...)
        L->>PG: UPDATE simtime_state SET ...
        L->>NATS: Publish("simtime.events", newState)
        NATS-->>GC: Broadcast to all game clients
        Note over F,F2: Followers detect change on next poll (0-10s)
        L-->>Admin: Success
    end
Loading

5. Leader Failure & Recovery

sequenceDiagram
    participant L as Leader (Instance 1)
    participant PG as PostgreSQL
    participant F1 as Follower 1
    participant F2 as Follower 2
    participant GC as Game Clients

    Note over L: Leader healthy, renewing lease
    loop Every 15 seconds
        L->>PG: UPDATE simtime_leader SET expires_at = NOW() + 30s
        PG-->>L: Success
        L->>PG: INSERT INTO simtime_state (snapshot)
        PG-->>L: State saved
    end

    Note over L: Leader crashes!
    L-xL: Crash/Network Issue

    Note over PG: After 30 seconds, TTL expires
    Note over F1,F2: Followers detect on next poll (0-10s)

    loop Every 10 seconds (polling)
        F1->>PG: SELECT * FROM simtime_leader WHERE expires_at > NOW()
        PG-->>F1: No active leader (TTL expired)
        Note over F1: Detected leader failure
        break Leader gone
        end
    end

    Note over F1,F2: Race to become leader
    F1->>PG: INSERT ... ON CONFLICT UPDATE WHERE expires_at < NOW()
    PG-->>F1: Success (became_leader = true)

    F1->>PG: SELECT * FROM simtime_state WHERE is_current = true
    PG-->>F1: Latest snapshot
    Note over F1: Becomes new Leader

    F2->>PG: INSERT ... ON CONFLICT UPDATE WHERE expires_at < NOW()
    PG-->>F2: Failed (became_leader = false, leader_endpoint = Follower 1)
    Note over F2: Remains Follower
    F2->>F1: StartSyncing() with new leader (no extra query needed!)

    Note over GC: Continuous service during transition
    GC->>F2: GetTimeData()
    F2->>F2: Serve from cached/extrapolated time
    F2-->>GC: TimeDataResponse
Loading

6. Time Extrapolation in Followers

sequenceDiagram
    participant GC as Game Client
    participant F as Follower
    participant SC as SimTimeClient
    participant Cache as Local State

    Note over Cache: Last sync: T0, Offset: +50ms

    GC->>F: GetTimeData() at T0+30s
    F->>SC: ServerNow
    SC->>Cache: Get last sync data
    Cache-->>SC: lastSync=T0, offset=50ms
    SC->>SC: elapsed = Now() - lastSync = 30s
    SC->>SC: time = T0 + offset + elapsed
    SC->>SC: Apply multiplier (e.g., 24x)
    SC->>SC: simTime = T0 + (30s * 24) = T0+12min
    SC-->>F: Calculated SimTime
    F-->>GC: TimeDataResponse(T0+12min)

    Note over GC,Cache: No network call needed!
Loading

7. Complete System Overview

graph TB
    subgraph "PostgreSQL Infrastructure"
        PG[PostgreSQL Database]
        LT[simtime_leader table<br/>Leader Election]
        ST[simtime_state table<br/>State Persistence]
        ET[simtime_events table<br/>Audit Trail]
    end

    subgraph "NATS Infrastructure"
        NATS[NATS Pub/Sub<br/>Client Notifications]
    end

    subgraph "Leader Instance"
        L[SimTimeServer<br/>Source of Truth]
        LG[gRPC Endpoint]
        LE[Leader Election<br/>Module]
    end

    subgraph "Follower Instance 1"
        F1[SimTimeClient]
        F1G[gRPC Endpoint]
        F1E[Election Watcher]
    end

    subgraph "Follower Instance 2"
        F2[SimTimeClient]
        F2G[gRPC Endpoint]
        F2E[Election Watcher]
    end

    subgraph "Game Services"
        GS1[Game Service 1]
        GS2[Game Service 2]
        GS3[Game Service 3]
    end

    PG --> LT
    PG --> ST
    PG --> ET

    LE -.->|Renew Lease| LT
    F1E -.->|Poll| LT
    F2E -.->|Poll| LT

    L -->|Save Snapshots| ST
    L -->|Log Events| ET
    L -->|Broadcast Events| NATS
    NATS -.->|Time Updates| GS1
    NATS -.->|Time Updates| GS2
    NATS -.->|Time Updates| GS3

    F1 -->|Sync Time| LG
    F2 -->|Sync Time| LG

    GS1 -->|GetTime| F1G
    GS2 -->|GetTime| F2G
    GS3 -->|GetTime| LG

    style L fill:#f96,stroke:#333,stroke-width:4px
    style F1 fill:#9cf,stroke:#333,stroke-width:2px
    style F2 fill:#9cf,stroke:#333,stroke-width:2px
    style PG fill:#326ce5,stroke:#333,stroke-width:2px
    style NATS fill:#27aae1,stroke:#333,stroke-width:2px
Loading

These diagrams illustrate the key interaction patterns in the distributed SimTime architecture:

  1. Leader Election: Shows how instances compete for leadership and maintain it through lease renewal
  2. Time Synchronization: Details the NTP-style algorithm used by followers to sync with the leader
  3. Request Handling: Demonstrates how both leaders and followers serve client requests
  4. Time Manipulation: Shows command routing for administrative operations
  5. Failure Recovery: Illustrates the seamless transition when a leader fails
  6. Time Extrapolation: Explains how followers calculate time locally without network calls
  7. System Overview: Provides a high-level view of all components and their relationships

Full Restart Recovery & State Persistence

The Challenge

When all SimTime servers restart simultaneously (e.g., cluster restart, power failure, Kubernetes namespace recreation), we face critical challenges:

  1. Complete State Loss - All in-memory time state vanishes
  2. Time Regression - Could jump from Day 15 back to Day 1
  3. Client Desynchronization - Clients have different time than restarted servers
  4. Lost Manipulations - Paused state, multiplier changes are forgotten

The Solution: Persistent State Snapshots

State Persistence Strategy

// Leader saves snapshots every 10 seconds
public class TimeSnapshotManager {
    private readonly IDbConnection db;
    private readonly Timer snapshotTimer;
    private readonly long fencingToken;

    public async Task SaveSnapshot() {
        var sql = @"
            -- Mark previous snapshots as not current
            UPDATE simtime_state SET is_current = false WHERE is_current = true;

            -- Insert new snapshot
            INSERT INTO simtime_state (
                sim_time_ticks,
                epoch_ticks,
                multiplier,
                paused_timestamp_ticks,
                is_current,
                saved_by,
                fencing_token
            ) VALUES (
                @simTimeTicks,
                @epochTicks,
                @multiplier,
                @pausedTimestampTicks,
                true,
                @savedBy,
                @fencingToken
            )";

        await db.ExecuteAsync(sql, new {
            simTimeTicks = GetCurrentSimTime().Ticks,
            epochTicks = currentEpoch.Ticks,
            multiplier = currentMultiplier,
            pausedTimestampTicks = pausedTimestamp?.Ticks,
            savedBy = instanceId,
            fencingToken = this.fencingToken
        });
    }
}

Recovery Protocol on Startup

public async Task<TimeData> RecoverTimeState() {
    // 1. Try to load from PostgreSQL snapshot
    var sql = @"
        SELECT
            saved_at,
            sim_time_ticks,
            epoch_ticks,
            multiplier,
            paused_timestamp_ticks
        FROM simtime_state
        WHERE is_current = true
        ORDER BY saved_at DESC
        LIMIT 1";

    var snapshot = await db.QuerySingleOrDefaultAsync<dynamic>(sql);

    if (snapshot != null) {
        var savedAt = (DateTime)snapshot.saved_at;
        var age = DateTime.UtcNow - savedAt;

        if (age < TimeSpan.FromMinutes(5)) {
            // 2. Extrapolate from snapshot
            var simTime = new DateTime((long)snapshot.sim_time_ticks);
            var extrapolatedTime = simTime + (age * snapshot.multiplier);

            Logger.Info($"Recovered from snapshot: {savedAt}, extrapolated {age}");
            return new TimeData {
                Epoch = new DateTime((long)snapshot.epoch_ticks),
                Multiplier = snapshot.multiplier,
                CurrentTime = extrapolatedTime,
                PausedTimestamp = snapshot.paused_timestamp_ticks != null
                    ? new DateTime((long)snapshot.paused_timestamp_ticks)
                    : null
            };
        }
    }

    // 3. Fallback to environment variables
    Logger.Warn("No recent snapshot found, initializing from environment");
    return InitializeFromEnvironment();
}

Full Restart Sequence Diagram

sequenceDiagram
    participant PG as PostgreSQL
    participant I1 as Instance 1
    participant I2 as Instance 2
    participant I3 as Instance 3
    participant GC as Game Clients

    Note over I1,I3: All instances start simultaneously

    par Instance 1 Recovery
        I1->>PG: SELECT * FROM simtime_state WHERE is_current = true
        PG-->>I1: Snapshot (saved 30s ago)
        I1->>I1: Extrapolate: saved_time + 30s
    and Instance 2 Recovery
        I2->>PG: SELECT * FROM simtime_state WHERE is_current = true
        PG-->>I2: Snapshot (saved 30s ago)
        I2->>I2: Extrapolate: saved_time + 30s
    and Instance 3 Recovery
        I3->>PG: SELECT * FROM simtime_state WHERE is_current = true
        PG-->>I3: Snapshot (saved 30s ago)
        I3->>I3: Extrapolate: saved_time + 30s
    end

    Note over I1,I3: All instances have consistent time

    I1->>PG: INSERT INTO simtime_leader ... ON CONFLICT UPDATE
    PG-->>I1: Success - Becomes Leader
    I2->>PG: INSERT INTO simtime_leader ... ON CONFLICT UPDATE
    PG-->>I2: Failed - Becomes Follower
    I3->>PG: INSERT INTO simtime_leader ... ON CONFLICT UPDATE
    PG-->>I3: Failed - Becomes Follower

    Note over I1: Leader starts snapshot timer
    loop Every 10 seconds
        I1->>PG: INSERT INTO simtime_state (snapshot)
        PG-->>I1: State persisted
    end

    Note over GC: Clients detect time continuity
    GC->>I2: GetTimeData()
    I2-->>GC: TimeData (continuous from before restart)
    GC->>GC: Small adjustment, no major jump

    Note over PG: Key advantage: Backup includes time state!
    PG->>PG: pg_dump includes simtime tables
Loading

Graceful Shutdown Protocol

public class GracefulShutdown {
    public async Task OnShutdown() {
        if (IsLeader) {
            // Save final snapshot before shutdown
            await SaveSnapshot();

            // Mark shutdown as clean in audit log
            await db.ExecuteAsync(@"
                INSERT INTO simtime_events (
                    event_type,
                    new_value,
                    performed_by,
                    instance_id
                ) VALUES (
                    'clean_shutdown',
                    @newValue::jsonb,
                    @performedBy,
                    @instanceId
                )", new {
                    newValue = JsonSerializer.Serialize(new {
                        timestamp = DateTime.UtcNow,
                        lastTime = GetCurrentSimTime()
                    }),
                    performedBy = instanceId,
                    instanceId = instanceId
                });
        }

        // Stop time advancement
        PauseTimeAdvancement();

        // Wait for pending operations
        await WaitForPendingOperations();
    }
}

Client Resilience for Time Jumps

public class TimeJumpDetection {
    private DateTime lastServerTime;
    private readonly TimeSpan maxAcceptableJump = TimeSpan.FromMinutes(1);

    public async Task<bool> ValidateTimeUpdate(DateTime newServerTime) {
        var timeDiff = Math.Abs((newServerTime - lastServerTime).TotalSeconds);

        if (timeDiff > maxAcceptableJump.TotalSeconds) {
            Logger.Warn($"Large time jump detected: {timeDiff}s");

            // Force complete resynchronization
            await ForceCompleteResync();

            // Notify game systems of time discontinuity
            OnTimeDiscontinuity?.Invoke(lastServerTime, newServerTime);

            return false; // Reject normal update
        }

        return true; // Accept update
    }
}

Configuration Additions

New environment variables for production:

# Snapshot Configuration
SIMTIME_SNAPSHOT_INTERVAL: "10"        # seconds between snapshots
SIMTIME_SNAPSHOT_RETENTION: "3600"     # keep snapshots for 1 hour
SIMTIME_RECOVERY_MAX_AGE: "300"        # max age of snapshot to use (5 min)

# Recovery Behavior
SIMTIME_ALLOW_ENV_FALLBACK: "true"     # fallback to env vars if no snapshot
SIMTIME_REQUIRE_CLEAN_SHUTDOWN: "false" # require clean shutdown marker

Recovery Scenarios

Scenario Recovery Method Time Continuity
Clean restart (< 1 min) Latest snapshot + extrapolation Perfect continuity
Unclean restart (< 5 min) Latest snapshot + extrapolation Near-perfect (< 1s drift)
Long outage (> 5 min) Environment variables + warning Possible time jump
PostgreSQL data lost Environment variables + alert Time reset to Day 1
Partial restart Active instances provide time Perfect continuity

Monitoring & Alerts

Alerts to Configure:
  - snapshot_age_high: Snapshot older than 60 seconds
  - recovery_from_env: Had to use environment variables
  - large_time_jump: Time jumped > 60 seconds
  - snapshot_save_failed: Failed to persist snapshot

Kubernetes Integration & Pod Discovery

The Challenge: Self-Identification in Kubernetes Deployments

When using Kubernetes Deployments (your approach), SimTime instances face unique challenges:

  • Pods don't inherently know their own IP address
  • Pod IPs are ephemeral and change on every restart
  • Pod names are unpredictable (e.g., simtime-server-7b4d9c-x2f3)
  • Service ClusterIPs don't work (followers need specific leader pod IP)

Solution: Kubernetes Downward API with Deployments

Since you're using Deployments, the Kubernetes Downward API is essential for pod self-discovery:

Deployment Configuration

apiVersion: apps/v1
kind: Deployment
metadata:
  name: simtime-server
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: simtime
        image: simtime:latest
        env:
        # Pod identification
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: POD_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: POD_IP
          valueFrom:
            fieldRef:
              fieldPath: status.podIP
        # Service discovery
        - name: SERVICE_NAME
          value: "simtime-internal"
        - name: GRPC_PORT
          value: "50051"

Leader Registration Code

public class LeaderElectionService {
    private readonly string instanceId;
    private readonly string instanceEndpoint;

    public LeaderElectionService(IConfiguration config) {
        // Get pod-specific information from Downward API
        var podName = config["POD_NAME"]
            ?? throw new Exception("POD_NAME not set - check Downward API config");
        var podIp = config["POD_IP"]
            ?? throw new Exception("POD_IP not set - check Downward API config");
        var grpcPort = config["GRPC_PORT"] ?? "50051";

        // Build unique instance ID and endpoint
        this.instanceId = podName;  // e.g., "simtime-server-abc123"
        this.instanceEndpoint = $"{podIp}:{grpcPort}";  // e.g., "10.244.1.5:50051"
    }

    public async Task<bool> TryBecomeLeader() {
        var sql = @"
            INSERT INTO simtime_leader (
                resource_name,
                instance_id,
                instance_endpoint,  -- Followers will connect here
                expires_at,
                metadata
            ) VALUES (
                'simtime',
                @instanceId,        -- simtime-server-abc123
                @instanceEndpoint,  -- 10.244.1.5:50051
                NOW() + INTERVAL '30 seconds',
                @metadata::jsonb
            )
            ON CONFLICT (resource_name) DO UPDATE SET
                instance_id = @instanceId,
                instance_endpoint = @instanceEndpoint,
                expires_at = NOW() + INTERVAL '30 seconds'
            WHERE simtime_leader.expires_at < NOW()
            RETURNING instance_id = @instanceId as became_leader";

        var metadata = new {
            pod_name = instanceId,
            pod_ip = instanceEndpoint.Split(':')[0],
            version = Assembly.GetExecutingAssembly().GetName().Version?.ToString()
        };

        return await db.QuerySingleAsync<bool>(sql, new {
            instanceId,
            instanceEndpoint,
            metadata = JsonSerializer.Serialize(metadata)
        });
    }
}

Follower Connection (No Separate Discovery Needed!)

public class FollowerService {
    private GrpcChannel? leaderChannel;
    private SimTimeClient simTimeClient;

    public async Task ConnectToLeader(string leaderEndpoint) {
        // Leader endpoint already provided by TryBecomeLeader()
        // No need for a separate database query!

        // Connect directly to leader pod IP
        // e.g., grpc://10.244.1.5:50051
        leaderChannel = GrpcChannel.ForAddress($"http://{leaderEndpoint}");

        // Initialize SimTimeClient for time synchronization
        var transport = new GrpcSimTimeTransport(leaderChannel);
        simTimeClient = new SimTimeClient(transport);
        simTimeClient.StartSyncing();

        Logger.Info($"Connected to leader at {leaderEndpoint}");
    }

    // If leader changes, we need to check periodically
    public async Task<string?> CheckLeaderChange() {
        var sql = @"
            SELECT instance_endpoint
            FROM simtime_leader
            WHERE resource_name = 'simtime'
                AND expires_at > NOW()";

        return await db.QuerySingleOrDefaultAsync<string>(sql);
    }
}

Important: Why Deployments Work Despite Ephemeral IPs

With Deployments, pod IPs change on every restart, but this is not a problem because:

  1. Leader re-registers its new IP on startup when acquiring leadership
  2. Followers re-discover the leader's new endpoint from PostgreSQL
  3. TTL-based leases ensure stale endpoints are automatically cleaned up
  4. Connection retry logic handles IP changes gracefully
// Followers automatically handle leader IP changes
public async Task MaintainLeaderConnection() {
    while (running) {
        var currentEndpoint = await DiscoverLeader();
        if (currentEndpoint != lastKnownEndpoint) {
            // Leader IP changed (pod restarted)
            await ReconnectToLeader(currentEndpoint);
            lastKnownEndpoint = currentEndpoint;
        }
        await Task.Delay(TimeSpan.FromSeconds(10));
    }
}

Development vs Production with Deployments

public class EndpointConfiguration {
    public static string GetInstanceEndpoint(IConfiguration config) {
        var env = config["ASPNETCORE_ENVIRONMENT"];

        if (env == "Development") {
            // Local development: use localhost
            return "localhost:50051";
        } else if (config["POD_IP"] != null) {
            // Kubernetes Deployment: use pod IP (your primary scenario)
            // This IP is ephemeral and will change on pod restart
            return $"{config["POD_IP"]}:{config["GRPC_PORT"] ?? "50051"}";
        } else {
            // Fallback for other environments (Docker, VM, etc.)
            return $"{Environment.MachineName}:50051";
        }
    }
}

Deployment-Specific Considerations

Since you're using Deployments:

  1. Pod IP Changes: Every pod restart gets a new IP

    • Solution: Leader updates PostgreSQL on startup
    • Followers poll for endpoint changes
  2. Rolling Updates: During deployment, pods cycle through

    • Old leader continues until new pod is ready
    • New pod tries to become leader
    • Seamless transition via TTL-based leases
  3. Scale Events: When scaling up/down

    • New pods become followers automatically
    • If leader is scaled down, election happens within TTL window
  4. No Persistent Identity: Pods have random suffixes

    • Use pod name as instance ID for uniqueness
    • Don't rely on pod name for network discovery

Key Points

  1. Downward API is Essential: Without it, pods can't know their own IP address
  2. Direct Pod Connection: Followers must connect to the specific leader pod, not a service
  3. Ephemeral IPs Are Fine: Leader re-registration and TTL cleanup handle IP changes automatically
  4. Graceful Degradation: Code handles both Kubernetes and local development scenarios
  5. Metadata Storage: Leader table stores both instance ID and endpoint for observability

Implementation Details

Phase 1: PostgreSQL Schema & Leader Election

Tasks:
  - Create PostgreSQL schema (simtime_leader, simtime_state, simtime_events)
  - Implement lock table-based leader election
  - Add lease renewal mechanism with TTL
  - Add observability queries and metrics

Phase 2: Follower Client Mode

Tasks:
  - Create GrpcSimTimeTransport implementing ISimTimeClientTransport
  - Integrate SimTimeClient into follower instances
  - Override time serving methods to use client
  - Add sync quality metrics

Phase 3: State Persistence & Recovery

Tasks:
  - Implement PostgreSQL snapshot persistence
  - Add recovery from snapshots on startup
  - Implement graceful shutdown with final snapshot
  - Add audit trail for time manipulations

Phase 4: Testing & Rollout

Tasks:
  - Chaos testing (leader failures, network partitions)
  - Load testing with multiple instances
  - Gradual rollout with feature flags
  - Monitor time consistency metrics

Advantages

1. Unified Backup/Restore βœ…

-- Backup includes everything!
pg_dump production > backup.dump

-- Restore includes time state automatically
pg_restore backup.dump
-- Time continues from exact backup point!

2. Eliminates Clock Skew βœ…

Traditional Problem:
Instance A clock: 14:00:00.000
Instance B clock: 14:00:00.050
Result: 50ms time difference!

Our Solution:
Both instances: 14:00:00.000 (synchronized to leader)

3. High Availability βœ…

  • Leader failure β†’ New election in ~30 seconds
  • Followers continue serving during election
  • Zero downtime deployments possible
  • PostgreSQL HA as foundation

4. Production Observability βœ…

-- Real-time monitoring
SELECT instance_id, expires_at - NOW() as ttl FROM simtime_leader;

-- Audit trail included
SELECT * FROM simtime_events WHERE event_time > NOW() - INTERVAL '1 hour';

5. Proven Technology βœ…

  • PostgreSQL lock tables battle-tested
  • SimTimeClient used in production for years
  • NTP algorithm is industry standard
  • Your team already PostgreSQL experts

6. Horizontal Scaling βœ…

Single Instance: 10,000 clients β†’ 167 req/s β†’ Overloaded
Multi-Instance:  10,000 clients Γ· 5 instances = 2,000 clients each β†’ 33 req/s β†’ Smooth
  • Load distributes across all instances
  • Each instance handles subset of clients
  • Add more instances as player base grows
  • No single bottleneck

7. Performance βœ…

  • Followers calculate time locally (no network calls)
  • Sync happens every 60 seconds (configurable)
  • Scales linearly with instances
  • Leader election returns current leader in single query (no extra round trip)

8. Simplified Coordination via Polling βœ…

  • No persistent database connections needed
  • No LISTEN/NOTIFY complexity or reconnection logic
  • Works perfectly with connection pooling
  • Predictable 10-second detection latency is acceptable for rare events
  • Easier to monitor, debug, and test

9. Transactional Consistency βœ…

BEGIN;
INSERT INTO game_events (...);
UPDATE simtime_state SET ...;
COMMIT;  -- Atomic operation

Disadvantages & Mitigations

1. Leader Election Complexity ⚠️

Issue: Distributed consensus is complex Mitigation:

  • Use proven PostgreSQL lock tables with TTL
  • Simple lease-based approach
  • Pure polling (no LISTEN/NOTIFY complexity)
  • Extensive testing of edge cases

2. Brief Inconsistency During Leader Change ⚠️

Issue: Up to 30 second window for new leader election Mitigation:

  • Followers serve last known good time
  • Time continues advancing via extrapolation
  • New leader elected quickly

3. Additional Network Traffic ⚠️

Issue: Followers sync with leader periodically Mitigation:

  • Sync interval configurable (default 60s)
  • Minimal data transfer (~100 bytes)
  • Use gRPC streaming for efficiency

4. Complexity vs Single Instance ⚠️

Issue: More moving parts than single instance Mitigation:

  • Comprehensive monitoring and alerting
  • Distributed tracing for debugging
  • Fallback to single-instance mode if needed

Alternative Approaches Considered

1. NATS KV for State Storage ❌

Why Rejected for State Storage (still used for client broadcasting):

  • No read-after-write consistency for critical state
  • Follower lag issues for leader election
  • Incompatible with immediate consistency needs
  • Separate backup/restore from PostgreSQL
  • Would split critical state across two systems

Note: NATS is still used for broadcasting time events to game clients, where its excellent fan-out performance shines

2. Redis for State Storage ❌

Why Rejected:

  • Adds new infrastructure dependency
  • Separate backup strategy needed
  • Not aligned with unified datastore strategy
  • Team lacks Redis expertise

3. Client-Side Time Calculation ❌

Why Rejected:

  • Pushes complexity to every client
  • Hard to coordinate time manipulation
  • Difficult to debug time issues

4. Keep Single Instance ❌

Why Rejected:

  • Doesn't solve SPOF issue
  • No horizontal scaling
  • Risky for production system

Risk Analysis

Risk Probability Impact Mitigation
Leader election fails Low High Manual override, alerting
Clock drift during partition Medium Medium Shorter sync intervals
Follower sync failures Low Low Mark instance unhealthy
Time jumps on leader change Low High Smooth convergence algorithm
Performance degradation Low Medium Monitoring, auto-scaling
Full cluster restart Low Critical Persistent snapshots, recovery protocol
PostgreSQL data loss Very Low Critical Environment var fallback, alerting
Snapshot corruption Very Low High Multiple snapshot versions, validation
Long outage (> 5 min) Low High Client resync protocol, time jump detection
Graceless shutdown Medium Medium Regular snapshots, extrapolation on recovery

Critical Architecture Issues & Solutions

This section identifies critical safety issues that must be addressed before production deployment, along with their solutions.

1. Split-Brain Vulnerability 🧠⚠️

The Problem: During network partitions, the system can end up with multiple leaders:

Time 0:   Leader renewing every 15s, Followers polling every 10s
Time 10:  Network partition occurs
          - Leader β†’ PostgreSQL: βœ… (still works)
          - Followers β†’ PostgreSQL: βœ… (still works)
          - Followers β†’ Leader: ❌ (network partition)
Time 30:  Followers think leader is dead (can't reach it)
Time 40:  Follower A becomes new leader
Time 45:  Network partition heals
Result:   TWO LEADERS! Original leader + Follower A

Why it's dangerous:

  • Game clients could get different times from different "leaders"
  • Time manipulations could be lost or duplicated
  • Data corruption in time-dependent game events
  • Inconsistent event timers across the game

The Solution - Monotonic Fencing Tokens:

// Every state update must check fencing token
public async Task<bool> AcceptTimeUpdate(TimeState newState) {
    // Only accept updates from newer leaders
    if (newState.fencing_token <= lastKnownFencingToken) {
        Logger.Warn($"Rejecting stale leader update: {newState.fencing_token} <= {lastKnownFencingToken}");
        return false;
    }

    lastKnownFencingToken = newState.fencing_token;
    ApplyTimeUpdate(newState);
    return true;
}

// Leader must include fencing token in all responses
public TimeDataResponse GetTimeData() {
    return new TimeDataResponse {
        Time = CalculateCurrentTime(),
        FencingToken = this.currentFencingToken,  // Critical!
        LeaderId = this.instanceId
    };
}

2. Transaction Isolation Missing πŸ’ΎβŒ

The Problem: State updates aren't atomic, creating brief windows with no valid state:

-- These run as separate statements!
UPDATE simtime_state SET is_current = false WHERE is_current = true;  -- Moment 1: No current state!
-- DANGER ZONE: Another instance queries here and gets NULL!
INSERT INTO simtime_state (...) VALUES (...);                         -- Moment 2: State restored

Timeline of Disaster:

Microsecond 1: UPDATE executes (no current state exists!)
Microsecond 2: Follower queries: "SELECT * WHERE is_current = true"
               Returns: NULL! πŸ’₯
Microsecond 3: INSERT executes (current state exists again)
Microsecond 4: Follower crashes due to null state

Why it's dangerous:

  • Brief moments with NO valid time state
  • Instances could fail to start or crash
  • Clients could get null responses
  • Recovery logic might initialize from environment (time jump!)

The Solution - Atomic Transactions:

public async Task SaveSnapshot() {
    var sql = @"
        BEGIN;
        -- Both operations succeed or both fail
        UPDATE simtime_state SET is_current = false WHERE is_current = true;
        INSERT INTO simtime_state (
            sim_time_ticks, epoch_ticks, multiplier, paused_timestamp_ticks,
            is_current, saved_by, fencing_token
        ) VALUES (
            @simTimeTicks, @epochTicks, @multiplier, @pausedTimestampTicks,
            true, @savedBy, @fencingToken
        );
        COMMIT;";

    await db.ExecuteAsync(sql, parameters);
}

// Alternative: Single UPSERT operation
public async Task SaveSnapshotAtomic() {
    var sql = @"
        INSERT INTO simtime_state (id, sim_time_ticks, epoch_ticks, multiplier, is_current)
        VALUES (1, @simTimeTicks, @epochTicks, @multiplier, true)
        ON CONFLICT (id) WHERE is_current = true
        DO UPDATE SET
            sim_time_ticks = @simTimeTicks,
            epoch_ticks = @epochTicks,
            saved_at = NOW()";

    await db.ExecuteAsync(sql, parameters);
}

3. Resource Leaks - Timer Cleanup 🚰

The Problem: Timers are created but never disposed when losing leadership:

private Timer renewalTimer;

private void StartLeaseRenewal() {
    renewalTimer = new Timer(async _ => {
        // Renews every 15 seconds forever!
    }, null, TimeSpan.FromSeconds(15), TimeSpan.FromSeconds(15));
}

// When leadership lost, timer keeps running!
// After 10 leadership changes = 10 orphaned timers

Accumulation Over Time:

Hour 0:   1 leadership change  = 1 orphaned timer
Hour 1:   3 leadership changes = 3 orphaned timers (4 total)
Hour 6:   2 more changes       = 2 orphaned timers (6 total)
Day 1:    20 changes total     = 20 timers firing every 15s!

Result: 20 timers Γ— 4 queries/min = 80 unnecessary DB queries/min
        Plus growing memory usage!

Why it's dangerous:

  • Memory usage grows unbounded
  • Database gets hammered with invalid renewal attempts
  • Connection pool exhaustion
  • Can trigger cascading failures under load

The Solution - Proper Cleanup:

public class LeaderElectionService : IDisposable {
    private Timer? renewalTimer;
    private CancellationTokenSource? leadershipCts;

    private void StartLeaseRenewal() {
        // Create cancellation token for this leadership term
        leadershipCts = new CancellationTokenSource();

        renewalTimer = new Timer(async _ => {
            try {
                if (leadershipCts.Token.IsCancellationRequested) {
                    return;  // Stop if cancelled
                }

                var renewed = await RenewLease();
                if (!renewed) {
                    await OnLostLeadership();
                }
            } catch (Exception ex) {
                Logger.Error($"Lease renewal failed: {ex}");
                await OnLostLeadership();
            }
        }, null, TimeSpan.FromSeconds(15), TimeSpan.FromSeconds(15));
    }

    private async Task OnLostLeadership() {
        // Critical: Clean up resources!
        leadershipCts?.Cancel();
        renewalTimer?.Dispose();
        renewalTimer = null;
        leadershipCts?.Dispose();
        leadershipCts = null;

        Logger.Info("Leadership lost, cleaned up resources");
        await TransitionToFollower();
    }

    public void Dispose() {
        renewalTimer?.Dispose();
        leadershipCts?.Dispose();
    }
}

4. Thundering Herd Problem 🐘🐘🐘

The Problem: When the leader fails, ALL followers detect it simultaneously and race to become leader:

Leader fails at T=0
At T=10: All 20 followers detect failure simultaneously
At T=10.001: All 20 followers execute:
    INSERT INTO simtime_leader ... ON CONFLICT UPDATE

Database perspective:
- 20 simultaneous complex CTE queries
- All competing for same row lock
- Lock contention causes serialization
- Connection pool exhaustion
- Query queue backs up

Cascade Effect:

T+0ms:    20 queries hit DB simultaneously
T+100ms:  DB CPU = 100%, lock wait queue forming
T+500ms:  Connection pool exhausted
T+1000ms: New queries timeout
T+2000ms: Health checks fail
T+3000ms: Kubernetes restarts "unhealthy" pods
T+4000ms: More instances race for leadership
Result:   Complete system failure

Why it's dangerous:

  • Database CPU spikes to 100%
  • All instances freeze waiting for query results
  • Connection pool exhaustion
  • Can trigger cascading failures
  • Game clients timeout and disconnect

The Solution - Jittered Elections:

public class SmartLeaderElection {
    private readonly Random random = new Random();

    public async Task OnLeaderFailureDetected() {
        // Add random jitter to prevent thundering herd
        var baseDelay = 1000;  // 1 second base
        var jitter = random.Next(0, 5000);  // 0-5 seconds random
        var totalDelay = baseDelay + jitter;

        Logger.Info($"Leader failed, waiting {totalDelay}ms before election attempt");
        await Task.Delay(totalDelay);

        // Check if someone else already became leader during our wait
        var currentLeader = await CheckCurrentLeader();
        if (currentLeader != null) {
            Logger.Info($"New leader already elected: {currentLeader}");
            await ConnectToLeader(currentLeader);
            return;
        }

        // Now try to become leader
        await TryBecomeLeader();
    }

    // Even smarter: Exponential backoff for retries
    public async Task ElectionWithBackoff() {
        var attempt = 0;
        var maxAttempts = 5;

        while (attempt < maxAttempts) {
            var delay = Math.Min(1000 * Math.Pow(2, attempt), 30000);  // Cap at 30s
            var jitter = random.Next(0, (int)delay / 2);

            await Task.Delay((int)delay + jitter);

            if (await TryBecomeLeader()) {
                return;  // Success!
            }

            attempt++;
        }

        throw new Exception("Failed to elect leader after max attempts");
    }
}

Production Scenario: Black Friday Game Event

Without These Fixes:

18:00: Black Friday event starts, 50,000 players online
18:15: Network blip causes followers to lose connection to leader
18:16: Split-brain occurs (Issue #1)
       - Original leader still renewing
       - Follower A becomes new leader
       - Players get different event end times
18:17: Support tickets: "Why does my friend have different timer?"
18:20: DevOps attempts emergency restart
18:21: All instances restart, try to recover state
       - One instance hits the NULL state window (Issue #2)
       - Instance crashes during initialization
18:22: Crashed instance restarts, old timer still running (Issue #3)
       - Orphaned timer hammering database
18:25: Leader fails under load
18:26: 20 instances thundering herd the database (Issue #4)
       - Database CPU 100%
       - All queries timeout
18:27: Complete game outage
18:45: Emergency war room called
19:30: Manual intervention to recover
20:00: Service restored, but event ruined

With These Fixes:

18:00: Black Friday event starts, 50,000 players online
18:15: Network blip causes followers to lose connection to leader
18:16: Fencing token prevents split-brain
       - Followers reject old leader's updates
       - Clean leader transition occurs
18:17: New leader elected with jitter (no thundering herd)
       - Database load normal
       - Players see consistent times
18:18: Event continues smoothly
       - Monitoring shows brief leader transition
       - No player impact
23:00: Event completes successfully

Implementation Priority

These issues must be fixed in this order:

  1. Transaction Isolation (Easiest, highest impact)
  2. Resource Leaks (Simple fix, prevents accumulation)
  3. Thundering Herd (Moderate complexity, critical for scale)
  4. Split-Brain (Most complex, but essential for correctness)

All fixes should be implemented before production deployment.


Success Metrics

Functional Metrics

  • βœ… Zero time jumps during normal operation
  • βœ… < 10ms time difference between instances
  • βœ… < 30 second leader election time
  • βœ… 99.99% availability
  • βœ… < 1 second time drift after full restart
  • βœ… 100% time continuity with snapshots available

Performance Metrics

  • βœ… < 1ms latency for time queries (local calculation)
  • βœ… < 100ms for time sync operations
  • βœ… Support 100+ instances without degradation
  • βœ… < 50ms snapshot save time
  • βœ… < 100ms recovery time on startup

Operational Metrics

  • βœ… Zero-downtime deployments
  • βœ… Automatic failover on leader failure
  • βœ… Self-healing after network partitions
  • βœ… Successful recovery from full cluster restart
  • βœ… Snapshot age always < 30 seconds
  • βœ… Zero data loss with graceful shutdown

Migration Strategy

Step 1: Deploy in Shadow Mode

  • Deploy multi-instance version alongside single instance
  • Compare time outputs, don't serve traffic
  • Validate consistency

Step 2: Gradual Traffic Shift

  • 10% β†’ 25% β†’ 50% β†’ 100% over 2 weeks
  • Monitor metrics at each stage
  • Rollback capability at each step

Step 3: Decommission Single Instance

  • Keep single instance as emergency fallback
  • After 30 days stable, remove entirely

Code Examples

Leader Election Implementation

public class PostgresLeaderElection {
    private readonly IDbConnection db;
    private readonly string instanceId;
    private readonly string instanceEndpoint;
    private Timer renewalTimer;

    public async Task<LeaderElectionResult> TryBecomeLeader() {
        var sql = @"
            WITH election_attempt AS (
                INSERT INTO simtime_leader
                (resource_name, instance_id, instance_endpoint, expires_at, heartbeat_at)
                VALUES ('simtime', @instanceId, @endpoint, NOW() + INTERVAL '30 seconds', NOW())
                ON CONFLICT (resource_name) DO UPDATE SET
                    instance_id = @instanceId,
                    instance_endpoint = @endpoint,
                    expires_at = NOW() + INTERVAL '30 seconds',
                    heartbeat_at = NOW()
                WHERE simtime_leader.expires_at < NOW()
                RETURNING instance_id, instance_endpoint
            )
            SELECT
                instance_id,
                instance_endpoint,
                instance_id = @instanceId as became_leader
            FROM election_attempt
            UNION ALL
            SELECT
                instance_id,
                instance_endpoint,
                false as became_leader
            FROM simtime_leader
            WHERE resource_name = 'simtime'
                AND NOT EXISTS (SELECT 1 FROM election_attempt)
            LIMIT 1";

        var result = await db.QuerySingleAsync<LeaderElectionResult>(sql,
            new { instanceId, endpoint = instanceEndpoint });

        if (result.BecameLeader) {
            StartLeaseRenewal();
            await LogLeaderChange("acquired");
        } else {
            // No need for a second query - we have the leader endpoint!
            await InitializeFollowerMode(result.LeaderEndpoint);
        }

        return result;
    }

    private void StartLeaseRenewal() {
        renewalTimer = new Timer(async _ => {
            var renewed = await db.ExecuteAsync(@"
                UPDATE simtime_leader
                SET expires_at = NOW() + INTERVAL '30 seconds',
                    heartbeat_at = NOW()
                WHERE resource_name = 'simtime'
                AND instance_id = @instanceId",
                new { instanceId });

            if (renewed == 0) {
                await OnLostLeadership();
            }
        }, null, TimeSpan.FromSeconds(15), TimeSpan.FromSeconds(15));
    }

    private async Task LogLeaderChange(string eventType) {
        await db.ExecuteAsync(@"
            INSERT INTO simtime_events
            (event_type, instance_id, performed_by)
            VALUES (@eventType, @instanceId, @instanceId)",
            new { eventType = $"leader_{eventType}", instanceId });
    }
}

Follower Transport Implementation

public class FollowerTransport : ISimTimeClientTransport {
    private readonly SimTimeServerExternalServiceClient leaderClient;

    public async Task<TimeSyncResponse> SyncAsync(CancellationToken ct) {
        var request = new SyncNoStreamRequest {
            ClientTimeSendTicks = DateTime.UtcNow.Ticks
        };

        var response = await leaderClient.SyncNoStreamAsync(request, cancellationToken: ct);

        return new TimeSyncResponse {
            Successful = response.Successful,
            ServerTimeTicks = response.ServerTimeTicks,
            ClientTimeSendTicks = response.ClientTimeSendTicks,
            ClientTimeReceivedTicks = DateTime.UtcNow.Ticks
        };
    }

    public async Task<TimeDataResponse> GetTimeDataAsync(CancellationToken ct) {
        var response = await leaderClient.GetTimeDataAsync(new(), cancellationToken: ct);

        return new TimeDataResponse {
            Success = true,
            CurrentEpochTicks = response.CurrentEpochTicks,
            CurrentMultiplier = response.CurrentMultiplier,
            CurrentPausedTimestamp = response.CurrentPausedTimestamp
        };
    }
}

Conclusion

This architecture solves the SimTime scaling problem by:

  1. Eliminating SPOF through leader election and failover
  2. Ensuring consistency via SimTimeClient synchronization
  3. Maintaining performance with local time calculation
  4. Reusing proven code from the existing SimTime library

The key insight is treating follower instances as "thick clients" that maintain synchronized time state, allowing them to serve requests locally without hitting the leader for every query.

This design provides the consistency guarantees required for game time synchronization while enabling horizontal scaling and high availability.


Next Steps

  1. Review & Approval: Discuss with team, address concerns
  2. Proof of Concept: Build minimal version to validate approach
  3. Design Review: Detailed technical design document
  4. Implementation: Follow phased approach outlined above
  5. Testing: Comprehensive testing including chaos engineering
  6. Rollout: Gradual deployment with monitoring

References


Document Version: 1.0 Author: Architecture Team Date: 2025-10-28 Status: DRAFT - For Review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment