oddur/simtime-multi-instance-architecture.md

## simtime-multi-instance-architecture.md

      
    Raw
  

              simtime-multi-instance-architecture.md
            
          
    SimTime Service Multi-Instance Architecture Plan

Executive Summary

This document outlines a solution to enable the SimTime service to run safely across multiple instances, eliminating the current single point of failure while maintaining time consistency across the distributed system.

ELI5: The Solution in Simple Terms

The Problem

Imagine you have a single master clock in a building that everyone looks at to know the time. If that clock breaks, nobody knows what time it is anymore. That's our current problem - we have only one time server, and if it fails, the entire game loses track of time.
The Solution

Instead of one clock, we'll have multiple clocks that work together:


The Leader Clock 🕐

One clock is chosen as the "leader" - this is the official time
It's like the principal's clock at school that everyone agrees is correct
If it breaks, another clock quickly becomes the new leader


The Follower Clocks 🕑🕒🕓

All other clocks are "followers" that sync with the leader
They ask the leader "what time is it?" every minute
Between asks, they keep track of time on their own (like counting seconds)
They remember the difference between their time and the leader's time


How It Stays in Sync

When a follower asks for time, it's like calling someone in another timezone
Follower: "What time do you have?" (sent at 2:00:00)
Leader: "It's 2:00:05"
Follower: "Okay, I'm 5 seconds behind you"
Now the follower adds 5 seconds to its own counting


Why This Is Smart

Game clients can ask ANY clock for the time
Followers don't need to constantly bother the leader
If the leader breaks, a follower becomes the new leader within 30 seconds
Everyone always agrees on what time it is (within milliseconds)


Real World Analogy

It's like having multiple smartphones that all sync their time with the cell tower. Even if one phone dies, the others keep working. And even if the main cell tower fails, the phones keep showing the right time because they know how to count seconds on their own.
The clever part: We're reusing the same time-sync technology that game clients already use to sync with the server. We're just making some servers act like "super clients" that can also answer time questions from real game clients.

Why PostgreSQL: Aligning with Unified Datastore Strategy

Technical Strategy Alignment

Our technical guidelines state:

"We have a single, unified way of storing data at rest in Postgres databases. Where possible, lean into going along the grain of relational databases since that is where the data lives."

Using PostgreSQL for SimTime coordination provides critical advantages:


Unified Backup/Restore
-- When you backup PostgreSQL, time state comes along
pg_dump production > backup.dump

-- Restore includes perfect time state
pg_restore backup.dump
-- SimTime continues from exact backup point!


Transactional Consistency
BEGIN;
-- Game event and time update in same transaction
INSERT INTO game_events (...);
UPDATE simtime_state SET ...;
COMMIT;  -- Both succeed or both fail


Zero Additional Infrastructure

No NATS to maintain
No separate backup strategy
Single connection pool
One monitoring system


PostgreSQL vs NATS - Division of Responsibilities


Component
PostgreSQL
NATS
Our Choice


Leader Election
Lock tables with TTL
KV with TTL
PostgreSQL ✅


State Persistence
Native tables with ACID
KV Store (eventual)
PostgreSQL ✅


Backup/Restore
Included automatically
Separate system
PostgreSQL ✅


Observability
Rich SQL queries
Limited
PostgreSQL ✅


Instance Coordination
Polling (simple, reliable)
Pub/Sub
PostgreSQL ✅


Game Client Broadcasting
Possible but suboptimal
Excellent fan-out
NATS ✅


We use PostgreSQL for reliability (leader election, state) and NATS for performance (broadcasting to thousands of game clients).
The Decisive Factor

When disaster recovery happens:

PostgreSQL: Restore one backup, time state included
NATS: Restore PostgreSQL + NATS separately, hope they align

This alone justifies PostgreSQL given your unified datastore strategy.

Current State & Problem

The Issue

The SimTime service currently cannot scale horizontally due to its stateful nature:
Current Architecture:
┌─────────────────────┐
│  SimTime Service    │ ← Single Instance (SPOF)
│  - Mutable State    │
│  - No Persistence   │
│  - No Coordination  │
└─────────────────────┘
           ↓
    All Game Services

Why It's a Problem


Single Point of Failure (SPOF)

Service crash = no time synchronization for entire system
Deployment/updates require downtime
No redundancy for this critical service


State Management Issues
// Each instance maintains local state:
private DateTime serverEpoch;
private long multiplier;
private DateTime? pausedTimestamp;

Multiple instances = divergent state
No synchronization mechanism
Race conditions on time manipulation


Scalability Limitations

Cannot handle increased load by adding instances
All time sync requests hit single instance
Becomes bottleneck as system grows


Sync Load Overwhelms Single Instance
Production Reality with 10,000 concurrent clients:
- Each client syncs every 60 seconds
- 10,000 clients ÷ 60 seconds = 167 sync requests/second
- Each sync involves: network I/O, time calculation, response
- Single instance CPU and network become saturated

Result: Sync latency increases → Time drift → Game desynchronization


Vertical scaling (bigger instance) has limits
Cannot distribute load across multiple instances
Critical during peak gaming hours or events


Proposed Solution: Leader-Follower Architecture

High-Level Design

Proposed Architecture:
┌─────────────────────┐
│   Leader Instance   │ ← Source of Truth
│  (SimTimeServer)    │   Elected via PostgreSQL
└─────────────────────┘
           ↓
    PostgreSQL + Optional Pub/Sub
           ↓
┌─────────────────────┐  ┌─────────────────────┐
│  Follower Instance  │  │  Follower Instance  │
│  (SimTimeClient)    │  │  (SimTimeClient)    │
└─────────────────────┘  └─────────────────────┘
           ↓                        ↓
     Game Clients              Game Clients

Key Innovation: Followers as SimTime Clients

The breakthrough insight: Follower instances use the existing SimTimeClient library to synchronize with the leader, just like game clients do. This solves clock skew and ensures consistency.
How It Works

1. Leader Election (via PostgreSQL Lock Table)

// Optimized: Single query returns both election result AND current leader endpoint
// This eliminates an extra round-trip to the database for followers
public async Task<LeaderElectionResult> TryBecomeLeader() {
    var sql = @"
        WITH election_attempt AS (
            INSERT INTO simtime_leader
            (resource_name, instance_id, instance_endpoint, expires_at)
            VALUES ('simtime', @instanceId, @endpoint, NOW() + INTERVAL '30 seconds')
            ON CONFLICT (resource_name) DO UPDATE SET
                instance_id = @instanceId,
                instance_endpoint = @endpoint,
                expires_at = NOW() + INTERVAL '30 seconds'
            WHERE simtime_leader.expires_at < NOW()  -- Only if expired
            RETURNING instance_id, instance_endpoint
        )
        SELECT
            instance_id,
            instance_endpoint,
            instance_id = @instanceId as became_leader
        FROM election_attempt
        UNION ALL
        SELECT
            instance_id,
            instance_endpoint,
            false as became_leader
        FROM simtime_leader
        WHERE resource_name = 'simtime'
            AND NOT EXISTS (SELECT 1 FROM election_attempt)
        LIMIT 1";

    var result = await db.QuerySingleAsync<LeaderElectionResult>(sql,
        new { instanceId, endpoint });

    if (result.BecameLeader) {
        StartLeaseRenewal();  // Renew every 15 seconds
        RunAsLeader();        // Full SimTimeServer mode
    } else {
        // We already have the leader endpoint from the query!
        await ConnectToLeader(result.LeaderEndpoint);
        RunAsFollower();      // SimTimeClient mode
    }

    return result;
}

public class LeaderElectionResult {
    public bool BecameLeader { get; set; }
    public string LeaderEndpoint { get; set; }
    public string LeaderId { get; set; }
}
2. Follower Synchronization

public class FollowerMode {
    private SimTimeClient clientToLeader;

    public void Initialize(string leaderEndpoint) {
        // Create transport to communicate with leader
        var transport = new GrpcSimTimeTransport(leaderEndpoint);

        // Use SimTimeClient for time synchronization
        clientToLeader = new SimTimeClient(transport);
        clientToLeader.StartSyncing();

        // Wait for initial sync
        await clientToLeader.WaitUntilTimeIsSynced();
    }

    // Serve client requests using synchronized time
    public DateTime GetCurrentTime() {
        return clientToLeader.ServerNow; // Locally extrapolated!
    }
}
3. Clock Synchronization Algorithm (NTP-style)

The SimTimeClient uses sophisticated time synchronization:
1. Follower sends sync request with timestamp T1
2. Leader receives at T2, responds at T3
3. Follower receives response at T4

Round Trip Time = (T4 - T1)
One-way latency ≈ RTT / 2
Clock offset = T3 - (T1 + one-way latency)

Result: Follower knows exact offset to leader's clock

4. Request Routing


Operation
Leader Behavior
Follower Behavior


GetTimeData
Serve from local state
Serve from synchronized state


Sync
Provide time reference
Provide synchronized time


Time Manipulation
Execute and broadcast
Proxy to leader


5. Client Notification via NATS

While PostgreSQL handles leader election and state persistence, NATS is still used for broadcasting time modification events to all connected game clients:
// Leader-side implementation
public class TimeManipulationBroadcast {
    private readonly INatsPublisher natsPublisher;

    public async Task OnTimeManipulation(TimeManipulationEvent evt) {
        // Leader executes the manipulation
        UpdateLocalState(evt);

        // Save to PostgreSQL for persistence
        await SaveToPostgreSQL(evt);

        // Broadcast to ALL game clients via NATS
        await natsPublisher.PublishAsync("simtime.events", new {
            Type = evt.Type,  // "pause", "resume", "multiplier_change", "advance"
            NewState = GetCurrentTimeData(),
            Timestamp = DateTime.UtcNow
        });

        // Note: Follower instances detect changes by polling PostgreSQL,
        // not through NATS. This NATS broadcast is for game clients only.
    }
}

// Follower-side implementation: Always proxy to leader
public class FollowerTimeManipulation {
    private readonly SimTimeServerInternalServiceClient leaderClient;
    private readonly bool isLeader;

    public async Task<PauseResponse> PauseAsync(PauseRequest request) {
        if (isLeader) {
            // Leader executes directly
            return await ExecutePauseAsync(request);
        } else {
            // Follower ALWAYS proxies to leader
            var leaderEndpoint = await GetLeaderEndpoint();
            return await leaderClient.PauseAsync(request);
        }
    }

    public async Task<ResumeResponse> ResumeAsync(ResumeRequest request) {
        if (!isLeader) {
            // Always proxy to leader
            return await leaderClient.ResumeAsync(request);
        }
        return await ExecuteResumeAsync(request);
    }

    public async Task<ChangeMultiplierResponse> ChangeMultiplierAsync(ChangeMultiplierRequest request) {
        if (!isLeader) {
            // Always proxy to leader
            return await leaderClient.ChangeMultiplierAsync(request);
        }
        return await ExecuteChangeMultiplierAsync(request);
    }
}
This hybrid approach leverages:

PostgreSQL for reliable state and leader coordination
NATS for efficient fan-out to thousands of game clients
Proxying ensures all time manipulations go through the leader
Existing client code continues to work unchanged

6. Unified Polling Architecture (No LISTEN/NOTIFY)

We intentionally use simple polling instead of PostgreSQL LISTEN/NOTIFY for follower coordination:
Why Polling Is Superior Here:

Connection Simplicity: No persistent DB connections or reconnection logic
Container-Friendly: Works perfectly with connection pooling and Kubernetes
Predictable Behavior: Fixed 10-second intervals, easy to debug and monitor
Resilient: Network blips don't break anything, just delay by one poll cycle
Simpler Testing: No need to mock LISTEN/NOTIFY, fully deterministic

The Unified Polling Loop:
public class FollowerPollingService {
    private string currentLeaderEndpoint;
    private DateTime lastStateUpdate;
    private readonly TimeSpan pollInterval = TimeSpan.FromSeconds(10);

    public async Task RunPollingLoop() {
        while (running) {
            try {
                // Single efficient query gets everything we need
                var sql = @"
                    SELECT
                        l.instance_id as leader_id,
                        l.instance_endpoint as leader_endpoint,
                        l.expires_at > NOW() as leader_active,
                        s.sim_time_ticks,
                        s.epoch_ticks,
                        s.multiplier,
                        s.paused_timestamp_ticks,
                        s.saved_at as state_updated_at
                    FROM simtime_leader l
                    CROSS JOIN LATERAL (
                        SELECT * FROM simtime_state
                        WHERE is_current = true
                        ORDER BY saved_at DESC LIMIT 1
                    ) s
                    WHERE l.resource_name = 'simtime'";

                var status = await db.QuerySingleOrDefaultAsync<dynamic>(sql);

                if (status == null || !status.leader_active) {
                    // No active leader - attempt election
                    await TryBecomeLeader();
                    continue;
                }

                // Check if leader changed
                if (status.leader_endpoint != currentLeaderEndpoint) {
                    Logger.Info($"Leader changed to {status.leader_endpoint}");
                    await ReconnectToNewLeader(status.leader_endpoint);
                    currentLeaderEndpoint = status.leader_endpoint;
                }

                // Check if time state was manipulated by admin
                if (status.state_updated_at > lastStateUpdate) {
                    Logger.Info("Time state changed, refreshing");
                    RefreshLocalTimeState(status);
                    lastStateUpdate = status.state_updated_at;
                }

            } catch (Exception ex) {
                Logger.Error($"Polling failed: {ex.Message}");
                // Continue polling - system remains available
            }

            await Task.Delay(pollInterval);
        }
    }
}
Acceptable Latency Trade-offs:


Event
Detection Latency
Impact
Mitigation


Leader failure
0-10 seconds
Rare event (weekly/monthly)
Followers continue serving


Time manipulation
0-10 seconds
Admin operation, non-critical
Game clients get instant NATS update


New leader elected
0-10 seconds
Follows leader failure
Extrapolation continues working


The 10-second polling interval is 3x faster than the 30-second leader TTL, ensuring we detect changes promptly while keeping database load minimal.

PostgreSQL Schema

Database Tables

-- Leader election table
CREATE TABLE simtime_leader (
    resource_name TEXT PRIMARY KEY DEFAULT 'simtime',
    instance_id TEXT NOT NULL,
    instance_endpoint TEXT NOT NULL,  -- Where followers connect
    acquired_at TIMESTAMPTZ DEFAULT NOW(),
    expires_at TIMESTAMPTZ NOT NULL,
    heartbeat_at TIMESTAMPTZ DEFAULT NOW(),
    fencing_token BIGSERIAL,  -- Monotonic counter for split-brain prevention
    metadata JSONB,  -- Version, capabilities, etc.

    CONSTRAINT valid_lease CHECK (expires_at > acquired_at)
);

-- State persistence table
CREATE TABLE simtime_state (
    id BIGSERIAL PRIMARY KEY,
    saved_at TIMESTAMPTZ DEFAULT NOW(),
    sim_time_ticks BIGINT NOT NULL,
    epoch_ticks BIGINT NOT NULL,
    multiplier INTEGER NOT NULL,
    paused_timestamp_ticks BIGINT,
    is_current BOOLEAN DEFAULT true,
    saved_by TEXT NOT NULL,
    fencing_token BIGINT NOT NULL  -- Must match leader's token
);

-- Audit trail for time manipulations
CREATE TABLE simtime_events (
    id BIGSERIAL PRIMARY KEY,
    event_time TIMESTAMPTZ DEFAULT NOW(),
    event_type TEXT NOT NULL,  -- 'pause', 'resume', 'multiplier_change', 'advance'
    old_value JSONB,
    new_value JSONB,
    performed_by TEXT NOT NULL,
    instance_id TEXT NOT NULL
);

-- Indexes for performance
CREATE INDEX idx_current_state ON simtime_state(is_current)
WHERE is_current = true;

CREATE INDEX idx_active_leader ON simtime_leader(expires_at DESC)
WHERE expires_at > NOW();

CREATE INDEX idx_recent_events ON simtime_events(event_time DESC);
Why Lock Table Instead of Advisory Locks


Aspect
Advisory Locks
Lock Table
Our Choice


Automatic cleanup
✅ On disconnect
❌ Manual TTL
Lock Table


Survives brief disconnects
❌ Lost immediately
✅ TTL-based
Lock Table ✅


Observability
❌ No metadata
✅ Full visibility
Lock Table ✅


Debugging
❌ Can't see holder
✅ SQL queries
Lock Table ✅


Graceful handoff
❌ Not possible
✅ Update owner
Lock Table ✅


Audit trail
❌ None
✅ History table
Lock Table ✅


Works with polling
❌ Requires connection
✅ Stateless queries
Lock Table ✅


The lock table approach provides production-grade observability:
-- Who is the leader?
SELECT instance_id, instance_endpoint, expires_at - NOW() as remaining
FROM simtime_leader WHERE expires_at > NOW();

-- Leader changes in last hour
SELECT * FROM simtime_events
WHERE event_type = 'leader_change'
AND event_time > NOW() - INTERVAL '1 hour';

Detailed Sequence Diagrams

1. Leader Election Sequence


      sequenceDiagram
    participant I1 as Instance 1
    participant I2 as Instance 2
    participant I3 as Instance 3
    participant PG as PostgreSQL

    Note over I1,PG: Startup - All instances try to become leader
    I1->>PG: INSERT INTO simtime_leader ... ON CONFLICT UPDATE<br/>WHERE expires_at < NOW()
    PG-->>I1: Success (became_leader = true)
    Note over I1: Becomes Leader

    I2->>PG: INSERT INTO simtime_leader ... ON CONFLICT UPDATE<br/>WHERE expires_at < NOW()
    PG-->>I2: Failed (became_leader = false, leader_endpoint = Instance 1)
    Note over I2: Becomes Follower

    I3->>PG: INSERT INTO simtime_leader ... ON CONFLICT UPDATE<br/>WHERE expires_at < NOW()
    PG-->>I3: Failed (became_leader = false, leader_endpoint = Instance 1)
    Note over I3: Becomes Follower

    loop Every 15 seconds
        I1->>PG: UPDATE simtime_leader SET expires_at = NOW() + 30s
        PG-->>I1: Success (lease renewed)
    end

    Note over I2,I3: Followers poll for leader changes (no LISTEN/NOTIFY)
    loop Every 10 seconds (polling loop)
        I2->>PG: SELECT leader, time_state FROM simtime_leader, simtime_state
        PG-->>I2: Current leader and state
        I3->>PG: SELECT leader, time_state FROM simtime_leader, simtime_state
        PG-->>I3: Current leader and state
    end

    
      Loading

  
2. Follower Time Synchronization


      sequenceDiagram
    participant F as Follower Instance
    participant SC as SimTimeClient
    participant L as Leader Instance
    participant LS as Leader SimTimeServer

    Note over F,LS: Initial Synchronization
    F->>SC: Initialize(leaderEndpoint)
    SC->>SC: StartSyncing()

    loop Every 60 seconds
        SC->>L: SyncNoStreamRequest(T1=local_time)
        L->>LS: OnTimeSync(T1)
        LS->>LS: T2=DateTime.UtcNow
        LS-->>L: TimeSyncResponse(T2)
        L-->>SC: Response(ServerTime=T2, RTT)
        SC->>SC: Calculate offset = T2 - (T1 + RTT/2)
        SC->>SC: Store serverTimeDifference
    end

    SC->>L: GetTimeDataRequest()
    L->>LS: GetTimeData()
    LS-->>L: TimeData(epoch, multiplier, paused)
    L-->>SC: TimeDataResponse
    SC->>SC: Store TimeData
    Note over SC: Now synchronized

    
      Loading

  
3. Client Request Handling - Leader vs Follower


      sequenceDiagram
    participant GC as Game Client
    participant LB as Load Balancer
    participant F as Follower
    participant L as Leader
    participant SC as SimTimeClient

    alt GetTimeData Request to Follower
        GC->>LB: GetTimeData()
        LB->>F: GetTimeData()
        F->>SC: ServerNow
        SC->>SC: Calculate: lastSync + offset + elapsed
        SC-->>F: Synchronized time
        F-->>GC: TimeDataResponse
        Note over GC,F: No network call to leader!
    else GetTimeData Request to Leader
        GC->>LB: GetTimeData()
        LB->>L: GetTimeData()
        L->>L: Return local state
        L-->>GC: TimeDataResponse
    end

    
      Loading

  
4. Time Manipulation Flow


      sequenceDiagram
    participant Admin as Admin Client
    participant F as Follower
    participant L as Leader
    participant PG as PostgreSQL
    participant NATS as NATS
    participant GC as Game Clients
    participant F2 as Other Followers

    alt Request to Follower
        Admin->>F: Pause()
        F->>F: Check if leader
        Note over F: Not leader - proxy to leader
        F->>L: Proxy: Pause()
        L->>L: UpdateState(paused=true)
        L->>PG: INSERT INTO simtime_events(...)
        L->>PG: UPDATE simtime_state SET ...
        L->>NATS: Publish("simtime.events", newState)
        NATS-->>GC: Broadcast to all game clients
        Note over F,F2: Followers detect change on next poll (0-10s)
        L-->>F: Success
        F-->>Admin: Success
    else Request to Leader
        Admin->>L: Pause()
        L->>L: UpdateState(paused=true)
        L->>PG: INSERT INTO simtime_events(...)
        L->>PG: UPDATE simtime_state SET ...
        L->>NATS: Publish("simtime.events", newState)
        NATS-->>GC: Broadcast to all game clients
        Note over F,F2: Followers detect change on next poll (0-10s)
        L-->>Admin: Success
    end

    
      Loading

  
5. Leader Failure & Recovery


      sequenceDiagram
    participant L as Leader (Instance 1)
    participant PG as PostgreSQL
    participant F1 as Follower 1
    participant F2 as Follower 2
    participant GC as Game Clients

    Note over L: Leader healthy, renewing lease
    loop Every 15 seconds
        L->>PG: UPDATE simtime_leader SET expires_at = NOW() + 30s
        PG-->>L: Success
        L->>PG: INSERT INTO simtime_state (snapshot)
        PG-->>L: State saved
    end

    Note over L: Leader crashes!
    L-xL: Crash/Network Issue

    Note over PG: After 30 seconds, TTL expires
    Note over F1,F2: Followers detect on next poll (0-10s)

    loop Every 10 seconds (polling)
        F1->>PG: SELECT * FROM simtime_leader WHERE expires_at > NOW()
        PG-->>F1: No active leader (TTL expired)
        Note over F1: Detected leader failure
        break Leader gone
        end
    end

    Note over F1,F2: Race to become leader
    F1->>PG: INSERT ... ON CONFLICT UPDATE WHERE expires_at < NOW()
    PG-->>F1: Success (became_leader = true)

    F1->>PG: SELECT * FROM simtime_state WHERE is_current = true
    PG-->>F1: Latest snapshot
    Note over F1: Becomes new Leader

    F2->>PG: INSERT ... ON CONFLICT UPDATE WHERE expires_at < NOW()
    PG-->>F2: Failed (became_leader = false, leader_endpoint = Follower 1)
    Note over F2: Remains Follower
    F2->>F1: StartSyncing() with new leader (no extra query needed!)

    Note over GC: Continuous service during transition
    GC->>F2: GetTimeData()
    F2->>F2: Serve from cached/extrapolated time
    F2-->>GC: TimeDataResponse

    
      Loading

  
6. Time Extrapolation in Followers


      sequenceDiagram
    participant GC as Game Client
    participant F as Follower
    participant SC as SimTimeClient
    participant Cache as Local State

    Note over Cache: Last sync: T0, Offset: +50ms

    GC->>F: GetTimeData() at T0+30s
    F->>SC: ServerNow
    SC->>Cache: Get last sync data
    Cache-->>SC: lastSync=T0, offset=50ms
    SC->>SC: elapsed = Now() - lastSync = 30s
    SC->>SC: time = T0 + offset + elapsed
    SC->>SC: Apply multiplier (e.g., 24x)
    SC->>SC: simTime = T0 + (30s * 24) = T0+12min
    SC-->>F: Calculated SimTime
    F-->>GC: TimeDataResponse(T0+12min)

    Note over GC,Cache: No network call needed!

    
      Loading

  
7. Complete System Overview


      graph TB
    subgraph "PostgreSQL Infrastructure"
        PG[PostgreSQL Database]
        LT[simtime_leader table<br/>Leader Election]
        ST[simtime_state table<br/>State Persistence]
        ET[simtime_events table<br/>Audit Trail]
    end

    subgraph "NATS Infrastructure"
        NATS[NATS Pub/Sub<br/>Client Notifications]
    end

    subgraph "Leader Instance"
        L[SimTimeServer<br/>Source of Truth]
        LG[gRPC Endpoint]
        LE[Leader Election<br/>Module]
    end

    subgraph "Follower Instance 1"
        F1[SimTimeClient]
        F1G[gRPC Endpoint]
        F1E[Election Watcher]
    end

    subgraph "Follower Instance 2"
        F2[SimTimeClient]
        F2G[gRPC Endpoint]
        F2E[Election Watcher]
    end

    subgraph "Game Services"
        GS1[Game Service 1]
        GS2[Game Service 2]
        GS3[Game Service 3]
    end

    PG --> LT
    PG --> ST
    PG --> ET

    LE -.->|Renew Lease| LT
    F1E -.->|Poll| LT
    F2E -.->|Poll| LT

    L -->|Save Snapshots| ST
    L -->|Log Events| ET
    L -->|Broadcast Events| NATS
    NATS -.->|Time Updates| GS1
    NATS -.->|Time Updates| GS2
    NATS -.->|Time Updates| GS3

    F1 -->|Sync Time| LG
    F2 -->|Sync Time| LG

    GS1 -->|GetTime| F1G
    GS2 -->|GetTime| F2G
    GS3 -->|GetTime| LG

    style L fill:#f96,stroke:#333,stroke-width:4px
    style F1 fill:#9cf,stroke:#333,stroke-width:2px
    style F2 fill:#9cf,stroke:#333,stroke-width:2px
    style PG fill:#326ce5,stroke:#333,stroke-width:2px
    style NATS fill:#27aae1,stroke:#333,stroke-width:2px

    
      Loading

  
These diagrams illustrate the key interaction patterns in the distributed SimTime architecture:

Leader Election: Shows how instances compete for leadership and maintain it through lease renewal
Time Synchronization: Details the NTP-style algorithm used by followers to sync with the leader
Request Handling: Demonstrates how both leaders and followers serve client requests
Time Manipulation: Shows command routing for administrative operations
Failure Recovery: Illustrates the seamless transition when a leader fails
Time Extrapolation: Explains how followers calculate time locally without network calls
System Overview: Provides a high-level view of all components and their relationships


Full Restart Recovery & State Persistence

The Challenge

When all SimTime servers restart simultaneously (e.g., cluster restart, power failure, Kubernetes namespace recreation), we face critical challenges:

Complete State Loss - All in-memory time state vanishes
Time Regression - Could jump from Day 15 back to Day 1
Client Desynchronization - Clients have different time than restarted servers
Lost Manipulations - Paused state, multiplier changes are forgotten

The Solution: Persistent State Snapshots

State Persistence Strategy

// Leader saves snapshots every 10 seconds
public class TimeSnapshotManager {
    private readonly IDbConnection db;
    private readonly Timer snapshotTimer;
    private readonly long fencingToken;

    public async Task SaveSnapshot() {
        var sql = @"
            -- Mark previous snapshots as not current
            UPDATE simtime_state SET is_current = false WHERE is_current = true;

            -- Insert new snapshot
            INSERT INTO simtime_state (
                sim_time_ticks,
                epoch_ticks,
                multiplier,
                paused_timestamp_ticks,
                is_current,
                saved_by,
                fencing_token
            ) VALUES (
                @simTimeTicks,
                @epochTicks,
                @multiplier,
                @pausedTimestampTicks,
                true,
                @savedBy,
                @fencingToken
            )";

        await db.ExecuteAsync(sql, new {
            simTimeTicks = GetCurrentSimTime().Ticks,
            epochTicks = currentEpoch.Ticks,
            multiplier = currentMultiplier,
            pausedTimestampTicks = pausedTimestamp?.Ticks,
            savedBy = instanceId,
            fencingToken = this.fencingToken
        });
    }
}
Recovery Protocol on Startup

public async Task<TimeData> RecoverTimeState() {
    // 1. Try to load from PostgreSQL snapshot
    var sql = @"
        SELECT
            saved_at,
            sim_time_ticks,
            epoch_ticks,
            multiplier,
            paused_timestamp_ticks
        FROM simtime_state
        WHERE is_current = true
        ORDER BY saved_at DESC
        LIMIT 1";

    var snapshot = await db.QuerySingleOrDefaultAsync<dynamic>(sql);

    if (snapshot != null) {
        var savedAt = (DateTime)snapshot.saved_at;
        var age = DateTime.UtcNow - savedAt;

        if (age < TimeSpan.FromMinutes(5)) {
            // 2. Extrapolate from snapshot
            var simTime = new DateTime((long)snapshot.sim_time_ticks);
            var extrapolatedTime = simTime + (age * snapshot.multiplier);

            Logger.Info($"Recovered from snapshot: {savedAt}, extrapolated {age}");
            return new TimeData {
                Epoch = new DateTime((long)snapshot.epoch_ticks),
                Multiplier = snapshot.multiplier,
                CurrentTime = extrapolatedTime,
                PausedTimestamp = snapshot.paused_timestamp_ticks != null
                    ? new DateTime((long)snapshot.paused_timestamp_ticks)
                    : null
            };
        }
    }

    // 3. Fallback to environment variables
    Logger.Warn("No recent snapshot found, initializing from environment");
    return InitializeFromEnvironment();
}
Full Restart Sequence Diagram


      sequenceDiagram
    participant PG as PostgreSQL
    participant I1 as Instance 1
    participant I2 as Instance 2
    participant I3 as Instance 3
    participant GC as Game Clients

    Note over I1,I3: All instances start simultaneously

    par Instance 1 Recovery
        I1->>PG: SELECT * FROM simtime_state WHERE is_current = true
        PG-->>I1: Snapshot (saved 30s ago)
        I1->>I1: Extrapolate: saved_time + 30s
    and Instance 2 Recovery
        I2->>PG: SELECT * FROM simtime_state WHERE is_current = true
        PG-->>I2: Snapshot (saved 30s ago)
        I2->>I2: Extrapolate: saved_time + 30s
    and Instance 3 Recovery
        I3->>PG: SELECT * FROM simtime_state WHERE is_current = true
        PG-->>I3: Snapshot (saved 30s ago)
        I3->>I3: Extrapolate: saved_time + 30s
    end

    Note over I1,I3: All instances have consistent time

    I1->>PG: INSERT INTO simtime_leader ... ON CONFLICT UPDATE
    PG-->>I1: Success - Becomes Leader
    I2->>PG: INSERT INTO simtime_leader ... ON CONFLICT UPDATE
    PG-->>I2: Failed - Becomes Follower
    I3->>PG: INSERT INTO simtime_leader ... ON CONFLICT UPDATE
    PG-->>I3: Failed - Becomes Follower

    Note over I1: Leader starts snapshot timer
    loop Every 10 seconds
        I1->>PG: INSERT INTO simtime_state (snapshot)
        PG-->>I1: State persisted
    end

    Note over GC: Clients detect time continuity
    GC->>I2: GetTimeData()
    I2-->>GC: TimeData (continuous from before restart)
    GC->>GC: Small adjustment, no major jump

    Note over PG: Key advantage: Backup includes time state!
    PG->>PG: pg_dump includes simtime tables

    
      Loading

  
Graceful Shutdown Protocol

public class GracefulShutdown {
    public async Task OnShutdown() {
        if (IsLeader) {
            // Save final snapshot before shutdown
            await SaveSnapshot();

            // Mark shutdown as clean in audit log
            await db.ExecuteAsync(@"
                INSERT INTO simtime_events (
                    event_type,
                    new_value,
                    performed_by,
                    instance_id
                ) VALUES (
                    'clean_shutdown',
                    @newValue::jsonb,
                    @performedBy,
                    @instanceId
                )", new {
                    newValue = JsonSerializer.Serialize(new {
                        timestamp = DateTime.UtcNow,
                        lastTime = GetCurrentSimTime()
                    }),
                    performedBy = instanceId,
                    instanceId = instanceId
                });
        }

        // Stop time advancement
        PauseTimeAdvancement();

        // Wait for pending operations
        await WaitForPendingOperations();
    }
}
Client Resilience for Time Jumps

public class TimeJumpDetection {
    private DateTime lastServerTime;
    private readonly TimeSpan maxAcceptableJump = TimeSpan.FromMinutes(1);

    public async Task<bool> ValidateTimeUpdate(DateTime newServerTime) {
        var timeDiff = Math.Abs((newServerTime - lastServerTime).TotalSeconds);

        if (timeDiff > maxAcceptableJump.TotalSeconds) {
            Logger.Warn($"Large time jump detected: {timeDiff}s");

            // Force complete resynchronization
            await ForceCompleteResync();

            // Notify game systems of time discontinuity
            OnTimeDiscontinuity?.Invoke(lastServerTime, newServerTime);

            return false; // Reject normal update
        }

        return true; // Accept update
    }
}
Configuration Additions

New environment variables for production:
# Snapshot Configuration
SIMTIME_SNAPSHOT_INTERVAL: "10"        # seconds between snapshots
SIMTIME_SNAPSHOT_RETENTION: "3600"     # keep snapshots for 1 hour
SIMTIME_RECOVERY_MAX_AGE: "300"        # max age of snapshot to use (5 min)

# Recovery Behavior
SIMTIME_ALLOW_ENV_FALLBACK: "true"     # fallback to env vars if no snapshot
SIMTIME_REQUIRE_CLEAN_SHUTDOWN: "false" # require clean shutdown marker
Recovery Scenarios


Scenario
Recovery Method
Time Continuity


Clean restart (< 1 min)
Latest snapshot + extrapolation
Perfect continuity


Unclean restart (< 5 min)
Latest snapshot + extrapolation
Near-perfect (< 1s drift)


Long outage (> 5 min)
Environment variables + warning
Possible time jump


PostgreSQL data lost
Environment variables + alert
Time reset to Day 1


Partial restart
Active instances provide time
Perfect continuity


Monitoring & Alerts

Alerts to Configure:
  - snapshot_age_high: Snapshot older than 60 seconds
  - recovery_from_env: Had to use environment variables
  - large_time_jump: Time jumped > 60 seconds
  - snapshot_save_failed: Failed to persist snapshot

Kubernetes Integration & Pod Discovery

The Challenge: Self-Identification in Kubernetes Deployments

When using Kubernetes Deployments (your approach), SimTime instances face unique challenges:

Pods don't inherently know their own IP address
Pod IPs are ephemeral and change on every restart
Pod names are unpredictable (e.g., simtime-server-7b4d9c-x2f3)
Service ClusterIPs don't work (followers need specific leader pod IP)

Solution: Kubernetes Downward API with Deployments

Since you're using Deployments, the Kubernetes Downward API is essential for pod self-discovery:
Deployment Configuration

apiVersion: apps/v1
kind: Deployment
metadata:
  name: simtime-server
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: simtime
        image: simtime:latest
        env:
        # Pod identification
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: POD_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: POD_IP
          valueFrom:
            fieldRef:
              fieldPath: status.podIP
        # Service discovery
        - name: SERVICE_NAME
          value: "simtime-internal"
        - name: GRPC_PORT
          value: "50051"
Leader Registration Code

public class LeaderElectionService {
    private readonly string instanceId;
    private readonly string instanceEndpoint;

    public LeaderElectionService(IConfiguration config) {
        // Get pod-specific information from Downward API
        var podName = config["POD_NAME"]
            ?? throw new Exception("POD_NAME not set - check Downward API config");
        var podIp = config["POD_IP"]
            ?? throw new Exception("POD_IP not set - check Downward API config");
        var grpcPort = config["GRPC_PORT"] ?? "50051";

        // Build unique instance ID and endpoint
        this.instanceId = podName;  // e.g., "simtime-server-abc123"
        this.instanceEndpoint = $"{podIp}:{grpcPort}";  // e.g., "10.244.1.5:50051"
    }

    public async Task<bool> TryBecomeLeader() {
        var sql = @"
            INSERT INTO simtime_leader (
                resource_name,
                instance_id,
                instance_endpoint,  -- Followers will connect here
                expires_at,
                metadata
            ) VALUES (
                'simtime',
                @instanceId,        -- simtime-server-abc123
                @instanceEndpoint,  -- 10.244.1.5:50051
                NOW() + INTERVAL '30 seconds',
                @metadata::jsonb
            )
            ON CONFLICT (resource_name) DO UPDATE SET
                instance_id = @instanceId,
                instance_endpoint = @instanceEndpoint,
                expires_at = NOW() + INTERVAL '30 seconds'
            WHERE simtime_leader.expires_at < NOW()
            RETURNING instance_id = @instanceId as became_leader";

        var metadata = new {
            pod_name = instanceId,
            pod_ip = instanceEndpoint.Split(':')[0],
            version = Assembly.GetExecutingAssembly().GetName().Version?.ToString()
        };

        return await db.QuerySingleAsync<bool>(sql, new {
            instanceId,
            instanceEndpoint,
            metadata = JsonSerializer.Serialize(metadata)
        });
    }
}
Follower Connection (No Separate Discovery Needed!)

public class FollowerService {
    private GrpcChannel? leaderChannel;
    private SimTimeClient simTimeClient;

    public async Task ConnectToLeader(string leaderEndpoint) {
        // Leader endpoint already provided by TryBecomeLeader()
        // No need for a separate database query!

        // Connect directly to leader pod IP
        // e.g., grpc://10.244.1.5:50051
        leaderChannel = GrpcChannel.ForAddress($"http://{leaderEndpoint}");

        // Initialize SimTimeClient for time synchronization
        var transport = new GrpcSimTimeTransport(leaderChannel);
        simTimeClient = new SimTimeClient(transport);
        simTimeClient.StartSyncing();

        Logger.Info($"Connected to leader at {leaderEndpoint}");
    }

    // If leader changes, we need to check periodically
    public async Task<string?> CheckLeaderChange() {
        var sql = @"
            SELECT instance_endpoint
            FROM simtime_leader
            WHERE resource_name = 'simtime'
                AND expires_at > NOW()";

        return await db.QuerySingleOrDefaultAsync<string>(sql);
    }
}
Important: Why Deployments Work Despite Ephemeral IPs

With Deployments, pod IPs change on every restart, but this is not a problem because:

Leader re-registers its new IP on startup when acquiring leadership
Followers re-discover the leader's new endpoint from PostgreSQL
TTL-based leases ensure stale endpoints are automatically cleaned up
Connection retry logic handles IP changes gracefully

// Followers automatically handle leader IP changes
public async Task MaintainLeaderConnection() {
    while (running) {
        var currentEndpoint = await DiscoverLeader();
        if (currentEndpoint != lastKnownEndpoint) {
            // Leader IP changed (pod restarted)
            await ReconnectToLeader(currentEndpoint);
            lastKnownEndpoint = currentEndpoint;
        }
        await Task.Delay(TimeSpan.FromSeconds(10));
    }
}
Development vs Production with Deployments

public class EndpointConfiguration {
    public static string GetInstanceEndpoint(IConfiguration config) {
        var env = config["ASPNETCORE_ENVIRONMENT"];

        if (env == "Development") {
            // Local development: use localhost
            return "localhost:50051";
        } else if (config["POD_IP"] != null) {
            // Kubernetes Deployment: use pod IP (your primary scenario)
            // This IP is ephemeral and will change on pod restart
            return $"{config["POD_IP"]}:{config["GRPC_PORT"] ?? "50051"}";
        } else {
            // Fallback for other environments (Docker, VM, etc.)
            return $"{Environment.MachineName}:50051";
        }
    }
}
Deployment-Specific Considerations

Since you're using Deployments:


Pod IP Changes: Every pod restart gets a new IP

Solution: Leader updates PostgreSQL on startup
Followers poll for endpoint changes


Rolling Updates: During deployment, pods cycle through

Old leader continues until new pod is ready
New pod tries to become leader
Seamless transition via TTL-based leases


Scale Events: When scaling up/down

New pods become followers automatically
If leader is scaled down, election happens within TTL window


No Persistent Identity: Pods have random suffixes

Use pod name as instance ID for uniqueness
Don't rely on pod name for network discovery


Key Points


Downward API is Essential: Without it, pods can't know their own IP address
Direct Pod Connection: Followers must connect to the specific leader pod, not a service
Ephemeral IPs Are Fine: Leader re-registration and TTL cleanup handle IP changes automatically
Graceful Degradation: Code handles both Kubernetes and local development scenarios
Metadata Storage: Leader table stores both instance ID and endpoint for observability


Implementation Details

Phase 1: PostgreSQL Schema & Leader Election

Tasks:
  - Create PostgreSQL schema (simtime_leader, simtime_state, simtime_events)
  - Implement lock table-based leader election
  - Add lease renewal mechanism with TTL
  - Add observability queries and metrics
Phase 2: Follower Client Mode

Tasks:
  - Create GrpcSimTimeTransport implementing ISimTimeClientTransport
  - Integrate SimTimeClient into follower instances
  - Override time serving methods to use client
  - Add sync quality metrics
Phase 3: State Persistence & Recovery

Tasks:
  - Implement PostgreSQL snapshot persistence
  - Add recovery from snapshots on startup
  - Implement graceful shutdown with final snapshot
  - Add audit trail for time manipulations
Phase 4: Testing & Rollout

Tasks:
  - Chaos testing (leader failures, network partitions)
  - Load testing with multiple instances
  - Gradual rollout with feature flags
  - Monitor time consistency metrics

Advantages

1. Unified Backup/Restore ✅

-- Backup includes everything!
pg_dump production > backup.dump

-- Restore includes time state automatically
pg_restore backup.dump
-- Time continues from exact backup point!
2. Eliminates Clock Skew ✅

Traditional Problem:
Instance A clock: 14:00:00.000
Instance B clock: 14:00:00.050
Result: 50ms time difference!

Our Solution:
Both instances: 14:00:00.000 (synchronized to leader)

3. High Availability ✅


Leader failure → New election in ~30 seconds
Followers continue serving during election
Zero downtime deployments possible
PostgreSQL HA as foundation

4. Production Observability ✅

-- Real-time monitoring
SELECT instance_id, expires_at - NOW() as ttl FROM simtime_leader;

-- Audit trail included
SELECT * FROM simtime_events WHERE event_time > NOW() - INTERVAL '1 hour';
5. Proven Technology ✅


PostgreSQL lock tables battle-tested
SimTimeClient used in production for years
NTP algorithm is industry standard
Your team already PostgreSQL experts

6. Horizontal Scaling ✅

Single Instance: 10,000 clients → 167 req/s → Overloaded
Multi-Instance:  10,000 clients ÷ 5 instances = 2,000 clients each → 33 req/s → Smooth


Load distributes across all instances
Each instance handles subset of clients
Add more instances as player base grows
No single bottleneck

7. Performance ✅


Followers calculate time locally (no network calls)
Sync happens every 60 seconds (configurable)
Scales linearly with instances
Leader election returns current leader in single query (no extra round trip)

8. Simplified Coordination via Polling ✅


No persistent database connections needed
No LISTEN/NOTIFY complexity or reconnection logic
Works perfectly with connection pooling
Predictable 10-second detection latency is acceptable for rare events
Easier to monitor, debug, and test

9. Transactional Consistency ✅

BEGIN;
INSERT INTO game_events (...);
UPDATE simtime_state SET ...;
COMMIT;  -- Atomic operation

Disadvantages & Mitigations

1. Leader Election Complexity ⚠️

Issue: Distributed consensus is complex
Mitigation:

Use proven PostgreSQL lock tables with TTL
Simple lease-based approach
Pure polling (no LISTEN/NOTIFY complexity)
Extensive testing of edge cases

2. Brief Inconsistency During Leader Change ⚠️

Issue: Up to 30 second window for new leader election
Mitigation:

Followers serve last known good time
Time continues advancing via extrapolation
New leader elected quickly

3. Additional Network Traffic ⚠️

Issue: Followers sync with leader periodically
Mitigation:

Sync interval configurable (default 60s)
Minimal data transfer (~100 bytes)
Use gRPC streaming for efficiency

4. Complexity vs Single Instance ⚠️

Issue: More moving parts than single instance
Mitigation:

Comprehensive monitoring and alerting
Distributed tracing for debugging
Fallback to single-instance mode if needed


Alternative Approaches Considered

1. NATS KV for State Storage ❌

Why Rejected for State Storage (still used for client broadcasting):

No read-after-write consistency for critical state
Follower lag issues for leader election
Incompatible with immediate consistency needs
Separate backup/restore from PostgreSQL
Would split critical state across two systems

Note: NATS is still used for broadcasting time events to game clients, where its excellent fan-out performance shines
2. Redis for State Storage ❌

Why Rejected:

Adds new infrastructure dependency
Separate backup strategy needed
Not aligned with unified datastore strategy
Team lacks Redis expertise

3. Client-Side Time Calculation ❌

Why Rejected:

Pushes complexity to every client
Hard to coordinate time manipulation
Difficult to debug time issues

4. Keep Single Instance ❌

Why Rejected:

Doesn't solve SPOF issue
No horizontal scaling
Risky for production system


Risk Analysis


Risk
Probability
Impact
Mitigation


Leader election fails
Low
High
Manual override, alerting


Clock drift during partition
Medium
Medium
Shorter sync intervals


Follower sync failures
Low
Low
Mark instance unhealthy


Time jumps on leader change
Low
High
Smooth convergence algorithm


Performance degradation
Low
Medium
Monitoring, auto-scaling


Full cluster restart
Low
Critical
Persistent snapshots, recovery protocol


PostgreSQL data loss
Very Low
Critical
Environment var fallback, alerting


Snapshot corruption
Very Low
High
Multiple snapshot versions, validation


Long outage (> 5 min)
Low
High
Client resync protocol, time jump detection


Graceless shutdown
Medium
Medium
Regular snapshots, extrapolation on recovery


Critical Architecture Issues & Solutions

This section identifies critical safety issues that must be addressed before production deployment, along with their solutions.
1. Split-Brain Vulnerability 🧠⚠️

The Problem:
During network partitions, the system can end up with multiple leaders:
Time 0:   Leader renewing every 15s, Followers polling every 10s
Time 10:  Network partition occurs
          - Leader → PostgreSQL: ✅ (still works)
          - Followers → PostgreSQL: ✅ (still works)
          - Followers → Leader: ❌ (network partition)
Time 30:  Followers think leader is dead (can't reach it)
Time 40:  Follower A becomes new leader
Time 45:  Network partition heals
Result:   TWO LEADERS! Original leader + Follower A

Why it's dangerous:

Game clients could get different times from different "leaders"
Time manipulations could be lost or duplicated
Data corruption in time-dependent game events
Inconsistent event timers across the game

The Solution - Monotonic Fencing Tokens:
// Every state update must check fencing token
public async Task<bool> AcceptTimeUpdate(TimeState newState) {
    // Only accept updates from newer leaders
    if (newState.fencing_token <= lastKnownFencingToken) {
        Logger.Warn($"Rejecting stale leader update: {newState.fencing_token} <= {lastKnownFencingToken}");
        return false;
    }

    lastKnownFencingToken = newState.fencing_token;
    ApplyTimeUpdate(newState);
    return true;
}

// Leader must include fencing token in all responses
public TimeDataResponse GetTimeData() {
    return new TimeDataResponse {
        Time = CalculateCurrentTime(),
        FencingToken = this.currentFencingToken,  // Critical!
        LeaderId = this.instanceId
    };
}
2. Transaction Isolation Missing 💾❌

The Problem:
State updates aren't atomic, creating brief windows with no valid state:
-- These run as separate statements!
UPDATE simtime_state SET is_current = false WHERE is_current = true;  -- Moment 1: No current state!
-- DANGER ZONE: Another instance queries here and gets NULL!
INSERT INTO simtime_state (...) VALUES (...);                         -- Moment 2: State restored
Timeline of Disaster:
Microsecond 1: UPDATE executes (no current state exists!)
Microsecond 2: Follower queries: "SELECT * WHERE is_current = true"
               Returns: NULL! 💥
Microsecond 3: INSERT executes (current state exists again)
Microsecond 4: Follower crashes due to null state

Why it's dangerous:

Brief moments with NO valid time state
Instances could fail to start or crash
Clients could get null responses
Recovery logic might initialize from environment (time jump!)

The Solution - Atomic Transactions:
public async Task SaveSnapshot() {
    var sql = @"
        BEGIN;
        -- Both operations succeed or both fail
        UPDATE simtime_state SET is_current = false WHERE is_current = true;
        INSERT INTO simtime_state (
            sim_time_ticks, epoch_ticks, multiplier, paused_timestamp_ticks,
            is_current, saved_by, fencing_token
        ) VALUES (
            @simTimeTicks, @epochTicks, @multiplier, @pausedTimestampTicks,
            true, @savedBy, @fencingToken
        );
        COMMIT;";

    await db.ExecuteAsync(sql, parameters);
}

// Alternative: Single UPSERT operation
public async Task SaveSnapshotAtomic() {
    var sql = @"
        INSERT INTO simtime_state (id, sim_time_ticks, epoch_ticks, multiplier, is_current)
        VALUES (1, @simTimeTicks, @epochTicks, @multiplier, true)
        ON CONFLICT (id) WHERE is_current = true
        DO UPDATE SET
            sim_time_ticks = @simTimeTicks,
            epoch_ticks = @epochTicks,
            saved_at = NOW()";

    await db.ExecuteAsync(sql, parameters);
}
3. Resource Leaks - Timer Cleanup 🚰

The Problem:
Timers are created but never disposed when losing leadership:
private Timer renewalTimer;

private void StartLeaseRenewal() {
    renewalTimer = new Timer(async _ => {
        // Renews every 15 seconds forever!
    }, null, TimeSpan.FromSeconds(15), TimeSpan.FromSeconds(15));
}

// When leadership lost, timer keeps running!
// After 10 leadership changes = 10 orphaned timers
Accumulation Over Time:
Hour 0:   1 leadership change  = 1 orphaned timer
Hour 1:   3 leadership changes = 3 orphaned timers (4 total)
Hour 6:   2 more changes       = 2 orphaned timers (6 total)
Day 1:    20 changes total     = 20 timers firing every 15s!

Result: 20 timers × 4 queries/min = 80 unnecessary DB queries/min
        Plus growing memory usage!

Why it's dangerous:

Memory usage grows unbounded
Database gets hammered with invalid renewal attempts
Connection pool exhaustion
Can trigger cascading failures under load

The Solution - Proper Cleanup:
public class LeaderElectionService : IDisposable {
    private Timer? renewalTimer;
    private CancellationTokenSource? leadershipCts;

    private void StartLeaseRenewal() {
        // Create cancellation token for this leadership term
        leadershipCts = new CancellationTokenSource();

        renewalTimer = new Timer(async _ => {
            try {
                if (leadershipCts.Token.IsCancellationRequested) {
                    return;  // Stop if cancelled
                }

                var renewed = await RenewLease();
                if (!renewed) {
                    await OnLostLeadership();
                }
            } catch (Exception ex) {
                Logger.Error($"Lease renewal failed: {ex}");
                await OnLostLeadership();
            }
        }, null, TimeSpan.FromSeconds(15), TimeSpan.FromSeconds(15));
    }

    private async Task OnLostLeadership() {
        // Critical: Clean up resources!
        leadershipCts?.Cancel();
        renewalTimer?.Dispose();
        renewalTimer = null;
        leadershipCts?.Dispose();
        leadershipCts = null;

        Logger.Info("Leadership lost, cleaned up resources");
        await TransitionToFollower();
    }

    public void Dispose() {
        renewalTimer?.Dispose();
        leadershipCts?.Dispose();
    }
}
4. Thundering Herd Problem 🐘🐘🐘

The Problem:
When the leader fails, ALL followers detect it simultaneously and race to become leader:
Leader fails at T=0
At T=10: All 20 followers detect failure simultaneously
At T=10.001: All 20 followers execute:
    INSERT INTO simtime_leader ... ON CONFLICT UPDATE

Database perspective:
- 20 simultaneous complex CTE queries
- All competing for same row lock
- Lock contention causes serialization
- Connection pool exhaustion
- Query queue backs up

Cascade Effect:
T+0ms:    20 queries hit DB simultaneously
T+100ms:  DB CPU = 100%, lock wait queue forming
T+500ms:  Connection pool exhausted
T+1000ms: New queries timeout
T+2000ms: Health checks fail
T+3000ms: Kubernetes restarts "unhealthy" pods
T+4000ms: More instances race for leadership
Result:   Complete system failure

Why it's dangerous:

Database CPU spikes to 100%
All instances freeze waiting for query results
Connection pool exhaustion
Can trigger cascading failures
Game clients timeout and disconnect

The Solution - Jittered Elections:
public class SmartLeaderElection {
    private readonly Random random = new Random();

    public async Task OnLeaderFailureDetected() {
        // Add random jitter to prevent thundering herd
        var baseDelay = 1000;  // 1 second base
        var jitter = random.Next(0, 5000);  // 0-5 seconds random
        var totalDelay = baseDelay + jitter;

        Logger.Info($"Leader failed, waiting {totalDelay}ms before election attempt");
        await Task.Delay(totalDelay);

        // Check if someone else already became leader during our wait
        var currentLeader = await CheckCurrentLeader();
        if (currentLeader != null) {
            Logger.Info($"New leader already elected: {currentLeader}");
            await ConnectToLeader(currentLeader);
            return;
        }

        // Now try to become leader
        await TryBecomeLeader();
    }

    // Even smarter: Exponential backoff for retries
    public async Task ElectionWithBackoff() {
        var attempt = 0;
        var maxAttempts = 5;

        while (attempt < maxAttempts) {
            var delay = Math.Min(1000 * Math.Pow(2, attempt), 30000);  // Cap at 30s
            var jitter = random.Next(0, (int)delay / 2);

            await Task.Delay((int)delay + jitter);

            if (await TryBecomeLeader()) {
                return;  // Success!
            }

            attempt++;
        }

        throw new Exception("Failed to elect leader after max attempts");
    }
}
Production Scenario: Black Friday Game Event

Without These Fixes:
18:00: Black Friday event starts, 50,000 players online
18:15: Network blip causes followers to lose connection to leader
18:16: Split-brain occurs (Issue #1)
       - Original leader still renewing
       - Follower A becomes new leader
       - Players get different event end times
18:17: Support tickets: "Why does my friend have different timer?"
18:20: DevOps attempts emergency restart
18:21: All instances restart, try to recover state
       - One instance hits the NULL state window (Issue #2)
       - Instance crashes during initialization
18:22: Crashed instance restarts, old timer still running (Issue #3)
       - Orphaned timer hammering database
18:25: Leader fails under load
18:26: 20 instances thundering herd the database (Issue #4)
       - Database CPU 100%
       - All queries timeout
18:27: Complete game outage
18:45: Emergency war room called
19:30: Manual intervention to recover
20:00: Service restored, but event ruined

With These Fixes:
18:00: Black Friday event starts, 50,000 players online
18:15: Network blip causes followers to lose connection to leader
18:16: Fencing token prevents split-brain
       - Followers reject old leader's updates
       - Clean leader transition occurs
18:17: New leader elected with jitter (no thundering herd)
       - Database load normal
       - Players see consistent times
18:18: Event continues smoothly
       - Monitoring shows brief leader transition
       - No player impact
23:00: Event completes successfully

Implementation Priority

These issues must be fixed in this order:

Transaction Isolation (Easiest, highest impact)
Resource Leaks (Simple fix, prevents accumulation)
Thundering Herd (Moderate complexity, critical for scale)
Split-Brain (Most complex, but essential for correctness)

All fixes should be implemented before production deployment.

Success Metrics

Functional Metrics


✅ Zero time jumps during normal operation
✅ < 10ms time difference between instances
✅ < 30 second leader election time
✅ 99.99% availability
✅ < 1 second time drift after full restart
✅ 100% time continuity with snapshots available

Performance Metrics


✅ < 1ms latency for time queries (local calculation)
✅ < 100ms for time sync operations
✅ Support 100+ instances without degradation
✅ < 50ms snapshot save time
✅ < 100ms recovery time on startup

Operational Metrics


✅ Zero-downtime deployments
✅ Automatic failover on leader failure
✅ Self-healing after network partitions
✅ Successful recovery from full cluster restart
✅ Snapshot age always < 30 seconds
✅ Zero data loss with graceful shutdown


Migration Strategy

Step 1: Deploy in Shadow Mode


Deploy multi-instance version alongside single instance
Compare time outputs, don't serve traffic
Validate consistency

Step 2: Gradual Traffic Shift


10% → 25% → 50% → 100% over 2 weeks
Monitor metrics at each stage
Rollback capability at each step

Step 3: Decommission Single Instance


Keep single instance as emergency fallback
After 30 days stable, remove entirely


Code Examples

Leader Election Implementation

public class PostgresLeaderElection {
    private readonly IDbConnection db;
    private readonly string instanceId;
    private readonly string instanceEndpoint;
    private Timer renewalTimer;

    public async Task<LeaderElectionResult> TryBecomeLeader() {
        var sql = @"
            WITH election_attempt AS (
                INSERT INTO simtime_leader
                (resource_name, instance_id, instance_endpoint, expires_at, heartbeat_at)
                VALUES ('simtime', @instanceId, @endpoint, NOW() + INTERVAL '30 seconds', NOW())
                ON CONFLICT (resource_name) DO UPDATE SET
                    instance_id = @instanceId,
                    instance_endpoint = @endpoint,
                    expires_at = NOW() + INTERVAL '30 seconds',
                    heartbeat_at = NOW()
                WHERE simtime_leader.expires_at < NOW()
                RETURNING instance_id, instance_endpoint
            )
            SELECT
                instance_id,
                instance_endpoint,
                instance_id = @instanceId as became_leader
            FROM election_attempt
            UNION ALL
            SELECT
                instance_id,
                instance_endpoint,
                false as became_leader
            FROM simtime_leader
            WHERE resource_name = 'simtime'
                AND NOT EXISTS (SELECT 1 FROM election_attempt)
            LIMIT 1";

        var result = await db.QuerySingleAsync<LeaderElectionResult>(sql,
            new { instanceId, endpoint = instanceEndpoint });

        if (result.BecameLeader) {
            StartLeaseRenewal();
            await LogLeaderChange("acquired");
        } else {
            // No need for a second query - we have the leader endpoint!
            await InitializeFollowerMode(result.LeaderEndpoint);
        }

        return result;
    }

    private void StartLeaseRenewal() {
        renewalTimer = new Timer(async _ => {
            var renewed = await db.ExecuteAsync(@"
                UPDATE simtime_leader
                SET expires_at = NOW() + INTERVAL '30 seconds',
                    heartbeat_at = NOW()
                WHERE resource_name = 'simtime'
                AND instance_id = @instanceId",
                new { instanceId });

            if (renewed == 0) {
                await OnLostLeadership();
            }
        }, null, TimeSpan.FromSeconds(15), TimeSpan.FromSeconds(15));
    }

    private async Task LogLeaderChange(string eventType) {
        await db.ExecuteAsync(@"
            INSERT INTO simtime_events
            (event_type, instance_id, performed_by)
            VALUES (@eventType, @instanceId, @instanceId)",
            new { eventType = $"leader_{eventType}", instanceId });
    }
}
Follower Transport Implementation

public class FollowerTransport : ISimTimeClientTransport {
    private readonly SimTimeServerExternalServiceClient leaderClient;

    public async Task<TimeSyncResponse> SyncAsync(CancellationToken ct) {
        var request = new SyncNoStreamRequest {
            ClientTimeSendTicks = DateTime.UtcNow.Ticks
        };

        var response = await leaderClient.SyncNoStreamAsync(request, cancellationToken: ct);

        return new TimeSyncResponse {
            Successful = response.Successful,
            ServerTimeTicks = response.ServerTimeTicks,
            ClientTimeSendTicks = response.ClientTimeSendTicks,
            ClientTimeReceivedTicks = DateTime.UtcNow.Ticks
        };
    }

    public async Task<TimeDataResponse> GetTimeDataAsync(CancellationToken ct) {
        var response = await leaderClient.GetTimeDataAsync(new(), cancellationToken: ct);

        return new TimeDataResponse {
            Success = true,
            CurrentEpochTicks = response.CurrentEpochTicks,
            CurrentMultiplier = response.CurrentMultiplier,
            CurrentPausedTimestamp = response.CurrentPausedTimestamp
        };
    }
}

Conclusion

This architecture solves the SimTime scaling problem by:

Eliminating SPOF through leader election and failover
Ensuring consistency via SimTimeClient synchronization
Maintaining performance with local time calculation
Reusing proven code from the existing SimTime library

The key insight is treating follower instances as "thick clients" that maintain synchronized time state, allowing them to serve requests locally without hitting the leader for every query.
This design provides the consistency guarantees required for game time synchronization while enabling horizontal scaling and high availability.

Next Steps


Review & Approval: Discuss with team, address concerns
Proof of Concept: Build minimal version to validate approach
Design Review: Detailed technical design document
Implementation: Follow phased approach outlined above
Testing: Comprehensive testing including chaos engineering
Rollout: Gradual deployment with monitoring


References


PostgreSQL Lock Tables
NATS Pub/Sub (for game client notifications)
NTP Protocol Specification
Distributed Systems Clock Synchronization
Internal: SimTimeClient Implementation (library-simtime/src/Klang.Seed.SimTime.Client/)


Document Version: 1.0
Author: Architecture Team
Date: 2025-10-28
Status: DRAFT - For Review
Component	PostgreSQL	NATS	Our Choice
Leader Election	Lock tables with TTL	KV with TTL	PostgreSQL ✅
State Persistence	Native tables with ACID	KV Store (eventual)	PostgreSQL ✅
Backup/Restore	Included automatically	Separate system	PostgreSQL ✅
Observability	Rich SQL queries	Limited	PostgreSQL ✅
Instance Coordination	Polling (simple, reliable)	Pub/Sub	PostgreSQL ✅
Game Client Broadcasting	Possible but suboptimal	Excellent fan-out	NATS ✅
Operation	Leader Behavior	Follower Behavior
GetTimeData	Serve from local state	Serve from synchronized state
Sync	Provide time reference	Provide synchronized time
Time Manipulation	Execute and broadcast	Proxy to leader
Event	Detection Latency	Impact	Mitigation
Leader failure	0-10 seconds	Rare event (weekly/monthly)	Followers continue serving
Time manipulation	0-10 seconds	Admin operation, non-critical	Game clients get instant NATS update
New leader elected	0-10 seconds	Follows leader failure	Extrapolation continues working
Aspect	Advisory Locks	Lock Table	Our Choice
Automatic cleanup	✅ On disconnect	❌ Manual TTL	Lock Table
Survives brief disconnects	❌ Lost immediately	✅ TTL-based	Lock Table ✅
Observability	❌ No metadata	✅ Full visibility	Lock Table ✅
Debugging	❌ Can't see holder	✅ SQL queries	Lock Table ✅
Graceful handoff	❌ Not possible	✅ Update owner	Lock Table ✅
Audit trail	❌ None	✅ History table	Lock Table ✅
Works with polling	❌ Requires connection	✅ Stateless queries	Lock Table ✅
Scenario	Recovery Method	Time Continuity
Clean restart (< 1 min)	Latest snapshot + extrapolation	Perfect continuity
Unclean restart (< 5 min)	Latest snapshot + extrapolation	Near-perfect (< 1s drift)
Long outage (> 5 min)	Environment variables + warning	Possible time jump
PostgreSQL data lost	Environment variables + alert	Time reset to Day 1
Partial restart	Active instances provide time	Perfect continuity
Risk	Probability	Impact	Mitigation
Leader election fails	Low	High	Manual override, alerting
Clock drift during partition	Medium	Medium	Shorter sync intervals
Follower sync failures	Low	Low	Mark instance unhealthy
Time jumps on leader change	Low	High	Smooth convergence algorithm
Performance degradation	Low	Medium	Monitoring, auto-scaling
Full cluster restart	Low	Critical	Persistent snapshots, recovery protocol
PostgreSQL data loss	Very Low	Critical	Environment var fallback, alerting
Snapshot corruption	Very Low	High	Multiple snapshot versions, validation
Long outage (> 5 min)	Low	High	Client resync protocol, time jump detection
Graceless shutdown	Medium	Medium	Regular snapshots, extrapolation on recovery