This document outlines a solution to enable the SimTime service to run safely across multiple instances, eliminating the current single point of failure while maintaining time consistency across the distributed system.
Imagine you have a single master clock in a building that everyone looks at to know the time. If that clock breaks, nobody knows what time it is anymore. That's our current problem - we have only one time server, and if it fails, the entire game loses track of time.
Instead of one clock, we'll have multiple clocks that work together:
-
The Leader Clock π
- One clock is chosen as the "leader" - this is the official time
- It's like the principal's clock at school that everyone agrees is correct
- If it breaks, another clock quickly becomes the new leader
-
The Follower Clocks πππ
- All other clocks are "followers" that sync with the leader
- They ask the leader "what time is it?" every minute
- Between asks, they keep track of time on their own (like counting seconds)
- They remember the difference between their time and the leader's time
-
How It Stays in Sync
- When a follower asks for time, it's like calling someone in another timezone
- Follower: "What time do you have?" (sent at 2:00:00)
- Leader: "It's 2:00:05"
- Follower: "Okay, I'm 5 seconds behind you"
- Now the follower adds 5 seconds to its own counting
-
Why This Is Smart
- Game clients can ask ANY clock for the time
- Followers don't need to constantly bother the leader
- If the leader breaks, a follower becomes the new leader within 30 seconds
- Everyone always agrees on what time it is (within milliseconds)
It's like having multiple smartphones that all sync their time with the cell tower. Even if one phone dies, the others keep working. And even if the main cell tower fails, the phones keep showing the right time because they know how to count seconds on their own.
The clever part: We're reusing the same time-sync technology that game clients already use to sync with the server. We're just making some servers act like "super clients" that can also answer time questions from real game clients.
Our technical guidelines state:
"We have a single, unified way of storing data at rest in Postgres databases. Where possible, lean into going along the grain of relational databases since that is where the data lives."
Using PostgreSQL for SimTime coordination provides critical advantages:
-
Unified Backup/Restore
-- When you backup PostgreSQL, time state comes along pg_dump production > backup.dump -- Restore includes perfect time state pg_restore backup.dump -- SimTime continues from exact backup point!
-
Transactional Consistency
BEGIN; -- Game event and time update in same transaction INSERT INTO game_events (...); UPDATE simtime_state SET ...; COMMIT; -- Both succeed or both fail
-
Zero Additional Infrastructure
- No NATS to maintain
- No separate backup strategy
- Single connection pool
- One monitoring system
| Component | PostgreSQL | NATS | Our Choice |
|---|---|---|---|
| Leader Election | Lock tables with TTL | KV with TTL | PostgreSQL β |
| State Persistence | Native tables with ACID | KV Store (eventual) | PostgreSQL β |
| Backup/Restore | Included automatically | Separate system | PostgreSQL β |
| Observability | Rich SQL queries | Limited | PostgreSQL β |
| Instance Coordination | Polling (simple, reliable) | Pub/Sub | PostgreSQL β |
| Game Client Broadcasting | Possible but suboptimal | Excellent fan-out | NATS β |
We use PostgreSQL for reliability (leader election, state) and NATS for performance (broadcasting to thousands of game clients).
When disaster recovery happens:
- PostgreSQL: Restore one backup, time state included
- NATS: Restore PostgreSQL + NATS separately, hope they align
This alone justifies PostgreSQL given your unified datastore strategy.
The SimTime service currently cannot scale horizontally due to its stateful nature:
Current Architecture:
βββββββββββββββββββββββ
β SimTime Service β β Single Instance (SPOF)
β - Mutable State β
β - No Persistence β
β - No Coordination β
βββββββββββββββββββββββ
β
All Game Services
-
Single Point of Failure (SPOF)
- Service crash = no time synchronization for entire system
- Deployment/updates require downtime
- No redundancy for this critical service
-
State Management Issues
// Each instance maintains local state: private DateTime serverEpoch; private long multiplier; private DateTime? pausedTimestamp;
- Multiple instances = divergent state
- No synchronization mechanism
- Race conditions on time manipulation
-
Scalability Limitations
- Cannot handle increased load by adding instances
- All time sync requests hit single instance
- Becomes bottleneck as system grows
-
Sync Load Overwhelms Single Instance
Production Reality with 10,000 concurrent clients: - Each client syncs every 60 seconds - 10,000 clients Γ· 60 seconds = 167 sync requests/second - Each sync involves: network I/O, time calculation, response - Single instance CPU and network become saturated Result: Sync latency increases β Time drift β Game desynchronization- Vertical scaling (bigger instance) has limits
- Cannot distribute load across multiple instances
- Critical during peak gaming hours or events
Proposed Architecture:
βββββββββββββββββββββββ
β Leader Instance β β Source of Truth
β (SimTimeServer) β Elected via PostgreSQL
βββββββββββββββββββββββ
β
PostgreSQL + Optional Pub/Sub
β
βββββββββββββββββββββββ βββββββββββββββββββββββ
β Follower Instance β β Follower Instance β
β (SimTimeClient) β β (SimTimeClient) β
βββββββββββββββββββββββ βββββββββββββββββββββββ
β β
Game Clients Game Clients
The breakthrough insight: Follower instances use the existing SimTimeClient library to synchronize with the leader, just like game clients do. This solves clock skew and ensures consistency.
// Optimized: Single query returns both election result AND current leader endpoint
// This eliminates an extra round-trip to the database for followers
public async Task<LeaderElectionResult> TryBecomeLeader() {
var sql = @"
WITH election_attempt AS (
INSERT INTO simtime_leader
(resource_name, instance_id, instance_endpoint, expires_at)
VALUES ('simtime', @instanceId, @endpoint, NOW() + INTERVAL '30 seconds')
ON CONFLICT (resource_name) DO UPDATE SET
instance_id = @instanceId,
instance_endpoint = @endpoint,
expires_at = NOW() + INTERVAL '30 seconds'
WHERE simtime_leader.expires_at < NOW() -- Only if expired
RETURNING instance_id, instance_endpoint
)
SELECT
instance_id,
instance_endpoint,
instance_id = @instanceId as became_leader
FROM election_attempt
UNION ALL
SELECT
instance_id,
instance_endpoint,
false as became_leader
FROM simtime_leader
WHERE resource_name = 'simtime'
AND NOT EXISTS (SELECT 1 FROM election_attempt)
LIMIT 1";
var result = await db.QuerySingleAsync<LeaderElectionResult>(sql,
new { instanceId, endpoint });
if (result.BecameLeader) {
StartLeaseRenewal(); // Renew every 15 seconds
RunAsLeader(); // Full SimTimeServer mode
} else {
// We already have the leader endpoint from the query!
await ConnectToLeader(result.LeaderEndpoint);
RunAsFollower(); // SimTimeClient mode
}
return result;
}
public class LeaderElectionResult {
public bool BecameLeader { get; set; }
public string LeaderEndpoint { get; set; }
public string LeaderId { get; set; }
}public class FollowerMode {
private SimTimeClient clientToLeader;
public void Initialize(string leaderEndpoint) {
// Create transport to communicate with leader
var transport = new GrpcSimTimeTransport(leaderEndpoint);
// Use SimTimeClient for time synchronization
clientToLeader = new SimTimeClient(transport);
clientToLeader.StartSyncing();
// Wait for initial sync
await clientToLeader.WaitUntilTimeIsSynced();
}
// Serve client requests using synchronized time
public DateTime GetCurrentTime() {
return clientToLeader.ServerNow; // Locally extrapolated!
}
}The SimTimeClient uses sophisticated time synchronization:
1. Follower sends sync request with timestamp T1
2. Leader receives at T2, responds at T3
3. Follower receives response at T4
Round Trip Time = (T4 - T1)
One-way latency β RTT / 2
Clock offset = T3 - (T1 + one-way latency)
Result: Follower knows exact offset to leader's clock
| Operation | Leader Behavior | Follower Behavior |
|---|---|---|
| GetTimeData | Serve from local state | Serve from synchronized state |
| Sync | Provide time reference | Provide synchronized time |
| Time Manipulation | Execute and broadcast | Proxy to leader |
While PostgreSQL handles leader election and state persistence, NATS is still used for broadcasting time modification events to all connected game clients:
// Leader-side implementation
public class TimeManipulationBroadcast {
private readonly INatsPublisher natsPublisher;
public async Task OnTimeManipulation(TimeManipulationEvent evt) {
// Leader executes the manipulation
UpdateLocalState(evt);
// Save to PostgreSQL for persistence
await SaveToPostgreSQL(evt);
// Broadcast to ALL game clients via NATS
await natsPublisher.PublishAsync("simtime.events", new {
Type = evt.Type, // "pause", "resume", "multiplier_change", "advance"
NewState = GetCurrentTimeData(),
Timestamp = DateTime.UtcNow
});
// Note: Follower instances detect changes by polling PostgreSQL,
// not through NATS. This NATS broadcast is for game clients only.
}
}
// Follower-side implementation: Always proxy to leader
public class FollowerTimeManipulation {
private readonly SimTimeServerInternalServiceClient leaderClient;
private readonly bool isLeader;
public async Task<PauseResponse> PauseAsync(PauseRequest request) {
if (isLeader) {
// Leader executes directly
return await ExecutePauseAsync(request);
} else {
// Follower ALWAYS proxies to leader
var leaderEndpoint = await GetLeaderEndpoint();
return await leaderClient.PauseAsync(request);
}
}
public async Task<ResumeResponse> ResumeAsync(ResumeRequest request) {
if (!isLeader) {
// Always proxy to leader
return await leaderClient.ResumeAsync(request);
}
return await ExecuteResumeAsync(request);
}
public async Task<ChangeMultiplierResponse> ChangeMultiplierAsync(ChangeMultiplierRequest request) {
if (!isLeader) {
// Always proxy to leader
return await leaderClient.ChangeMultiplierAsync(request);
}
return await ExecuteChangeMultiplierAsync(request);
}
}This hybrid approach leverages:
- PostgreSQL for reliable state and leader coordination
- NATS for efficient fan-out to thousands of game clients
- Proxying ensures all time manipulations go through the leader
- Existing client code continues to work unchanged
We intentionally use simple polling instead of PostgreSQL LISTEN/NOTIFY for follower coordination:
Why Polling Is Superior Here:
- Connection Simplicity: No persistent DB connections or reconnection logic
- Container-Friendly: Works perfectly with connection pooling and Kubernetes
- Predictable Behavior: Fixed 10-second intervals, easy to debug and monitor
- Resilient: Network blips don't break anything, just delay by one poll cycle
- Simpler Testing: No need to mock LISTEN/NOTIFY, fully deterministic
The Unified Polling Loop:
public class FollowerPollingService {
private string currentLeaderEndpoint;
private DateTime lastStateUpdate;
private readonly TimeSpan pollInterval = TimeSpan.FromSeconds(10);
public async Task RunPollingLoop() {
while (running) {
try {
// Single efficient query gets everything we need
var sql = @"
SELECT
l.instance_id as leader_id,
l.instance_endpoint as leader_endpoint,
l.expires_at > NOW() as leader_active,
s.sim_time_ticks,
s.epoch_ticks,
s.multiplier,
s.paused_timestamp_ticks,
s.saved_at as state_updated_at
FROM simtime_leader l
CROSS JOIN LATERAL (
SELECT * FROM simtime_state
WHERE is_current = true
ORDER BY saved_at DESC LIMIT 1
) s
WHERE l.resource_name = 'simtime'";
var status = await db.QuerySingleOrDefaultAsync<dynamic>(sql);
if (status == null || !status.leader_active) {
// No active leader - attempt election
await TryBecomeLeader();
continue;
}
// Check if leader changed
if (status.leader_endpoint != currentLeaderEndpoint) {
Logger.Info($"Leader changed to {status.leader_endpoint}");
await ReconnectToNewLeader(status.leader_endpoint);
currentLeaderEndpoint = status.leader_endpoint;
}
// Check if time state was manipulated by admin
if (status.state_updated_at > lastStateUpdate) {
Logger.Info("Time state changed, refreshing");
RefreshLocalTimeState(status);
lastStateUpdate = status.state_updated_at;
}
} catch (Exception ex) {
Logger.Error($"Polling failed: {ex.Message}");
// Continue polling - system remains available
}
await Task.Delay(pollInterval);
}
}
}Acceptable Latency Trade-offs:
| Event | Detection Latency | Impact | Mitigation |
|---|---|---|---|
| Leader failure | 0-10 seconds | Rare event (weekly/monthly) | Followers continue serving |
| Time manipulation | 0-10 seconds | Admin operation, non-critical | Game clients get instant NATS update |
| New leader elected | 0-10 seconds | Follows leader failure | Extrapolation continues working |
The 10-second polling interval is 3x faster than the 30-second leader TTL, ensuring we detect changes promptly while keeping database load minimal.
-- Leader election table
CREATE TABLE simtime_leader (
resource_name TEXT PRIMARY KEY DEFAULT 'simtime',
instance_id TEXT NOT NULL,
instance_endpoint TEXT NOT NULL, -- Where followers connect
acquired_at TIMESTAMPTZ DEFAULT NOW(),
expires_at TIMESTAMPTZ NOT NULL,
heartbeat_at TIMESTAMPTZ DEFAULT NOW(),
fencing_token BIGSERIAL, -- Monotonic counter for split-brain prevention
metadata JSONB, -- Version, capabilities, etc.
CONSTRAINT valid_lease CHECK (expires_at > acquired_at)
);
-- State persistence table
CREATE TABLE simtime_state (
id BIGSERIAL PRIMARY KEY,
saved_at TIMESTAMPTZ DEFAULT NOW(),
sim_time_ticks BIGINT NOT NULL,
epoch_ticks BIGINT NOT NULL,
multiplier INTEGER NOT NULL,
paused_timestamp_ticks BIGINT,
is_current BOOLEAN DEFAULT true,
saved_by TEXT NOT NULL,
fencing_token BIGINT NOT NULL -- Must match leader's token
);
-- Audit trail for time manipulations
CREATE TABLE simtime_events (
id BIGSERIAL PRIMARY KEY,
event_time TIMESTAMPTZ DEFAULT NOW(),
event_type TEXT NOT NULL, -- 'pause', 'resume', 'multiplier_change', 'advance'
old_value JSONB,
new_value JSONB,
performed_by TEXT NOT NULL,
instance_id TEXT NOT NULL
);
-- Indexes for performance
CREATE INDEX idx_current_state ON simtime_state(is_current)
WHERE is_current = true;
CREATE INDEX idx_active_leader ON simtime_leader(expires_at DESC)
WHERE expires_at > NOW();
CREATE INDEX idx_recent_events ON simtime_events(event_time DESC);| Aspect | Advisory Locks | Lock Table | Our Choice |
|---|---|---|---|
| Automatic cleanup | β On disconnect | β Manual TTL | Lock Table |
| Survives brief disconnects | β Lost immediately | β TTL-based | Lock Table β |
| Observability | β No metadata | β Full visibility | Lock Table β |
| Debugging | β Can't see holder | β SQL queries | Lock Table β |
| Graceful handoff | β Not possible | β Update owner | Lock Table β |
| Audit trail | β None | β History table | Lock Table β |
| Works with polling | β Requires connection | β Stateless queries | Lock Table β |
The lock table approach provides production-grade observability:
-- Who is the leader?
SELECT instance_id, instance_endpoint, expires_at - NOW() as remaining
FROM simtime_leader WHERE expires_at > NOW();
-- Leader changes in last hour
SELECT * FROM simtime_events
WHERE event_type = 'leader_change'
AND event_time > NOW() - INTERVAL '1 hour';sequenceDiagram
participant I1 as Instance 1
participant I2 as Instance 2
participant I3 as Instance 3
participant PG as PostgreSQL
Note over I1,PG: Startup - All instances try to become leader
I1->>PG: INSERT INTO simtime_leader ... ON CONFLICT UPDATE<br/>WHERE expires_at < NOW()
PG-->>I1: Success (became_leader = true)
Note over I1: Becomes Leader
I2->>PG: INSERT INTO simtime_leader ... ON CONFLICT UPDATE<br/>WHERE expires_at < NOW()
PG-->>I2: Failed (became_leader = false, leader_endpoint = Instance 1)
Note over I2: Becomes Follower
I3->>PG: INSERT INTO simtime_leader ... ON CONFLICT UPDATE<br/>WHERE expires_at < NOW()
PG-->>I3: Failed (became_leader = false, leader_endpoint = Instance 1)
Note over I3: Becomes Follower
loop Every 15 seconds
I1->>PG: UPDATE simtime_leader SET expires_at = NOW() + 30s
PG-->>I1: Success (lease renewed)
end
Note over I2,I3: Followers poll for leader changes (no LISTEN/NOTIFY)
loop Every 10 seconds (polling loop)
I2->>PG: SELECT leader, time_state FROM simtime_leader, simtime_state
PG-->>I2: Current leader and state
I3->>PG: SELECT leader, time_state FROM simtime_leader, simtime_state
PG-->>I3: Current leader and state
end
sequenceDiagram
participant F as Follower Instance
participant SC as SimTimeClient
participant L as Leader Instance
participant LS as Leader SimTimeServer
Note over F,LS: Initial Synchronization
F->>SC: Initialize(leaderEndpoint)
SC->>SC: StartSyncing()
loop Every 60 seconds
SC->>L: SyncNoStreamRequest(T1=local_time)
L->>LS: OnTimeSync(T1)
LS->>LS: T2=DateTime.UtcNow
LS-->>L: TimeSyncResponse(T2)
L-->>SC: Response(ServerTime=T2, RTT)
SC->>SC: Calculate offset = T2 - (T1 + RTT/2)
SC->>SC: Store serverTimeDifference
end
SC->>L: GetTimeDataRequest()
L->>LS: GetTimeData()
LS-->>L: TimeData(epoch, multiplier, paused)
L-->>SC: TimeDataResponse
SC->>SC: Store TimeData
Note over SC: Now synchronized
sequenceDiagram
participant GC as Game Client
participant LB as Load Balancer
participant F as Follower
participant L as Leader
participant SC as SimTimeClient
alt GetTimeData Request to Follower
GC->>LB: GetTimeData()
LB->>F: GetTimeData()
F->>SC: ServerNow
SC->>SC: Calculate: lastSync + offset + elapsed
SC-->>F: Synchronized time
F-->>GC: TimeDataResponse
Note over GC,F: No network call to leader!
else GetTimeData Request to Leader
GC->>LB: GetTimeData()
LB->>L: GetTimeData()
L->>L: Return local state
L-->>GC: TimeDataResponse
end
sequenceDiagram
participant Admin as Admin Client
participant F as Follower
participant L as Leader
participant PG as PostgreSQL
participant NATS as NATS
participant GC as Game Clients
participant F2 as Other Followers
alt Request to Follower
Admin->>F: Pause()
F->>F: Check if leader
Note over F: Not leader - proxy to leader
F->>L: Proxy: Pause()
L->>L: UpdateState(paused=true)
L->>PG: INSERT INTO simtime_events(...)
L->>PG: UPDATE simtime_state SET ...
L->>NATS: Publish("simtime.events", newState)
NATS-->>GC: Broadcast to all game clients
Note over F,F2: Followers detect change on next poll (0-10s)
L-->>F: Success
F-->>Admin: Success
else Request to Leader
Admin->>L: Pause()
L->>L: UpdateState(paused=true)
L->>PG: INSERT INTO simtime_events(...)
L->>PG: UPDATE simtime_state SET ...
L->>NATS: Publish("simtime.events", newState)
NATS-->>GC: Broadcast to all game clients
Note over F,F2: Followers detect change on next poll (0-10s)
L-->>Admin: Success
end
sequenceDiagram
participant L as Leader (Instance 1)
participant PG as PostgreSQL
participant F1 as Follower 1
participant F2 as Follower 2
participant GC as Game Clients
Note over L: Leader healthy, renewing lease
loop Every 15 seconds
L->>PG: UPDATE simtime_leader SET expires_at = NOW() + 30s
PG-->>L: Success
L->>PG: INSERT INTO simtime_state (snapshot)
PG-->>L: State saved
end
Note over L: Leader crashes!
L-xL: Crash/Network Issue
Note over PG: After 30 seconds, TTL expires
Note over F1,F2: Followers detect on next poll (0-10s)
loop Every 10 seconds (polling)
F1->>PG: SELECT * FROM simtime_leader WHERE expires_at > NOW()
PG-->>F1: No active leader (TTL expired)
Note over F1: Detected leader failure
break Leader gone
end
end
Note over F1,F2: Race to become leader
F1->>PG: INSERT ... ON CONFLICT UPDATE WHERE expires_at < NOW()
PG-->>F1: Success (became_leader = true)
F1->>PG: SELECT * FROM simtime_state WHERE is_current = true
PG-->>F1: Latest snapshot
Note over F1: Becomes new Leader
F2->>PG: INSERT ... ON CONFLICT UPDATE WHERE expires_at < NOW()
PG-->>F2: Failed (became_leader = false, leader_endpoint = Follower 1)
Note over F2: Remains Follower
F2->>F1: StartSyncing() with new leader (no extra query needed!)
Note over GC: Continuous service during transition
GC->>F2: GetTimeData()
F2->>F2: Serve from cached/extrapolated time
F2-->>GC: TimeDataResponse
sequenceDiagram
participant GC as Game Client
participant F as Follower
participant SC as SimTimeClient
participant Cache as Local State
Note over Cache: Last sync: T0, Offset: +50ms
GC->>F: GetTimeData() at T0+30s
F->>SC: ServerNow
SC->>Cache: Get last sync data
Cache-->>SC: lastSync=T0, offset=50ms
SC->>SC: elapsed = Now() - lastSync = 30s
SC->>SC: time = T0 + offset + elapsed
SC->>SC: Apply multiplier (e.g., 24x)
SC->>SC: simTime = T0 + (30s * 24) = T0+12min
SC-->>F: Calculated SimTime
F-->>GC: TimeDataResponse(T0+12min)
Note over GC,Cache: No network call needed!
graph TB
subgraph "PostgreSQL Infrastructure"
PG[PostgreSQL Database]
LT[simtime_leader table<br/>Leader Election]
ST[simtime_state table<br/>State Persistence]
ET[simtime_events table<br/>Audit Trail]
end
subgraph "NATS Infrastructure"
NATS[NATS Pub/Sub<br/>Client Notifications]
end
subgraph "Leader Instance"
L[SimTimeServer<br/>Source of Truth]
LG[gRPC Endpoint]
LE[Leader Election<br/>Module]
end
subgraph "Follower Instance 1"
F1[SimTimeClient]
F1G[gRPC Endpoint]
F1E[Election Watcher]
end
subgraph "Follower Instance 2"
F2[SimTimeClient]
F2G[gRPC Endpoint]
F2E[Election Watcher]
end
subgraph "Game Services"
GS1[Game Service 1]
GS2[Game Service 2]
GS3[Game Service 3]
end
PG --> LT
PG --> ST
PG --> ET
LE -.->|Renew Lease| LT
F1E -.->|Poll| LT
F2E -.->|Poll| LT
L -->|Save Snapshots| ST
L -->|Log Events| ET
L -->|Broadcast Events| NATS
NATS -.->|Time Updates| GS1
NATS -.->|Time Updates| GS2
NATS -.->|Time Updates| GS3
F1 -->|Sync Time| LG
F2 -->|Sync Time| LG
GS1 -->|GetTime| F1G
GS2 -->|GetTime| F2G
GS3 -->|GetTime| LG
style L fill:#f96,stroke:#333,stroke-width:4px
style F1 fill:#9cf,stroke:#333,stroke-width:2px
style F2 fill:#9cf,stroke:#333,stroke-width:2px
style PG fill:#326ce5,stroke:#333,stroke-width:2px
style NATS fill:#27aae1,stroke:#333,stroke-width:2px
These diagrams illustrate the key interaction patterns in the distributed SimTime architecture:
- Leader Election: Shows how instances compete for leadership and maintain it through lease renewal
- Time Synchronization: Details the NTP-style algorithm used by followers to sync with the leader
- Request Handling: Demonstrates how both leaders and followers serve client requests
- Time Manipulation: Shows command routing for administrative operations
- Failure Recovery: Illustrates the seamless transition when a leader fails
- Time Extrapolation: Explains how followers calculate time locally without network calls
- System Overview: Provides a high-level view of all components and their relationships
When all SimTime servers restart simultaneously (e.g., cluster restart, power failure, Kubernetes namespace recreation), we face critical challenges:
- Complete State Loss - All in-memory time state vanishes
- Time Regression - Could jump from Day 15 back to Day 1
- Client Desynchronization - Clients have different time than restarted servers
- Lost Manipulations - Paused state, multiplier changes are forgotten
// Leader saves snapshots every 10 seconds
public class TimeSnapshotManager {
private readonly IDbConnection db;
private readonly Timer snapshotTimer;
private readonly long fencingToken;
public async Task SaveSnapshot() {
var sql = @"
-- Mark previous snapshots as not current
UPDATE simtime_state SET is_current = false WHERE is_current = true;
-- Insert new snapshot
INSERT INTO simtime_state (
sim_time_ticks,
epoch_ticks,
multiplier,
paused_timestamp_ticks,
is_current,
saved_by,
fencing_token
) VALUES (
@simTimeTicks,
@epochTicks,
@multiplier,
@pausedTimestampTicks,
true,
@savedBy,
@fencingToken
)";
await db.ExecuteAsync(sql, new {
simTimeTicks = GetCurrentSimTime().Ticks,
epochTicks = currentEpoch.Ticks,
multiplier = currentMultiplier,
pausedTimestampTicks = pausedTimestamp?.Ticks,
savedBy = instanceId,
fencingToken = this.fencingToken
});
}
}public async Task<TimeData> RecoverTimeState() {
// 1. Try to load from PostgreSQL snapshot
var sql = @"
SELECT
saved_at,
sim_time_ticks,
epoch_ticks,
multiplier,
paused_timestamp_ticks
FROM simtime_state
WHERE is_current = true
ORDER BY saved_at DESC
LIMIT 1";
var snapshot = await db.QuerySingleOrDefaultAsync<dynamic>(sql);
if (snapshot != null) {
var savedAt = (DateTime)snapshot.saved_at;
var age = DateTime.UtcNow - savedAt;
if (age < TimeSpan.FromMinutes(5)) {
// 2. Extrapolate from snapshot
var simTime = new DateTime((long)snapshot.sim_time_ticks);
var extrapolatedTime = simTime + (age * snapshot.multiplier);
Logger.Info($"Recovered from snapshot: {savedAt}, extrapolated {age}");
return new TimeData {
Epoch = new DateTime((long)snapshot.epoch_ticks),
Multiplier = snapshot.multiplier,
CurrentTime = extrapolatedTime,
PausedTimestamp = snapshot.paused_timestamp_ticks != null
? new DateTime((long)snapshot.paused_timestamp_ticks)
: null
};
}
}
// 3. Fallback to environment variables
Logger.Warn("No recent snapshot found, initializing from environment");
return InitializeFromEnvironment();
}sequenceDiagram
participant PG as PostgreSQL
participant I1 as Instance 1
participant I2 as Instance 2
participant I3 as Instance 3
participant GC as Game Clients
Note over I1,I3: All instances start simultaneously
par Instance 1 Recovery
I1->>PG: SELECT * FROM simtime_state WHERE is_current = true
PG-->>I1: Snapshot (saved 30s ago)
I1->>I1: Extrapolate: saved_time + 30s
and Instance 2 Recovery
I2->>PG: SELECT * FROM simtime_state WHERE is_current = true
PG-->>I2: Snapshot (saved 30s ago)
I2->>I2: Extrapolate: saved_time + 30s
and Instance 3 Recovery
I3->>PG: SELECT * FROM simtime_state WHERE is_current = true
PG-->>I3: Snapshot (saved 30s ago)
I3->>I3: Extrapolate: saved_time + 30s
end
Note over I1,I3: All instances have consistent time
I1->>PG: INSERT INTO simtime_leader ... ON CONFLICT UPDATE
PG-->>I1: Success - Becomes Leader
I2->>PG: INSERT INTO simtime_leader ... ON CONFLICT UPDATE
PG-->>I2: Failed - Becomes Follower
I3->>PG: INSERT INTO simtime_leader ... ON CONFLICT UPDATE
PG-->>I3: Failed - Becomes Follower
Note over I1: Leader starts snapshot timer
loop Every 10 seconds
I1->>PG: INSERT INTO simtime_state (snapshot)
PG-->>I1: State persisted
end
Note over GC: Clients detect time continuity
GC->>I2: GetTimeData()
I2-->>GC: TimeData (continuous from before restart)
GC->>GC: Small adjustment, no major jump
Note over PG: Key advantage: Backup includes time state!
PG->>PG: pg_dump includes simtime tables
public class GracefulShutdown {
public async Task OnShutdown() {
if (IsLeader) {
// Save final snapshot before shutdown
await SaveSnapshot();
// Mark shutdown as clean in audit log
await db.ExecuteAsync(@"
INSERT INTO simtime_events (
event_type,
new_value,
performed_by,
instance_id
) VALUES (
'clean_shutdown',
@newValue::jsonb,
@performedBy,
@instanceId
)", new {
newValue = JsonSerializer.Serialize(new {
timestamp = DateTime.UtcNow,
lastTime = GetCurrentSimTime()
}),
performedBy = instanceId,
instanceId = instanceId
});
}
// Stop time advancement
PauseTimeAdvancement();
// Wait for pending operations
await WaitForPendingOperations();
}
}public class TimeJumpDetection {
private DateTime lastServerTime;
private readonly TimeSpan maxAcceptableJump = TimeSpan.FromMinutes(1);
public async Task<bool> ValidateTimeUpdate(DateTime newServerTime) {
var timeDiff = Math.Abs((newServerTime - lastServerTime).TotalSeconds);
if (timeDiff > maxAcceptableJump.TotalSeconds) {
Logger.Warn($"Large time jump detected: {timeDiff}s");
// Force complete resynchronization
await ForceCompleteResync();
// Notify game systems of time discontinuity
OnTimeDiscontinuity?.Invoke(lastServerTime, newServerTime);
return false; // Reject normal update
}
return true; // Accept update
}
}New environment variables for production:
# Snapshot Configuration
SIMTIME_SNAPSHOT_INTERVAL: "10" # seconds between snapshots
SIMTIME_SNAPSHOT_RETENTION: "3600" # keep snapshots for 1 hour
SIMTIME_RECOVERY_MAX_AGE: "300" # max age of snapshot to use (5 min)
# Recovery Behavior
SIMTIME_ALLOW_ENV_FALLBACK: "true" # fallback to env vars if no snapshot
SIMTIME_REQUIRE_CLEAN_SHUTDOWN: "false" # require clean shutdown marker| Scenario | Recovery Method | Time Continuity |
|---|---|---|
| Clean restart (< 1 min) | Latest snapshot + extrapolation | Perfect continuity |
| Unclean restart (< 5 min) | Latest snapshot + extrapolation | Near-perfect (< 1s drift) |
| Long outage (> 5 min) | Environment variables + warning | Possible time jump |
| PostgreSQL data lost | Environment variables + alert | Time reset to Day 1 |
| Partial restart | Active instances provide time | Perfect continuity |
Alerts to Configure:
- snapshot_age_high: Snapshot older than 60 seconds
- recovery_from_env: Had to use environment variables
- large_time_jump: Time jumped > 60 seconds
- snapshot_save_failed: Failed to persist snapshotWhen using Kubernetes Deployments (your approach), SimTime instances face unique challenges:
- Pods don't inherently know their own IP address
- Pod IPs are ephemeral and change on every restart
- Pod names are unpredictable (e.g.,
simtime-server-7b4d9c-x2f3) - Service ClusterIPs don't work (followers need specific leader pod IP)
Since you're using Deployments, the Kubernetes Downward API is essential for pod self-discovery:
apiVersion: apps/v1
kind: Deployment
metadata:
name: simtime-server
spec:
replicas: 3
template:
spec:
containers:
- name: simtime
image: simtime:latest
env:
# Pod identification
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
# Service discovery
- name: SERVICE_NAME
value: "simtime-internal"
- name: GRPC_PORT
value: "50051"public class LeaderElectionService {
private readonly string instanceId;
private readonly string instanceEndpoint;
public LeaderElectionService(IConfiguration config) {
// Get pod-specific information from Downward API
var podName = config["POD_NAME"]
?? throw new Exception("POD_NAME not set - check Downward API config");
var podIp = config["POD_IP"]
?? throw new Exception("POD_IP not set - check Downward API config");
var grpcPort = config["GRPC_PORT"] ?? "50051";
// Build unique instance ID and endpoint
this.instanceId = podName; // e.g., "simtime-server-abc123"
this.instanceEndpoint = $"{podIp}:{grpcPort}"; // e.g., "10.244.1.5:50051"
}
public async Task<bool> TryBecomeLeader() {
var sql = @"
INSERT INTO simtime_leader (
resource_name,
instance_id,
instance_endpoint, -- Followers will connect here
expires_at,
metadata
) VALUES (
'simtime',
@instanceId, -- simtime-server-abc123
@instanceEndpoint, -- 10.244.1.5:50051
NOW() + INTERVAL '30 seconds',
@metadata::jsonb
)
ON CONFLICT (resource_name) DO UPDATE SET
instance_id = @instanceId,
instance_endpoint = @instanceEndpoint,
expires_at = NOW() + INTERVAL '30 seconds'
WHERE simtime_leader.expires_at < NOW()
RETURNING instance_id = @instanceId as became_leader";
var metadata = new {
pod_name = instanceId,
pod_ip = instanceEndpoint.Split(':')[0],
version = Assembly.GetExecutingAssembly().GetName().Version?.ToString()
};
return await db.QuerySingleAsync<bool>(sql, new {
instanceId,
instanceEndpoint,
metadata = JsonSerializer.Serialize(metadata)
});
}
}public class FollowerService {
private GrpcChannel? leaderChannel;
private SimTimeClient simTimeClient;
public async Task ConnectToLeader(string leaderEndpoint) {
// Leader endpoint already provided by TryBecomeLeader()
// No need for a separate database query!
// Connect directly to leader pod IP
// e.g., grpc://10.244.1.5:50051
leaderChannel = GrpcChannel.ForAddress($"http://{leaderEndpoint}");
// Initialize SimTimeClient for time synchronization
var transport = new GrpcSimTimeTransport(leaderChannel);
simTimeClient = new SimTimeClient(transport);
simTimeClient.StartSyncing();
Logger.Info($"Connected to leader at {leaderEndpoint}");
}
// If leader changes, we need to check periodically
public async Task<string?> CheckLeaderChange() {
var sql = @"
SELECT instance_endpoint
FROM simtime_leader
WHERE resource_name = 'simtime'
AND expires_at > NOW()";
return await db.QuerySingleOrDefaultAsync<string>(sql);
}
}With Deployments, pod IPs change on every restart, but this is not a problem because:
- Leader re-registers its new IP on startup when acquiring leadership
- Followers re-discover the leader's new endpoint from PostgreSQL
- TTL-based leases ensure stale endpoints are automatically cleaned up
- Connection retry logic handles IP changes gracefully
// Followers automatically handle leader IP changes
public async Task MaintainLeaderConnection() {
while (running) {
var currentEndpoint = await DiscoverLeader();
if (currentEndpoint != lastKnownEndpoint) {
// Leader IP changed (pod restarted)
await ReconnectToLeader(currentEndpoint);
lastKnownEndpoint = currentEndpoint;
}
await Task.Delay(TimeSpan.FromSeconds(10));
}
}public class EndpointConfiguration {
public static string GetInstanceEndpoint(IConfiguration config) {
var env = config["ASPNETCORE_ENVIRONMENT"];
if (env == "Development") {
// Local development: use localhost
return "localhost:50051";
} else if (config["POD_IP"] != null) {
// Kubernetes Deployment: use pod IP (your primary scenario)
// This IP is ephemeral and will change on pod restart
return $"{config["POD_IP"]}:{config["GRPC_PORT"] ?? "50051"}";
} else {
// Fallback for other environments (Docker, VM, etc.)
return $"{Environment.MachineName}:50051";
}
}
}Since you're using Deployments:
-
Pod IP Changes: Every pod restart gets a new IP
- Solution: Leader updates PostgreSQL on startup
- Followers poll for endpoint changes
-
Rolling Updates: During deployment, pods cycle through
- Old leader continues until new pod is ready
- New pod tries to become leader
- Seamless transition via TTL-based leases
-
Scale Events: When scaling up/down
- New pods become followers automatically
- If leader is scaled down, election happens within TTL window
-
No Persistent Identity: Pods have random suffixes
- Use pod name as instance ID for uniqueness
- Don't rely on pod name for network discovery
- Downward API is Essential: Without it, pods can't know their own IP address
- Direct Pod Connection: Followers must connect to the specific leader pod, not a service
- Ephemeral IPs Are Fine: Leader re-registration and TTL cleanup handle IP changes automatically
- Graceful Degradation: Code handles both Kubernetes and local development scenarios
- Metadata Storage: Leader table stores both instance ID and endpoint for observability
Tasks:
- Create PostgreSQL schema (simtime_leader, simtime_state, simtime_events)
- Implement lock table-based leader election
- Add lease renewal mechanism with TTL
- Add observability queries and metricsTasks:
- Create GrpcSimTimeTransport implementing ISimTimeClientTransport
- Integrate SimTimeClient into follower instances
- Override time serving methods to use client
- Add sync quality metricsTasks:
- Implement PostgreSQL snapshot persistence
- Add recovery from snapshots on startup
- Implement graceful shutdown with final snapshot
- Add audit trail for time manipulationsTasks:
- Chaos testing (leader failures, network partitions)
- Load testing with multiple instances
- Gradual rollout with feature flags
- Monitor time consistency metrics-- Backup includes everything!
pg_dump production > backup.dump
-- Restore includes time state automatically
pg_restore backup.dump
-- Time continues from exact backup point!Traditional Problem:
Instance A clock: 14:00:00.000
Instance B clock: 14:00:00.050
Result: 50ms time difference!
Our Solution:
Both instances: 14:00:00.000 (synchronized to leader)
- Leader failure β New election in ~30 seconds
- Followers continue serving during election
- Zero downtime deployments possible
- PostgreSQL HA as foundation
-- Real-time monitoring
SELECT instance_id, expires_at - NOW() as ttl FROM simtime_leader;
-- Audit trail included
SELECT * FROM simtime_events WHERE event_time > NOW() - INTERVAL '1 hour';- PostgreSQL lock tables battle-tested
- SimTimeClient used in production for years
- NTP algorithm is industry standard
- Your team already PostgreSQL experts
Single Instance: 10,000 clients β 167 req/s β Overloaded
Multi-Instance: 10,000 clients Γ· 5 instances = 2,000 clients each β 33 req/s β Smooth
- Load distributes across all instances
- Each instance handles subset of clients
- Add more instances as player base grows
- No single bottleneck
- Followers calculate time locally (no network calls)
- Sync happens every 60 seconds (configurable)
- Scales linearly with instances
- Leader election returns current leader in single query (no extra round trip)
- No persistent database connections needed
- No LISTEN/NOTIFY complexity or reconnection logic
- Works perfectly with connection pooling
- Predictable 10-second detection latency is acceptable for rare events
- Easier to monitor, debug, and test
BEGIN;
INSERT INTO game_events (...);
UPDATE simtime_state SET ...;
COMMIT; -- Atomic operationIssue: Distributed consensus is complex Mitigation:
- Use proven PostgreSQL lock tables with TTL
- Simple lease-based approach
- Pure polling (no LISTEN/NOTIFY complexity)
- Extensive testing of edge cases
Issue: Up to 30 second window for new leader election Mitigation:
- Followers serve last known good time
- Time continues advancing via extrapolation
- New leader elected quickly
Issue: Followers sync with leader periodically Mitigation:
- Sync interval configurable (default 60s)
- Minimal data transfer (~100 bytes)
- Use gRPC streaming for efficiency
Issue: More moving parts than single instance Mitigation:
- Comprehensive monitoring and alerting
- Distributed tracing for debugging
- Fallback to single-instance mode if needed
Why Rejected for State Storage (still used for client broadcasting):
- No read-after-write consistency for critical state
- Follower lag issues for leader election
- Incompatible with immediate consistency needs
- Separate backup/restore from PostgreSQL
- Would split critical state across two systems
Note: NATS is still used for broadcasting time events to game clients, where its excellent fan-out performance shines
Why Rejected:
- Adds new infrastructure dependency
- Separate backup strategy needed
- Not aligned with unified datastore strategy
- Team lacks Redis expertise
Why Rejected:
- Pushes complexity to every client
- Hard to coordinate time manipulation
- Difficult to debug time issues
Why Rejected:
- Doesn't solve SPOF issue
- No horizontal scaling
- Risky for production system
| Risk | Probability | Impact | Mitigation |
|---|---|---|---|
| Leader election fails | Low | High | Manual override, alerting |
| Clock drift during partition | Medium | Medium | Shorter sync intervals |
| Follower sync failures | Low | Low | Mark instance unhealthy |
| Time jumps on leader change | Low | High | Smooth convergence algorithm |
| Performance degradation | Low | Medium | Monitoring, auto-scaling |
| Full cluster restart | Low | Critical | Persistent snapshots, recovery protocol |
| PostgreSQL data loss | Very Low | Critical | Environment var fallback, alerting |
| Snapshot corruption | Very Low | High | Multiple snapshot versions, validation |
| Long outage (> 5 min) | Low | High | Client resync protocol, time jump detection |
| Graceless shutdown | Medium | Medium | Regular snapshots, extrapolation on recovery |
This section identifies critical safety issues that must be addressed before production deployment, along with their solutions.
The Problem: During network partitions, the system can end up with multiple leaders:
Time 0: Leader renewing every 15s, Followers polling every 10s
Time 10: Network partition occurs
- Leader β PostgreSQL: β
(still works)
- Followers β PostgreSQL: β
(still works)
- Followers β Leader: β (network partition)
Time 30: Followers think leader is dead (can't reach it)
Time 40: Follower A becomes new leader
Time 45: Network partition heals
Result: TWO LEADERS! Original leader + Follower A
Why it's dangerous:
- Game clients could get different times from different "leaders"
- Time manipulations could be lost or duplicated
- Data corruption in time-dependent game events
- Inconsistent event timers across the game
The Solution - Monotonic Fencing Tokens:
// Every state update must check fencing token
public async Task<bool> AcceptTimeUpdate(TimeState newState) {
// Only accept updates from newer leaders
if (newState.fencing_token <= lastKnownFencingToken) {
Logger.Warn($"Rejecting stale leader update: {newState.fencing_token} <= {lastKnownFencingToken}");
return false;
}
lastKnownFencingToken = newState.fencing_token;
ApplyTimeUpdate(newState);
return true;
}
// Leader must include fencing token in all responses
public TimeDataResponse GetTimeData() {
return new TimeDataResponse {
Time = CalculateCurrentTime(),
FencingToken = this.currentFencingToken, // Critical!
LeaderId = this.instanceId
};
}The Problem: State updates aren't atomic, creating brief windows with no valid state:
-- These run as separate statements!
UPDATE simtime_state SET is_current = false WHERE is_current = true; -- Moment 1: No current state!
-- DANGER ZONE: Another instance queries here and gets NULL!
INSERT INTO simtime_state (...) VALUES (...); -- Moment 2: State restoredTimeline of Disaster:
Microsecond 1: UPDATE executes (no current state exists!)
Microsecond 2: Follower queries: "SELECT * WHERE is_current = true"
Returns: NULL! π₯
Microsecond 3: INSERT executes (current state exists again)
Microsecond 4: Follower crashes due to null state
Why it's dangerous:
- Brief moments with NO valid time state
- Instances could fail to start or crash
- Clients could get null responses
- Recovery logic might initialize from environment (time jump!)
The Solution - Atomic Transactions:
public async Task SaveSnapshot() {
var sql = @"
BEGIN;
-- Both operations succeed or both fail
UPDATE simtime_state SET is_current = false WHERE is_current = true;
INSERT INTO simtime_state (
sim_time_ticks, epoch_ticks, multiplier, paused_timestamp_ticks,
is_current, saved_by, fencing_token
) VALUES (
@simTimeTicks, @epochTicks, @multiplier, @pausedTimestampTicks,
true, @savedBy, @fencingToken
);
COMMIT;";
await db.ExecuteAsync(sql, parameters);
}
// Alternative: Single UPSERT operation
public async Task SaveSnapshotAtomic() {
var sql = @"
INSERT INTO simtime_state (id, sim_time_ticks, epoch_ticks, multiplier, is_current)
VALUES (1, @simTimeTicks, @epochTicks, @multiplier, true)
ON CONFLICT (id) WHERE is_current = true
DO UPDATE SET
sim_time_ticks = @simTimeTicks,
epoch_ticks = @epochTicks,
saved_at = NOW()";
await db.ExecuteAsync(sql, parameters);
}The Problem: Timers are created but never disposed when losing leadership:
private Timer renewalTimer;
private void StartLeaseRenewal() {
renewalTimer = new Timer(async _ => {
// Renews every 15 seconds forever!
}, null, TimeSpan.FromSeconds(15), TimeSpan.FromSeconds(15));
}
// When leadership lost, timer keeps running!
// After 10 leadership changes = 10 orphaned timersAccumulation Over Time:
Hour 0: 1 leadership change = 1 orphaned timer
Hour 1: 3 leadership changes = 3 orphaned timers (4 total)
Hour 6: 2 more changes = 2 orphaned timers (6 total)
Day 1: 20 changes total = 20 timers firing every 15s!
Result: 20 timers Γ 4 queries/min = 80 unnecessary DB queries/min
Plus growing memory usage!
Why it's dangerous:
- Memory usage grows unbounded
- Database gets hammered with invalid renewal attempts
- Connection pool exhaustion
- Can trigger cascading failures under load
The Solution - Proper Cleanup:
public class LeaderElectionService : IDisposable {
private Timer? renewalTimer;
private CancellationTokenSource? leadershipCts;
private void StartLeaseRenewal() {
// Create cancellation token for this leadership term
leadershipCts = new CancellationTokenSource();
renewalTimer = new Timer(async _ => {
try {
if (leadershipCts.Token.IsCancellationRequested) {
return; // Stop if cancelled
}
var renewed = await RenewLease();
if (!renewed) {
await OnLostLeadership();
}
} catch (Exception ex) {
Logger.Error($"Lease renewal failed: {ex}");
await OnLostLeadership();
}
}, null, TimeSpan.FromSeconds(15), TimeSpan.FromSeconds(15));
}
private async Task OnLostLeadership() {
// Critical: Clean up resources!
leadershipCts?.Cancel();
renewalTimer?.Dispose();
renewalTimer = null;
leadershipCts?.Dispose();
leadershipCts = null;
Logger.Info("Leadership lost, cleaned up resources");
await TransitionToFollower();
}
public void Dispose() {
renewalTimer?.Dispose();
leadershipCts?.Dispose();
}
}The Problem: When the leader fails, ALL followers detect it simultaneously and race to become leader:
Leader fails at T=0
At T=10: All 20 followers detect failure simultaneously
At T=10.001: All 20 followers execute:
INSERT INTO simtime_leader ... ON CONFLICT UPDATE
Database perspective:
- 20 simultaneous complex CTE queries
- All competing for same row lock
- Lock contention causes serialization
- Connection pool exhaustion
- Query queue backs up
Cascade Effect:
T+0ms: 20 queries hit DB simultaneously
T+100ms: DB CPU = 100%, lock wait queue forming
T+500ms: Connection pool exhausted
T+1000ms: New queries timeout
T+2000ms: Health checks fail
T+3000ms: Kubernetes restarts "unhealthy" pods
T+4000ms: More instances race for leadership
Result: Complete system failure
Why it's dangerous:
- Database CPU spikes to 100%
- All instances freeze waiting for query results
- Connection pool exhaustion
- Can trigger cascading failures
- Game clients timeout and disconnect
The Solution - Jittered Elections:
public class SmartLeaderElection {
private readonly Random random = new Random();
public async Task OnLeaderFailureDetected() {
// Add random jitter to prevent thundering herd
var baseDelay = 1000; // 1 second base
var jitter = random.Next(0, 5000); // 0-5 seconds random
var totalDelay = baseDelay + jitter;
Logger.Info($"Leader failed, waiting {totalDelay}ms before election attempt");
await Task.Delay(totalDelay);
// Check if someone else already became leader during our wait
var currentLeader = await CheckCurrentLeader();
if (currentLeader != null) {
Logger.Info($"New leader already elected: {currentLeader}");
await ConnectToLeader(currentLeader);
return;
}
// Now try to become leader
await TryBecomeLeader();
}
// Even smarter: Exponential backoff for retries
public async Task ElectionWithBackoff() {
var attempt = 0;
var maxAttempts = 5;
while (attempt < maxAttempts) {
var delay = Math.Min(1000 * Math.Pow(2, attempt), 30000); // Cap at 30s
var jitter = random.Next(0, (int)delay / 2);
await Task.Delay((int)delay + jitter);
if (await TryBecomeLeader()) {
return; // Success!
}
attempt++;
}
throw new Exception("Failed to elect leader after max attempts");
}
}Without These Fixes:
18:00: Black Friday event starts, 50,000 players online
18:15: Network blip causes followers to lose connection to leader
18:16: Split-brain occurs (Issue #1)
- Original leader still renewing
- Follower A becomes new leader
- Players get different event end times
18:17: Support tickets: "Why does my friend have different timer?"
18:20: DevOps attempts emergency restart
18:21: All instances restart, try to recover state
- One instance hits the NULL state window (Issue #2)
- Instance crashes during initialization
18:22: Crashed instance restarts, old timer still running (Issue #3)
- Orphaned timer hammering database
18:25: Leader fails under load
18:26: 20 instances thundering herd the database (Issue #4)
- Database CPU 100%
- All queries timeout
18:27: Complete game outage
18:45: Emergency war room called
19:30: Manual intervention to recover
20:00: Service restored, but event ruined
With These Fixes:
18:00: Black Friday event starts, 50,000 players online
18:15: Network blip causes followers to lose connection to leader
18:16: Fencing token prevents split-brain
- Followers reject old leader's updates
- Clean leader transition occurs
18:17: New leader elected with jitter (no thundering herd)
- Database load normal
- Players see consistent times
18:18: Event continues smoothly
- Monitoring shows brief leader transition
- No player impact
23:00: Event completes successfully
These issues must be fixed in this order:
- Transaction Isolation (Easiest, highest impact)
- Resource Leaks (Simple fix, prevents accumulation)
- Thundering Herd (Moderate complexity, critical for scale)
- Split-Brain (Most complex, but essential for correctness)
All fixes should be implemented before production deployment.
- β Zero time jumps during normal operation
- β < 10ms time difference between instances
- β < 30 second leader election time
- β 99.99% availability
- β < 1 second time drift after full restart
- β 100% time continuity with snapshots available
- β < 1ms latency for time queries (local calculation)
- β < 100ms for time sync operations
- β Support 100+ instances without degradation
- β < 50ms snapshot save time
- β < 100ms recovery time on startup
- β Zero-downtime deployments
- β Automatic failover on leader failure
- β Self-healing after network partitions
- β Successful recovery from full cluster restart
- β Snapshot age always < 30 seconds
- β Zero data loss with graceful shutdown
- Deploy multi-instance version alongside single instance
- Compare time outputs, don't serve traffic
- Validate consistency
- 10% β 25% β 50% β 100% over 2 weeks
- Monitor metrics at each stage
- Rollback capability at each step
- Keep single instance as emergency fallback
- After 30 days stable, remove entirely
public class PostgresLeaderElection {
private readonly IDbConnection db;
private readonly string instanceId;
private readonly string instanceEndpoint;
private Timer renewalTimer;
public async Task<LeaderElectionResult> TryBecomeLeader() {
var sql = @"
WITH election_attempt AS (
INSERT INTO simtime_leader
(resource_name, instance_id, instance_endpoint, expires_at, heartbeat_at)
VALUES ('simtime', @instanceId, @endpoint, NOW() + INTERVAL '30 seconds', NOW())
ON CONFLICT (resource_name) DO UPDATE SET
instance_id = @instanceId,
instance_endpoint = @endpoint,
expires_at = NOW() + INTERVAL '30 seconds',
heartbeat_at = NOW()
WHERE simtime_leader.expires_at < NOW()
RETURNING instance_id, instance_endpoint
)
SELECT
instance_id,
instance_endpoint,
instance_id = @instanceId as became_leader
FROM election_attempt
UNION ALL
SELECT
instance_id,
instance_endpoint,
false as became_leader
FROM simtime_leader
WHERE resource_name = 'simtime'
AND NOT EXISTS (SELECT 1 FROM election_attempt)
LIMIT 1";
var result = await db.QuerySingleAsync<LeaderElectionResult>(sql,
new { instanceId, endpoint = instanceEndpoint });
if (result.BecameLeader) {
StartLeaseRenewal();
await LogLeaderChange("acquired");
} else {
// No need for a second query - we have the leader endpoint!
await InitializeFollowerMode(result.LeaderEndpoint);
}
return result;
}
private void StartLeaseRenewal() {
renewalTimer = new Timer(async _ => {
var renewed = await db.ExecuteAsync(@"
UPDATE simtime_leader
SET expires_at = NOW() + INTERVAL '30 seconds',
heartbeat_at = NOW()
WHERE resource_name = 'simtime'
AND instance_id = @instanceId",
new { instanceId });
if (renewed == 0) {
await OnLostLeadership();
}
}, null, TimeSpan.FromSeconds(15), TimeSpan.FromSeconds(15));
}
private async Task LogLeaderChange(string eventType) {
await db.ExecuteAsync(@"
INSERT INTO simtime_events
(event_type, instance_id, performed_by)
VALUES (@eventType, @instanceId, @instanceId)",
new { eventType = $"leader_{eventType}", instanceId });
}
}public class FollowerTransport : ISimTimeClientTransport {
private readonly SimTimeServerExternalServiceClient leaderClient;
public async Task<TimeSyncResponse> SyncAsync(CancellationToken ct) {
var request = new SyncNoStreamRequest {
ClientTimeSendTicks = DateTime.UtcNow.Ticks
};
var response = await leaderClient.SyncNoStreamAsync(request, cancellationToken: ct);
return new TimeSyncResponse {
Successful = response.Successful,
ServerTimeTicks = response.ServerTimeTicks,
ClientTimeSendTicks = response.ClientTimeSendTicks,
ClientTimeReceivedTicks = DateTime.UtcNow.Ticks
};
}
public async Task<TimeDataResponse> GetTimeDataAsync(CancellationToken ct) {
var response = await leaderClient.GetTimeDataAsync(new(), cancellationToken: ct);
return new TimeDataResponse {
Success = true,
CurrentEpochTicks = response.CurrentEpochTicks,
CurrentMultiplier = response.CurrentMultiplier,
CurrentPausedTimestamp = response.CurrentPausedTimestamp
};
}
}This architecture solves the SimTime scaling problem by:
- Eliminating SPOF through leader election and failover
- Ensuring consistency via SimTimeClient synchronization
- Maintaining performance with local time calculation
- Reusing proven code from the existing SimTime library
The key insight is treating follower instances as "thick clients" that maintain synchronized time state, allowing them to serve requests locally without hitting the leader for every query.
This design provides the consistency guarantees required for game time synchronization while enabling horizontal scaling and high availability.
- Review & Approval: Discuss with team, address concerns
- Proof of Concept: Build minimal version to validate approach
- Design Review: Detailed technical design document
- Implementation: Follow phased approach outlined above
- Testing: Comprehensive testing including chaos engineering
- Rollout: Gradual deployment with monitoring
- PostgreSQL Lock Tables
- NATS Pub/Sub (for game client notifications)
- NTP Protocol Specification
- Distributed Systems Clock Synchronization
- Internal: SimTimeClient Implementation (
library-simtime/src/Klang.Seed.SimTime.Client/)
Document Version: 1.0 Author: Architecture Team Date: 2025-10-28 Status: DRAFT - For Review