Skip to content

Instantly share code, notes, and snippets.

@PierreZ
Last active March 10, 2026 13:49
Show Gist options
  • Select an option

  • Save PierreZ/e974f963f923882df9fbf3d3c1488d8f to your computer and use it in GitHub Desktop.

Select an option

Save PierreZ/e974f963f923882df9fbf3d3c1488d8f to your computer and use it in GitHub Desktop.
FoundationDB Internal RPC Analysis: Idempotency, Retries, and Design Patterns

FoundationDB Internal RPC Analysis: Idempotency, Retries, and Design Patterns

Context

Deep analysis of FoundationDB's internal RPC architecture — not the external client API, but the inter-role communication primitives that make FDB's distributed transaction engine work. The goal is to extract design principles around idempotency and retry safety that can inform RPC design in other distributed systems.


1. The RPC Framework Layer

Core Abstractions (fdbrpc/)

FDB's RPC system is built on Flow (a custom actor-based concurrency framework) and uses a request/reply pattern with typed endpoints:

Primitive Role Key File
ReplyPromise<T> Single-shot reply channel fdbrpc/include/fdbrpc/fdbrpc.h
ReplyPromiseStream<T> Multi-shot streaming reply with backpressure fdbrpc/include/fdbrpc/fdbrpc.h
RequestStream<T> / PublicRequestStream<T> Typed endpoint for receiving requests fdbrpc/include/fdbrpc/fdbrpc.h
FlowTransport Network transport, endpoint registry fdbrpc/include/fdbrpc/FlowTransport.h
IFailureMonitor Failure detection oracle fdbrpc/include/fdbrpc/FailureMonitor.h

Two Fundamental Delivery Modes

The framework provides two send modes that fundamentally determine retry behavior:

getReply() — Reliable, At-Least-Once

Client → sendReliable() → retries until cancelled or permanent failure
  • Packet queued in ReliablePacketList on the Peer
  • Retransmitted on reconnection
  • Cancelled via sendCanceler() when failure monitor fires
  • Implication: Server MUST handle duplicate deliveries or the request must be idempotent

tryGetReply() — Unreliable, At-Most-Once

Client → sendUnreliable() → single attempt, races against disconnect signal
  • Single packet send via sendUnreliable()
  • Races reply against onDisconnectOrFailure() from IFailureMonitor
  • On failure: returns ErrorOr<T> with error, caller decides retry
  • Implication: No server-side dedup needed, but caller must handle missing replies

getReplyUnlessFailedFor() — Reliable With Timeout

Client → sendReliable() → but give up after sustained failure duration
  • Combines reliable send with onFailedFor(endpoint, duration, slope)
  • Used for RPCs where waiting forever is worse than failing

The retryBrokenPromise() Helper

// genericactors.actor.h:40-57
loop {
    try {
        reply = wait(to.getReply(request));
        return reply;
    } catch (Error& e) {
        if (e.code() != error_code_broken_promise) throw;
        resetReply(request);
        wait(delayJittered(FLOW_KNOBS->PREVENT_FAST_SPIN_DELAY));
    }
}
  • Only for well-known endpoints that may restart (e.g., Master, ClusterController)
  • NOT for ordinary endpoints — a broken_promise there means the actor is gone
  • Automatically resets the ReplyPromise for re-send

Public vs Private Endpoints

Type Authorization Use Case
PublicRequestStream<T> Calls T::verify() — checks tenant authorization Client-facing (reads, commits)
RequestStream<T> (private) No verification Internal role-to-role communication

Failed verification → permission_denied() error + audit log.


2. Complete RPC Interface Catalog

CommitProxy (13 RPCs) — fdbclient/include/fdbclient/CommitProxyInterface.h

Request Reply Delivery Idempotency Mechanism
CommitTransactionRequest CommitID Reliable IdempotencyId (optional, 16-255 bytes)
GetReadVersionRequest GetReadVersionReply Reliable Implicit (read-only)
GetKeyServerLocationsRequest GetKeyServerLocationsReply Reliable Naturally idempotent (read)
GetStorageServerRejoinInfoRequest Reply Reliable Naturally idempotent
TxnStateRequest Void One-way Version-ordered
GetHealthMetricsRequest Reply Unreliable Naturally idempotent
ProxySnapRequest Void One-way
ExclusionSafetyCheckRequest Reply Reliable Naturally idempotent
GetDDMetricsRequest Reply Reliable Naturally idempotent
ExpireIdempotencyIdRequest Void One-way Convergent (count-based)
GetTenantIdRequest Reply Reliable Naturally idempotent
GetBlobGranuleLocationsRequest Reply Reliable Naturally idempotent
SetThrottledShardRequest Reply Reliable Last-writer-wins

GrvProxy (4 RPCs) — fdbclient/include/fdbclient/GrvProxyInterface.h

Request Reply Delivery Idempotency
GetReadVersionRequest GetReadVersionReply Reliable Naturally idempotent
GetHealthMetricsRequest Reply Unreliable Naturally idempotent
GlobalConfigRefreshRequest Reply Reliable Naturally idempotent
waitFailure Void Special (failure detection) N/A

Master (5 RPCs) — fdbserver/include/fdbserver/MasterInterface.h

Request Reply Delivery Idempotency
GetCommitVersionRequest GetCommitVersionReply Reliable requestNum monotonic sequence + dedup
GetRawCommittedVersionRequest Reply Reliable Naturally idempotent (read)
ReportRawCommittedVersionRequest Void One-way Monotonic (max wins)
UpdateRecoveryDataRequest Void One-way Version-ordered
waitFailure Void Failure detection N/A

Resolver (5 RPCs) — fdbserver/include/fdbserver/ResolverInterface.h

Request Reply Delivery Idempotency
ResolveTransactionBatchRequest Reply Reliable prevVersion wait + version sequencing
ResolutionMetricsRequest Reply Reliable Naturally idempotent
ResolutionSplitRequest Reply Reliable Naturally idempotent
txnState TxnStateRequest One-way Version-ordered
waitFailure Void Failure detection N/A

TLog (11 RPCs) — fdbserver/include/fdbserver/TLogInterface.h

Request Reply Delivery Idempotency
TLogCommitRequest Reply Reliable Version uniqueness (one version = one commit)
TLogPeekRequest Reply Reliable Optional sequence: (UID, int) dedup
TLogPeekStreamRequest Stream Streaming Sequence-based
TLogPopRequest Void One-way Monotonic (only advances, never regresses)
TLogLockRequest LockResult Reliable Recovery protocol ensures single lock
TLogQueuingMetricsRequest Reply Reliable Naturally idempotent
TLogConfirmRunningRequest Void One-way Heartbeat
TLogRecoveryFinishedRequest Void One-way Convergent
TLogDisablePopRequest Void One-way Toggle state
TLogEnablePopRequest Void One-way Toggle state
TLogSnapRequest Void One-way

StorageServer (26+ RPCs) — fdbclient/include/fdbclient/StorageServerInterface.h

Read operations (all naturally idempotent via MVCC + version pinning):

  • GetValueRequest, GetKeyRequest, GetKeyValuesRequest
  • GetMappedKeyValuesRequest, GetKeyValuesStreamRequest
  • WatchValueRequest

Metadata/status (all naturally idempotent):

  • GetShardStateRequest, WaitMetricsRequest, SplitMetricsRequest
  • GetStorageMetricsRequest, StorageQueuingMetricsRequest
  • GetKeyValueStoreTypeRequest, ReadHotSubRangeRequest, SplitRangeRequest

Change feeds: ChangeFeedStreamRequest, OverlappingChangeFeedsRequest, ChangeFeedPopRequest (monotonic), ChangeFeedVersionUpdateRequest

Checkpoints: GetCheckpointRequest, FetchCheckpointRequest, FetchCheckpointKeyValuesRequest

Admin: AuditStorageRequest, GetHotShardsRequest, GetStorageCheckSumRequest, BulkDumpRequest


3. The Five Idempotency Patterns

FDB uses exactly five distinct patterns for achieving idempotency across internal RPCs:

Pattern 1: Version-Is-The-Key (TLog, Resolver)

Where: TLogCommitRequest, ResolveTransactionBatchRequest

The globally unique, monotonically increasing commit version assigned by the Master acts as a natural deduplication key. The entire cluster has a single version sequencer (Master), so:

  • Each version is assigned exactly once
  • TLog and Resolver wait on prevVersion before processing version
  • Duplicate delivery of the same version is either ignored or produces the same result
  • The prevVersion → version chain forms an implicit total order
CommitProxy → Master: GetCommitVersion(requestNum=42)
Master → CommitProxy: {version=1000, prevVersion=999}
CommitProxy → Resolver: Resolve(version=1000, prevVersion=999)  // Resolver waits for 999 first
CommitProxy → TLog: Commit(version=1000)                         // TLog waits for 999 first

Design insight: When your system has a single sequencer, the sequence number itself is the idempotency key. No additional dedup state needed.

Pattern 2: Monotonic Request Numbers (Master ↔ CommitProxy)

Where: GetCommitVersionRequest.requestNum

Each CommitProxy maintains a per-proxy monotonic counter. The Master tracks mostRecentProcessedRequestNum per proxy:

// Master side (masterserver.actor.cpp)
if (requestNum <= mostRecentProcessedRequestNum) {
    // Already processed — return cached result
    return cachedCommitVersion;
}
if (requestNum > currentRequestNum + 1) {
    // Future request — wait (Never)
    return Never;
}
// Process and cache

Design insight: When the caller is a fixed set of known processes each issuing sequential requests, a per-caller sequence number with server-side last-seen tracking is the simplest dedup.

Pattern 3: Explicit Idempotency IDs (Client → CommitProxy)

Where: CommitTransactionRequest.idempotencyId

Client-generated 16-255 byte random token, stored in the \xff\x02/idmp/ keyspace with the commit version. On retry after commit_unknown_result:

1. Client generates IdempotencyId (16 random bytes)
2. Client → CommitProxy: Commit(mutations, idempotencyId)
3. Network failure — client doesn't know if committed
4. Client → CommitProxy: commitDummyTransaction() to flush pipeline
5. Client → StorageServer: scan \xff\x02/idmp/[readVersion, readVersion+MAX_LIFE] for id
6. Found → return original versionstamp
   Not found → throw transaction_too_old (safe to retry from scratch)

Storage format batches IDs to reduce write amplification:

Key:   \xff\x02/idmp/${commitVersion_BE_8bytes}${highBatchIndex_1byte}
Value: ${protocolVersion}${timestamp_i64}(${idLen_1byte}${id}${lowBatchIndex_1byte})*

Design insight: When the caller is untrusted/unbounded and the operation has side effects, caller-generated opaque tokens stored alongside the result enable post-hoc dedup queries.

Pattern 4: Naturally Idempotent Operations (Storage reads, metadata)

Where: All read RPCs, all metadata/metrics RPCs

MVCC + version-pinned reads are inherently idempotent:

GetValue(key="foo", version=1000)  // Always returns same result for same version

No dedup needed. Retries are always safe. This covers the majority of RPCs by count.

Design insight: Design read paths to be version-pinned. If every read includes the version it's reading at, retries are free.

Pattern 5: Monotonic/Convergent One-Way Messages

Where: TLogPopRequest, ChangeFeedPopRequest, ReportRawCommittedVersionRequest

These are fire-and-forget messages where the operation is monotonic:

TLogPop(version=500)   // "I've consumed everything up to 500"
TLogPop(version=500)   // Duplicate — no-op, already at 500
TLogPop(version=400)   // Out-of-order — ignored, already past 400

Design insight: If the operation is "advance a watermark," make it take the max. Duplicates and reordering become harmless.


4. The Commit Path: A Case Study in Layered Idempotency

The commit path shows how these patterns compose across roles:

Client                CommitProxy           Master          Resolver        TLog
  |                       |                   |               |              |
  |--- CommitTxnReq ----->|                   |               |              |
  |    (idempotencyId)    |                   |               |              |
  |                       |--- GetCommitVer ->|               |              |
  |                       |   (requestNum=N)  |               |              |
  |                       |                   |-- dedup by N--|              |
  |                       |<-- version=V -----|               |              |
  |                       |                   |               |              |
  |                       |--- ResolveBatch ----------------->|              |
  |                       |   (version=V, prevVersion=V-1)    |              |
  |                       |                   |               |-- wait V-1 --|
  |                       |<-- committed/conflict ------------|              |
  |                       |                   |               |              |
  |                       |--- TLogCommit ----------------------------------->|
  |                       |   (version=V)                     |              |
  |                       |                   |               |   wait V-1   |
  |                       |<-- persisted ----------------------------------------|
  |                       |                   |               |              |
  |<-- CommitID(V) -------|                   |               |              |
  |    (+ store idmpId)   |                   |               |              |

Each hop uses a different pattern:

  1. Client → CommitProxy: IdempotencyId (Pattern 3)
  2. CommitProxy → Master: requestNum (Pattern 2)
  3. CommitProxy → Resolver: version chain (Pattern 1)
  4. CommitProxy → TLog: version chain (Pattern 1)
  5. CommitProxy → Client: reply (if lost, client queries via Pattern 3)

5. Error Handling and Retry Taxonomy

Error Categories

Error Meaning Retry Safe? Action
transaction_too_old Read version expired Yes Get new read version, retry
not_committed Conflict detected Yes Full retry with new read version
commit_unknown_result Network failure during commit Only with IdempotencyId Query idmp keyspace or accept ambiguity
request_maybe_delivered Send uncertainty Same as above Same as above
broken_promise Remote actor died Depends on endpoint type Well-known: retry. Others: re-resolve
connection_failed Network partition Depends on operation Re-resolve endpoint, retry
process_behind Server overloaded Yes Backoff and retry

Client-Side Retry Flow (NativeAPI.actor.cpp)

// Simplified from tryCommit()
try {
    CommitID reply = wait(commitProxy.commit.getReply(req));
    return reply;
} catch (Error& e) {
    if (e.code() == error_code_commit_unknown_result) {
        if (req.idempotencyId.valid()) {
            // 1. Flush pipeline with dummy conflicting transaction
            wait(commitDummyTransaction(...));
            // 2. Search idempotency keyspace for our ID
            Optional<CommitResult> r = wait(determineCommitStatus(
                trState, readVersion, readVersion + MAX_WRITE_TRANSACTION_LIFE_VERSIONS, id));
            if (r.present()) return r.get();  // Was committed!
            throw transaction_too_old();      // Was NOT committed, safe to retry
        }
        throw commit_unknown_result();  // No idempotency ID, propagate ambiguity
    }
}

The commitDummyTransaction Trick

Before querying for idempotency ID status, FDB commits a dummy transaction that conflicts with the original. This ensures the original commit is either fully committed or fully aborted — it cannot be "in flight" anymore. This is necessary because the idempotency ID is written atomically with the transaction.


6. Storage Server: Version-Based Exactly-Once

Storage servers apply mutations from TLogs via version ordering:

StorageServer {
    VersionedData versionedData;     // [storageVersion, version.get()]
    Version durableVersion;          // Everything below is on disk
    map<Version, VerUpdateRef> mutationLog;  // (durableVersion, version]
}

Key invariant (from storageserver.actor.cpp): "All items in versionedData.atLatest() have insertVersion() > durableVersion, but views at older versions may contain older items which are also in storage (this is OK because of idempotency)"

This means:

  • Mutations at version V are applied exactly once into the versioned data structure
  • Recovery replays mutations from TLogs starting at durableVersion
  • Replayed mutations at already-applied versions are naturally deduplicated by the versioned structure
  • No explicit dedup map needed — the version IS the dedup key

7. RPC Design Guidance (Extracted Principles)

Principle 1: Choose Your Delivery Guarantee First

Is the operation naturally idempotent?  ──Yes──→  tryGetReply() (at-most-once)
         │ No
         ↓
Does the server have a natural dedup key?  ──Yes──→  getReply() (at-least-once)
         │ No                                         with version/seqnum dedup
         ↓
Is the caller bounded and sequential?  ──Yes──→  Per-caller sequence numbers
         │ No
         ↓
Use caller-generated idempotency tokens  ──→  Store with result, query on retry

Principle 2: Version as Universal Ordering Primitive

FDB's single-sequencer (Master) version is the backbone of ALL idempotency:

  • It orders TLog commits
  • It orders Resolver conflict checks
  • It pins reads for MVCC
  • It bounds idempotency ID searches

Takeaway: If your system has a logical clock or sequence oracle, thread it through every RPC. It eliminates entire classes of dedup problems.

Principle 3: Monotonic Operations Need No Dedup

For "advance watermark" operations (pop, report version, update metrics):

  • Accept the max of (current, incoming)
  • Duplicates and reordering are harmless
  • No state to track, no cleanup needed

Principle 4: Batch Idempotency Tokens to Reduce Write Amplification

FDB batches up to 256 idempotency IDs into a single KV pair per commit version range. Without batching: 1M TPS = 1M extra keys/second. With batching: 1M TPS = ~4K extra keys/second (256 IDs per key).

Principle 5: Separate the "Was it done?" Query from the "Do it" Path

FDB's idempotency doesn't prevent duplicate execution at the proxy level. Instead:

  1. Commit happens (or doesn't)
  2. ID is stored alongside the result
  3. On ambiguity, query the stored ID

This is cheaper than preventing duplicates because the common case (no failure) pays near-zero cost. Only the rare failure path pays for the query.

Principle 6: Flush Before Query

Before checking "did my operation succeed?", flush the pipeline:

commitDummyTransaction()  →  ensures original is fully resolved
then scan idempotency keyspace  →  definitive answer

Without the flush, you might get a false negative (operation still in flight).

Principle 7: Public Endpoints Verify, Private Endpoints Trust

FDB splits endpoints into PublicRequestStream (client-facing, calls verify()) and RequestStream (internal, no check). This avoids authorization overhead on the hot internal paths while protecting the external surface.

Principle 8: Use Failure Detection to Bound Retries

getReplyUnlessFailedFor(duration, slope) prevents indefinite retry against a dead node. The IFailureMonitor provides onFailedFor() which uses sustained failure duration rather than instantaneous checks — avoids flapping.


8. Key Files Reference

File Contents
fdbrpc/include/fdbrpc/fdbrpc.h ReplyPromise, RequestStream, delivery modes
fdbrpc/include/fdbrpc/FlowTransport.h Transport, endpoints, reliable/unreliable send
fdbrpc/include/fdbrpc/FailureMonitor.h Failure detection interface
fdbrpc/include/fdbrpc/genericactors.actor.h retryBrokenPromise, sendCanceler, waitValueOrSignal
fdbclient/include/fdbclient/CommitProxyInterface.h CommitProxy RPCs
fdbclient/include/fdbclient/StorageServerInterface.h StorageServer RPCs (26+)
fdbserver/include/fdbserver/MasterInterface.h Master RPCs (requestNum dedup)
fdbserver/include/fdbserver/ResolverInterface.h Resolver RPCs (version chain)
fdbserver/include/fdbserver/TLogInterface.h TLog RPCs (version-based)
fdbclient/IdempotencyId.actor.cpp IdempotencyId storage & query
fdbserver/CommitProxyServer.actor.cpp Commit path implementation
fdbclient/NativeAPI.actor.cpp Client retry logic, determineCommitStatus
design/idempotency_ids.md Design doc for idempotency IDs
design/Commit/How a commit is done in FDB.md Commit path walkthrough

Verification

This is a read-only analysis. To verify findings:

  1. Trace the commit path: grep -r "getReply\|tryGetReply\|retryBrokenPromise" fdbserver/CommitProxyServer.actor.cpp
  2. Check Master dedup: read fdbserver/masterserver.actor.cpp, search for requestNum
  3. Check TLog version ordering: read fdbserver/TLogServer.actor.cpp, search for prevVersion
  4. Check idempotency ID storage: read fdbclient/IdempotencyId.actor.cpp

FoundationDB Internal RPC Analysis: Idempotency, Retries, and Design Patterns

Context

Deep analysis of FoundationDB's internal RPC architecture — not the external client API, but the inter-role communication primitives that make FDB's distributed transaction engine work. The goal is to extract design principles around idempotency and retry safety that can inform RPC design in other distributed systems.


1. The RPC Framework Layer

Core Abstractions (fdbrpc/)

FDB's RPC system is built on Flow (a custom actor-based concurrency framework) and uses a request/reply pattern with typed endpoints:

Primitive Role Key File
ReplyPromise<T> Single-shot reply channel fdbrpc/include/fdbrpc/fdbrpc.h
ReplyPromiseStream<T> Multi-shot streaming reply with backpressure fdbrpc/include/fdbrpc/fdbrpc.h
RequestStream<T> / PublicRequestStream<T> Typed endpoint for receiving requests fdbrpc/include/fdbrpc/fdbrpc.h
FlowTransport Network transport, endpoint registry fdbrpc/include/fdbrpc/FlowTransport.h
IFailureMonitor Failure detection oracle fdbrpc/include/fdbrpc/FailureMonitor.h

Two Fundamental Delivery Modes

The framework provides two send modes that fundamentally determine retry behavior:

getReply() — Reliable, At-Least-Once

Client → sendReliable() → retries until cancelled or permanent failure
  • Packet queued in ReliablePacketList on the Peer
  • Retransmitted on reconnection
  • Cancelled via sendCanceler() when failure monitor fires
  • Implication: Server MUST handle duplicate deliveries or the request must be idempotent

tryGetReply() — Unreliable, At-Most-Once

Client → sendUnreliable() → single attempt, races against disconnect signal
  • Single packet send via sendUnreliable()
  • Races reply against onDisconnectOrFailure() from IFailureMonitor
  • On failure: returns ErrorOr<T> with error, caller decides retry
  • Implication: No server-side dedup needed, but caller must handle missing replies

getReplyUnlessFailedFor() — Reliable With Timeout

Client → sendReliable() → but give up after sustained failure duration
  • Combines reliable send with onFailedFor(endpoint, duration, slope)
  • Used for RPCs where waiting forever is worse than failing

The retryBrokenPromise() Helper

// genericactors.actor.h:40-57
loop {
    try {
        reply = wait(to.getReply(request));
        return reply;
    } catch (Error& e) {
        if (e.code() != error_code_broken_promise) throw;
        resetReply(request);
        wait(delayJittered(FLOW_KNOBS->PREVENT_FAST_SPIN_DELAY));
    }
}
  • Only for well-known endpoints that may restart (e.g., Master, ClusterController)
  • NOT for ordinary endpoints — a broken_promise there means the actor is gone
  • Automatically resets the ReplyPromise for re-send

Public vs Private Endpoints

Type Authorization Use Case
PublicRequestStream<T> Calls T::verify() — checks tenant authorization Client-facing (reads, commits)
RequestStream<T> (private) No verification Internal role-to-role communication

Failed verification → permission_denied() error + audit log.


2. Complete RPC Interface Catalog

CommitProxy (13 RPCs) — fdbclient/include/fdbclient/CommitProxyInterface.h

Request Reply Delivery Idempotency Mechanism
CommitTransactionRequest CommitID Reliable IdempotencyId (optional, 16-255 bytes)
GetReadVersionRequest GetReadVersionReply Reliable Implicit (read-only)
GetKeyServerLocationsRequest GetKeyServerLocationsReply Reliable Naturally idempotent (read)
GetStorageServerRejoinInfoRequest Reply Reliable Naturally idempotent
TxnStateRequest Void One-way Version-ordered
GetHealthMetricsRequest Reply Unreliable Naturally idempotent
ProxySnapRequest Void One-way
ExclusionSafetyCheckRequest Reply Reliable Naturally idempotent
GetDDMetricsRequest Reply Reliable Naturally idempotent
ExpireIdempotencyIdRequest Void One-way Convergent (count-based)
GetTenantIdRequest Reply Reliable Naturally idempotent
GetBlobGranuleLocationsRequest Reply Reliable Naturally idempotent
SetThrottledShardRequest Reply Reliable Last-writer-wins

GrvProxy (4 RPCs) — fdbclient/include/fdbclient/GrvProxyInterface.h

Request Reply Delivery Idempotency
GetReadVersionRequest GetReadVersionReply Reliable Naturally idempotent
GetHealthMetricsRequest Reply Unreliable Naturally idempotent
GlobalConfigRefreshRequest Reply Reliable Naturally idempotent
waitFailure Void Special (failure detection) N/A

Master (5 RPCs) — fdbserver/include/fdbserver/MasterInterface.h

Request Reply Delivery Idempotency
GetCommitVersionRequest GetCommitVersionReply Reliable requestNum monotonic sequence + dedup
GetRawCommittedVersionRequest Reply Reliable Naturally idempotent (read)
ReportRawCommittedVersionRequest Void One-way Monotonic (max wins)
UpdateRecoveryDataRequest Void One-way Version-ordered
waitFailure Void Failure detection N/A

Resolver (5 RPCs) — fdbserver/include/fdbserver/ResolverInterface.h

Request Reply Delivery Idempotency
ResolveTransactionBatchRequest Reply Reliable prevVersion wait + version sequencing
ResolutionMetricsRequest Reply Reliable Naturally idempotent
ResolutionSplitRequest Reply Reliable Naturally idempotent
txnState TxnStateRequest One-way Version-ordered
waitFailure Void Failure detection N/A

TLog (11 RPCs) — fdbserver/include/fdbserver/TLogInterface.h

Request Reply Delivery Idempotency
TLogCommitRequest Reply Reliable Version uniqueness (one version = one commit)
TLogPeekRequest Reply Reliable Optional sequence: (UID, int) dedup
TLogPeekStreamRequest Stream Streaming Sequence-based
TLogPopRequest Void One-way Monotonic (only advances, never regresses)
TLogLockRequest LockResult Reliable Recovery protocol ensures single lock
TLogQueuingMetricsRequest Reply Reliable Naturally idempotent
TLogConfirmRunningRequest Void One-way Heartbeat
TLogRecoveryFinishedRequest Void One-way Convergent
TLogDisablePopRequest Void One-way Toggle state
TLogEnablePopRequest Void One-way Toggle state
TLogSnapRequest Void One-way

StorageServer (26+ RPCs) — fdbclient/include/fdbclient/StorageServerInterface.h

Read operations (all naturally idempotent via MVCC + version pinning):

  • GetValueRequest, GetKeyRequest, GetKeyValuesRequest
  • GetMappedKeyValuesRequest, GetKeyValuesStreamRequest
  • WatchValueRequest

Metadata/status (all naturally idempotent):

  • GetShardStateRequest, WaitMetricsRequest, SplitMetricsRequest
  • GetStorageMetricsRequest, StorageQueuingMetricsRequest
  • GetKeyValueStoreTypeRequest, ReadHotSubRangeRequest, SplitRangeRequest

Change feeds: ChangeFeedStreamRequest, OverlappingChangeFeedsRequest, ChangeFeedPopRequest (monotonic), ChangeFeedVersionUpdateRequest

Checkpoints: GetCheckpointRequest, FetchCheckpointRequest, FetchCheckpointKeyValuesRequest

Admin: AuditStorageRequest, GetHotShardsRequest, GetStorageCheckSumRequest, BulkDumpRequest


3. The Five Idempotency Patterns

FDB uses exactly five distinct patterns for achieving idempotency across internal RPCs:

Pattern 1: Version-Is-The-Key (TLog, Resolver)

Where: TLogCommitRequest, ResolveTransactionBatchRequest

The globally unique, monotonically increasing commit version assigned by the Master acts as a natural deduplication key. The entire cluster has a single version sequencer (Master), so:

  • Each version is assigned exactly once
  • TLog and Resolver wait on prevVersion before processing version
  • Duplicate delivery of the same version is either ignored or produces the same result
  • The prevVersion → version chain forms an implicit total order
CommitProxy → Master: GetCommitVersion(requestNum=42)
Master → CommitProxy: {version=1000, prevVersion=999}
CommitProxy → Resolver: Resolve(version=1000, prevVersion=999)  // Resolver waits for 999 first
CommitProxy → TLog: Commit(version=1000)                         // TLog waits for 999 first

Design insight: When your system has a single sequencer, the sequence number itself is the idempotency key. No additional dedup state needed.

Pattern 2: Monotonic Request Numbers (Master ↔ CommitProxy)

Where: GetCommitVersionRequest.requestNum

Each CommitProxy maintains a per-proxy monotonic counter. The Master tracks mostRecentProcessedRequestNum per proxy:

// Master side (masterserver.actor.cpp)
if (requestNum <= mostRecentProcessedRequestNum) {
    // Already processed — return cached result
    return cachedCommitVersion;
}
if (requestNum > currentRequestNum + 1) {
    // Future request — wait (Never)
    return Never;
}
// Process and cache

Design insight: When the caller is a fixed set of known processes each issuing sequential requests, a per-caller sequence number with server-side last-seen tracking is the simplest dedup.

Pattern 3: Explicit Idempotency IDs (Client → CommitProxy)

Where: CommitTransactionRequest.idempotencyId

Client-generated 16-255 byte random token, stored in the \xff\x02/idmp/ keyspace with the commit version. On retry after commit_unknown_result:

1. Client generates IdempotencyId (16 random bytes)
2. Client → CommitProxy: Commit(mutations, idempotencyId)
3. Network failure — client doesn't know if committed
4. Client → CommitProxy: commitDummyTransaction() to flush pipeline
5. Client → StorageServer: scan \xff\x02/idmp/[readVersion, readVersion+MAX_LIFE] for id
6. Found → return original versionstamp
   Not found → throw transaction_too_old (safe to retry from scratch)

Storage format batches IDs to reduce write amplification:

Key:   \xff\x02/idmp/${commitVersion_BE_8bytes}${highBatchIndex_1byte}
Value: ${protocolVersion}${timestamp_i64}(${idLen_1byte}${id}${lowBatchIndex_1byte})*

Design insight: When the caller is untrusted/unbounded and the operation has side effects, caller-generated opaque tokens stored alongside the result enable post-hoc dedup queries.

Pattern 4: Naturally Idempotent Operations (Storage reads, metadata)

Where: All read RPCs, all metadata/metrics RPCs

MVCC + version-pinned reads are inherently idempotent:

GetValue(key="foo", version=1000)  // Always returns same result for same version

No dedup needed. Retries are always safe. This covers the majority of RPCs by count.

Design insight: Design read paths to be version-pinned. If every read includes the version it's reading at, retries are free.

Pattern 5: Monotonic/Convergent One-Way Messages

Where: TLogPopRequest, ChangeFeedPopRequest, ReportRawCommittedVersionRequest

These are fire-and-forget messages where the operation is monotonic:

TLogPop(version=500)   // "I've consumed everything up to 500"
TLogPop(version=500)   // Duplicate — no-op, already at 500
TLogPop(version=400)   // Out-of-order — ignored, already past 400

Design insight: If the operation is "advance a watermark," make it take the max. Duplicates and reordering become harmless.


4. The Commit Path: A Case Study in Layered Idempotency

The commit path shows how these patterns compose across roles:

Client                CommitProxy           Master          Resolver        TLog
  |                       |                   |               |              |
  |--- CommitTxnReq ----->|                   |               |              |
  |    (idempotencyId)    |                   |               |              |
  |                       |--- GetCommitVer ->|               |              |
  |                       |   (requestNum=N)  |               |              |
  |                       |                   |-- dedup by N--|              |
  |                       |<-- version=V -----|               |              |
  |                       |                   |               |              |
  |                       |--- ResolveBatch ----------------->|              |
  |                       |   (version=V, prevVersion=V-1)    |              |
  |                       |                   |               |-- wait V-1 --|
  |                       |<-- committed/conflict ------------|              |
  |                       |                   |               |              |
  |                       |--- TLogCommit ----------------------------------->|
  |                       |   (version=V)                     |              |
  |                       |                   |               |   wait V-1   |
  |                       |<-- persisted ----------------------------------------|
  |                       |                   |               |              |
  |<-- CommitID(V) -------|                   |               |              |
  |    (+ store idmpId)   |                   |               |              |

Each hop uses a different pattern:

  1. Client → CommitProxy: IdempotencyId (Pattern 3)
  2. CommitProxy → Master: requestNum (Pattern 2)
  3. CommitProxy → Resolver: version chain (Pattern 1)
  4. CommitProxy → TLog: version chain (Pattern 1)
  5. CommitProxy → Client: reply (if lost, client queries via Pattern 3)

5. Error Handling and Retry Taxonomy

Error Categories

Error Meaning Retry Safe? Action
transaction_too_old Read version expired Yes Get new read version, retry
not_committed Conflict detected Yes Full retry with new read version
commit_unknown_result Network failure during commit Only with IdempotencyId Query idmp keyspace or accept ambiguity
request_maybe_delivered Send uncertainty Same as above Same as above
broken_promise Remote actor died Depends on endpoint type Well-known: retry. Others: re-resolve
connection_failed Network partition Depends on operation Re-resolve endpoint, retry
process_behind Server overloaded Yes Backoff and retry

Client-Side Retry Flow (NativeAPI.actor.cpp)

// Simplified from tryCommit()
try {
    CommitID reply = wait(commitProxy.commit.getReply(req));
    return reply;
} catch (Error& e) {
    if (e.code() == error_code_commit_unknown_result) {
        if (req.idempotencyId.valid()) {
            // 1. Flush pipeline with dummy conflicting transaction
            wait(commitDummyTransaction(...));
            // 2. Search idempotency keyspace for our ID
            Optional<CommitResult> r = wait(determineCommitStatus(
                trState, readVersion, readVersion + MAX_WRITE_TRANSACTION_LIFE_VERSIONS, id));
            if (r.present()) return r.get();  // Was committed!
            throw transaction_too_old();      // Was NOT committed, safe to retry
        }
        throw commit_unknown_result();  // No idempotency ID, propagate ambiguity
    }
}

The commitDummyTransaction Trick

Before querying for idempotency ID status, FDB commits a dummy transaction that conflicts with the original. This ensures the original commit is either fully committed or fully aborted — it cannot be "in flight" anymore. This is necessary because the idempotency ID is written atomically with the transaction.


6. Storage Server: Version-Based Exactly-Once

Storage servers apply mutations from TLogs via version ordering:

StorageServer {
    VersionedData versionedData;     // [storageVersion, version.get()]
    Version durableVersion;          // Everything below is on disk
    map<Version, VerUpdateRef> mutationLog;  // (durableVersion, version]
}

Key invariant (from storageserver.actor.cpp): "All items in versionedData.atLatest() have insertVersion() > durableVersion, but views at older versions may contain older items which are also in storage (this is OK because of idempotency)"

This means:

  • Mutations at version V are applied exactly once into the versioned data structure
  • Recovery replays mutations from TLogs starting at durableVersion
  • Replayed mutations at already-applied versions are naturally deduplicated by the versioned structure
  • No explicit dedup map needed — the version IS the dedup key

7. RPC Design Guidance (Extracted Principles)

Principle 1: Choose Your Delivery Guarantee First

Is the operation naturally idempotent?  ──Yes──→  tryGetReply() (at-most-once)
         │ No
         ↓
Does the server have a natural dedup key?  ──Yes──→  getReply() (at-least-once)
         │ No                                         with version/seqnum dedup
         ↓
Is the caller bounded and sequential?  ──Yes──→  Per-caller sequence numbers
         │ No
         ↓
Use caller-generated idempotency tokens  ──→  Store with result, query on retry

Principle 2: Version as Universal Ordering Primitive

FDB's single-sequencer (Master) version is the backbone of ALL idempotency:

  • It orders TLog commits
  • It orders Resolver conflict checks
  • It pins reads for MVCC
  • It bounds idempotency ID searches

Takeaway: If your system has a logical clock or sequence oracle, thread it through every RPC. It eliminates entire classes of dedup problems.

Principle 3: Monotonic Operations Need No Dedup

For "advance watermark" operations (pop, report version, update metrics):

  • Accept the max of (current, incoming)
  • Duplicates and reordering are harmless
  • No state to track, no cleanup needed

Principle 4: Batch Idempotency Tokens to Reduce Write Amplification

FDB batches up to 256 idempotency IDs into a single KV pair per commit version range. Without batching: 1M TPS = 1M extra keys/second. With batching: 1M TPS = ~4K extra keys/second (256 IDs per key).

Principle 5: Separate the "Was it done?" Query from the "Do it" Path

FDB's idempotency doesn't prevent duplicate execution at the proxy level. Instead:

  1. Commit happens (or doesn't)
  2. ID is stored alongside the result
  3. On ambiguity, query the stored ID

This is cheaper than preventing duplicates because the common case (no failure) pays near-zero cost. Only the rare failure path pays for the query.

Principle 6: Flush Before Query

Before checking "did my operation succeed?", flush the pipeline:

commitDummyTransaction()  →  ensures original is fully resolved
then scan idempotency keyspace  →  definitive answer

Without the flush, you might get a false negative (operation still in flight).

Principle 7: Public Endpoints Verify, Private Endpoints Trust

FDB splits endpoints into PublicRequestStream (client-facing, calls verify()) and RequestStream (internal, no check). This avoids authorization overhead on the hot internal paths while protecting the external surface.

Principle 8: Use Failure Detection to Bound Retries

getReplyUnlessFailedFor(duration, slope) prevents indefinite retry against a dead node. The IFailureMonitor provides onFailedFor() which uses sustained failure duration rather than instantaneous checks — avoids flapping.


8. Key Files Reference

File Contents
fdbrpc/include/fdbrpc/fdbrpc.h ReplyPromise, RequestStream, delivery modes
fdbrpc/include/fdbrpc/FlowTransport.h Transport, endpoints, reliable/unreliable send
fdbrpc/include/fdbrpc/FailureMonitor.h Failure detection interface
fdbrpc/include/fdbrpc/genericactors.actor.h retryBrokenPromise, sendCanceler, waitValueOrSignal
fdbclient/include/fdbclient/CommitProxyInterface.h CommitProxy RPCs
fdbclient/include/fdbclient/StorageServerInterface.h StorageServer RPCs (26+)
fdbserver/include/fdbserver/MasterInterface.h Master RPCs (requestNum dedup)
fdbserver/include/fdbserver/ResolverInterface.h Resolver RPCs (version chain)
fdbserver/include/fdbserver/TLogInterface.h TLog RPCs (version-based)
fdbclient/IdempotencyId.actor.cpp IdempotencyId storage & query
fdbserver/CommitProxyServer.actor.cpp Commit path implementation
fdbclient/NativeAPI.actor.cpp Client retry logic, determineCommitStatus
design/idempotency_ids.md Design doc for idempotency IDs
design/Commit/How a commit is done in FDB.md Commit path walkthrough

Verification

This is a read-only analysis. To verify findings:

  1. Trace the commit path: grep -r "getReply\|tryGetReply\|retryBrokenPromise" fdbserver/CommitProxyServer.actor.cpp
  2. Check Master dedup: read fdbserver/masterserver.actor.cpp, search for requestNum
  3. Check TLog version ordering: read fdbserver/TLogServer.actor.cpp, search for prevVersion
  4. Check idempotency ID storage: read fdbclient/IdempotencyId.actor.cpp

FoundationDB Internal RPC Analysis: Method Taxonomy, Idempotency & Design Patterns

Context

Deep analysis of FoundationDB's internal RPC architecture — the inter-role communication primitives that make FDB's distributed transaction engine work. Extracted from the actual codebase, not documentation.


1. The Nine RPC Method Categories

Every internal RPC call in FDB falls into one of nine structural categories. These aren't just "how to send a message" — each category implies a different contract around failure, ordering, idempotency, and who bears retry responsibility.

Category A: Request-Reply (Single Target)


A1. getReply() — Reliable, Blocking, At-Least-Once

reply = wait(interface.endpoint.getReply(request, taskPriority));

Semantics: Queues packet in ReliablePacketList. Retransmits on reconnection. Caller blocks until reply or permanent failure. Server may see duplicates.

Where used (hot path):

Caller Callee Request File
CommitProxy Master GetCommitVersionRequest CommitProxyServer.actor.cpp:1016
CommitProxy Resolver ResolveTransactionBatchRequest CommitProxyServer.actor.cpp:1132
CommitProxy TLog TLogCommitRequest TagPartitionedLogSystem.actor.cpp:675
GrvProxy Ratekeeper GetRateInfoRequest GrvProxyServer.actor.cpp:426
GrvProxy Master GetRawCommittedVersionRequest GrvProxyServer.actor.cpp:479

Where used (cold path — recovery, admin):

Caller Callee Request File
ClusterRecovery Master UpdateRecoveryDataRequest ClusterRecovery.actor.cpp:1072
ClusterRecovery TLog TLogLockRequest ClusterRecovery.actor.cpp

Idempotency burden: On the server. Since transport retransmits, server must either:

  • Be naturally idempotent (reads), or
  • Dedup by version/seqnum (TLog, Resolver), or
  • Dedup by request number (Master)

When to use: Critical RPCs where the caller cannot proceed without the answer.


A2. tryGetReply() — Unreliable, Non-Blocking, At-Most-Once

ErrorOr<Reply> reply = wait(interface.endpoint.tryGetReply(request));

Semantics: Single sendUnreliable(). Races reply against onDisconnectOrFailure(). Returns ErrorOr<T> — caller handles failure explicitly. Server sees request at most once.

Where used:

Caller Callee Request File
CommitProxy DataDistributor DistributorSnapRequest CommitProxyServer.actor.cpp:3483
CommitProxy DataDistributor ExclusionSafetyCheckRequest CommitProxyServer.actor.cpp:3530
CommitProxy BlobManager ExclusionSafetyCheckRequest CommitProxyServer.actor.cpp:3537

Idempotency burden: On the caller. If the reply is lost, caller decides whether to retry. Server never sees duplicates from the transport layer.

When to use: Admin/operational RPCs that can fail gracefully. Never on the commit hot path.


A3. getReplyUnlessFailedFor(duration, slope) — Reliable With Timeout

ErrorOr<Reply> reply = wait(
    interface.endpoint.getReplyUnlessFailedFor(request, 2.0, 0.0));

Semantics: Like getReply() but races against IFailureMonitor::onFailedFor(). If endpoint stays failed for duration + slope * time_since_last_good, returns error.

Where used:

Caller Callee Request File
StorageServer Remote StorageServer GetKeyValuesRequest storageserver.actor.cpp:5636
ClusterRecovery Workers InitializeCommitProxyRequest ClusterRecovery.actor.cpp
ClusterRecovery Workers InitializeResolverRequest ClusterRecovery.actor.cpp
ClusterRecovery Workers InitializeTLogRequest ClusterRecovery.actor.cpp

Idempotency burden: Same as getReply() — server may see duplicates.

When to use: RPCs where waiting forever is worse than failing. Recovery initialization (bounded by TLOG_TIMEOUT), cross-datacenter reads.


A4. retryBrokenPromise() — Infinite Retry for Well-Known Endpoints

reply = wait(retryBrokenPromise(interface.endpoint, request));

Semantics: Wraps getReply() in a loop that catches broken_promise (endpoint died), resets the ReplyPromise, waits delayJittered(PREVENT_FAST_SPIN_DELAY), and retries. All other errors propagate.

// genericactors.actor.h:40-57
loop {
    try {
        reply = wait(to.getReply(request));
        return reply;
    } catch (Error& e) {
        if (e.code() != error_code_broken_promise) throw;
        resetReply(request);
        wait(delayJittered(FLOW_KNOBS->PREVENT_FAST_SPIN_DELAY));
    }
}

Where used: Implicitly via brokenPromiseToNever() wrapper in many broadcast patterns. Directly for well-known singletons (Master, ClusterController) that restart after failure.

Critical rule: NEVER use for ephemeral endpoints. A broken_promise from a storage server means that specific server is gone — retrying the same endpoint is pointless.

When to use: Only for well-known endpoints that are expected to come back after restart.


Category B: Fire-and-Forget (No Reply)


B1. send() — One-Way Message, No Acknowledgment

interface.endpoint.send(request);   // returns void, no future

Semantics: Sends message and immediately continues. No reply channel. Lost messages are silently dropped.

Where used:

Caller Callee Request Safety Mechanism
CommitProxy TLog TLogPopRequest Monotonic (max version wins)
Client CommitProxy ExpireIdempotencyIdRequest Convergent (count-based)
CommitProxy Master ReportRawCommittedVersionRequest Monotonic (max version wins)
TLog internals Completion signals (Void) Idempotent signals
Recovery Proxies TLogRecoveryFinishedRequest Convergent state

Idempotency burden: The operation itself must be safe to lose or duplicate:

  • Monotonic operations (pop = advance watermark)
  • Convergent operations (state that converges regardless of message count)
  • Signals (triggering something that checks its own state)

When to use: Watermark advances, metric updates, best-effort cleanup, internal signals.


Category C: Multi-Target Patterns


C1. Broadcast + waitForAll() — Fan-Out, All Must Succeed

std::vector<Future<Reply>> replies;
for (auto& resolver : resolvers) {
    replies.push_back(brokenPromiseToNever(
        resolver.resolve.getReply(request, TaskPriority::ProxyResolverReply)));
}
std::vector<Reply> results = wait(getAll(replies));

Semantics: Send to ALL targets, wait for ALL to respond. Any single failure fails the entire operation.

Where used:

Caller Callee Request Why ALL?
CommitProxy ALL Resolvers ResolveTransactionBatchRequest Each resolver owns a key range partition — need all conflict results
CommitProxy ALL Proxies TxnStateRequest (broadcast) All proxies must have consistent txn state
ClusterRecovery ALL new roles Initialize*Request All roles must be up before recovery completes

Failure mode: If any target fails, the caller gets broken_promise or timeout. On the commit path, this triggers recovery.

When to use: When the operation is partitioned across targets (each has unique work) or when consistency requires all participants to acknowledge.


C2. Broadcast + quorum(N) — Fan-Out, Quorum Suffices

std::vector<Future<Void>> tLogCommitResults;
for (auto& tlog : logSet->logServers) {
    tLogCommitResults.push_back(
        success(tlog->get().interf().commit.getReply(request, TaskPriority::ProxyTLogCommitReply)));
}
wait(quorum(tLogCommitResults, tLogCommitResults.size() - tLogWriteAntiQuorum));

Semantics: Send to ALL targets, but only wait for N - antiQuorum to respond. Tolerates antiQuorum failures.

Where used:

Caller Callee Quorum Size Why Quorum?
CommitProxy TLogs (per log set) count - tLogWriteAntiQuorum Replication factor allows some failures
ClusterRecovery Coordinators Majority (N/2 + 1) Leader election consensus

Where used (inverted — quorum=1 for failure detection):

// ClusterRecovery.actor.cpp:416-428
tagError<Void>(quorum(proxyFailures, 1), commit_proxy_failed())
tagError<Void>(quorum(resolverFailures, 1), resolver_failed())

ANY single failure triggers recovery.

When to use: Replicated writes (TLogs), consensus protocols (coordinators), failure watchers (any-1 failure = action).


C3. loadBalance() — Replica Selection With Hedging

Reply reply = wait(loadBalance(
    cx.getPtr(), locationInfo.locations,
    &StorageServerInterface::getValue,
    request, TaskPriority::DefaultEndpoint));

Semantics: Picks the best replica based on queue depth metrics (smoothOutstanding). Sends tryGetReply() to best server. If slow, hedges by sending to next-best after a delay. On failure, rotates through all replicas. If all down, waits for any to recover.

Internal flow:

1. Score replicas by smoothOutstanding (pending request count)
2. Send tryGetReply() to bestAlt
3. After secondMultiplier * queueDepth delay → hedge to nextAlt
4. On error → increment alt, try next replica
5. On all-failed → quorum(failureMonitor.onStateEqual(available), 1) → retry

Where used:

Caller Callee Request File
Client StorageServer GetValueRequest NativeAPI.actor.cpp:3717
Client StorageServer GetKeyRequest NativeAPI.actor.cpp
Client StorageServer GetKeyValuesRequest NativeAPI.actor.cpp
Client CommitProxy GetDDMetricsRequest NativeAPI.actor.cpp:8256

Idempotency: All load-balanced RPCs are reads — naturally idempotent via MVCC. The hedging means the same read may hit two replicas, which is fine.

When to use: Read-path RPCs with multiple replicas. Never for writes.


Category D: Streaming & Long-Lived


D1. getReplyStream() — Server-Streaming With Backpressure

ReplyPromiseStream<GetKeyValuesStreamReply> stream =
    locations[shard].locations->get(idx, &StorageServerInterface::getKeyValuesStream)
        .getReplyStream(request);

loop {
    GetKeyValuesStreamReply reply = waitNext(stream.getFuture());
    // process chunk...
}

Semantics: Client sends one request, server streams back multiple replies. Flow-controlled via acknowledgment tokens (bytesAcknowledged vs bytesSent). Server pauses when client falls behind (onReady() blocks).

Where used:

Caller Callee Request Nature
Client StorageServer GetKeyValuesStreamRequest Large range reads
Client StorageServer ChangeFeedStreamRequest Long-lived mutation subscription
Client StorageServer FetchCheckpointKeyValuesRequest Checkpoint streaming
TLog consumer TLog TLogPeekStreamRequest Log tailing

Idempotency: Each stream message has a sequence number. On reconnect, the stream must be re-established from scratch (no resume). The caller re-issues the request with an updated version/cursor.

When to use: Unbounded result sets, long-lived subscriptions, log tailing.


D2. waitFailure — The Anti-RPC (Death Watch)

// Server side: advertises a "come watch me die" endpoint
addActor.send(waitFailureServer(proxy.waitFailure.getFuture()));

// Client side: blocks until the server dies
wait(waitFailureClient(ssi.waitFailure, timeout));

Semantics: Not a request-reply at all. The server holds the ReplyPromise open and only sends when it's shutting down. The client treats a broken_promise (process died) the same as an explicit reply. Used to turn failure detection into a composable Future.

Where used:

Watcher Watched Trigger On Failure
ClusterRecovery ALL CommitProxies commit_proxy_failed() → recovery
ClusterRecovery ALL GrvProxies grv_proxy_failed() → recovery
ClusterRecovery ALL Resolvers resolver_failed() → recovery
DataDistribution StorageServers Remove from team, re-replicate
Ratekeeper StorageServers Adjust rate limits

Composition with quorum:

// Trigger recovery when ANY proxy dies
tagError<Void>(quorum(proxyWatchers, 1), commit_proxy_failed())

When to use: Failure detection for roles that need monitoring. Combined with quorum(watchers, 1) for "first death" triggers, or individually for per-server tracking.


2. Decision Matrix: Choosing the Right Method

                        ┌─ Read-only, version-pinned?
                        │   YES → loadBalance() [C3]
                        │
                        ├─ Need reply from ALL targets?
                        │   YES → broadcast + waitForAll [C1]
                        │
                        ├─ Need reply from QUORUM?
                        │   YES → broadcast + quorum(N) [C2]
                        │
                        ├─ Large/unbounded result?
                        │   YES → getReplyStream() [D1]
                        │
                        ├─ No reply needed?
                        │   YES → Is operation monotonic/convergent?
                        │           YES → send() [B1]
                        │           NO  → DON'T use fire-and-forget
                        │
Is it a single target? ─┤
                        ├─ Well-known endpoint that restarts?
                        │   YES → retryBrokenPromise() [A4]
                        │
                        ├─ Admin/operational, can fail?
                        │   YES → tryGetReply() [A2]
                        │
                        ├─ Need bounded wait time?
                        │   YES → getReplyUnlessFailedFor() [A3]
                        │
                        └─ Critical, must succeed?
                            YES → getReply() [A1]

3. The Five Idempotency Patterns (by RPC Category)

Each RPC method category implies a different idempotency strategy:

Category Transport Guarantee Who Deduplicates Mechanism
A1 getReply At-least-once Server Version chain, request numbers, or natural idempotency
A2 tryGetReply At-most-once Caller (retry decision) Caller retries if needed; server sees no dupes
A3 getReplyUnlessFailedFor At-least-once Server Same as A1
A4 retryBrokenPromise At-least-once + actor retry Server Same as A1, but survives endpoint restarts
B1 send At-most-once, lossy Neither Operation must be monotonic/convergent by design
C1 broadcast+waitAll At-least-once per target Each server Version chain (Resolver, TLog)
C2 broadcast+quorum At-least-once per target Each server Version uniqueness (TLog)
C3 loadBalance At-most-once per attempt Caller (hedging = safe for reads) MVCC reads are naturally idempotent
D1 getReplyStream Sequenced stream Stream protocol Sequence numbers in stream; restart on disconnect
D2 waitFailure N/A (death signal) N/A Not a data RPC

The Five Server-Side Patterns

When the server bears the dedup burden (A1, A3, A4, C1, C2), FDB uses five mechanisms:

  1. Version-is-the-keyTLogCommitRequest, ResolveTransactionBatchRequest

    • Global version = dedup key. prevVersion chain enforces ordering.
  2. Per-caller sequence numbersGetCommitVersionRequest.requestNum

    • Bounded caller set. Server tracks lastProcessedRequestNum per caller.
  3. Explicit idempotency tokensCommitTransactionRequest.idempotencyId

    • Unbounded callers. Token stored with result, queried on retry.
  4. Natural idempotency — All reads, all metadata queries

    • MVCC + version pinning. Same inputs = same outputs.
  5. Monotonic/convergentTLogPopRequest, ReportRawCommittedVersionRequest

    • server_state = max(server_state, incoming). Dupes and reorder harmless.

4. The Commit Path: All Nine Categories in Action

The commit path is the best case study because it touches almost every category:

Client ──[C3 loadBalance]──→ GrvProxy          (GetReadVersion — read, load-balanced)
Client ──[A1 getReply]─────→ CommitProxy        (CommitTransactionRequest)
  CommitProxy ──[A1 getReply]────→ Master       (GetCommitVersionRequest — seqnum dedup)
  CommitProxy ──[C1 broadcast]───→ ALL Resolvers (ResolveTransactionBatchRequest — version chain)
  CommitProxy ──[C2 quorum]──────→ TLogs        (TLogCommitRequest — version uniqueness)
  CommitProxy ──[B1 send]───────→ Master        (ReportRawCommittedVersion — monotonic)
CommitProxy ──[A1 reply]──→ Client              (CommitID)

Background:
  ClusterRecovery ──[D2 waitFailure]──→ ALL Proxies    (death watch, quorum=1)
  StorageServer ──[D1 stream]────────→ TLog            (TLogPeekStream — log tailing)
  Client ──[B1 send]──→ CommitProxy                    (ExpireIdempotencyId — convergent)

5. Error Taxonomy Mapped to Categories

Error Produced By Affected Categories Correct Response
broken_promise Actor died A1, A3, C1, C2 A4 retries; others re-resolve endpoint
request_maybe_delivered Transport uncertainty A1 Check idempotency ID or propagate commit_unknown_result
commit_unknown_result Proxy reply lost A1 (commit path) Query \xff\x02/idmp/ keyspace after flush
connection_failed Network partition D1 (streams) Re-establish stream from last known position
process_behind Server overloaded C3 (loadBalance) Load balancer auto-rotates to next replica
transaction_too_old Version expired C3 Get new read version, retry entire transaction
not_committed Conflict detected A1 (commit reply) Safe to retry entire transaction
permission_denied PublicRequestStream::verify() A1, A2 (public endpoints) Refresh auth token

6. Key Design Principles

P1: The method category determines the idempotency contract

Don't bolt idempotency onto an RPC after the fact. Choose the send method first — it dictates who bears the dedup burden.

P2: Version is the universal dedup key

FDB's single-sequencer (Master) version threads through every write-path RPC. If your system has a logical clock, use it as the dedup key everywhere.

P3: Reads are free to retry — make them version-pinned

All loadBalance() RPCs include a version. Same version = same result. Hedging is free.

P4: One-way messages must be monotonic by construction

Never fire-and-forget a non-monotonic mutation. If send() loses the message or delivers it twice, the system state must still be correct.

P5: Broadcast strategy follows data partitioning

  • Resolvers: waitForAll because each owns a key range partition
  • TLogs: quorum because they're replicas of the same data
  • Load-balanced reads: single target because any replica has the answer

P6: Failure detection is a composable primitive

waitFailure + quorum(watchers, 1) = "trigger on first death". This is cleaner than polling health endpoints.

P7: Hedge reads, never hedge writes

loadBalance() sends to a second replica after a delay. Safe for reads (MVCC). Never done for writes — that would violate exactly-once semantics.

P8: Bound your waits on cold paths

Recovery uses getReplyUnlessFailedFor() with TLOG_TIMEOUT so a dead worker doesn't hang recovery forever. Hot-path RPCs use getReply() because the failure monitor will trigger recovery instead.


7. Key Files Reference

File Contents
fdbrpc/include/fdbrpc/fdbrpc.h ReplyPromise, RequestStream, all send methods
fdbrpc/include/fdbrpc/LoadBalance.actor.h loadBalance() implementation with hedging
fdbrpc/include/fdbrpc/genericactors.actor.h retryBrokenPromise, sendCanceler, waitValueOrSignal
fdbrpc/include/fdbrpc/FlowTransport.h sendReliable() / sendUnreliable()
fdbrpc/include/fdbrpc/FailureMonitor.h onFailedFor(), onDisconnectOrFailure()
fdbserver/CommitProxyServer.actor.cpp Resolver broadcast, TxnState broadcast, commit path
fdbserver/TagPartitionedLogSystem.actor.cpp TLog quorum push
fdbserver/ClusterRecovery.actor.cpp Recovery recruitment, waitFailure quorum
fdbclient/NativeAPI.actor.cpp loadBalance reads, client retry logic
fdbserver/masterserver.actor.cpp requestNum dedup
fdbserver/storageserver.actor.cpp Version-based exactly-once
fdbclient/IdempotencyId.actor.cpp Explicit idempotency token storage
design/idempotency_ids.md Design doc for idempotency IDs
design/Commit/How a commit is done in FDB.md Commit path sequence diagrams
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment