Deep analysis of FoundationDB's internal RPC architecture — not the external client API, but the inter-role communication primitives that make FDB's distributed transaction engine work. The goal is to extract design principles around idempotency and retry safety that can inform RPC design in other distributed systems.
FDB's RPC system is built on Flow (a custom actor-based concurrency framework) and uses a request/reply pattern with typed endpoints:
| Primitive | Role | Key File |
|---|---|---|
ReplyPromise<T> |
Single-shot reply channel | fdbrpc/include/fdbrpc/fdbrpc.h |
ReplyPromiseStream<T> |
Multi-shot streaming reply with backpressure | fdbrpc/include/fdbrpc/fdbrpc.h |
RequestStream<T> / PublicRequestStream<T> |
Typed endpoint for receiving requests | fdbrpc/include/fdbrpc/fdbrpc.h |
FlowTransport |
Network transport, endpoint registry | fdbrpc/include/fdbrpc/FlowTransport.h |
IFailureMonitor |
Failure detection oracle | fdbrpc/include/fdbrpc/FailureMonitor.h |
The framework provides two send modes that fundamentally determine retry behavior:
Client → sendReliable() → retries until cancelled or permanent failure
- Packet queued in
ReliablePacketListon thePeer - Retransmitted on reconnection
- Cancelled via
sendCanceler()when failure monitor fires - Implication: Server MUST handle duplicate deliveries or the request must be idempotent
Client → sendUnreliable() → single attempt, races against disconnect signal
- Single packet send via
sendUnreliable() - Races reply against
onDisconnectOrFailure()fromIFailureMonitor - On failure: returns
ErrorOr<T>with error, caller decides retry - Implication: No server-side dedup needed, but caller must handle missing replies
Client → sendReliable() → but give up after sustained failure duration
- Combines reliable send with
onFailedFor(endpoint, duration, slope) - Used for RPCs where waiting forever is worse than failing
// genericactors.actor.h:40-57
loop {
try {
reply = wait(to.getReply(request));
return reply;
} catch (Error& e) {
if (e.code() != error_code_broken_promise) throw;
resetReply(request);
wait(delayJittered(FLOW_KNOBS->PREVENT_FAST_SPIN_DELAY));
}
}- Only for well-known endpoints that may restart (e.g., Master, ClusterController)
- NOT for ordinary endpoints — a broken_promise there means the actor is gone
- Automatically resets the ReplyPromise for re-send
| Type | Authorization | Use Case |
|---|---|---|
PublicRequestStream<T> |
Calls T::verify() — checks tenant authorization |
Client-facing (reads, commits) |
RequestStream<T> (private) |
No verification | Internal role-to-role communication |
Failed verification → permission_denied() error + audit log.
| Request | Reply | Delivery | Idempotency Mechanism |
|---|---|---|---|
CommitTransactionRequest |
CommitID |
Reliable | IdempotencyId (optional, 16-255 bytes) |
GetReadVersionRequest |
GetReadVersionReply |
Reliable | Implicit (read-only) |
GetKeyServerLocationsRequest |
GetKeyServerLocationsReply |
Reliable | Naturally idempotent (read) |
GetStorageServerRejoinInfoRequest |
Reply |
Reliable | Naturally idempotent |
TxnStateRequest |
Void |
One-way | Version-ordered |
GetHealthMetricsRequest |
Reply |
Unreliable | Naturally idempotent |
ProxySnapRequest |
Void |
One-way | — |
ExclusionSafetyCheckRequest |
Reply |
Reliable | Naturally idempotent |
GetDDMetricsRequest |
Reply |
Reliable | Naturally idempotent |
ExpireIdempotencyIdRequest |
Void |
One-way | Convergent (count-based) |
GetTenantIdRequest |
Reply |
Reliable | Naturally idempotent |
GetBlobGranuleLocationsRequest |
Reply |
Reliable | Naturally idempotent |
SetThrottledShardRequest |
Reply |
Reliable | Last-writer-wins |
| Request | Reply | Delivery | Idempotency |
|---|---|---|---|
GetReadVersionRequest |
GetReadVersionReply |
Reliable | Naturally idempotent |
GetHealthMetricsRequest |
Reply |
Unreliable | Naturally idempotent |
GlobalConfigRefreshRequest |
Reply |
Reliable | Naturally idempotent |
waitFailure |
Void |
Special (failure detection) | N/A |
| Request | Reply | Delivery | Idempotency |
|---|---|---|---|
GetCommitVersionRequest |
GetCommitVersionReply |
Reliable | requestNum monotonic sequence + dedup |
GetRawCommittedVersionRequest |
Reply |
Reliable | Naturally idempotent (read) |
ReportRawCommittedVersionRequest |
Void |
One-way | Monotonic (max wins) |
UpdateRecoveryDataRequest |
Void |
One-way | Version-ordered |
waitFailure |
Void |
Failure detection | N/A |
| Request | Reply | Delivery | Idempotency |
|---|---|---|---|
ResolveTransactionBatchRequest |
Reply |
Reliable | prevVersion wait + version sequencing |
ResolutionMetricsRequest |
Reply |
Reliable | Naturally idempotent |
ResolutionSplitRequest |
Reply |
Reliable | Naturally idempotent |
txnState |
TxnStateRequest |
One-way | Version-ordered |
waitFailure |
Void |
Failure detection | N/A |
| Request | Reply | Delivery | Idempotency |
|---|---|---|---|
TLogCommitRequest |
Reply |
Reliable | Version uniqueness (one version = one commit) |
TLogPeekRequest |
Reply |
Reliable | Optional sequence: (UID, int) dedup |
TLogPeekStreamRequest |
Stream | Streaming | Sequence-based |
TLogPopRequest |
Void |
One-way | Monotonic (only advances, never regresses) |
TLogLockRequest |
LockResult |
Reliable | Recovery protocol ensures single lock |
TLogQueuingMetricsRequest |
Reply |
Reliable | Naturally idempotent |
TLogConfirmRunningRequest |
Void |
One-way | Heartbeat |
TLogRecoveryFinishedRequest |
Void |
One-way | Convergent |
TLogDisablePopRequest |
Void |
One-way | Toggle state |
TLogEnablePopRequest |
Void |
One-way | Toggle state |
TLogSnapRequest |
Void |
One-way | — |
Read operations (all naturally idempotent via MVCC + version pinning):
GetValueRequest,GetKeyRequest,GetKeyValuesRequestGetMappedKeyValuesRequest,GetKeyValuesStreamRequestWatchValueRequest
Metadata/status (all naturally idempotent):
GetShardStateRequest,WaitMetricsRequest,SplitMetricsRequestGetStorageMetricsRequest,StorageQueuingMetricsRequestGetKeyValueStoreTypeRequest,ReadHotSubRangeRequest,SplitRangeRequest
Change feeds: ChangeFeedStreamRequest, OverlappingChangeFeedsRequest, ChangeFeedPopRequest (monotonic), ChangeFeedVersionUpdateRequest
Checkpoints: GetCheckpointRequest, FetchCheckpointRequest, FetchCheckpointKeyValuesRequest
Admin: AuditStorageRequest, GetHotShardsRequest, GetStorageCheckSumRequest, BulkDumpRequest
FDB uses exactly five distinct patterns for achieving idempotency across internal RPCs:
Where: TLogCommitRequest, ResolveTransactionBatchRequest
The globally unique, monotonically increasing commit version assigned by the Master acts as a natural deduplication key. The entire cluster has a single version sequencer (Master), so:
- Each version is assigned exactly once
- TLog and Resolver wait on
prevVersionbefore processingversion - Duplicate delivery of the same version is either ignored or produces the same result
- The
prevVersion → versionchain forms an implicit total order
CommitProxy → Master: GetCommitVersion(requestNum=42)
Master → CommitProxy: {version=1000, prevVersion=999}
CommitProxy → Resolver: Resolve(version=1000, prevVersion=999) // Resolver waits for 999 first
CommitProxy → TLog: Commit(version=1000) // TLog waits for 999 first
Design insight: When your system has a single sequencer, the sequence number itself is the idempotency key. No additional dedup state needed.
Where: GetCommitVersionRequest.requestNum
Each CommitProxy maintains a per-proxy monotonic counter. The Master tracks
mostRecentProcessedRequestNum per proxy:
// Master side (masterserver.actor.cpp)
if (requestNum <= mostRecentProcessedRequestNum) {
// Already processed — return cached result
return cachedCommitVersion;
}
if (requestNum > currentRequestNum + 1) {
// Future request — wait (Never)
return Never;
}
// Process and cacheDesign insight: When the caller is a fixed set of known processes each issuing sequential requests, a per-caller sequence number with server-side last-seen tracking is the simplest dedup.
Where: CommitTransactionRequest.idempotencyId
Client-generated 16-255 byte random token, stored in the \xff\x02/idmp/ keyspace with the
commit version. On retry after commit_unknown_result:
1. Client generates IdempotencyId (16 random bytes)
2. Client → CommitProxy: Commit(mutations, idempotencyId)
3. Network failure — client doesn't know if committed
4. Client → CommitProxy: commitDummyTransaction() to flush pipeline
5. Client → StorageServer: scan \xff\x02/idmp/[readVersion, readVersion+MAX_LIFE] for id
6. Found → return original versionstamp
Not found → throw transaction_too_old (safe to retry from scratch)
Storage format batches IDs to reduce write amplification:
Key: \xff\x02/idmp/${commitVersion_BE_8bytes}${highBatchIndex_1byte}
Value: ${protocolVersion}${timestamp_i64}(${idLen_1byte}${id}${lowBatchIndex_1byte})*
Design insight: When the caller is untrusted/unbounded and the operation has side effects, caller-generated opaque tokens stored alongside the result enable post-hoc dedup queries.
Where: All read RPCs, all metadata/metrics RPCs
MVCC + version-pinned reads are inherently idempotent:
GetValue(key="foo", version=1000) // Always returns same result for same version
No dedup needed. Retries are always safe. This covers the majority of RPCs by count.
Design insight: Design read paths to be version-pinned. If every read includes the version it's reading at, retries are free.
Where: TLogPopRequest, ChangeFeedPopRequest, ReportRawCommittedVersionRequest
These are fire-and-forget messages where the operation is monotonic:
TLogPop(version=500) // "I've consumed everything up to 500"
TLogPop(version=500) // Duplicate — no-op, already at 500
TLogPop(version=400) // Out-of-order — ignored, already past 400
Design insight: If the operation is "advance a watermark," make it take the max. Duplicates and reordering become harmless.
The commit path shows how these patterns compose across roles:
Client CommitProxy Master Resolver TLog
| | | | |
|--- CommitTxnReq ----->| | | |
| (idempotencyId) | | | |
| |--- GetCommitVer ->| | |
| | (requestNum=N) | | |
| | |-- dedup by N--| |
| |<-- version=V -----| | |
| | | | |
| |--- ResolveBatch ----------------->| |
| | (version=V, prevVersion=V-1) | |
| | | |-- wait V-1 --|
| |<-- committed/conflict ------------| |
| | | | |
| |--- TLogCommit ----------------------------------->|
| | (version=V) | |
| | | | wait V-1 |
| |<-- persisted ----------------------------------------|
| | | | |
|<-- CommitID(V) -------| | | |
| (+ store idmpId) | | | |
Each hop uses a different pattern:
- Client → CommitProxy: IdempotencyId (Pattern 3)
- CommitProxy → Master: requestNum (Pattern 2)
- CommitProxy → Resolver: version chain (Pattern 1)
- CommitProxy → TLog: version chain (Pattern 1)
- CommitProxy → Client: reply (if lost, client queries via Pattern 3)
| Error | Meaning | Retry Safe? | Action |
|---|---|---|---|
transaction_too_old |
Read version expired | Yes | Get new read version, retry |
not_committed |
Conflict detected | Yes | Full retry with new read version |
commit_unknown_result |
Network failure during commit | Only with IdempotencyId | Query idmp keyspace or accept ambiguity |
request_maybe_delivered |
Send uncertainty | Same as above | Same as above |
broken_promise |
Remote actor died | Depends on endpoint type | Well-known: retry. Others: re-resolve |
connection_failed |
Network partition | Depends on operation | Re-resolve endpoint, retry |
process_behind |
Server overloaded | Yes | Backoff and retry |
// Simplified from tryCommit()
try {
CommitID reply = wait(commitProxy.commit.getReply(req));
return reply;
} catch (Error& e) {
if (e.code() == error_code_commit_unknown_result) {
if (req.idempotencyId.valid()) {
// 1. Flush pipeline with dummy conflicting transaction
wait(commitDummyTransaction(...));
// 2. Search idempotency keyspace for our ID
Optional<CommitResult> r = wait(determineCommitStatus(
trState, readVersion, readVersion + MAX_WRITE_TRANSACTION_LIFE_VERSIONS, id));
if (r.present()) return r.get(); // Was committed!
throw transaction_too_old(); // Was NOT committed, safe to retry
}
throw commit_unknown_result(); // No idempotency ID, propagate ambiguity
}
}Before querying for idempotency ID status, FDB commits a dummy transaction that conflicts with the original. This ensures the original commit is either fully committed or fully aborted — it cannot be "in flight" anymore. This is necessary because the idempotency ID is written atomically with the transaction.
Storage servers apply mutations from TLogs via version ordering:
StorageServer {
VersionedData versionedData; // [storageVersion, version.get()]
Version durableVersion; // Everything below is on disk
map<Version, VerUpdateRef> mutationLog; // (durableVersion, version]
}
Key invariant (from storageserver.actor.cpp): "All items in versionedData.atLatest() have
insertVersion() > durableVersion, but views at older versions may contain older items which are
also in storage (this is OK because of idempotency)"
This means:
- Mutations at version V are applied exactly once into the versioned data structure
- Recovery replays mutations from TLogs starting at
durableVersion - Replayed mutations at already-applied versions are naturally deduplicated by the versioned structure
- No explicit dedup map needed — the version IS the dedup key
Is the operation naturally idempotent? ──Yes──→ tryGetReply() (at-most-once)
│ No
↓
Does the server have a natural dedup key? ──Yes──→ getReply() (at-least-once)
│ No with version/seqnum dedup
↓
Is the caller bounded and sequential? ──Yes──→ Per-caller sequence numbers
│ No
↓
Use caller-generated idempotency tokens ──→ Store with result, query on retry
FDB's single-sequencer (Master) version is the backbone of ALL idempotency:
- It orders TLog commits
- It orders Resolver conflict checks
- It pins reads for MVCC
- It bounds idempotency ID searches
Takeaway: If your system has a logical clock or sequence oracle, thread it through every RPC. It eliminates entire classes of dedup problems.
For "advance watermark" operations (pop, report version, update metrics):
- Accept the max of (current, incoming)
- Duplicates and reordering are harmless
- No state to track, no cleanup needed
FDB batches up to 256 idempotency IDs into a single KV pair per commit version range. Without batching: 1M TPS = 1M extra keys/second. With batching: 1M TPS = ~4K extra keys/second (256 IDs per key).
FDB's idempotency doesn't prevent duplicate execution at the proxy level. Instead:
- Commit happens (or doesn't)
- ID is stored alongside the result
- On ambiguity, query the stored ID
This is cheaper than preventing duplicates because the common case (no failure) pays near-zero cost. Only the rare failure path pays for the query.
Before checking "did my operation succeed?", flush the pipeline:
commitDummyTransaction() → ensures original is fully resolved
then scan idempotency keyspace → definitive answer
Without the flush, you might get a false negative (operation still in flight).
FDB splits endpoints into PublicRequestStream (client-facing, calls verify()) and
RequestStream (internal, no check). This avoids authorization overhead on the hot internal
paths while protecting the external surface.
getReplyUnlessFailedFor(duration, slope) prevents indefinite retry against a dead node.
The IFailureMonitor provides onFailedFor() which uses sustained failure duration rather
than instantaneous checks — avoids flapping.
| File | Contents |
|---|---|
fdbrpc/include/fdbrpc/fdbrpc.h |
ReplyPromise, RequestStream, delivery modes |
fdbrpc/include/fdbrpc/FlowTransport.h |
Transport, endpoints, reliable/unreliable send |
fdbrpc/include/fdbrpc/FailureMonitor.h |
Failure detection interface |
fdbrpc/include/fdbrpc/genericactors.actor.h |
retryBrokenPromise, sendCanceler, waitValueOrSignal |
fdbclient/include/fdbclient/CommitProxyInterface.h |
CommitProxy RPCs |
fdbclient/include/fdbclient/StorageServerInterface.h |
StorageServer RPCs (26+) |
fdbserver/include/fdbserver/MasterInterface.h |
Master RPCs (requestNum dedup) |
fdbserver/include/fdbserver/ResolverInterface.h |
Resolver RPCs (version chain) |
fdbserver/include/fdbserver/TLogInterface.h |
TLog RPCs (version-based) |
fdbclient/IdempotencyId.actor.cpp |
IdempotencyId storage & query |
fdbserver/CommitProxyServer.actor.cpp |
Commit path implementation |
fdbclient/NativeAPI.actor.cpp |
Client retry logic, determineCommitStatus |
design/idempotency_ids.md |
Design doc for idempotency IDs |
design/Commit/How a commit is done in FDB.md |
Commit path walkthrough |
This is a read-only analysis. To verify findings:
- Trace the commit path:
grep -r "getReply\|tryGetReply\|retryBrokenPromise" fdbserver/CommitProxyServer.actor.cpp - Check Master dedup: read
fdbserver/masterserver.actor.cpp, search forrequestNum - Check TLog version ordering: read
fdbserver/TLogServer.actor.cpp, search forprevVersion - Check idempotency ID storage: read
fdbclient/IdempotencyId.actor.cpp