Skip to content

Instantly share code, notes, and snippets.

@tbg
Created March 4, 2026 10:15
Show Gist options
  • Select an option

  • Save tbg/1d78a0c496069f18ec35a2b551483052 to your computer and use it in GitHub Desktop.

Select an option

Save tbg/1d78a0c496069f18ec35a2b551483052 to your computer and use it in GitHub Desktop.
review-crdb skill example: PR #164792 (physical modeling in simulator)

Review: PR #164792 — mmaintegration: introduce physical capacity model

Summary

This PR introduces a physical capacity model for MMA that expresses store loads and capacities in physical resource units (CPU ns/s, disk bytes) and threads amplification factors through all range-load callsites. It is a well-structured, well-documented change with excellent commit messages. The core algebraic claim (load/capacity ratio preservation) is correct. There are a few issues worth addressing, the most important being a missing capacity floor that changes overload behavior.

The PR is split across three commits on top of a base PR (#164760):

  1. mmaprototype: plumb node CPU in MakeStoreLoadMsg — replaces derived ReportedCPU/CapacityCPU (incremental sums) with explicit physical NodeCPULoad/NodeCPUCapacity in StoreLoadMsg.
  2. mmaintegration: introduce physical capacity model — adds the physical_model.go file with computePhysicalCPU, computePhysicalDisk, computePhysicalWriteBandwidth, and the public MakePhysicalRangeLoad and ComputeAmplificationFactors entry points.
  3. kvserver: convert MMA store and range loads to physical units — wires the physical model into all callers: MakeStoreLoadMsg, mmaRangeLoad, tryConstructMMARangeMsg, NonMMAPreTransferLease, NonMMAPreChangeReplicas, and the simulator.

Findings

Correctness

minCapacity floor is far too low for CPU — behavioral change from the old model physical_model.go, line defining const minCapacity:

The old computeCPUCapacityWithCap uses cpuCapacityFloorPerStore = 0.1 * 1e9 (0.1 cores = 100,000,000 ns/s) as the minimum per-store CPU capacity. The new computePhysicalCPU uses minCapacity = 1.0 (1 ns/s) as a shared floor across all dimensions (CPU, disk, write bandwidth).

For disk and write bandwidth, a floor of 1 byte is reasonable. But for CPU, a floor of 1 ns/s is effectively no floor at all. Under the old model, when a node is severely overloaded, capacity bottoms out at 0.1 cores, giving a maximum utilization of ~20x (for 2 cores of direct load). Under the new model, with the amplified load (storeCPU * ampFactor) and a floor of 1 ns/s, the utilization can spike to billions-x. The old code has extensive commentary explaining why the 0.1-core floor was chosen and why the resulting 20x max utilization is acceptable.

This is a meaningful behavioral change in overload scenarios. It could cause all severely-overloaded stores to look identically infinite, losing the differentiation the old floor provides.

Suggested fix: use a per-dimension floor, preserving cpuCapacityFloorPerStore for the CPU dimension. For example:

func computePhysicalCPU(desc roachpb.StoreDescriptor) (res physicalDimension) {
    defer func() { res.capacity = max(cpuCapacityFloorPerStore, res.capacity) }()
    // ...
}

Or define separate constants per dimension and apply them in setDimension.

Severity: blocking


ByteSize dimension semantics changed from logical to physical physical_model.go, computePhysicalDisk:

The old model used load=LogicalBytes and capacity=LogicalBytes/fractionUsed, keeping everything in logical (uncompressed) byte space. The new model uses load=Used (physical bytes) and capacity=Used+Available (physical bytes).

This changes the semantics of the ByteSize dimension. The amplification factor (Used/LogicalBytes, capped at 5.0) is applied at the range-load boundary via MakePhysicalRangeLoad to convert per-range LogicalBytes to physical units, so the ratio should still work out (range physical load / store physical capacity). However, the old computeStoreByteSizeCapacity had special handling for empty/new stores (falling back to availableBytes when logicalBytes == 0 or fractionUsed < 0.01). The new model doesn't have this — it simply uses Used + Available, which could be zero if the store reports 0 for both. In that case, minCapacity = 1.0 kicks in as a 1-byte floor.

This is probably fine in practice (stores with 0 used and 0 available are rare edge cases), and the old code's comment about the empty-store case being "desirably pessimistic" was about using compressed Available as logical capacity, which is no longer relevant. But it's worth confirming the behavioral change is intended — especially since the old logical-space model treated compression ratios differently (they were folded into the capacity rather than applied as a separate amplification factor).

Severity: suggestion — confirm this is intentional and document the semantic change more explicitly in the commit message or code comments.


computePhysicalCPU load uses CPUPerSecond not storesCPU/numStores — asymmetry with capacity physical_model.go, computePhysicalCPU:

The code uses storeCPU = desc.Capacity.CPUPerSecond for load (per-store direct CPU) but splits capacity evenly ((nodeCap - bg) / numStores). The comment explains this asymmetry well — it preserves per-store differentiation. However, consider: when CPUPerSecond > 0, load = CPUPerSecond * ampFactor, but the total across all stores would be sum(CPUPerSecond_i) * ampFactor which should equal storesCPU * ampFactor = mmaLoad. This holds by definition since StoresCPURate = sum(CPUPerSecond_i), so the total is consistent. Good.

But the fallback storeCPU = storesCPU / numStores when CPUPerSecond <= 0 could produce surprising results if only some stores have CPUPerSecond = 0. In that case, those stores would get the average load rather than zero. This seems like an unlikely scenario and the fallback is documented, but worth noting.

Severity: nit


Comments

processStoreLoadMsg comment and adjustedCPU update logic cluster_state.go, around line 1381-1399 (in the diff):

The new comment and code for updating adjustedCPU is well-written and clearly explains the multi-store pending delta preservation. The formula ns.adjustedCPU += (storeMsg.NodeCPULoad - ns.NodeCPULoad) - storePendingDelta correctly accounts for:

  1. The baseline shift from the new NodeCPULoad value
  2. Stripping this store's pending delta (which will be re-applied below)

This is a nice improvement over the old incremental sum approach.


Duplicate function-level comment on mmaRangeLoad mma_replica_store.go, before the standalone mmaRangeLoad function:

The comment block starting "The returned RangeLoad contains stats..." is duplicated between the standalone mmaRangeLoad function and the method (mr *mmaReplica) mmaRangeLoad. In the current code on master this is already the case, so this is pre-existing. But now that the standalone function just delegates to MakePhysicalRangeLoad, the comment on it should be updated to reflect that it's a thin wrapper applying amplification factors.

Severity: nit


Redactability

The new Amp and AmpVector types correctly implement SafeFormatter and derive String() via redact.StringWithoutMarkers. Since amplification factors are purely numeric (no PII), this is appropriate. No issues here.


Go conventions

The new types and functions follow CockroachDB conventions. The Amp type as a named float64 is clean. IdentityAmpVector() is a nice constructor. No issues.


Commit structure

The three-commit structure is well-organized:

  1. Groundwork commit (plumb node CPU): adds the physical node-level CPU fields (NodeCPULoad, NodeCPUCapacity) to StoreLoadMsg and NodeLoad, rewrites processStoreLoadMsg to use them, and resolves the old TODO about why only CPU is tracked at the node level. This is a clean, self-contained change.

  2. Core model commit (introduce physical capacity model): adds the physical model functions and types. Self-contained and well-tested via the datadriven test.

  3. Wiring commit (convert MMA store and range loads to physical units): threads the physical model through all callers. Mechanical in nature.

This separation is sensible. The first commit establishes the new data path, the second introduces the model, and the third wires it up.

One structural observation: commit 2 and commit 3 could arguably be one commit, since commit 2 introduces code that isn't called by anything until commit 3 lands. But keeping them separate is fine for reviewability — the model logic in commit 2 is complex enough to warrant isolated review.


PR description

The PR description is thorough and well-structured. It clearly explains the algebraic equivalence of the old and new models, the motivation (unit mismatch in adjustedCPU, uniform LoadVector across dimensions), and the trade-off (estimation error in both load and capacity). The per-commit summaries are helpful for multi-commit review.

The description mentions the PR is rebased on #164760. That base PR is still open, which is fine for review but should be noted for merge ordering.

Aspects skipped

  • Compatibility: not applicable. This change is internal to MMA, which is not yet the default allocator. No RPCs, persisted formats, or inter-version state are affected.
  • Red/green testing: the behavioral fix here preserves fractionUsed ratios by design, so the existing tests continue to pass with the same expectations. The new TestComputePhysicalStore datadriven test covers the physical model directly. The testdata updates to add node-cpu-load/node-cpu-capacity to every store-load-msg are mechanical and consistent.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment