This PR introduces a physical capacity model for MMA that expresses store loads and capacities in physical resource units (CPU ns/s, disk bytes) and threads amplification factors through all range-load callsites. It is a well-structured, well-documented change with excellent commit messages. The core algebraic claim (load/capacity ratio preservation) is correct. There are a few issues worth addressing, the most important being a missing capacity floor that changes overload behavior.
The PR is split across three commits on top of a base PR (#164760):
mmaprototype: plumb node CPU in MakeStoreLoadMsg— replaces derivedReportedCPU/CapacityCPU(incremental sums) with explicit physicalNodeCPULoad/NodeCPUCapacityinStoreLoadMsg.mmaintegration: introduce physical capacity model— adds thephysical_model.gofile withcomputePhysicalCPU,computePhysicalDisk,computePhysicalWriteBandwidth, and the publicMakePhysicalRangeLoadandComputeAmplificationFactorsentry points.kvserver: convert MMA store and range loads to physical units— wires the physical model into all callers:MakeStoreLoadMsg,mmaRangeLoad,tryConstructMMARangeMsg,NonMMAPreTransferLease,NonMMAPreChangeReplicas, and the simulator.
minCapacity floor is far too low for CPU — behavioral change from the old model
physical_model.go, line defining const minCapacity:
The old computeCPUCapacityWithCap uses cpuCapacityFloorPerStore = 0.1 * 1e9
(0.1 cores = 100,000,000 ns/s) as the minimum per-store CPU capacity. The new
computePhysicalCPU uses minCapacity = 1.0 (1 ns/s) as a shared floor across
all dimensions (CPU, disk, write bandwidth).
For disk and write bandwidth, a floor of 1 byte is reasonable. But for CPU, a
floor of 1 ns/s is effectively no floor at all. Under the old model, when a
node is severely overloaded, capacity bottoms out at 0.1 cores, giving a
maximum utilization of ~20x (for 2 cores of direct load). Under the new model,
with the amplified load (storeCPU * ampFactor) and a floor of 1 ns/s, the
utilization can spike to billions-x. The old code has extensive commentary
explaining why the 0.1-core floor was chosen and why the resulting 20x max
utilization is acceptable.
This is a meaningful behavioral change in overload scenarios. It could cause all severely-overloaded stores to look identically infinite, losing the differentiation the old floor provides.
Suggested fix: use a per-dimension floor, preserving cpuCapacityFloorPerStore
for the CPU dimension. For example:
func computePhysicalCPU(desc roachpb.StoreDescriptor) (res physicalDimension) {
defer func() { res.capacity = max(cpuCapacityFloorPerStore, res.capacity) }()
// ...
}Or define separate constants per dimension and apply them in setDimension.
Severity: blocking
ByteSize dimension semantics changed from logical to physical
physical_model.go, computePhysicalDisk:
The old model used load=LogicalBytes and capacity=LogicalBytes/fractionUsed,
keeping everything in logical (uncompressed) byte space. The new model uses
load=Used (physical bytes) and capacity=Used+Available (physical bytes).
This changes the semantics of the ByteSize dimension. The amplification factor
(Used/LogicalBytes, capped at 5.0) is applied at the range-load boundary via
MakePhysicalRangeLoad to convert per-range LogicalBytes to physical units,
so the ratio should still work out (range physical load / store physical
capacity). However, the old computeStoreByteSizeCapacity had special handling
for empty/new stores (falling back to availableBytes when logicalBytes == 0
or fractionUsed < 0.01). The new model doesn't have this — it simply uses
Used + Available, which could be zero if the store reports 0 for both. In that
case, minCapacity = 1.0 kicks in as a 1-byte floor.
This is probably fine in practice (stores with 0 used and 0 available are rare
edge cases), and the old code's comment about the empty-store case being
"desirably pessimistic" was about using compressed Available as logical
capacity, which is no longer relevant. But it's worth confirming the behavioral
change is intended — especially since the old logical-space model treated
compression ratios differently (they were folded into the capacity rather than
applied as a separate amplification factor).
Severity: suggestion — confirm this is intentional and document the semantic change more explicitly in the commit message or code comments.
computePhysicalCPU load uses CPUPerSecond not storesCPU/numStores — asymmetry with capacity
physical_model.go, computePhysicalCPU:
The code uses storeCPU = desc.Capacity.CPUPerSecond for load (per-store direct
CPU) but splits capacity evenly ((nodeCap - bg) / numStores). The comment
explains this asymmetry well — it preserves per-store differentiation. However,
consider: when CPUPerSecond > 0, load = CPUPerSecond * ampFactor, but the
total across all stores would be sum(CPUPerSecond_i) * ampFactor which should
equal storesCPU * ampFactor = mmaLoad. This holds by definition since
StoresCPURate = sum(CPUPerSecond_i), so the total is consistent. Good.
But the fallback storeCPU = storesCPU / numStores when CPUPerSecond <= 0
could produce surprising results if only some stores have CPUPerSecond = 0. In
that case, those stores would get the average load rather than zero. This seems
like an unlikely scenario and the fallback is documented, but worth noting.
Severity: nit
processStoreLoadMsg comment and adjustedCPU update logic
cluster_state.go, around line 1381-1399 (in the diff):
The new comment and code for updating adjustedCPU is well-written and clearly
explains the multi-store pending delta preservation. The formula
ns.adjustedCPU += (storeMsg.NodeCPULoad - ns.NodeCPULoad) - storePendingDelta
correctly accounts for:
- The baseline shift from the new
NodeCPULoadvalue - Stripping this store's pending delta (which will be re-applied below)
This is a nice improvement over the old incremental sum approach.
Duplicate function-level comment on mmaRangeLoad
mma_replica_store.go, before the standalone mmaRangeLoad function:
The comment block starting "The returned RangeLoad contains stats..." is
duplicated between the standalone mmaRangeLoad function and the method
(mr *mmaReplica) mmaRangeLoad. In the current code on master this is already
the case, so this is pre-existing. But now that the standalone function just
delegates to MakePhysicalRangeLoad, the comment on it should be updated to
reflect that it's a thin wrapper applying amplification factors.
Severity: nit
The new Amp and AmpVector types correctly implement SafeFormatter and
derive String() via redact.StringWithoutMarkers. Since amplification factors
are purely numeric (no PII), this is appropriate. No issues here.
The new types and functions follow CockroachDB conventions. The Amp type as
a named float64 is clean. IdentityAmpVector() is a nice constructor. No
issues.
The three-commit structure is well-organized:
-
Groundwork commit (
plumb node CPU): adds the physical node-level CPU fields (NodeCPULoad,NodeCPUCapacity) toStoreLoadMsgandNodeLoad, rewritesprocessStoreLoadMsgto use them, and resolves the old TODO about why only CPU is tracked at the node level. This is a clean, self-contained change. -
Core model commit (
introduce physical capacity model): adds the physical model functions and types. Self-contained and well-tested via the datadriven test. -
Wiring commit (
convert MMA store and range loads to physical units): threads the physical model through all callers. Mechanical in nature.
This separation is sensible. The first commit establishes the new data path, the second introduces the model, and the third wires it up.
One structural observation: commit 2 and commit 3 could arguably be one commit, since commit 2 introduces code that isn't called by anything until commit 3 lands. But keeping them separate is fine for reviewability — the model logic in commit 2 is complex enough to warrant isolated review.
The PR description is thorough and well-structured. It clearly explains the
algebraic equivalence of the old and new models, the motivation (unit mismatch
in adjustedCPU, uniform LoadVector across dimensions), and the trade-off
(estimation error in both load and capacity). The per-commit summaries are
helpful for multi-commit review.
The description mentions the PR is rebased on #164760. That base PR is still open, which is fine for review but should be noted for merge ordering.
- Compatibility: not applicable. This change is internal to MMA, which is not yet the default allocator. No RPCs, persisted formats, or inter-version state are affected.
- Red/green testing: the behavioral fix here preserves
fractionUsedratios by design, so the existing tests continue to pass with the same expectations. The newTestComputePhysicalStoredatadriven test covers the physical model directly. The testdata updates to addnode-cpu-load/node-cpu-capacityto everystore-load-msgare mechanical and consistent.