This PR implements the KV portion of SKIP LOCKED support for
SELECT ... FOR UPDATE SKIP LOCKED and SELECT ... FOR SHARE SKIP LOCKED.
The change spans the MVCC scanner, KV concurrency control, optimistic
evaluation, timestamp cache, refresh spans, and the lock table. The SQL
optimizer still rejects SKIP LOCKED (the SQL portion was extracted into
a separate PR, #83627), so this is plumbing-only from the KV side.
Overall assessment: the change is well-structured across five focused commits, the isolation model is clearly articulated, and the test coverage (data-driven MVCC tests, lock table tests, concurrency manager tests, integration tests) is thorough. The commit arc is clean and tells a reviewable story. There are a few correctness and design concerns worth discussing, and some comments that could be improved.
IsKeyLockedByConflictingTxn races on the lock table snapshot --
pebble_mvcc_scanner.go, lock_table.go
- Severity: blocking
The IsKeyLockedByConflictingTxn implementation in lockTableGuardImpl takes
l.mu.Lock() on the lockState found in the tableSnapshot btree. This is
correct for reading the lock holder, but there is a subtle concern: the
tableSnapshot is captured at sequencing time, while evaluation happens later.
Between those two points, a lock could be released and the lockState object
could be reused or its fields updated. The l.mu.Lock() protects the read, but
the question is whether the lockState itself could have been garbage collected
or reused for a different key.
In practice, CockroachDB's lock table retains lockState objects in the btree
even after release (they are only removed lazily), and the btree snapshot is
taken under lockTableImpl.mu. So this should be safe. However, the PR does
not document this invariant -- a comment in IsKeyLockedByConflictingTxn
explaining why it is safe to access lockState from the snapshot after
sequencing would be valuable, since a future refactor could break this
assumption.
Non-locking reads with skipLocked and failOnMoreRecent: WriteTooOldError
on unlocked keys is correct but surprising --
pebble_mvcc_scanner.go (cases 2a, 4a)
- Severity: suggestion
When skipLocked and failOnMoreRecent are both set, the scanner skips keys
that are locked by conflicting transactions but still raises
WriteTooOldError for unlocked keys with more recent versions. This is
logically consistent: the weaker isolation applies only to lock conflicts, not
to MVCC conflicts on unlocked keys.
However, this means a SELECT ... FOR UPDATE SKIP LOCKED can still fail with
WriteTooOldError and force a retry. The MVCC test coverage confirms this
behavior. A comment somewhere (perhaps in the MVCCScan or MVCCGet
documentation) should call this out explicitly, as it may surprise users who
expect SKIP LOCKED to "never block and never error on conflicts."
collectSpansRead casts req.(roachpb.LockingReadRequest) without guarding
against non-locking reads -- replica_read.go
- Severity: suggestion
In collectSpansRead, when ba.WaitPolicy == lock.WaitPolicy_SkipLocked && roachpb.CanSkipLocked(req), the code does:
getAlloc.get.KeyLocking = req.(roachpb.LockingReadRequest).KeyLockingStrength()If a CanSkipLocked request were ever added that does not implement
LockingReadRequest, this would panic. Currently CanSkipLocked is only true
for GetRequest, ScanRequest, and ReverseScanRequest, all of which
implement LockingReadRequest, so this is safe today. But the assertion is
implicit. A comment or a defensive check would prevent a future surprise.
Intent resolution cost for SKIP LOCKED -- intent_resolver.go
- Severity: nit
The TODO comment added at CleanupIntentsAsync raises a valid concern: if
SKIP LOCKED scans frequently encounter intents from live transactions, the
async intent resolution will attempt to resolve them, only to find they are
still active. The comment says "is this needed if the intents could not have
expired yet (i.e. they are not at least 5s old)?" This is a good question for a
follow-up, but does not block this PR. The comment is appropriately marked as a
TODO.
- Severity: nit
The commit arc is well done:
- Rename
SKIP->SKIP_LOCKED(mechanical, clear) - Refactor
getAndAdvancecase ordering (mechanical, enables next commit) - Teach MVCC scanner about skip-locked (core behavior change)
- Teach concurrency control about skip-locked (KV integration)
- Teach optimistic evaluation about skip-locked (performance optimization)
This is a clean progression. The refactor commit (#2) is properly separated from the behavioral change (#3), which makes both easier to review. The behavioral changes in #3 are well-tested with the extensive data-driven test file.
One minor observation: commit 4 is quite large because it touches concurrency control, timestamp cache, refresh spans, and span collection all at once. These are tightly coupled for correctness (they all relate to the weaker isolation semantics), so splitting them further might not improve reviewability. The PR description's implementation section explains these touchpoints well.
LockTableView interface doc: grammar -- pkg/storage/mvcc.go
- Severity: nit
// LockTableView is a transaction-bound view into an in-memory collections of
"collections" -> "collection"
MVCCGet doc: grammar -- pkg/storage/mvcc.go
- Severity: nit
// In this mode, the LockTableView provided in the options is consulted any
// observed key to determine whether it is locked with an unreplicated lock.
Missing "for" -- should be "is consulted for any observed key".
collectSpans comment: typo -- replica_send.go
- Severity: nit
// able to constraint their read spans
"constraint" -> "constrain"
The change to RefreshSpanIterate to return an error is a good improvement.
The previous signature silently discarded potential errors from
ResponseKeyIterate. The callers are updated appropriately.
Duplicate scanLockStrength function --
lock_table_test.go and datadriven_util_test.go
- Severity: suggestion
The scanLockStrength function is defined identically in both
lock_table_test.go and datadriven_util_test.go (same package). Since
datadriven_util_test.go appears to be the shared utility file for the
concurrency tests, the one in lock_table_test.go is redundant. Only one
definition should be needed.
The PR description is thorough and well-written. The implementation section explains the isolation model clearly, the interaction with latching, the timestamp cache, and refresh spans. The example SQL session at the top orients the reader immediately. The honest "I haven't done serious benchmarking" note is appreciated.
The note about extracting the SQL changes to a separate PR (#83627) is helpful for understanding the scope.
- Redactability: no new types with
String()orError()methods are added; theLockTableViewinterface has no string representation. - Compatibility: the new
WaitPolicy_SkipLockedproto enum value is additive (value 2) and the SQL layer still rejects it, so there is no mixed-version compatibility concern for this PR specifically. A cluster version gate will be needed when the SQL layer is wired up.
None -- all referenced rules and skills were available.