This PR adds a roachtest that exercises changefeeds during rolling node
drain+restart cycles and introduces a COCKROACH_CHANGEFEED_TESTING_SLOW_RETRY
env var for reaching max backoff behavior quickly. The test is well-structured
and the motivation is clear. There are a few structural and correctness issues
worth addressing.
suggestion: The PR mixes two distinct changes in a single commit: (1) the
COCKROACH_CHANGEFEED_TESTING_SLOW_RETRY env var in retry.go (a behavioral
change to production code) and (2) the roachtest in cdc.go. These should be
separate commits. The retry knob is independently useful and reviewable.
suggestion — pkg/cmd/roachtest/tests/cdc.go, workload goroutine
(m.Go): The workload goroutine calls t.Fatal(err) instead of returning the
error. While this works in roachtests (where t.Fatal calls
runtime.Goexit()), the monitor's m.Go contract expects errors to be
returned. The restart goroutine correctly returns errors — the workload
goroutine should do the same for consistency:
m.Go(func(ctx context.Context) error {
cmd := fmt.Sprintf(...)
return c.RunE(ctx, option.WithNodes(c.Node(c.Spec().NodeCount)), cmd)
})suggestion — pkg/cmd/roachtest/tests/cdc.go, wait channel: The wait
channel is created and immediately closed at the end of the synchronous function
literal, then read from (<-wait) on the very next line. It serves no purpose
— the function literal runs synchronously on the main goroutine, so by the time
<-wait executes, the channel is already closed. This is dead code inherited
from runCDCInitialScanRollingRestart where it made more sense (the function
literal ran in a different flow). Remove the wait channel and just call
m.Wait() directly after the function literal returns. (andyyang890 also
flagged this.)
suggestion — pkg/cmd/roachtest/tests/cdc.go, restart goroutine when
doRestarts is false: When doRestarts is false, the restart goroutine blocks
on stopRestarts for the entire test duration doing nothing. It would be
cleaner to not start this goroutine at all when doRestarts is false.
(andyyang890 also flagged this.)
suggestion — pkg/ccl/changefeedccl/retry.go, line 20-22: The comment
"useful for testing the retry behavior near its maximum" is ambiguous — "its
maximum" could refer to the retry count or the backoff duration. Clarify:
// useSlowRetry starts retry backoff at 32s so tests reach the maximum
// backoff duration without many retries.
nit — pkg/cmd/roachtest/tests/cdc.go, the function literal containing
the changefeed monitoring loop: This 70-line anonymous function would benefit
from a brief comment at the top explaining its phases: (1) create changefeed,
(2) poll lag during restart phase, (3) poll lag during recovery phase, (4)
assert final lag. (andyyang890 also flagged this.)
suggestion — pkg/cmd/roachtest/tests/cdc.go, rollingRestartParams
defaults: The zero-value defaulting pattern (if testDuration == 0 { ... })
hides what the registered test actually uses. Since there is only one registered
variant, consider either removing the defaults and requiring explicit values, or
documenting the defaults in the struct's doc comment so a reader of the test
registration can understand the full parameterization without reading the
function body. (andyyang890 also flagged this.)
nit — pkg/ccl/changefeedccl/retry.go, lines 39-47: The useSlowRetry
block can be merged with the useFastRetry block as an else if to reduce
nesting and make the precedence relationship self-documenting:
if useSlowRetry {
opts = retry.Options{...}
} else if useFastRetry {
opts = retry.Options{...}
}(andyyang890 also flagged this.)
nit: The PR description is missing a Release note: None line, which is
required by convention. (andyyang890 also flagged this.)
nit — pkg/cmd/roachtest/tests/cdc.go: The test uses
c.NewDeprecatedMonitor. As DarrylWong noted in an existing comment, the
preferred approach is t.Go with Monitor: true in the TestSpec, which
avoids the need for ExpectDeath/ResetDeaths bookkeeping. Not blocking since
existing CDC tests also use the deprecated monitor, but worth adopting the
modern API in new code.
- Redactability: No new types with
String()orError()methods. - Red/green testing: This is a new test, not a behavioral fix, so the red/green pattern does not apply.
- Compatibility: No RPCs, persisted formats, or shared state changes. The env var is test-only and does not affect production behavior unless explicitly set.
None — all referenced rules and skills were available.