Skip to content

Instantly share code, notes, and snippets.

@tbg
Created March 4, 2026 10:15
Show Gist options
  • Select an option

  • Save tbg/69ec2eea0c4ab5d794a135fa62e551dc to your computer and use it in GitHub Desktop.

Select an option

Save tbg/69ec2eea0c4ab5d794a135fa62e551dc to your computer and use it in GitHub Desktop.
review-crdb skill example: PR #164677 (connection retry roachtest)

Review: PR #164677 — changefeedccl: add roachtest for CDC rolling restarts with KV workload

Summary

This PR adds a roachtest that exercises changefeeds during rolling node drain+restart cycles and introduces a COCKROACH_CHANGEFEED_TESTING_SLOW_RETRY env var for reaching max backoff behavior quickly. The test is well-structured and the motivation is clear. There are a few structural and correctness issues worth addressing.

Findings

Commit structure

suggestion: The PR mixes two distinct changes in a single commit: (1) the COCKROACH_CHANGEFEED_TESTING_SLOW_RETRY env var in retry.go (a behavioral change to production code) and (2) the roachtest in cdc.go. These should be separate commits. The retry knob is independently useful and reviewable.

Error handling

suggestionpkg/cmd/roachtest/tests/cdc.go, workload goroutine (m.Go): The workload goroutine calls t.Fatal(err) instead of returning the error. While this works in roachtests (where t.Fatal calls runtime.Goexit()), the monitor's m.Go contract expects errors to be returned. The restart goroutine correctly returns errors — the workload goroutine should do the same for consistency:

m.Go(func(ctx context.Context) error {
    cmd := fmt.Sprintf(...)
    return c.RunE(ctx, option.WithNodes(c.Node(c.Spec().NodeCount)), cmd)
})

Unnecessary synchronization channel

suggestionpkg/cmd/roachtest/tests/cdc.go, wait channel: The wait channel is created and immediately closed at the end of the synchronous function literal, then read from (<-wait) on the very next line. It serves no purpose — the function literal runs synchronously on the main goroutine, so by the time <-wait executes, the channel is already closed. This is dead code inherited from runCDCInitialScanRollingRestart where it made more sense (the function literal ran in a different flow). Remove the wait channel and just call m.Wait() directly after the function literal returns. (andyyang890 also flagged this.)

Idle goroutine in control mode

suggestionpkg/cmd/roachtest/tests/cdc.go, restart goroutine when doRestarts is false: When doRestarts is false, the restart goroutine blocks on stopRestarts for the entire test duration doing nothing. It would be cleaner to not start this goroutine at all when doRestarts is false. (andyyang890 also flagged this.)

Comments

suggestionpkg/ccl/changefeedccl/retry.go, line 20-22: The comment "useful for testing the retry behavior near its maximum" is ambiguous — "its maximum" could refer to the retry count or the backoff duration. Clarify:

// useSlowRetry starts retry backoff at 32s so tests reach the maximum
// backoff duration without many retries.

nitpkg/cmd/roachtest/tests/cdc.go, the function literal containing the changefeed monitoring loop: This 70-line anonymous function would benefit from a brief comment at the top explaining its phases: (1) create changefeed, (2) poll lag during restart phase, (3) poll lag during recovery phase, (4) assert final lag. (andyyang890 also flagged this.)

Go conventions

suggestionpkg/cmd/roachtest/tests/cdc.go, rollingRestartParams defaults: The zero-value defaulting pattern (if testDuration == 0 { ... }) hides what the registered test actually uses. Since there is only one registered variant, consider either removing the defaults and requiring explicit values, or documenting the defaults in the struct's doc comment so a reader of the test registration can understand the full parameterization without reading the function body. (andyyang890 also flagged this.)

nitpkg/ccl/changefeedccl/retry.go, lines 39-47: The useSlowRetry block can be merged with the useFastRetry block as an else if to reduce nesting and make the precedence relationship self-documenting:

if useSlowRetry {
    opts = retry.Options{...}
} else if useFastRetry {
    opts = retry.Options{...}
}

(andyyang890 also flagged this.)

PR description

nit: The PR description is missing a Release note: None line, which is required by convention. (andyyang890 also flagged this.)

Deprecated API

nitpkg/cmd/roachtest/tests/cdc.go: The test uses c.NewDeprecatedMonitor. As DarrylWong noted in an existing comment, the preferred approach is t.Go with Monitor: true in the TestSpec, which avoids the need for ExpectDeath/ResetDeaths bookkeeping. Not blocking since existing CDC tests also use the deprecated monitor, but worth adopting the modern API in new code.

Aspects skipped

  • Redactability: No new types with String() or Error() methods.
  • Red/green testing: This is a new test, not a behavioral fix, so the red/green pattern does not apply.
  • Compatibility: No RPCs, persisted formats, or shared state changes. The env var is test-only and does not affect production behavior unless explicitly set.

Missing sources

None — all referenced rules and skills were available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment