Skip to content

Instantly share code, notes, and snippets.

View tbg's full-sized avatar

Tobias Grieger tbg

View GitHub Profile
@tbg
tbg / output.md
Created March 11, 2026 12:35
investigate workflow runs

/investigate workflow runs

113 non-skipped runs from 2026-02-19 to 2026-03-11.

Date Who Result Issue Title Run
2026-03-11 rafiss success #163431 roachtest: ruby-pg failed [liveness session expired before transaction] run
2026-03-10 williamchoe3 success #165212 pkg/sql/opt/opbench/opbench_test_/opbench_test: pkg failed run
2026-03-10 dt success #164906 roachtest: backup-restore/online-restore failed run
2026-03-10 dt success #165013 backup: TestBackupRestoreCrossTab
@tbg
tbg / heapscan.md
Created March 10, 2026 11:55
heapScan overestimate under GOGC=off + GOMEMLIMIT

heapScan overestimate under GOGC=off + GOMEMLIMIT

Setup

Single-node CockroachDB (n2-standard-16, 64GB RAM) running a KV workload at ~20% CPU with GOGC=off and GOMEMLIMIT=51GiB. The live heap is ~480MB, but with GOGC disabled, the heap grows to ~50GB before GC triggers (driven entirely by the memory limit). GC runs roughly every 24 seconds.

The log output

@tbg
tbg / gcpacer-tail.md
Created March 9, 2026 13:09
GC + pacer trace tail from tobias-gcassist (2026-03-09)

GC + Pacer trace tail (tobias-gcassist, 2026-03-09 ~12:45 UTC)

Single-node CockroachDB (n2-standard-16), KV workload at ~20-25% CPU. GODEBUG=gctrace=1,gcpacertrace=1.

pacer: assist ratio=+1.966144e+000 (scan 226 MB in 1660->1736 MB) workers=4++0.000000e+000
pacer: 27% CPU (25 exp.) for 151835216+1501680+2831530 B work (155682658 B exp.) in 1741434408 B -> 1766226960 B (∆goal -54389166, cons/mark +1.702709e-001)
gc 20311 @11135.890s 0%: 0.099+11+0.098 ms clock, 1.5+5.1/45/49+1.5 ms cpu, 1660->1684->434 MB, 1736 MB goal, 1 MB stacks, 2 MB globals, 16 P
pacer: sweep done at heap size 458MB; allocated 23MB during sweep; swept 218348 pages at +1.681737e-004 pages/byte
@tbg
tbg / gc-assist-metric-issue.md
Created March 9, 2026 12:28
OTel Datadog exporter inflates counter metric rates by ~3x

OTel Datadog exporter inflates counter metric rates by ~3x

Summary

The Datadog cockroachdb.sys.gc.assist.ns metric (and likely all Prometheus counter-type metrics) reports a rate ~3x higher than the actual rate when using .as_rate(). The root cause appears to be a mismatch between the OTel Prometheus scrape interval (30s) and the interval metadata submitted to Datadog by the OTel Datadog exporter (suspected 10s, matching the batch processor timeout).

@tbg
tbg / experiment.md
Last active March 9, 2026 13:14
Single-Node KV Workload Experiment with GC tracing analysis

Single-Node KV Workload Experiment

2026-03-09T09:23:05Z by Showboat dev

Create a single-node CockroachDB cluster with OpenTelemetry and fluent-bit for Datadog observability, then run a KV workload targeting ~20-25% CPU.

Cluster Creation

@tbg
tbg / review-pr-164900.md
Created March 5, 2026 09:54
Review of cockroachdb/cockroach PR #164900: mmaintegration: introduce physical capacity model

Review: PR #164900 — mmaintegration: introduce physical capacity model

Author: wenyihu6 | Branch: oldmodel2 | Epic: CRDB-55052

Blocking Issues (must fix)

  1. [correctness] highDiskSpaceUtilization comment is now stale (capacity_model.go:703-724): The comment explains that fractionUsed = load/capacity = LogicalBytes / (LogicalBytes / diskUtil) = diskUtil. Under the new model, load=Used, capacity=Used+Available — the math still recovers actual disk utilization, but the comment references the old LogicalBytes-based derivation and is now misleading.

  2. [correctness] minCapacity floor is dramatically lower than the old floor (physical_model.go): The old model had cpuCapacityFloorPerStore = 0.1 * 1e9 (0.1 cores). The new minCapacity = 1.0 means 1 ns/s — effectively zero CPU capacity. The old floor existed to prevent utilization from going to infinity on overloaded nodes (its comment explains this in detail). If a store has non-zero load and capacity=1 ns/s, utilization

@tbg
tbg / review.md
Created March 4, 2026 10:15
review-crdb skill example: PR #161454 (engine separation ReadWriter)

Review: PR #161454 — kvserver: thread in correct engine when destroying and subsuming replicas

Summary

This PR replaces two uses of kvstorage.TODOReadWriter(b.batch) in replicaAppBatch.runPostAddTriggersReplicaOnly with a new b.ReadWriter() helper that correctly separates the state engine batch (b.batch) from the raft engine batch (b.RaftBatch()). This is part of the broader effort to logically separate the state and raft engines in the apply stack (issue #161059). The change is correct, small, and follows the pattern

@tbg
tbg / review.md
Created March 4, 2026 10:15
review-crdb skill example: PR #79134 (SKIP LOCKED implementation)

Review: PR #79134 — kv: support FOR {UPDATE,SHARE} SKIP LOCKED

Summary

This PR implements the KV portion of SKIP LOCKED support for SELECT ... FOR UPDATE SKIP LOCKED and SELECT ... FOR SHARE SKIP LOCKED. The change spans the MVCC scanner, KV concurrency control, optimistic evaluation, timestamp cache, refresh spans, and the lock table. The SQL optimizer still rejects SKIP LOCKED (the SQL portion was extracted into a separate PR, #83627), so this is plumbing-only from the KV side.

@tbg
tbg / review.md
Created March 4, 2026 10:15
review-crdb skill example: PR #164677 (connection retry roachtest)

Review: PR #164677 — changefeedccl: add roachtest for CDC rolling restarts with KV workload

Summary

This PR adds a roachtest that exercises changefeeds during rolling node drain+restart cycles and introduces a COCKROACH_CHANGEFEED_TESTING_SLOW_RETRY env var for reaching max backoff behavior quickly. The test is well-structured and the motivation is clear. There are a few structural and correctness issues worth addressing.

@tbg
tbg / review.md
Created March 4, 2026 10:15
review-crdb skill example: PR #164792 (physical modeling in simulator)

Review: PR #164792 — mmaintegration: introduce physical capacity model

Summary

This PR introduces a physical capacity model for MMA that expresses store loads and capacities in physical resource units (CPU ns/s, disk bytes) and threads amplification factors through all range-load callsites. It is a well-structured, well-documented change with excellent commit messages. The core algebraic claim (load/capacity ratio preservation) is correct. There are a few issues worth addressing, the most important being a missing capacity floor that changes