The Datadog cockroachdb.sys.gc.assist.ns metric (and likely all Prometheus
counter-type metrics) reports a rate ~3x higher than the actual rate when using
.as_rate(). The root cause appears to be a mismatch between the OTel
Prometheus scrape interval (30s) and the interval metadata submitted to Datadog
by the OTel Datadog exporter (suspected 10s, matching the batch processor
timeout).
We ran a single-node CockroachDB cluster (tobias-gcassist, n2-standard-16)
with a KV workload at ~20-25% CPU with GODEBUG=gctrace=1 enabled, and compared
the GC assist CPU time from three sources:
| Source | GC assist rate |
|---|---|
| gctrace (stderr, parsed from 50 cycles) | ~8.8 ms/s |
/_status/vars (polled raw counter, 5 samples over 49s) |
~7.6 ms/s |
Datadog .as_rate() |
~24-26 ms/s |
The gctrace and /_status/vars numbers agree (they both read from
gcController.assistTime via the Go runtime metric
/cpu/classes/gc/mark/assist:cpu-seconds). The Datadog number is ~3.2-3.4x
higher.
The sys_gc_assist_ns metric is correctly exposed as a Prometheus counter
(monotonically increasing cumulative value in nanoseconds):
# TYPE sys_gc_assist_ns counter
sys_gc_assist_ns{node_id="1"} 6.8206110615e+10
The OTel pipeline processes this as follows:
- Prometheus receiver scrapes
/_status/varsevery 30s (configured in the OTel collector config underscrape_interval: 30s). - Datadog exporter converts the cumulative counter to per-interval deltas
(expected behavior for Datadog's
countsubmission type). - The delta over a 30s scrape interval is correct (~240M ns for ~8 ms/s of assist time).
- However, Datadog's
.as_rate()divides by ~10s instead of 30s, producing a 3x inflated rate.
The suspected cause: the OTel batch/datadog processor is configured with
timeout: 10s. This timeout value appears to leak into the interval metadata
that the Datadog exporter attaches to each submitted count, rather than using
the actual interval between Prometheus scrapes.
Relevant OTel collector config:
receivers:
prometheus/cockroachdb:
config:
global:
scrape_interval: 30s # <-- actual data interval
processors:
batch/datadog:
timeout: 10s # <-- suspected source of wrong interval metadataQuerying the same metric three ways in Datadog shows the issue:
# Raw values (already converted to deltas by the exporter):
avg:cockroachdb.sys.gc.assist.ns{cluster:tobias-gcassist}
# => ~230-250M per point, points spaced 30s apart
# .as_rate() divides by ~10s instead of 30s:
avg:cockroachdb.sys.gc.assist.ns{cluster:tobias-gcassist}.as_rate()
# => ~24M ns/s (should be ~8M ns/s)
# Correct rate can be obtained manually:
# raw_value / 30 ≈ 8M ns/s ≈ 8 ms/s ✓
This likely affects all Prometheus counter-type metrics flowing through the
OTel Datadog exporter with this configuration, not just sys.gc.assist.ns.
Any metric queried with .as_rate() would show rates inflated by the same
factor (~scrape_interval / batch_timeout).
The cluster tobias-gcassist is still running with the workload and
GODEBUG=gctrace=1 enabled. You can verify by:
# Ground truth from /_status/vars (poll twice, 30s apart):
roachprod run tobias-gcassist:1 -- \
"curl -s http://localhost:26258/_status/vars | grep '^sys_gc_assist_ns'"
# Compare with Datadog:
pup metrics query --from 5m \
--query 'avg:cockroachdb.sys.gc.assist.ns{cluster:tobias-gcassist}.as_rate()'