tbg/gc-assist-metric-issue.md

## gc-assist-metric-issue.md

      
    Raw
  

              gc-assist-metric-issue.md
            
          
    OTel Datadog exporter inflates counter metric rates by ~3x

Summary

The Datadog cockroachdb.sys.gc.assist.ns metric (and likely all Prometheus
counter-type metrics) reports a rate ~3x higher than the actual rate when using
.as_rate(). The root cause appears to be a mismatch between the OTel
Prometheus scrape interval (30s) and the interval metadata submitted to Datadog
by the OTel Datadog exporter (suspected 10s, matching the batch processor
timeout).
How we found this

We ran a single-node CockroachDB cluster (tobias-gcassist, n2-standard-16)
with a KV workload at ~20-25% CPU with GODEBUG=gctrace=1 enabled, and compared
the GC assist CPU time from three sources:


Source
GC assist rate


gctrace (stderr, parsed from 50 cycles)
~8.8 ms/s


/_status/vars (polled raw counter, 5 samples over 49s)
~7.6 ms/s


Datadog .as_rate()
~24-26 ms/s


The gctrace and /_status/vars numbers agree (they both read from
gcController.assistTime via the Go runtime metric
/cpu/classes/gc/mark/assist:cpu-seconds). The Datadog number is ~3.2-3.4x
higher.
Root cause analysis

The sys_gc_assist_ns metric is correctly exposed as a Prometheus counter
(monotonically increasing cumulative value in nanoseconds):
# TYPE sys_gc_assist_ns counter
sys_gc_assist_ns{node_id="1"} 6.8206110615e+10

The OTel pipeline processes this as follows:

Prometheus receiver scrapes /_status/vars every 30s (configured in
the OTel collector config under scrape_interval: 30s).
Datadog exporter converts the cumulative counter to per-interval deltas
(expected behavior for Datadog's count submission type).
The delta over a 30s scrape interval is correct (~240M ns for ~8 ms/s of
assist time).
However, Datadog's .as_rate() divides by ~10s instead of 30s, producing
a 3x inflated rate.

The suspected cause: the OTel batch/datadog processor is configured with
timeout: 10s. This timeout value appears to leak into the interval metadata
that the Datadog exporter attaches to each submitted count, rather than using
the actual interval between Prometheus scrapes.
Relevant OTel collector config:
receivers:
  prometheus/cockroachdb:
    config:
      global:
        scrape_interval: 30s    # <-- actual data interval
processors:
  batch/datadog:
    timeout: 10s                # <-- suspected source of wrong interval metadata
Verification

Querying the same metric three ways in Datadog shows the issue:
# Raw values (already converted to deltas by the exporter):
avg:cockroachdb.sys.gc.assist.ns{cluster:tobias-gcassist}
# => ~230-250M per point, points spaced 30s apart

# .as_rate() divides by ~10s instead of 30s:
avg:cockroachdb.sys.gc.assist.ns{cluster:tobias-gcassist}.as_rate()
# => ~24M ns/s (should be ~8M ns/s)

# Correct rate can be obtained manually:
# raw_value / 30 ≈ 8M ns/s ≈ 8 ms/s ✓

Impact

This likely affects all Prometheus counter-type metrics flowing through the
OTel Datadog exporter with this configuration, not just sys.gc.assist.ns.
Any metric queried with .as_rate() would show rates inflated by the same
factor (~scrape_interval / batch_timeout).
Reproducer

The cluster tobias-gcassist is still running with the workload and
GODEBUG=gctrace=1 enabled. You can verify by:
# Ground truth from /_status/vars (poll twice, 30s apart):
roachprod run tobias-gcassist:1 -- \
  "curl -s http://localhost:26258/_status/vars | grep '^sys_gc_assist_ns'"

# Compare with Datadog:
pup metrics query --from 5m \
  --query 'avg:cockroachdb.sys.gc.assist.ns{cluster:tobias-gcassist}.as_rate()'
Source	GC assist rate
gctrace (stderr, parsed from 50 cycles)	~8.8 ms/s
`/_status/vars` (polled raw counter, 5 samples over 49s)	~7.6 ms/s
Datadog `.as_rate()`	~24-26 ms/s
No results found