Skip to content

Instantly share code, notes, and snippets.

@tbg
Created March 9, 2026 12:28
Show Gist options
  • Select an option

  • Save tbg/815359c23b4d612e2f629f4e4692fda5 to your computer and use it in GitHub Desktop.

Select an option

Save tbg/815359c23b4d612e2f629f4e4692fda5 to your computer and use it in GitHub Desktop.
OTel Datadog exporter inflates counter metric rates by ~3x

OTel Datadog exporter inflates counter metric rates by ~3x

Summary

The Datadog cockroachdb.sys.gc.assist.ns metric (and likely all Prometheus counter-type metrics) reports a rate ~3x higher than the actual rate when using .as_rate(). The root cause appears to be a mismatch between the OTel Prometheus scrape interval (30s) and the interval metadata submitted to Datadog by the OTel Datadog exporter (suspected 10s, matching the batch processor timeout).

How we found this

We ran a single-node CockroachDB cluster (tobias-gcassist, n2-standard-16) with a KV workload at ~20-25% CPU with GODEBUG=gctrace=1 enabled, and compared the GC assist CPU time from three sources:

Source GC assist rate
gctrace (stderr, parsed from 50 cycles) ~8.8 ms/s
/_status/vars (polled raw counter, 5 samples over 49s) ~7.6 ms/s
Datadog .as_rate() ~24-26 ms/s

The gctrace and /_status/vars numbers agree (they both read from gcController.assistTime via the Go runtime metric /cpu/classes/gc/mark/assist:cpu-seconds). The Datadog number is ~3.2-3.4x higher.

Root cause analysis

The sys_gc_assist_ns metric is correctly exposed as a Prometheus counter (monotonically increasing cumulative value in nanoseconds):

# TYPE sys_gc_assist_ns counter
sys_gc_assist_ns{node_id="1"} 6.8206110615e+10

The OTel pipeline processes this as follows:

  1. Prometheus receiver scrapes /_status/vars every 30s (configured in the OTel collector config under scrape_interval: 30s).
  2. Datadog exporter converts the cumulative counter to per-interval deltas (expected behavior for Datadog's count submission type).
  3. The delta over a 30s scrape interval is correct (~240M ns for ~8 ms/s of assist time).
  4. However, Datadog's .as_rate() divides by ~10s instead of 30s, producing a 3x inflated rate.

The suspected cause: the OTel batch/datadog processor is configured with timeout: 10s. This timeout value appears to leak into the interval metadata that the Datadog exporter attaches to each submitted count, rather than using the actual interval between Prometheus scrapes.

Relevant OTel collector config:

receivers:
  prometheus/cockroachdb:
    config:
      global:
        scrape_interval: 30s    # <-- actual data interval
processors:
  batch/datadog:
    timeout: 10s                # <-- suspected source of wrong interval metadata

Verification

Querying the same metric three ways in Datadog shows the issue:

# Raw values (already converted to deltas by the exporter):
avg:cockroachdb.sys.gc.assist.ns{cluster:tobias-gcassist}
# => ~230-250M per point, points spaced 30s apart

# .as_rate() divides by ~10s instead of 30s:
avg:cockroachdb.sys.gc.assist.ns{cluster:tobias-gcassist}.as_rate()
# => ~24M ns/s (should be ~8M ns/s)

# Correct rate can be obtained manually:
# raw_value / 30 ≈ 8M ns/s ≈ 8 ms/s ✓

Impact

This likely affects all Prometheus counter-type metrics flowing through the OTel Datadog exporter with this configuration, not just sys.gc.assist.ns. Any metric queried with .as_rate() would show rates inflated by the same factor (~scrape_interval / batch_timeout).

Reproducer

The cluster tobias-gcassist is still running with the workload and GODEBUG=gctrace=1 enabled. You can verify by:

# Ground truth from /_status/vars (poll twice, 30s apart):
roachprod run tobias-gcassist:1 -- \
  "curl -s http://localhost:26258/_status/vars | grep '^sys_gc_assist_ns'"

# Compare with Datadog:
pup metrics query --from 5m \
  --query 'avg:cockroachdb.sys.gc.assist.ns{cluster:tobias-gcassist}.as_rate()'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment