Skip to content

Instantly share code, notes, and snippets.

@c4milo
Created March 2, 2026 18:14
Show Gist options
  • Select an option

  • Save c4milo/4df894be618f453e4f2d8a22325b56bb to your computer and use it in GitHub Desktop.

Select an option

Save c4milo/4df894be618f453e4f2d8a22325b56bb to your computer and use it in GitHub Desktop.
Redpanda AI Gateway - Token Tracker Architecture

Token Tracker Architecture

System Overview

The token tracker is a shared infrastructure layer that captures LLM token usage from provider responses and feeds two consumers: the rate limiter (tokens/time) and the spend limiter (cost/time). It uses a write-through cache pattern with Redis for fast aggregation and PostgreSQL for durable storage.

graph TB
    subgraph "Middleware Chain"
        OIDC["OIDC Auth"]
        RC["reqcontext<br/>(parse body once)"]
        SL["Spend Limiter<br/>(CEL evaluation + limit check)"]
        RL["Rate Limiter"]
        RT["Router"]
        TT["Token Tracker<br/>(captures response)"]
    end

    subgraph "Storage Layer"
        Redis["Redis<br/>(Sorted Sets + Lua scripts)<br/>Aggregation happens here"]
        PG["PostgreSQL<br/>(llm_token_usage)<br/>Durable storage + fallback"]
        Kafka["Redpanda/Kafka<br/>(optional stream)"]
    end

    subgraph "Pricing"
        MP["Model Pricing Config<br/>(in-memory lookup)"]
    end

    OIDC --> RC --> SL --> RL --> RT
    RT -->|response| TT

    SL -->|"QueryCostByExpression()<br/>→ Lua aggregates in Redis<br/>→ Go compares against limit"| Redis
    Redis -.->|cache miss| PG

    TT -->|"WriteToCache() (Lua ZADD)"| Redis
    TT -->|"WriteTokenUsage() (batch COPY)"| PG
    TT -.->|optional| Kafka
    TT -->|"calculateCosts()"| MP
Loading

Core Interfaces

// Writer — records token usage (PostgreSQL batch, Redpanda/Kafka stream)
type Writer interface {
    WriteTokenUsage(ctx context.Context, usage *TokenUsage) error
    Close(ctx context.Context) error
}

// Querier — reads aggregated usage (PostgreSQL direct, or CachedQuerier with Redis)
type Querier interface {
    QueryTokenUsage(ctx context.Context, key string, window time.Duration) (input, output int64, err error)
    QueryCost(ctx context.Context, key string, windowSeconds int) (totalCostCents int64, err error)
    QueryCostByExpression(ctx context.Context, ruleID, key, expression, keyExtractor string, windowSeconds int) (int64, error)
    QueryTokenUsageByExpression(ctx context.Context, ruleID, key, expression, keyExtractor string, window time.Duration) (int64, int64, error)
}

Implementations

Interface Implementation Description
Writer PostgreSQLWriter Batch COPY protocol writes (100 records / 1s flush)
Writer RedpandaWriter Kafka/Redpanda streaming writes (franz-go)
Querier PostgreSQLQuerier Direct SQL against llm_token_usage
Querier CachedQuerier Redis cache wrapping any Querier (read-aside)

Middleware Chain Order

sequenceDiagram
    box Request Path (outer → inner)
        participant C as Client
        participant OIDC as OIDC
        participant RC as reqcontext
        participant SL as Spend Limiter
        participant RL as Rate Limiter
        participant B as Backend (LLM)
    end

    Note over C,B: REQUEST DIRECTION →
    C->>OIDC: POST /v1/chat/completions
    OIDC->>RC: validate JWT, store claims
    RC->>SL: parse body once, store model/provider
    SL->>RL: evaluate CEL, check limits, store keys
    RL->>B: forward to LLM provider

    Note over C,B: ← RESPONSE DIRECTION
    B-->>RL: LLM response
    RL-->>SL: (pass through)
    Note over SL: Token Tracker captures here<br/>(innermost middleware)
    SL-->>RC: record usage → Redis + PostgreSQL
    RC-->>OIDC:
    OIDC-->>C: 200 OK + response
Loading

The spend limiter wraps the token tracker middleware:

  • Request: spend limiter evaluates CEL, checks limits, stores keys in context
  • Response: token tracker captures tokens, reads keys from context, records usage

Sequence 1: Normal Request (Spend Limit Not Exceeded)

Key detail: Redis Lua scripts handle aggregation atomically (ZRANGEBYSCORE + SUM loop). The limit comparison (totalCostCents >= limitCents) happens in Go after Redis returns the aggregate.

sequenceDiagram
    participant C as Client
    participant SL as Spend Limiter (Go)
    participant B as Backend (LLM)
    participant TT as Token Tracker (Go)
    participant R as Redis (Lua)
    participant PG as PostgreSQL

    C->>SL: POST /v1/chat/completions

    Note over SL: 1. Evaluate CEL match expression<br/>"oidc_claims.tier == 'free'"
    Note over SL: 2. Extract key via CEL<br/>"oidc_claims.sub" → "user-123"

    SL->>R: EVAL costCacheReadScript {hash_tag} 2592000
    Note over R: Lua executes atomically:<br/>1. TIME → get Redis clock<br/>2. ZRANGEBYSCORE → entries in window<br/>3. SUM loop → aggregate cost_cents<br/>4. return total (or nil if empty)
    R-->>SL: 350 (i.e. $3.50)

    Note over SL: 3. Go compares: 350 < 500 limit<br/>→ ALLOWED
    Note over SL: 4. Store keys in context:<br/>SpendLimitRuleID, SpendKey,<br/>SpendKeyExtractor, SpendLimitExpression

    SL->>B: forward request
    B-->>TT: LLM response (tokens in body)

    Note over TT: 5. Extract tokens<br/>(context-first, then gjson parse)
    Note over TT: 6. Calculate cost from pricing config<br/>input=150 × $2.50/1M<br/>output=300 × $10.00/1M

    par 3 parallel Lua cache writes
        TT->>R: EVAL cacheWriteScript (rate limit tokens)
        Note over R: Lua: TIME → INCR seq →<br/>ZADD to 4 sorted sets<br/>(minute/hour/day/month)
        TT->>R: EVAL cacheWriteScript (spend limit tokens)
        TT->>R: EVAL costCacheWriteScript (spend limit cost)
    end
    R-->>TT: Redis TIME timestamps

    TT->>PG: buffer record (flush at 100 records or 1s)
    Note over PG: Batch COPY protocol<br/>(10x faster than INSERT)

    TT-->>C: 200 OK + response
Loading

Sequence 2: Spend Limit Exceeded (429)

sequenceDiagram
    participant C as Client
    participant SL as Spend Limiter (Go)
    participant R as Redis (Lua)

    C->>SL: POST /v1/chat/completions

    Note over SL: CEL match → key = "user-123"

    SL->>R: EVAL costCacheReadScript {hash_tag} 2592000
    Note over R: Lua: TIME → ZRANGEBYSCORE<br/>→ SUM loop → return 520
    R-->>SL: 520 (i.e. $5.20)

    Note over SL: Go compares: 520 >= 500 limit<br/>→ BLOCKED (action: "block")

    SL-->>C: 429 Too Many Requests
    Note right of C: Headers:<br/>SpendLimit-Policy: cost_per_month_cents=500<br/>SpendLimit: cost_per_month_cents=520<br/>Retry-After: 86400
Loading

Sequence 3: Cache Miss → PostgreSQL Fallback

sequenceDiagram
    participant SL as Spend Limiter (Go)
    participant CQ as CachedQuerier (Go)
    participant R as Redis (Lua)
    participant PG as PostgreSQL

    SL->>CQ: QueryCostByExpression()
    CQ->>R: EVAL costCacheReadScript
    Note over R: Lua: ZRANGEBYSCORE returns<br/>0 entries → return nil
    R-->>CQ: nil (cache miss)

    Note over CQ: Cache miss → fallback to PostgreSQL

    CQ->>PG: SELECT SUM(total_cost_cents)<br/>FROM llm_token_usage<br/>WHERE spend_key = $1<br/>AND spend_limit_expression = $2<br/>AND timestamp >= NOW() - interval
    PG-->>CQ: 350

    CQ-->>SL: 350 ($3.50)
    Note over SL: Go compares: 350 < 500 → ALLOWED
Loading

What Runs Where

graph LR
    subgraph "Go (Gateway Process)"
        CEL["CEL Expression<br/>Evaluation"]
        CMP["Limit Comparison<br/>totalCost >= limit"]
        COST["Cost Calculation<br/>tokens × price/1M"]
        BATCH["Batch Buffering<br/>100 records / 1s"]
    end

    subgraph "Redis (Lua Scripts, Atomic)"
        TIME["TIME command<br/>(clock source)"]
        ZADD_W["ZADD + EXPIRE<br/>(write to 4 windows)"]
        ZRANGE["ZRANGEBYSCORE<br/>(range query)"]
        SUM["SUM loop<br/>(aggregate in Lua)"]
        SEQ["INCR seq<br/>(collision prevention)"]
    end

    subgraph "PostgreSQL"
        COPY["COPY protocol<br/>(batch insert)"]
        AGG["SUM() + NOW()<br/>(fallback aggregation)"]
    end

    CEL --> CMP
    TIME --> ZADD_W
    TIME --> ZRANGE --> SUM --> CMP
    SEQ --> ZADD_W
    COST --> ZADD_W
    BATCH --> COPY
    SUM -.->|cache miss| AGG --> CMP
Loading

Write Architecture

Redis (immediate, for fast queries)

graph LR
    TT["Token Tracker<br/>Middleware"] -->|"Lua: TIME → INCR seq →<br/>ZADD to 4 sorted sets"| SETS

    subgraph SETS ["Sorted Sets per {hash_tag}"]
        M["token:usage:...:<b>minute</b><br/>TTL: 120s"]
        H["token:usage:...:<b>hour</b><br/>TTL: 7,200s"]
        D["token:usage:...:<b>day</b><br/>TTL: 172,800s"]
        Mo["token:usage:...:<b>month</b><br/>TTL: 5,184,000s"]
    end
Loading

PostgreSQL (batched, durable)

graph LR
    TT["Token Tracker"] -->|"append"| BUF["In-Memory Buffer"]
    BUF -->|"batch_size=100<br/>OR timeout=1s"| COPY["COPY Protocol<br/>(10x faster than INSERT)"]
    COPY --> PG["llm_token_usage table"]
Loading

Redis Cache Key Format

token:usage:{rule_id:key:combined_hash}:{window}
token:cost:{rule_id:key:combined_hash}:{window}
token:write:seq:{rule_id:key:combined_hash}          # sequence counter, 60s TTL
  • {} = Redis Cluster hash tags — all keys for one entity route to the same slot
  • combined_hash = SHA256(expression + "|" + keyExtractor)[:16]
  • Sorted Set score = Unix timestamp from redis.call('TIME')
  • Usage value: "seq:timestamp:{"input":150,"output":300}"
  • Cost value: "seq:timestamp:cost_cents"

Clock Synchronization Strategy

Operation Clock Source Why
Cache write redis.call('TIME') Single source of truth for cache
Cache read redis.call('TIME') Consistent window calculation
DB write Timestamp from Redis TIME Sync between cache and durable store
DB read NOW() (PostgreSQL) Clock-sync free queries

Read Architecture (Cache-First)

graph TD
    Q["CachedQuerier.QueryCostByExpression()"] --> CACHE{"Redis Lua script"}
    CACHE -->|"Hit (~1ms)<br/>TIME + ZRANGEBYSCORE + SUM"| RET["Return aggregated total<br/>to Go for limit comparison"]
    CACHE -->|"Miss (nil result)"| PG["PostgreSQL fallback<br/>SUM() with NOW() window<br/>(~10-50ms)"]
    PG --> RET
Loading

Lua Read Script (cost aggregation)

-- All of this executes atomically inside Redis
local time = redis.call('TIME')
local now = tonumber(time[1]) + (tonumber(time[2]) / 1000000)
local since = now - window_seconds

-- Pick smallest sorted set covering the window
if window_seconds <= 60 then key = '...:minute'
elseif window_seconds <= 3600 then key = '...:hour'
elseif window_seconds <= 86400 then key = '...:day'
else key = '...:month' end

local entries = redis.call('ZRANGEBYSCORE', key, since, now)
if #entries == 0 then return nil end  -- cache miss

-- Aggregate inside Redis
local total_cost = 0
for _, entry in ipairs(entries) do
    local cost_str = entry:match('[^:]+:[^:]+:(.*)')
    total_cost = total_cost + tonumber(cost_str)
end
return total_cost  -- Go code then compares this against the limit

Go Limit Comparison (outside Redis)

// spendlimit_handler.go:493-505
totalCostCents, err := querier.QueryCostByExpression(ctx, rule.ID, key, expression, rule.KeyExtractor, windowSeconds)
if err != nil {
    return nil // Fail-open: don't block on query errors
}

blocked := totalCostCents >= limitCents  // ← comparison happens in Go

Expression-Based Cost Isolation

Same user can have different limits tracked independently:

graph TD
    U["user-123"] --> R1["Rule: gpt4-budget<br/>expr: model.startsWith('openai/gpt-4')<br/>limit: $10/month"]
    U --> R2["Rule: total-budget<br/>expr: true<br/>limit: $100/month"]

    R1 --> C1["Redis: {gpt4-budget:user-123:a1b2c3d4}<br/>Separate sorted sets"]
    R2 --> C2["Redis: {total-budget:user-123:e5f6g7h8}<br/>Separate sorted sets"]

    R1 --> PG1["PG: WHERE spend_key='user-123'<br/>AND spend_limit_expression=<br/>'model.startsWith(...)'"]
    R2 --> PG2["PG: WHERE spend_key='user-123'<br/>AND spend_limit_expression='true'"]
Loading

The combined_hash = SHA256(expression + "|" + keyExtractor)[:16] ensures each rule gets its own cache keys. PostgreSQL queries filter by both spend_key AND spend_limit_expression.


Cost Calculation

Costs are calculated at write time in the token tracker middleware (not at query time), using in-memory pricing config:

// Token tracker middleware, after capturing response
record.InputCostCents  = (record.InputTokens * pricing.InputTokenPriceCentsPerMillion) / 1_000_000
record.OutputCostCents = (record.OutputTokens * pricing.OutputTokenPriceCentsPerMillion) / 1_000_000
record.TotalCostCents  = record.InputCostCents + record.OutputCostCents + ...

Price resolution priority: Custom (per-org) → Standard → Default

Costs stored as cents in BIGINT — avoids floating-point precision issues.


PostgreSQL Schema

CREATE TABLE llm_token_usage (
    id                             BIGSERIAL PRIMARY KEY,
    timestamp                      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    request_id                     TEXT NOT NULL,
    frontend_id                    TEXT,
    backend_id                     TEXT,
    -- Spend limiting (CEL-based)
    spend_limit_rule_id            TEXT,
    spend_key                      TEXT,
    spend_key_extractor            TEXT,
    spend_limit_expression         TEXT,
    request_context                JSONB,
    -- Model
    model_id                       TEXT,
    provider                       TEXT,
    -- Tokens
    input_tokens                   BIGINT,
    output_tokens                  BIGINT,
    total_tokens                   BIGINT,
    -- Costs (cents, BIGINT for precision)
    input_cost_cents               BIGINT,
    output_cost_cents              BIGINT,
    cached_input_read_cost_cents   BIGINT,
    cached_input_write_cost_cents  BIGINT,
    unit_cost_cents                BIGINT,
    total_cost_cents               BIGINT NOT NULL DEFAULT 0,
    -- Metadata
    path                           TEXT,
    method                         TEXT,
    status_code                    BIGINT,
    streaming                      BOOLEAN,
    duration_ms                    BIGINT,
    created_at                     TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

Key Indexes

Index Columns Purpose
idx_..._timestamp timestamp DESC Time-range queries
idx_..._request_id request_id Debugging / audit
idx_..._provider_model provider, model_id, timestamp DESC Spending by provider/model
idx_..._spend_key_cost spend_key, timestamp DESC, total_cost_cents (partial) Cost aggregation
idx_..._spend_enforcement spend_key, spend_limit_expression, timestamp DESC, total_cost_cents (partial) Per-rule isolation
idx_..._context_gin request_context (GIN) Flexible JSONB queries

Fail-Open Design

Every error is logged but never blocks requests:

Failure Behavior
Cache write fails Logged as warning, request continues
Cache read fails Falls back to PostgreSQL
PostgreSQL write fails Logged, response already returned
PostgreSQL read fails Logged, spend check passes (fail-open)
Token extraction fails Logged as debug, no record written
Cost calculation fails Logged, record written without cost

Configuration

spend_limiter:
  enabled: true
  token_tracker:
    storage: postgresql
    postgresql:
      storage_id: "postgres-main"
      batch_size: 100
      batch_timeout: 1000          # ms
    async: true
    cache:
      enabled: true
      storage_id: "redis-main"

spend_limits:
  - frontend_id: "main"
    rules:
      - id: "free-tier"
        name: "Free Tier - $5/month"
        expression: "request.context.oidc_claims.tier == 'free'"
        key_extractor: "request.context.oidc_claims.sub"
        tokens_per_minute: 10000
        cost_per_month_cents: 500
        action: "block"             # "block" | "warn" | "throttle"

Key Files

File Role
internal/tokentracker/interfaces.go Core interfaces and TokenUsage struct
internal/tokentracker/extractor.go Token extraction from LLM responses (gjson)
internal/tokentracker/cached_querier.go Redis cache layer with Lua scripts
internal/tokentracker/querier_postgresql.go PostgreSQL querier
internal/tokentracker/writer_postgresql.go Batched COPY writer
internal/frontend/middleware/tokentracker/ HTTP middleware (capture + record)
internal/frontend/middleware/spendlimit/ Spend limit enforcement middleware
internal/cel/hash.go Expression hash for cache key isolation
internal/tokentracker/postgresql/sql/queries.sql All SQL query definitions
internal/tokentracker/postgresql/migrations/ Schema migrations
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment