c4milo/token-tracker-architecture.md

## token-tracker-architecture.md

      
    Raw
  

              token-tracker-architecture.md
            
          
    Token Tracker Architecture

System Overview

The token tracker is a shared infrastructure layer that captures LLM token usage from provider responses and feeds two consumers: the rate limiter (tokens/time) and the spend limiter (cost/time). It uses a write-through cache pattern with Redis for fast aggregation and PostgreSQL for durable storage.

  
      graph TB
    subgraph "Middleware Chain"
        OIDC["OIDC Auth"]
        RC["reqcontext<br/>(parse body once)"]
        SL["Spend Limiter<br/>(CEL evaluation + limit check)"]
        RL["Rate Limiter"]
        RT["Router"]
        TT["Token Tracker<br/>(captures response)"]
    end

    subgraph "Storage Layer"
        Redis["Redis<br/>(Sorted Sets + Lua scripts)<br/>Aggregation happens here"]
        PG["PostgreSQL<br/>(llm_token_usage)<br/>Durable storage + fallback"]
        Kafka["Redpanda/Kafka<br/>(optional stream)"]
    end

    subgraph "Pricing"
        MP["Model Pricing Config<br/>(in-memory lookup)"]
    end

    OIDC --> RC --> SL --> RL --> RT
    RT -->|response| TT

    SL -->|"QueryCostByExpression()<br/>→ Lua aggregates in Redis<br/>→ Go compares against limit"| Redis
    Redis -.->|cache miss| PG

    TT -->|"WriteToCache() (Lua ZADD)"| Redis
    TT -->|"WriteTokenUsage() (batch COPY)"| PG
    TT -.->|optional| Kafka
    TT -->|"calculateCosts()"| MP

    
      Loading

  
Core Interfaces

// Writer — records token usage (PostgreSQL batch, Redpanda/Kafka stream)
type Writer interface {
    WriteTokenUsage(ctx context.Context, usage *TokenUsage) error
    Close(ctx context.Context) error
}

// Querier — reads aggregated usage (PostgreSQL direct, or CachedQuerier with Redis)
type Querier interface {
    QueryTokenUsage(ctx context.Context, key string, window time.Duration) (input, output int64, err error)
    QueryCost(ctx context.Context, key string, windowSeconds int) (totalCostCents int64, err error)
    QueryCostByExpression(ctx context.Context, ruleID, key, expression, keyExtractor string, windowSeconds int) (int64, error)
    QueryTokenUsageByExpression(ctx context.Context, ruleID, key, expression, keyExtractor string, window time.Duration) (int64, int64, error)
}
Implementations


Interface
Implementation
Description


Writer
PostgreSQLWriter
Batch COPY protocol writes (100 records / 1s flush)


Writer
RedpandaWriter
Kafka/Redpanda streaming writes (franz-go)


Querier
PostgreSQLQuerier
Direct SQL against llm_token_usage


Querier
CachedQuerier
Redis cache wrapping any Querier (read-aside)


Middleware Chain Order


      sequenceDiagram
    box Request Path (outer → inner)
        participant C as Client
        participant OIDC as OIDC
        participant RC as reqcontext
        participant SL as Spend Limiter
        participant RL as Rate Limiter
        participant B as Backend (LLM)
    end

    Note over C,B: REQUEST DIRECTION →
    C->>OIDC: POST /v1/chat/completions
    OIDC->>RC: validate JWT, store claims
    RC->>SL: parse body once, store model/provider
    SL->>RL: evaluate CEL, check limits, store keys
    RL->>B: forward to LLM provider

    Note over C,B: ← RESPONSE DIRECTION
    B-->>RL: LLM response
    RL-->>SL: (pass through)
    Note over SL: Token Tracker captures here<br/>(innermost middleware)
    SL-->>RC: record usage → Redis + PostgreSQL
    RC-->>OIDC:
    OIDC-->>C: 200 OK + response

    
      Loading

  
The spend limiter wraps the token tracker middleware:

Request: spend limiter evaluates CEL, checks limits, stores keys in context
Response: token tracker captures tokens, reads keys from context, records usage


Sequence 1: Normal Request (Spend Limit Not Exceeded)


Key detail: Redis Lua scripts handle aggregation atomically (ZRANGEBYSCORE + SUM loop).
The limit comparison (totalCostCents >= limitCents) happens in Go after Redis returns the aggregate.


      sequenceDiagram
    participant C as Client
    participant SL as Spend Limiter (Go)
    participant B as Backend (LLM)
    participant TT as Token Tracker (Go)
    participant R as Redis (Lua)
    participant PG as PostgreSQL

    C->>SL: POST /v1/chat/completions

    Note over SL: 1. Evaluate CEL match expression<br/>"oidc_claims.tier == 'free'"
    Note over SL: 2. Extract key via CEL<br/>"oidc_claims.sub" → "user-123"

    SL->>R: EVAL costCacheReadScript {hash_tag} 2592000
    Note over R: Lua executes atomically:<br/>1. TIME → get Redis clock<br/>2. ZRANGEBYSCORE → entries in window<br/>3. SUM loop → aggregate cost_cents<br/>4. return total (or nil if empty)
    R-->>SL: 350 (i.e. $3.50)

    Note over SL: 3. Go compares: 350 < 500 limit<br/>→ ALLOWED
    Note over SL: 4. Store keys in context:<br/>SpendLimitRuleID, SpendKey,<br/>SpendKeyExtractor, SpendLimitExpression

    SL->>B: forward request
    B-->>TT: LLM response (tokens in body)

    Note over TT: 5. Extract tokens<br/>(context-first, then gjson parse)
    Note over TT: 6. Calculate cost from pricing config<br/>input=150 × $2.50/1M<br/>output=300 × $10.00/1M

    par 3 parallel Lua cache writes
        TT->>R: EVAL cacheWriteScript (rate limit tokens)
        Note over R: Lua: TIME → INCR seq →<br/>ZADD to 4 sorted sets<br/>(minute/hour/day/month)
        TT->>R: EVAL cacheWriteScript (spend limit tokens)
        TT->>R: EVAL costCacheWriteScript (spend limit cost)
    end
    R-->>TT: Redis TIME timestamps

    TT->>PG: buffer record (flush at 100 records or 1s)
    Note over PG: Batch COPY protocol<br/>(10x faster than INSERT)

    TT-->>C: 200 OK + response

    
      Loading

  
Sequence 2: Spend Limit Exceeded (429)


      sequenceDiagram
    participant C as Client
    participant SL as Spend Limiter (Go)
    participant R as Redis (Lua)

    C->>SL: POST /v1/chat/completions

    Note over SL: CEL match → key = "user-123"

    SL->>R: EVAL costCacheReadScript {hash_tag} 2592000
    Note over R: Lua: TIME → ZRANGEBYSCORE<br/>→ SUM loop → return 520
    R-->>SL: 520 (i.e. $5.20)

    Note over SL: Go compares: 520 >= 500 limit<br/>→ BLOCKED (action: "block")

    SL-->>C: 429 Too Many Requests
    Note right of C: Headers:<br/>SpendLimit-Policy: cost_per_month_cents=500<br/>SpendLimit: cost_per_month_cents=520<br/>Retry-After: 86400

    
      Loading

  
Sequence 3: Cache Miss → PostgreSQL Fallback


      sequenceDiagram
    participant SL as Spend Limiter (Go)
    participant CQ as CachedQuerier (Go)
    participant R as Redis (Lua)
    participant PG as PostgreSQL

    SL->>CQ: QueryCostByExpression()
    CQ->>R: EVAL costCacheReadScript
    Note over R: Lua: ZRANGEBYSCORE returns<br/>0 entries → return nil
    R-->>CQ: nil (cache miss)

    Note over CQ: Cache miss → fallback to PostgreSQL

    CQ->>PG: SELECT SUM(total_cost_cents)<br/>FROM llm_token_usage<br/>WHERE spend_key = $1<br/>AND spend_limit_expression = $2<br/>AND timestamp >= NOW() - interval
    PG-->>CQ: 350

    CQ-->>SL: 350 ($3.50)
    Note over SL: Go compares: 350 < 500 → ALLOWED

    
      Loading

  
What Runs Where


      graph LR
    subgraph "Go (Gateway Process)"
        CEL["CEL Expression<br/>Evaluation"]
        CMP["Limit Comparison<br/>totalCost >= limit"]
        COST["Cost Calculation<br/>tokens × price/1M"]
        BATCH["Batch Buffering<br/>100 records / 1s"]
    end

    subgraph "Redis (Lua Scripts, Atomic)"
        TIME["TIME command<br/>(clock source)"]
        ZADD_W["ZADD + EXPIRE<br/>(write to 4 windows)"]
        ZRANGE["ZRANGEBYSCORE<br/>(range query)"]
        SUM["SUM loop<br/>(aggregate in Lua)"]
        SEQ["INCR seq<br/>(collision prevention)"]
    end

    subgraph "PostgreSQL"
        COPY["COPY protocol<br/>(batch insert)"]
        AGG["SUM() + NOW()<br/>(fallback aggregation)"]
    end

    CEL --> CMP
    TIME --> ZADD_W
    TIME --> ZRANGE --> SUM --> CMP
    SEQ --> ZADD_W
    COST --> ZADD_W
    BATCH --> COPY
    SUM -.->|cache miss| AGG --> CMP

    
      Loading

  
Write Architecture

Redis (immediate, for fast queries)


      graph LR
    TT["Token Tracker<br/>Middleware"] -->|"Lua: TIME → INCR seq →<br/>ZADD to 4 sorted sets"| SETS

    subgraph SETS ["Sorted Sets per {hash_tag}"]
        M["token:usage:...:<b>minute</b><br/>TTL: 120s"]
        H["token:usage:...:<b>hour</b><br/>TTL: 7,200s"]
        D["token:usage:...:<b>day</b><br/>TTL: 172,800s"]
        Mo["token:usage:...:<b>month</b><br/>TTL: 5,184,000s"]
    end

    
      Loading

  
PostgreSQL (batched, durable)


      graph LR
    TT["Token Tracker"] -->|"append"| BUF["In-Memory Buffer"]
    BUF -->|"batch_size=100<br/>OR timeout=1s"| COPY["COPY Protocol<br/>(10x faster than INSERT)"]
    COPY --> PG["llm_token_usage table"]

    
      Loading

  
Redis Cache Key Format

token:usage:{rule_id:key:combined_hash}:{window}
token:cost:{rule_id:key:combined_hash}:{window}
token:write:seq:{rule_id:key:combined_hash}          # sequence counter, 60s TTL


{} = Redis Cluster hash tags — all keys for one entity route to the same slot
combined_hash = SHA256(expression + "|" + keyExtractor)[:16]
Sorted Set score = Unix timestamp from redis.call('TIME')
Usage value: "seq:timestamp:{"input":150,"output":300}"
Cost value: "seq:timestamp:cost_cents"

Clock Synchronization Strategy


Operation
Clock Source
Why


Cache write
redis.call('TIME')
Single source of truth for cache


Cache read
redis.call('TIME')
Consistent window calculation


DB write
Timestamp from Redis TIME
Sync between cache and durable store


DB read
NOW() (PostgreSQL)
Clock-sync free queries


Read Architecture (Cache-First)


      graph TD
    Q["CachedQuerier.QueryCostByExpression()"] --> CACHE{"Redis Lua script"}
    CACHE -->|"Hit (~1ms)<br/>TIME + ZRANGEBYSCORE + SUM"| RET["Return aggregated total<br/>to Go for limit comparison"]
    CACHE -->|"Miss (nil result)"| PG["PostgreSQL fallback<br/>SUM() with NOW() window<br/>(~10-50ms)"]
    PG --> RET

    
      Loading

  
Lua Read Script (cost aggregation)

-- All of this executes atomically inside Redis
local time = redis.call('TIME')
local now = tonumber(time[1]) + (tonumber(time[2]) / 1000000)
local since = now - window_seconds

-- Pick smallest sorted set covering the window
if window_seconds <= 60 then key = '...:minute'
elseif window_seconds <= 3600 then key = '...:hour'
elseif window_seconds <= 86400 then key = '...:day'
else key = '...:month' end

local entries = redis.call('ZRANGEBYSCORE', key, since, now)
if #entries == 0 then return nil end  -- cache miss

-- Aggregate inside Redis
local total_cost = 0
for _, entry in ipairs(entries) do
    local cost_str = entry:match('[^:]+:[^:]+:(.*)')
    total_cost = total_cost + tonumber(cost_str)
end
return total_cost  -- Go code then compares this against the limit
Go Limit Comparison (outside Redis)

// spendlimit_handler.go:493-505
totalCostCents, err := querier.QueryCostByExpression(ctx, rule.ID, key, expression, rule.KeyExtractor, windowSeconds)
if err != nil {
    return nil // Fail-open: don't block on query errors
}

blocked := totalCostCents >= limitCents  // ← comparison happens in Go

Expression-Based Cost Isolation

Same user can have different limits tracked independently:

  
      graph TD
    U["user-123"] --> R1["Rule: gpt4-budget<br/>expr: model.startsWith('openai/gpt-4')<br/>limit: $10/month"]
    U --> R2["Rule: total-budget<br/>expr: true<br/>limit: $100/month"]

    R1 --> C1["Redis: {gpt4-budget:user-123:a1b2c3d4}<br/>Separate sorted sets"]
    R2 --> C2["Redis: {total-budget:user-123:e5f6g7h8}<br/>Separate sorted sets"]

    R1 --> PG1["PG: WHERE spend_key='user-123'<br/>AND spend_limit_expression=<br/>'model.startsWith(...)'"]
    R2 --> PG2["PG: WHERE spend_key='user-123'<br/>AND spend_limit_expression='true'"]

    
      Loading

  
The combined_hash = SHA256(expression + "|" + keyExtractor)[:16] ensures each rule gets its own cache keys. PostgreSQL queries filter by both spend_key AND spend_limit_expression.

Cost Calculation

Costs are calculated at write time in the token tracker middleware (not at query time), using in-memory pricing config:
// Token tracker middleware, after capturing response
record.InputCostCents  = (record.InputTokens * pricing.InputTokenPriceCentsPerMillion) / 1_000_000
record.OutputCostCents = (record.OutputTokens * pricing.OutputTokenPriceCentsPerMillion) / 1_000_000
record.TotalCostCents  = record.InputCostCents + record.OutputCostCents + ...
Price resolution priority: Custom (per-org) → Standard → Default
Costs stored as cents in BIGINT — avoids floating-point precision issues.

PostgreSQL Schema

CREATE TABLE llm_token_usage (
    id                             BIGSERIAL PRIMARY KEY,
    timestamp                      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    request_id                     TEXT NOT NULL,
    frontend_id                    TEXT,
    backend_id                     TEXT,
    -- Spend limiting (CEL-based)
    spend_limit_rule_id            TEXT,
    spend_key                      TEXT,
    spend_key_extractor            TEXT,
    spend_limit_expression         TEXT,
    request_context                JSONB,
    -- Model
    model_id                       TEXT,
    provider                       TEXT,
    -- Tokens
    input_tokens                   BIGINT,
    output_tokens                  BIGINT,
    total_tokens                   BIGINT,
    -- Costs (cents, BIGINT for precision)
    input_cost_cents               BIGINT,
    output_cost_cents              BIGINT,
    cached_input_read_cost_cents   BIGINT,
    cached_input_write_cost_cents  BIGINT,
    unit_cost_cents                BIGINT,
    total_cost_cents               BIGINT NOT NULL DEFAULT 0,
    -- Metadata
    path                           TEXT,
    method                         TEXT,
    status_code                    BIGINT,
    streaming                      BOOLEAN,
    duration_ms                    BIGINT,
    created_at                     TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
Key Indexes


Index
Columns
Purpose


idx_..._timestamp
timestamp DESC
Time-range queries


idx_..._request_id
request_id
Debugging / audit


idx_..._provider_model
provider, model_id, timestamp DESC
Spending by provider/model


idx_..._spend_key_cost
spend_key, timestamp DESC, total_cost_cents (partial)
Cost aggregation


idx_..._spend_enforcement
spend_key, spend_limit_expression, timestamp DESC, total_cost_cents (partial)
Per-rule isolation


idx_..._context_gin
request_context (GIN)
Flexible JSONB queries


Fail-Open Design

Every error is logged but never blocks requests:


Failure
Behavior


Cache write fails
Logged as warning, request continues


Cache read fails
Falls back to PostgreSQL


PostgreSQL write fails
Logged, response already returned


PostgreSQL read fails
Logged, spend check passes (fail-open)


Token extraction fails
Logged as debug, no record written


Cost calculation fails
Logged, record written without cost


Configuration

spend_limiter:
  enabled: true
  token_tracker:
    storage: postgresql
    postgresql:
      storage_id: "postgres-main"
      batch_size: 100
      batch_timeout: 1000          # ms
    async: true
    cache:
      enabled: true
      storage_id: "redis-main"

spend_limits:
  - frontend_id: "main"
    rules:
      - id: "free-tier"
        name: "Free Tier - $5/month"
        expression: "request.context.oidc_claims.tier == 'free'"
        key_extractor: "request.context.oidc_claims.sub"
        tokens_per_minute: 10000
        cost_per_month_cents: 500
        action: "block"             # "block" | "warn" | "throttle"

Key Files


File
Role


internal/tokentracker/interfaces.go
Core interfaces and TokenUsage struct


internal/tokentracker/extractor.go
Token extraction from LLM responses (gjson)


internal/tokentracker/cached_querier.go
Redis cache layer with Lua scripts


internal/tokentracker/querier_postgresql.go
PostgreSQL querier


internal/tokentracker/writer_postgresql.go
Batched COPY writer


internal/frontend/middleware/tokentracker/
HTTP middleware (capture + record)


internal/frontend/middleware/spendlimit/
Spend limit enforcement middleware


internal/cel/hash.go
Expression hash for cache key isolation


internal/tokentracker/postgresql/sql/queries.sql
All SQL query definitions


internal/tokentracker/postgresql/migrations/
Schema migrations
Interface	Implementation	Description
`Writer`	`PostgreSQLWriter`	Batch COPY protocol writes (100 records / 1s flush)
`Writer`	`RedpandaWriter`	Kafka/Redpanda streaming writes (franz-go)
`Querier`	`PostgreSQLQuerier`	Direct SQL against `llm_token_usage`
`Querier`	`CachedQuerier`	Redis cache wrapping any `Querier` (read-aside)
Operation	Clock Source	Why
Cache write	`redis.call('TIME')`	Single source of truth for cache
Cache read	`redis.call('TIME')`	Consistent window calculation
DB write	Timestamp from Redis TIME	Sync between cache and durable store
DB read	`NOW()` (PostgreSQL)	Clock-sync free queries
Index	Columns	Purpose
`idx_..._timestamp`	`timestamp DESC`	Time-range queries
`idx_..._request_id`	`request_id`	Debugging / audit
`idx_..._provider_model`	`provider, model_id, timestamp DESC`	Spending by provider/model
`idx_..._spend_key_cost`	`spend_key, timestamp DESC, total_cost_cents` (partial)	Cost aggregation
`idx_..._spend_enforcement`	`spend_key, spend_limit_expression, timestamp DESC, total_cost_cents` (partial)	Per-rule isolation
`idx_..._context_gin`	`request_context` (GIN)	Flexible JSONB queries
Failure	Behavior
Cache write fails	Logged as warning, request continues
Cache read fails	Falls back to PostgreSQL
PostgreSQL write fails	Logged, response already returned
PostgreSQL read fails	Logged, spend check passes (fail-open)
Token extraction fails	Logged as debug, no record written
Cost calculation fails	Logged, record written without cost
File	Role
`internal/tokentracker/interfaces.go`	Core interfaces and `TokenUsage` struct
`internal/tokentracker/extractor.go`	Token extraction from LLM responses (gjson)
`internal/tokentracker/cached_querier.go`	Redis cache layer with Lua scripts
`internal/tokentracker/querier_postgresql.go`	PostgreSQL querier
`internal/tokentracker/writer_postgresql.go`	Batched COPY writer
`internal/frontend/middleware/tokentracker/`	HTTP middleware (capture + record)
`internal/frontend/middleware/spendlimit/`	Spend limit enforcement middleware
`internal/cel/hash.go`	Expression hash for cache key isolation
`internal/tokentracker/postgresql/sql/queries.sql`	All SQL query definitions
`internal/tokentracker/postgresql/migrations/`	Schema migrations