The token tracker is a shared infrastructure layer that captures LLM token usage from provider responses and feeds two consumers: the rate limiter (tokens/time) and the spend limiter (cost/time). It uses a write-through cache pattern with Redis for fast aggregation and PostgreSQL for durable storage.
graph TB
subgraph "Middleware Chain"
OIDC["OIDC Auth"]
RC["reqcontext<br/>(parse body once)"]
SL["Spend Limiter<br/>(CEL evaluation + limit check)"]
RL["Rate Limiter"]
RT["Router"]
TT["Token Tracker<br/>(captures response)"]
end
subgraph "Storage Layer"
Redis["Redis<br/>(Sorted Sets + Lua scripts)<br/>Aggregation happens here"]
PG["PostgreSQL<br/>(llm_token_usage)<br/>Durable storage + fallback"]
Kafka["Redpanda/Kafka<br/>(optional stream)"]
end
subgraph "Pricing"
MP["Model Pricing Config<br/>(in-memory lookup)"]
end
OIDC --> RC --> SL --> RL --> RT
RT -->|response| TT
SL -->|"QueryCostByExpression()<br/>→ Lua aggregates in Redis<br/>→ Go compares against limit"| Redis
Redis -.->|cache miss| PG
TT -->|"WriteToCache() (Lua ZADD)"| Redis
TT -->|"WriteTokenUsage() (batch COPY)"| PG
TT -.->|optional| Kafka
TT -->|"calculateCosts()"| MP
// Writer — records token usage (PostgreSQL batch, Redpanda/Kafka stream)
type Writer interface {
WriteTokenUsage(ctx context.Context, usage *TokenUsage) error
Close(ctx context.Context) error
}
// Querier — reads aggregated usage (PostgreSQL direct, or CachedQuerier with Redis)
type Querier interface {
QueryTokenUsage(ctx context.Context, key string, window time.Duration) (input, output int64, err error)
QueryCost(ctx context.Context, key string, windowSeconds int) (totalCostCents int64, err error)
QueryCostByExpression(ctx context.Context, ruleID, key, expression, keyExtractor string, windowSeconds int) (int64, error)
QueryTokenUsageByExpression(ctx context.Context, ruleID, key, expression, keyExtractor string, window time.Duration) (int64, int64, error)
}| Interface | Implementation | Description |
|---|---|---|
Writer |
PostgreSQLWriter |
Batch COPY protocol writes (100 records / 1s flush) |
Writer |
RedpandaWriter |
Kafka/Redpanda streaming writes (franz-go) |
Querier |
PostgreSQLQuerier |
Direct SQL against llm_token_usage |
Querier |
CachedQuerier |
Redis cache wrapping any Querier (read-aside) |
sequenceDiagram
box Request Path (outer → inner)
participant C as Client
participant OIDC as OIDC
participant RC as reqcontext
participant SL as Spend Limiter
participant RL as Rate Limiter
participant B as Backend (LLM)
end
Note over C,B: REQUEST DIRECTION →
C->>OIDC: POST /v1/chat/completions
OIDC->>RC: validate JWT, store claims
RC->>SL: parse body once, store model/provider
SL->>RL: evaluate CEL, check limits, store keys
RL->>B: forward to LLM provider
Note over C,B: ← RESPONSE DIRECTION
B-->>RL: LLM response
RL-->>SL: (pass through)
Note over SL: Token Tracker captures here<br/>(innermost middleware)
SL-->>RC: record usage → Redis + PostgreSQL
RC-->>OIDC:
OIDC-->>C: 200 OK + response
The spend limiter wraps the token tracker middleware:
- Request: spend limiter evaluates CEL, checks limits, stores keys in context
- Response: token tracker captures tokens, reads keys from context, records usage
Key detail: Redis Lua scripts handle aggregation atomically (ZRANGEBYSCORE + SUM loop). The limit comparison (
totalCostCents >= limitCents) happens in Go after Redis returns the aggregate.
sequenceDiagram
participant C as Client
participant SL as Spend Limiter (Go)
participant B as Backend (LLM)
participant TT as Token Tracker (Go)
participant R as Redis (Lua)
participant PG as PostgreSQL
C->>SL: POST /v1/chat/completions
Note over SL: 1. Evaluate CEL match expression<br/>"oidc_claims.tier == 'free'"
Note over SL: 2. Extract key via CEL<br/>"oidc_claims.sub" → "user-123"
SL->>R: EVAL costCacheReadScript {hash_tag} 2592000
Note over R: Lua executes atomically:<br/>1. TIME → get Redis clock<br/>2. ZRANGEBYSCORE → entries in window<br/>3. SUM loop → aggregate cost_cents<br/>4. return total (or nil if empty)
R-->>SL: 350 (i.e. $3.50)
Note over SL: 3. Go compares: 350 < 500 limit<br/>→ ALLOWED
Note over SL: 4. Store keys in context:<br/>SpendLimitRuleID, SpendKey,<br/>SpendKeyExtractor, SpendLimitExpression
SL->>B: forward request
B-->>TT: LLM response (tokens in body)
Note over TT: 5. Extract tokens<br/>(context-first, then gjson parse)
Note over TT: 6. Calculate cost from pricing config<br/>input=150 × $2.50/1M<br/>output=300 × $10.00/1M
par 3 parallel Lua cache writes
TT->>R: EVAL cacheWriteScript (rate limit tokens)
Note over R: Lua: TIME → INCR seq →<br/>ZADD to 4 sorted sets<br/>(minute/hour/day/month)
TT->>R: EVAL cacheWriteScript (spend limit tokens)
TT->>R: EVAL costCacheWriteScript (spend limit cost)
end
R-->>TT: Redis TIME timestamps
TT->>PG: buffer record (flush at 100 records or 1s)
Note over PG: Batch COPY protocol<br/>(10x faster than INSERT)
TT-->>C: 200 OK + response
sequenceDiagram
participant C as Client
participant SL as Spend Limiter (Go)
participant R as Redis (Lua)
C->>SL: POST /v1/chat/completions
Note over SL: CEL match → key = "user-123"
SL->>R: EVAL costCacheReadScript {hash_tag} 2592000
Note over R: Lua: TIME → ZRANGEBYSCORE<br/>→ SUM loop → return 520
R-->>SL: 520 (i.e. $5.20)
Note over SL: Go compares: 520 >= 500 limit<br/>→ BLOCKED (action: "block")
SL-->>C: 429 Too Many Requests
Note right of C: Headers:<br/>SpendLimit-Policy: cost_per_month_cents=500<br/>SpendLimit: cost_per_month_cents=520<br/>Retry-After: 86400
sequenceDiagram
participant SL as Spend Limiter (Go)
participant CQ as CachedQuerier (Go)
participant R as Redis (Lua)
participant PG as PostgreSQL
SL->>CQ: QueryCostByExpression()
CQ->>R: EVAL costCacheReadScript
Note over R: Lua: ZRANGEBYSCORE returns<br/>0 entries → return nil
R-->>CQ: nil (cache miss)
Note over CQ: Cache miss → fallback to PostgreSQL
CQ->>PG: SELECT SUM(total_cost_cents)<br/>FROM llm_token_usage<br/>WHERE spend_key = $1<br/>AND spend_limit_expression = $2<br/>AND timestamp >= NOW() - interval
PG-->>CQ: 350
CQ-->>SL: 350 ($3.50)
Note over SL: Go compares: 350 < 500 → ALLOWED
graph LR
subgraph "Go (Gateway Process)"
CEL["CEL Expression<br/>Evaluation"]
CMP["Limit Comparison<br/>totalCost >= limit"]
COST["Cost Calculation<br/>tokens × price/1M"]
BATCH["Batch Buffering<br/>100 records / 1s"]
end
subgraph "Redis (Lua Scripts, Atomic)"
TIME["TIME command<br/>(clock source)"]
ZADD_W["ZADD + EXPIRE<br/>(write to 4 windows)"]
ZRANGE["ZRANGEBYSCORE<br/>(range query)"]
SUM["SUM loop<br/>(aggregate in Lua)"]
SEQ["INCR seq<br/>(collision prevention)"]
end
subgraph "PostgreSQL"
COPY["COPY protocol<br/>(batch insert)"]
AGG["SUM() + NOW()<br/>(fallback aggregation)"]
end
CEL --> CMP
TIME --> ZADD_W
TIME --> ZRANGE --> SUM --> CMP
SEQ --> ZADD_W
COST --> ZADD_W
BATCH --> COPY
SUM -.->|cache miss| AGG --> CMP
graph LR
TT["Token Tracker<br/>Middleware"] -->|"Lua: TIME → INCR seq →<br/>ZADD to 4 sorted sets"| SETS
subgraph SETS ["Sorted Sets per {hash_tag}"]
M["token:usage:...:<b>minute</b><br/>TTL: 120s"]
H["token:usage:...:<b>hour</b><br/>TTL: 7,200s"]
D["token:usage:...:<b>day</b><br/>TTL: 172,800s"]
Mo["token:usage:...:<b>month</b><br/>TTL: 5,184,000s"]
end
graph LR
TT["Token Tracker"] -->|"append"| BUF["In-Memory Buffer"]
BUF -->|"batch_size=100<br/>OR timeout=1s"| COPY["COPY Protocol<br/>(10x faster than INSERT)"]
COPY --> PG["llm_token_usage table"]
token:usage:{rule_id:key:combined_hash}:{window}
token:cost:{rule_id:key:combined_hash}:{window}
token:write:seq:{rule_id:key:combined_hash} # sequence counter, 60s TTL
{}= Redis Cluster hash tags — all keys for one entity route to the same slotcombined_hash=SHA256(expression + "|" + keyExtractor)[:16]- Sorted Set score = Unix timestamp from
redis.call('TIME') - Usage value:
"seq:timestamp:{"input":150,"output":300}" - Cost value:
"seq:timestamp:cost_cents"
| Operation | Clock Source | Why |
|---|---|---|
| Cache write | redis.call('TIME') |
Single source of truth for cache |
| Cache read | redis.call('TIME') |
Consistent window calculation |
| DB write | Timestamp from Redis TIME | Sync between cache and durable store |
| DB read | NOW() (PostgreSQL) |
Clock-sync free queries |
graph TD
Q["CachedQuerier.QueryCostByExpression()"] --> CACHE{"Redis Lua script"}
CACHE -->|"Hit (~1ms)<br/>TIME + ZRANGEBYSCORE + SUM"| RET["Return aggregated total<br/>to Go for limit comparison"]
CACHE -->|"Miss (nil result)"| PG["PostgreSQL fallback<br/>SUM() with NOW() window<br/>(~10-50ms)"]
PG --> RET
-- All of this executes atomically inside Redis
local time = redis.call('TIME')
local now = tonumber(time[1]) + (tonumber(time[2]) / 1000000)
local since = now - window_seconds
-- Pick smallest sorted set covering the window
if window_seconds <= 60 then key = '...:minute'
elseif window_seconds <= 3600 then key = '...:hour'
elseif window_seconds <= 86400 then key = '...:day'
else key = '...:month' end
local entries = redis.call('ZRANGEBYSCORE', key, since, now)
if #entries == 0 then return nil end -- cache miss
-- Aggregate inside Redis
local total_cost = 0
for _, entry in ipairs(entries) do
local cost_str = entry:match('[^:]+:[^:]+:(.*)')
total_cost = total_cost + tonumber(cost_str)
end
return total_cost -- Go code then compares this against the limit// spendlimit_handler.go:493-505
totalCostCents, err := querier.QueryCostByExpression(ctx, rule.ID, key, expression, rule.KeyExtractor, windowSeconds)
if err != nil {
return nil // Fail-open: don't block on query errors
}
blocked := totalCostCents >= limitCents // ← comparison happens in GoSame user can have different limits tracked independently:
graph TD
U["user-123"] --> R1["Rule: gpt4-budget<br/>expr: model.startsWith('openai/gpt-4')<br/>limit: $10/month"]
U --> R2["Rule: total-budget<br/>expr: true<br/>limit: $100/month"]
R1 --> C1["Redis: {gpt4-budget:user-123:a1b2c3d4}<br/>Separate sorted sets"]
R2 --> C2["Redis: {total-budget:user-123:e5f6g7h8}<br/>Separate sorted sets"]
R1 --> PG1["PG: WHERE spend_key='user-123'<br/>AND spend_limit_expression=<br/>'model.startsWith(...)'"]
R2 --> PG2["PG: WHERE spend_key='user-123'<br/>AND spend_limit_expression='true'"]
The combined_hash = SHA256(expression + "|" + keyExtractor)[:16] ensures each rule gets its own cache keys. PostgreSQL queries filter by both spend_key AND spend_limit_expression.
Costs are calculated at write time in the token tracker middleware (not at query time), using in-memory pricing config:
// Token tracker middleware, after capturing response
record.InputCostCents = (record.InputTokens * pricing.InputTokenPriceCentsPerMillion) / 1_000_000
record.OutputCostCents = (record.OutputTokens * pricing.OutputTokenPriceCentsPerMillion) / 1_000_000
record.TotalCostCents = record.InputCostCents + record.OutputCostCents + ...Price resolution priority: Custom (per-org) → Standard → Default
Costs stored as cents in BIGINT — avoids floating-point precision issues.
CREATE TABLE llm_token_usage (
id BIGSERIAL PRIMARY KEY,
timestamp TIMESTAMPTZ NOT NULL DEFAULT NOW(),
request_id TEXT NOT NULL,
frontend_id TEXT,
backend_id TEXT,
-- Spend limiting (CEL-based)
spend_limit_rule_id TEXT,
spend_key TEXT,
spend_key_extractor TEXT,
spend_limit_expression TEXT,
request_context JSONB,
-- Model
model_id TEXT,
provider TEXT,
-- Tokens
input_tokens BIGINT,
output_tokens BIGINT,
total_tokens BIGINT,
-- Costs (cents, BIGINT for precision)
input_cost_cents BIGINT,
output_cost_cents BIGINT,
cached_input_read_cost_cents BIGINT,
cached_input_write_cost_cents BIGINT,
unit_cost_cents BIGINT,
total_cost_cents BIGINT NOT NULL DEFAULT 0,
-- Metadata
path TEXT,
method TEXT,
status_code BIGINT,
streaming BOOLEAN,
duration_ms BIGINT,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);| Index | Columns | Purpose |
|---|---|---|
idx_..._timestamp |
timestamp DESC |
Time-range queries |
idx_..._request_id |
request_id |
Debugging / audit |
idx_..._provider_model |
provider, model_id, timestamp DESC |
Spending by provider/model |
idx_..._spend_key_cost |
spend_key, timestamp DESC, total_cost_cents (partial) |
Cost aggregation |
idx_..._spend_enforcement |
spend_key, spend_limit_expression, timestamp DESC, total_cost_cents (partial) |
Per-rule isolation |
idx_..._context_gin |
request_context (GIN) |
Flexible JSONB queries |
Every error is logged but never blocks requests:
| Failure | Behavior |
|---|---|
| Cache write fails | Logged as warning, request continues |
| Cache read fails | Falls back to PostgreSQL |
| PostgreSQL write fails | Logged, response already returned |
| PostgreSQL read fails | Logged, spend check passes (fail-open) |
| Token extraction fails | Logged as debug, no record written |
| Cost calculation fails | Logged, record written without cost |
spend_limiter:
enabled: true
token_tracker:
storage: postgresql
postgresql:
storage_id: "postgres-main"
batch_size: 100
batch_timeout: 1000 # ms
async: true
cache:
enabled: true
storage_id: "redis-main"
spend_limits:
- frontend_id: "main"
rules:
- id: "free-tier"
name: "Free Tier - $5/month"
expression: "request.context.oidc_claims.tier == 'free'"
key_extractor: "request.context.oidc_claims.sub"
tokens_per_minute: 10000
cost_per_month_cents: 500
action: "block" # "block" | "warn" | "throttle"| File | Role |
|---|---|
internal/tokentracker/interfaces.go |
Core interfaces and TokenUsage struct |
internal/tokentracker/extractor.go |
Token extraction from LLM responses (gjson) |
internal/tokentracker/cached_querier.go |
Redis cache layer with Lua scripts |
internal/tokentracker/querier_postgresql.go |
PostgreSQL querier |
internal/tokentracker/writer_postgresql.go |
Batched COPY writer |
internal/frontend/middleware/tokentracker/ |
HTTP middleware (capture + record) |
internal/frontend/middleware/spendlimit/ |
Spend limit enforcement middleware |
internal/cel/hash.go |
Expression hash for cache key isolation |
internal/tokentracker/postgresql/sql/queries.sql |
All SQL query definitions |
internal/tokentracker/postgresql/migrations/ |
Schema migrations |