Skip to content

Instantly share code, notes, and snippets.

@matthew-gerstman
Created February 19, 2026 21:57
Show Gist options
  • Select an option

  • Save matthew-gerstman/947ac42871fb6c24930c7baef636b477 to your computer and use it in GitHub Desktop.

Select an option

Save matthew-gerstman/947ac42871fb6c24930c7baef636b477 to your computer and use it in GitHub Desktop.
Agent streaming performance research & plan (local dev regression)

Agent Streaming Performance Research

Problem Statement

Agent responses in the Dashboard UI arrive one word at a time with a long delay before the first word. This is a regression over the past few weeks, exclusively in local dev — production/staging are fine.


Architecture Overview

Execution Flow

User sends message
  → API POST /chat
  → Inngest event: obvious/agent.execute
  → step.run('prepare-execution')   — tracing, DB record, emit started event
  → step.run('check-credits')       — 5+ async ops: workspace resolution, feature flags, credit balance
  → step.run('check-queue-N')       — feature flag + queue check
  → step.run('agent-step-N')        — context building + streamText() + iterate fullStream

Streaming Path (inside agent-step)

// step-executor.ts:272-313
const streamResult = streamText({
  model: provider(model),  // Routes through Vercel AI Gateway
  experimental_transform: ai.smoothStream({ delayInMs: 50 }),
  // ...
})

for await (const message of streamResult.fullStream) {
  await this.handleChunk(config.threadId, message)  // AWAITS every chunk
}

Per-Token Event Emission

// step-executor.ts:637-648
if (chunk.type === 'text-delta') {
  this.fullText += chunk.text ?? ''
  return this.eventEmitter?.emitMessageUpdated({
    id: this.currentMessageId,
    fullText: this.fullText,      // FULL accumulated text sent every token
    toolCalls: this.toolCalls,
  }, { ephemeral: true })
}

Event Emission → Redis → SSE → Dashboard

handleChunk(text-delta)
  → AgentEventEmitter.emitMessageUpdated()
  → emitEvent() builds event object with makeId('evt', 16)
  → eventsService.publish(CHANNELS.USER(userId), event)
  → JSON.stringify(eventWithTimestamp)    // serializes FULL accumulated text
  → await redisPublisher.publish(topic, message)  // AWAITED per-token
  → SSE endpoint streams to dashboard
  → EventStreamService.onmessage → React state update

For ephemeral events (text-delta): no DB storage, just Redis PUBLISH. But every token still awaits the publish round-trip.


Root Cause Analysis

Primary: AI Gateway Internet Round-Trip (LOCAL DEV ONLY)

The AI Gateway was introduced ~Feb 8 (PR #3441). All AI calls now route through Vercel's infrastructure:

Local dev path:

Local machine → Internet → Vercel AI Gateway → Anthropic API → Gateway → Internet → Local machine

Production path:

Cloud server → Vercel AI Gateway → Anthropic API → Gateway → Cloud server
(all within same cloud region, ~sub-20ms overhead)

In local dev, every request and every SSE chunk traverses a full internet round-trip. This explains:

  • Long delay before first word: TTFT includes gateway routing + internet latency
  • Slow per-chunk delivery: each SSE event proxied through gateway

The direct provider fallback was removed on Feb 18 (PR #7075), but the gateway was already the default path since Feb 8.

Secondary: Per-Token Redis PUBLISH Backpressure

The for await loop awaits handleChunk which awaits Redis PUBLISH for every text-delta. This creates backpressure:

  • Each token waits for Redis to confirm the previous publish before reading the next
  • Even with ~1ms local Redis latency, 2000 tokens = 2 seconds of pure blocking
  • Combined with gateway latency, this compounds

Tertiary: Growing Payload Size (O(n²) Serialization)

Each text-delta sends fullText (entire accumulated response), not just the delta. As the response grows:

  • Token 1: serialize ~10 bytes
  • Token 1000: serialize ~5KB
  • Token 2000: serialize ~10KB

Total serialization for a 2000-token response: ~10MB aggregate. Not the primary bottleneck but adds up.

Minor: smoothStream 50ms Delay

experimental_transform: ai.smoothStream({ delayInMs: 50 }) adds artificial per-chunk buffering. Small but additive.

Minor: Inngest Dev Server Step Overhead

Each step.run() in the Inngest dev server involves HTTP round-trips for memoization. The 3 pre-streaming steps (prepare, credits, queue-check) add startup latency that's worse locally than in production Inngest.


Key Files

File Purpose
apps/api/src/agents/obvious-v2/execution/step-executor.ts Streaming loop, handleChunk, smoothStream config
apps/api/src/agents/obvious-v2/state/events.ts AgentEventEmitter — builds and publishes events
apps/api/src/redis/events.service.ts Redis publish implementation, ephemeral handling
apps/api/src/inngest/obvious-agent-execution.ts Inngest function with pre-streaming step orchestration
apps/api/src/agents/obvious-v2/state/provider.ts Provider initialization (gateway vs direct)
apps/api/src/agents/lib/gateway.ts AI Gateway singleton configuration

Timeline of Relevant Changes

Date PR Change Impact
Feb 8 #3441 AI Gateway integration with BYOK Primary cause — adds internet hop in local dev
Feb 8 #6305 Braintrust SDK upgrade to v2.2.0 Adds tracing wrapping overhead
Feb 8 #6530 Drizzle ORM v1→v2 migration Potential query perf changes
Feb 10 #6684 Move gateway fallback models into mode registry Minor
Feb 18 #7075 Remove direct-provider fallback, gateway required Removed escape hatch

Proposed Fix

1. Restore Direct Provider Path for Local Dev (Highest Impact)

Re-introduce initializeDirectProvider (removed in PR #7075) as a local-dev-only path. When USE_LOCALSTACK or USE_DIRECT_PROVIDER env var is set, bypass the AI Gateway and call Anthropic directly.

The deleted code is preserved in commit ffd341e648.

2. Fire-and-Forget Ephemeral Text-Delta Publishes

Stop awaiting Redis publish for text-delta and text-start events. These are ephemeral and self-correcting (each carries full accumulated fullText). Add .catch() for error logging. Keep await for text-end, tool-input-start, tool-call.

3. Throttle Text-Delta Event Emission

Batch text-deltas to one Redis publish every ~32ms (~30fps). Reduces publishes from ~2000 per response to ~60-100 per second. Remove smoothStream({ delayInMs: 50 }) as it becomes redundant.


Expected Impact

Fix Latency Reduction Scope
Direct provider for local dev Eliminates internet round-trip (~50-200ms TTFT, per-chunk latency) Local dev only
Fire-and-forget deltas -2-4 seconds on 2000-token response All environments
Throttle to 32ms Additional reduction from fewer publishes All environments

Fix Slow Agent Streaming Performance (Local Dev)

Context

Agent responses in the Dashboard UI are arriving one word at a time with a long delay before the first word. This is a regression over the past few weeks, exclusively in local dev. Production/staging are fine.

Primary root cause: The AI Gateway introduction (~Feb 8, PR #3441) routes all AI calls through Vercel's infrastructure. In production this adds negligible latency (same cloud). In local dev, every request and every SSE chunk traverses an internet round-trip (local machine → Vercel Gateway → Anthropic → Gateway → local machine), significantly increasing TTFT and per-chunk delivery time.

Secondary cause: Every text-delta token awaits a Redis PUBLISH before processing the next token, creating backpressure. This compounds the gateway latency.

Commits

  1. perf(agent): restore direct provider path for local dev
  2. perf(agent): fire-and-forget ephemeral text-delta publishes
  3. perf(agent): throttle text-delta emission to 32ms intervals

Changes

1. Restore Direct Provider Path for Local Dev (Highest Impact)

File: apps/api/src/agents/obvious-v2/state/provider.ts

Re-introduce initializeDirectProvider (removed in PR #7075) as a local-dev-only path. When USE_LOCALSTACK or a new USE_DIRECT_PROVIDER env var is set, bypass the AI Gateway and call Anthropic/OpenAI directly using @ai-sdk/anthropic etc.

The deleted code is in commit ffd341e648 — restore the initializeDirectProvider function and the conditional in initializeProvider:

// Use gateway in production, direct provider locally
if (process.env.USE_LOCALSTACK || process.env.USE_DIRECT_PROVIDER) {
  return initializeDirectProvider(providerName, modelName, betas)
}
return initializeWithGateway(providerName, modelName, betas)

File: apps/api/src/agents/obvious-v2/state/provider.ts (types)

Restore the union ProviderFactory type to include direct SDK providers.

2. Fire-and-Forget Ephemeral Text-Delta Publishes

File: apps/api/src/agents/obvious-v2/execution/step-executor.ts

In handleChunk (~line 637-648), stop awaiting the Redis publish for text-delta and text-start events. Add .catch() for error logging. These are ephemeral and self-correcting (each carries full accumulated fullText).

Keep await for text-end, tool-input-start, tool-call — these trigger DB persistence.

3. Throttle Text-Delta Event Emission

File: apps/api/src/agents/obvious-v2/execution/step-executor.ts

Add a throttled emit that batches text-deltas to one Redis publish every ~32ms (~30fps). Reduces publishes from ~2000 per response to ~60-100 per second of streaming.

  • Add emitThrottledDelta() / flushDelta() methods
  • text-end handler flushes pending delta before persisting final state
  • Remove smoothStream({ delayInMs: 50 }) — redundant with throttled emit

Files to Modify

  • apps/api/src/agents/obvious-v2/state/provider.ts — restore direct provider path
  • apps/api/src/agents/obvious-v2/execution/step-executor.ts — fire-and-forget deltas, throttling, remove smoothStream

Verification

  1. Run bun obvious up locally
  2. Open Dashboard, start a fresh conversation
  3. Confirm streaming is fast and smooth (not word-by-word)
  4. Verify text-end still persists final message correctly (check DB)
  5. bun obvious test --changed
  6. Compare TTFT with and without USE_DIRECT_PROVIDER to confirm gateway is the bottleneck
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment