Skip to content

Instantly share code, notes, and snippets.

@hmseeb
Created February 20, 2026 17:55
Show Gist options
  • Select an option

  • Save hmseeb/f272bca8b6ed4f98e61fdef1a49bd37a to your computer and use it in GitHub Desktop.

Select an option

Save hmseeb/f272bca8b6ed4f98e61fdef1a49bd37a to your computer and use it in GitHub Desktop.
PersonaPlex Voice Pipeline Status Snapshot — Feb 20, 2026

PersonaPlex Voice Pipeline — Status Snapshot (2026-02-20)

Architecture

Cascaded STT → LLM → TTS voice pipeline on LiveKit Agents SDK v1.4.2

Hardware

  • Current: RunPod RTX A5000 (24GB VRAM), Pod ID: hww87ivbmxgcvq
  • VRAM usage: ~16GB / 24.6GB with all services running

Stack

Component Implementation Serving VRAM
STT faster-whisper large-v3-turbo In-process (CUDA, int8_float16) ~2.3GB
LLM Qwen 3 8B AWQ SGLang (--mem-fraction-static 0.40) ~8.7GB
TTS Orpheus 3B Q8_0 GGUF llama.cpp server + Orpheus-FastAPI ~4.3GB + 0.7GB
VAD Silero VAD CPU <1MB
Turn Detection LiveKit MultilingualModel CPU ~50MB

Pipeline Latency (warm, benchmarked 2026-02-20)

Individual Components

Component Latency Notes
VAD silence detection ~300ms min_silence_duration=0.3s
STT (faster-whisper) 162-230ms Batch mode, language="en" forced
Turn detection 67-113ms unlikely_threshold=0.05, was ~1100ms at 0.65!
LLM TTFT (SGLang) 15-18ms Qwen 3 8B AWQ, RadixAttention
TTS TTFA (Orpheus) 245-363ms 245ms isolated, 346-363ms under GPU contention
Network/WebRTC jitter ~100-200ms RunPod Canada → user
Total perceived ~950ms-1.2s From user stops speaking → hears response

Key Optimizations Applied

  1. SNAC min_frames_first fix: Changed from 7→14 in speechpipe.py. Root cause: SNAC outputs 2048 samples per frame, [2048:4096] slice needs ≥2 frames. Cut TTS TTFA from 475ms→245ms.
  2. Turn detection threshold: Changed TURN_UNLIKELY_THRESHOLD from 0.65→0.05. MultilingualModel English default is 0.0289. Cut turn detection from ~1100ms→67-113ms.
  3. Max endpointing delay: Reduced from 1.5s→1.0s.
  4. Filler phrases: Trigger at 300ms (was 500ms), natural phrases.
  5. SNAC auto-warmup: Added to start-native.sh to eliminate 8.5s cold start.
  6. STT language forcing: language="en" saves ~110ms vs auto-detect.
  7. STT compute_type: int8_float16 saves ~1GB VRAM vs float16, same latency.

Current Configuration (agent/config.py)

# Turn detection
TURN_UNLIKELY_THRESHOLD = 0.05      # Was 0.65
MAX_ENDPOINTING_DELAY = 1.0         # Was 1.5
MIN_ENDPOINTING_DELAY = 0.3

# VAD
VAD_MIN_SILENCE_DURATION = 0.3
VAD_ACTIVATION_THRESHOLD = 0.5

# Filler
FILLER_DELAY_MS = 300                # Was 500
FILLER_PHRASES = ["Mm, let me see...", "Ah, sure...", "Um...", "Right, so...", "Yeah, one sec...", "Let me check on that..."]

# Interruption
MIN_INTERRUPTION_DURATION = 0.5
FALSE_INTERRUPTION_TIMEOUT = 3.0

# STT
WHISPER_MODEL = "deepdml/faster-whisper-large-v3-turbo-ct2"
WHISPER_DEVICE = "cuda"
WHISPER_COMPUTE_TYPE = "int8_float16"

# LLM
SGLANG_MODEL = "Qwen/Qwen3-8B-AWQ"
SGLANG_URL = "http://localhost:30000/v1"

# TTS
ORPHEUS_API_URL = "http://localhost:5005/v1"
ORPHEUS_VOICE = "tara"
ORPHEUS_USE_OPENAI_PLUGIN = True     # response_format="pcm"

Room Lifecycle Fix

  • delete_room_on_close=True in RoomOptions — fixes zombie agent processes blocking reconnections (SDK issue #3174)
  • room.empty_timeout: 10 in LiveKit server config — safety net for stale rooms

VRAM Breakdown (RTX A5000, 24GB)

SGLang (Qwen 3 8B AWQ)      8,652 MiB
llama-server (Orpheus Q8_0)  4,348 MiB
faster-whisper (large-v3)    2,300 MiB (estimated, loads per-job)
Orpheus-FastAPI (SNAC)         712 MiB
─────────────────────────────────────
Total                       ~16,012 MiB / 24,564 MiB
Free                         ~8,552 MiB

Known Issues

  1. Agent startup takes ~4.5 minutes — forkserver imports ctranslate2/faster_whisper slowly
  2. TTS TTFA regresses under GPU contention — 245ms isolated → 346-363ms when STT+SGLang active
  3. Process memory warnings — agent job process uses ~960MB RAM (faster-whisper model)
  4. Batch STT adds latency — Must wait for user to fully finish speaking before processing
  5. SDK issue #3174 — child processes hang on exit due to non-daemon threads (mitigated by delete_room_on_close)

Remaining Bottlenecks (in priority order)

  1. Batch STT (~200ms) — Streaming STT would process audio incrementally
  2. TTS TTFA (~245-363ms) — GPU contention makes this worse
  3. VAD silence detection (~300ms) — Necessary, hard to reduce
  4. Network latency (~100-200ms) — Inherent to remote deployment

Pod Services Startup (scripts/start-native.sh)

redis-server → livekit-server → SGLang → llama-server → Orpheus-FastAPI → warmup → agent

Files Modified (uncommitted)

  • agent/config.py — Turn detection, filler, VAD tuning
  • agent/main.py — RoomOptions(delete_room_on_close=True) import + usage
  • agent/agents/voice_agent.py — Natural speech system prompt
  • .env.example — Updated defaults
  • services/livekit/livekit.yaml — room.empty_timeout
  • scripts/start-native.sh — Auto-warmup section

Pod-Only Changes (not in git)

  • /workspace/Orpheus-FastAPI/tts_engine/speechpipe.py — min_frames_first=14 (was 7)
  • /workspace/personaplex/services/livekit/livekit-native.yaml — room.empty_timeout: 10, redis: localhost:6379
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment