hmseeb/PIPELINE_STATUS.md

## PIPELINE_STATUS.md

      
    Raw
  

              PIPELINE_STATUS.md
            
          
    PersonaPlex Voice Pipeline — Status Snapshot (2026-02-20)

Architecture

Cascaded STT → LLM → TTS voice pipeline on LiveKit Agents SDK v1.4.2
Hardware


Current: RunPod RTX A5000 (24GB VRAM), Pod ID: hww87ivbmxgcvq
VRAM usage: ~16GB / 24.6GB with all services running

Stack


Component
Implementation
Serving
VRAM


STT
faster-whisper large-v3-turbo
In-process (CUDA, int8_float16)
~2.3GB


LLM
Qwen 3 8B AWQ
SGLang (--mem-fraction-static 0.40)
~8.7GB


TTS
Orpheus 3B Q8_0 GGUF
llama.cpp server + Orpheus-FastAPI
~4.3GB + 0.7GB


VAD
Silero VAD
CPU
<1MB


Turn Detection
LiveKit MultilingualModel
CPU
~50MB


Pipeline Latency (warm, benchmarked 2026-02-20)

Individual Components


Component
Latency
Notes


VAD silence detection
~300ms
min_silence_duration=0.3s


STT (faster-whisper)
162-230ms
Batch mode, language="en" forced


Turn detection
67-113ms
unlikely_threshold=0.05, was ~1100ms at 0.65!


LLM TTFT (SGLang)
15-18ms
Qwen 3 8B AWQ, RadixAttention


TTS TTFA (Orpheus)
245-363ms
245ms isolated, 346-363ms under GPU contention


Network/WebRTC jitter
~100-200ms
RunPod Canada → user


Total perceived
~950ms-1.2s
From user stops speaking → hears response


Key Optimizations Applied


SNAC min_frames_first fix: Changed from 7→14 in speechpipe.py. Root cause: SNAC outputs 2048 samples per frame, [2048:4096] slice needs ≥2 frames. Cut TTS TTFA from 475ms→245ms.
Turn detection threshold: Changed TURN_UNLIKELY_THRESHOLD from 0.65→0.05. MultilingualModel English default is 0.0289. Cut turn detection from ~1100ms→67-113ms.
Max endpointing delay: Reduced from 1.5s→1.0s.
Filler phrases: Trigger at 300ms (was 500ms), natural phrases.
SNAC auto-warmup: Added to start-native.sh to eliminate 8.5s cold start.
STT language forcing: language="en" saves ~110ms vs auto-detect.
STT compute_type: int8_float16 saves ~1GB VRAM vs float16, same latency.

Current Configuration (agent/config.py)

# Turn detection
TURN_UNLIKELY_THRESHOLD = 0.05      # Was 0.65
MAX_ENDPOINTING_DELAY = 1.0         # Was 1.5
MIN_ENDPOINTING_DELAY = 0.3

# VAD
VAD_MIN_SILENCE_DURATION = 0.3
VAD_ACTIVATION_THRESHOLD = 0.5

# Filler
FILLER_DELAY_MS = 300                # Was 500
FILLER_PHRASES = ["Mm, let me see...", "Ah, sure...", "Um...", "Right, so...", "Yeah, one sec...", "Let me check on that..."]

# Interruption
MIN_INTERRUPTION_DURATION = 0.5
FALSE_INTERRUPTION_TIMEOUT = 3.0

# STT
WHISPER_MODEL = "deepdml/faster-whisper-large-v3-turbo-ct2"
WHISPER_DEVICE = "cuda"
WHISPER_COMPUTE_TYPE = "int8_float16"

# LLM
SGLANG_MODEL = "Qwen/Qwen3-8B-AWQ"
SGLANG_URL = "http://localhost:30000/v1"

# TTS
ORPHEUS_API_URL = "http://localhost:5005/v1"
ORPHEUS_VOICE = "tara"
ORPHEUS_USE_OPENAI_PLUGIN = True     # response_format="pcm"
Room Lifecycle Fix


delete_room_on_close=True in RoomOptions — fixes zombie agent processes blocking reconnections (SDK issue #3174)
room.empty_timeout: 10 in LiveKit server config — safety net for stale rooms

VRAM Breakdown (RTX A5000, 24GB)

SGLang (Qwen 3 8B AWQ)      8,652 MiB
llama-server (Orpheus Q8_0)  4,348 MiB
faster-whisper (large-v3)    2,300 MiB (estimated, loads per-job)
Orpheus-FastAPI (SNAC)         712 MiB
─────────────────────────────────────
Total                       ~16,012 MiB / 24,564 MiB
Free                         ~8,552 MiB

Known Issues


Agent startup takes ~4.5 minutes — forkserver imports ctranslate2/faster_whisper slowly
TTS TTFA regresses under GPU contention — 245ms isolated → 346-363ms when STT+SGLang active
Process memory warnings — agent job process uses ~960MB RAM (faster-whisper model)
Batch STT adds latency — Must wait for user to fully finish speaking before processing
SDK issue #3174 — child processes hang on exit due to non-daemon threads (mitigated by delete_room_on_close)

Remaining Bottlenecks (in priority order)


Batch STT (~200ms) — Streaming STT would process audio incrementally
TTS TTFA (~245-363ms) — GPU contention makes this worse
VAD silence detection (~300ms) — Necessary, hard to reduce
Network latency (~100-200ms) — Inherent to remote deployment

Pod Services Startup (scripts/start-native.sh)

redis-server → livekit-server → SGLang → llama-server → Orpheus-FastAPI → warmup → agent

Files Modified (uncommitted)


agent/config.py — Turn detection, filler, VAD tuning
agent/main.py — RoomOptions(delete_room_on_close=True) import + usage
agent/agents/voice_agent.py — Natural speech system prompt
.env.example — Updated defaults
services/livekit/livekit.yaml — room.empty_timeout
scripts/start-native.sh — Auto-warmup section

Pod-Only Changes (not in git)


/workspace/Orpheus-FastAPI/tts_engine/speechpipe.py — min_frames_first=14 (was 7)
/workspace/personaplex/services/livekit/livekit-native.yaml — room.empty_timeout: 10, redis: localhost:6379
Component	Implementation	Serving	VRAM
STT	faster-whisper large-v3-turbo	In-process (CUDA, int8_float16)	~2.3GB
LLM	Qwen 3 8B AWQ	SGLang (--mem-fraction-static 0.40)	~8.7GB
TTS	Orpheus 3B Q8_0 GGUF	llama.cpp server + Orpheus-FastAPI	~4.3GB + 0.7GB
VAD	Silero VAD	CPU	<1MB
Turn Detection	LiveKit MultilingualModel	CPU	~50MB
Component	Latency	Notes
VAD silence detection	~300ms	min_silence_duration=0.3s
STT (faster-whisper)	162-230ms	Batch mode, language="en" forced
Turn detection	67-113ms	unlikely_threshold=0.05, was ~1100ms at 0.65!
LLM TTFT (SGLang)	15-18ms	Qwen 3 8B AWQ, RadixAttention
TTS TTFA (Orpheus)	245-363ms	245ms isolated, 346-363ms under GPU contention
Network/WebRTC jitter	~100-200ms	RunPod Canada → user
Total perceived	~950ms-1.2s	From user stops speaking → hears response