You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Current: RunPod RTX A5000 (24GB VRAM), Pod ID: hww87ivbmxgcvq
VRAM usage: ~16GB / 24.6GB with all services running
Stack
Component
Implementation
Serving
VRAM
STT
faster-whisper large-v3-turbo
In-process (CUDA, int8_float16)
~2.3GB
LLM
Qwen 3 8B AWQ
SGLang (--mem-fraction-static 0.40)
~8.7GB
TTS
Orpheus 3B Q8_0 GGUF
llama.cpp server + Orpheus-FastAPI
~4.3GB + 0.7GB
VAD
Silero VAD
CPU
<1MB
Turn Detection
LiveKit MultilingualModel
CPU
~50MB
Pipeline Latency (warm, benchmarked 2026-02-20)
Individual Components
Component
Latency
Notes
VAD silence detection
~300ms
min_silence_duration=0.3s
STT (faster-whisper)
162-230ms
Batch mode, language="en" forced
Turn detection
67-113ms
unlikely_threshold=0.05, was ~1100ms at 0.65!
LLM TTFT (SGLang)
15-18ms
Qwen 3 8B AWQ, RadixAttention
TTS TTFA (Orpheus)
245-363ms
245ms isolated, 346-363ms under GPU contention
Network/WebRTC jitter
~100-200ms
RunPod Canada → user
Total perceived
~950ms-1.2s
From user stops speaking → hears response
Key Optimizations Applied
SNAC min_frames_first fix: Changed from 7→14 in speechpipe.py. Root cause: SNAC outputs 2048 samples per frame, [2048:4096] slice needs ≥2 frames. Cut TTS TTFA from 475ms→245ms.
Turn detection threshold: Changed TURN_UNLIKELY_THRESHOLD from 0.65→0.05. MultilingualModel English default is 0.0289. Cut turn detection from ~1100ms→67-113ms.
Max endpointing delay: Reduced from 1.5s→1.0s.
Filler phrases: Trigger at 300ms (was 500ms), natural phrases.
SNAC auto-warmup: Added to start-native.sh to eliminate 8.5s cold start.
STT language forcing: language="en" saves ~110ms vs auto-detect.
STT compute_type: int8_float16 saves ~1GB VRAM vs float16, same latency.
Current Configuration (agent/config.py)
# Turn detectionTURN_UNLIKELY_THRESHOLD=0.05# Was 0.65MAX_ENDPOINTING_DELAY=1.0# Was 1.5MIN_ENDPOINTING_DELAY=0.3# VADVAD_MIN_SILENCE_DURATION=0.3VAD_ACTIVATION_THRESHOLD=0.5# FillerFILLER_DELAY_MS=300# Was 500FILLER_PHRASES= ["Mm, let me see...", "Ah, sure...", "Um...", "Right, so...", "Yeah, one sec...", "Let me check on that..."]
# InterruptionMIN_INTERRUPTION_DURATION=0.5FALSE_INTERRUPTION_TIMEOUT=3.0# STTWHISPER_MODEL="deepdml/faster-whisper-large-v3-turbo-ct2"WHISPER_DEVICE="cuda"WHISPER_COMPUTE_TYPE="int8_float16"# LLMSGLANG_MODEL="Qwen/Qwen3-8B-AWQ"SGLANG_URL="http://localhost:30000/v1"# TTSORPHEUS_API_URL="http://localhost:5005/v1"ORPHEUS_VOICE="tara"ORPHEUS_USE_OPENAI_PLUGIN=True# response_format="pcm"