Skip to content

Instantly share code, notes, and snippets.

@kika
Last active February 22, 2026 13:20
Show Gist options
  • Select an option

  • Save kika/236bb2014258876be1f6ea9c2d380b74 to your computer and use it in GitHub Desktop.

Select an option

Save kika/236bb2014258876be1f6ea9c2d380b74 to your computer and use it in GitHub Desktop.
Qwen 3.5 397B fp8 on Hot Isle
uv run vllm bench serve \
--backend openai-chat --base-url http://localhost:8000 \
--num-prompts 10 --request-rate 1 --endpoint /v1/chat/completions
export MODEL="Qwen/Qwen3.5-397B-A17B-FP8" \
VLLM_WORKER_MULTIPROC_METHOD=spawn \
PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True,garbage_collection_threshold:0.8,max_split_size_mb:512" \
#VLLM_LOGGING_LEVEL=DEBUG \
uv run vllm serve $MODEL --port 8000 \
--tensor-parallel-size 4 --max-model-len 262144 --reasoning-parser qwen3 \
--language-model-only --kv-cache-dtype fp8 \
--max-num-batched-tokens 8192 --gpu-memory-utilization 0.85
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment