A step-by-step guide for running the Qwen3-Coder-Next-FP8 model locally via vLLM, connecting Claude Code to it, and verifying end-to-end operation.
Goal: claude -p "what is kubernetes" → coherent answer from a local model
- A Brev account with access to H100 GPU instances
brevCLI installed locallyclaudeCLI installed locally (or available on the instance)
Note: As of early 2026,
brev create --gpu hyperstack_H100fails with "instance type not found" due to the old CLI not passing a workspace group ID. Use the API directly instead:
Create an instance via the Brev API (one-time):
# Get your org ID
brev org ls # copy the ID of your active org (format: org-XXXX...)
# Create the instance
TOKEN=$(python3 -c "import json; d=json.load(open('~/.brev/credentials.json'.replace('~', __import__('os').path.expanduser('~')))); print(d['access_token'])")
ORG_ID="org-XXXXXXXXXXXXXXXXXXXXXXXXX" # replace with your org ID
curl -sf -X POST \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
"https://brevapi.us-west-2-prod.control-plane.brev.dev/api/organizations/$ORG_ID/workspaces" \
-d '{
"name": "qwen3-h100",
"workspaceGroupId": "shadeform",
"diskStorage": "120Gi",
"instanceType": "hyperstack_H100",
"workspaceVersion": "v1",
"vmOnlyMode": true,
"portMappings": {},
"workspaceTemplateId": "4nbb4lg2s",
"launchJupyterOnStart": false
}' | python3 -c "import sys,json; ws=json.load(sys.stdin).get('workspace',{}); print('Created:', ws.get('id'), ws.get('name'), ws.get('status'))"Wait for it to be RUNNING:
brev ls # wait until STATUS = RUNNING (typically 3-5 minutes)Register SSH config:
brev refreshSSH directly:
ssh -F ~/.brev/ssh_config qwen3-h100-host # note: shadeform instances use the "-host" suffixVerify the GPU:
nvidia-smiExpected: NVIDIA H100 PCIe listed with ~80 GB VRAM.
Disk note: Hyperstack instances have ~97 GB root disk. The Docker image uses ~30 GB. The model download needs ~80 GB. Use the ephemeral disk (
/ephemeral, ~750 GB) for the HuggingFace cache — this is handled in Step 2 below.
Run in the background (detached). The first run downloads the model (~80 GB FP8) from HuggingFace — this takes 15–20 minutes. Subsequent runs use the local cache (~2–3 min).
Important: Mount to
/ephemeral/huggingface(the 750 GB ephemeral disk), not~/.cache/huggingface. The root disk is only ~97 GB and fills up during download.
mkdir -p /ephemeral/huggingface
docker run -d --name qwen3 \
--gpus all --ipc=host --network host \
--ulimit memlock=-1 --ulimit stack=67108864 \
-e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
-v /ephemeral/huggingface:/root/.cache/huggingface \
nvcr.io/nvidia/vllm:26.01-py3 \
vllm serve "Qwen/Qwen3-Coder-Next-FP8" \
--served-model-name qwen3-coder-next \
--port 8000 \
--max-model-len 16384 \
--gpu-memory-utilization 0.98 \
--swap-space 16 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--attention-backend flash_attn \
--kv-cache-dtype fp8 \
--max-num-seqs 1 \
--enable-chunked-prefill \
--max-num-batched-tokens 32| Flag | Why |
|---|---|
--max-model-len 16384 |
The H100 PCIe has 80 GB; model weights alone use 74.88 GiB. At FP8 KV cache with 0.98 utilization, only 16384 tokens of KV cache fit. Larger values cause OOM. |
--gpu-memory-utilization 0.98 |
Pushes close to the limit to fit as many KV cache blocks as possible. |
--attention-backend flash_attn |
flashinfer backend causes CUDA graph OOM on H100 PCIe with this model. |
--kv-cache-dtype fp8 |
Cuts KV cache memory in half vs fp16. |
--max-num-seqs 1 |
Single-user setup; prevents memory pressure from concurrent sequences. |
--enable-chunked-prefill --max-num-batched-tokens 32 |
Critical: Qwen3-Coder-Next uses Flash Linear Attention (FLA) with a Triton autotuner. Sequences >32 tokens trigger a BT=64 tile configuration that OOMs during Triton autotuning. Chunked prefill splits any longer prompt into ≤32 token chunks, keeping the tile size at BT=32, which fits within the ~1.8 GB free memory. |
--enable-prefix-caching |
Omitted intentionally — incompatible with chunked prefill for this model. |
docker logs qwen3 -f 2>&1 | grep -E "Ready|READY|loading|Loading|ERROR|error"Poll until the model is serving:
until curl -sf http://localhost:8000/v1/models > /dev/null; do
echo "$(date +%H:%M:%S) waiting..."
sleep 15
done
echo "vLLM is ready"Verify the model is listed:
curl -s http://localhost:8000/v1/models | python3 -c "
import sys, json
d = json.load(sys.stdin)
print('Model:', d['data'][0]['id'])
print('Max context:', d['data'][0]['max_model_len'])
"Expected output:
Model: qwen3-coder-next
Max context: 16384
Claude Code uses the Anthropic Messages API format (POST /v1/messages).
vLLM speaks OpenAI's format (POST /v1/chat/completions or /v1/responses).
litellm bridges them.
Install litellm (if not present):
pip install 'litellm[proxy]'Note:
pip install litellm(without[proxy]) is missing thebackoffdependency and will fail withImportError: Missing dependency No module named 'backoff'.
Start litellm:
export OPENAI_API_KEY=dummy
export PATH="$HOME/.local/bin:$PATH" # pip installs to ~/.local/bin
nohup litellm \
--model openai/qwen3-coder-next \
--api_base http://localhost:8000/v1 \
--drop_params \
--port 4000 > /tmp/litellm.log 2>&1 &
echo "litellm PID: $!"The --drop_params flag silently drops unsupported parameters that Claude Code sends
(such as reasoning_effort), preventing HTTP 400 errors.
Verify litellm is up:
sleep 3
curl -s http://localhost:4000/v1/models | python3 -c "import sys,json; print(json.load(sys.stdin)['data'][0]['id'])"Expected: qwen3-coder-next
# Check Node.js (requires 18+)
node --version || (curl -fsSL https://deb.nodesource.com/setup_20.x | sudo -E bash - && sudo apt-get install -y nodejs)
sudo npm install -g @anthropic-ai/claude-code
claude --versionexport ANTHROPIC_BASE_URL=http://localhost:4000
export ANTHROPIC_AUTH_TOKEN=dummy
export ANTHROPIC_API_KEY=dummy
export ANTHROPIC_MODEL=qwen3-coder-next
export ANTHROPIC_SMALL_FAST_MODEL=qwen3-coder-next
export CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1
export API_TIMEOUT_MS=120000Why dummy tokens? vLLM and litellm don't validate auth — any non-empty string works. Claude Code requires the vars to be set.
claude -p "what is kubernetes" \
--system-prompt "You are a helpful assistant." \
--tools "" \
--no-session-persistenceExpected: A clear, structured explanation of Kubernetes from the local Qwen3 model.
Kubernetes (often stylized as K8s) is an open-source container orchestration platform
for automating the deployment, scaling, and management of containerized applications.
## Core Concepts
- Container Orchestration: Kubernetes automates the scheduling and running of containers
across multiple hosts, handling tasks like load balancing, networking, and storage.
- Cluster Architecture: control plane (api-server, scheduler, etcd) + worker nodes
...
-
curl http://localhost:8000/v1/modelsreturnsqwen3-coder-next -
curl http://localhost:4000/v1/modelsreturnsqwen3-coder-next(via litellm) -
claude -p "what is kubernetes" --system-prompt "..." --tools "" --no-session-persistencereturns a coherent answer
torch.OutOfMemoryError: CUDA out of memory.
RuntimeError: CUDA out of memory occurred when warming up sampler
Cause: --max-model-len is too large. The model weights use 74.88 GiB; with
--gpu-memory-utilization 0.98 on an 80 GB H100, there's only ~1.8 GB free for KV
cache blocks.
Fix: Use --max-model-len 16384 (as shown above). Do not increase it.
RuntimeError: Triton Error [CUDA]: out of memory
(in chunk_fwd_kernel_o autotuner)
Cause: Qwen3-Coder-Next uses Flash Linear Attention (FLA). Its Triton kernel
chunk_fwd_kernel_o has an autotuner that benchmarks different tile configurations.
With only ~1.8 GB free, sequences >32 tokens trigger BT=64 tiles which OOM during
autotuning.
Fix: Add --enable-chunked-prefill --max-num-batched-tokens 32 to vLLM. This
splits prompt processing into chunks of ≤32 tokens, keeping BT≤32. This is the
most important non-obvious flag for this model on H100 PCIe.
This crash kills the vLLM process. Restart with docker start qwen3.
RuntimeError: Engine core initialization failed. See root cause above.
Cause: --attention-backend flashinfer — this backend performs CUDA graph capture
which uses additional GPU memory and causes OOM on this hardware.
Fix: Use --attention-backend flash_attn instead.
API Error: 400 {"error": {"message": "litellm.BadRequestError: UnsupportedParamsError..."}}
Cause: Claude Code sends reasoning_effort parameter that vLLM doesn't support.
Fix: Add --drop_params to litellm startup command.
litellm.BadRequestError: OpenAIException - {"error": {"message":
"EngineCore encountered an issue..."}}
Cause: vLLM's engine crashed (Triton OOM) and returned an error to litellm.
Fix: Restart vLLM container (docker start qwen3) and confirm chunked prefill is
enabled.
Symptom: SSH connection drops with exit code 255 when running pkill -f litellm.
Cause: pkill -f PATTERN matches the full command line of ALL processes, including
the current bash session which has PATTERN as an argument.
Fix: Kill by PID instead:
LITELLM_PID=$(ps aux | grep "litellm" | grep -v grep | awk '{print $2}' | head -1)
kill "$LITELLM_PID"API Error: 400 context window exceeded
Cause: Claude Code's default system prompt is ~17,767 tokens, exceeding the 16,384 token limit.
Fix: Always use --system-prompt "..." and --tools "" to replace the default
system prompt with a short one.
Claude Code (local Mac)
│ ANTHROPIC_BASE_URL=http://<brev-ip>:4000
│ POST /v1/messages?beta=true (Anthropic API format)
▼
litellm proxy (Brev instance, port 4000)
│ --model openai/qwen3-coder-next
│ --api_base http://localhost:8000/v1
│ Translates: Anthropic → OpenAI Responses API
▼
vLLM serving Qwen3-Coder-Next-FP8 (Brev instance, port 8000)
│ --enable-chunked-prefill --max-num-batched-tokens 32
│ Chunks long prompts → FLA kernel sees ≤32 tokens per pass
▼
NVIDIA H100 PCIe (80 GB VRAM)
│ Model weights: 74.88 GiB FP8
│ KV cache: ~1.8 GB remaining @ 0.98 utilization
└─ Qwen3-Coder-Next-FP8 inference
If the Brev instance reboots, Docker containers stop. Restart both services:
Warning:
/ephemeralis ephemeral — it survives reboots but NOT instance stop/start cycles on some providers. If the model cache is gone,docker runagain (full download needed).
# Restart vLLM (model is already cached in /ephemeral/huggingface if not wiped)
docker start qwen3
# Wait for it to be ready (2-3 minutes from cache)
until curl -sf http://localhost:8000/v1/models > /dev/null; do sleep 15; done
echo "vLLM ready"
# Restart litellm
export OPENAI_API_KEY=dummy
nohup litellm \
--model openai/qwen3-coder-next \
--api_base http://localhost:8000/v1 \
--drop_params \
--port 4000 > /tmp/litellm.log 2>&1 &- This setup requires: H100 PCIe 80 GB or larger single GPU
- Does NOT work on H100 PCIe 80 GB without chunked prefill due to FLA Triton OOM
- The original article used: Apple M4 Max with 128 GB unified memory — more headroom
- Would work without chunked prefill on: H100 SXM 80 GB × 2 (tensor parallel), or any system with >80 GB VRAM in a single device
# 1. SSH in (shadeform instances use the "-host" suffix in ssh config)
ssh -F ~/.brev/ssh_config qwen3-h100-host
# 2. Start vLLM (first time takes 15-20 min to download model)
# Use /ephemeral for HF cache — root disk is only ~97 GB
mkdir -p /ephemeral/huggingface
docker run -d --name qwen3 \
--gpus all --ipc=host --network host \
--ulimit memlock=-1 --ulimit stack=67108864 \
-e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
-v /ephemeral/huggingface:/root/.cache/huggingface \
nvcr.io/nvidia/vllm:26.01-py3 \
vllm serve "Qwen/Qwen3-Coder-Next-FP8" \
--served-model-name qwen3-coder-next \
--port 8000 --max-model-len 16384 \
--gpu-memory-utilization 0.98 --swap-space 16 \
--enable-auto-tool-choice --tool-call-parser qwen3_coder \
--attention-backend flash_attn --kv-cache-dtype fp8 \
--max-num-seqs 1 \
--enable-chunked-prefill --max-num-batched-tokens 32
# 3. Wait for vLLM to be ready
until curl -sf http://localhost:8000/v1/models > /dev/null; do
echo "$(date +%H:%M:%S) waiting..."; sleep 15; done
echo "vLLM ready"
# 4. Install litellm and start proxy
export PATH="$HOME/.local/bin:$PATH"
pip install -q 'litellm[proxy]'
export OPENAI_API_KEY=dummy
nohup litellm --model openai/qwen3-coder-next \
--api_base http://localhost:8000/v1 --drop_params --port 4000 \
> /tmp/litellm.log 2>&1 &
# 5. Install Claude Code (Node 20 required)
curl -fsSL https://deb.nodesource.com/setup_20.x | sudo -E bash - > /dev/null 2>&1
sudo apt-get install -y nodejs > /dev/null 2>&1
sudo npm install -g @anthropic-ai/claude-code
# 6. Set Claude Code env vars
export ANTHROPIC_BASE_URL=http://localhost:4000
export ANTHROPIC_AUTH_TOKEN=dummy
export ANTHROPIC_API_KEY=dummy
export ANTHROPIC_MODEL=qwen3-coder-next
export ANTHROPIC_SMALL_FAST_MODEL=qwen3-coder-next
export CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1
export API_TIMEOUT_MS=120000
# 7. Test!
claude -p "what is kubernetes" \
--system-prompt "You are a helpful assistant." \
--tools "" --no-session-persistence