Skip to content

Instantly share code, notes, and snippets.

@dims
Created February 26, 2026 10:56
Show Gist options
  • Select an option

  • Save dims/4aa2fb68ae83bbb81fd21159c51d04c5 to your computer and use it in GitHub Desktop.

Select an option

Save dims/4aa2fb68ae83bbb81fd21159c51d04c5 to your computer and use it in GitHub Desktop.
Running Qwen3-Coder-Next-FP8 with Claude Code on Brev H100

Running Qwen3-Coder-Next-FP8 with Claude Code on Brev H100

A step-by-step guide for running the Qwen3-Coder-Next-FP8 model locally via vLLM, connecting Claude Code to it, and verifying end-to-end operation.

Goal: claude -p "what is kubernetes" → coherent answer from a local model


Prerequisites

  • A Brev account with access to H100 GPU instances
  • brev CLI installed locally
  • claude CLI installed locally (or available on the instance)

Step 1 — Create and Connect to the Brev H100 Instance

Note: As of early 2026, brev create --gpu hyperstack_H100 fails with "instance type not found" due to the old CLI not passing a workspace group ID. Use the API directly instead:

Create an instance via the Brev API (one-time):

# Get your org ID
brev org ls  # copy the ID of your active org (format: org-XXXX...)

# Create the instance
TOKEN=$(python3 -c "import json; d=json.load(open('~/.brev/credentials.json'.replace('~', __import__('os').path.expanduser('~')))); print(d['access_token'])")
ORG_ID="org-XXXXXXXXXXXXXXXXXXXXXXXXX"  # replace with your org ID

curl -sf -X POST \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  "https://brevapi.us-west-2-prod.control-plane.brev.dev/api/organizations/$ORG_ID/workspaces" \
  -d '{
    "name": "qwen3-h100",
    "workspaceGroupId": "shadeform",
    "diskStorage": "120Gi",
    "instanceType": "hyperstack_H100",
    "workspaceVersion": "v1",
    "vmOnlyMode": true,
    "portMappings": {},
    "workspaceTemplateId": "4nbb4lg2s",
    "launchJupyterOnStart": false
  }' | python3 -c "import sys,json; ws=json.load(sys.stdin).get('workspace',{}); print('Created:', ws.get('id'), ws.get('name'), ws.get('status'))"

Wait for it to be RUNNING:

brev ls  # wait until STATUS = RUNNING (typically 3-5 minutes)

Register SSH config:

brev refresh

SSH directly:

ssh -F ~/.brev/ssh_config qwen3-h100-host  # note: shadeform instances use the "-host" suffix

Verify the GPU:

nvidia-smi

Expected: NVIDIA H100 PCIe listed with ~80 GB VRAM.

Disk note: Hyperstack instances have ~97 GB root disk. The Docker image uses ~30 GB. The model download needs ~80 GB. Use the ephemeral disk (/ephemeral, ~750 GB) for the HuggingFace cache — this is handled in Step 2 below.


Step 2 — Start vLLM in Docker

Run in the background (detached). The first run downloads the model (~80 GB FP8) from HuggingFace — this takes 15–20 minutes. Subsequent runs use the local cache (~2–3 min).

Important: Mount to /ephemeral/huggingface (the 750 GB ephemeral disk), not ~/.cache/huggingface. The root disk is only ~97 GB and fills up during download.

mkdir -p /ephemeral/huggingface

docker run -d --name qwen3 \
  --gpus all --ipc=host --network host \
  --ulimit memlock=-1 --ulimit stack=67108864 \
  -e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
  -v /ephemeral/huggingface:/root/.cache/huggingface \
  nvcr.io/nvidia/vllm:26.01-py3 \
  vllm serve "Qwen/Qwen3-Coder-Next-FP8" \
  --served-model-name qwen3-coder-next \
  --port 8000 \
  --max-model-len 16384 \
  --gpu-memory-utilization 0.98 \
  --swap-space 16 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --attention-backend flash_attn \
  --kv-cache-dtype fp8 \
  --max-num-seqs 1 \
  --enable-chunked-prefill \
  --max-num-batched-tokens 32

Flag rationale

Flag Why
--max-model-len 16384 The H100 PCIe has 80 GB; model weights alone use 74.88 GiB. At FP8 KV cache with 0.98 utilization, only 16384 tokens of KV cache fit. Larger values cause OOM.
--gpu-memory-utilization 0.98 Pushes close to the limit to fit as many KV cache blocks as possible.
--attention-backend flash_attn flashinfer backend causes CUDA graph OOM on H100 PCIe with this model.
--kv-cache-dtype fp8 Cuts KV cache memory in half vs fp16.
--max-num-seqs 1 Single-user setup; prevents memory pressure from concurrent sequences.
--enable-chunked-prefill --max-num-batched-tokens 32 Critical: Qwen3-Coder-Next uses Flash Linear Attention (FLA) with a Triton autotuner. Sequences >32 tokens trigger a BT=64 tile configuration that OOMs during Triton autotuning. Chunked prefill splits any longer prompt into ≤32 token chunks, keeping the tile size at BT=32, which fits within the ~1.8 GB free memory.
--enable-prefix-caching Omitted intentionally — incompatible with chunked prefill for this model.

Watch startup progress

docker logs qwen3 -f 2>&1 | grep -E "Ready|READY|loading|Loading|ERROR|error"

Step 3 — Wait for the Endpoint to Be Ready

Poll until the model is serving:

until curl -sf http://localhost:8000/v1/models > /dev/null; do
  echo "$(date +%H:%M:%S) waiting..."
  sleep 15
done
echo "vLLM is ready"

Verify the model is listed:

curl -s http://localhost:8000/v1/models | python3 -c "
import sys, json
d = json.load(sys.stdin)
print('Model:', d['data'][0]['id'])
print('Max context:', d['data'][0]['max_model_len'])
"

Expected output:

Model: qwen3-coder-next
Max context: 16384

Step 4 — Start the litellm Proxy

Claude Code uses the Anthropic Messages API format (POST /v1/messages). vLLM speaks OpenAI's format (POST /v1/chat/completions or /v1/responses). litellm bridges them.

Install litellm (if not present):

pip install 'litellm[proxy]'

Note: pip install litellm (without [proxy]) is missing the backoff dependency and will fail with ImportError: Missing dependency No module named 'backoff'.

Start litellm:

export OPENAI_API_KEY=dummy
export PATH="$HOME/.local/bin:$PATH"  # pip installs to ~/.local/bin
nohup litellm \
  --model openai/qwen3-coder-next \
  --api_base http://localhost:8000/v1 \
  --drop_params \
  --port 4000 > /tmp/litellm.log 2>&1 &
echo "litellm PID: $!"

The --drop_params flag silently drops unsupported parameters that Claude Code sends (such as reasoning_effort), preventing HTTP 400 errors.

Verify litellm is up:

sleep 3
curl -s http://localhost:4000/v1/models | python3 -c "import sys,json; print(json.load(sys.stdin)['data'][0]['id'])"

Expected: qwen3-coder-next


Step 5 — Install Claude Code (if not already installed)

# Check Node.js (requires 18+)
node --version || (curl -fsSL https://deb.nodesource.com/setup_20.x | sudo -E bash - && sudo apt-get install -y nodejs)

sudo npm install -g @anthropic-ai/claude-code
claude --version

Step 6 — Configure Claude Code to Use the Local Endpoint

export ANTHROPIC_BASE_URL=http://localhost:4000
export ANTHROPIC_AUTH_TOKEN=dummy
export ANTHROPIC_API_KEY=dummy
export ANTHROPIC_MODEL=qwen3-coder-next
export ANTHROPIC_SMALL_FAST_MODEL=qwen3-coder-next
export CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1
export API_TIMEOUT_MS=120000

Why dummy tokens? vLLM and litellm don't validate auth — any non-empty string works. Claude Code requires the vars to be set.


Step 7 — Run the Test Prompt

claude -p "what is kubernetes" \
  --system-prompt "You are a helpful assistant." \
  --tools "" \
  --no-session-persistence

Expected: A clear, structured explanation of Kubernetes from the local Qwen3 model.

Example output

Kubernetes (often stylized as K8s) is an open-source container orchestration platform
for automating the deployment, scaling, and management of containerized applications.

## Core Concepts
- Container Orchestration: Kubernetes automates the scheduling and running of containers
  across multiple hosts, handling tasks like load balancing, networking, and storage.
- Cluster Architecture: control plane (api-server, scheduler, etcd) + worker nodes
...

Verification Checklist

  • curl http://localhost:8000/v1/models returns qwen3-coder-next
  • curl http://localhost:4000/v1/models returns qwen3-coder-next (via litellm)
  • claude -p "what is kubernetes" --system-prompt "..." --tools "" --no-session-persistence returns a coherent answer

Failure Modes and Fixes

OOM: CUDA out of memory on startup

torch.OutOfMemoryError: CUDA out of memory.
RuntimeError: CUDA out of memory occurred when warming up sampler

Cause: --max-model-len is too large. The model weights use 74.88 GiB; with --gpu-memory-utilization 0.98 on an 80 GB H100, there's only ~1.8 GB free for KV cache blocks.

Fix: Use --max-model-len 16384 (as shown above). Do not increase it.


Triton OOM on first inference

RuntimeError: Triton Error [CUDA]: out of memory
(in chunk_fwd_kernel_o autotuner)

Cause: Qwen3-Coder-Next uses Flash Linear Attention (FLA). Its Triton kernel chunk_fwd_kernel_o has an autotuner that benchmarks different tile configurations. With only ~1.8 GB free, sequences >32 tokens trigger BT=64 tiles which OOM during autotuning.

Fix: Add --enable-chunked-prefill --max-num-batched-tokens 32 to vLLM. This splits prompt processing into chunks of ≤32 tokens, keeping BT≤32. This is the most important non-obvious flag for this model on H100 PCIe.

This crash kills the vLLM process. Restart with docker start qwen3.


vLLM crash: engine core initialization failed

RuntimeError: Engine core initialization failed. See root cause above.

Cause: --attention-backend flashinfer — this backend performs CUDA graph capture which uses additional GPU memory and causes OOM on this hardware.

Fix: Use --attention-backend flash_attn instead.


litellm 400 error: unsupported parameter

API Error: 400 {"error": {"message": "litellm.BadRequestError: UnsupportedParamsError..."}}

Cause: Claude Code sends reasoning_effort parameter that vLLM doesn't support.

Fix: Add --drop_params to litellm startup command.


litellm 400 error: EngineCore encountered an issue

litellm.BadRequestError: OpenAIException - {"error": {"message":
"EngineCore encountered an issue..."}}

Cause: vLLM's engine crashed (Triton OOM) and returned an error to litellm.

Fix: Restart vLLM container (docker start qwen3) and confirm chunked prefill is enabled.


SSH session killed when managing processes

Symptom: SSH connection drops with exit code 255 when running pkill -f litellm.

Cause: pkill -f PATTERN matches the full command line of ALL processes, including the current bash session which has PATTERN as an argument.

Fix: Kill by PID instead:

LITELLM_PID=$(ps aux | grep "litellm" | grep -v grep | awk '{print $2}' | head -1)
kill "$LITELLM_PID"

Claude Code context window error

API Error: 400 context window exceeded

Cause: Claude Code's default system prompt is ~17,767 tokens, exceeding the 16,384 token limit.

Fix: Always use --system-prompt "..." and --tools "" to replace the default system prompt with a short one.


Architecture Diagram

Claude Code (local Mac)
  │  ANTHROPIC_BASE_URL=http://<brev-ip>:4000
  │  POST /v1/messages?beta=true (Anthropic API format)
  ▼
litellm proxy (Brev instance, port 4000)
  │  --model openai/qwen3-coder-next
  │  --api_base http://localhost:8000/v1
  │  Translates: Anthropic → OpenAI Responses API
  ▼
vLLM serving Qwen3-Coder-Next-FP8 (Brev instance, port 8000)
  │  --enable-chunked-prefill --max-num-batched-tokens 32
  │  Chunks long prompts → FLA kernel sees ≤32 tokens per pass
  ▼
NVIDIA H100 PCIe (80 GB VRAM)
  │  Model weights: 74.88 GiB FP8
  │  KV cache: ~1.8 GB remaining @ 0.98 utilization
  └─ Qwen3-Coder-Next-FP8 inference

Restarting After Instance Restart

If the Brev instance reboots, Docker containers stop. Restart both services:

Warning: /ephemeral is ephemeral — it survives reboots but NOT instance stop/start cycles on some providers. If the model cache is gone, docker run again (full download needed).

# Restart vLLM (model is already cached in /ephemeral/huggingface if not wiped)
docker start qwen3

# Wait for it to be ready (2-3 minutes from cache)
until curl -sf http://localhost:8000/v1/models > /dev/null; do sleep 15; done
echo "vLLM ready"

# Restart litellm
export OPENAI_API_KEY=dummy
nohup litellm \
  --model openai/qwen3-coder-next \
  --api_base http://localhost:8000/v1 \
  --drop_params \
  --port 4000 > /tmp/litellm.log 2>&1 &

Hardware Notes

  • This setup requires: H100 PCIe 80 GB or larger single GPU
  • Does NOT work on H100 PCIe 80 GB without chunked prefill due to FLA Triton OOM
  • The original article used: Apple M4 Max with 128 GB unified memory — more headroom
  • Would work without chunked prefill on: H100 SXM 80 GB × 2 (tensor parallel), or any system with >80 GB VRAM in a single device

Quick Reference: All Commands in Order

# 1. SSH in (shadeform instances use the "-host" suffix in ssh config)
ssh -F ~/.brev/ssh_config qwen3-h100-host

# 2. Start vLLM (first time takes 15-20 min to download model)
# Use /ephemeral for HF cache — root disk is only ~97 GB
mkdir -p /ephemeral/huggingface
docker run -d --name qwen3 \
  --gpus all --ipc=host --network host \
  --ulimit memlock=-1 --ulimit stack=67108864 \
  -e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
  -v /ephemeral/huggingface:/root/.cache/huggingface \
  nvcr.io/nvidia/vllm:26.01-py3 \
  vllm serve "Qwen/Qwen3-Coder-Next-FP8" \
  --served-model-name qwen3-coder-next \
  --port 8000 --max-model-len 16384 \
  --gpu-memory-utilization 0.98 --swap-space 16 \
  --enable-auto-tool-choice --tool-call-parser qwen3_coder \
  --attention-backend flash_attn --kv-cache-dtype fp8 \
  --max-num-seqs 1 \
  --enable-chunked-prefill --max-num-batched-tokens 32

# 3. Wait for vLLM to be ready
until curl -sf http://localhost:8000/v1/models > /dev/null; do
  echo "$(date +%H:%M:%S) waiting..."; sleep 15; done
echo "vLLM ready"

# 4. Install litellm and start proxy
export PATH="$HOME/.local/bin:$PATH"
pip install -q 'litellm[proxy]'
export OPENAI_API_KEY=dummy
nohup litellm --model openai/qwen3-coder-next \
  --api_base http://localhost:8000/v1 --drop_params --port 4000 \
  > /tmp/litellm.log 2>&1 &

# 5. Install Claude Code (Node 20 required)
curl -fsSL https://deb.nodesource.com/setup_20.x | sudo -E bash - > /dev/null 2>&1
sudo apt-get install -y nodejs > /dev/null 2>&1
sudo npm install -g @anthropic-ai/claude-code

# 6. Set Claude Code env vars
export ANTHROPIC_BASE_URL=http://localhost:4000
export ANTHROPIC_AUTH_TOKEN=dummy
export ANTHROPIC_API_KEY=dummy
export ANTHROPIC_MODEL=qwen3-coder-next
export ANTHROPIC_SMALL_FAST_MODEL=qwen3-coder-next
export CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1
export API_TIMEOUT_MS=120000

# 7. Test!
claude -p "what is kubernetes" \
  --system-prompt "You are a helpful assistant." \
  --tools "" --no-session-persistence
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment