bbrowning/instructions.md

## instructions.md

      
    Raw
  

              instructions.md
            
          
    Compiling vLLM main from source on DGX Spark

I do all this SSH'd into the DGX Spark from another machine, so everything is terminal commands.
Install python dev dependencies and uv

sudo apt install python3-dev
curl -LsSf https://astral.sh/uv/install.sh | sh

Exit your shell and come back in to have uv in your path
Set env variables

export TORCH_CUDA_ARCH_LIST=12.1a
export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

Clone vLLM

mkdir -p ~/src
cd ~/src
git clone https://github.com/vllm-project/vllm.git
cd vllm

Create a new Python venv

uv venv --python 3.12 --seed
source .venv/bin/activate

Install torch for CUDA 13

uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130

Install newest prereleases of flashinfer and triton

uv pip install xgrammar triton flashinfer-python flashinfer-cubin --prerelease=allow

Setup vLLM to use existing torch

Tell vLLM to use your existing torch and install build dependencies.
python use_existing_torch.py
uv pip install -r requirements/build.txt

Compile vLLM in editable mode

This will take at least 30 minutes or more.
uv pip install -v --no-build-isolation -e .

This should give you a working vLLM that can serve most models. Note that NVFP4 MoE models are still a work in progress, but other FP4 models like gpt-oss-20b should work fine.
Test serving a model

vllm serve openai/gpt-oss-20b \
  --async-scheduling \
  --gpu-memory-utilization 0.4 \
  --tool-call-parser openai \
  --enable-auto-tool-choice
No results found