I do all this SSH'd into the DGX Spark from another machine, so everything is terminal commands.
sudo apt install python3-dev
curl -LsSf https://astral.sh/uv/install.sh | sh
Exit your shell and come back in to have uv in your path
export TORCH_CUDA_ARCH_LIST=12.1a
export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
mkdir -p ~/src
cd ~/src
git clone https://github.com/vllm-project/vllm.git
cd vllm
uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130
uv pip install xgrammar triton flashinfer-python flashinfer-cubin --prerelease=allow
Tell vLLM to use your existing torch and install build dependencies.
python use_existing_torch.py
uv pip install -r requirements/build.txt
This will take at least 30 minutes or more.
uv pip install -v --no-build-isolation -e .
This should give you a working vLLM that can serve most models. Note that NVFP4 MoE models are still a work in progress, but other FP4 models like gpt-oss-20b should work fine.
vllm serve openai/gpt-oss-20b \
--async-scheduling \
--gpu-memory-utilization 0.4 \
--tool-call-parser openai \
--enable-auto-tool-choice