This tutorial captures end-to-end reference flows for running AIPerf against vLLM-hosted models. Each chapter covers a specific OpenAI-compatible endpoint: how to launch the vLLM server, run the AIPerf benchmark, and interpret sample results collected on a 1x H100 system.
Note
The vLLM examples rely on the latest main branch (commit
081b5594a2b1a37ea793659bb6767c497beef45d) to access Qwen3 model support. If
an official release ships with matching capabilities, substitute the
pip install command accordingly.
- Ubuntu 22.04+ with NVIDIA H100 (or equivalent Hopper GPU) and recent CUDA drivers.
- Python 3.10+ with a project virtual environment (assumed at
.venv). If you don't already have one, create it and install the project along with vLLM:
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -e .
pip install --no-cache-dir git+https://github.com/vllm-project/vllm@main
deactivateWith prerequisites in place, proceed to the endpoint-specific guides below.
All commands below assume you are in the repository root with the project virtual environment (
.venv) already activated.
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen3-0.6B \
--port 8001 \
--trust-remote-codeWait for the logs to report successful CUDA graph capture and the server will be
ready on http://localhost:8001/v1/chat/completions.
curl -s -o /tmp/chat_smoke.json -w "%{http_code}" \
http://localhost:8001/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 5
}'
cat /tmp/chat_smoke.jsonExpect a 200 status and a short assistant reply payload.
aiperf profile \
--model Qwen/Qwen3-0.6B \
--endpoint-type chat \
--endpoint /v1/chat/completions \
--url http://localhost:8001 \
--streaming \
--concurrency 4 \
--request-count 16 \
--warmup-request-count 2 \
--synthetic-input-tokens-mean 64 \
--synthetic-input-tokens-stddev 0 \
--output-tokens-mean 128 \
--output-tokens-stddev 0 \
--conversation-num 4 \
--random-seed 7The run completes in ~2 seconds on an H100 and emits CSV/JSON exports under
artifacts/Qwen_Qwen3-0.6B-openai-chat-concurrency4/ for downstream analysis.
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━┓
┃ Metric ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p50 ┃ std ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━┩
│ Time to First │ 13.00 │ 10.81 │ 15.62 │ 15.57 │ 15.23 │ 13.18 │ 1.54 │
│ Token (ms) │ │ │ │ │ │ │ │
│ Request │ 410.50 │ 404.00 │ 416.77 │ 416.69 │ 416.20 │ 411.07 │ 4.09 │
│ Latency (ms) │ │ │ │ │ │ │ │
│ Output Token │ 1,241.17 │ N/A │ N/A │ N/A │ N/A │ N/A │ N/A │
│ Throughput │ │ │ │ │ │ │ │
│ Request │ 9.70 │ N/A │ N/A │ N/A │ N/A │ N/A │ N/A │
│ Throughput │ │ │ │ │ │ │ │
└───────────────┴──────────┴────────┴────────┴────────┴────────┴────────┴──────┘These figures were gathered with deterministic synthetic workloads (fixed token lengths) to simplify cross-run comparisons. Adjust token distributions, concurrency, or request counts to match your production scenario.
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-VL-3B-Instruct-AWQ \
--port 8002 \
--trust-remote-codeWait for the logs to report successful model loading and the server will be
ready on http://localhost:8002/v1/chat/completions.
Run the following smoke test to verify the vision-language endpoint. It embeds
the included architecture diagram as a base64 data URL and requests a caption
through the OpenAI-compatible chat.completions API.
python - <<'PY'
from openai import OpenAI
import base64, json
from pathlib import Path
client = OpenAI(base_url="http://localhost:8002/v1", api_key="dummy")
image_b64 = base64.b64encode(Path("docs/diagrams/aiperf-diagram-256.png").read_bytes()).decode()
resp = client.chat.completions.create(
model="Qwen/Qwen2.5-VL-3B-Instruct-AWQ",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Describe this diagram"},
{
"type": "image_url",
"image_url": {"url": f"data:image/png;base64,{image_b64}", "detail": "low"},
},
],
}
],
max_tokens=128,
)
print(json.dumps(resp.to_dict(), indent=2)[:800])
PYExpect a 200 response with a short textual description of the diagram.
AIPerf can generate synthetic images for vision-language benchmarking using the
--image-width-mean and --image-height-mean parameters:
aiperf profile \
--model Qwen/Qwen2.5-VL-3B-Instruct-AWQ \
--endpoint-type chat \
--endpoint /v1/chat/completions \
--url http://localhost:8002 \
--concurrency 2 \
--request-count 8 \
--warmup-request-count 2 \
--synthetic-input-tokens-mean 64 \
--synthetic-input-tokens-stddev 0 \
--output-tokens-mean 128 \
--output-tokens-stddev 0 \
--image-width-mean 100 \
--image-width-stddev 0 \
--image-height-mean 100 \
--image-height-stddev 0 \
--random-seed 7┏━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃ Metric ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p50 ┃ std ┃
┡━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ Request │ 571.04 │ 367.46 │ 640.21 │ 640.19 │ 640.00 │ 625.36 │ 92.21 │
│ Latency (ms) │ │ │ │ │ │ │ │
│ Output │ 111.62 │ 70.00 │ 128.00 │ 128.00 │ 128.00 │ 126.50 │ 21.49 │
│ Sequence │ │ │ │ │ │ │ │
│ Length │ │ │ │ │ │ │ │
│ (tokens) │ │ │ │ │ │ │ │
│ Input Sequence │ 64.00 │ 64.00 │ 64.00 │ 64.00 │ 64.00 │ 64.00 │ 0.00 │
│ Length │ │ │ │ │ │ │ │
│ (tokens) │ │ │ │ │ │ │ │
│ Output Token │ 372.23 │ N/A │ N/A │ N/A │ N/A │ N/A │ N/A │
│ Throughput │ │ │ │ │ │ │ │
│ (tokens/sec) │ │ │ │ │ │ │ │
│ Request │ 3.33 │ N/A │ N/A │ N/A │ N/A │ N/A │ N/A │
│ Throughput │ │ │ │ │ │ │ │
│ (requests/sec) │ │ │ │ │ │ │ │
│ Request Count │ 8.00 │ N/A │ N/A │ N/A │ N/A │ N/A │ N/A │
│ (requests) │ │ │ │ │ │ │ │
└────────────────┴────────┴────────┴────────┴────────┴────────┴────────┴───────┘python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen3-Embedding-0.6B \
--port 8003 \
--trust-remote-codeWait for the logs to report successful model loading and the server will be
ready on http://localhost:8003/v1/embeddings.
curl -s -o /tmp/embed_smoke.json -w "%{http_code}" \
http://localhost:8003/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-Embedding-0.6B",
"input": "hello world"
}'
cat /tmp/embed_smoke.jsonExpect a 200 status and an embedding array of 1,536 floats.
AIPERF_LOG_LEVEL=info HF_TOKEN=... \
aiperf profile \
--model Qwen/Qwen3-Embedding-0.6B \
--endpoint-type embeddings \
--endpoint /v1/embeddings \
--url http://localhost:8003 \
--concurrency 8 \
--request-count 32 \
--warmup-request-count 4 \
--synthetic-input-tokens-mean 128 \
--synthetic-input-tokens-stddev 0 \
--random-seed 7┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━┓
┃ Metric ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p50 ┃ std ┃
┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━┩
│ Request Latency │ 32.53 │ 12.21 │ 45.16 │ 45.16 │ 45.10 │ 38.40 │ 11.85 │
│ (ms) │ │ │ │ │ │ │ │
│ Request Throughput │ 219.88 │ N/A │ N/A │ N/A │ N/A │ N/A │ N/A │
│ (requests/sec) │ │ │ │ │ │ │ │
│ Request Count │ 32.00 │ N/A │ N/A │ N/A │ N/A │ N/A │ N/A │
│ (requests) │ │ │ │ │ │ │ │
└─────────────────────┴────────┴───────┴───────┴───────┴───────┴───────┴───────┘