Skip to content

Instantly share code, notes, and snippets.

@BenHamm
Last active September 27, 2025 00:51
Show Gist options
  • Select an option

  • Save BenHamm/502b8985e3ce9e805492c11175acec9e to your computer and use it in GitHub Desktop.

Select an option

Save BenHamm/502b8985e3ce9e805492c11175acec9e to your computer and use it in GitHub Desktop.
AI Perf Permutations

AIPerf Profiling of Text, Image, & Embeddings Endpoints

This tutorial captures end-to-end reference flows for running AIPerf against vLLM-hosted models. Each chapter covers a specific OpenAI-compatible endpoint: how to launch the vLLM server, run the AIPerf benchmark, and interpret sample results collected on a 1x H100 system.

Note

The vLLM examples rely on the latest main branch (commit 081b5594a2b1a37ea793659bb6767c497beef45d) to access Qwen3 model support. If an official release ships with matching capabilities, substitute the pip install command accordingly.

Prerequisites

  • Ubuntu 22.04+ with NVIDIA H100 (or equivalent Hopper GPU) and recent CUDA drivers.
  • Python 3.10+ with a project virtual environment (assumed at .venv). If you don't already have one, create it and install the project along with vLLM:
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -e .
pip install --no-cache-dir git+https://github.com/vllm-project/vllm@main
deactivate

With prerequisites in place, proceed to the endpoint-specific guides below.

All commands below assume you are in the repository root with the project virtual environment (.venv) already activated.

Chat Completions — Qwen/Qwen3-0.6B

1. Launch the vLLM server

python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3-0.6B \
  --port 8001 \
  --trust-remote-code

Wait for the logs to report successful CUDA graph capture and the server will be ready on http://localhost:8001/v1/chat/completions.

2. Verify the endpoint

curl -s -o /tmp/chat_smoke.json -w "%{http_code}" \
  http://localhost:8001/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
        "model": "Qwen/Qwen3-0.6B",
        "messages": [{"role": "user", "content": "Hello"}],
        "max_tokens": 5
      }'
cat /tmp/chat_smoke.json

Expect a 200 status and a short assistant reply payload.

3. Run the AIPerf benchmark

aiperf profile \
  --model Qwen/Qwen3-0.6B \
  --endpoint-type chat \
  --endpoint /v1/chat/completions \
  --url http://localhost:8001 \
  --streaming \
  --concurrency 4 \
  --request-count 16 \
  --warmup-request-count 2 \
  --synthetic-input-tokens-mean 64 \
  --synthetic-input-tokens-stddev 0 \
  --output-tokens-mean 128 \
  --output-tokens-stddev 0 \
  --conversation-num 4 \
  --random-seed 7

The run completes in ~2 seconds on an H100 and emits CSV/JSON exports under artifacts/Qwen_Qwen3-0.6B-openai-chat-concurrency4/ for downstream analysis.

4. Sample metrics snapshot

┏━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━┓
┃        Metric ┃      avg ┃    min ┃    max ┃    p99 ┃    p90 ┃    p50 ┃  std ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━┩
│ Time to First │    13.0010.8115.6215.5715.2313.181.54 │
│    Token (ms) │          │        │        │        │        │        │      │
│       Request │   410.50404.00416.77416.69416.20411.074.09 │
│  Latency (ms) │          │        │        │        │        │        │      │
│  Output Token │ 1,241.17 │    N/A │    N/A │    N/A │    N/A │    N/A │  N/A │
│    Throughput │          │        │        │        │        │        │      │
│       Request │     9.70 │    N/A │    N/A │    N/A │    N/A │    N/A │  N/A │
│    Throughput │          │        │        │        │        │        │      │
└───────────────┴──────────┴────────┴────────┴────────┴────────┴────────┴──────┘

These figures were gathered with deterministic synthetic workloads (fixed token lengths) to simplify cross-run comparisons. Adjust token distributions, concurrency, or request counts to match your production scenario.


Chat Completions (Vision-Language) — Qwen/Qwen2.5-VL-3B-Instruct-AWQ

1. Launch the vLLM server

python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-VL-3B-Instruct-AWQ \
  --port 8002 \
  --trust-remote-code

Wait for the logs to report successful model loading and the server will be ready on http://localhost:8002/v1/chat/completions.

2. Verify the endpoint with an image prompt

Run the following smoke test to verify the vision-language endpoint. It embeds the included architecture diagram as a base64 data URL and requests a caption through the OpenAI-compatible chat.completions API.

python - <<'PY'
from openai import OpenAI
import base64, json
from pathlib import Path

client = OpenAI(base_url="http://localhost:8002/v1", api_key="dummy")
image_b64 = base64.b64encode(Path("docs/diagrams/aiperf-diagram-256.png").read_bytes()).decode()

resp = client.chat.completions.create(
    model="Qwen/Qwen2.5-VL-3B-Instruct-AWQ",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe this diagram"},
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/png;base64,{image_b64}", "detail": "low"},
                },
            ],
        }
    ],
    max_tokens=128,
)
print(json.dumps(resp.to_dict(), indent=2)[:800])
PY

Expect a 200 response with a short textual description of the diagram.

3. Benchmark with AIPerf

AIPerf can generate synthetic images for vision-language benchmarking using the --image-width-mean and --image-height-mean parameters:

aiperf profile \
  --model Qwen/Qwen2.5-VL-3B-Instruct-AWQ \
  --endpoint-type chat \
  --endpoint /v1/chat/completions \
  --url http://localhost:8002 \
  --concurrency 2 \
  --request-count 8 \
  --warmup-request-count 2 \
  --synthetic-input-tokens-mean 64 \
  --synthetic-input-tokens-stddev 0 \
  --output-tokens-mean 128 \
  --output-tokens-stddev 0 \
  --image-width-mean 100 \
  --image-width-stddev 0 \
  --image-height-mean 100 \
  --image-height-stddev 0 \
  --random-seed 7

4. Sample metrics snapshot

┏━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃         Metric ┃    avg ┃    min ┃    max ┃    p99 ┃    p90 ┃    p50 ┃   std ┃
┡━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│        Request │ 571.04367.46640.21640.19640.00625.3692.21 │
│   Latency (ms) │        │        │        │        │        │        │       │
│         Output │ 111.6270.00128.00128.00128.00126.5021.49 │
│       Sequence │        │        │        │        │        │        │       │
│         Length │        │        │        │        │        │        │       │
│       (tokens) │        │        │        │        │        │        │       │
│ Input Sequence │  64.0064.0064.0064.0064.0064.000.00 │
│         Length │        │        │        │        │        │        │       │
│       (tokens) │        │        │        │        │        │        │       │
│   Output Token │ 372.23 │    N/A │    N/A │    N/A │    N/A │    N/A │   N/A │
│     Throughput │        │        │        │        │        │        │       │
│   (tokens/sec) │        │        │        │        │        │        │       │
│        Request │   3.33 │    N/A │    N/A │    N/A │    N/A │    N/A │   N/A │
│     Throughput │        │        │        │        │        │        │       │
│ (requests/sec) │        │        │        │        │        │        │       │
│  Request Count │   8.00 │    N/A │    N/A │    N/A │    N/A │    N/A │   N/A │
│     (requests) │        │        │        │        │        │        │       │
└────────────────┴────────┴────────┴────────┴────────┴────────┴────────┴───────┘

Embeddings — Qwen/Qwen3-Embedding-0.6B

1. Launch the vLLM server

python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3-Embedding-0.6B \
  --port 8003 \
  --trust-remote-code

Wait for the logs to report successful model loading and the server will be ready on http://localhost:8003/v1/embeddings.

2. Verify the endpoint

curl -s -o /tmp/embed_smoke.json -w "%{http_code}" \
  http://localhost:8003/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
        "model": "Qwen/Qwen3-Embedding-0.6B",
        "input": "hello world"
      }'
cat /tmp/embed_smoke.json

Expect a 200 status and an embedding array of 1,536 floats.

3. Run the AIPerf benchmark

AIPERF_LOG_LEVEL=info HF_TOKEN=... \
aiperf profile \
  --model Qwen/Qwen3-Embedding-0.6B \
  --endpoint-type embeddings \
  --endpoint /v1/embeddings \
  --url http://localhost:8003 \
  --concurrency 8 \
  --request-count 32 \
  --warmup-request-count 4 \
  --synthetic-input-tokens-mean 128 \
  --synthetic-input-tokens-stddev 0 \
  --random-seed 7

4. Sample metrics snapshot

┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━┓
┃              Metric ┃    avg ┃   min ┃   max ┃   p99 ┃   p90 ┃   p50 ┃   std ┃
┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━┩
│     Request Latency │  32.5312.2145.1645.1645.1038.4011.85 │
│                (ms) │        │       │       │       │       │       │       │
│  Request Throughput │ 219.88 │   N/A │   N/A │   N/A │   N/A │   N/A │   N/A │
│      (requests/sec) │        │       │       │       │       │       │       │
│       Request Count │  32.00 │   N/A │   N/A │   N/A │   N/A │   N/A │   N/A │
│          (requests) │        │       │       │       │       │       │       │
└─────────────────────┴────────┴───────┴───────┴───────┴───────┴───────┴───────┘
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment