BenHamm/AIPerf_permutations_guide.md

## AIPerf_permutations_guide.md

      
    Raw
  

              AIPerf_permutations_guide.md
            
          
AIPerf Profiling of Text, Image, & Embeddings Endpoints

This tutorial captures end-to-end reference flows for running AIPerf
against vLLM-hosted models. Each chapter covers a specific OpenAI-compatible
endpoint: how to launch the vLLM server, run the AIPerf benchmark, and interpret
sample results collected on a 1x H100 system.
Note
The vLLM examples rely on the latest main branch (commit
081b5594a2b1a37ea793659bb6767c497beef45d) to access Qwen3 model support. If
an official release ships with matching capabilities, substitute the
pip install command accordingly.

Prerequisites


Ubuntu 22.04+ with NVIDIA H100 (or equivalent Hopper GPU) and recent CUDA
drivers.
Python 3.10+ with a project virtual environment (assumed at .venv). If you
don't already have one, create it and install the project along with vLLM:

python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -e .
pip install --no-cache-dir git+https://github.com/vllm-project/vllm@main
deactivate
With prerequisites in place, proceed to the endpoint-specific guides below.

All commands below assume you are in the repository root with the project
virtual environment (.venv) already activated.

Chat Completions — Qwen/Qwen3-0.6B

1. Launch the vLLM server

python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3-0.6B \
  --port 8001 \
  --trust-remote-code
Wait for the logs to report successful CUDA graph capture and the server will be
ready on http://localhost:8001/v1/chat/completions.
2. Verify the endpoint

curl -s -o /tmp/chat_smoke.json -w "%{http_code}" \
  http://localhost:8001/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
        "model": "Qwen/Qwen3-0.6B",
        "messages": [{"role": "user", "content": "Hello"}],
        "max_tokens": 5
      }'
cat /tmp/chat_smoke.json
Expect a 200 status and a short assistant reply payload.
3. Run the AIPerf benchmark

aiperf profile \
  --model Qwen/Qwen3-0.6B \
  --endpoint-type chat \
  --endpoint /v1/chat/completions \
  --url http://localhost:8001 \
  --streaming \
  --concurrency 4 \
  --request-count 16 \
  --warmup-request-count 2 \
  --synthetic-input-tokens-mean 64 \
  --synthetic-input-tokens-stddev 0 \
  --output-tokens-mean 128 \
  --output-tokens-stddev 0 \
  --conversation-num 4 \
  --random-seed 7
The run completes in ~2 seconds on an H100 and emits CSV/JSON exports under
artifacts/Qwen_Qwen3-0.6B-openai-chat-concurrency4/ for downstream analysis.
4. Sample metrics snapshot

┏━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━┓
┃        Metric ┃      avg ┃    min ┃    max ┃    p99 ┃    p90 ┃    p50 ┃  std ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━┩
│ Time to First │    13.00 │  10.81 │  15.62 │  15.57 │  15.23 │  13.18 │ 1.54 │
│    Token (ms) │          │        │        │        │        │        │      │
│       Request │   410.50 │ 404.00 │ 416.77 │ 416.69 │ 416.20 │ 411.07 │ 4.09 │
│  Latency (ms) │          │        │        │        │        │        │      │
│  Output Token │ 1,241.17 │    N/A │    N/A │    N/A │    N/A │    N/A │  N/A │
│    Throughput │          │        │        │        │        │        │      │
│       Request │     9.70 │    N/A │    N/A │    N/A │    N/A │    N/A │  N/A │
│    Throughput │          │        │        │        │        │        │      │
└───────────────┴──────────┴────────┴────────┴────────┴────────┴────────┴──────┘
These figures were gathered with deterministic synthetic workloads (fixed token
lengths) to simplify cross-run comparisons. Adjust token distributions,
concurrency, or request counts to match your production scenario.

Chat Completions (Vision-Language) — Qwen/Qwen2.5-VL-3B-Instruct-AWQ

1. Launch the vLLM server

python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-VL-3B-Instruct-AWQ \
  --port 8002 \
  --trust-remote-code
Wait for the logs to report successful model loading and the server will be
ready on http://localhost:8002/v1/chat/completions.
2. Verify the endpoint with an image prompt

Run the following smoke test to verify the vision-language endpoint. It embeds
the included architecture diagram as a base64 data URL and requests a caption
through the OpenAI-compatible chat.completions API.
python - <<'PY'
from openai import OpenAI
import base64, json
from pathlib import Path

client = OpenAI(base_url="http://localhost:8002/v1", api_key="dummy")
image_b64 = base64.b64encode(Path("docs/diagrams/aiperf-diagram-256.png").read_bytes()).decode()

resp = client.chat.completions.create(
    model="Qwen/Qwen2.5-VL-3B-Instruct-AWQ",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe this diagram"},
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/png;base64,{image_b64}", "detail": "low"},
                },
            ],
        }
    ],
    max_tokens=128,
)
print(json.dumps(resp.to_dict(), indent=2)[:800])
PY
Expect a 200 response with a short textual description of the diagram.
3. Benchmark with AIPerf

AIPerf can generate synthetic images for vision-language benchmarking using the
--image-width-mean and --image-height-mean parameters:
aiperf profile \
  --model Qwen/Qwen2.5-VL-3B-Instruct-AWQ \
  --endpoint-type chat \
  --endpoint /v1/chat/completions \
  --url http://localhost:8002 \
  --concurrency 2 \
  --request-count 8 \
  --warmup-request-count 2 \
  --synthetic-input-tokens-mean 64 \
  --synthetic-input-tokens-stddev 0 \
  --output-tokens-mean 128 \
  --output-tokens-stddev 0 \
  --image-width-mean 100 \
  --image-width-stddev 0 \
  --image-height-mean 100 \
  --image-height-stddev 0 \
  --random-seed 7
4. Sample metrics snapshot

┏━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃         Metric ┃    avg ┃    min ┃    max ┃    p99 ┃    p90 ┃    p50 ┃   std ┃
┡━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│        Request │ 571.04 │ 367.46 │ 640.21 │ 640.19 │ 640.00 │ 625.36 │ 92.21 │
│   Latency (ms) │        │        │        │        │        │        │       │
│         Output │ 111.62 │  70.00 │ 128.00 │ 128.00 │ 128.00 │ 126.50 │ 21.49 │
│       Sequence │        │        │        │        │        │        │       │
│         Length │        │        │        │        │        │        │       │
│       (tokens) │        │        │        │        │        │        │       │
│ Input Sequence │  64.00 │  64.00 │  64.00 │  64.00 │  64.00 │  64.00 │  0.00 │
│         Length │        │        │        │        │        │        │       │
│       (tokens) │        │        │        │        │        │        │       │
│   Output Token │ 372.23 │    N/A │    N/A │    N/A │    N/A │    N/A │   N/A │
│     Throughput │        │        │        │        │        │        │       │
│   (tokens/sec) │        │        │        │        │        │        │       │
│        Request │   3.33 │    N/A │    N/A │    N/A │    N/A │    N/A │   N/A │
│     Throughput │        │        │        │        │        │        │       │
│ (requests/sec) │        │        │        │        │        │        │       │
│  Request Count │   8.00 │    N/A │    N/A │    N/A │    N/A │    N/A │   N/A │
│     (requests) │        │        │        │        │        │        │       │
└────────────────┴────────┴────────┴────────┴────────┴────────┴────────┴───────┘

Embeddings — Qwen/Qwen3-Embedding-0.6B

1. Launch the vLLM server

python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3-Embedding-0.6B \
  --port 8003 \
  --trust-remote-code
Wait for the logs to report successful model loading and the server will be
ready on http://localhost:8003/v1/embeddings.
2. Verify the endpoint

curl -s -o /tmp/embed_smoke.json -w "%{http_code}" \
  http://localhost:8003/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
        "model": "Qwen/Qwen3-Embedding-0.6B",
        "input": "hello world"
      }'
cat /tmp/embed_smoke.json
Expect a 200 status and an embedding array of 1,536 floats.
3. Run the AIPerf benchmark

AIPERF_LOG_LEVEL=info HF_TOKEN=... \
aiperf profile \
  --model Qwen/Qwen3-Embedding-0.6B \
  --endpoint-type embeddings \
  --endpoint /v1/embeddings \
  --url http://localhost:8003 \
  --concurrency 8 \
  --request-count 32 \
  --warmup-request-count 4 \
  --synthetic-input-tokens-mean 128 \
  --synthetic-input-tokens-stddev 0 \
  --random-seed 7
4. Sample metrics snapshot

┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━┓
┃              Metric ┃    avg ┃   min ┃   max ┃   p99 ┃   p90 ┃   p50 ┃   std ┃
┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━┩
│     Request Latency │  32.53 │ 12.21 │ 45.16 │ 45.16 │ 45.10 │ 38.40 │ 11.85 │
│                (ms) │        │       │       │       │       │       │       │
│  Request Throughput │ 219.88 │   N/A │   N/A │   N/A │   N/A │   N/A │   N/A │
│      (requests/sec) │        │       │       │       │       │       │       │
│       Request Count │  32.00 │   N/A │   N/A │   N/A │   N/A │   N/A │   N/A │
│          (requests) │        │       │       │       │       │       │       │
└─────────────────────┴────────┴───────┴───────┴───────┴───────┴───────┴───────┘
No results found