Skip to content

Instantly share code, notes, and snippets.

@BenHamm
Last active November 14, 2025 09:32
Show Gist options
  • Select an option

  • Save BenHamm/31c648f7d7331c94c1f3a45859db6677 to your computer and use it in GitHub Desktop.

Select an option

Save BenHamm/31c648f7d7331c94c1f3a45859db6677 to your computer and use it in GitHub Desktop.
AIPerf Comprehensive Benchmarking Guide - WIP

AIPerf: Comprehensive LLM Benchmarking

Presentation Date: November 13, 2025
Tool: AIPerf v0.3.0


Table of Contents

  1. Setup: Installing AIPerf 0.3.0
  2. Test Endpoint Details
  3. Use Case 1: Simple Profiling with Static ISL/OSL
  4. Use Case 2: Auditing Raw Results - Custom Percentile Analysis
  5. Use Case 3: Trace-Based Benchmarking with Mooncake
  6. Use Case 4: Goodput Analysis - Measuring SLA Compliance
  7. Use Case 5: Time-Sliced Analysis - Performance Over Time
  8. Summary
  9. Advanced Topics
  10. Coming Soon

Setup: Installing AIPerf 0.3.0

Note: AIPerf 0.3.0 is not yet available on PyPI. Install from the GitHub repository:

# Create a virtual environment (recommended)
python -m venv venv
source venv/bin/activate

# Install AIPerf from release/0.3.0 branch
pip install --upgrade git+https://github.com/ai-dynamo/aiperf.git@release/0.3.0

# Verify installation
aiperf --version
# Expected output: 0.3.0

Why 0.3.0?

  • ✅ Fixed dashboard UI bug with tokenizer downloads
  • ✅ Improved stability and performance
  • ✅ Enhanced reporting features
  • ✅ Direct aiperf command (no need for python -m aiperf)

Test Endpoint Details

Note: This was a demo endpoint used for the November 13, 2025 presentation. The cluster has been taken down.

Model: Qwen3-0.6B (Qwen/Qwen3-0.6B)
Inference Engine: vLLM v0.11.0
Architecture: 8-way data parallelism (8 independent vLLM replicas)
Hardware: 8x NVIDIA H200 GPUs (1 GPU per replica)
Deployment: Kubernetes on Nebius Cloud

Demo Endpoint (no longer active):

# This endpoint was available during the demo:
# export ENDPOINT_URL="http://89.169.112.187:8000"

Why this endpoint was chosen for the demo:

  • Small model (~600M parameters) = high throughput for benchmarking
  • 8 replicas = demonstrated horizontal scaling
  • Public access = allowed live demonstration

Use Case 1: Simple Profiling with Static ISL/OSL

Goal: Measure baseline performance under controlled load

Command

aiperf profile \
  --model qwen3-0.6b \
  --url $ENDPOINT_URL \
  --endpoint-type chat \
  --streaming \
  --concurrency 100 \
  --request-count 1000 \
  --isl 1000 \
  --osl 500 \
  --tokenizer Qwen/Qwen3-0.6B

Parameters Explained

Arg Value Purpose
--model qwen3-0.6b Model identifier (matches endpoint)
--url $ENDPOINT_URL Target inference server
--endpoint-type chat OpenAI chat completions API
--streaming (flag) Enable token streaming
--concurrency 100 Simultaneous connections
--request-count 1000 Total requests to send
--isl 1000 Input tokens per request
--osl 500 Output tokens per response
--tokenizer Qwen/Qwen3-0.6B HuggingFace tokenizer for accuracy

Key Insight: This creates 100 "virtual users" sending 1,000 requests total with large payloads (1000→500 tokens).

Results

                          NVIDIA AIPerf | LLM Metrics                           
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃ Metric                  ┃      avg ┃    min ┃    max ┃    p99 ┃    p90 ┃    p50 ┃   std ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ Time to First Token (ms)│   347.15 │ 204.55 │1,052.66│  815.02│  577.05│  289.49│ 143.57│
│ Request Latency (ms)    │ 2,101.75 │ 693.08 │4,770.98│3,613.75│2,319.79│2,057.50│ 303.17│
│ Inter Token Latency (ms)│     3.57 │   1.99 │   8.55 │   5.78 │   3.93 │   3.49 │   0.54│
│ Output Token Throughput │22,521.42 │    N/A │    N/A │    N/A │    N/A │    N/A │    N/A│
│           (tokens/sec)  │          │        │        │        │        │        │       │
│ Request Throughput      │    45.70 │    N/A │    N/A │    N/A │    N/A │    N/A │    N/A│
│      (requests/sec)     │          │        │        │        │        │        │       │
│ Request Count           │ 1,000.00 │    N/A │    N/A │    N/A │    N/A │    N/A │    N/A│
└─────────────────────────┴──────────┴────────┴────────┴────────┴────────┴────────┴───────┘

Benchmark Duration: 21.88 sec
Success Rate: 100% (0 errors)

Key Takeaways

TTFT = 347ms: Fast first token delivery - users see responses quickly
Request Latency = 2.1s: Total time to generate 500 tokens per request
System Throughput = 22.5K tokens/sec: High capacity with 100 concurrent users
ITL = 3.57ms: Smooth, consistent token streaming
P99 Latency = 3.6s: Even worst-case requests complete reasonably fast

What we learned:

  • With 100 concurrent users and large payloads (1000→500 tokens), the system maintained stable performance
  • P99 latency (3.6s) vs avg (2.1s) shows good consistency - only ~70% variance at tail
  • Zero errors = reliable service under load
  • 22.5K tokens/sec sustained throughput demonstrates 8-replica scaling effectiveness

Evolution: Pareto Curve Analysis - Resource Efficiency vs. User Experience

Goal: Understand the trade-off between resource utilization (TPS/GPU) and user experience (TPS/User) at different concurrency levels.

The Experiment

We ran the same benchmark at 5 different concurrency levels (10, 50, 100, 200, 500) to observe how throughput per GPU and throughput per user change:

# Example commands (run each separately)
aiperf profile --model qwen3-0.6b --url $ENDPOINT_URL \
  --endpoint-type chat --streaming --concurrency 10 \
  --request-count 1000 --isl 1000 --osl 500 \
  --tokenizer Qwen/Qwen3-0.6B --artifact-dir artifacts/pareto-c10

# (Repeat for concurrency: 50, 100, 200, 500)

Results: The Pareto Curve

Concurrency Total TPS TPS/GPU TPS/User TTFT (avg)
10 3,045 1,522 364.69 ~250 ms
50 12,890 6,445 326.10 ~270 ms
100 22,521 11,261 285.03 ~347 ms
200 35,999 18,000 238.67 ~420 ms
500 29,836 14,918 128.85 ~1,129 ms

Hardware: 8 vLLM replicas on 8 H200 GPUs (so we divide Total TPS by 8 for TPS/GPU)

Visualizing the Trade-off

[Visual: Pareto Curve Chart - See presentation slide]

The Pareto frontier shows the inverse relationship between resource efficiency and user experience:

Point Concurrency TPS/GPU TPS/User Interpretation
Far Right c=10 1,522 365 Best user experience, poor GPU utilization
Moving Up-Left c=50 6,445 326 Trading UX for efficiency
c=100 11,261 285 Balanced middle ground
Peak ⭐ c=200 18,000 239 Maximum GPU efficiency
Collapse c=500 14,918 129 Over-saturation degrades both

Key Insight: The Pareto curve demonstrates you cannot optimize both metrics simultaneously. Choose your operating point based on whether you prioritize cost efficiency (c=200) or user experience (c=10-50).

Key Insights from the Pareto Curve

Low Concurrency (10-50):

  • Best user experience: 365 tokens/sec per user = very responsive
  • Poor resource utilization: Only 1,500-6,500 TPS/GPU = GPUs are underutilized
  • Use case: Premium tier, low-latency applications

Medium Concurrency (100-200):

  • Balanced performance: ~11,000-18,000 TPS/GPU
  • Good user experience: ~240-285 tokens/sec per user
  • Sweet spot at c=200: Peak resource utilization (18K TPS/GPU) with acceptable user experience
  • Use case: General production workloads

High Concurrency (500+):

  • Degraded resource utilization: TPS/GPU drops from 18K → 15K
  • Poor user experience: 129 tokens/sec per user, TTFT = 1.1 seconds
  • Queuing dominates: Request backlog causes both metrics to degrade
  • Use case: Avoid this region unless cost is the only priority

The Business Trade-off

Question: Should you optimize for cost efficiency (max TPS/GPU) or user satisfaction (max TPS/User)?

Priority Optimal Concurrency Justification
User Experience 10-50 Sub-300ms TTFT, 325+ tokens/sec/user
Balanced 100-200 18K TPS/GPU, 240+ tokens/sec/user
Cost Efficiency 200 Peak TPS/GPU before degradation

The c=200 "sweet spot":

  • 12x better resource utilization vs. c=10 (18K vs. 1.5K TPS/GPU)
  • Only 35% reduction in per-user throughput (239 vs. 365 tokens/sec/user)
  • TTFT still under 500ms for most requests

What We Learned

🔍 Performance is non-linear: Doubling concurrency doesn't double throughput
📊 The U-shaped curve: TPS/GPU rises, peaks at c=200, then falls due to queuing overhead
⚖️ No free lunch: Higher concurrency = better GPU utilization BUT worse user experience
🎯 Know your SLA: Choose concurrency based on your latency vs. throughput priorities

Pro tip: Run this analysis on YOUR endpoint with YOUR request patterns to find YOUR sweet spot!


Use Case 2: Auditing Raw Results - Custom Percentile Analysis

Scenario: Your management defines SLAs using P75, not the standard P50/P90/P99 that AIPerf reports by default.

Goal: Calculate P75 TTFT from raw benchmark data.

Understanding the Raw Data: profile_export.jsonl

AIPerf outputs detailed per-request data in profile_export.jsonl. Each line is a JSON record:

{
  "metadata": {
    "session_num": 87,
    "x_request_id": "abd8df1a-7904-4aa0-8107-0d74ba0ac0d7",
    "turn_index": 0,
    "request_start_ns": 1763066701865462000,
    "request_end_ns": 1763066703082535666,
    "worker_id": "worker_b431129c"
  },
  "metrics": {
    "time_to_first_token": {
      "value": 582.66,
      "unit": "ms"
    },
    "output_token_count": {
      "value": 194,
      "unit": "tokens"
    },
    "request_latency": {
      "value": 1210.008,
      "unit": "ms"
    },
    "input_sequence_length": {
      "value": 1000,
      "unit": "tokens"
    },
    "output_sequence_length": {
      "value": 194,
      "unit": "tokens"
    },
    "inter_token_latency": {
      "value": 3.25,
      "unit": "ms"
    }
  }
}

Key fields: Every request has time_to_first_token, request_latency, ISL, OSL, and more.

Calculating P75 TTFT

import json
import numpy as np
from pathlib import Path

# Read all TTFT values
ttft_values = []
with open("artifacts/.../profile_export.jsonl", 'r') as f:
    for line in f:
        record = json.loads(line)
        ttft = record['metrics']['time_to_first_token']['value']
        ttft_values.append(ttft)

# Calculate P75
p75_ttft = np.percentile(ttft_values, 75)
print(f"P75 TTFT: {p75_ttft:.2f} ms")

Results from Our Benchmark

============================================================
TTFT Percentile Analysis
============================================================
Total requests analyzed: 1000

Percentiles (ms):
  P25 (25th percentile): 242.45 ms
  P50 (50th percentile): 289.49 ms
  P75 (75th percentile): 422.87 ms  ⭐ YOUR SLA METRIC
  P90 (90th percentile): 577.05 ms
  P99 (99th percentile): 815.02 ms
============================================================

Key Takeaways

P75 = 422.87ms: 75% of requests get first token within this time
Raw data access: Calculate ANY custom metric your org needs
Full transparency: Every request is logged with complete metrics
Easy parsing: Standard JSON format, one record per line

Why this matters:

  • Different orgs have different SLA definitions
  • P75 is a common SLA target (balance between typical and worst-case)
  • AIPerf's raw exports let you calculate ANY percentile or custom metric
  • No need to re-run benchmarks for different analysis

Use Case 3: Trace-Based Benchmarking with Mooncake

Goal: Test your system under realistic production workload patterns using privacy-preserving traces.

What is Mooncake Trace Data?

Mooncake is an open-source KV cache sharing system that released real production traces from their arXiv Q&A service. These traces capture actual user behavior including:

  • Request arrival times
  • Input/output token lengths
  • Block hash IDs: Privacy-preserving identifiers for KV cache reuse patterns

Understanding Block Hashing

The Problem: Sharing production traces risks leaking sensitive user data.

Mooncake's Solution: Hash every 512-token block of input. Users asking about the same document get the same hash IDs, enabling cache reuse analysis without revealing content.

Example: Multi-turn conversation

Turn 1: User uploads paper (7,500 tokens) + question (500 tokens)
├─ Total: 8,000 tokens = 16 blocks
└─ Hash IDs: [46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61]

Turn 2: Same paper + different question (8,500 tokens)
├─ Total: 8,500 tokens = 17 blocks
├─ Hash IDs: [46-61] (reused!) + [62] (new)
└─ ✅ Cache hit rate: 94% (16/17 blocks reused)

Turn 3: Same paper + another question (9,000 tokens)
├─ Total: 9,000 tokens = 18 blocks  
├─ Hash IDs: [46-61] (reused!) + [62, 63] (new)
└─ ✅ Cache hit rate: 89% (16/18 blocks reused)

Key insight: Hash IDs reveal cache reuse opportunities while completely protecting user privacy.

The Mooncake arXiv Trace Dataset

======================================================================
MOONCAKE ARXIV TRACE - DATASET CHARACTERISTICS
======================================================================

📊 OVERALL STATISTICS
  Total Requests: 23,608
  Duration: 60.0 minutes (3,600 seconds)
  Avg Request Rate: 393.5 requests/minute

📏 TOKEN DISTRIBUTION (Input + Output)
  Mean: 8,772 tokens
  Median: 6,402 tokens
  P25: 3,331 tokens  |  P75: 7,562 tokens
  P90: 17,140 tokens |  P99: 61,961 tokens
  Max: 125,878 tokens

📊 REQUEST SIZE DISTRIBUTION
  Token Range          | Count  | Percentage | Visualization
  ──────────────────────────────────────────────────────────
       0 -  5,000   | 7,632 |  32.3%    | ████████████████
   5,000 - 10,000   | 11,626|  49.2%    | ████████████████████████
  10,000 - 20,000   | 2,499 |  10.6%    | █████
  20,000 - 40,000   | 1,325 |   5.6%    | ██
  40,000 - 60,000   |   272 |   1.2%    | 
  60,000 - 80,000   |   135 |   0.6%    | 
  80,000 - 100,000  |    65 |   0.3%    | 
  100,000+          |    54 |   0.2%    | 

⏱️  REQUEST ARRIVAL PATTERN (5-minute windows)
  Time Window         | Requests | Load Pattern
  ───────────────────────────────────────────────────────
    0 -   4 min     | 1,765    | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
    5 -   9 min     | 1,657    | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
   10 -  14 min     | 1,875    | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
   15 -  19 min     | 1,860    | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
   20 -  24 min     | 1,992    | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
   25 -  29 min     | 2,010    | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
   30 -  34 min     | 2,012    | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
   35 -  39 min     | 2,063    | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
   40 -  44 min     | 2,133    | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
   45 -  49 min     | 2,026    | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
   50 -  54 min     | 2,125    | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
   55 -  59 min     | 1,680    | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
======================================================================

Key characteristics of real production traffic:

Highly Variable Request Sizes: 49% of requests are 5K-10K tokens, but tail extends to 125K
Long-Context Dominant: Median of 6,402 tokens vs. typical benchmarks using 1K-2K
Consistent Load: ~393 requests/minute with relatively steady arrival rate
Heavy Tail Distribution: 2% of requests exceed 40K tokens (production reality!)

This represents real-world patterns you won't get from synthetic benchmarks:

  • Multi-turn conversations (shared hash IDs across requests)
  • Variable request sizes (not uniform 1K/500 like Use Case 1)
  • Realistic timing (actual production arrival patterns)
  • Long-context queries that stress-test model limits

Running a Trace-Based Benchmark

# Download the Mooncake trace
curl -o mooncake_trace.jsonl https://raw.githubusercontent.com/kvcache-ai/Mooncake/refs/heads/main/FAST25-release/arxiv-trace/mooncake_trace.jsonl

# Option 1: Replay with original timing (for end-to-end system testing)
aiperf profile \
  --model qwen3-0.6b \
  --url $ENDPOINT_URL \
  --endpoint-type chat \
  --streaming \
  --input-file mooncake_trace.jsonl \
  --custom-dataset-type mooncake_trace \
  --fixed-schedule \
  --tokenizer Qwen/Qwen3-0.6B

# Option 2: Replay as fast as possible (for capacity testing)
aiperf profile \
  --model qwen3-0.6b \
  --url $ENDPOINT_URL \
  --endpoint-type chat \
  --streaming \
  --input-file mooncake_trace.jsonl \
  --custom-dataset-type mooncake_trace \
  --tokenizer Qwen/Qwen3-0.6B

Key Differences from Synthetic Benchmarks

Aspect Use Case 1 (Synthetic) Use Case 3 (Trace-Based)
Request Pattern Uniform (all 1000→500) Variable (100→2,000K tokens)
Arrival Timing Constant concurrency Bursty, realistic timing
KV Cache No reuse patterns Real cache-sharing patterns
Use Case Steady-state capacity Production validation

Why Trace-Based Benchmarking Matters

Realistic Load Testing: Test how your system handles actual production patterns, not idealized synthetic load
KV Cache Validation: If you implement cache sharing (like Mooncake), trace data shows real hit rates
Capacity Planning: See performance under bursty traffic with variable request sizes
Privacy-Preserving: Hash-based traces enable sharing without exposing sensitive data

Pro tip: Use --fixed-schedule for end-to-end system validation (respects timing), or remove it to stress-test maximum throughput capacity.

Real Benchmark Results: 5-Minute Mooncake Trace (5x Speed)

We extracted the first 5 minutes of the Mooncake trace (1,765 requests) and sped it up 5x to replay in ~1 minute:

# Create the subset (first 5 minutes, sped up 5x)
aiperf profile \
  --model qwen3-0.6b \
  --url $ENDPOINT_URL \
  --endpoint-type chat \
  --streaming \
  --input-file mooncake_trace_5min_5x.jsonl \
  --custom-dataset-type mooncake_trace \
  --fixed-schedule \
  --tokenizer Qwen/Qwen3-0.6B

Results:

                          NVIDIA AIPerf | LLM Metrics                           
┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┓
┃ Metric                ┃     avg ┃    min ┃     max ┃     p99 ┃     p90 ┃    p50 ┃     std ┃
┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━┩
│ Time to First Token   │  407.42 │ 212.68 │ 1,519.5 │  951.16 │  586.01 │ 370.20 │  150.12 │
│              (ms)     │         │        │         │         │         │        │         │
│ Request Latency (ms)  │ 1,171.0 │ 243.14 │ 6,665.7 │ 4,184.4 │ 2,615.9 │ 648.33 │  978.09 │
│ Inter Token Latency   │    5.97 │   0.00 │   88.31 │   17.88 │   10.72 │   4.54 │    5.46 │
│              (ms)     │         │        │         │         │         │        │         │
│ Output Sequence Length│  175.27 │   1.00 │ 1,165.0 │  761.65 │  510.00 │  28.00 │  220.30 │
│          (tokens)     │         │        │         │         │         │        │         │
│ Input Sequence Length │ 7,243.0 │ 890.00 │32,236.0 │27,260.0 │15,157.0 │6,344.0 │ 5,536.0 │
│          (tokens)     │         │        │         │         │         │        │         │
│ Output Token          │ 4,675.0 │    N/A │     N/A │     N/A │     N/A │    N/A │     N/A │
│ Throughput (tok/sec)  │         │        │         │         │         │        │         │
│ Request Throughput    │   26.68 │    N/A │     N/A │     N/A │     N/A │    N/A │     N/A │
│      (requests/sec)   │         │        │         │         │         │        │         │
│ Request Count         │ 1,690   │    N/A │     N/A │     N/A │     N/A │    N/A │     N/A │
│      (successful)     │         │        │         │         │         │        │         │
└───────────────────────┴─────────┴────────┴─────────┴─────────┴─────────┴────────┴─────────┘

Benchmark Duration: 63.35 sec
Success Rate: 96% (75 requests exceeded 32K context window)

Key Observations from Trace-Based Testing

Highly Variable Request Sizes:

  • Input: 890→32,236 tokens (36x range!)
  • Output: 1→1,165 tokens
  • Median input: 6,344 tokens (much larger than our synthetic 1K)

Performance Under Real Load:

  • TTFT = 407ms average despite 7K+ token median inputs
  • System handled 4,675 tokens/sec with bursty, variable traffic
  • P99 TTFT = 951ms (some large requests took longer, as expected)

Realistic Failures:

  • 75 requests (4%) exceeded Qwen3-0.6B's 32K context limit
  • This reveals a real operational constraint you'd miss with synthetic tests
  • Production insight: Need longer-context model or request filtering

Production Timing Patterns:

  • Trace shows realistic request bursts and lulls
  • Not constant load like --concurrency 100
  • More representative of actual user traffic patterns

What we learned from trace-based vs. synthetic testing:

  • Use Case 1 (synthetic): 100% success, uniform 1K→500 tokens, 22.5K TPS
  • Use Case 3 (trace): 96% success, variable 890→32K input tokens, 4.7K TPS, revealed context window issues

Trace-based testing exposes real-world challenges that synthetic benchmarks hide!


Use Case 4: Goodput Analysis - Measuring SLA Compliance

Goal: Measure what percentage of requests meet your defined Service Level Objectives (SLOs), not just average performance.

What is Goodput?

Goodput = The fraction of requests that meet ALL specified SLA thresholds.

Why it matters:

  • Throughput tells you how many requests/sec your system handles
  • Goodput tells you how many requests/sec deliver acceptable user experience
  • A system can have high throughput but low goodput if most requests miss SLAs!

Definition (from DistServe paper):

"Goodput measures the number of requests per second that meet specified service-level objectives (SLOs), providing a metric that directly reflects user-perceived quality of service."

Real-World Example: Why Goodput > Throughput

Imagine two systems serving 1000 requests/min:

  • System A: 950 requests under SLA, 50 requests timeout → 95% goodput
  • System B: 500 requests under SLA, 500 requests slow → 50% goodput

Both have the same throughput, but System A delivers 2x better user experience!

Running Goodput Analysis

We'll use the same Mooncake trace, but add SLO thresholds:

# Define SLA thresholds based on your business requirements
# Example: TTFT ≤ 370ms, Request Latency ≤ 648ms

aiperf profile \
  --model qwen3-0.6b \
  --url $ENDPOINT_URL \
  --endpoint-type chat \
  --streaming \
  --input-file mooncake_trace_5min_5x.jsonl \
  --custom-dataset-type mooncake_trace \
  --fixed-schedule \
  --tokenizer Qwen/Qwen3-0.6B \
  --goodput "time_to_first_token:370 request_latency:648"

Goodput Results

                          NVIDIA AIPerf | LLM Metrics                           
┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┓
┃ Metric                ┃     avg ┃    min ┃     max ┃     p99 ┃     p90 ┃    p50 ┃     std ┃
┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━┩
│ Time to First Token   │  428.86 │ 209.96 │ 1,651.8 │ 1,109.7 │  649.21 │ 385.29 │  176.32 │
│              (ms)     │         │        │         │         │         │        │         │
│ Request Latency (ms)  │ 1,208.9 │ 229.80 │ 6,280.6 │ 4,350.7 │ 2,726.4 │ 691.07 │ 1,005.5 │
│ Request Throughput    │   26.67 │    N/A │     N/A │     N/A │     N/A │    N/A │     N/A │
│      (requests/sec)   │         │        │         │         │         │        │         │
│ Goodput               │    7.43 │    N/A │     N/A │     N/A │     N/A │    N/A │     N/A │  ⭐
│ (requests/sec)        │         │        │         │         │         │        │         │
└───────────────────────┴─────────┴────────┴─────────┴─────────┴─────────┴────────┴─────────┘

Benchmark Duration: 63.37 sec
Success Rate: 96% (75 requests exceeded 32K context window)

Key Insights from Goodput Analysis

Goodput vs. Throughput:

Total Throughput: 26.67 requests/sec (100%)
Goodput:           7.43 requests/sec (28%)  ⚠️
────────────────────────────────────────────
Only 28% of requests met BOTH SLO requirements!

Understanding the results:

  • SLO Thresholds: TTFT ≤ 370ms AND Request Latency ≤ 648ms
  • Average TTFT: 428ms (above threshold)
  • Median Latency: 691ms (above threshold)
  • 72% of requests failed to meet at least one SLO
  • Raw throughput doesn't reveal this user experience gap!

How SLO compliance works:

  • Requests must meet ALL SLO criteria to count toward goodput
  • A request with TTFT=350ms but latency=700ms fails (missed latency SLO)
  • A request with TTFT=400ms but latency=600ms fails (missed TTFT SLO)
  • Only requests with TTFT≤370ms AND latency≤648ms count as goodput

What Goodput Tells You That Metrics Don't

Metric What It Measures What It Misses
Average TTFT Typical first token delay Tail latency, SLA violations
P99 Latency Worst-case performance Overall SLA compliance rate
Throughput System capacity User experience quality
Goodput % requests meeting SLAs Nothing - it's the complete picture!

Using Goodput for Capacity Planning

Question: How many servers do I need to handle 1000 req/sec with 95% goodput?

Without goodput analysis:

  • Measure throughput: 26.67 req/sec per server
  • Calculate: 1000 / 26.67 = 38 servers
  • Problem: This assumes all requests meet SLAs! ❌

With goodput analysis:

  • Measure goodput: 7.43 req/sec per server (28% of throughput)
  • Calculate: 1000 / 7.43 = 135 servers
  • Reality: Need 3.5x more capacity to meet SLAs ✅

The cost of ignoring goodput: Underprovisioning by 250%!

Adjusting SLOs for Your Business

Different use cases need different SLOs:

# Strict SLOs (premium tier)
--goodput "time_to_first_token:250 request_latency:500"

# Balanced SLOs (standard tier)
--goodput "time_to_first_token:370 request_latency:648"

# Relaxed SLOs (batch processing)
--goodput "time_to_first_token:600 request_latency:2500"

Pro tip: Set SLO thresholds based on your business requirements, then use goodput to measure compliance and plan capacity accordingly.


Use Case 5: Time-Sliced Analysis - Performance Over Time

Goal: Understand how performance metrics evolve during a benchmark to detect warm-up effects, degradation patterns, or load-dependent behavior.

What is Time-Slicing?

Time-slicing divides your benchmark into sequential time windows, computing metrics independently for each window.

Why it matters:

  • Detect warm-up effects: Identify cold-start latency vs. steady-state performance
  • Spot degradation: Find memory leaks or resource exhaustion over time
  • Understand load patterns: See how performance changes as traffic evolves
  • Validate SLAs over time: Ensure consistent performance, not just averages

Running Time-Sliced Analysis

We'll use the same Mooncake trace with 10-second time slices:

aiperf profile \
  --model qwen3-0.6b \
  --url $ENDPOINT_URL \
  --endpoint-type chat \
  --streaming \
  --input-file mooncake_trace_5min_5x.jsonl \
  --custom-dataset-type mooncake_trace \
  --fixed-schedule \
  --tokenizer Qwen/Qwen3-0.6B \
  --slice-duration 10

Output: AIPerf generates additional files:

  • profile_export_aiperf_timeslices.csv - Time-series data in tidy format
  • profile_export_aiperf_timeslices.json - Hierarchical time-series data

Time-Sliced Results

==========================================================================================
TIME-SLICED PERFORMANCE ANALYSIS (10-second slices)
==========================================================================================

Slice |  Time   | Requests | TTFT (ms) | Latency (ms) | Throughput
  #   | Window  |  Count   |  avg (p90)|  avg  (p90)  | (tokens/s)
------------------------------------------------------------------------------------------
  0   |  0-10s  |      111 |   545 (  900) |  1516 ( 3217) |       3203
  1   | 10-20s  |      223 |   381 (  560) |  1050 ( 2300) |       3027
  2   | 20-30s  |      279 |   376 (  502) |  1266 ( 3008) |       4014
  3   | 30-40s  |      293 |   388 (  655) |  1272 ( 2942) |       3648
  4   | 40-50s  |      302 |   387 (  500) |   976 ( 2173) |       3554
  5   | 50-60s  |      303 |   344 (  444) |   999 ( 2313) |       3470
  6   | 60-70s  |      179 |   374 (  517) |  1427 ( 2803) |       4258
==========================================================================================

TREND ANALYSIS:
  TTFT Range: 344ms - 545ms (variation: 58.6%)
  Throughput Range: 3027 - 4258 tokens/s
  First slice TTFT: 545ms vs. Last slice: 374ms

✅ Warm-up detected: TTFT improved after first slice (cold start effect)
==========================================================================================

Key Insights from Time-Sliced Analysis

1. Warm-Up Effect Detected:

Slice 0 (0-10s):   TTFT = 545ms  ⚠️  Cold start
Slice 1 (10-20s):  TTFT = 381ms  ✅  30% improvement after warm-up
Slices 2-6:        TTFT = 344-388ms  ✅  Stable steady-state

Why this matters:

  • First 10 seconds show 545ms TTFT (above target)
  • Performance improves 30% after warm-up
  • Steady-state performance (344-388ms) is significantly better than cold-start
  • Implication: Pre-warming servers before production traffic prevents SLA violations

2. Variable Load Patterns:

  • Request distribution not uniform: 111 requests (slice 0) → 303 requests (slice 5)
  • Throughput varies with load: 3.0K - 4.3K tokens/sec
  • System handles variable load without significant degradation

3. No Performance Degradation:

  • TTFT remains stable from slice 1-6 (344-388ms range)
  • No upward trend in latency over time
  • No signs of memory leaks or resource exhaustion
  • System is healthy for sustained operation

Comparing Overall vs. Time-Sliced Metrics

Metric Overall Average Slice 0 (Cold) Slice 1-6 (Warm)
TTFT 386ms 545ms (+41%) 344-388ms (baseline)
Latency 1,172ms 1,516ms 976-1,427ms

The hidden truth: Overall averages mask the 41% cold-start penalty!

Use Cases for Time-Slicing

Scenario 1: Detecting Warm-Up Effects

Problem: SLA violations in first minute of operation
Solution: Use time-slicing to quantify warm-up penalty
Action: Pre-warm servers or set longer health check delays

Scenario 2: Finding Memory Leaks

Problem: Performance degrades after hours of operation
Solution: Run long benchmark with time-slicing (--benchmark-duration 3600 --slice-duration 300)
Look for: Increasing TTFT/latency in later slices

Scenario 3: Load Pattern Validation

Problem: Trace-based tests with varying load
Solution: Time-slice to see if performance varies with request density
Look for: Correlation between requests/slice and latency

Best Practices

Choose appropriate slice duration:

  • Too short (<5s): High variance, unstable metrics
  • Too long (>60s): Miss fine-grained patterns
  • Recommended: 10-30 seconds for most workloads

Use with trace-based benchmarks:

  • Time-slicing + realistic traces = complete picture
  • See both overall AND time-evolving performance

Compare cold vs. warm state:

  • Exclude slice 0 from steady-state SLA calculations
  • Report both cold-start and warm-state performance separately

Monitor for degradation:

  • Upward trend in latency = resource issue
  • Flat or decreasing latency = healthy system

Summary

We've demonstrated 5 powerful AIPerf use cases:

  1. Simple Profiling + Pareto Analysis: Find the sweet spot between user experience and resource utilization
  2. Custom Percentile Analysis: Calculate any metric your organization needs from raw data
  3. Trace-Based Benchmarking: Test with realistic production workload patterns
  4. Goodput Analysis: Measure actual SLA compliance, not just raw throughput
  5. Time-Sliced Analysis: Understand performance evolution and detect warm-up/degradation

Key Takeaway: Synthetic benchmarks (Use Case 1) provide baseline capacity, but real-world validation requires traces (Use Case 3), goodput (Use Case 4), and time-series analysis (Use Case 5) to ensure production readiness.


Advanced Topics

In-Cluster Benchmarking

For high-scale testing, consider running AIPerf from within your Kubernetes cluster to:

  • Eliminate network latency between client and server
  • Avoid ephemeral port exhaustion on client machines at extreme concurrency
  • Test true server capacity without client-side bottlenecks

Deploy a load-tester pod in the same cluster as your inference endpoint and use the internal ClusterIP service address for benchmarking.

Request Cancellation Testing

Simulate real-world user behavior where requests are cancelled mid-flight (e.g., users navigating away, timeouts):

aiperf profile \
  --model qwen3-0.6b \
  --url $ENDPOINT_URL \
  --endpoint-type chat \
  --streaming \
  --concurrency 10 \
  --request-count 100 \
  --request-cancellation-rate 20 \
  --request-cancellation-delay 0.5 \
  --isl 800 \
  --osl 400 \
  --tokenizer Qwen/Qwen3-0.6B

Parameters:

  • --request-cancellation-rate 20: Cancel 20% of requests
  • --request-cancellation-delay 0.5: Wait 0.5 seconds before cancelling

Use Cases:

  • Test server resource cleanup and connection pooling
  • Measure impact of cancellations on remaining requests
  • Validate graceful degradation under partial failures

Coming Soon

AIPerf is actively developing new capabilities:

Server-Side Metrics

  • Internal queuing metrics: Measure time spent in scheduling queues, KV cache wait times, and batching delays
  • Engine telemetry: Deep visibility into request lifecycle within the inference engine
  • Resource utilization: Track memory pressure, GPU utilization, and scheduling efficiency

Synthetic Load for KV Cache Efficiency Testing

  • Block reuse patterns: Generate synthetic traces that test prefix caching and attention reuse
  • Cache hit rate benchmarks: Measure performance gains from shared KV cache blocks
  • Memory efficiency testing: Validate PagedAttention and other cache optimization strategies

Kubernetes-Native Benchmarking

  • Distributed load generation: Deploy multiple load-tester pods to simulate thousands of concurrent users
  • Large-scale workloads: Test production-scale traffic patterns without client-side bottlenecks
  • Automated orchestration: Kubernetes operators to manage benchmark lifecycles and resource allocation

Automatic Plot Generation

  • Post-profiling analysis: Automatically generate visualizations after each benchmark run
  • Server-side metrics: Plot GPU utilization, memory usage, and queuing patterns over time
  • Client-side results: Visualize latency distributions, throughput curves, and goodput trends
  • Comparison views: Side-by-side charts for multiple benchmark runs

GitHub Gist: https://gist.github.com/BenHamm/31c648f7d7331c94c1f3a45859db6677

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment