BenHamm/AIPERF-PRESENTATION.md

## AIPERF-PRESENTATION.md

      
    Raw
  

              AIPERF-PRESENTATION.md
            
          
    AIPerf: Comprehensive LLM Benchmarking

Presentation Date: November 13, 2025

Tool: AIPerf v0.3.0

Table of Contents


Setup: Installing AIPerf 0.3.0
Test Endpoint Details
Use Case 1: Simple Profiling with Static ISL/OSL

Evolution: Pareto Curve Analysis


Use Case 2: Auditing Raw Results - Custom Percentile Analysis
Use Case 3: Trace-Based Benchmarking with Mooncake
Use Case 4: Goodput Analysis - Measuring SLA Compliance
Use Case 5: Time-Sliced Analysis - Performance Over Time
Summary
Advanced Topics

In-Cluster Benchmarking
Request Cancellation Testing


Coming Soon

Server-Side Metrics
Synthetic Load for KV Cache Efficiency Testing


Setup: Installing AIPerf 0.3.0

Note: AIPerf 0.3.0 is not yet available on PyPI. Install from the GitHub repository:
# Create a virtual environment (recommended)
python -m venv venv
source venv/bin/activate

# Install AIPerf from release/0.3.0 branch
pip install --upgrade git+https://github.com/ai-dynamo/aiperf.git@release/0.3.0

# Verify installation
aiperf --version
# Expected output: 0.3.0
Why 0.3.0?

✅ Fixed dashboard UI bug with tokenizer downloads
✅ Improved stability and performance
✅ Enhanced reporting features
✅ Direct aiperf command (no need for python -m aiperf)


Test Endpoint Details

Note: This was a demo endpoint used for the November 13, 2025 presentation. The cluster has been taken down.
Model: Qwen3-0.6B (Qwen/Qwen3-0.6B)

Inference Engine: vLLM v0.11.0

Architecture: 8-way data parallelism (8 independent vLLM replicas)

Hardware: 8x NVIDIA H200 GPUs (1 GPU per replica)

Deployment: Kubernetes on Nebius Cloud
Demo Endpoint (no longer active):
# This endpoint was available during the demo:
# export ENDPOINT_URL="http://89.169.112.187:8000"
Why this endpoint was chosen for the demo:

Small model (~600M parameters) = high throughput for benchmarking
8 replicas = demonstrated horizontal scaling
Public access = allowed live demonstration


Use Case 1: Simple Profiling with Static ISL/OSL

Goal: Measure baseline performance under controlled load
Command

aiperf profile \
  --model qwen3-0.6b \
  --url $ENDPOINT_URL \
  --endpoint-type chat \
  --streaming \
  --concurrency 100 \
  --request-count 1000 \
  --isl 1000 \
  --osl 500 \
  --tokenizer Qwen/Qwen3-0.6B
Parameters Explained


Arg
Value
Purpose


--model
qwen3-0.6b
Model identifier (matches endpoint)


--url
$ENDPOINT_URL
Target inference server


--endpoint-type
chat
OpenAI chat completions API


--streaming
(flag)
Enable token streaming


--concurrency
100
Simultaneous connections


--request-count
1000
Total requests to send


--isl
1000
Input tokens per request


--osl
500
Output tokens per response


--tokenizer
Qwen/Qwen3-0.6B
HuggingFace tokenizer for accuracy


Key Insight: This creates 100 "virtual users" sending 1,000 requests total with large payloads (1000→500 tokens).
Results

                          NVIDIA AIPerf | LLM Metrics                           
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃ Metric                  ┃      avg ┃    min ┃    max ┃    p99 ┃    p90 ┃    p50 ┃   std ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ Time to First Token (ms)│   347.15 │ 204.55 │1,052.66│  815.02│  577.05│  289.49│ 143.57│
│ Request Latency (ms)    │ 2,101.75 │ 693.08 │4,770.98│3,613.75│2,319.79│2,057.50│ 303.17│
│ Inter Token Latency (ms)│     3.57 │   1.99 │   8.55 │   5.78 │   3.93 │   3.49 │   0.54│
│ Output Token Throughput │22,521.42 │    N/A │    N/A │    N/A │    N/A │    N/A │    N/A│
│           (tokens/sec)  │          │        │        │        │        │        │       │
│ Request Throughput      │    45.70 │    N/A │    N/A │    N/A │    N/A │    N/A │    N/A│
│      (requests/sec)     │          │        │        │        │        │        │       │
│ Request Count           │ 1,000.00 │    N/A │    N/A │    N/A │    N/A │    N/A │    N/A│
└─────────────────────────┴──────────┴────────┴────────┴────────┴────────┴────────┴───────┘

Benchmark Duration: 21.88 sec
Success Rate: 100% (0 errors)

Key Takeaways

✅ TTFT = 347ms: Fast first token delivery - users see responses quickly

✅ Request Latency = 2.1s: Total time to generate 500 tokens per request

✅ System Throughput = 22.5K tokens/sec: High capacity with 100 concurrent users

✅ ITL = 3.57ms: Smooth, consistent token streaming

✅ P99 Latency = 3.6s: Even worst-case requests complete reasonably fast
What we learned:

With 100 concurrent users and large payloads (1000→500 tokens), the system maintained stable performance
P99 latency (3.6s) vs avg (2.1s) shows good consistency - only ~70% variance at tail
Zero errors = reliable service under load
22.5K tokens/sec sustained throughput demonstrates 8-replica scaling effectiveness


Evolution: Pareto Curve Analysis - Resource Efficiency vs. User Experience

Goal: Understand the trade-off between resource utilization (TPS/GPU) and user experience (TPS/User) at different concurrency levels.
The Experiment

We ran the same benchmark at 5 different concurrency levels (10, 50, 100, 200, 500) to observe how throughput per GPU and throughput per user change:
# Example commands (run each separately)
aiperf profile --model qwen3-0.6b --url $ENDPOINT_URL \
  --endpoint-type chat --streaming --concurrency 10 \
  --request-count 1000 --isl 1000 --osl 500 \
  --tokenizer Qwen/Qwen3-0.6B --artifact-dir artifacts/pareto-c10

# (Repeat for concurrency: 50, 100, 200, 500)
Results: The Pareto Curve


Concurrency
Total TPS
TPS/GPU
TPS/User
TTFT (avg)


10
3,045
1,522
364.69
~250 ms


50
12,890
6,445
326.10
~270 ms


100
22,521
11,261
285.03
~347 ms


200
35,999
18,000 ⭐
238.67
~420 ms


500
29,836
14,918
128.85
~1,129 ms


Hardware: 8 vLLM replicas on 8 H200 GPUs (so we divide Total TPS by 8 for TPS/GPU)
Visualizing the Trade-off

[Visual: Pareto Curve Chart - See presentation slide]
The Pareto frontier shows the inverse relationship between resource efficiency and user experience:


Point
Concurrency
TPS/GPU
TPS/User
Interpretation


Far Right
c=10
1,522
365
Best user experience, poor GPU utilization


Moving Up-Left
c=50
6,445
326
Trading UX for efficiency


c=100
11,261
285
Balanced middle ground


Peak ⭐
c=200
18,000
239
Maximum GPU efficiency


Collapse
c=500
14,918
129
Over-saturation degrades both


Key Insight: The Pareto curve demonstrates you cannot optimize both metrics simultaneously. Choose your operating point based on whether you prioritize cost efficiency (c=200) or user experience (c=10-50).
Key Insights from the Pareto Curve

✅ Low Concurrency (10-50):

Best user experience: 365 tokens/sec per user = very responsive
Poor resource utilization: Only 1,500-6,500 TPS/GPU = GPUs are underutilized
Use case: Premium tier, low-latency applications

✅ Medium Concurrency (100-200):

Balanced performance: ~11,000-18,000 TPS/GPU
Good user experience: ~240-285 tokens/sec per user
Sweet spot at c=200: Peak resource utilization (18K TPS/GPU) with acceptable user experience
Use case: General production workloads

❌ High Concurrency (500+):

Degraded resource utilization: TPS/GPU drops from 18K → 15K
Poor user experience: 129 tokens/sec per user, TTFT = 1.1 seconds
Queuing dominates: Request backlog causes both metrics to degrade
Use case: Avoid this region unless cost is the only priority

The Business Trade-off

Question: Should you optimize for cost efficiency (max TPS/GPU) or user satisfaction (max TPS/User)?


Priority
Optimal Concurrency
Justification


User Experience
10-50
Sub-300ms TTFT, 325+ tokens/sec/user


Balanced
100-200 ⭐
18K TPS/GPU, 240+ tokens/sec/user


Cost Efficiency
200
Peak TPS/GPU before degradation


The c=200 "sweet spot":

12x better resource utilization vs. c=10 (18K vs. 1.5K TPS/GPU)
Only 35% reduction in per-user throughput (239 vs. 365 tokens/sec/user)
TTFT still under 500ms for most requests

What We Learned

🔍 Performance is non-linear: Doubling concurrency doesn't double throughput

📊 The U-shaped curve: TPS/GPU rises, peaks at c=200, then falls due to queuing overhead

⚖️ No free lunch: Higher concurrency = better GPU utilization BUT worse user experience

🎯 Know your SLA: Choose concurrency based on your latency vs. throughput priorities
Pro tip: Run this analysis on YOUR endpoint with YOUR request patterns to find YOUR sweet spot!

Use Case 2: Auditing Raw Results - Custom Percentile Analysis

Scenario: Your management defines SLAs using P75, not the standard P50/P90/P99 that AIPerf reports by default.
Goal: Calculate P75 TTFT from raw benchmark data.
Understanding the Raw Data: profile_export.jsonl

AIPerf outputs detailed per-request data in profile_export.jsonl. Each line is a JSON record:
{
  "metadata": {
    "session_num": 87,
    "x_request_id": "abd8df1a-7904-4aa0-8107-0d74ba0ac0d7",
    "turn_index": 0,
    "request_start_ns": 1763066701865462000,
    "request_end_ns": 1763066703082535666,
    "worker_id": "worker_b431129c"
  },
  "metrics": {
    "time_to_first_token": {
      "value": 582.66,
      "unit": "ms"
    },
    "output_token_count": {
      "value": 194,
      "unit": "tokens"
    },
    "request_latency": {
      "value": 1210.008,
      "unit": "ms"
    },
    "input_sequence_length": {
      "value": 1000,
      "unit": "tokens"
    },
    "output_sequence_length": {
      "value": 194,
      "unit": "tokens"
    },
    "inter_token_latency": {
      "value": 3.25,
      "unit": "ms"
    }
  }
}
Key fields: Every request has time_to_first_token, request_latency, ISL, OSL, and more.
Calculating P75 TTFT

import json
import numpy as np
from pathlib import Path

# Read all TTFT values
ttft_values = []
with open("artifacts/.../profile_export.jsonl", 'r') as f:
    for line in f:
        record = json.loads(line)
        ttft = record['metrics']['time_to_first_token']['value']
        ttft_values.append(ttft)

# Calculate P75
p75_ttft = np.percentile(ttft_values, 75)
print(f"P75 TTFT: {p75_ttft:.2f} ms")
Results from Our Benchmark

============================================================
TTFT Percentile Analysis
============================================================
Total requests analyzed: 1000

Percentiles (ms):
  P25 (25th percentile): 242.45 ms
  P50 (50th percentile): 289.49 ms
  P75 (75th percentile): 422.87 ms  ⭐ YOUR SLA METRIC
  P90 (90th percentile): 577.05 ms
  P99 (99th percentile): 815.02 ms
============================================================

Key Takeaways

✅ P75 = 422.87ms: 75% of requests get first token within this time

✅ Raw data access: Calculate ANY custom metric your org needs

✅ Full transparency: Every request is logged with complete metrics

✅ Easy parsing: Standard JSON format, one record per line
Why this matters:

Different orgs have different SLA definitions
P75 is a common SLA target (balance between typical and worst-case)
AIPerf's raw exports let you calculate ANY percentile or custom metric
No need to re-run benchmarks for different analysis


Use Case 3: Trace-Based Benchmarking with Mooncake

Goal: Test your system under realistic production workload patterns using privacy-preserving traces.
What is Mooncake Trace Data?

Mooncake is an open-source KV cache sharing system that released real production traces from their arXiv Q&A service. These traces capture actual user behavior including:

Request arrival times
Input/output token lengths
Block hash IDs: Privacy-preserving identifiers for KV cache reuse patterns

Understanding Block Hashing

The Problem: Sharing production traces risks leaking sensitive user data.
Mooncake's Solution: Hash every 512-token block of input. Users asking about the same document get the same hash IDs, enabling cache reuse analysis without revealing content.
Example: Multi-turn conversation
Turn 1: User uploads paper (7,500 tokens) + question (500 tokens)
├─ Total: 8,000 tokens = 16 blocks
└─ Hash IDs: [46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61]

Turn 2: Same paper + different question (8,500 tokens)
├─ Total: 8,500 tokens = 17 blocks
├─ Hash IDs: [46-61] (reused!) + [62] (new)
└─ ✅ Cache hit rate: 94% (16/17 blocks reused)

Turn 3: Same paper + another question (9,000 tokens)
├─ Total: 9,000 tokens = 18 blocks  
├─ Hash IDs: [46-61] (reused!) + [62, 63] (new)
└─ ✅ Cache hit rate: 89% (16/18 blocks reused)

Key insight: Hash IDs reveal cache reuse opportunities while completely protecting user privacy.
The Mooncake arXiv Trace Dataset

======================================================================
MOONCAKE ARXIV TRACE - DATASET CHARACTERISTICS
======================================================================

📊 OVERALL STATISTICS
  Total Requests: 23,608
  Duration: 60.0 minutes (3,600 seconds)
  Avg Request Rate: 393.5 requests/minute

📏 TOKEN DISTRIBUTION (Input + Output)
  Mean: 8,772 tokens
  Median: 6,402 tokens
  P25: 3,331 tokens  |  P75: 7,562 tokens
  P90: 17,140 tokens |  P99: 61,961 tokens
  Max: 125,878 tokens

📊 REQUEST SIZE DISTRIBUTION
  Token Range          | Count  | Percentage | Visualization
  ──────────────────────────────────────────────────────────
       0 -  5,000   | 7,632 |  32.3%    | ████████████████
   5,000 - 10,000   | 11,626|  49.2%    | ████████████████████████
  10,000 - 20,000   | 2,499 |  10.6%    | █████
  20,000 - 40,000   | 1,325 |   5.6%    | ██
  40,000 - 60,000   |   272 |   1.2%    | 
  60,000 - 80,000   |   135 |   0.6%    | 
  80,000 - 100,000  |    65 |   0.3%    | 
  100,000+          |    54 |   0.2%    | 

⏱️  REQUEST ARRIVAL PATTERN (5-minute windows)
  Time Window         | Requests | Load Pattern
  ───────────────────────────────────────────────────────
    0 -   4 min     | 1,765    | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
    5 -   9 min     | 1,657    | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
   10 -  14 min     | 1,875    | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
   15 -  19 min     | 1,860    | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
   20 -  24 min     | 1,992    | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
   25 -  29 min     | 2,010    | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
   30 -  34 min     | 2,012    | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
   35 -  39 min     | 2,063    | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
   40 -  44 min     | 2,133    | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
   45 -  49 min     | 2,026    | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
   50 -  54 min     | 2,125    | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
   55 -  59 min     | 1,680    | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
======================================================================

Key characteristics of real production traffic:
✅ Highly Variable Request Sizes: 49% of requests are 5K-10K tokens, but tail extends to 125K

✅ Long-Context Dominant: Median of 6,402 tokens vs. typical benchmarks using 1K-2K

✅ Consistent Load: ~393 requests/minute with relatively steady arrival rate

✅ Heavy Tail Distribution: 2% of requests exceed 40K tokens (production reality!)
This represents real-world patterns you won't get from synthetic benchmarks:

Multi-turn conversations (shared hash IDs across requests)
Variable request sizes (not uniform 1K/500 like Use Case 1)
Realistic timing (actual production arrival patterns)
Long-context queries that stress-test model limits

Running a Trace-Based Benchmark

# Download the Mooncake trace
curl -o mooncake_trace.jsonl https://raw.githubusercontent.com/kvcache-ai/Mooncake/refs/heads/main/FAST25-release/arxiv-trace/mooncake_trace.jsonl

# Option 1: Replay with original timing (for end-to-end system testing)
aiperf profile \
  --model qwen3-0.6b \
  --url $ENDPOINT_URL \
  --endpoint-type chat \
  --streaming \
  --input-file mooncake_trace.jsonl \
  --custom-dataset-type mooncake_trace \
  --fixed-schedule \
  --tokenizer Qwen/Qwen3-0.6B

# Option 2: Replay as fast as possible (for capacity testing)
aiperf profile \
  --model qwen3-0.6b \
  --url $ENDPOINT_URL \
  --endpoint-type chat \
  --streaming \
  --input-file mooncake_trace.jsonl \
  --custom-dataset-type mooncake_trace \
  --tokenizer Qwen/Qwen3-0.6B
Key Differences from Synthetic Benchmarks


Aspect
Use Case 1 (Synthetic)
Use Case 3 (Trace-Based)


Request Pattern
Uniform (all 1000→500)
Variable (100→2,000K tokens)


Arrival Timing
Constant concurrency
Bursty, realistic timing


KV Cache
No reuse patterns
Real cache-sharing patterns


Use Case
Steady-state capacity
Production validation


Why Trace-Based Benchmarking Matters

✅ Realistic Load Testing: Test how your system handles actual production patterns, not idealized synthetic load

✅ KV Cache Validation: If you implement cache sharing (like Mooncake), trace data shows real hit rates

✅ Capacity Planning: See performance under bursty traffic with variable request sizes

✅ Privacy-Preserving: Hash-based traces enable sharing without exposing sensitive data
Pro tip: Use --fixed-schedule for end-to-end system validation (respects timing), or remove it to stress-test maximum throughput capacity.
Real Benchmark Results: 5-Minute Mooncake Trace (5x Speed)

We extracted the first 5 minutes of the Mooncake trace (1,765 requests) and sped it up 5x to replay in ~1 minute:
# Create the subset (first 5 minutes, sped up 5x)
aiperf profile \
  --model qwen3-0.6b \
  --url $ENDPOINT_URL \
  --endpoint-type chat \
  --streaming \
  --input-file mooncake_trace_5min_5x.jsonl \
  --custom-dataset-type mooncake_trace \
  --fixed-schedule \
  --tokenizer Qwen/Qwen3-0.6B
Results:
                          NVIDIA AIPerf | LLM Metrics                           
┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┓
┃ Metric                ┃     avg ┃    min ┃     max ┃     p99 ┃     p90 ┃    p50 ┃     std ┃
┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━┩
│ Time to First Token   │  407.42 │ 212.68 │ 1,519.5 │  951.16 │  586.01 │ 370.20 │  150.12 │
│              (ms)     │         │        │         │         │         │        │         │
│ Request Latency (ms)  │ 1,171.0 │ 243.14 │ 6,665.7 │ 4,184.4 │ 2,615.9 │ 648.33 │  978.09 │
│ Inter Token Latency   │    5.97 │   0.00 │   88.31 │   17.88 │   10.72 │   4.54 │    5.46 │
│              (ms)     │         │        │         │         │         │        │         │
│ Output Sequence Length│  175.27 │   1.00 │ 1,165.0 │  761.65 │  510.00 │  28.00 │  220.30 │
│          (tokens)     │         │        │         │         │         │        │         │
│ Input Sequence Length │ 7,243.0 │ 890.00 │32,236.0 │27,260.0 │15,157.0 │6,344.0 │ 5,536.0 │
│          (tokens)     │         │        │         │         │         │        │         │
│ Output Token          │ 4,675.0 │    N/A │     N/A │     N/A │     N/A │    N/A │     N/A │
│ Throughput (tok/sec)  │         │        │         │         │         │        │         │
│ Request Throughput    │   26.68 │    N/A │     N/A │     N/A │     N/A │    N/A │     N/A │
│      (requests/sec)   │         │        │         │         │         │        │         │
│ Request Count         │ 1,690   │    N/A │     N/A │     N/A │     N/A │    N/A │     N/A │
│      (successful)     │         │        │         │         │         │        │         │
└───────────────────────┴─────────┴────────┴─────────┴─────────┴─────────┴────────┴─────────┘

Benchmark Duration: 63.35 sec
Success Rate: 96% (75 requests exceeded 32K context window)

Key Observations from Trace-Based Testing

✅ Highly Variable Request Sizes:

Input: 890→32,236 tokens (36x range!)
Output: 1→1,165 tokens
Median input: 6,344 tokens (much larger than our synthetic 1K)

✅ Performance Under Real Load:

TTFT = 407ms average despite 7K+ token median inputs
System handled 4,675 tokens/sec with bursty, variable traffic
P99 TTFT = 951ms (some large requests took longer, as expected)

✅ Realistic Failures:

75 requests (4%) exceeded Qwen3-0.6B's 32K context limit
This reveals a real operational constraint you'd miss with synthetic tests
Production insight: Need longer-context model or request filtering

✅ Production Timing Patterns:

Trace shows realistic request bursts and lulls
Not constant load like --concurrency 100
More representative of actual user traffic patterns

What we learned from trace-based vs. synthetic testing:

Use Case 1 (synthetic): 100% success, uniform 1K→500 tokens, 22.5K TPS
Use Case 3 (trace): 96% success, variable 890→32K input tokens, 4.7K TPS, revealed context window issues

Trace-based testing exposes real-world challenges that synthetic benchmarks hide!

Use Case 4: Goodput Analysis - Measuring SLA Compliance

Goal: Measure what percentage of requests meet your defined Service Level Objectives (SLOs), not just average performance.
What is Goodput?

Goodput = The fraction of requests that meet ALL specified SLA thresholds.
Why it matters:

Throughput tells you how many requests/sec your system handles
Goodput tells you how many requests/sec deliver acceptable user experience
A system can have high throughput but low goodput if most requests miss SLAs!

Definition (from DistServe paper):

"Goodput measures the number of requests per second that meet specified service-level objectives (SLOs), providing a metric that directly reflects user-perceived quality of service."

Real-World Example: Why Goodput > Throughput

Imagine two systems serving 1000 requests/min:

System A: 950 requests under SLA, 50 requests timeout → 95% goodput
System B: 500 requests under SLA, 500 requests slow → 50% goodput

Both have the same throughput, but System A delivers 2x better user experience!
Running Goodput Analysis

We'll use the same Mooncake trace, but add SLO thresholds:
# Define SLA thresholds based on your business requirements
# Example: TTFT ≤ 370ms, Request Latency ≤ 648ms

aiperf profile \
  --model qwen3-0.6b \
  --url $ENDPOINT_URL \
  --endpoint-type chat \
  --streaming \
  --input-file mooncake_trace_5min_5x.jsonl \
  --custom-dataset-type mooncake_trace \
  --fixed-schedule \
  --tokenizer Qwen/Qwen3-0.6B \
  --goodput "time_to_first_token:370 request_latency:648"
Goodput Results

                          NVIDIA AIPerf | LLM Metrics                           
┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┓
┃ Metric                ┃     avg ┃    min ┃     max ┃     p99 ┃     p90 ┃    p50 ┃     std ┃
┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━┩
│ Time to First Token   │  428.86 │ 209.96 │ 1,651.8 │ 1,109.7 │  649.21 │ 385.29 │  176.32 │
│              (ms)     │         │        │         │         │         │        │         │
│ Request Latency (ms)  │ 1,208.9 │ 229.80 │ 6,280.6 │ 4,350.7 │ 2,726.4 │ 691.07 │ 1,005.5 │
│ Request Throughput    │   26.67 │    N/A │     N/A │     N/A │     N/A │    N/A │     N/A │
│      (requests/sec)   │         │        │         │         │         │        │         │
│ Goodput               │    7.43 │    N/A │     N/A │     N/A │     N/A │    N/A │     N/A │  ⭐
│ (requests/sec)        │         │        │         │         │         │        │         │
└───────────────────────┴─────────┴────────┴─────────┴─────────┴─────────┴────────┴─────────┘

Benchmark Duration: 63.37 sec
Success Rate: 96% (75 requests exceeded 32K context window)

Key Insights from Goodput Analysis

Goodput vs. Throughput:
Total Throughput: 26.67 requests/sec (100%)
Goodput:           7.43 requests/sec (28%)  ⚠️
────────────────────────────────────────────
Only 28% of requests met BOTH SLO requirements!

Understanding the results:

SLO Thresholds: TTFT ≤ 370ms AND Request Latency ≤ 648ms
Average TTFT: 428ms (above threshold)
Median Latency: 691ms (above threshold)
72% of requests failed to meet at least one SLO
Raw throughput doesn't reveal this user experience gap!

How SLO compliance works:

Requests must meet ALL SLO criteria to count toward goodput
A request with TTFT=350ms but latency=700ms fails (missed latency SLO)
A request with TTFT=400ms but latency=600ms fails (missed TTFT SLO)
Only requests with TTFT≤370ms AND latency≤648ms count as goodput

What Goodput Tells You That Metrics Don't


Metric
What It Measures
What It Misses


Average TTFT
Typical first token delay
Tail latency, SLA violations


P99 Latency
Worst-case performance
Overall SLA compliance rate


Throughput
System capacity
User experience quality


Goodput ⭐
% requests meeting SLAs
Nothing - it's the complete picture!


Using Goodput for Capacity Planning

Question: How many servers do I need to handle 1000 req/sec with 95% goodput?
Without goodput analysis:

Measure throughput: 26.67 req/sec per server
Calculate: 1000 / 26.67 = 38 servers
Problem: This assumes all requests meet SLAs! ❌

With goodput analysis:

Measure goodput: 7.43 req/sec per server (28% of throughput)
Calculate: 1000 / 7.43 = 135 servers
Reality: Need 3.5x more capacity to meet SLAs ✅

The cost of ignoring goodput: Underprovisioning by 250%!
Adjusting SLOs for Your Business

Different use cases need different SLOs:
# Strict SLOs (premium tier)
--goodput "time_to_first_token:250 request_latency:500"

# Balanced SLOs (standard tier)
--goodput "time_to_first_token:370 request_latency:648"

# Relaxed SLOs (batch processing)
--goodput "time_to_first_token:600 request_latency:2500"
Pro tip: Set SLO thresholds based on your business requirements, then use goodput to measure compliance and plan capacity accordingly.

Use Case 5: Time-Sliced Analysis - Performance Over Time

Goal: Understand how performance metrics evolve during a benchmark to detect warm-up effects, degradation patterns, or load-dependent behavior.
What is Time-Slicing?

Time-slicing divides your benchmark into sequential time windows, computing metrics independently for each window.
Why it matters:

Detect warm-up effects: Identify cold-start latency vs. steady-state performance
Spot degradation: Find memory leaks or resource exhaustion over time
Understand load patterns: See how performance changes as traffic evolves
Validate SLAs over time: Ensure consistent performance, not just averages

Running Time-Sliced Analysis

We'll use the same Mooncake trace with 10-second time slices:
aiperf profile \
  --model qwen3-0.6b \
  --url $ENDPOINT_URL \
  --endpoint-type chat \
  --streaming \
  --input-file mooncake_trace_5min_5x.jsonl \
  --custom-dataset-type mooncake_trace \
  --fixed-schedule \
  --tokenizer Qwen/Qwen3-0.6B \
  --slice-duration 10
Output: AIPerf generates additional files:

profile_export_aiperf_timeslices.csv - Time-series data in tidy format
profile_export_aiperf_timeslices.json - Hierarchical time-series data

Time-Sliced Results

==========================================================================================
TIME-SLICED PERFORMANCE ANALYSIS (10-second slices)
==========================================================================================

Slice |  Time   | Requests | TTFT (ms) | Latency (ms) | Throughput
  #   | Window  |  Count   |  avg (p90)|  avg  (p90)  | (tokens/s)
------------------------------------------------------------------------------------------
  0   |  0-10s  |      111 |   545 (  900) |  1516 ( 3217) |       3203
  1   | 10-20s  |      223 |   381 (  560) |  1050 ( 2300) |       3027
  2   | 20-30s  |      279 |   376 (  502) |  1266 ( 3008) |       4014
  3   | 30-40s  |      293 |   388 (  655) |  1272 ( 2942) |       3648
  4   | 40-50s  |      302 |   387 (  500) |   976 ( 2173) |       3554
  5   | 50-60s  |      303 |   344 (  444) |   999 ( 2313) |       3470
  6   | 60-70s  |      179 |   374 (  517) |  1427 ( 2803) |       4258
==========================================================================================

TREND ANALYSIS:
  TTFT Range: 344ms - 545ms (variation: 58.6%)
  Throughput Range: 3027 - 4258 tokens/s
  First slice TTFT: 545ms vs. Last slice: 374ms

✅ Warm-up detected: TTFT improved after first slice (cold start effect)
==========================================================================================

Key Insights from Time-Sliced Analysis

1. Warm-Up Effect Detected:
Slice 0 (0-10s):   TTFT = 545ms  ⚠️  Cold start
Slice 1 (10-20s):  TTFT = 381ms  ✅  30% improvement after warm-up
Slices 2-6:        TTFT = 344-388ms  ✅  Stable steady-state

Why this matters:

First 10 seconds show 545ms TTFT (above target)
Performance improves 30% after warm-up
Steady-state performance (344-388ms) is significantly better than cold-start
Implication: Pre-warming servers before production traffic prevents SLA violations

2. Variable Load Patterns:

Request distribution not uniform: 111 requests (slice 0) → 303 requests (slice 5)
Throughput varies with load: 3.0K - 4.3K tokens/sec
System handles variable load without significant degradation

3. No Performance Degradation:

TTFT remains stable from slice 1-6 (344-388ms range)
No upward trend in latency over time
No signs of memory leaks or resource exhaustion
System is healthy for sustained operation

Comparing Overall vs. Time-Sliced Metrics


Metric
Overall Average
Slice 0 (Cold)
Slice 1-6 (Warm)


TTFT
386ms
545ms (+41%)
344-388ms (baseline)


Latency
1,172ms
1,516ms
976-1,427ms


The hidden truth: Overall averages mask the 41% cold-start penalty!
Use Cases for Time-Slicing

Scenario 1: Detecting Warm-Up Effects
Problem: SLA violations in first minute of operation
Solution: Use time-slicing to quantify warm-up penalty
Action: Pre-warm servers or set longer health check delays

Scenario 2: Finding Memory Leaks
Problem: Performance degrades after hours of operation
Solution: Run long benchmark with time-slicing (--benchmark-duration 3600 --slice-duration 300)
Look for: Increasing TTFT/latency in later slices

Scenario 3: Load Pattern Validation
Problem: Trace-based tests with varying load
Solution: Time-slice to see if performance varies with request density
Look for: Correlation between requests/slice and latency

Best Practices

✅ Choose appropriate slice duration:

Too short (<5s): High variance, unstable metrics
Too long (>60s): Miss fine-grained patterns
Recommended: 10-30 seconds for most workloads

✅ Use with trace-based benchmarks:

Time-slicing + realistic traces = complete picture
See both overall AND time-evolving performance

✅ Compare cold vs. warm state:

Exclude slice 0 from steady-state SLA calculations
Report both cold-start and warm-state performance separately

✅ Monitor for degradation:

Upward trend in latency = resource issue
Flat or decreasing latency = healthy system


Summary

We've demonstrated 5 powerful AIPerf use cases:

Simple Profiling + Pareto Analysis: Find the sweet spot between user experience and resource utilization
Custom Percentile Analysis: Calculate any metric your organization needs from raw data
Trace-Based Benchmarking: Test with realistic production workload patterns
Goodput Analysis: Measure actual SLA compliance, not just raw throughput
Time-Sliced Analysis: Understand performance evolution and detect warm-up/degradation

Key Takeaway: Synthetic benchmarks (Use Case 1) provide baseline capacity, but real-world validation requires traces (Use Case 3), goodput (Use Case 4), and time-series analysis (Use Case 5) to ensure production readiness.

Advanced Topics

In-Cluster Benchmarking

For high-scale testing, consider running AIPerf from within your Kubernetes cluster to:

Eliminate network latency between client and server
Avoid ephemeral port exhaustion on client machines at extreme concurrency
Test true server capacity without client-side bottlenecks

Deploy a load-tester pod in the same cluster as your inference endpoint and use the internal ClusterIP service address for benchmarking.
Request Cancellation Testing

Simulate real-world user behavior where requests are cancelled mid-flight (e.g., users navigating away, timeouts):
aiperf profile \
  --model qwen3-0.6b \
  --url $ENDPOINT_URL \
  --endpoint-type chat \
  --streaming \
  --concurrency 10 \
  --request-count 100 \
  --request-cancellation-rate 20 \
  --request-cancellation-delay 0.5 \
  --isl 800 \
  --osl 400 \
  --tokenizer Qwen/Qwen3-0.6B
Parameters:

--request-cancellation-rate 20: Cancel 20% of requests
--request-cancellation-delay 0.5: Wait 0.5 seconds before cancelling

Use Cases:

Test server resource cleanup and connection pooling
Measure impact of cancellations on remaining requests
Validate graceful degradation under partial failures


Coming Soon

AIPerf is actively developing new capabilities:
Server-Side Metrics


Internal queuing metrics: Measure time spent in scheduling queues, KV cache wait times, and batching delays
Engine telemetry: Deep visibility into request lifecycle within the inference engine
Resource utilization: Track memory pressure, GPU utilization, and scheduling efficiency

Synthetic Load for KV Cache Efficiency Testing


Block reuse patterns: Generate synthetic traces that test prefix caching and attention reuse
Cache hit rate benchmarks: Measure performance gains from shared KV cache blocks
Memory efficiency testing: Validate PagedAttention and other cache optimization strategies

Kubernetes-Native Benchmarking


Distributed load generation: Deploy multiple load-tester pods to simulate thousands of concurrent users
Large-scale workloads: Test production-scale traffic patterns without client-side bottlenecks
Automated orchestration: Kubernetes operators to manage benchmark lifecycles and resource allocation

Automatic Plot Generation


Post-profiling analysis: Automatically generate visualizations after each benchmark run
Server-side metrics: Plot GPU utilization, memory usage, and queuing patterns over time
Client-side results: Visualize latency distributions, throughput curves, and goodput trends
Comparison views: Side-by-side charts for multiple benchmark runs


GitHub Gist: https://gist.github.com/BenHamm/31c648f7d7331c94c1f3a45859db6677
Arg	Value	Purpose
`--model`	`qwen3-0.6b`	Model identifier (matches endpoint)
`--url`	`$ENDPOINT_URL`	Target inference server
`--endpoint-type`	`chat`	OpenAI chat completions API
`--streaming`	(flag)	Enable token streaming
`--concurrency`	`100`	Simultaneous connections
`--request-count`	`1000`	Total requests to send
`--isl`	`1000`	Input tokens per request
`--osl`	`500`	Output tokens per response
`--tokenizer`	`Qwen/Qwen3-0.6B`	HuggingFace tokenizer for accuracy
Concurrency	Total TPS	TPS/GPU	TPS/User	TTFT (avg)
10	3,045	1,522	364.69	~250 ms
50	12,890	6,445	326.10	~270 ms
100	22,521	11,261	285.03	~347 ms
200	35,999	18,000 ⭐	238.67	~420 ms
500	29,836	14,918	128.85	~1,129 ms
Point	Concurrency	TPS/GPU	TPS/User	Interpretation
Far Right	c=10	1,522	365	Best user experience, poor GPU utilization
Moving Up-Left	c=50	6,445	326	Trading UX for efficiency
	c=100	11,261	285	Balanced middle ground
Peak ⭐	c=200	18,000	239	Maximum GPU efficiency
Collapse	c=500	14,918	129	Over-saturation degrades both
Priority	Optimal Concurrency	Justification
User Experience	10-50	Sub-300ms TTFT, 325+ tokens/sec/user
Balanced	100-200 ⭐	18K TPS/GPU, 240+ tokens/sec/user
Cost Efficiency	200	Peak TPS/GPU before degradation
Aspect	Use Case 1 (Synthetic)	Use Case 3 (Trace-Based)
Request Pattern	Uniform (all 1000→500)	Variable (100→2,000K tokens)
Arrival Timing	Constant concurrency	Bursty, realistic timing
KV Cache	No reuse patterns	Real cache-sharing patterns
Use Case	Steady-state capacity	Production validation
Metric	What It Measures	What It Misses
Average TTFT	Typical first token delay	Tail latency, SLA violations
P99 Latency	Worst-case performance	Overall SLA compliance rate
Throughput	System capacity	User experience quality
Goodput ⭐	% requests meeting SLAs	Nothing - it's the complete picture!
Metric	Overall Average	Slice 0 (Cold)	Slice 1-6 (Warm)
TTFT	386ms	545ms (+41%)	344-388ms (baseline)
Latency	1,172ms	1,516ms	976-1,427ms