Presentation Date: November 13, 2025
Tool: AIPerf v0.3.0
- Setup: Installing AIPerf 0.3.0
- Test Endpoint Details
- Use Case 1: Simple Profiling with Static ISL/OSL
- Use Case 2: Auditing Raw Results - Custom Percentile Analysis
- Use Case 3: Trace-Based Benchmarking with Mooncake
- Use Case 4: Goodput Analysis - Measuring SLA Compliance
- Use Case 5: Time-Sliced Analysis - Performance Over Time
- Summary
- Advanced Topics
- Coming Soon
Note: AIPerf 0.3.0 is not yet available on PyPI. Install from the GitHub repository:
# Create a virtual environment (recommended)
python -m venv venv
source venv/bin/activate
# Install AIPerf from release/0.3.0 branch
pip install --upgrade git+https://github.com/ai-dynamo/aiperf.git@release/0.3.0
# Verify installation
aiperf --version
# Expected output: 0.3.0Why 0.3.0?
- ✅ Fixed dashboard UI bug with tokenizer downloads
- ✅ Improved stability and performance
- ✅ Enhanced reporting features
- ✅ Direct
aiperfcommand (no need forpython -m aiperf)
Note: This was a demo endpoint used for the November 13, 2025 presentation. The cluster has been taken down.
Model: Qwen3-0.6B (Qwen/Qwen3-0.6B)
Inference Engine: vLLM v0.11.0
Architecture: 8-way data parallelism (8 independent vLLM replicas)
Hardware: 8x NVIDIA H200 GPUs (1 GPU per replica)
Deployment: Kubernetes on Nebius Cloud
Demo Endpoint (no longer active):
# This endpoint was available during the demo:
# export ENDPOINT_URL="http://89.169.112.187:8000"Why this endpoint was chosen for the demo:
- Small model (~600M parameters) = high throughput for benchmarking
- 8 replicas = demonstrated horizontal scaling
- Public access = allowed live demonstration
Goal: Measure baseline performance under controlled load
aiperf profile \
--model qwen3-0.6b \
--url $ENDPOINT_URL \
--endpoint-type chat \
--streaming \
--concurrency 100 \
--request-count 1000 \
--isl 1000 \
--osl 500 \
--tokenizer Qwen/Qwen3-0.6B| Arg | Value | Purpose |
|---|---|---|
--model |
qwen3-0.6b |
Model identifier (matches endpoint) |
--url |
$ENDPOINT_URL |
Target inference server |
--endpoint-type |
chat |
OpenAI chat completions API |
--streaming |
(flag) | Enable token streaming |
--concurrency |
100 |
Simultaneous connections |
--request-count |
1000 |
Total requests to send |
--isl |
1000 |
Input tokens per request |
--osl |
500 |
Output tokens per response |
--tokenizer |
Qwen/Qwen3-0.6B |
HuggingFace tokenizer for accuracy |
Key Insight: This creates 100 "virtual users" sending 1,000 requests total with large payloads (1000→500 tokens).
NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃ Metric ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p50 ┃ std ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ Time to First Token (ms)│ 347.15 │ 204.55 │1,052.66│ 815.02│ 577.05│ 289.49│ 143.57│
│ Request Latency (ms) │ 2,101.75 │ 693.08 │4,770.98│3,613.75│2,319.79│2,057.50│ 303.17│
│ Inter Token Latency (ms)│ 3.57 │ 1.99 │ 8.55 │ 5.78 │ 3.93 │ 3.49 │ 0.54│
│ Output Token Throughput │22,521.42 │ N/A │ N/A │ N/A │ N/A │ N/A │ N/A│
│ (tokens/sec) │ │ │ │ │ │ │ │
│ Request Throughput │ 45.70 │ N/A │ N/A │ N/A │ N/A │ N/A │ N/A│
│ (requests/sec) │ │ │ │ │ │ │ │
│ Request Count │ 1,000.00 │ N/A │ N/A │ N/A │ N/A │ N/A │ N/A│
└─────────────────────────┴──────────┴────────┴────────┴────────┴────────┴────────┴───────┘
Benchmark Duration: 21.88 sec
Success Rate: 100% (0 errors)
✅ TTFT = 347ms: Fast first token delivery - users see responses quickly
✅ Request Latency = 2.1s: Total time to generate 500 tokens per request
✅ System Throughput = 22.5K tokens/sec: High capacity with 100 concurrent users
✅ ITL = 3.57ms: Smooth, consistent token streaming
✅ P99 Latency = 3.6s: Even worst-case requests complete reasonably fast
What we learned:
- With 100 concurrent users and large payloads (1000→500 tokens), the system maintained stable performance
- P99 latency (3.6s) vs avg (2.1s) shows good consistency - only ~70% variance at tail
- Zero errors = reliable service under load
- 22.5K tokens/sec sustained throughput demonstrates 8-replica scaling effectiveness
Goal: Understand the trade-off between resource utilization (TPS/GPU) and user experience (TPS/User) at different concurrency levels.
We ran the same benchmark at 5 different concurrency levels (10, 50, 100, 200, 500) to observe how throughput per GPU and throughput per user change:
# Example commands (run each separately)
aiperf profile --model qwen3-0.6b --url $ENDPOINT_URL \
--endpoint-type chat --streaming --concurrency 10 \
--request-count 1000 --isl 1000 --osl 500 \
--tokenizer Qwen/Qwen3-0.6B --artifact-dir artifacts/pareto-c10
# (Repeat for concurrency: 50, 100, 200, 500)| Concurrency | Total TPS | TPS/GPU | TPS/User | TTFT (avg) |
|---|---|---|---|---|
| 10 | 3,045 | 1,522 | 364.69 | ~250 ms |
| 50 | 12,890 | 6,445 | 326.10 | ~270 ms |
| 100 | 22,521 | 11,261 | 285.03 | ~347 ms |
| 200 | 35,999 | 18,000 ⭐ | 238.67 | ~420 ms |
| 500 | 29,836 | 14,918 | 128.85 | ~1,129 ms |
Hardware: 8 vLLM replicas on 8 H200 GPUs (so we divide Total TPS by 8 for TPS/GPU)
[Visual: Pareto Curve Chart - See presentation slide]
The Pareto frontier shows the inverse relationship between resource efficiency and user experience:
| Point | Concurrency | TPS/GPU | TPS/User | Interpretation |
|---|---|---|---|---|
| Far Right | c=10 | 1,522 | 365 | Best user experience, poor GPU utilization |
| Moving Up-Left | c=50 | 6,445 | 326 | Trading UX for efficiency |
| c=100 | 11,261 | 285 | Balanced middle ground | |
| Peak ⭐ | c=200 | 18,000 | 239 | Maximum GPU efficiency |
| Collapse | c=500 | 14,918 | 129 | Over-saturation degrades both |
Key Insight: The Pareto curve demonstrates you cannot optimize both metrics simultaneously. Choose your operating point based on whether you prioritize cost efficiency (c=200) or user experience (c=10-50).
✅ Low Concurrency (10-50):
- Best user experience: 365 tokens/sec per user = very responsive
- Poor resource utilization: Only 1,500-6,500 TPS/GPU = GPUs are underutilized
- Use case: Premium tier, low-latency applications
✅ Medium Concurrency (100-200):
- Balanced performance: ~11,000-18,000 TPS/GPU
- Good user experience: ~240-285 tokens/sec per user
- Sweet spot at c=200: Peak resource utilization (18K TPS/GPU) with acceptable user experience
- Use case: General production workloads
❌ High Concurrency (500+):
- Degraded resource utilization: TPS/GPU drops from 18K → 15K
- Poor user experience: 129 tokens/sec per user, TTFT = 1.1 seconds
- Queuing dominates: Request backlog causes both metrics to degrade
- Use case: Avoid this region unless cost is the only priority
Question: Should you optimize for cost efficiency (max TPS/GPU) or user satisfaction (max TPS/User)?
| Priority | Optimal Concurrency | Justification |
|---|---|---|
| User Experience | 10-50 | Sub-300ms TTFT, 325+ tokens/sec/user |
| Balanced | 100-200 ⭐ | 18K TPS/GPU, 240+ tokens/sec/user |
| Cost Efficiency | 200 | Peak TPS/GPU before degradation |
The c=200 "sweet spot":
- 12x better resource utilization vs. c=10 (18K vs. 1.5K TPS/GPU)
- Only 35% reduction in per-user throughput (239 vs. 365 tokens/sec/user)
- TTFT still under 500ms for most requests
🔍 Performance is non-linear: Doubling concurrency doesn't double throughput
📊 The U-shaped curve: TPS/GPU rises, peaks at c=200, then falls due to queuing overhead
⚖️ No free lunch: Higher concurrency = better GPU utilization BUT worse user experience
🎯 Know your SLA: Choose concurrency based on your latency vs. throughput priorities
Pro tip: Run this analysis on YOUR endpoint with YOUR request patterns to find YOUR sweet spot!
Scenario: Your management defines SLAs using P75, not the standard P50/P90/P99 that AIPerf reports by default.
Goal: Calculate P75 TTFT from raw benchmark data.
AIPerf outputs detailed per-request data in profile_export.jsonl. Each line is a JSON record:
{
"metadata": {
"session_num": 87,
"x_request_id": "abd8df1a-7904-4aa0-8107-0d74ba0ac0d7",
"turn_index": 0,
"request_start_ns": 1763066701865462000,
"request_end_ns": 1763066703082535666,
"worker_id": "worker_b431129c"
},
"metrics": {
"time_to_first_token": {
"value": 582.66,
"unit": "ms"
},
"output_token_count": {
"value": 194,
"unit": "tokens"
},
"request_latency": {
"value": 1210.008,
"unit": "ms"
},
"input_sequence_length": {
"value": 1000,
"unit": "tokens"
},
"output_sequence_length": {
"value": 194,
"unit": "tokens"
},
"inter_token_latency": {
"value": 3.25,
"unit": "ms"
}
}
}Key fields: Every request has time_to_first_token, request_latency, ISL, OSL, and more.
import json
import numpy as np
from pathlib import Path
# Read all TTFT values
ttft_values = []
with open("artifacts/.../profile_export.jsonl", 'r') as f:
for line in f:
record = json.loads(line)
ttft = record['metrics']['time_to_first_token']['value']
ttft_values.append(ttft)
# Calculate P75
p75_ttft = np.percentile(ttft_values, 75)
print(f"P75 TTFT: {p75_ttft:.2f} ms")============================================================
TTFT Percentile Analysis
============================================================
Total requests analyzed: 1000
Percentiles (ms):
P25 (25th percentile): 242.45 ms
P50 (50th percentile): 289.49 ms
P75 (75th percentile): 422.87 ms ⭐ YOUR SLA METRIC
P90 (90th percentile): 577.05 ms
P99 (99th percentile): 815.02 ms
============================================================
✅ P75 = 422.87ms: 75% of requests get first token within this time
✅ Raw data access: Calculate ANY custom metric your org needs
✅ Full transparency: Every request is logged with complete metrics
✅ Easy parsing: Standard JSON format, one record per line
Why this matters:
- Different orgs have different SLA definitions
- P75 is a common SLA target (balance between typical and worst-case)
- AIPerf's raw exports let you calculate ANY percentile or custom metric
- No need to re-run benchmarks for different analysis
Goal: Test your system under realistic production workload patterns using privacy-preserving traces.
Mooncake is an open-source KV cache sharing system that released real production traces from their arXiv Q&A service. These traces capture actual user behavior including:
- Request arrival times
- Input/output token lengths
- Block hash IDs: Privacy-preserving identifiers for KV cache reuse patterns
The Problem: Sharing production traces risks leaking sensitive user data.
Mooncake's Solution: Hash every 512-token block of input. Users asking about the same document get the same hash IDs, enabling cache reuse analysis without revealing content.
Example: Multi-turn conversation
Turn 1: User uploads paper (7,500 tokens) + question (500 tokens)
├─ Total: 8,000 tokens = 16 blocks
└─ Hash IDs: [46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61]
Turn 2: Same paper + different question (8,500 tokens)
├─ Total: 8,500 tokens = 17 blocks
├─ Hash IDs: [46-61] (reused!) + [62] (new)
└─ ✅ Cache hit rate: 94% (16/17 blocks reused)
Turn 3: Same paper + another question (9,000 tokens)
├─ Total: 9,000 tokens = 18 blocks
├─ Hash IDs: [46-61] (reused!) + [62, 63] (new)
└─ ✅ Cache hit rate: 89% (16/18 blocks reused)
Key insight: Hash IDs reveal cache reuse opportunities while completely protecting user privacy.
======================================================================
MOONCAKE ARXIV TRACE - DATASET CHARACTERISTICS
======================================================================
📊 OVERALL STATISTICS
Total Requests: 23,608
Duration: 60.0 minutes (3,600 seconds)
Avg Request Rate: 393.5 requests/minute
📏 TOKEN DISTRIBUTION (Input + Output)
Mean: 8,772 tokens
Median: 6,402 tokens
P25: 3,331 tokens | P75: 7,562 tokens
P90: 17,140 tokens | P99: 61,961 tokens
Max: 125,878 tokens
📊 REQUEST SIZE DISTRIBUTION
Token Range | Count | Percentage | Visualization
──────────────────────────────────────────────────────────
0 - 5,000 | 7,632 | 32.3% | ████████████████
5,000 - 10,000 | 11,626| 49.2% | ████████████████████████
10,000 - 20,000 | 2,499 | 10.6% | █████
20,000 - 40,000 | 1,325 | 5.6% | ██
40,000 - 60,000 | 272 | 1.2% |
60,000 - 80,000 | 135 | 0.6% |
80,000 - 100,000 | 65 | 0.3% |
100,000+ | 54 | 0.2% |
⏱️ REQUEST ARRIVAL PATTERN (5-minute windows)
Time Window | Requests | Load Pattern
───────────────────────────────────────────────────────
0 - 4 min | 1,765 | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
5 - 9 min | 1,657 | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
10 - 14 min | 1,875 | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
15 - 19 min | 1,860 | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
20 - 24 min | 1,992 | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
25 - 29 min | 2,010 | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
30 - 34 min | 2,012 | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
35 - 39 min | 2,063 | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
40 - 44 min | 2,133 | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
45 - 49 min | 2,026 | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
50 - 54 min | 2,125 | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
55 - 59 min | 1,680 | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
======================================================================
Key characteristics of real production traffic:
✅ Highly Variable Request Sizes: 49% of requests are 5K-10K tokens, but tail extends to 125K
✅ Long-Context Dominant: Median of 6,402 tokens vs. typical benchmarks using 1K-2K
✅ Consistent Load: ~393 requests/minute with relatively steady arrival rate
✅ Heavy Tail Distribution: 2% of requests exceed 40K tokens (production reality!)
This represents real-world patterns you won't get from synthetic benchmarks:
- Multi-turn conversations (shared hash IDs across requests)
- Variable request sizes (not uniform 1K/500 like Use Case 1)
- Realistic timing (actual production arrival patterns)
- Long-context queries that stress-test model limits
# Download the Mooncake trace
curl -o mooncake_trace.jsonl https://raw.githubusercontent.com/kvcache-ai/Mooncake/refs/heads/main/FAST25-release/arxiv-trace/mooncake_trace.jsonl
# Option 1: Replay with original timing (for end-to-end system testing)
aiperf profile \
--model qwen3-0.6b \
--url $ENDPOINT_URL \
--endpoint-type chat \
--streaming \
--input-file mooncake_trace.jsonl \
--custom-dataset-type mooncake_trace \
--fixed-schedule \
--tokenizer Qwen/Qwen3-0.6B
# Option 2: Replay as fast as possible (for capacity testing)
aiperf profile \
--model qwen3-0.6b \
--url $ENDPOINT_URL \
--endpoint-type chat \
--streaming \
--input-file mooncake_trace.jsonl \
--custom-dataset-type mooncake_trace \
--tokenizer Qwen/Qwen3-0.6B| Aspect | Use Case 1 (Synthetic) | Use Case 3 (Trace-Based) |
|---|---|---|
| Request Pattern | Uniform (all 1000→500) | Variable (100→2,000K tokens) |
| Arrival Timing | Constant concurrency | Bursty, realistic timing |
| KV Cache | No reuse patterns | Real cache-sharing patterns |
| Use Case | Steady-state capacity | Production validation |
✅ Realistic Load Testing: Test how your system handles actual production patterns, not idealized synthetic load
✅ KV Cache Validation: If you implement cache sharing (like Mooncake), trace data shows real hit rates
✅ Capacity Planning: See performance under bursty traffic with variable request sizes
✅ Privacy-Preserving: Hash-based traces enable sharing without exposing sensitive data
Pro tip: Use --fixed-schedule for end-to-end system validation (respects timing), or remove it to stress-test maximum throughput capacity.
We extracted the first 5 minutes of the Mooncake trace (1,765 requests) and sped it up 5x to replay in ~1 minute:
# Create the subset (first 5 minutes, sped up 5x)
aiperf profile \
--model qwen3-0.6b \
--url $ENDPOINT_URL \
--endpoint-type chat \
--streaming \
--input-file mooncake_trace_5min_5x.jsonl \
--custom-dataset-type mooncake_trace \
--fixed-schedule \
--tokenizer Qwen/Qwen3-0.6BResults:
NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┓
┃ Metric ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p50 ┃ std ┃
┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━┩
│ Time to First Token │ 407.42 │ 212.68 │ 1,519.5 │ 951.16 │ 586.01 │ 370.20 │ 150.12 │
│ (ms) │ │ │ │ │ │ │ │
│ Request Latency (ms) │ 1,171.0 │ 243.14 │ 6,665.7 │ 4,184.4 │ 2,615.9 │ 648.33 │ 978.09 │
│ Inter Token Latency │ 5.97 │ 0.00 │ 88.31 │ 17.88 │ 10.72 │ 4.54 │ 5.46 │
│ (ms) │ │ │ │ │ │ │ │
│ Output Sequence Length│ 175.27 │ 1.00 │ 1,165.0 │ 761.65 │ 510.00 │ 28.00 │ 220.30 │
│ (tokens) │ │ │ │ │ │ │ │
│ Input Sequence Length │ 7,243.0 │ 890.00 │32,236.0 │27,260.0 │15,157.0 │6,344.0 │ 5,536.0 │
│ (tokens) │ │ │ │ │ │ │ │
│ Output Token │ 4,675.0 │ N/A │ N/A │ N/A │ N/A │ N/A │ N/A │
│ Throughput (tok/sec) │ │ │ │ │ │ │ │
│ Request Throughput │ 26.68 │ N/A │ N/A │ N/A │ N/A │ N/A │ N/A │
│ (requests/sec) │ │ │ │ │ │ │ │
│ Request Count │ 1,690 │ N/A │ N/A │ N/A │ N/A │ N/A │ N/A │
│ (successful) │ │ │ │ │ │ │ │
└───────────────────────┴─────────┴────────┴─────────┴─────────┴─────────┴────────┴─────────┘
Benchmark Duration: 63.35 sec
Success Rate: 96% (75 requests exceeded 32K context window)
✅ Highly Variable Request Sizes:
- Input: 890→32,236 tokens (36x range!)
- Output: 1→1,165 tokens
- Median input: 6,344 tokens (much larger than our synthetic 1K)
✅ Performance Under Real Load:
- TTFT = 407ms average despite 7K+ token median inputs
- System handled 4,675 tokens/sec with bursty, variable traffic
- P99 TTFT = 951ms (some large requests took longer, as expected)
✅ Realistic Failures:
- 75 requests (4%) exceeded Qwen3-0.6B's 32K context limit
- This reveals a real operational constraint you'd miss with synthetic tests
- Production insight: Need longer-context model or request filtering
✅ Production Timing Patterns:
- Trace shows realistic request bursts and lulls
- Not constant load like
--concurrency 100 - More representative of actual user traffic patterns
What we learned from trace-based vs. synthetic testing:
- Use Case 1 (synthetic): 100% success, uniform 1K→500 tokens, 22.5K TPS
- Use Case 3 (trace): 96% success, variable 890→32K input tokens, 4.7K TPS, revealed context window issues
Trace-based testing exposes real-world challenges that synthetic benchmarks hide!
Goal: Measure what percentage of requests meet your defined Service Level Objectives (SLOs), not just average performance.
Goodput = The fraction of requests that meet ALL specified SLA thresholds.
Why it matters:
- Throughput tells you how many requests/sec your system handles
- Goodput tells you how many requests/sec deliver acceptable user experience
- A system can have high throughput but low goodput if most requests miss SLAs!
Definition (from DistServe paper):
"Goodput measures the number of requests per second that meet specified service-level objectives (SLOs), providing a metric that directly reflects user-perceived quality of service."
Imagine two systems serving 1000 requests/min:
- System A: 950 requests under SLA, 50 requests timeout → 95% goodput
- System B: 500 requests under SLA, 500 requests slow → 50% goodput
Both have the same throughput, but System A delivers 2x better user experience!
We'll use the same Mooncake trace, but add SLO thresholds:
# Define SLA thresholds based on your business requirements
# Example: TTFT ≤ 370ms, Request Latency ≤ 648ms
aiperf profile \
--model qwen3-0.6b \
--url $ENDPOINT_URL \
--endpoint-type chat \
--streaming \
--input-file mooncake_trace_5min_5x.jsonl \
--custom-dataset-type mooncake_trace \
--fixed-schedule \
--tokenizer Qwen/Qwen3-0.6B \
--goodput "time_to_first_token:370 request_latency:648" NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┓
┃ Metric ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p50 ┃ std ┃
┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━┩
│ Time to First Token │ 428.86 │ 209.96 │ 1,651.8 │ 1,109.7 │ 649.21 │ 385.29 │ 176.32 │
│ (ms) │ │ │ │ │ │ │ │
│ Request Latency (ms) │ 1,208.9 │ 229.80 │ 6,280.6 │ 4,350.7 │ 2,726.4 │ 691.07 │ 1,005.5 │
│ Request Throughput │ 26.67 │ N/A │ N/A │ N/A │ N/A │ N/A │ N/A │
│ (requests/sec) │ │ │ │ │ │ │ │
│ Goodput │ 7.43 │ N/A │ N/A │ N/A │ N/A │ N/A │ N/A │ ⭐
│ (requests/sec) │ │ │ │ │ │ │ │
└───────────────────────┴─────────┴────────┴─────────┴─────────┴─────────┴────────┴─────────┘
Benchmark Duration: 63.37 sec
Success Rate: 96% (75 requests exceeded 32K context window)
Goodput vs. Throughput:
Total Throughput: 26.67 requests/sec (100%)
Goodput: 7.43 requests/sec (28%) ⚠️
────────────────────────────────────────────
Only 28% of requests met BOTH SLO requirements!
Understanding the results:
- SLO Thresholds: TTFT ≤ 370ms AND Request Latency ≤ 648ms
- Average TTFT: 428ms (above threshold)
- Median Latency: 691ms (above threshold)
- 72% of requests failed to meet at least one SLO
- Raw throughput doesn't reveal this user experience gap!
How SLO compliance works:
- Requests must meet ALL SLO criteria to count toward goodput
- A request with TTFT=350ms but latency=700ms fails (missed latency SLO)
- A request with TTFT=400ms but latency=600ms fails (missed TTFT SLO)
- Only requests with TTFT≤370ms AND latency≤648ms count as goodput
| Metric | What It Measures | What It Misses |
|---|---|---|
| Average TTFT | Typical first token delay | Tail latency, SLA violations |
| P99 Latency | Worst-case performance | Overall SLA compliance rate |
| Throughput | System capacity | User experience quality |
| Goodput ⭐ | % requests meeting SLAs | Nothing - it's the complete picture! |
Question: How many servers do I need to handle 1000 req/sec with 95% goodput?
Without goodput analysis:
- Measure throughput: 26.67 req/sec per server
- Calculate: 1000 / 26.67 = 38 servers
- Problem: This assumes all requests meet SLAs! ❌
With goodput analysis:
- Measure goodput: 7.43 req/sec per server (28% of throughput)
- Calculate: 1000 / 7.43 = 135 servers
- Reality: Need 3.5x more capacity to meet SLAs ✅
The cost of ignoring goodput: Underprovisioning by 250%!
Different use cases need different SLOs:
# Strict SLOs (premium tier)
--goodput "time_to_first_token:250 request_latency:500"
# Balanced SLOs (standard tier)
--goodput "time_to_first_token:370 request_latency:648"
# Relaxed SLOs (batch processing)
--goodput "time_to_first_token:600 request_latency:2500"Pro tip: Set SLO thresholds based on your business requirements, then use goodput to measure compliance and plan capacity accordingly.
Goal: Understand how performance metrics evolve during a benchmark to detect warm-up effects, degradation patterns, or load-dependent behavior.
Time-slicing divides your benchmark into sequential time windows, computing metrics independently for each window.
Why it matters:
- Detect warm-up effects: Identify cold-start latency vs. steady-state performance
- Spot degradation: Find memory leaks or resource exhaustion over time
- Understand load patterns: See how performance changes as traffic evolves
- Validate SLAs over time: Ensure consistent performance, not just averages
We'll use the same Mooncake trace with 10-second time slices:
aiperf profile \
--model qwen3-0.6b \
--url $ENDPOINT_URL \
--endpoint-type chat \
--streaming \
--input-file mooncake_trace_5min_5x.jsonl \
--custom-dataset-type mooncake_trace \
--fixed-schedule \
--tokenizer Qwen/Qwen3-0.6B \
--slice-duration 10Output: AIPerf generates additional files:
profile_export_aiperf_timeslices.csv- Time-series data in tidy formatprofile_export_aiperf_timeslices.json- Hierarchical time-series data
==========================================================================================
TIME-SLICED PERFORMANCE ANALYSIS (10-second slices)
==========================================================================================
Slice | Time | Requests | TTFT (ms) | Latency (ms) | Throughput
# | Window | Count | avg (p90)| avg (p90) | (tokens/s)
------------------------------------------------------------------------------------------
0 | 0-10s | 111 | 545 ( 900) | 1516 ( 3217) | 3203
1 | 10-20s | 223 | 381 ( 560) | 1050 ( 2300) | 3027
2 | 20-30s | 279 | 376 ( 502) | 1266 ( 3008) | 4014
3 | 30-40s | 293 | 388 ( 655) | 1272 ( 2942) | 3648
4 | 40-50s | 302 | 387 ( 500) | 976 ( 2173) | 3554
5 | 50-60s | 303 | 344 ( 444) | 999 ( 2313) | 3470
6 | 60-70s | 179 | 374 ( 517) | 1427 ( 2803) | 4258
==========================================================================================
TREND ANALYSIS:
TTFT Range: 344ms - 545ms (variation: 58.6%)
Throughput Range: 3027 - 4258 tokens/s
First slice TTFT: 545ms vs. Last slice: 374ms
✅ Warm-up detected: TTFT improved after first slice (cold start effect)
==========================================================================================
1. Warm-Up Effect Detected:
Slice 0 (0-10s): TTFT = 545ms ⚠️ Cold start
Slice 1 (10-20s): TTFT = 381ms ✅ 30% improvement after warm-up
Slices 2-6: TTFT = 344-388ms ✅ Stable steady-state
Why this matters:
- First 10 seconds show 545ms TTFT (above target)
- Performance improves 30% after warm-up
- Steady-state performance (344-388ms) is significantly better than cold-start
- Implication: Pre-warming servers before production traffic prevents SLA violations
2. Variable Load Patterns:
- Request distribution not uniform: 111 requests (slice 0) → 303 requests (slice 5)
- Throughput varies with load: 3.0K - 4.3K tokens/sec
- System handles variable load without significant degradation
3. No Performance Degradation:
- TTFT remains stable from slice 1-6 (344-388ms range)
- No upward trend in latency over time
- No signs of memory leaks or resource exhaustion
- System is healthy for sustained operation
| Metric | Overall Average | Slice 0 (Cold) | Slice 1-6 (Warm) |
|---|---|---|---|
| TTFT | 386ms | 545ms (+41%) | 344-388ms (baseline) |
| Latency | 1,172ms | 1,516ms | 976-1,427ms |
The hidden truth: Overall averages mask the 41% cold-start penalty!
Scenario 1: Detecting Warm-Up Effects
Problem: SLA violations in first minute of operation
Solution: Use time-slicing to quantify warm-up penalty
Action: Pre-warm servers or set longer health check delays
Scenario 2: Finding Memory Leaks
Problem: Performance degrades after hours of operation
Solution: Run long benchmark with time-slicing (--benchmark-duration 3600 --slice-duration 300)
Look for: Increasing TTFT/latency in later slices
Scenario 3: Load Pattern Validation
Problem: Trace-based tests with varying load
Solution: Time-slice to see if performance varies with request density
Look for: Correlation between requests/slice and latency
✅ Choose appropriate slice duration:
- Too short (<5s): High variance, unstable metrics
- Too long (>60s): Miss fine-grained patterns
- Recommended: 10-30 seconds for most workloads
✅ Use with trace-based benchmarks:
- Time-slicing + realistic traces = complete picture
- See both overall AND time-evolving performance
✅ Compare cold vs. warm state:
- Exclude slice 0 from steady-state SLA calculations
- Report both cold-start and warm-state performance separately
✅ Monitor for degradation:
- Upward trend in latency = resource issue
- Flat or decreasing latency = healthy system
We've demonstrated 5 powerful AIPerf use cases:
- Simple Profiling + Pareto Analysis: Find the sweet spot between user experience and resource utilization
- Custom Percentile Analysis: Calculate any metric your organization needs from raw data
- Trace-Based Benchmarking: Test with realistic production workload patterns
- Goodput Analysis: Measure actual SLA compliance, not just raw throughput
- Time-Sliced Analysis: Understand performance evolution and detect warm-up/degradation
Key Takeaway: Synthetic benchmarks (Use Case 1) provide baseline capacity, but real-world validation requires traces (Use Case 3), goodput (Use Case 4), and time-series analysis (Use Case 5) to ensure production readiness.
For high-scale testing, consider running AIPerf from within your Kubernetes cluster to:
- Eliminate network latency between client and server
- Avoid ephemeral port exhaustion on client machines at extreme concurrency
- Test true server capacity without client-side bottlenecks
Deploy a load-tester pod in the same cluster as your inference endpoint and use the internal ClusterIP service address for benchmarking.
Simulate real-world user behavior where requests are cancelled mid-flight (e.g., users navigating away, timeouts):
aiperf profile \
--model qwen3-0.6b \
--url $ENDPOINT_URL \
--endpoint-type chat \
--streaming \
--concurrency 10 \
--request-count 100 \
--request-cancellation-rate 20 \
--request-cancellation-delay 0.5 \
--isl 800 \
--osl 400 \
--tokenizer Qwen/Qwen3-0.6BParameters:
--request-cancellation-rate 20: Cancel 20% of requests--request-cancellation-delay 0.5: Wait 0.5 seconds before cancelling
Use Cases:
- Test server resource cleanup and connection pooling
- Measure impact of cancellations on remaining requests
- Validate graceful degradation under partial failures
AIPerf is actively developing new capabilities:
- Internal queuing metrics: Measure time spent in scheduling queues, KV cache wait times, and batching delays
- Engine telemetry: Deep visibility into request lifecycle within the inference engine
- Resource utilization: Track memory pressure, GPU utilization, and scheduling efficiency
- Block reuse patterns: Generate synthetic traces that test prefix caching and attention reuse
- Cache hit rate benchmarks: Measure performance gains from shared KV cache blocks
- Memory efficiency testing: Validate PagedAttention and other cache optimization strategies
- Distributed load generation: Deploy multiple load-tester pods to simulate thousands of concurrent users
- Large-scale workloads: Test production-scale traffic patterns without client-side bottlenecks
- Automated orchestration: Kubernetes operators to manage benchmark lifecycles and resource allocation
- Post-profiling analysis: Automatically generate visualizations after each benchmark run
- Server-side metrics: Plot GPU utilization, memory usage, and queuing patterns over time
- Client-side results: Visualize latency distributions, throughput curves, and goodput trends
- Comparison views: Side-by-side charts for multiple benchmark runs
GitHub Gist: https://gist.github.com/BenHamm/31c648f7d7331c94c1f3a45859db6677