We tested AIConfigurator (version 0.4.0) against the performance claims in the "Advanced Disagg Perf Tuning" guide and found a significant discrepancy between AIC's predictions and the guide's reported results.
Key Finding: AIC predicts disaggregated serving provides 7-8% improvement, while the guide reports 102-148% improvement - a 10-20x difference in expected gains.
Source Document: The guide being tested is from PR #4655 by davilu-nvidia (submitted Nov 27, 2025, currently under review and not yet merged).
Model: Qwen3-32B-FP8
Hardware: H200 SXM
Backend: TensorRT-LLM
Parameters:
- Input Sequence Length (ISL): 4000 tokens
- Output Sequence Length (OSL): 500 tokens
- TPS/user target: ≥60 tokens/s/user
- Total GPUs: 8 (for fair comparison)
| Metric | Guide Claims | AIC Predicts | Difference |
|---|---|---|---|
| Improvement | 148% (2.48x) | 8% (1.08x) | 140 percentage points |
| Best Agg | Not specified | 686.87 tokens/s/gpu | - |
| Best Disagg | Not specified | 739.12 tokens/s/gpu | - |
| Metric | Guide Claims | AIC Predicts | Difference |
|---|---|---|---|
| Improvement | 102% (2.02x) | 7% (1.07x) | 95 percentage points |
| Best Agg | Not specified | 689.39 tokens/s/gpu | - |
| Best Disagg | Not specified | 739.12 tokens/s/gpu | - |
Throughput: 686.87 tokens/s/gpu
Configuration: 4 replicas × TP2 = 8 GPUs
Batch Size: 24
TTFT: 512.55ms ✓
TPOT: 16.53ms ✓
User Throughput: 60.51 tokens/s/user ✓
Throughput: 739.12 tokens/s/gpu
Configuration: 2 replicas × (2P×TP1 + 1D×TP2) = 8 GPUs
- Prefill: 2 workers × TP1 = 2 GPUs (batch_size=1)
- Decode: 1 worker × TP2 = 2 GPUs (batch_size=56)
TTFT: 547.98ms ✓
TPOT: 11.96ms ✓
User Throughput: 83.58 tokens/s/user ✓
Improvement: 739.12 / 686.87 = 1.076x = 7.6%
The AIConfigurator README includes an example with stricter SLA targets than what we tested:
aiconfigurator cli default --model QWEN3_32B --total_gpus 32 --system h200_sxm \
--ttft 300 --tpot 10 --isl 4000 --osl 500 --prefix 500This example claims 1.64x higher tokens/s/gpu for disaggregated vs aggregated.
Comparison of test scenarios:
| Source | TTFT | TPOT | ISL | OSL | Prefix | Result |
|---|---|---|---|---|---|---|
| AIC README example | 300ms | 10ms | 4000 | 500 | 500 | 1.64x (64% improvement) |
| Our test (600ms) | 600ms | 16.67ms | 4000 | 500 | 0 | 1.08x (8% improvement) |
| Our test (1200ms) | 1200ms | 16.67ms | 4000 | 500 | 0 | 1.07x (7% improvement) |
| Guide claims | 600ms / 1200ms | - | 4000 | 500 | 0 | 2.48x / 2.02x (148% / 102%) |
Observations:
- Stricter SLAs (lower TTFT/TPOT) appear to favor disaggregated serving more
- Prefix caching (500 tokens) may also contribute to higher gains
- But even the 1.64x from AIC's own example is still far from the guide's 2-2.5x claims
From the AIConfigurator README:
"Results can be overly optimistic in the low-speed, high-throughput region."
This suggests AIC's predictions may have accuracy issues in certain operating regimes.
From aic_based_disagg_perf_tuning.md:
Based on AIC run and minimum manual fine tuning process:
- Under TTFT constraint of 600 ms, disagg delivers a 148% tps/gpu perf gain over agg
- Under TTFT constraint of 1200 ms, disagg delivers a 102% tps/gpu perf gain over agg
When we run AIC with the same parameters (ISL=4000, OSL=500, TTFT=600/1200ms, 8 GPUs on H200):
- 600ms TTFT: 8% improvement (1.08x)
- 1200ms TTFT: 7% improvement (1.07x)
This is not a small measurement error - it's an order of magnitude difference:
- Expected gain per guide: 2-2.5x faster
- AIC prediction: 1.07-1.08x faster
-
Manual Tuning Not Captured
- Guide mentions "manual fine tuning based on AIC suggestions"
- What tuning was done? Could this explain the gap?
-
Configuration Differences
- Did the actual tested configs differ from what AIC recommends?
- Existing recipe uses different replica counts than AIC suggests
-
Workload Characteristics
- Real traffic patterns vs synthetic benchmarks
- Request arrival patterns and queuing behavior
- Prefix caching effects (though guide says it's disabled)
-
Measurement Methodology
- How was "tps/gpu" calculated in the actual tests?
- What concurrency level was used?
- What metric aggregation (mean, median, P99)?
-
AIC Model Accuracy
- Is AIC's performance database calibrated correctly for H200?
- Does AIC capture all online serving overheads?
- Guide mentions "AIC can handle TTFT from engine execution, but not other online serving overheads"
-
Version/Configuration Drift
- AIC uses TensorRT-LLM 1.0.0rc3 performance database
- What backend version was actually tested?
- Are all quantization settings identical?
To resolve this discrepancy, we need:
- What were the exact deployment configurations that achieved 102-148% gains?
- What manual tuning was performed after AIC suggestions?
- Can you share the complete deployment YAMLs used in testing?
- What AIPerf command was used? (concurrency, request count, etc.)
- How was "tps/gpu" calculated from the benchmark results?
- What metrics were captured? (mean? median? P99?)
- Can you share the raw benchmark results (AIPerf artifact files)?
- What TensorRT-LLM version was used?
- What container image tags were deployed?
- Were there any special cluster configurations or optimizations?
We propose to:
-
Validate with Real Benchmarks
- Deploy both AIC-generated configs on H200 cluster
- Run identical benchmark workload
- Compare actual results vs AIC predictions vs guide claims
-
Test Existing Recipes
- Deploy the recipes referenced in the guide (
recipes/qwen3-32b-fp8/) - Benchmark them with the same methodology
- See if we can reproduce the 102-148% gains
- Deploy the recipes referenced in the guide (
-
Collaborate with AIC Team
- Share findings and get clarification
- Understand if there are known limitations in AIC's predictions
- Improve documentation/expectations for future users
Test Date: December 1, 2025
AIC Version: 0.4.0 (latest release from Nov 24, 2025)
Cluster: Nebius H200 (16 nodes × 8 H200 GPUs)
Document Tested: aic_based_disagg_perf_tuning.md from PR #4655
Document Status: Under review, not yet merged (as of Dec 1, 2025)
AIC Commands Used:
For 600ms TTFT scenario:
aiconfigurator cli default \
--model QWEN3_32B \
--total_gpus 8 \
--system h200_sxm \
--isl 4000 \
--osl 500 \
--ttft 600 \
--tpot 16.67 \
--save_dir ./aic-configs-ttft600For 1200ms TTFT scenario:
aiconfigurator cli default \
--model QWEN3_32B \
--total_gpus 8 \
--system h200_sxm \
--isl 4000 \
--osl 500 \
--ttft 1200 \
--tpot 16.67 \
--save_dir ./aic-configs-ttft1200Note on TPOT calculation: TPOT = 1000ms / TPS_user_target = 1000/60 ≈ 16.67ms per token
Aggregated Top 3:
| Rank | tokens/s/gpu | Config | TTFT | User Throughput |
|---|---|---|---|---|
| 1 | 686.87 | 4×TP2 | 512.55ms | 60.51 tokens/s/user |
| 2 | 635.59 | 2×TP4 | 588.09ms | 68.72 tokens/s/user |
| 3 | 487.56 | 1×TP8 | 488.37ms | 64.02 tokens/s/user |
Disaggregated Top 3:
| Rank | tokens/s/gpu | Prefill Config | Decode Config | TTFT | User Throughput |
|---|---|---|---|---|---|
| 1 | 739.12 | 2×TP1 (bs=1) | 1×TP2 (bs=56) | 547.98ms | 83.58 tokens/s/user |
| 2 | 739.12 | 4×TP1 (bs=1) | 1×TP4 (bs=60) | 547.98ms | 111.32 tokens/s/user |
| 3 | 646.07 | 1×TP1 (bs=1) | 1×TP1 (bs=22) | 547.98ms | 63.71 tokens/s/user |
Aggregated Top 3:
| Rank | tokens/s/gpu | Config | TTFT | User Throughput |
|---|---|---|---|---|
| 1 | 689.39 | 2×TP4 | 630.54ms | 61.62 tokens/s/user |
| 2 | 686.87 | 4×TP2 | 512.55ms | 60.51 tokens/s/user |
| 3 | 622.10 | 8×TP1 | 1198.37ms | 62.49 tokens/s/user |
Disaggregated Top 3:
| Rank | tokens/s/gpu | Prefill Config | Decode Config | TTFT | User Throughput |
|---|---|---|---|---|---|
| 1 | 739.12 | 2×TP1 (bs=1) | 1×TP2 (bs=56) | 547.98ms | 83.58 tokens/s/user |
| 2 | 739.12 | 4×TP1 (bs=1) | 1×TP4 (bs=60) | 547.98ms | 111.32 tokens/s/user |
| 3 | 646.07 | 1×TP1 (bs=1) | 1×TP1 (bs=22) | 547.98ms | 63.71 tokens/s/user |
For questions or collaboration on validating these findings:
- GitHub: [your-github]
- Team: Dynamo Product Management
We welcome feedback from the AIC/SA team to help understand and resolve this discrepancy.