BenHamm/AIC_PREDICTION_MISMATCH_GIST.md

## AIC_PREDICTION_MISMATCH_GIST.md

      
    Raw
  

              AIC_PREDICTION_MISMATCH_GIST.md
            
          
    AIConfigurator Performance Prediction Mismatch

Summary

We tested AIConfigurator (version 0.4.0) against the performance claims in the "Advanced Disagg Perf Tuning" guide and found a significant discrepancy between AIC's predictions and the guide's reported results.
Key Finding: AIC predicts disaggregated serving provides 7-8% improvement, while the guide reports 102-148% improvement - a 10-20x difference in expected gains.
Source Document: The guide being tested is from PR #4655 by davilu-nvidia (submitted Nov 27, 2025, currently under review and not yet merged).

Test Setup

Model: Qwen3-32B-FP8

Hardware: H200 SXM

Backend: TensorRT-LLM

Parameters:

Input Sequence Length (ISL): 4000 tokens
Output Sequence Length (OSL): 500 tokens
TPS/user target: ≥60 tokens/s/user
Total GPUs: 8 (for fair comparison)


Results Comparison

Scenario 1: TTFT = 600ms


Metric
Guide Claims
AIC Predicts
Difference


Improvement
148% (2.48x)
8% (1.08x)
140 percentage points


Best Agg
Not specified
686.87 tokens/s/gpu
-


Best Disagg
Not specified
739.12 tokens/s/gpu
-


Scenario 2: TTFT = 1200ms


Metric
Guide Claims
AIC Predicts
Difference


Improvement
102% (2.02x)
7% (1.07x)
95 percentage points


Best Agg
Not specified
689.39 tokens/s/gpu
-


Best Disagg
Not specified
739.12 tokens/s/gpu
-


Detailed AIC Results

TTFT = 600ms - Aggregated (Top Config)

Throughput: 686.87 tokens/s/gpu
Configuration: 4 replicas × TP2 = 8 GPUs
Batch Size: 24
TTFT: 512.55ms ✓
TPOT: 16.53ms ✓
User Throughput: 60.51 tokens/s/user ✓

TTFT = 600ms - Disaggregated (Top Config)

Throughput: 739.12 tokens/s/gpu
Configuration: 2 replicas × (2P×TP1 + 1D×TP2) = 8 GPUs
  - Prefill: 2 workers × TP1 = 2 GPUs (batch_size=1)
  - Decode: 1 worker × TP2 = 2 GPUs (batch_size=56)
TTFT: 547.98ms ✓
TPOT: 11.96ms ✓
User Throughput: 83.58 tokens/s/user ✓

Improvement: 739.12 / 686.87 = 1.076x = 7.6%

Additional Context

AIConfigurator README Example

The AIConfigurator README includes an example with stricter SLA targets than what we tested:
aiconfigurator cli default --model QWEN3_32B --total_gpus 32 --system h200_sxm \
  --ttft 300 --tpot 10 --isl 4000 --osl 500 --prefix 500
This example claims 1.64x higher tokens/s/gpu for disaggregated vs aggregated.
Comparison of test scenarios:


Source
TTFT
TPOT
ISL
OSL
Prefix
Result


AIC README example
300ms
10ms
4000
500
500
1.64x (64% improvement)


Our test (600ms)
600ms
16.67ms
4000
500
0
1.08x (8% improvement)


Our test (1200ms)
1200ms
16.67ms
4000
500
0
1.07x (7% improvement)


Guide claims
600ms / 1200ms
-
4000
500
0
2.48x / 2.02x (148% / 102%)


Observations:

Stricter SLAs (lower TTFT/TPOT) appear to favor disaggregated serving more
Prefix caching (500 tokens) may also contribute to higher gains
But even the 1.64x from AIC's own example is still far from the guide's 2-2.5x claims

Known Limitations

From the AIConfigurator README:

"Results can be overly optimistic in the low-speed, high-throughput region."

This suggests AIC's predictions may have accuracy issues in certain operating regimes.

Analysis

What the Guide Says

From aic_based_disagg_perf_tuning.md:

Based on AIC run and minimum manual fine tuning process:

Under TTFT constraint of 600 ms, disagg delivers a 148% tps/gpu perf gain over agg
Under TTFT constraint of 1200 ms, disagg delivers a 102% tps/gpu perf gain over agg


What AIC Actually Predicts

When we run AIC with the same parameters (ISL=4000, OSL=500, TTFT=600/1200ms, 8 GPUs on H200):

600ms TTFT: 8% improvement (1.08x)
1200ms TTFT: 7% improvement (1.07x)

The Gap

This is not a small measurement error - it's an order of magnitude difference:

Expected gain per guide: 2-2.5x faster
AIC prediction: 1.07-1.08x faster


Possible Explanations


Manual Tuning Not Captured

Guide mentions "manual fine tuning based on AIC suggestions"
What tuning was done? Could this explain the gap?


Configuration Differences

Did the actual tested configs differ from what AIC recommends?
Existing recipe uses different replica counts than AIC suggests


Workload Characteristics

Real traffic patterns vs synthetic benchmarks
Request arrival patterns and queuing behavior
Prefix caching effects (though guide says it's disabled)


Measurement Methodology

How was "tps/gpu" calculated in the actual tests?
What concurrency level was used?
What metric aggregation (mean, median, P99)?


AIC Model Accuracy

Is AIC's performance database calibrated correctly for H200?
Does AIC capture all online serving overheads?
Guide mentions "AIC can handle TTFT from engine execution, but not other online serving overheads"


Version/Configuration Drift

AIC uses TensorRT-LLM 1.0.0rc3 performance database
What backend version was actually tested?
Are all quantization settings identical?


Questions for the AIC/SA Team

To resolve this discrepancy, we need:
Configuration Details


What were the exact deployment configurations that achieved 102-148% gains?
What manual tuning was performed after AIC suggestions?
Can you share the complete deployment YAMLs used in testing?

Benchmark Details


What AIPerf command was used? (concurrency, request count, etc.)
How was "tps/gpu" calculated from the benchmark results?
What metrics were captured? (mean? median? P99?)
Can you share the raw benchmark results (AIPerf artifact files)?

Environment Details


What TensorRT-LLM version was used?
What container image tags were deployed?
Were there any special cluster configurations or optimizations?


Next Steps

We propose to:


Validate with Real Benchmarks

Deploy both AIC-generated configs on H200 cluster
Run identical benchmark workload
Compare actual results vs AIC predictions vs guide claims


Test Existing Recipes

Deploy the recipes referenced in the guide (recipes/qwen3-32b-fp8/)
Benchmark them with the same methodology
See if we can reproduce the 102-148% gains


Collaborate with AIC Team

Share findings and get clarification
Understand if there are known limitations in AIC's predictions
Improve documentation/expectations for future users


Environment Information

Test Date: December 1, 2025

AIC Version: 0.4.0 (latest release from Nov 24, 2025)

Cluster: Nebius H200 (16 nodes × 8 H200 GPUs)

Document Tested: aic_based_disagg_perf_tuning.md from PR #4655

Document Status: Under review, not yet merged (as of Dec 1, 2025)
AIC Commands Used:
For 600ms TTFT scenario:
aiconfigurator cli default \
  --model QWEN3_32B \
  --total_gpus 8 \
  --system h200_sxm \
  --isl 4000 \
  --osl 500 \
  --ttft 600 \
  --tpot 16.67 \
  --save_dir ./aic-configs-ttft600
For 1200ms TTFT scenario:
aiconfigurator cli default \
  --model QWEN3_32B \
  --total_gpus 8 \
  --system h200_sxm \
  --isl 4000 \
  --osl 500 \
  --ttft 1200 \
  --tpot 16.67 \
  --save_dir ./aic-configs-ttft1200
Note on TPOT calculation: TPOT = 1000ms / TPS_user_target = 1000/60 ≈ 16.67ms per token

Appendix: Full AIC Output

TTFT = 600ms Results

Aggregated Top 3:


Rank
tokens/s/gpu
Config
TTFT
User Throughput


1
686.87
4×TP2
512.55ms
60.51 tokens/s/user


2
635.59
2×TP4
588.09ms
68.72 tokens/s/user


3
487.56
1×TP8
488.37ms
64.02 tokens/s/user


Disaggregated Top 3:


Rank
tokens/s/gpu
Prefill Config
Decode Config
TTFT
User Throughput


1
739.12
2×TP1 (bs=1)
1×TP2 (bs=56)
547.98ms
83.58 tokens/s/user


2
739.12
4×TP1 (bs=1)
1×TP4 (bs=60)
547.98ms
111.32 tokens/s/user


3
646.07
1×TP1 (bs=1)
1×TP1 (bs=22)
547.98ms
63.71 tokens/s/user


TTFT = 1200ms Results

Aggregated Top 3:


Rank
tokens/s/gpu
Config
TTFT
User Throughput


1
689.39
2×TP4
630.54ms
61.62 tokens/s/user


2
686.87
4×TP2
512.55ms
60.51 tokens/s/user


3
622.10
8×TP1
1198.37ms
62.49 tokens/s/user


Disaggregated Top 3:


Rank
tokens/s/gpu
Prefill Config
Decode Config
TTFT
User Throughput


1
739.12
2×TP1 (bs=1)
1×TP2 (bs=56)
547.98ms
83.58 tokens/s/user


2
739.12
4×TP1 (bs=1)
1×TP4 (bs=60)
547.98ms
111.32 tokens/s/user


3
646.07
1×TP1 (bs=1)
1×TP1 (bs=22)
547.98ms
63.71 tokens/s/user


Contact

For questions or collaboration on validating these findings:

GitHub: [your-github]
Team: Dynamo Product Management

We welcome feedback from the AIC/SA team to help understand and resolve this discrepancy.
Metric	Guide Claims	AIC Predicts	Difference
Improvement	148% (2.48x)	8% (1.08x)	140 percentage points
Best Agg	Not specified	686.87 tokens/s/gpu	-
Best Disagg	Not specified	739.12 tokens/s/gpu	-
Metric	Guide Claims	AIC Predicts	Difference
Improvement	102% (2.02x)	7% (1.07x)	95 percentage points
Best Agg	Not specified	689.39 tokens/s/gpu	-
Best Disagg	Not specified	739.12 tokens/s/gpu	-
Source	TTFT	TPOT	ISL	OSL	Prefix	Result
AIC README example	300ms	10ms	4000	500	500	1.64x (64% improvement)
Our test (600ms)	600ms	16.67ms	4000	500	0	1.08x (8% improvement)
Our test (1200ms)	1200ms	16.67ms	4000	500	0	1.07x (7% improvement)
Guide claims	600ms / 1200ms	-	4000	500	0	2.48x / 2.02x (148% / 102%)
Rank	tokens/s/gpu	Config	TTFT	User Throughput
1	686.87	4×TP2	512.55ms	60.51 tokens/s/user
2	635.59	2×TP4	588.09ms	68.72 tokens/s/user
3	487.56	1×TP8	488.37ms	64.02 tokens/s/user
Rank	tokens/s/gpu	Prefill Config	Decode Config	TTFT	User Throughput
1	739.12	2×TP1 (bs=1)	1×TP2 (bs=56)	547.98ms	83.58 tokens/s/user
2	739.12	4×TP1 (bs=1)	1×TP4 (bs=60)	547.98ms	111.32 tokens/s/user
3	646.07	1×TP1 (bs=1)	1×TP1 (bs=22)	547.98ms	63.71 tokens/s/user
Rank	tokens/s/gpu	Config	TTFT	User Throughput
1	689.39	2×TP4	630.54ms	61.62 tokens/s/user
2	686.87	4×TP2	512.55ms	60.51 tokens/s/user
3	622.10	8×TP1	1198.37ms	62.49 tokens/s/user