Skip to content

Instantly share code, notes, and snippets.

@BenHamm
Last active December 1, 2025 23:24
Show Gist options
  • Select an option

  • Save BenHamm/3ec1e1e92312302e966ee75606fe1931 to your computer and use it in GitHub Desktop.

Select an option

Save BenHamm/3ec1e1e92312302e966ee75606fe1931 to your computer and use it in GitHub Desktop.
AIConfigurator Prediction Mismatch: 7-8% vs 102-148% Disaggregated Serving Performance Gains

AIConfigurator Performance Prediction Mismatch

Summary

We tested AIConfigurator (version 0.4.0) against the performance claims in the "Advanced Disagg Perf Tuning" guide and found a significant discrepancy between AIC's predictions and the guide's reported results.

Key Finding: AIC predicts disaggregated serving provides 7-8% improvement, while the guide reports 102-148% improvement - a 10-20x difference in expected gains.

Source Document: The guide being tested is from PR #4655 by davilu-nvidia (submitted Nov 27, 2025, currently under review and not yet merged).


Test Setup

Model: Qwen3-32B-FP8
Hardware: H200 SXM
Backend: TensorRT-LLM
Parameters:

  • Input Sequence Length (ISL): 4000 tokens
  • Output Sequence Length (OSL): 500 tokens
  • TPS/user target: ≥60 tokens/s/user
  • Total GPUs: 8 (for fair comparison)

Results Comparison

Scenario 1: TTFT = 600ms

Metric Guide Claims AIC Predicts Difference
Improvement 148% (2.48x) 8% (1.08x) 140 percentage points
Best Agg Not specified 686.87 tokens/s/gpu -
Best Disagg Not specified 739.12 tokens/s/gpu -

Scenario 2: TTFT = 1200ms

Metric Guide Claims AIC Predicts Difference
Improvement 102% (2.02x) 7% (1.07x) 95 percentage points
Best Agg Not specified 689.39 tokens/s/gpu -
Best Disagg Not specified 739.12 tokens/s/gpu -

Detailed AIC Results

TTFT = 600ms - Aggregated (Top Config)

Throughput: 686.87 tokens/s/gpu
Configuration: 4 replicas × TP2 = 8 GPUs
Batch Size: 24
TTFT: 512.55ms ✓
TPOT: 16.53ms ✓
User Throughput: 60.51 tokens/s/user ✓

TTFT = 600ms - Disaggregated (Top Config)

Throughput: 739.12 tokens/s/gpu
Configuration: 2 replicas × (2P×TP1 + 1D×TP2) = 8 GPUs
  - Prefill: 2 workers × TP1 = 2 GPUs (batch_size=1)
  - Decode: 1 worker × TP2 = 2 GPUs (batch_size=56)
TTFT: 547.98ms ✓
TPOT: 11.96ms ✓
User Throughput: 83.58 tokens/s/user ✓

Improvement: 739.12 / 686.87 = 1.076x = 7.6%


Additional Context

AIConfigurator README Example

The AIConfigurator README includes an example with stricter SLA targets than what we tested:

aiconfigurator cli default --model QWEN3_32B --total_gpus 32 --system h200_sxm \
  --ttft 300 --tpot 10 --isl 4000 --osl 500 --prefix 500

This example claims 1.64x higher tokens/s/gpu for disaggregated vs aggregated.

Comparison of test scenarios:

Source TTFT TPOT ISL OSL Prefix Result
AIC README example 300ms 10ms 4000 500 500 1.64x (64% improvement)
Our test (600ms) 600ms 16.67ms 4000 500 0 1.08x (8% improvement)
Our test (1200ms) 1200ms 16.67ms 4000 500 0 1.07x (7% improvement)
Guide claims 600ms / 1200ms - 4000 500 0 2.48x / 2.02x (148% / 102%)

Observations:

  • Stricter SLAs (lower TTFT/TPOT) appear to favor disaggregated serving more
  • Prefix caching (500 tokens) may also contribute to higher gains
  • But even the 1.64x from AIC's own example is still far from the guide's 2-2.5x claims

Known Limitations

From the AIConfigurator README:

"Results can be overly optimistic in the low-speed, high-throughput region."

This suggests AIC's predictions may have accuracy issues in certain operating regimes.


Analysis

What the Guide Says

From aic_based_disagg_perf_tuning.md:

Based on AIC run and minimum manual fine tuning process:

  • Under TTFT constraint of 600 ms, disagg delivers a 148% tps/gpu perf gain over agg
  • Under TTFT constraint of 1200 ms, disagg delivers a 102% tps/gpu perf gain over agg

What AIC Actually Predicts

When we run AIC with the same parameters (ISL=4000, OSL=500, TTFT=600/1200ms, 8 GPUs on H200):

  • 600ms TTFT: 8% improvement (1.08x)
  • 1200ms TTFT: 7% improvement (1.07x)

The Gap

This is not a small measurement error - it's an order of magnitude difference:

  • Expected gain per guide: 2-2.5x faster
  • AIC prediction: 1.07-1.08x faster

Possible Explanations

  1. Manual Tuning Not Captured

    • Guide mentions "manual fine tuning based on AIC suggestions"
    • What tuning was done? Could this explain the gap?
  2. Configuration Differences

    • Did the actual tested configs differ from what AIC recommends?
    • Existing recipe uses different replica counts than AIC suggests
  3. Workload Characteristics

    • Real traffic patterns vs synthetic benchmarks
    • Request arrival patterns and queuing behavior
    • Prefix caching effects (though guide says it's disabled)
  4. Measurement Methodology

    • How was "tps/gpu" calculated in the actual tests?
    • What concurrency level was used?
    • What metric aggregation (mean, median, P99)?
  5. AIC Model Accuracy

    • Is AIC's performance database calibrated correctly for H200?
    • Does AIC capture all online serving overheads?
    • Guide mentions "AIC can handle TTFT from engine execution, but not other online serving overheads"
  6. Version/Configuration Drift

    • AIC uses TensorRT-LLM 1.0.0rc3 performance database
    • What backend version was actually tested?
    • Are all quantization settings identical?

Questions for the AIC/SA Team

To resolve this discrepancy, we need:

Configuration Details

  1. What were the exact deployment configurations that achieved 102-148% gains?
  2. What manual tuning was performed after AIC suggestions?
  3. Can you share the complete deployment YAMLs used in testing?

Benchmark Details

  1. What AIPerf command was used? (concurrency, request count, etc.)
  2. How was "tps/gpu" calculated from the benchmark results?
  3. What metrics were captured? (mean? median? P99?)
  4. Can you share the raw benchmark results (AIPerf artifact files)?

Environment Details

  1. What TensorRT-LLM version was used?
  2. What container image tags were deployed?
  3. Were there any special cluster configurations or optimizations?

Next Steps

We propose to:

  1. Validate with Real Benchmarks

    • Deploy both AIC-generated configs on H200 cluster
    • Run identical benchmark workload
    • Compare actual results vs AIC predictions vs guide claims
  2. Test Existing Recipes

    • Deploy the recipes referenced in the guide (recipes/qwen3-32b-fp8/)
    • Benchmark them with the same methodology
    • See if we can reproduce the 102-148% gains
  3. Collaborate with AIC Team

    • Share findings and get clarification
    • Understand if there are known limitations in AIC's predictions
    • Improve documentation/expectations for future users

Environment Information

Test Date: December 1, 2025
AIC Version: 0.4.0 (latest release from Nov 24, 2025)
Cluster: Nebius H200 (16 nodes × 8 H200 GPUs)
Document Tested: aic_based_disagg_perf_tuning.md from PR #4655
Document Status: Under review, not yet merged (as of Dec 1, 2025)

AIC Commands Used:

For 600ms TTFT scenario:

aiconfigurator cli default \
  --model QWEN3_32B \
  --total_gpus 8 \
  --system h200_sxm \
  --isl 4000 \
  --osl 500 \
  --ttft 600 \
  --tpot 16.67 \
  --save_dir ./aic-configs-ttft600

For 1200ms TTFT scenario:

aiconfigurator cli default \
  --model QWEN3_32B \
  --total_gpus 8 \
  --system h200_sxm \
  --isl 4000 \
  --osl 500 \
  --ttft 1200 \
  --tpot 16.67 \
  --save_dir ./aic-configs-ttft1200

Note on TPOT calculation: TPOT = 1000ms / TPS_user_target = 1000/60 ≈ 16.67ms per token


Appendix: Full AIC Output

TTFT = 600ms Results

Aggregated Top 3:

Rank tokens/s/gpu Config TTFT User Throughput
1 686.87 4×TP2 512.55ms 60.51 tokens/s/user
2 635.59 2×TP4 588.09ms 68.72 tokens/s/user
3 487.56 1×TP8 488.37ms 64.02 tokens/s/user

Disaggregated Top 3:

Rank tokens/s/gpu Prefill Config Decode Config TTFT User Throughput
1 739.12 2×TP1 (bs=1) 1×TP2 (bs=56) 547.98ms 83.58 tokens/s/user
2 739.12 4×TP1 (bs=1) 1×TP4 (bs=60) 547.98ms 111.32 tokens/s/user
3 646.07 1×TP1 (bs=1) 1×TP1 (bs=22) 547.98ms 63.71 tokens/s/user

TTFT = 1200ms Results

Aggregated Top 3:

Rank tokens/s/gpu Config TTFT User Throughput
1 689.39 2×TP4 630.54ms 61.62 tokens/s/user
2 686.87 4×TP2 512.55ms 60.51 tokens/s/user
3 622.10 8×TP1 1198.37ms 62.49 tokens/s/user

Disaggregated Top 3:

Rank tokens/s/gpu Prefill Config Decode Config TTFT User Throughput
1 739.12 2×TP1 (bs=1) 1×TP2 (bs=56) 547.98ms 83.58 tokens/s/user
2 739.12 4×TP1 (bs=1) 1×TP4 (bs=60) 547.98ms 111.32 tokens/s/user
3 646.07 1×TP1 (bs=1) 1×TP1 (bs=22) 547.98ms 63.71 tokens/s/user

Contact

For questions or collaboration on validating these findings:

  • GitHub: [your-github]
  • Team: Dynamo Product Management

We welcome feedback from the AIC/SA team to help understand and resolve this discrepancy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment