BenHamm/aiperf_benchmark_results.md

## aiperf_benchmark_results.md

      
    Raw
  

              aiperf_benchmark_results.md
            
          
    AIPerf Benchmark: Claude Sonnet 4.5 via NVIDIA API

Overview

Performance benchmark of aws/anthropic/bedrock-claude-sonnet-4-5-v1 model hosted on NVIDIA's inference API, testing with large context windows (8K input tokens, 1K output tokens).
Benchmark Date: November 3, 2025

Tool: AIPerf v0.2.0

Test Configuration: 100 requests with streaming enabled

Reproduction Instructions

Prerequisites

# Install AIPerf
pip install aiperf

# Or use a virtual environment
python3 -m venv venv
source venv/bin/activate
pip install aiperf
Run the Benchmark

aiperf profile \
  --model "aws/anthropic/bedrock-claude-sonnet-4-5-v1" \
  --url https://inference-api.nvidia.com \
  --endpoint-type chat \
  --streaming \
  --tokenizer gpt2 \
  --synthetic-input-tokens-mean 8000 \
  --synthetic-input-tokens-stddev 0 \
  --output-tokens-mean 1000 \
  --output-tokens-stddev 0 \
  --request-count 100 \
  --api-key YOUR_NVIDIA_API_KEY_HERE \
  --ui-type simple
Configuration Details


Model: aws/anthropic/bedrock-claude-sonnet-4-5-v1
Endpoint Type: Chat completions (OpenAI-compatible)
Streaming: Enabled
Input Sequence Length (ISL): 8,000 tokens (mean, stddev=0)
Output Sequence Length (OSL): 1,000 tokens (mean, stddev=0)
Request Count: 100 requests
Concurrency: 1 (default)
Tokenizer: gpt2 (for token counting)


Performance Results

Key Metrics Summary


Metric
Average
Min
Max
P99
P90
P50
Std Dev


Time to First Token (ms)
3,105.18
53.20
12,275.68
11,285.55
3,933.44
2,993.15
1,702.43


Time to Second Token (ms)
1.41
0.07
9.28
7.11
1.76
1.25
1.15


Request Latency (ms)
16,902.57
53.20
36,493.88
29,443.70
27,396.16
16,143.62
8,145.46


Inter Token Latency (ms)
21.60
0.00
34.24
30.42
26.74
23.53
7.53


Output Token Throughput Per User (tokens/sec)
43.88
29.21
139.86
125.31
47.17
42.11
13.87


Output Sequence Length (tokens)
620.65
176.00
1,051.00
1,019.00
975.10
542.00
256.21


Input Sequence Length (tokens)
8,000.00
8,000.00
8,000.00
8,000.00
8,000.00
8,000.00
0.00


Throughput Metrics


Output Token Throughput: 36.68 tokens/sec (aggregate)
Request Throughput: 0.06 requests/sec
Total Requests: 100
Total Benchmark Duration: 1,692.14 seconds (≈28.2 minutes)


Analysis & Key Findings

🔍 Performance Characteristics


Time to First Token (TTFT)

Average: ~3.1 seconds
High variability (max 12.3s) suggests variable prefill time
P90 at 3.9s indicates consistent performance for most requests
The large context (8K tokens) contributes to prefill latency


Streaming Performance

Time to Second Token: 1.41ms average (excellent)
Inter Token Latency: 21.6ms average (~46 tokens/sec decode speed)
Consistent streaming after initial token generation


Output Length Variance

Requested: 1,000 tokens (mean)
Actual average: 620.65 tokens
Note: Without min_tokens or ignore_eos parameters, the model stops at natural completion points


End-to-End Latency

Average request: ~16.9 seconds
Max request: ~36.5 seconds
P50: ~16.1 seconds (median experience)


Per-User Throughput

Average: 43.88 tokens/sec per user
This includes TTFT overhead


💡 Recommendations

To achieve target 1,000 token outputs:
# Add these parameters to force minimum output length:
--extra-inputs min_tokens:1000 \
--extra-inputs ignore_eos:true
For higher throughput testing:
# Increase concurrency to test parallel load:
--concurrency 10

# Or use request-rate mode:
--request-rate 5
For production SLO validation:
# Set goodput thresholds:
--goodput request_latency:20000 \
--goodput time_to_first_token:5000 \
--goodput inter_token_latency:50

Environment Details


Python Version: 3.13.7
AIPerf Version: 0.2.0
OS: macOS (darwin 25.0.0)
Test Date: November 3, 2025 at 12:33 PM UTC


Raw Output Artifacts

The benchmark generates three output files:

CSV Export: profile_export_aiperf.csv - Raw per-request data
JSON Export: profile_export_aiperf.json - Structured metrics
Log File: aiperf.log - Detailed execution logs

Location: artifacts/aws_anthropic_bedrock-claude-sonnet-4-5-v1-openai-chat-concurrency1/

Additional Notes


The model uses OpenAI-compatible chat completions API
Tokenization handled by gpt2 tokenizer (approximate for Claude)
All 100 requests completed successfully with no errors
Single-worker, sequential execution (concurrency=1)
Streaming mode enables token-by-token delivery


About AIPerf

AIPerf is a comprehensive benchmarking tool for generative AI models. Learn more:

GitHub: https://github.com/ai-dynamo/aiperf
Documentation: https://github.com/ai-dynamo/aiperf/tree/main/docs
PyPI: https://pypi.org/project/aiperf/


Generated with AIPerf v0.2.0
Metric	Average	Min	Max	P99	P90	P50	Std Dev
Time to First Token (ms)	3,105.18	53.20	12,275.68	11,285.55	3,933.44	2,993.15	1,702.43
Time to Second Token (ms)	1.41	0.07	9.28	7.11	1.76	1.25	1.15
Request Latency (ms)	16,902.57	53.20	36,493.88	29,443.70	27,396.16	16,143.62	8,145.46
Inter Token Latency (ms)	21.60	0.00	34.24	30.42	26.74	23.53	7.53
Output Token Throughput Per User (tokens/sec)	43.88	29.21	139.86	125.31	47.17	42.11	13.87
Output Sequence Length (tokens)	620.65	176.00	1,051.00	1,019.00	975.10	542.00	256.21
Input Sequence Length (tokens)	8,000.00	8,000.00	8,000.00	8,000.00	8,000.00	8,000.00	0.00
No results found