Performance benchmark of aws/anthropic/bedrock-claude-sonnet-4-5-v1 model hosted on NVIDIA's inference API, testing with large context windows (8K input tokens, 1K output tokens).
Benchmark Date: November 3, 2025
Tool: AIPerf v0.2.0
Test Configuration: 100 requests with streaming enabled
# Install AIPerf
pip install aiperf
# Or use a virtual environment
python3 -m venv venv
source venv/bin/activate
pip install aiperfaiperf profile \
--model "aws/anthropic/bedrock-claude-sonnet-4-5-v1" \
--url https://inference-api.nvidia.com \
--endpoint-type chat \
--streaming \
--tokenizer gpt2 \
--synthetic-input-tokens-mean 8000 \
--synthetic-input-tokens-stddev 0 \
--output-tokens-mean 1000 \
--output-tokens-stddev 0 \
--request-count 100 \
--api-key YOUR_NVIDIA_API_KEY_HERE \
--ui-type simple- Model:
aws/anthropic/bedrock-claude-sonnet-4-5-v1 - Endpoint Type: Chat completions (OpenAI-compatible)
- Streaming: Enabled
- Input Sequence Length (ISL): 8,000 tokens (mean, stddev=0)
- Output Sequence Length (OSL): 1,000 tokens (mean, stddev=0)
- Request Count: 100 requests
- Concurrency: 1 (default)
- Tokenizer: gpt2 (for token counting)
| Metric | Average | Min | Max | P99 | P90 | P50 | Std Dev |
|---|---|---|---|---|---|---|---|
| Time to First Token (ms) | 3,105.18 | 53.20 | 12,275.68 | 11,285.55 | 3,933.44 | 2,993.15 | 1,702.43 |
| Time to Second Token (ms) | 1.41 | 0.07 | 9.28 | 7.11 | 1.76 | 1.25 | 1.15 |
| Request Latency (ms) | 16,902.57 | 53.20 | 36,493.88 | 29,443.70 | 27,396.16 | 16,143.62 | 8,145.46 |
| Inter Token Latency (ms) | 21.60 | 0.00 | 34.24 | 30.42 | 26.74 | 23.53 | 7.53 |
| Output Token Throughput Per User (tokens/sec) | 43.88 | 29.21 | 139.86 | 125.31 | 47.17 | 42.11 | 13.87 |
| Output Sequence Length (tokens) | 620.65 | 176.00 | 1,051.00 | 1,019.00 | 975.10 | 542.00 | 256.21 |
| Input Sequence Length (tokens) | 8,000.00 | 8,000.00 | 8,000.00 | 8,000.00 | 8,000.00 | 8,000.00 | 0.00 |
- Output Token Throughput: 36.68 tokens/sec (aggregate)
- Request Throughput: 0.06 requests/sec
- Total Requests: 100
- Total Benchmark Duration: 1,692.14 seconds (β28.2 minutes)
-
Time to First Token (TTFT)
- Average: ~3.1 seconds
- High variability (max 12.3s) suggests variable prefill time
- P90 at 3.9s indicates consistent performance for most requests
- The large context (8K tokens) contributes to prefill latency
-
Streaming Performance
- Time to Second Token: 1.41ms average (excellent)
- Inter Token Latency: 21.6ms average (~46 tokens/sec decode speed)
- Consistent streaming after initial token generation
-
Output Length Variance
- Requested: 1,000 tokens (mean)
- Actual average: 620.65 tokens
- Note: Without
min_tokensorignore_eosparameters, the model stops at natural completion points
-
End-to-End Latency
- Average request: ~16.9 seconds
- Max request: ~36.5 seconds
- P50: ~16.1 seconds (median experience)
-
Per-User Throughput
- Average: 43.88 tokens/sec per user
- This includes TTFT overhead
To achieve target 1,000 token outputs:
# Add these parameters to force minimum output length:
--extra-inputs min_tokens:1000 \
--extra-inputs ignore_eos:trueFor higher throughput testing:
# Increase concurrency to test parallel load:
--concurrency 10
# Or use request-rate mode:
--request-rate 5For production SLO validation:
# Set goodput thresholds:
--goodput request_latency:20000 \
--goodput time_to_first_token:5000 \
--goodput inter_token_latency:50- Python Version: 3.13.7
- AIPerf Version: 0.2.0
- OS: macOS (darwin 25.0.0)
- Test Date: November 3, 2025 at 12:33 PM UTC
The benchmark generates three output files:
- CSV Export:
profile_export_aiperf.csv- Raw per-request data - JSON Export:
profile_export_aiperf.json- Structured metrics - Log File:
aiperf.log- Detailed execution logs
Location: artifacts/aws_anthropic_bedrock-claude-sonnet-4-5-v1-openai-chat-concurrency1/
- The model uses OpenAI-compatible chat completions API
- Tokenization handled by gpt2 tokenizer (approximate for Claude)
- All 100 requests completed successfully with no errors
- Single-worker, sequential execution (concurrency=1)
- Streaming mode enables token-by-token delivery
AIPerf is a comprehensive benchmarking tool for generative AI models. Learn more:
- GitHub: https://github.com/ai-dynamo/aiperf
- Documentation: https://github.com/ai-dynamo/aiperf/tree/main/docs
- PyPI: https://pypi.org/project/aiperf/
Generated with AIPerf v0.2.0