Skip to content

Instantly share code, notes, and snippets.

@BenHamm
Created November 3, 2025 21:24
Show Gist options
  • Select an option

  • Save BenHamm/e2b72daa031c47e2488e7e1e39e9f0cc to your computer and use it in GitHub Desktop.

Select an option

Save BenHamm/e2b72daa031c47e2488e7e1e39e9f0cc to your computer and use it in GitHub Desktop.
AIPerf Benchmark: Claude Sonnet 4.5 via NVIDIA API - 8K input, 1K output, 100 requests

AIPerf Benchmark: Claude Sonnet 4.5 via NVIDIA API

Overview

Performance benchmark of aws/anthropic/bedrock-claude-sonnet-4-5-v1 model hosted on NVIDIA's inference API, testing with large context windows (8K input tokens, 1K output tokens).

Benchmark Date: November 3, 2025
Tool: AIPerf v0.2.0
Test Configuration: 100 requests with streaming enabled


Reproduction Instructions

Prerequisites

# Install AIPerf
pip install aiperf

# Or use a virtual environment
python3 -m venv venv
source venv/bin/activate
pip install aiperf

Run the Benchmark

aiperf profile \
  --model "aws/anthropic/bedrock-claude-sonnet-4-5-v1" \
  --url https://inference-api.nvidia.com \
  --endpoint-type chat \
  --streaming \
  --tokenizer gpt2 \
  --synthetic-input-tokens-mean 8000 \
  --synthetic-input-tokens-stddev 0 \
  --output-tokens-mean 1000 \
  --output-tokens-stddev 0 \
  --request-count 100 \
  --api-key YOUR_NVIDIA_API_KEY_HERE \
  --ui-type simple

Configuration Details

  • Model: aws/anthropic/bedrock-claude-sonnet-4-5-v1
  • Endpoint Type: Chat completions (OpenAI-compatible)
  • Streaming: Enabled
  • Input Sequence Length (ISL): 8,000 tokens (mean, stddev=0)
  • Output Sequence Length (OSL): 1,000 tokens (mean, stddev=0)
  • Request Count: 100 requests
  • Concurrency: 1 (default)
  • Tokenizer: gpt2 (for token counting)

Performance Results

Key Metrics Summary

Metric Average Min Max P99 P90 P50 Std Dev
Time to First Token (ms) 3,105.18 53.20 12,275.68 11,285.55 3,933.44 2,993.15 1,702.43
Time to Second Token (ms) 1.41 0.07 9.28 7.11 1.76 1.25 1.15
Request Latency (ms) 16,902.57 53.20 36,493.88 29,443.70 27,396.16 16,143.62 8,145.46
Inter Token Latency (ms) 21.60 0.00 34.24 30.42 26.74 23.53 7.53
Output Token Throughput Per User (tokens/sec) 43.88 29.21 139.86 125.31 47.17 42.11 13.87
Output Sequence Length (tokens) 620.65 176.00 1,051.00 1,019.00 975.10 542.00 256.21
Input Sequence Length (tokens) 8,000.00 8,000.00 8,000.00 8,000.00 8,000.00 8,000.00 0.00

Throughput Metrics

  • Output Token Throughput: 36.68 tokens/sec (aggregate)
  • Request Throughput: 0.06 requests/sec
  • Total Requests: 100
  • Total Benchmark Duration: 1,692.14 seconds (β‰ˆ28.2 minutes)

Analysis & Key Findings

πŸ” Performance Characteristics

  1. Time to First Token (TTFT)

    • Average: ~3.1 seconds
    • High variability (max 12.3s) suggests variable prefill time
    • P90 at 3.9s indicates consistent performance for most requests
    • The large context (8K tokens) contributes to prefill latency
  2. Streaming Performance

    • Time to Second Token: 1.41ms average (excellent)
    • Inter Token Latency: 21.6ms average (~46 tokens/sec decode speed)
    • Consistent streaming after initial token generation
  3. Output Length Variance

    • Requested: 1,000 tokens (mean)
    • Actual average: 620.65 tokens
    • Note: Without min_tokens or ignore_eos parameters, the model stops at natural completion points
  4. End-to-End Latency

    • Average request: ~16.9 seconds
    • Max request: ~36.5 seconds
    • P50: ~16.1 seconds (median experience)
  5. Per-User Throughput

    • Average: 43.88 tokens/sec per user
    • This includes TTFT overhead

πŸ’‘ Recommendations

To achieve target 1,000 token outputs:

# Add these parameters to force minimum output length:
--extra-inputs min_tokens:1000 \
--extra-inputs ignore_eos:true

For higher throughput testing:

# Increase concurrency to test parallel load:
--concurrency 10

# Or use request-rate mode:
--request-rate 5

For production SLO validation:

# Set goodput thresholds:
--goodput request_latency:20000 \
--goodput time_to_first_token:5000 \
--goodput inter_token_latency:50

Environment Details

  • Python Version: 3.13.7
  • AIPerf Version: 0.2.0
  • OS: macOS (darwin 25.0.0)
  • Test Date: November 3, 2025 at 12:33 PM UTC

Raw Output Artifacts

The benchmark generates three output files:

  1. CSV Export: profile_export_aiperf.csv - Raw per-request data
  2. JSON Export: profile_export_aiperf.json - Structured metrics
  3. Log File: aiperf.log - Detailed execution logs

Location: artifacts/aws_anthropic_bedrock-claude-sonnet-4-5-v1-openai-chat-concurrency1/


Additional Notes

  • The model uses OpenAI-compatible chat completions API
  • Tokenization handled by gpt2 tokenizer (approximate for Claude)
  • All 100 requests completed successfully with no errors
  • Single-worker, sequential execution (concurrency=1)
  • Streaming mode enables token-by-token delivery

About AIPerf

AIPerf is a comprehensive benchmarking tool for generative AI models. Learn more:


Generated with AIPerf v0.2.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment