A comprehensive load generation and benchmarking toolkit from Weka for testing LLM inference performance with realistic KV cache patterns.
The weka-new-kv-cache-tester is a Python-based toolkit designed to benchmark LLM inference servers with realistic agentic coding workloads. It includes multiple testing tools and a dataset of 588 real Claude Code conversation traces.
Author: Callan Fox (Weka)
License: Apache 2.0
Version: 1.1.0 (as of 2026-01-08)
| Issue | Old Weka Dataset | New KV Cache Tester |
|---|---|---|
| Block Size | 256 tokens | 64 tokens (finer granularity) |
| Hash IDs | Local per conversation | Still local, but tester handles cache simulation |
| Context Sizes | Avg 179K tokens (too large for OSS) | Avg 131K tokens, 62.7% within 128K |
| Timestamps | Relative only | Relative, but tester manages realistic replay |
| Metadata | Basic | Rich: tool_tokens, system_tokens, model info |
| Tooling | None | Full load generator with adaptive scaling |
weka-new-kv-cache-tester/
├── trace_replay_tester.py # Main trace replay tool (3,525 lines)
├── generate_index.py # Dashboard generator
├── requirements.txt # Dependencies
├── LICENSE # Apache 2.0
├── docs/ # Documentation
│ ├── trace_replay_tester.md
│ ├── cache_rate_tester.md
│ ├── single_prompt_tester.md
│ ├── working_set_tester.md
│ └── utilities.md
└── traces/ # 588 conversation traces
├── 013e38c8.json
├── 016625cd.json
└── ... (588 JSON files)
The flagship tool - replays real agentic coding traces with realistic timing, cache patterns, and message structures.
Key Features:
- Fire-and-forget async request dispatch
- Adaptive user scaling based on TTFT thresholds
- Cache pressure budgeting (
--max-new-tokens-per-period) - Working set limits (
--max-working-set-tokens) - Warm prefix caching for cross-conversation cache sharing
- Trace advancement (start users mid-conversation)
- Admission control (
--max-concurrent-requests)
Quick Start:
python trace_replay_tester.py \
--api-endpoint http://localhost:8000 \
--trace-directory ./traces \
--output-dir ./results \
--start-users 5 \
--max-users 50 \
--max-ttft 2.0 \
--test-duration 300Tests performance across various cache hit rates (0%, 5%, 10%, ..., 100%) with a fixed working set size.
Modes:
- Sustained (default): Continuous load with adaptive concurrency
- Fixed: Test specific concurrency levels
python cache_rate_tester.py \
--api-endpoint http://localhost:8000 \
--context-sizes 32000 \
--working-set-size 2000000 \
--max-ttft 2.0 \
--output-dir test_outputSimple cold start vs. cached prompt comparison - quick smoke test for cache functionality.
python single_prompt_tester.py \
--api-endpoint http://localhost:8000 \
--context-sizes 8000 32000 64000Tests performance as working set grows - useful for understanding memory tier transitions (HBM → DRAM → SSD).
python working_set_tester.py \
--api-endpoint http://localhost:8000 \
--context-sizes 30000 \
--min-working-set-size 100000 \
--max-working-set-size 5000000 \
--working-set-increments 10 \
--max-ttft 2.0Each trace file is a JSON document representing a Claude Code conversation:
{
"id": "016625cd",
"models": ["claude-sonnet-4-5-20250929"],
"block_size": 64,
"tool_tokens": 9117,
"system_tokens": 3100,
"requests": [
{
"t": 0.0,
"type": "s",
"model": "claude-sonnet-4-5-20250929",
"in": 17557,
"out": 438,
"hash_ids": [1, 2, 3, ...],
"input_types": ["text"],
"output_types": ["text"],
"stop": ""
}
]
}| Field | Type | Description |
|---|---|---|
id |
string | Unique conversation identifier |
models |
string[] | Models used in conversation |
block_size |
int | Token block size (64 tokens) |
tool_tokens |
int | Tokens in tool definitions (~8-12K for Claude Code) |
system_tokens |
int | Tokens in system prompt (~2-3K for Claude Code) |
requests |
array | List of requests in the conversation |
| Field | Type | Description |
|---|---|---|
t |
float | Timestamp (seconds from conversation start) |
type |
string | "s" = streaming, "n" = non-streaming |
model |
string | Model used for this request |
in |
int | Input token count |
out |
int | Output token count |
hash_ids |
int[] | Block hash IDs for cache simulation |
input_types |
string[] | Content types: ["text"], ["tool_result"] |
output_types |
string[] | Response content types |
stop |
string | Stop reason: "", "tool_use", "end_turn" |
| Metric | Value |
|---|---|
| Total Traces | 588 |
| Block Size | 64 tokens |
| Models | Claude Sonnet 4, Opus 4, Haiku 4.5 |
| Statistic | Value |
|---|---|
| Mean | 130,908 tokens |
| Median | 110,303 tokens |
| Min | 0 tokens |
| Max | 636,522 tokens |
| Statistic | Value |
|---|---|
| Mean | 432 tokens |
| Median | 231 tokens |
| Min | 0 tokens |
| Max | 15,390 tokens |
| Threshold | % of Requests Within |
|---|---|
| ≤ 32K tokens | 8.3% |
| ≤ 64K tokens | 22.8% |
| ≤ 128K tokens | 62.7% |
Note: This is significantly better than the old dataset (42.5% within 128K), making it more suitable for OSS models like Llama 3.1 (128K context).
| Metric | Average |
|---|---|
| Tool tokens | 10,445 |
| System tokens | 2,459 |
| Total shared prefix | ~12,904 tokens |
This shared prefix data enables the warm prefix caching feature, simulating how Claude Code's tool definitions and system prompts are typically already cached across conversations.
Simulates cross-conversation cache sharing:
- User 1's first request: all cache misses on shared prefix
- User 2+'s first request: cache hits on the warm prefix portion
python trace_replay_tester.py \
--warm-prefix-pct 0.5 \ # 50% of tool+system tokens pre-warmed
...Simulates production workloads where sessions are at various stages:
python trace_replay_tester.py \
--advance-min 0.0 --advance-max 0.5 \ # Start 0-50% through trace
--prime-cache \ # Warm server cache at start position
...Limits concurrent requests to prevent server overload:
python trace_replay_tester.py \
--max-concurrent-requests 20 \
--max-users 50 \
...Limits cache churn per assessment period:
python trace_replay_tester.py \
--max-new-tokens-per-period 500000 \
--max-working-set-tokens 10000000 \
...| File | Description |
|---|---|
summary_trace_replay.csv |
Per-assessment-period metrics |
detailed_results.csv |
Per-request metrics |
user_lifecycle.csv |
User start/complete/truncate events |
test_metadata.json |
Configuration and trace stats |
*.html |
Plotly visualizations |
index.html |
Dashboard |
cd weka-new-kv-cache-tester
pip install -r requirements.txtDependencies:
- openai>=1.0.0
- transformers>=4.30.0
- torch>=2.0.0
- plotly>=5.14.0
- pandas>=2.0.0
- numpy>=1.24.0
- requests>=2.31.0
| Feature | Weka KV Cache Tester | AIPerf |
|---|---|---|
| Purpose | KV cache benchmarking | General LLM benchmarking |
| Trace Format | Custom JSON | Mooncake JSONL |
| Block Size | 64 tokens | Configurable |
| Adaptive Scaling | Yes (built-in) | No (fixed schedule) |
| Warm Prefix | Yes | No |
| Trace Advancement | Yes | No |
| Output | HTML dashboards | JSON/CSV exports |
-
Local Hash IDs: Hash IDs are still local per conversation, not global across the dataset. The tester handles cache simulation internally, but true cross-conversation cache analysis requires global hashing.
-
Relative Timestamps: Timestamps are still relative to conversation start. The tester manages realistic replay, but global traffic pattern analysis is limited.
-
Claude-Focused: Traces are from Claude Code sessions, so patterns may not perfectly represent other agentic workloads.
python trace_replay_tester.py \
--api-endpoint http://localhost:8000 \
--trace-directory ./traces \
--output-dir ./capacity_test \
--start-users 10 \
--max-users 100 \
--max-ttft 2.0 \
--ttft-metric p95 \
--max-new-tokens-per-period 1000000 \
--test-duration 1800 \
--recycle \
--warm-prefix-pct 0.5 \
--advance-min 0.0 --advance-max 0.5 \
--prime-cache \
--seed 42This configuration:
- Starts with 10 users, scales up to 100
- Targets 2.0s p95 TTFT
- Limits cache churn to 1M tokens per 30s period
- Recycles completed users with new traces
- Simulates 50% warm prefix cache sharing
- Starts users at random positions (0-50%) with cache priming
- Runs for 30 minutes