BenHamm/WEKA-KV-CACHE-TESTER.md

## WEKA-KV-CACHE-TESTER.md

      
    Raw
  

              WEKA-KV-CACHE-TESTER.md
            
          
    Weka KV Cache Tester

A comprehensive load generation and benchmarking toolkit from Weka for testing LLM inference performance with realistic KV cache patterns.
Overview

The weka-new-kv-cache-tester is a Python-based toolkit designed to benchmark LLM inference servers with realistic agentic coding workloads. It includes multiple testing tools and a dataset of 588 real Claude Code conversation traces.
Author: Callan Fox (Weka)

License: Apache 2.0

Version: 1.1.0 (as of 2026-01-08)
Key Improvements Over Previous Dataset


Issue
Old Weka Dataset
New KV Cache Tester


Block Size
256 tokens
64 tokens (finer granularity)


Hash IDs
Local per conversation
Still local, but tester handles cache simulation


Context Sizes
Avg 179K tokens (too large for OSS)
Avg 131K tokens, 62.7% within 128K


Timestamps
Relative only
Relative, but tester manages realistic replay


Metadata
Basic
Rich: tool_tokens, system_tokens, model info


Tooling
None
Full load generator with adaptive scaling


Repository Structure

weka-new-kv-cache-tester/
├── trace_replay_tester.py    # Main trace replay tool (3,525 lines)
├── generate_index.py         # Dashboard generator
├── requirements.txt          # Dependencies
├── LICENSE                   # Apache 2.0
├── docs/                     # Documentation
│   ├── trace_replay_tester.md
│   ├── cache_rate_tester.md
│   ├── single_prompt_tester.md
│   ├── working_set_tester.md
│   └── utilities.md
└── traces/                   # 588 conversation traces
    ├── 013e38c8.json
    ├── 016625cd.json
    └── ... (588 JSON files)

Testing Tools

1. Trace Replay Tester (trace_replay_tester.py)

The flagship tool - replays real agentic coding traces with realistic timing, cache patterns, and message structures.
Key Features:

Fire-and-forget async request dispatch
Adaptive user scaling based on TTFT thresholds
Cache pressure budgeting (--max-new-tokens-per-period)
Working set limits (--max-working-set-tokens)
Warm prefix caching for cross-conversation cache sharing
Trace advancement (start users mid-conversation)
Admission control (--max-concurrent-requests)

Quick Start:
python trace_replay_tester.py \
    --api-endpoint http://localhost:8000 \
    --trace-directory ./traces \
    --output-dir ./results \
    --start-users 5 \
    --max-users 50 \
    --max-ttft 2.0 \
    --test-duration 300
2. Cache Rate Tester

Tests performance across various cache hit rates (0%, 5%, 10%, ..., 100%) with a fixed working set size.
Modes:

Sustained (default): Continuous load with adaptive concurrency
Fixed: Test specific concurrency levels

python cache_rate_tester.py \
    --api-endpoint http://localhost:8000 \
    --context-sizes 32000 \
    --working-set-size 2000000 \
    --max-ttft 2.0 \
    --output-dir test_output
3. Single Prompt Tester

Simple cold start vs. cached prompt comparison - quick smoke test for cache functionality.
python single_prompt_tester.py \
    --api-endpoint http://localhost:8000 \
    --context-sizes 8000 32000 64000
4. Working Set Tester

Tests performance as working set grows - useful for understanding memory tier transitions (HBM → DRAM → SSD).
python working_set_tester.py \
    --api-endpoint http://localhost:8000 \
    --context-sizes 30000 \
    --min-working-set-size 100000 \
    --max-working-set-size 5000000 \
    --working-set-increments 10 \
    --max-ttft 2.0
Trace Format (New)

Each trace file is a JSON document representing a Claude Code conversation:
{
  "id": "016625cd",
  "models": ["claude-sonnet-4-5-20250929"],
  "block_size": 64,
  "tool_tokens": 9117,
  "system_tokens": 3100,
  "requests": [
    {
      "t": 0.0,
      "type": "s",
      "model": "claude-sonnet-4-5-20250929",
      "in": 17557,
      "out": 438,
      "hash_ids": [1, 2, 3, ...],
      "input_types": ["text"],
      "output_types": ["text"],
      "stop": ""
    }
  ]
}
Top-Level Fields


Field
Type
Description


id
string
Unique conversation identifier


models
string[]
Models used in conversation


block_size
int
Token block size (64 tokens)


tool_tokens
int
Tokens in tool definitions (~8-12K for Claude Code)


system_tokens
int
Tokens in system prompt (~2-3K for Claude Code)


requests
array
List of requests in the conversation


Request Fields


Field
Type
Description


t
float
Timestamp (seconds from conversation start)


type
string
"s" = streaming, "n" = non-streaming


model
string
Model used for this request


in
int
Input token count


out
int
Output token count


hash_ids
int[]
Block hash IDs for cache simulation


input_types
string[]
Content types: ["text"], ["tool_result"]


output_types
string[]
Response content types


stop
string
Stop reason: "", "tool_use", "end_turn"


Dataset Statistics (588 Traces)


Metric
Value


Total Traces
588


Block Size
64 tokens


Models
Claude Sonnet 4, Opus 4, Haiku 4.5


Input Sequence Length (ISL)


Statistic
Value


Mean
130,908 tokens


Median
110,303 tokens


Min
0 tokens


Max
636,522 tokens


Output Sequence Length (OSL)


Statistic
Value


Mean
432 tokens


Median
231 tokens


Min
0 tokens


Max
15,390 tokens


Context Length Compatibility


Threshold
% of Requests Within


≤ 32K tokens
8.3%


≤ 64K tokens
22.8%


≤ 128K tokens
62.7%


Note: This is significantly better than the old dataset (42.5% within 128K), making it more suitable for OSS models like Llama 3.1 (128K context).
Shared Prefix Data


Metric
Average


Tool tokens
10,445


System tokens
2,459


Total shared prefix
~12,904 tokens


This shared prefix data enables the warm prefix caching feature, simulating how Claude Code's tool definitions and system prompts are typically already cached across conversations.
Key Features

1. Warm Prefix Caching

Simulates cross-conversation cache sharing:

User 1's first request: all cache misses on shared prefix
User 2+'s first request: cache hits on the warm prefix portion

python trace_replay_tester.py \
    --warm-prefix-pct 0.5 \  # 50% of tool+system tokens pre-warmed
    ...
2. Trace Advancement

Simulates production workloads where sessions are at various stages:
python trace_replay_tester.py \
    --advance-min 0.0 --advance-max 0.5 \  # Start 0-50% through trace
    --prime-cache \  # Warm server cache at start position
    ...
3. Admission Control

Limits concurrent requests to prevent server overload:
python trace_replay_tester.py \
    --max-concurrent-requests 20 \
    --max-users 50 \
    ...
4. Cache Pressure Budgeting

Limits cache churn per assessment period:
python trace_replay_tester.py \
    --max-new-tokens-per-period 500000 \
    --max-working-set-tokens 10000000 \
    ...
Output Files


File
Description


summary_trace_replay.csv
Per-assessment-period metrics


detailed_results.csv
Per-request metrics


user_lifecycle.csv
User start/complete/truncate events


test_metadata.json
Configuration and trace stats


*.html
Plotly visualizations


index.html
Dashboard


Installation

cd weka-new-kv-cache-tester
pip install -r requirements.txt
Dependencies:

openai>=1.0.0
transformers>=4.30.0
torch>=2.0.0
plotly>=5.14.0
pandas>=2.0.0
numpy>=1.24.0
requests>=2.31.0

Comparison: Weka KV Cache Tester vs AIPerf


Feature
Weka KV Cache Tester
AIPerf


Purpose
KV cache benchmarking
General LLM benchmarking


Trace Format
Custom JSON
Mooncake JSONL


Block Size
64 tokens
Configurable


Adaptive Scaling
Yes (built-in)
No (fixed schedule)


Warm Prefix
Yes
No


Trace Advancement
Yes
No


Output
HTML dashboards
JSON/CSV exports


Remaining Limitations


Local Hash IDs: Hash IDs are still local per conversation, not global across the dataset. The tester handles cache simulation internally, but true cross-conversation cache analysis requires global hashing.


Relative Timestamps: Timestamps are still relative to conversation start. The tester manages realistic replay, but global traffic pattern analysis is limited.


Claude-Focused: Traces are from Claude Code sessions, so patterns may not perfectly represent other agentic workloads.


Example: Production Capacity Test

python trace_replay_tester.py \
    --api-endpoint http://localhost:8000 \
    --trace-directory ./traces \
    --output-dir ./capacity_test \
    --start-users 10 \
    --max-users 100 \
    --max-ttft 2.0 \
    --ttft-metric p95 \
    --max-new-tokens-per-period 1000000 \
    --test-duration 1800 \
    --recycle \
    --warm-prefix-pct 0.5 \
    --advance-min 0.0 --advance-max 0.5 \
    --prime-cache \
    --seed 42
This configuration:

Starts with 10 users, scales up to 100
Targets 2.0s p95 TTFT
Limits cache churn to 1M tokens per 30s period
Recycles completed users with new traces
Simulates 50% warm prefix cache sharing
Starts users at random positions (0-50%) with cache priming
Runs for 30 minutes
Issue	Old Weka Dataset	New KV Cache Tester
Block Size	256 tokens	64 tokens (finer granularity)
Hash IDs	Local per conversation	Still local, but tester handles cache simulation
Context Sizes	Avg 179K tokens (too large for OSS)	Avg 131K tokens, 62.7% within 128K
Timestamps	Relative only	Relative, but tester manages realistic replay
Metadata	Basic	Rich: `tool_tokens`, `system_tokens`, model info
Tooling	None	Full load generator with adaptive scaling
Field	Type	Description
`id`	string	Unique conversation identifier
`models`	string[]	Models used in conversation
`block_size`	int	Token block size (64 tokens)
`tool_tokens`	int	Tokens in tool definitions (~8-12K for Claude Code)
`system_tokens`	int	Tokens in system prompt (~2-3K for Claude Code)
`requests`	array	List of requests in the conversation
Field	Type	Description
`t`	float	Timestamp (seconds from conversation start)
`type`	string	`"s"` = streaming, `"n"` = non-streaming
`model`	string	Model used for this request
`in`	int	Input token count
`out`	int	Output token count
`hash_ids`	int[]	Block hash IDs for cache simulation
`input_types`	string[]	Content types: `["text"]`, `["tool_result"]`
`output_types`	string[]	Response content types
`stop`	string	Stop reason: `""`, `"tool_use"`, `"end_turn"`
Metric	Value
Total Traces	588
Block Size	64 tokens
Models	Claude Sonnet 4, Opus 4, Haiku 4.5
Statistic	Value
Mean	130,908 tokens
Median	110,303 tokens
Min	0 tokens
Max	636,522 tokens
Threshold	% of Requests Within
≤ 32K tokens	8.3%
≤ 64K tokens	22.8%
≤ 128K tokens	62.7%
Metric	Average
Tool tokens	10,445
System tokens	2,459
Total shared prefix	~12,904 tokens
File	Description
`summary_trace_replay.csv`	Per-assessment-period metrics
`detailed_results.csv`	Per-request metrics
`user_lifecycle.csv`	User start/complete/truncate events
`test_metadata.json`	Configuration and trace stats
`*.html`	Plotly visualizations
`index.html`	Dashboard
Feature	Weka KV Cache Tester	AIPerf
Purpose	KV cache benchmarking	General LLM benchmarking
Trace Format	Custom JSON	Mooncake JSONL
Block Size	64 tokens	Configurable
Adaptive Scaling	Yes (built-in)	No (fixed schedule)
Warm Prefix	Yes	No
Trace Advancement	Yes	No
Output	HTML dashboards	JSON/CSV exports