Skip to content

Instantly share code, notes, and snippets.

@BenHamm
Created January 27, 2026 17:33
Show Gist options
  • Select an option

  • Save BenHamm/13d698c530847c1339b79ea067b71953 to your computer and use it in GitHub Desktop.

Select an option

Save BenHamm/13d698c530847c1339b79ea067b71953 to your computer and use it in GitHub Desktop.
Weka KV Cache Tester - Documentation for the new load generation toolkit

Weka KV Cache Tester

A comprehensive load generation and benchmarking toolkit from Weka for testing LLM inference performance with realistic KV cache patterns.

Overview

The weka-new-kv-cache-tester is a Python-based toolkit designed to benchmark LLM inference servers with realistic agentic coding workloads. It includes multiple testing tools and a dataset of 588 real Claude Code conversation traces.

Author: Callan Fox (Weka)
License: Apache 2.0
Version: 1.1.0 (as of 2026-01-08)

Key Improvements Over Previous Dataset

Issue Old Weka Dataset New KV Cache Tester
Block Size 256 tokens 64 tokens (finer granularity)
Hash IDs Local per conversation Still local, but tester handles cache simulation
Context Sizes Avg 179K tokens (too large for OSS) Avg 131K tokens, 62.7% within 128K
Timestamps Relative only Relative, but tester manages realistic replay
Metadata Basic Rich: tool_tokens, system_tokens, model info
Tooling None Full load generator with adaptive scaling

Repository Structure

weka-new-kv-cache-tester/
├── trace_replay_tester.py    # Main trace replay tool (3,525 lines)
├── generate_index.py         # Dashboard generator
├── requirements.txt          # Dependencies
├── LICENSE                   # Apache 2.0
├── docs/                     # Documentation
│   ├── trace_replay_tester.md
│   ├── cache_rate_tester.md
│   ├── single_prompt_tester.md
│   ├── working_set_tester.md
│   └── utilities.md
└── traces/                   # 588 conversation traces
    ├── 013e38c8.json
    ├── 016625cd.json
    └── ... (588 JSON files)

Testing Tools

1. Trace Replay Tester (trace_replay_tester.py)

The flagship tool - replays real agentic coding traces with realistic timing, cache patterns, and message structures.

Key Features:

  • Fire-and-forget async request dispatch
  • Adaptive user scaling based on TTFT thresholds
  • Cache pressure budgeting (--max-new-tokens-per-period)
  • Working set limits (--max-working-set-tokens)
  • Warm prefix caching for cross-conversation cache sharing
  • Trace advancement (start users mid-conversation)
  • Admission control (--max-concurrent-requests)

Quick Start:

python trace_replay_tester.py \
    --api-endpoint http://localhost:8000 \
    --trace-directory ./traces \
    --output-dir ./results \
    --start-users 5 \
    --max-users 50 \
    --max-ttft 2.0 \
    --test-duration 300

2. Cache Rate Tester

Tests performance across various cache hit rates (0%, 5%, 10%, ..., 100%) with a fixed working set size.

Modes:

  • Sustained (default): Continuous load with adaptive concurrency
  • Fixed: Test specific concurrency levels
python cache_rate_tester.py \
    --api-endpoint http://localhost:8000 \
    --context-sizes 32000 \
    --working-set-size 2000000 \
    --max-ttft 2.0 \
    --output-dir test_output

3. Single Prompt Tester

Simple cold start vs. cached prompt comparison - quick smoke test for cache functionality.

python single_prompt_tester.py \
    --api-endpoint http://localhost:8000 \
    --context-sizes 8000 32000 64000

4. Working Set Tester

Tests performance as working set grows - useful for understanding memory tier transitions (HBM → DRAM → SSD).

python working_set_tester.py \
    --api-endpoint http://localhost:8000 \
    --context-sizes 30000 \
    --min-working-set-size 100000 \
    --max-working-set-size 5000000 \
    --working-set-increments 10 \
    --max-ttft 2.0

Trace Format (New)

Each trace file is a JSON document representing a Claude Code conversation:

{
  "id": "016625cd",
  "models": ["claude-sonnet-4-5-20250929"],
  "block_size": 64,
  "tool_tokens": 9117,
  "system_tokens": 3100,
  "requests": [
    {
      "t": 0.0,
      "type": "s",
      "model": "claude-sonnet-4-5-20250929",
      "in": 17557,
      "out": 438,
      "hash_ids": [1, 2, 3, ...],
      "input_types": ["text"],
      "output_types": ["text"],
      "stop": ""
    }
  ]
}

Top-Level Fields

Field Type Description
id string Unique conversation identifier
models string[] Models used in conversation
block_size int Token block size (64 tokens)
tool_tokens int Tokens in tool definitions (~8-12K for Claude Code)
system_tokens int Tokens in system prompt (~2-3K for Claude Code)
requests array List of requests in the conversation

Request Fields

Field Type Description
t float Timestamp (seconds from conversation start)
type string "s" = streaming, "n" = non-streaming
model string Model used for this request
in int Input token count
out int Output token count
hash_ids int[] Block hash IDs for cache simulation
input_types string[] Content types: ["text"], ["tool_result"]
output_types string[] Response content types
stop string Stop reason: "", "tool_use", "end_turn"

Dataset Statistics (588 Traces)

Metric Value
Total Traces 588
Block Size 64 tokens
Models Claude Sonnet 4, Opus 4, Haiku 4.5

Input Sequence Length (ISL)

Statistic Value
Mean 130,908 tokens
Median 110,303 tokens
Min 0 tokens
Max 636,522 tokens

Output Sequence Length (OSL)

Statistic Value
Mean 432 tokens
Median 231 tokens
Min 0 tokens
Max 15,390 tokens

Context Length Compatibility

Threshold % of Requests Within
≤ 32K tokens 8.3%
≤ 64K tokens 22.8%
≤ 128K tokens 62.7%

Note: This is significantly better than the old dataset (42.5% within 128K), making it more suitable for OSS models like Llama 3.1 (128K context).

Shared Prefix Data

Metric Average
Tool tokens 10,445
System tokens 2,459
Total shared prefix ~12,904 tokens

This shared prefix data enables the warm prefix caching feature, simulating how Claude Code's tool definitions and system prompts are typically already cached across conversations.

Key Features

1. Warm Prefix Caching

Simulates cross-conversation cache sharing:

  • User 1's first request: all cache misses on shared prefix
  • User 2+'s first request: cache hits on the warm prefix portion
python trace_replay_tester.py \
    --warm-prefix-pct 0.5 \  # 50% of tool+system tokens pre-warmed
    ...

2. Trace Advancement

Simulates production workloads where sessions are at various stages:

python trace_replay_tester.py \
    --advance-min 0.0 --advance-max 0.5 \  # Start 0-50% through trace
    --prime-cache \  # Warm server cache at start position
    ...

3. Admission Control

Limits concurrent requests to prevent server overload:

python trace_replay_tester.py \
    --max-concurrent-requests 20 \
    --max-users 50 \
    ...

4. Cache Pressure Budgeting

Limits cache churn per assessment period:

python trace_replay_tester.py \
    --max-new-tokens-per-period 500000 \
    --max-working-set-tokens 10000000 \
    ...

Output Files

File Description
summary_trace_replay.csv Per-assessment-period metrics
detailed_results.csv Per-request metrics
user_lifecycle.csv User start/complete/truncate events
test_metadata.json Configuration and trace stats
*.html Plotly visualizations
index.html Dashboard

Installation

cd weka-new-kv-cache-tester
pip install -r requirements.txt

Dependencies:

  • openai>=1.0.0
  • transformers>=4.30.0
  • torch>=2.0.0
  • plotly>=5.14.0
  • pandas>=2.0.0
  • numpy>=1.24.0
  • requests>=2.31.0

Comparison: Weka KV Cache Tester vs AIPerf

Feature Weka KV Cache Tester AIPerf
Purpose KV cache benchmarking General LLM benchmarking
Trace Format Custom JSON Mooncake JSONL
Block Size 64 tokens Configurable
Adaptive Scaling Yes (built-in) No (fixed schedule)
Warm Prefix Yes No
Trace Advancement Yes No
Output HTML dashboards JSON/CSV exports

Remaining Limitations

  1. Local Hash IDs: Hash IDs are still local per conversation, not global across the dataset. The tester handles cache simulation internally, but true cross-conversation cache analysis requires global hashing.

  2. Relative Timestamps: Timestamps are still relative to conversation start. The tester manages realistic replay, but global traffic pattern analysis is limited.

  3. Claude-Focused: Traces are from Claude Code sessions, so patterns may not perfectly represent other agentic workloads.

Example: Production Capacity Test

python trace_replay_tester.py \
    --api-endpoint http://localhost:8000 \
    --trace-directory ./traces \
    --output-dir ./capacity_test \
    --start-users 10 \
    --max-users 100 \
    --max-ttft 2.0 \
    --ttft-metric p95 \
    --max-new-tokens-per-period 1000000 \
    --test-duration 1800 \
    --recycle \
    --warm-prefix-pct 0.5 \
    --advance-min 0.0 --advance-max 0.5 \
    --prime-cache \
    --seed 42

This configuration:

  • Starts with 10 users, scales up to 100
  • Targets 2.0s p95 TTFT
  • Limits cache churn to 1M tokens per 30s period
  • Recycles completed users with new traces
  • Simulates 50% warm prefix cache sharing
  • Starts users at random positions (0-50%) with cache priming
  • Runs for 30 minutes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment