Inference providers that want to contribute to open soruce projects are limited in what they can share while respecting user privacy and protecting proprietary prompts. Real LLM inference traces contain sensitive information, but without realistic traces, benchmarks can't accurately reflect production workloads.
Block hashing converts tokens into cryptographic hash IDs that preserve prefix-matching patterns while protecting content. Based on Mooncake AI's approach (USENIX FAST'25).
Key insight: Hash blocks include the previous block's hash, so identical prefixes produce identical hash sequences. This lets you benchmark prefix caching systems with realistic hit rates.
def get_block_hashes(ids, prev_hash):
"""Each block's hash depends on ALL previous blocks"""
hash_val = hash(hash(tuple(ids)) + prev_hash)
if hash_val not in block_hashes:
block_hashes[hash_val] = counter
counter += 1
return block_hashes[hash_val], hash_val# Setup
python3 -m venv venv
source venv/bin/activate
pip install transformers jinja2
# Run
python block_parse.py
# View results
cat block_parse_output.jsonlInput:
{"timestamp": 0, "input": [{"role": "user", "content": "What is the capital of France?"}], "output": "The capital of France is Paris..."}
{"timestamp": 300, "input": [{"role": "user", "content": "What is the capital of Finland?"}], "output": "The capital of Finland is Helsinki."}Output:
{"timestamp": 0, "input_length": 12, "output_length": 30, "hash_ids": [1, 2, 3]}
{"timestamp": 300, "input_length": 12, "output_length": 7, "hash_ids": [1, 12, 13]}Both start with hash_id: 1 → first 4 tokens match → cache hit detected!
BLOCK_SIZE = 4 # Tokens per block (Mooncake uses 512 in production)
model_name = "deepseek-ai/DeepSeek-R1" # Match your deploymentNote: If your input is already formatted with a chat template, modify line 39:
# Before:
input_ids = tokenizer.encode(tokenizer.apply_chat_template(input_prompt, tokenize=False, add_generation_prompt=True), add_special_tokens=False)
# After:
input_ids = tokenizer.encode(input_prompt, add_special_tokens=False)Shared (safe):
- Request timestamps
- Token counts
- Hash ID sequences
- Cache hit patterns
Protected (never exposed):
- Actual text content
- Token IDs
- User information
- Proprietary prompts
- Share traces with partners under NDA
- Create reproducible benchmark recipes
- Test prefix caching with realistic hit rates
- Model actual workload characteristics
Based on Mooncake AI's work on KV-cache-centric LLM serving (USENIX FAST'25). Their production deployment shows 50% cache hit ratios on real workloads.
Note:
hash(function in python depends on interpreter state. I would recommend using hashlib e.g. md5 + hexitdigest or similar to get a hash function. Can we use the block hash from dynamo directly?Are output blocks or length helpful?