Skip to content

Instantly share code, notes, and snippets.

@BenHamm
Last active November 4, 2025 02:20
Show Gist options
  • Select an option

  • Save BenHamm/4d1e69c9693977ef20118f1cfca60c11 to your computer and use it in GitHub Desktop.

Select an option

Save BenHamm/4d1e69c9693977ef20118f1cfca60c11 to your computer and use it in GitHub Desktop.
Block Hash: Privacy-Preserving LLM Trace Sharing

Block Hash: Privacy-Preserving LLM Trace Sharing

The Problem

Inference providers that want to contribute to open soruce projects are limited in what they can share while respecting user privacy and protecting proprietary prompts. Real LLM inference traces contain sensitive information, but without realistic traces, benchmarks can't accurately reflect production workloads.

The Solution

Block hashing converts tokens into cryptographic hash IDs that preserve prefix-matching patterns while protecting content. Based on Mooncake AI's approach (USENIX FAST'25).

Key insight: Hash blocks include the previous block's hash, so identical prefixes produce identical hash sequences. This lets you benchmark prefix caching systems with realistic hit rates.

How It Works

def get_block_hashes(ids, prev_hash):
    """Each block's hash depends on ALL previous blocks"""
    hash_val = hash(hash(tuple(ids)) + prev_hash)
    if hash_val not in block_hashes:
        block_hashes[hash_val] = counter
        counter += 1
    return block_hashes[hash_val], hash_val

Quick Start

# Setup
python3 -m venv venv
source venv/bin/activate
pip install transformers jinja2

# Run
python block_parse.py

# View results
cat block_parse_output.jsonl

Example Conversion

Input:

{"timestamp": 0, "input": [{"role": "user", "content": "What is the capital of France?"}], "output": "The capital of France is Paris..."}
{"timestamp": 300, "input": [{"role": "user", "content": "What is the capital of Finland?"}], "output": "The capital of Finland is Helsinki."}

Output:

{"timestamp": 0, "input_length": 12, "output_length": 30, "hash_ids": [1, 2, 3]}
{"timestamp": 300, "input_length": 12, "output_length": 7, "hash_ids": [1, 12, 13]}

Both start with hash_id: 1 → first 4 tokens match → cache hit detected!

Configuration

BLOCK_SIZE = 4  # Tokens per block (Mooncake uses 512 in production)
model_name = "deepseek-ai/DeepSeek-R1"  # Match your deployment

Note: If your input is already formatted with a chat template, modify line 39:

# Before:
input_ids = tokenizer.encode(tokenizer.apply_chat_template(input_prompt, tokenize=False, add_generation_prompt=True), add_special_tokens=False)

# After:
input_ids = tokenizer.encode(input_prompt, add_special_tokens=False)

What Gets Shared vs Protected

Shared (safe):

  • Request timestamps
  • Token counts
  • Hash ID sequences
  • Cache hit patterns

Protected (never exposed):

  • Actual text content
  • Token IDs
  • User information
  • Proprietary prompts

Use Cases

  • Share traces with partners under NDA
  • Create reproducible benchmark recipes
  • Test prefix caching with realistic hit rates
  • Model actual workload characteristics

Credit

Based on Mooncake AI's work on KV-cache-centric LLM serving (USENIX FAST'25). Their production deployment shows 50% cache hit ratios on real workloads.

import csv
import ast
import math
import json
from collections import defaultdict
from typing import List, Tuple, Dict
from transformers import AutoTokenizer
input_file = "input_raw.jsonl"
output_file = "block_parse_output.jsonl"
BLOCK_SIZE = 4
model_name = "deepseek-ai/DeepSeek-R1"
tokenizer_needed = True
if tokenizer_needed:
tokenizer = AutoTokenizer.from_pretrained(model_name)
block_hashes = {}
counter = 1
def get_block_hashes(ids, prev_hash):
global counter
hash_val = hash(hash(tuple(ids)) + prev_hash)
if hash_val not in block_hashes:
block_hashes[hash_val] = counter
counter += 1
return block_hashes[hash_val], hash_val
# Read in jsonl line by line and prepare the output file for writing
with open(input_file, 'r', encoding='utf-8') as file:
with open(output_file, 'w', encoding='utf-8') as outfile:
for line in file:
data = json.loads(line)
input_prompt = data['input']
output_prompt = data['output']
print(input_prompt, output_prompt)
if tokenizer_needed:
input_ids = tokenizer.encode(tokenizer.apply_chat_template(input_prompt, tokenize=False, add_generation_prompt=True), add_special_tokens=False)
output_ids = tokenizer.encode(output_prompt, add_special_tokens=False)
else:
input_ids = input_prompt
output_ids = output_prompt
input_length = len(input_ids)
output_length = len(output_ids)
timestamp = data['timestamp']
input_hashes = []
prev_hash = 0
for i in range(0, len(input_ids), BLOCK_SIZE):
block_ids = input_ids[i:i+BLOCK_SIZE]
block_hash, prev_hash = get_block_hashes(block_ids, prev_hash)
input_hashes.append(block_hash)
outfile.write(json.dumps({
"timestamp": timestamp,
"input_length": input_length,
"output_length": output_length,
"hash_ids": input_hashes,
}) + '\n')
{"timestamp": 0, "input_length": 12, "output_length": 30, "hash_ids": [1, 2, 3]}
{"timestamp": 100, "input_length": 14, "output_length": 46, "hash_ids": [4, 5, 6, 7]}
{"timestamp": 200, "input_length": 15, "output_length": 69, "hash_ids": [8, 9, 10, 11]}
{"timestamp": 300, "input_length": 12, "output_length": 7, "hash_ids": [1, 12, 13]}
{"timestamp": 100, "input_length": 21, "output_length": 46, "hash_ids": [4, 5, 14, 15, 16, 17]}
{"timestamp": 0, "input": [{"role": "user", "content": "What is the capital of France?"}], "output": "The capital of France is Paris. It is located in the north-central part of the country and has been the capital since the 12th century."}
{"timestamp": 100, "input": [{"role": "user", "content": "Explain what machine learning is in simple terms."}], "output": "Machine learning is a type of artificial intelligence that allows computers to learn from data and improve their performance over time without being explicitly programmed for every task. Think of it like teaching a child through examples rather than giving them a rulebook."}
{"timestamp": 200, "input": [{"role": "user", "content": "How do I make a good cup of coffee?"}], "output": "To make a good cup of coffee: 1) Use fresh, quality beans, 2) Grind them just before brewing, 3) Use the right water temperature (195-205°F), 4) Maintain proper coffee-to-water ratio (1:15-1:17), and 5) Clean your equipment regularly."}
{"timestamp": 300, "input": [{"role": "user", "content": "What is the capital of Finland?"}], "output": "The capital of Finland is Helsinki."}
{"timestamp": 100, "input": [{"role": "user", "content": "Explain what machine learning is in simple terms. It should be a short answer."}], "output": "Machine learning is a type of artificial intelligence that allows computers to learn from data and improve their performance over time without being explicitly programmed for every task. Think of it like teaching a child through examples rather than giving them a rulebook."}
@michaelfeil
Copy link

michaelfeil commented Nov 4, 2025

Note: hash( function in python depends on interpreter state. I would recommend using hashlib e.g. md5 + hexitdigest or similar to get a hash function. Can we use the block hash from dynamo directly?

Are output blocks or length helpful?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment