Skip to content

Instantly share code, notes, and snippets.

View BenHamm's full-sized avatar

Ben Hamm BenHamm

  • OctoML
  • Seattle, WA
View GitHub Profile
@BenHamm
BenHamm / AIC_PREDICTION_MISMATCH_GIST.md
Last active December 1, 2025 23:24
AIConfigurator Prediction Mismatch: 7-8% vs 102-148% Disaggregated Serving Performance Gains

AIConfigurator Performance Prediction Mismatch

Summary

We tested AIConfigurator (version 0.4.0) against the performance claims in the "Advanced Disagg Perf Tuning" guide and found a significant discrepancy between AIC's predictions and the guide's reported results.

Key Finding: AIC predicts disaggregated serving provides 7-8% improvement, while the guide reports 102-148% improvement - a 10-20x difference in expected gains.

Source Document: The guide being tested is from PR #4655 by davilu-nvidia (submitted Nov 27, 2025, currently under review and not yet merged).

@BenHamm
BenHamm / AIPERF-PRESENTATION.md
Last active November 14, 2025 09:32
AIPerf Comprehensive Benchmarking Guide - WIP
@BenHamm
BenHamm / README.md
Last active November 4, 2025 02:20
Block Hash: Privacy-Preserving LLM Trace Sharing

Block Hash: Privacy-Preserving LLM Trace Sharing

The Problem

Inference providers that want to contribute to open soruce projects are limited in what they can share while respecting user privacy and protecting proprietary prompts. Real LLM inference traces contain sensitive information, but without realistic traces, benchmarks can't accurately reflect production workloads.

The Solution

Block hashing converts tokens into cryptographic hash IDs that preserve prefix-matching patterns while protecting content. Based on Mooncake AI's approach (USENIX FAST'25).

@BenHamm
BenHamm / aiperf_benchmark_results.md
Created November 3, 2025 21:24
AIPerf Benchmark: Claude Sonnet 4.5 via NVIDIA API - 8K input, 1K output, 100 requests

AIPerf Benchmark: Claude Sonnet 4.5 via NVIDIA API

Overview

Performance benchmark of aws/anthropic/bedrock-claude-sonnet-4-5-v1 model hosted on NVIDIA's inference API, testing with large context windows (8K input tokens, 1K output tokens).

Benchmark Date: November 3, 2025
Tool: AIPerf v0.2.0
Test Configuration: 100 requests with streaming enabled

@BenHamm
BenHamm / brev-instance-alert-setup.md
Last active October 24, 2025 20:58
Brev Instance Alert Setup for macOS - Never forget to stop your instances!

Brev Instance Alert Setup for macOS - Never forget to stop your instances!

Brev Instance Alert Setup for macOS

Never forget to stop your Brev instances again! This setup creates an alert that pops up every 4 hours if you have running Brev instances.

What the Alert Looks Like

Brev Instance Alert Screenshot

@BenHamm
BenHamm / QWEN32B_DISAGGREGATION_REPORT.md
Created October 22, 2025 17:02
Qwen3-32B Disaggregation Performance Analysis: Why Disaggregation Underperformed

Qwen3-32B Disaggregation Performance Analysis

Date: October 22, 2025
Environment: Nebius H200 Kubernetes Cluster
Infrastructure: Dynamo LLM Serving Platform v0.5.0
Model: Qwen/Qwen3-32B with FP8 quantization


1. Executive Summary

@BenHamm
BenHamm / AIPerf_permutations_guide.md
Last active September 27, 2025 00:51
AI Perf Permutations

AIPerf Profiling of Text, Image, & Embeddings Endpoints

This tutorial captures end-to-end reference flows for running AIPerf against vLLM-hosted models. Each chapter covers a specific OpenAI-compatible endpoint: how to launch the vLLM server, run the AIPerf benchmark, and interpret

FAKE TensorRT-Cloud PyTorch Sweep Quickstart Guide

This guide will walk you through using TensorRT-Cloud to perform performance sweeping with the TRT LLM PyTorch backend.

--> THIS GUIDE IS PSEUDOCODE AND JUST A PRODUCT MANAGER'S SUGGESTION. WE HAVE NOT YET BUILT THIS FEATURE <---

Overview

Unlike the C++ backend which uses ahead-of-time (AoT) compilation, TRT LLM PyTorch uses just-in-time (JIT) compilation. This means we'll be configuring runtime parameters rather than build parameters for our performance sweep.

import speech_recognition as sr
from text_to_speech import text_to_speech #a different module, but you can guess what it does
import pygame
pygame.mixer.init()
def capture_speech():
r = sr.Recognizer()
with sr.Microphone() as source: # use the default microphone as the audio source
pygame.mixer.Sound("open_mic.wav").play() #play a prompt sound so that the user knows that they can speak
import speech_recognition as sr
import time
import sys
sys.stdout = open('/dev/null', 'w')
from text_to_speech import text_to_speech
import pygame
sys.stdout = sys.__stdout__
pygame.mixer.init()