Skip to content

Instantly share code, notes, and snippets.

View BenHamm's full-sized avatar

Ben Hamm BenHamm

  • OctoML
  • Seattle, WA
View GitHub Profile
@BenHamm
BenHamm / DEPLOYMENT.md
Last active March 13, 2026 20:00
Constrained Decoding for VLM Content Safety Classification — Qwen2.5-VL + vLLM structured outputs eliminates 5% JSON schema failure rate

Deployment Guide

Prerequisites

  • Kubernetes cluster with NVIDIA GPU nodes
  • kubectl configured and authenticated
  • GPU with ≥16 GB VRAM (A100, H100, B200, etc.)

1. Create Namespace and Storage

@BenHamm
BenHamm / b200-sweep-report.md
Last active February 16, 2026 23:29
SGLang B200 Benchmark Sweep — 9 Models on 8×B200 (NVSwitch)

SGLang B200 Benchmark Sweep — 9 Models on 8×B200

Date: 2026-02-15/16
Hardware: 8× NVIDIA B200 183GB, NV18 NVSwitch, Xeon 8570 (224 cores), 2TB RAM
GPU Clocks: SM 1965 MHz, MEM 3996 MHz (locked at max)
Driver: 570.195.03
Framework: SGLang v0.5.8.post1 (Docker lmsysorg/sglang:v0.5.8.post1)
Benchmark: sglang.bench_serving, random 1K input / 1K output tokens
OS: Ubuntu 24.04.3 LTS

@BenHamm
BenHamm / context-management-rules.md
Created February 16, 2026 03:57
Context Management Rules for Long-Running AI Sub-Agents

Context Management Rules for Long-Running AI Sub-Agents

When an AI agent runs long tasks (benchmarks, deployments, CI pipelines), context overflow is the #1 failure mode. Compaction only runs between turns — if a single turn accumulates too much tool output, the agent dies with no recovery.

The Problem

A sub-agent doing GPU benchmarks might chain tool calls like:

  1. docker pull → 200 lines of layer progress
  2. docker logs → 500 lines of model loading / warmup
  3. bench_serving → 1000+ lines of progress bars
@BenHamm
BenHamm / openclaw-glm5-b200-benchmark-report.md
Last active February 14, 2026 06:44
GLM-5-FP8 Inference Benchmark: 8×B200 vs 8×H200 (SGLang + EAGLE)

OpenClaw - GLM-5-FP8 Inference Benchmark Report: 8×B200 vs 8×H200

Date: 2026-02-13
Model: zai-org/GLM-5-FP8 (744B MoE, 40B active parameters)
Framework: SGLang (lmsysorg/sglang:glm5-blackwell)
Hardware: 8× NVIDIA B200 183GB, NV18 NVSwitch, Xeon 8570 (224 cores), 2TB RAM
Reference: SGLang GLM-5 Cookbook (8×H200 141GB) AI Agent: Claude-4.6-Opus Running on OpenClaw


@BenHamm
BenHamm / BENCHMARK_RESULTS_GIST.md
Created December 17, 2025 23:35
Qwen3-32B Disaggregated Serving Benchmark Results - AIConfigurator vs Actual Performance

Qwen3-32B Disaggregated Serving Benchmark Results

Date: December 17, 2024
Model: Qwen/Qwen3-32B-FP8
Cluster: Nebius H200 (16 GPUs)
Framework: TensorRT-LLM via Dynamo


1. Cluster Configuration

@BenHamm
BenHamm / AIC_WALKTHROUGH_GUIDE.md
Last active December 19, 2025 17:02
AIConfigurator Walkthrough: Finding Optimal LLM Deployment Configurations

AIConfigurator: Fast-Track Your LLM Deployment on NVIDIA Dynamo

What is NVIDIA Dynamo?

NVIDIA Dynamo is a high-throughput, low-latency inference framework for serving generative AI models across multi-node GPU clusters. As LLMs grow beyond what a single GPU can handle, Dynamo solves the orchestration challenge of coordinating shards, routing requests, and transferring KV cache data across distributed systems.

Key capabilities:

  • Disaggregated serving — Separates prefill and decode phases for optimized GPU utilization
  • KV-aware routing — Routes requests to workers with the highest cache hit rate
  • KV Block Manager — Offloads KV cache to CPU, SSD, or remote memory (G2/G3/G4) for higher throughput
@BenHamm
BenHamm / AIC_PREDICTION_MISMATCH_GIST.md
Last active December 1, 2025 23:24
AIConfigurator Prediction Mismatch: 7-8% vs 102-148% Disaggregated Serving Performance Gains

AIConfigurator Performance Prediction Mismatch

Summary

We tested AIConfigurator (version 0.4.0) against the performance claims in the "Advanced Disagg Perf Tuning" guide and found a significant discrepancy between AIC's predictions and the guide's reported results.

Key Finding: AIC predicts disaggregated serving provides 7-8% improvement, while the guide reports 102-148% improvement - a 10-20x difference in expected gains.

Source Document: The guide being tested is from PR #4655 by davilu-nvidia (submitted Nov 27, 2025, currently under review and not yet merged).

@BenHamm
BenHamm / AIPERF-PRESENTATION.md
Last active January 29, 2026 18:42
AIPerf Comprehensive Benchmarking Guide - WIP
@BenHamm
BenHamm / README.md
Last active November 4, 2025 02:20
Block Hash: Privacy-Preserving LLM Trace Sharing

Block Hash: Privacy-Preserving LLM Trace Sharing

The Problem

Inference providers that want to contribute to open soruce projects are limited in what they can share while respecting user privacy and protecting proprietary prompts. Real LLM inference traces contain sensitive information, but without realistic traces, benchmarks can't accurately reflect production workloads.

The Solution

Block hashing converts tokens into cryptographic hash IDs that preserve prefix-matching patterns while protecting content. Based on Mooncake AI's approach (USENIX FAST'25).

@BenHamm
BenHamm / aiperf_benchmark_results.md
Created November 3, 2025 21:24
AIPerf Benchmark: Claude Sonnet 4.5 via NVIDIA API - 8K input, 1K output, 100 requests

AIPerf Benchmark: Claude Sonnet 4.5 via NVIDIA API

Overview

Performance benchmark of aws/anthropic/bedrock-claude-sonnet-4-5-v1 model hosted on NVIDIA's inference API, testing with large context windows (8K input tokens, 1K output tokens).

Benchmark Date: November 3, 2025
Tool: AIPerf v0.2.0
Test Configuration: 100 requests with streaming enabled