Ben Hamm BenHamm

## DEPLOYMENT.md

      
              4 files
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                BenHamm
                / DEPLOYMENT.md
            
            
              Last active
              March 13, 2026 20:00
            
              
                Constrained Decoding for VLM Content Safety Classification — Qwen2.5-VL + vLLM structured outputs eliminates 5% JSON schema failure rate
              
          
    Deployment Guide

Prerequisites


Kubernetes cluster with NVIDIA GPU nodes
kubectl configured and authenticated
GPU with ≥16 GB VRAM (A100, H100, B200, etc.)

1. Create Namespace and Storage


## b200-sweep-report.md

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                BenHamm
                / b200-sweep-report.md
            
            
              Last active
              February 16, 2026 23:29
            
              
                SGLang B200 Benchmark Sweep — 9 Models on 8×B200 (NVSwitch)
              
          
    SGLang B200 Benchmark Sweep — 9 Models on 8×B200

Date: 2026-02-15/16

Hardware: 8× NVIDIA B200 183GB, NV18 NVSwitch, Xeon 8570 (224 cores), 2TB RAM

GPU Clocks: SM 1965 MHz, MEM 3996 MHz (locked at max)

Driver: 570.195.03

Framework: SGLang v0.5.8.post1 (Docker lmsysorg/sglang:v0.5.8.post1)

Benchmark: sglang.bench_serving, random 1K input / 1K output tokens

OS: Ubuntu 24.04.3 LTS

  
## context-management-rules.md

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                BenHamm
                / context-management-rules.md
            
            
              Created
              February 16, 2026 03:57
            
              
                Context Management Rules for Long-Running AI Sub-Agents
              
          
    Context Management Rules for Long-Running AI Sub-Agents

When an AI agent runs long tasks (benchmarks, deployments, CI pipelines), context overflow is the #1 failure mode. Compaction only runs between turns — if a single turn accumulates too much tool output, the agent dies with no recovery.
The Problem

A sub-agent doing GPU benchmarks might chain tool calls like:

docker pull → 200 lines of layer progress
docker logs → 500 lines of model loading / warmup
bench_serving → 1000+ lines of progress bars


## openclaw-glm5-b200-benchmark-report.md

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                BenHamm
                / openclaw-glm5-b200-benchmark-report.md
            
            
              Last active
              February 14, 2026 06:44
            
              
                GLM-5-FP8 Inference Benchmark: 8×B200 vs 8×H200 (SGLang + EAGLE)
              
          
    OpenClaw - GLM-5-FP8 Inference Benchmark Report: 8×B200 vs 8×H200

Date: 2026-02-13

Model: zai-org/GLM-5-FP8 (744B MoE, 40B active parameters)

Framework: SGLang (lmsysorg/sglang:glm5-blackwell)

Hardware: 8× NVIDIA B200 183GB, NV18 NVSwitch, Xeon 8570 (224 cores), 2TB RAM

Reference: SGLang GLM-5 Cookbook (8×H200 141GB)
AI Agent: Claude-4.6-Opus Running on OpenClaw


## BENCHMARK_RESULTS_GIST.md

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                BenHamm
                / BENCHMARK_RESULTS_GIST.md
            
            
              Created
              December 17, 2025 23:35
            
              
                Qwen3-32B Disaggregated Serving Benchmark Results - AIConfigurator vs Actual Performance
              
          
    Qwen3-32B Disaggregated Serving Benchmark Results

Date: December 17, 2024

Model: Qwen/Qwen3-32B-FP8

Cluster: Nebius H200 (16 GPUs)

Framework: TensorRT-LLM via Dynamo

1. Cluster Configuration


## AIC_WALKTHROUGH_GUIDE.md

      
              1 file
            
          
              3 forks
            
          
                0 comments
              
            
              3 stars
            
          
                BenHamm
                / AIC_WALKTHROUGH_GUIDE.md
            
            
              Last active
              December 19, 2025 17:02
            
              
                AIConfigurator Walkthrough: Finding Optimal LLM Deployment Configurations
              
          
    AIConfigurator: Fast-Track Your LLM Deployment on NVIDIA Dynamo

What is NVIDIA Dynamo?

NVIDIA Dynamo is a high-throughput, low-latency inference framework for serving generative AI models across multi-node GPU clusters. As LLMs grow beyond what a single GPU can handle, Dynamo solves the orchestration challenge of coordinating shards, routing requests, and transferring KV cache data across distributed systems.
Key capabilities:

Disaggregated serving — Separates prefill and decode phases for optimized GPU utilization
KV-aware routing — Routes requests to workers with the highest cache hit rate
KV Block Manager — Offloads KV cache to CPU, SSD, or remote memory (G2/G3/G4) for higher throughput


## AIC_PREDICTION_MISMATCH_GIST.md

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                BenHamm
                / AIC_PREDICTION_MISMATCH_GIST.md
            
            
              Last active
              December 1, 2025 23:24
            
              
                AIConfigurator Prediction Mismatch: 7-8% vs 102-148% Disaggregated Serving Performance Gains
              
          
    AIConfigurator Performance Prediction Mismatch

Summary

We tested AIConfigurator (version 0.4.0) against the performance claims in the "Advanced Disagg Perf Tuning" guide and found a significant discrepancy between AIC's predictions and the guide's reported results.
Key Finding: AIC predicts disaggregated serving provides 7-8% improvement, while the guide reports 102-148% improvement - a 10-20x difference in expected gains.
Source Document: The guide being tested is from PR #4655 by davilu-nvidia (submitted Nov 27, 2025, currently under review and not yet merged).

  
## AIPERF-PRESENTATION.md

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              4 stars
            
          
                BenHamm
                / AIPERF-PRESENTATION.md
            
            
              Last active
              January 29, 2026 18:42
            
              
                AIPerf Comprehensive Benchmarking Guide - WIP
              
          
    AIPerf: Comprehensive LLM Benchmarking

Presentation Date: November 13, 2025

Tool: AIPerf v0.3.0

Table of Contents


Setup: Installing AIPerf 0.3.0


## README.md

      
              4 files
            
          
              0 forks
            
          
                1 comment
              
            
              0 stars
            
          
                BenHamm
                / README.md
            
            
              Last active
              November 4, 2025 02:20
            
              
                Block Hash: Privacy-Preserving LLM Trace Sharing
              
          
    Block Hash: Privacy-Preserving LLM Trace Sharing

The Problem

Inference providers that want to contribute to open soruce projects are limited in what they can share while respecting user privacy and protecting proprietary prompts. Real LLM inference traces contain sensitive information, but without realistic traces, benchmarks can't accurately reflect production workloads.
The Solution

Block hashing converts tokens into cryptographic hash IDs that preserve prefix-matching patterns while protecting content. Based on Mooncake AI's approach (USENIX FAST'25).

  
## aiperf_benchmark_results.md

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                BenHamm
                / aiperf_benchmark_results.md
            
            
              Created
              November 3, 2025 21:24
            
              
                AIPerf Benchmark: Claude Sonnet 4.5 via NVIDIA API - 8K input, 1K output, 100 requests
              
          
    AIPerf Benchmark: Claude Sonnet 4.5 via NVIDIA API

Overview

Performance benchmark of aws/anthropic/bedrock-claude-sonnet-4-5-v1 model hosted on NVIDIA's inference API, testing with large context windows (8K input tokens, 1K output tokens).
Benchmark Date: November 3, 2025

Tool: AIPerf v0.2.0

Test Configuration: 100 requests with streaming enabled