Skip to content

Instantly share code, notes, and snippets.

@BenHamm
Last active February 16, 2026 23:29
Show Gist options
  • Select an option

  • Save BenHamm/78563d0d0a08d7cb61b4737392633384 to your computer and use it in GitHub Desktop.

Select an option

Save BenHamm/78563d0d0a08d7cb61b4737392633384 to your computer and use it in GitHub Desktop.
SGLang B200 Benchmark Sweep — 9 Models on 8×B200 (NVSwitch)

SGLang B200 Benchmark Sweep — 9 Models on 8×B200

Date: 2026-02-15/16
Hardware: 8× NVIDIA B200 183GB, NV18 NVSwitch, Xeon 8570 (224 cores), 2TB RAM
GPU Clocks: SM 1965 MHz, MEM 3996 MHz (locked at max)
Driver: 570.195.03
Framework: SGLang v0.5.8.post1 (Docker lmsysorg/sglang:v0.5.8.post1)
Benchmark: sglang.bench_serving, random 1K input / 1K output tokens
OS: Ubuntu 24.04.3 LTS


Executive Summary

Benchmarked 9 SGLang cookbook models on a single 8×B200 NVSwitch node. All models ran with their cookbook-recommended configs. 8 of 9 completed successfully; DeepSeek-V3.2-Exp crashed during DeepGEMM JIT warmup (known issue).

Top throughput per GPU (tok/s/GPU at peak concurrency):

  1. Nemotron3-Nano-30B — 6,830 tok/s/GPU (TP=1, FP8) 🥇
  2. Qwen3-Coder-Next — 1,497 tok/s/GPU (TP=2, BF16) 🥈
  3. GPT-OSS-120B — 1,179 tok/s/GPU (TP=8, MXFP4) 🥉

Fastest decode latency (TPOT at c=1):

  1. GPT-OSS-120B — 2.41ms 🥇
  2. Nemotron3-Nano-30B — 4.36ms 🥈
  3. Qwen3-Coder-Next — 4.53ms 🥉

Results Summary

# Model Params (Total/Active) TP Quant Latency tok/s (c=1) TPOT (c=1) Throughput tok/s Peak tok/s tok/s/GPU
1 GLM-5-FP8 744B / ~40B 8 FP8 112 7.69ms* 1,370 171
2 Nemotron3-Nano-30B 30B / 3B 1 FP8 223 4.36ms 6,830 11,272 6,830
3 Qwen3-Coder-Next 80B / 3B 2 BF16 204 4.53ms 2,994 5,708 1,497
4 GPT-OSS-120B 117B / ~12B 8 MXFP4 397 2.41ms 9,432 13,021 1,179
5 Qwen3-235B-A22B 235B / 22B 8 BF16 145 6.70ms 3,366 4,651 421
6 GLM-4.6 ~700B 8 BF16 66 14.93ms 1,822 2,900 228
7 Kimi-K2-Instruct 1T / ~32B 8 BF16 128 7.54ms 2,166 4,205 271
8 DeepSeek-V3.2-Exp 685B / ~37B 8 BF16 ❌ CRASHED ❌ CRASHED
9 DeepSeek-R1-0528-FP4 685B / ~37B 8 FP4 88 5.74ms 302 5,088 38

*GLM-5 TPOT measured with EAGLE speculative decoding (accept length ~3.52). Without EAGLE: 20.22ms.

⚠️ Caveat: Concurrency levels varied by model (c=100 for most, c=256 for Nemotron3-Nano) and may not represent saturation points. TP values follow cookbook recommendations, not a controlled variable. tok/s/GPU normalization helps compare efficiency but models with lower TP inherently have less communication overhead.


Per-Model Details

1. GLM-5-FP8 (744B MoE, ~40B active)

  • Model: zai-org/GLM-5-FP8
  • Config: TP=8, EAGLE speculative decoding (3 steps, topk=1, 4 draft tokens)
  • Image: lmsysorg/sglang:glm5-blackwell
  • Full report: GLM-5 B200 Benchmark Gist
Metric Value
Latency (c=1) output tok/s 112
TTFT (c=1) 246ms
TPOT (c=1, EAGLE) 7.69ms
TPOT (c=1, no EAGLE) 20.22ms
Throughput (c=100) output tok/s 1,370
EAGLE accept length 3.52

Notes: EAGLE is critical — 2.6× better decode latency. B200 beats H200 by ~13% throughput (1,370 vs 1,215 tok/s).


2. Nemotron3-Nano-30B-A3B-FP8 (30B, 3B active — Mamba-MoE hybrid)

  • Model: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8
  • Config: TP=1 (single B200), --max-running-requests 1024
Metric Value
Latency (c=1) output tok/s 223
TTFT (c=1) 46ms (median)
TPOT (c=1) 4.36ms
Throughput (c=256) output tok/s 6,830
Peak output tok/s 11,272

Notes: Only model running on a single GPU. Mamba SSM cache consumed 57.7GB (conv_state 1.0GB + ssm_state 56.7GB). Below cookbook reference (11,552 tok/s) — likely SGLang version diff.


3. Qwen3-Coder-Next (80B, 3B active — MoE)

  • Model: Qwen/Qwen3-Coder-Next
  • Config: TP=2
Metric c=1 c=16 c=100
Output tok/s 204 1,281 2,994
Peak tok/s 226 1,758 5,708
TPOT 4.53ms 11.52ms 31.53ms
TTFT (median) 148ms 146ms 156ms

Notes: Only needs 2 GPUs. Excellent latency scaling. 256K context support.


4. GPT-OSS-120B (117B MoE, ~12B active) ⭐ TOP THROUGHPUT

  • Model: openai/gpt-oss-120b
  • Config: TP=8, --reasoning-parser gpt-oss, MXFP4 MoE kernels
Metric c=1 c=16 c=100
Output tok/s 397 3,221 9,432
Peak tok/s 417 3,942 13,021
TPOT 2.41ms 4.70ms 9.53ms
TTFT (median) 37ms 35ms 62ms

Notes: Absolute throughput king. MXFP4 quantization + efficient MoE routing = incredible B200 utilization. 2.41ms TPOT at c=1 is the fastest decode of any model tested.


5. Qwen3-235B-A22B-Instruct-2507 (235B MoE, 22B active)

  • Model: Qwen/Qwen3-235B-A22B-Instruct-2507
  • Config: TP=8, BF16
Metric c=1 c=16 c=100
Output tok/s 145 1,048 3,366
Peak tok/s 150 1,263 4,651
TPOT 6.70ms 14.06ms 26.77ms
TTFT (median) 69ms 73ms 84ms

Notes: Solid throughput for 22B active params. BF16 weights (~438GB). Stable TTFT across concurrency levels.


6. GLM-4.6 (~700B MoE)

  • Model: zai-org/GLM-4.6
  • Config: TP=8, BF16
Metric c=1 c=16 c=100
Output tok/s 66 570 1,822
Peak tok/s 67 720 2,900
TPOT 14.93ms 26.18ms 49.89ms
TTFT (median) 109ms 118ms 128ms

Notes: Slowest model tested. Very large BF16 weights (665GB). Thinking model (responds with <think> tags). Standard sglang image works (doesn't need glm5-blackwell).


7. Kimi-K2-Instruct (1T MoE, ~32B active)

  • Model: moonshotai/Kimi-K2-Instruct
  • Config: TP=8, BF16
Metric c=1 c=100
Output tok/s 128 2,166
Peak tok/s 133 4,205
TPOT 7.54ms 42.23ms
TTFT (median) 116ms 208ms

Notes: Largest model tested (1T params, ~959GB weights). Required DeepGEMM JIT warmup (~2 min for 32K kernels). Solid throughput despite massive size.


8. DeepSeek-V3.2-Exp (685B MoE) ❌ CRASHED

  • Model: deepseek-ai/DeepSeek-V3.2-Exp
  • Config: TP=8, attempted with --mem-fraction-static 0.80 --dist-timeout 3600
  • Result: Server crashes during DeepGEMM JIT compile warmup with NCCL timeout / EOFError
  • Tried: Default config, reduced mem fraction — both failed
  • Root cause: Likely OOM during warmup phase when both model weights + JIT compilation buffers exceed GPU memory

9. DeepSeek-R1-0528-FP4 (685B MoE, ~37B active)

  • Model: nvidia/DeepSeek-R1-0528-FP4-v2
  • Config: TP=8, --quantization modelopt_fp4, --mem-fraction-static 0.80
Metric c=1 c=100
Output tok/s 88 302
Peak tok/s 175 5,088
TPOT 5.74ms 323.14ms
TTFT (median) 2,598ms 2,711ms

Notes: FP4 quant allowed DeepSeek R1 to load where V3.2 BF16 crashed. High TTFT (~2.6s) likely due to FP4 dequantization overhead during prefill. Massive gap between peak (5,088) and sustained (302) throughput suggests memory pressure. Reasoning model — responds with <think> blocks.


Key Observations

1. MXFP4 is a game-changer for MoE throughput

GPT-OSS at 9,432 tok/s absolutely dominates. MXFP4 quantization reduces memory bandwidth per expert while maintaining quality, letting the B200s' bandwidth shine.

2. Active parameter count correlates with decode speed

Active Params Best TPOT (c=1) Model
~3B 4.36ms Nemotron3-Nano
~3B 4.53ms Qwen3-Coder-Next
~12B 2.41ms GPT-OSS (MXFP4 helps)
~22B 6.70ms Qwen3-235B
~32B 7.54ms Kimi-K2
~37B 5.74ms DeepSeek-R1 (FP4)
~40B 7.69ms GLM-5 (EAGLE)

3. NVSwitch enables massive TP scaling

All 8-GPU models benefit from NVSwitch's ~900 GB/s bisection bandwidth. TP=8 all-reduce is nearly free, which is why disaggregated serving (splitting GPUs) performs worse on this topology.

4. DeepGEMM warmup is a deployment concern

Both DeepSeek models and Kimi-K2 require JIT-compiling ~32K GEMM kernels on cold start. Kimi-K2 took ~2 min; DeepSeek-V3.2 crashed entirely. Production deployments should pre-warm or use persistent containers.

5. FP4 quantization: loads but with caveats

DeepSeek-R1 FP4 succeeded where BF16 V3.2 crashed, but sustained throughput (302 tok/s) was much lower than peak (5,088), and TTFT was 2.6s. FP4 dequantization overhead during prefill is significant.

6. EAGLE speculative decoding matters for large models

GLM-5 with EAGLE: 7.69ms TPOT. Without: 20.22ms. For decode-heavy workloads on 700B+ models, spec decode is essential.


Hardware Utilization Notes

  • Disk: Models range from 31GB (Nemotron3-Nano FP8) to 959GB (Kimi-K2 BF16). Plan for 1TB+ cache per large model.
  • GPU Memory: All 8×183GB = 1.46TB total. Kimi-K2 (1T, BF16) used nearly all of it. FP8/FP4 quant reduces memory 2-4×.
  • GPU Clocks: Locked at max (SM 1965, MEM 3996) for consistent benchmarks.
  • Interconnect: NV18 NVSwitch — all-to-all GPU communication is ~free for TP.

Reproducibility

All benchmarks used:

# Server launch
sudo docker run -d --name bench-server \
  --gpus all --ipc=host --net=host \
  -v /home/nvadmin/.cache/huggingface:/root/.cache/huggingface \
  lmsysorg/sglang:v0.5.8.post1 \
  python3 -m sglang.launch_server \
    --model-path <MODEL> --tp <TP> \
    --host 0.0.0.0 --port 30000

# Benchmarks
python3 -m sglang.bench_serving \
  --backend sglang --model <MODEL> \
  --dataset-name random \
  --random-input-len 1000 --random-output-len 1000 \
  --num-prompts <N> --max-concurrency <C>

Concurrency levels: c=1 (latency), c=16 (mid), c=100 (throughput). Prompt counts: 10/80/500-1000.


Benchmarked by Thermidor 🦞 on 8×B200, 2026-02-15/16

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment