BenHamm/b200-sweep-report.md

## b200-sweep-report.md

      
    Raw
  

              b200-sweep-report.md
            
          
    SGLang B200 Benchmark Sweep — 9 Models on 8×B200

Date: 2026-02-15/16

Hardware: 8× NVIDIA B200 183GB, NV18 NVSwitch, Xeon 8570 (224 cores), 2TB RAM

GPU Clocks: SM 1965 MHz, MEM 3996 MHz (locked at max)

Driver: 570.195.03

Framework: SGLang v0.5.8.post1 (Docker lmsysorg/sglang:v0.5.8.post1)

Benchmark: sglang.bench_serving, random 1K input / 1K output tokens

OS: Ubuntu 24.04.3 LTS

Executive Summary

Benchmarked 9 SGLang cookbook models on a single 8×B200 NVSwitch node. All models ran with their cookbook-recommended configs. 8 of 9 completed successfully; DeepSeek-V3.2-Exp crashed during DeepGEMM JIT warmup (known issue).
Top throughput per GPU (tok/s/GPU at peak concurrency):

Nemotron3-Nano-30B — 6,830 tok/s/GPU (TP=1, FP8) 🥇
Qwen3-Coder-Next — 1,497 tok/s/GPU (TP=2, BF16) 🥈
GPT-OSS-120B — 1,179 tok/s/GPU (TP=8, MXFP4) 🥉

Fastest decode latency (TPOT at c=1):

GPT-OSS-120B — 2.41ms 🥇
Nemotron3-Nano-30B — 4.36ms 🥈
Qwen3-Coder-Next — 4.53ms 🥉


Results Summary


#
Model
Params (Total/Active)
TP
Quant
Latency tok/s (c=1)
TPOT (c=1)
Throughput tok/s
Peak tok/s
tok/s/GPU


1
GLM-5-FP8
744B / ~40B
8
FP8
112
7.69ms*
1,370
—
171


2
Nemotron3-Nano-30B
30B / 3B
1
FP8
223
4.36ms
6,830
11,272
6,830


3
Qwen3-Coder-Next
80B / 3B
2
BF16
204
4.53ms
2,994
5,708
1,497


4
GPT-OSS-120B
117B / ~12B
8
MXFP4
397
2.41ms
9,432
13,021
1,179


5
Qwen3-235B-A22B
235B / 22B
8
BF16
145
6.70ms
3,366
4,651
421


6
GLM-4.6
~700B
8
BF16
66
14.93ms
1,822
2,900
228


7
Kimi-K2-Instruct
1T / ~32B
8
BF16
128
7.54ms
2,166
4,205
271


8
DeepSeek-V3.2-Exp
685B / ~37B
8
BF16
❌ CRASHED
—
❌ CRASHED
—
—


9
DeepSeek-R1-0528-FP4
685B / ~37B
8
FP4
88
5.74ms
302
5,088
38


*GLM-5 TPOT measured with EAGLE speculative decoding (accept length ~3.52). Without EAGLE: 20.22ms.
⚠️ Caveat: Concurrency levels varied by model (c=100 for most, c=256 for Nemotron3-Nano) and may not represent saturation points. TP values follow cookbook recommendations, not a controlled variable. tok/s/GPU normalization helps compare efficiency but models with lower TP inherently have less communication overhead.

Per-Model Details

1. GLM-5-FP8 (744B MoE, ~40B active)


Model: zai-org/GLM-5-FP8
Config: TP=8, EAGLE speculative decoding (3 steps, topk=1, 4 draft tokens)
Image: lmsysorg/sglang:glm5-blackwell
Full report: GLM-5 B200 Benchmark Gist


Metric
Value


Latency (c=1) output tok/s
112


TTFT (c=1)
246ms


TPOT (c=1, EAGLE)
7.69ms


TPOT (c=1, no EAGLE)
20.22ms


Throughput (c=100) output tok/s
1,370


EAGLE accept length
3.52


Notes: EAGLE is critical — 2.6× better decode latency. B200 beats H200 by ~13% throughput (1,370 vs 1,215 tok/s).

2. Nemotron3-Nano-30B-A3B-FP8 (30B, 3B active — Mamba-MoE hybrid)


Model: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8
Config: TP=1 (single B200), --max-running-requests 1024


Metric
Value


Latency (c=1) output tok/s
223


TTFT (c=1)
46ms (median)


TPOT (c=1)
4.36ms


Throughput (c=256) output tok/s
6,830


Peak output tok/s
11,272


Notes: Only model running on a single GPU. Mamba SSM cache consumed 57.7GB (conv_state 1.0GB + ssm_state 56.7GB). Below cookbook reference (11,552 tok/s) — likely SGLang version diff.

3. Qwen3-Coder-Next (80B, 3B active — MoE)


Model: Qwen/Qwen3-Coder-Next
Config: TP=2


Metric
c=1
c=16
c=100


Output tok/s
204
1,281
2,994


Peak tok/s
226
1,758
5,708


TPOT
4.53ms
11.52ms
31.53ms


TTFT (median)
148ms
146ms
156ms


Notes: Only needs 2 GPUs. Excellent latency scaling. 256K context support.

4. GPT-OSS-120B (117B MoE, ~12B active) ⭐ TOP THROUGHPUT


Model: openai/gpt-oss-120b
Config: TP=8, --reasoning-parser gpt-oss, MXFP4 MoE kernels


Metric
c=1
c=16
c=100


Output tok/s
397
3,221
9,432


Peak tok/s
417
3,942
13,021


TPOT
2.41ms
4.70ms
9.53ms


TTFT (median)
37ms
35ms
62ms


Notes: Absolute throughput king. MXFP4 quantization + efficient MoE routing = incredible B200 utilization. 2.41ms TPOT at c=1 is the fastest decode of any model tested.

5. Qwen3-235B-A22B-Instruct-2507 (235B MoE, 22B active)


Model: Qwen/Qwen3-235B-A22B-Instruct-2507
Config: TP=8, BF16


Metric
c=1
c=16
c=100


Output tok/s
145
1,048
3,366


Peak tok/s
150
1,263
4,651


TPOT
6.70ms
14.06ms
26.77ms


TTFT (median)
69ms
73ms
84ms


Notes: Solid throughput for 22B active params. BF16 weights (~438GB). Stable TTFT across concurrency levels.

6. GLM-4.6 (~700B MoE)


Model: zai-org/GLM-4.6
Config: TP=8, BF16


Metric
c=1
c=16
c=100


Output tok/s
66
570
1,822


Peak tok/s
67
720
2,900


TPOT
14.93ms
26.18ms
49.89ms


TTFT (median)
109ms
118ms
128ms


Notes: Slowest model tested. Very large BF16 weights (665GB). Thinking model (responds with <think> tags). Standard sglang image works (doesn't need glm5-blackwell).

7. Kimi-K2-Instruct (1T MoE, ~32B active)


Model: moonshotai/Kimi-K2-Instruct
Config: TP=8, BF16


Metric
c=1
c=100


Output tok/s
128
2,166


Peak tok/s
133
4,205


TPOT
7.54ms
42.23ms


TTFT (median)
116ms
208ms


Notes: Largest model tested (1T params, ~959GB weights). Required DeepGEMM JIT warmup (~2 min for 32K kernels). Solid throughput despite massive size.

8. DeepSeek-V3.2-Exp (685B MoE) ❌ CRASHED


Model: deepseek-ai/DeepSeek-V3.2-Exp
Config: TP=8, attempted with --mem-fraction-static 0.80 --dist-timeout 3600
Result: Server crashes during DeepGEMM JIT compile warmup with NCCL timeout / EOFError
Tried: Default config, reduced mem fraction — both failed
Root cause: Likely OOM during warmup phase when both model weights + JIT compilation buffers exceed GPU memory


9. DeepSeek-R1-0528-FP4 (685B MoE, ~37B active)


Model: nvidia/DeepSeek-R1-0528-FP4-v2
Config: TP=8, --quantization modelopt_fp4, --mem-fraction-static 0.80


Metric
c=1
c=100


Output tok/s
88
302


Peak tok/s
175
5,088


TPOT
5.74ms
323.14ms


TTFT (median)
2,598ms
2,711ms


Notes: FP4 quant allowed DeepSeek R1 to load where V3.2 BF16 crashed. High TTFT (~2.6s) likely due to FP4 dequantization overhead during prefill. Massive gap between peak (5,088) and sustained (302) throughput suggests memory pressure. Reasoning model — responds with <think> blocks.

Key Observations

1. MXFP4 is a game-changer for MoE throughput

GPT-OSS at 9,432 tok/s absolutely dominates. MXFP4 quantization reduces memory bandwidth per expert while maintaining quality, letting the B200s' bandwidth shine.
2. Active parameter count correlates with decode speed


Active Params
Best TPOT (c=1)
Model


~3B
4.36ms
Nemotron3-Nano


~3B
4.53ms
Qwen3-Coder-Next


~12B
2.41ms
GPT-OSS (MXFP4 helps)


~22B
6.70ms
Qwen3-235B


~32B
7.54ms
Kimi-K2


~37B
5.74ms
DeepSeek-R1 (FP4)


~40B
7.69ms
GLM-5 (EAGLE)


3. NVSwitch enables massive TP scaling

All 8-GPU models benefit from NVSwitch's ~900 GB/s bisection bandwidth. TP=8 all-reduce is nearly free, which is why disaggregated serving (splitting GPUs) performs worse on this topology.
4. DeepGEMM warmup is a deployment concern

Both DeepSeek models and Kimi-K2 require JIT-compiling ~32K GEMM kernels on cold start. Kimi-K2 took ~2 min; DeepSeek-V3.2 crashed entirely. Production deployments should pre-warm or use persistent containers.
5. FP4 quantization: loads but with caveats

DeepSeek-R1 FP4 succeeded where BF16 V3.2 crashed, but sustained throughput (302 tok/s) was much lower than peak (5,088), and TTFT was 2.6s. FP4 dequantization overhead during prefill is significant.
6. EAGLE speculative decoding matters for large models

GLM-5 with EAGLE: 7.69ms TPOT. Without: 20.22ms. For decode-heavy workloads on 700B+ models, spec decode is essential.

Hardware Utilization Notes


Disk: Models range from 31GB (Nemotron3-Nano FP8) to 959GB (Kimi-K2 BF16). Plan for 1TB+ cache per large model.
GPU Memory: All 8×183GB = 1.46TB total. Kimi-K2 (1T, BF16) used nearly all of it. FP8/FP4 quant reduces memory 2-4×.
GPU Clocks: Locked at max (SM 1965, MEM 3996) for consistent benchmarks.
Interconnect: NV18 NVSwitch — all-to-all GPU communication is ~free for TP.


Reproducibility

All benchmarks used:
# Server launch
sudo docker run -d --name bench-server \
  --gpus all --ipc=host --net=host \
  -v /home/nvadmin/.cache/huggingface:/root/.cache/huggingface \
  lmsysorg/sglang:v0.5.8.post1 \
  python3 -m sglang.launch_server \
    --model-path <MODEL> --tp <TP> \
    --host 0.0.0.0 --port 30000

# Benchmarks
python3 -m sglang.bench_serving \
  --backend sglang --model <MODEL> \
  --dataset-name random \
  --random-input-len 1000 --random-output-len 1000 \
  --num-prompts <N> --max-concurrency <C>
Concurrency levels: c=1 (latency), c=16 (mid), c=100 (throughput). Prompt counts: 10/80/500-1000.

Benchmarked by Thermidor 🦞 on 8×B200, 2026-02-15/16
#	Model	Params (Total/Active)	TP	Quant	Latency tok/s (c=1)	TPOT (c=1)	Throughput tok/s	Peak tok/s	tok/s/GPU
1	GLM-5-FP8	744B / ~40B	8	FP8	112	7.69ms*	1,370	—	171
2	Nemotron3-Nano-30B	30B / 3B	1	FP8	223	4.36ms	6,830	11,272	6,830
3	Qwen3-Coder-Next	80B / 3B	2	BF16	204	4.53ms	2,994	5,708	1,497
4	GPT-OSS-120B	117B / ~12B	8	MXFP4	397	2.41ms	9,432	13,021	1,179
5	Qwen3-235B-A22B	235B / 22B	8	BF16	145	6.70ms	3,366	4,651	421
6	GLM-4.6	~700B	8	BF16	66	14.93ms	1,822	2,900	228
7	Kimi-K2-Instruct	1T / ~32B	8	BF16	128	7.54ms	2,166	4,205	271
8	DeepSeek-V3.2-Exp	685B / ~37B	8	BF16	❌ CRASHED	—	❌ CRASHED	—	—
9	DeepSeek-R1-0528-FP4	685B / ~37B	8	FP4	88	5.74ms	302	5,088	38
Metric	Value
Latency (c=1) output tok/s	112
TTFT (c=1)	246ms
TPOT (c=1, EAGLE)	7.69ms
TPOT (c=1, no EAGLE)	20.22ms
Throughput (c=100) output tok/s	1,370
EAGLE accept length	3.52
Metric	Value
Latency (c=1) output tok/s	223
TTFT (c=1)	46ms (median)
TPOT (c=1)	4.36ms
Throughput (c=256) output tok/s	6,830
Peak output tok/s	11,272
Metric	c=1	c=16	c=100
Output tok/s	204	1,281	2,994
Peak tok/s	226	1,758	5,708
TPOT	4.53ms	11.52ms	31.53ms
TTFT (median)	148ms	146ms	156ms
Metric	c=1	c=16	c=100
Output tok/s	397	3,221	9,432
Peak tok/s	417	3,942	13,021
TPOT	2.41ms	4.70ms	9.53ms
TTFT (median)	37ms	35ms	62ms
Metric	c=1	c=16	c=100
Output tok/s	145	1,048	3,366
Peak tok/s	150	1,263	4,651
TPOT	6.70ms	14.06ms	26.77ms
TTFT (median)	69ms	73ms	84ms
Active Params	Best TPOT (c=1)	Model
~3B	4.36ms	Nemotron3-Nano
~3B	4.53ms	Qwen3-Coder-Next
~12B	2.41ms	GPT-OSS (MXFP4 helps)
~22B	6.70ms	Qwen3-235B
~32B	7.54ms	Kimi-K2
~37B	5.74ms	DeepSeek-R1 (FP4)
~40B	7.69ms	GLM-5 (EAGLE)