Date: 2026-02-15/16
Hardware: 8× NVIDIA B200 183GB, NV18 NVSwitch, Xeon 8570 (224 cores), 2TB RAM
GPU Clocks: SM 1965 MHz, MEM 3996 MHz (locked at max)
Driver: 570.195.03
Framework: SGLang v0.5.8.post1 (Docker lmsysorg/sglang:v0.5.8.post1)
Benchmark: sglang.bench_serving, random 1K input / 1K output tokens
OS: Ubuntu 24.04.3 LTS
Benchmarked 9 SGLang cookbook models on a single 8×B200 NVSwitch node. All models ran with their cookbook-recommended configs. 8 of 9 completed successfully; DeepSeek-V3.2-Exp crashed during DeepGEMM JIT warmup (known issue).
Top throughput per GPU (tok/s/GPU at peak concurrency):
- Nemotron3-Nano-30B — 6,830 tok/s/GPU (TP=1, FP8) 🥇
- Qwen3-Coder-Next — 1,497 tok/s/GPU (TP=2, BF16) 🥈
- GPT-OSS-120B — 1,179 tok/s/GPU (TP=8, MXFP4) 🥉
Fastest decode latency (TPOT at c=1):
- GPT-OSS-120B — 2.41ms 🥇
- Nemotron3-Nano-30B — 4.36ms 🥈
- Qwen3-Coder-Next — 4.53ms 🥉
| # | Model | Params (Total/Active) | TP | Quant | Latency tok/s (c=1) | TPOT (c=1) | Throughput tok/s | Peak tok/s | tok/s/GPU |
|---|---|---|---|---|---|---|---|---|---|
| 1 | GLM-5-FP8 | 744B / ~40B | 8 | FP8 | 112 | 7.69ms* | 1,370 | — | 171 |
| 2 | Nemotron3-Nano-30B | 30B / 3B | 1 | FP8 | 223 | 4.36ms | 6,830 | 11,272 | 6,830 |
| 3 | Qwen3-Coder-Next | 80B / 3B | 2 | BF16 | 204 | 4.53ms | 2,994 | 5,708 | 1,497 |
| 4 | GPT-OSS-120B | 117B / ~12B | 8 | MXFP4 | 397 | 2.41ms | 9,432 | 13,021 | 1,179 |
| 5 | Qwen3-235B-A22B | 235B / 22B | 8 | BF16 | 145 | 6.70ms | 3,366 | 4,651 | 421 |
| 6 | GLM-4.6 | ~700B | 8 | BF16 | 66 | 14.93ms | 1,822 | 2,900 | 228 |
| 7 | Kimi-K2-Instruct | 1T / ~32B | 8 | BF16 | 128 | 7.54ms | 2,166 | 4,205 | 271 |
| 8 | DeepSeek-V3.2-Exp | 685B / ~37B | 8 | BF16 | ❌ CRASHED | — | ❌ CRASHED | — | — |
| 9 | DeepSeek-R1-0528-FP4 | 685B / ~37B | 8 | FP4 | 88 | 5.74ms | 302 | 5,088 | 38 |
*GLM-5 TPOT measured with EAGLE speculative decoding (accept length ~3.52). Without EAGLE: 20.22ms.
- Model:
zai-org/GLM-5-FP8 - Config: TP=8, EAGLE speculative decoding (3 steps, topk=1, 4 draft tokens)
- Image:
lmsysorg/sglang:glm5-blackwell - Full report: GLM-5 B200 Benchmark Gist
| Metric | Value |
|---|---|
| Latency (c=1) output tok/s | 112 |
| TTFT (c=1) | 246ms |
| TPOT (c=1, EAGLE) | 7.69ms |
| TPOT (c=1, no EAGLE) | 20.22ms |
| Throughput (c=100) output tok/s | 1,370 |
| EAGLE accept length | 3.52 |
Notes: EAGLE is critical — 2.6× better decode latency. B200 beats H200 by ~13% throughput (1,370 vs 1,215 tok/s).
- Model:
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 - Config: TP=1 (single B200),
--max-running-requests 1024
| Metric | Value |
|---|---|
| Latency (c=1) output tok/s | 223 |
| TTFT (c=1) | 46ms (median) |
| TPOT (c=1) | 4.36ms |
| Throughput (c=256) output tok/s | 6,830 |
| Peak output tok/s | 11,272 |
Notes: Only model running on a single GPU. Mamba SSM cache consumed 57.7GB (conv_state 1.0GB + ssm_state 56.7GB). Below cookbook reference (11,552 tok/s) — likely SGLang version diff.
- Model:
Qwen/Qwen3-Coder-Next - Config: TP=2
| Metric | c=1 | c=16 | c=100 |
|---|---|---|---|
| Output tok/s | 204 | 1,281 | 2,994 |
| Peak tok/s | 226 | 1,758 | 5,708 |
| TPOT | 4.53ms | 11.52ms | 31.53ms |
| TTFT (median) | 148ms | 146ms | 156ms |
Notes: Only needs 2 GPUs. Excellent latency scaling. 256K context support.
- Model:
openai/gpt-oss-120b - Config: TP=8,
--reasoning-parser gpt-oss, MXFP4 MoE kernels
| Metric | c=1 | c=16 | c=100 |
|---|---|---|---|
| Output tok/s | 397 | 3,221 | 9,432 |
| Peak tok/s | 417 | 3,942 | 13,021 |
| TPOT | 2.41ms | 4.70ms | 9.53ms |
| TTFT (median) | 37ms | 35ms | 62ms |
Notes: Absolute throughput king. MXFP4 quantization + efficient MoE routing = incredible B200 utilization. 2.41ms TPOT at c=1 is the fastest decode of any model tested.
- Model:
Qwen/Qwen3-235B-A22B-Instruct-2507 - Config: TP=8, BF16
| Metric | c=1 | c=16 | c=100 |
|---|---|---|---|
| Output tok/s | 145 | 1,048 | 3,366 |
| Peak tok/s | 150 | 1,263 | 4,651 |
| TPOT | 6.70ms | 14.06ms | 26.77ms |
| TTFT (median) | 69ms | 73ms | 84ms |
Notes: Solid throughput for 22B active params. BF16 weights (~438GB). Stable TTFT across concurrency levels.
- Model:
zai-org/GLM-4.6 - Config: TP=8, BF16
| Metric | c=1 | c=16 | c=100 |
|---|---|---|---|
| Output tok/s | 66 | 570 | 1,822 |
| Peak tok/s | 67 | 720 | 2,900 |
| TPOT | 14.93ms | 26.18ms | 49.89ms |
| TTFT (median) | 109ms | 118ms | 128ms |
Notes: Slowest model tested. Very large BF16 weights (665GB). Thinking model (responds with <think> tags). Standard sglang image works (doesn't need glm5-blackwell).
- Model:
moonshotai/Kimi-K2-Instruct - Config: TP=8, BF16
| Metric | c=1 | c=100 |
|---|---|---|
| Output tok/s | 128 | 2,166 |
| Peak tok/s | 133 | 4,205 |
| TPOT | 7.54ms | 42.23ms |
| TTFT (median) | 116ms | 208ms |
Notes: Largest model tested (1T params, ~959GB weights). Required DeepGEMM JIT warmup (~2 min for 32K kernels). Solid throughput despite massive size.
- Model:
deepseek-ai/DeepSeek-V3.2-Exp - Config: TP=8, attempted with
--mem-fraction-static 0.80 --dist-timeout 3600 - Result: Server crashes during DeepGEMM JIT compile warmup with NCCL timeout / EOFError
- Tried: Default config, reduced mem fraction — both failed
- Root cause: Likely OOM during warmup phase when both model weights + JIT compilation buffers exceed GPU memory
- Model:
nvidia/DeepSeek-R1-0528-FP4-v2 - Config: TP=8,
--quantization modelopt_fp4,--mem-fraction-static 0.80
| Metric | c=1 | c=100 |
|---|---|---|
| Output tok/s | 88 | 302 |
| Peak tok/s | 175 | 5,088 |
| TPOT | 5.74ms | 323.14ms |
| TTFT (median) | 2,598ms | 2,711ms |
Notes: FP4 quant allowed DeepSeek R1 to load where V3.2 BF16 crashed. High TTFT (~2.6s) likely due to FP4 dequantization overhead during prefill. Massive gap between peak (5,088) and sustained (302) throughput suggests memory pressure. Reasoning model — responds with <think> blocks.
GPT-OSS at 9,432 tok/s absolutely dominates. MXFP4 quantization reduces memory bandwidth per expert while maintaining quality, letting the B200s' bandwidth shine.
| Active Params | Best TPOT (c=1) | Model |
|---|---|---|
| ~3B | 4.36ms | Nemotron3-Nano |
| ~3B | 4.53ms | Qwen3-Coder-Next |
| ~12B | 2.41ms | GPT-OSS (MXFP4 helps) |
| ~22B | 6.70ms | Qwen3-235B |
| ~32B | 7.54ms | Kimi-K2 |
| ~37B | 5.74ms | DeepSeek-R1 (FP4) |
| ~40B | 7.69ms | GLM-5 (EAGLE) |
All 8-GPU models benefit from NVSwitch's ~900 GB/s bisection bandwidth. TP=8 all-reduce is nearly free, which is why disaggregated serving (splitting GPUs) performs worse on this topology.
Both DeepSeek models and Kimi-K2 require JIT-compiling ~32K GEMM kernels on cold start. Kimi-K2 took ~2 min; DeepSeek-V3.2 crashed entirely. Production deployments should pre-warm or use persistent containers.
DeepSeek-R1 FP4 succeeded where BF16 V3.2 crashed, but sustained throughput (302 tok/s) was much lower than peak (5,088), and TTFT was 2.6s. FP4 dequantization overhead during prefill is significant.
GLM-5 with EAGLE: 7.69ms TPOT. Without: 20.22ms. For decode-heavy workloads on 700B+ models, spec decode is essential.
- Disk: Models range from 31GB (Nemotron3-Nano FP8) to 959GB (Kimi-K2 BF16). Plan for 1TB+ cache per large model.
- GPU Memory: All 8×183GB = 1.46TB total. Kimi-K2 (1T, BF16) used nearly all of it. FP8/FP4 quant reduces memory 2-4×.
- GPU Clocks: Locked at max (SM 1965, MEM 3996) for consistent benchmarks.
- Interconnect: NV18 NVSwitch — all-to-all GPU communication is ~free for TP.
All benchmarks used:
# Server launch
sudo docker run -d --name bench-server \
--gpus all --ipc=host --net=host \
-v /home/nvadmin/.cache/huggingface:/root/.cache/huggingface \
lmsysorg/sglang:v0.5.8.post1 \
python3 -m sglang.launch_server \
--model-path <MODEL> --tp <TP> \
--host 0.0.0.0 --port 30000
# Benchmarks
python3 -m sglang.bench_serving \
--backend sglang --model <MODEL> \
--dataset-name random \
--random-input-len 1000 --random-output-len 1000 \
--num-prompts <N> --max-concurrency <C>Concurrency levels: c=1 (latency), c=16 (mid), c=100 (throughput). Prompt counts: 10/80/500-1000.
Benchmarked by Thermidor 🦞 on 8×B200, 2026-02-15/16