Skip to content

Instantly share code, notes, and snippets.

@arcanemachine
Last active February 18, 2026 19:37
Show Gist options
  • Select an option

  • Save arcanemachine/391fb5e07b44411666515057c304340c to your computer and use it in GitHub Desktop.

Select an option

Save arcanemachine/391fb5e07b44411666515057c304340c to your computer and use it in GitHub Desktop.
Some 4xMi50 32GB Benchmarks (Qwen-Coder-Next Q4_0, Q4_K_M)
TLDR: Using PCIe x1 cuts PP in half for multi-card setups (for this model), but TG decrease is much less significant
Notes:
- Devices 0 and 2 are on PCIe 3x16, Devices 1 and 3 are on PCIe x1
- Flash attention disabled (I saw no difference with it enabled or disabled)
- ROCm version: 6.3.3
- llama.cpp compiled with pwilkin autoparser branch
---
## 2 cards, both on PCIe 3x16
user@aipc:~/code/ai/repo/scripts$ CUDA_VISIBLE_DEVICES=0,2 ./llama-bench -m ../../models/Qwen3-Coder-Next-Q4_0.gguf
ggml_cuda_init: found 2 ROCm devices:
Device 0: AMD Instinct MI50/MI60, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
Device 1: AMD Instinct MI50/MI60, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen3next 80B.A3B Q4_0 | 42.19 GiB | 79.67 B | ROCm | 99 | pp512 | 547.26 ± 1.86 |
| qwen3next 80B.A3B Q4_0 | 42.19 GiB | 79.67 B | ROCm | 99 | tg128 | 38.06 ± 0.10 |
build: 8c9ef65f5 (8095)
user@aipc:~/code/ai/repo/scripts$ CUDA_VISIBLE_DEVICES=0,2 ./llama-bench -m ../../models/Qwen3-Coder-Next-Q4_K_M.gguf
ggml_cuda_init: found 2 ROCm devices:
Device 0: AMD Instinct MI50/MI60, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
Device 1: AMD Instinct MI50/MI60, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen3next 80B.A3B Q4_K - Medium | 45.17 GiB | 79.67 B | ROCm | 99 | pp512 | 491.68 ± 1.89 |
| qwen3next 80B.A3B Q4_K - Medium | 45.17 GiB | 79.67 B | ROCm | 99 | tg128 | 36.24 ± 0.16 |
## 4 cards, 2 on PCIe 3x16, 2 on PCIe x1
user@aipc:~/code/ai/repo/scripts$ ./llama-bench -m ../../models/Qwen3-Coder-Next-Q4_0.gguf
ggml_cuda_init: found 4 ROCm devices:
Device 0: AMD Instinct MI50/MI60, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
Device 1: AMD Instinct MI50/MI60, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
Device 2: AMD Instinct MI50/MI60, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
Device 3: AMD Instinct MI50/MI60, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen3next 80B.A3B Q4_0 | 42.19 GiB | 79.67 B | ROCm | 99 | pp512 | 200.63 ± 0.34 |
| qwen3next 80B.A3B Q4_0 | 42.19 GiB | 79.67 B | ROCm | 99 | tg128 | 31.56 ± 0.10 |
build: 8c9ef65f5 (8095)
user@aipc:~/code/ai/repo/scripts$ ./llama-bench -m ../../models/Qwen3-Coder-Next-Q4_K_M.gguf
ggml_cuda_init: found 4 ROCm devices:
Device 0: AMD Instinct MI50/MI60, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
Device 1: AMD Instinct MI50/MI60, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
Device 2: AMD Instinct MI50/MI60, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
Device 3: AMD Instinct MI50/MI60, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen3next 80B.A3B Q4_K - Medium | 45.17 GiB | 79.67 B | ROCm | 99 | pp512 | 192.41 ± 0.27 |
| qwen3next 80B.A3B Q4_K - Medium | 45.17 GiB | 79.67 B | ROCm | 99 | tg128 | 29.96 ± 0.32 |
build: 8c9ef65f5 (8095)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment