arcanemachine/mi50-benchmarks-qwen3-coder-next.txt

## mi50-benchmarks-qwen3-coder-next.txt
Some 4xMi50 32GB Benchmarks (Qwen-Coder-Next Q4_0, Q4_K_M)

TLDR: Using PCIe x1 cuts PP in half for multi-card setups (for this model), but TG decrease is much less significant

Notes:
- Devices 0 and 2 are on PCIe 3x16, Devices 1 and 3 are on PCIe x1
- Flash attention disabled (I saw no difference with it enabled or disabled)
- ROCm version: 6.3.3
- llama.cpp compiled with pwilkin autoparser branch

---

## 2 cards, both on PCIe 3x16

user@aipc:~/code/ai/repo/scripts$ CUDA_VISIBLE_DEVICES=0,2 ./llama-bench -m ../../models/Qwen3-Coder-Next-Q4_0.gguf
ggml_cuda_init: found 2 ROCm devices:
  Device 0: AMD Instinct MI50/MI60, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
  Device 1: AMD Instinct MI50/MI60, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen3next 80B.A3B Q4_0         |  42.19 GiB |    79.67 B | ROCm       |  99 |           pp512 |        547.26 ± 1.86 |
| qwen3next 80B.A3B Q4_0         |  42.19 GiB |    79.67 B | ROCm       |  99 |           tg128 |         38.06 ± 0.10 |

build: 8c9ef65f5 (8095)

user@aipc:~/code/ai/repo/scripts$ CUDA_VISIBLE_DEVICES=0,2 ./llama-bench -m ../../models/Qwen3-Coder-Next-Q4_K_M.gguf
ggml_cuda_init: found 2 ROCm devices:
  Device 0: AMD Instinct MI50/MI60, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
  Device 1: AMD Instinct MI50/MI60, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen3next 80B.A3B Q4_K - Medium |  45.17 GiB |    79.67 B | ROCm       |  99 |           pp512 |        491.68 ± 1.89 |
| qwen3next 80B.A3B Q4_K - Medium |  45.17 GiB |    79.67 B | ROCm       |  99 |           tg128 |         36.24 ± 0.16 |

## 4 cards, 2 on PCIe 3x16, 2 on PCIe x1

user@aipc:~/code/ai/repo/scripts$ ./llama-bench -m ../../models/Qwen3-Coder-Next-Q4_0.gguf
ggml_cuda_init: found 4 ROCm devices:
  Device 0: AMD Instinct MI50/MI60, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
  Device 1: AMD Instinct MI50/MI60, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
  Device 2: AMD Instinct MI50/MI60, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
  Device 3: AMD Instinct MI50/MI60, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen3next 80B.A3B Q4_0         |  42.19 GiB |    79.67 B | ROCm       |  99 |           pp512 |        200.63 ± 0.34 |
| qwen3next 80B.A3B Q4_0         |  42.19 GiB |    79.67 B | ROCm       |  99 |           tg128 |         31.56 ± 0.10 |

build: 8c9ef65f5 (8095)

user@aipc:~/code/ai/repo/scripts$ ./llama-bench -m ../../models/Qwen3-Coder-Next-Q4_K_M.gguf
ggml_cuda_init: found 4 ROCm devices:
  Device 0: AMD Instinct MI50/MI60, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
  Device 1: AMD Instinct MI50/MI60, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
  Device 2: AMD Instinct MI50/MI60, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
  Device 3: AMD Instinct MI50/MI60, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen3next 80B.A3B Q4_K - Medium |  45.17 GiB |    79.67 B | ROCm       |  99 |           pp512 |        192.41 ± 0.27 |
| qwen3next 80B.A3B Q4_K - Medium |  45.17 GiB |    79.67 B | ROCm       |  99 |           tg128 |         29.96 ± 0.32 |

build: 8c9ef65f5 (8095)
	Some 4xMi50 32GB Benchmarks (Qwen-Coder-Next Q4_0, Q4_K_M)

	TLDR: Using PCIe x1 cuts PP in half for multi-card setups (for this model), but TG decrease is much less significant

	Notes:
	- Devices 0 and 2 are on PCIe 3x16, Devices 1 and 3 are on PCIe x1
	- Flash attention disabled (I saw no difference with it enabled or disabled)
	- ROCm version: 6.3.3
	- llama.cpp compiled with pwilkin autoparser branch

	---

	## 2 cards, both on PCIe 3x16

	user@aipc:~/code/ai/repo/scripts$ CUDA_VISIBLE_DEVICES=0,2 ./llama-bench -m ../../models/Qwen3-Coder-Next-Q4_0.gguf
	ggml_cuda_init: found 2 ROCm devices:
	Device 0: AMD Instinct MI50/MI60, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
	Device 1: AMD Instinct MI50/MI60, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
	\| model \| size \| params \| backend \| ngl \| test \| t/s \|
	\| ------------------------------ \| ---------: \| ---------: \| ---------- \| --: \| --------------: \| -------------------: \|
	\| qwen3next 80B.A3B Q4_0 \| 42.19 GiB \| 79.67 B \| ROCm \| 99 \| pp512 \| 547.26 ± 1.86 \|
	\| qwen3next 80B.A3B Q4_0 \| 42.19 GiB \| 79.67 B \| ROCm \| 99 \| tg128 \| 38.06 ± 0.10 \|

	build: 8c9ef65f5 (8095)

	user@aipc:~/code/ai/repo/scripts$ CUDA_VISIBLE_DEVICES=0,2 ./llama-bench -m ../../models/Qwen3-Coder-Next-Q4_K_M.gguf
	ggml_cuda_init: found 2 ROCm devices:
	Device 0: AMD Instinct MI50/MI60, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
	Device 1: AMD Instinct MI50/MI60, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
	\| model \| size \| params \| backend \| ngl \| test \| t/s \|
	\| ------------------------------ \| ---------: \| ---------: \| ---------- \| --: \| --------------: \| -------------------: \|
	\| qwen3next 80B.A3B Q4_K - Medium \| 45.17 GiB \| 79.67 B \| ROCm \| 99 \| pp512 \| 491.68 ± 1.89 \|
	\| qwen3next 80B.A3B Q4_K - Medium \| 45.17 GiB \| 79.67 B \| ROCm \| 99 \| tg128 \| 36.24 ± 0.16 \|

	## 4 cards, 2 on PCIe 3x16, 2 on PCIe x1

	user@aipc:~/code/ai/repo/scripts$ ./llama-bench -m ../../models/Qwen3-Coder-Next-Q4_0.gguf
	ggml_cuda_init: found 4 ROCm devices:
	Device 0: AMD Instinct MI50/MI60, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
	Device 1: AMD Instinct MI50/MI60, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
	Device 2: AMD Instinct MI50/MI60, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
	Device 3: AMD Instinct MI50/MI60, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
	\| model \| size \| params \| backend \| ngl \| test \| t/s \|
	\| ------------------------------ \| ---------: \| ---------: \| ---------- \| --: \| --------------: \| -------------------: \|
	\| qwen3next 80B.A3B Q4_0 \| 42.19 GiB \| 79.67 B \| ROCm \| 99 \| pp512 \| 200.63 ± 0.34 \|
	\| qwen3next 80B.A3B Q4_0 \| 42.19 GiB \| 79.67 B \| ROCm \| 99 \| tg128 \| 31.56 ± 0.10 \|

	build: 8c9ef65f5 (8095)

	user@aipc:~/code/ai/repo/scripts$ ./llama-bench -m ../../models/Qwen3-Coder-Next-Q4_K_M.gguf
	ggml_cuda_init: found 4 ROCm devices:
	Device 0: AMD Instinct MI50/MI60, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
	Device 1: AMD Instinct MI50/MI60, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
	Device 2: AMD Instinct MI50/MI60, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
	Device 3: AMD Instinct MI50/MI60, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
	\| model \| size \| params \| backend \| ngl \| test \| t/s \|
	\| ------------------------------ \| ---------: \| ---------: \| ---------- \| --: \| --------------: \| -------------------: \|
	\| qwen3next 80B.A3B Q4_K - Medium \| 45.17 GiB \| 79.67 B \| ROCm \| 99 \| pp512 \| 192.41 ± 0.27 \|
	\| qwen3next 80B.A3B Q4_K - Medium \| 45.17 GiB \| 79.67 B \| ROCm \| 99 \| tg128 \| 29.96 ± 0.32 \|

	build: 8c9ef65f5 (8095)
No results found