On my setup it was suprisingly hard to verify that a model is fully loaded and running on the dedicated GPU. The GPU can have some kind of activity, or the models can be just partially loaded even when the CPU is doing most of the work and on Ubuntu you'll need a handful of tools to verify rocm performance.
This is mostly a reminder for myself, but might prove useful if you have the same setup:
$ lscpu | grep -i model
Model name: AMD Ryzen 9 6900HS with Radeon Graphics
$ nvtop
Device 0 [AMD Radeon RX 6700S] PCIe GEN 4@ 8x
Device 1 [AMD Radeon Graphics] Integrated GPU
$ uname -a
Linux vs-ubuntu 6.11.0-19-generic #19~24.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Feb 17 11:51:52 UTC 2 x86_64 x86_64 x86_64 GNU/LinuxAs LLama suggest I followed this official AMD guide, ignore the compatibility list, it's fine.
https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.4.2/install/quick-start.html Verify the installation with: https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.4.2/install/post-install.html
If at the end of this you don't see your device in here, fix it before proceeding:
rocm-smi
========================================== ROCm System Management Interface ==========================================
==================================================== Concise Info ====================================================
Device Node IDs Temp Power Partitions SCLK MCLK Fan Perf PwrCap VRAM% GPU%
(DID, GUID) (Edge) (Avg) (Mem, Compute, ID)
======================================================================================================================
0 1 0x73ef, 3931 57.0°C 5.0W N/A, N/A, 0 700Mhz 96Mhz 100.0% auto 100.0W 75% 0%
1 2 0x1681, 52857 55.0°C 18.0W N/A, N/A, 0 N/A 2400Mhz 0% auto N/A 83% 0%
======================================================================================================================
================================================ End of ROCm SMI Log =================================================As LLama is big and bulky it makes sense to verify that HIP is working at all: ROCm ships with some C++ samples for HIP and the readme is helpful in building/running:
ls /opt/rocm/share/hip/samples/
0_Intro 1_Utils 2_Cookbook CMakeLists.txt common packaging README.mdBut I found them very minimalistic and too small to actually see VRAM and GPU Usage fill up.
Grok helped me write a large matrix calculation that actually fills the 8gb VRAM: stress_hip.cpp The matrix sizes are adjustable with:
int M = 23000;
int N = 23000;
int K = 23000;Make sure to set the HIP_DEVICE_ID according to the device number from rocm-smi.Compile and run with HIP compiler:
hipcc stress_hip.cpp -o stress_hip --offload-arch=gfx1030
./stress_hipYou can monitor the VRAM and GPU% with:
watch rocm-smi
# second terminal
nvtopNow that we know HIP works and the correct device is being worked we can work on LLama.cpp
https://github.com/ggml-org/llama.cpp/blob/baad94885df512bb24ab01e2b22d1998fce4d00e/docs/build.md#hip Compile LLama with HIP and run qwen3 with CPU/GPU:
git clone git@github.com:ggml-org/llama.cpp.git
cd llama.cpp
export ROCM_ARCH="gfx1030" # I use the 1030 although the 6700S is gtx1032
# DEBUG BUILD, if needed
#HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)"
#cmake -S . -B build -DGGML_HIP=ON -DLLAMA_ALL_WARNINGS=ON -DAMDGPU_TARGETS=$ROCM_ARCH -DCMAKE_BUILD_TYPE=Debug
#cmake --build build --config Debug -j$(nproc)
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)"
cmake -S . -B build -DGGML_HIP=ON -DAMDGPU_TARGETS=$ROCM_ARCH -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc)
cd build/bin
export HIP_VISIBLE_DEVICES=0
export HSA_OVERRIDE_GFX_VERSION=10.3.0
# this downloads the model magically https://ollama.com/library/qwen3:4b
./llama-run qwen3:4b
# exit
# ctrl+cThis minor detail cost me a lot of time.. it's only evermentioned there and in some dockerfiles, but if you don't define the number of layers to offload to the GPU with -n / -ngl, it defaults to using 100% CPU, even if there will be some minor increase in VRAM, that will confuse you!
Verbose output will show you which device is visible and loaded and how many layers are loaded into the GPU, if you load many layers for 14b model with only 8GB VRAM, you'll get an OOM error.
export HIP_VISIBLE_DEVICES=0
export HSA_OVERRIDE_GFX_VERSION=10.3.0
./llama-bench -m qwen3\:4b -ngl 99 -v
# ---
gml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 6700S, gfx1030 (0x1030), VMM: no, Wave Size: 32
# you'll see, depending on the model how many layers are actually loaded into VRAM
llama_kv_cache_unified: layer 0: dev = ROCm0
llama_kv_cache_unified: layer 1: dev = ROCm0
llama_kv_cache_unified: layer 2: dev = ROCm0
llama_kv_cache_unified: layer 3: dev = ROCm0
llama_kv_cache_unified: layer 4: dev = ROCm0
llama_kv_cache_unified: layer 5: dev = ROCm0
llama_kv_cache_unified: layer 6: dev = ROCm0
llama_kv_cache_unified: layer 7: dev = ROCm0
llama_kv_cache_unified: layer 8: dev = ROCm0
llama_kv_cache_unified: layer 9: dev = ROCm0
llama_kv_cache_unified: layer 10: dev = ROCm0
...
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | ROCm | 99 | tg128 | 43.27 ± 0.12 |
llama_perf_context_print: load time = 6925.96 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 641 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 21717.15 ms / 642 tokens
llama_perf_context_print: graphs reused = 0
# more readable without -v:
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen3 4B Q4_K - Medium | 2.44 GiB | 4.02 B | ROCm | 99 | pp512 | 601.23 ± 8.72 |
| qwen3 4B Q4_K - Medium | 2.44 GiB | 4.02 B | ROCm | 99 | tg128 | 39.52 ± 0.35 |If you monitor your nvtop & rocm-smi you'll see that the GPU is filling the VRAM and mostly on full load during the benchmark.
You can also change the device to the internal GPU HIP_VISIBLE_DEVICES=1, or both HIP_VISIBLE_DEVICES="0,1" or set -n to 0 which both performs worse, especially on output token metric tg128.