valentinsavenko/amd_gfx1030_llama.md

## amd_gfx1030_llama.md

      
    Raw
  

              amd_gfx1030_llama.md
            
          
    On my setup it was suprisingly hard to verify that a model is fully loaded and running on the dedicated GPU.
The GPU can have some kind of activity, or the models can be just partially loaded even when the CPU is doing most of the work and on Ubuntu you'll need a handful of tools to verify rocm performance.
This is mostly a reminder for myself, but might prove useful if you have the same setup:
$ lscpu | grep -i model
Model name:                           AMD Ryzen 9 6900HS with Radeon Graphics

$ nvtop
 Device 0 [AMD Radeon RX 6700S] PCIe GEN 4@ 8x             
 Device 1 [AMD Radeon Graphics] Integrated GPU 
 
$ uname -a
Linux vs-ubuntu 6.11.0-19-generic #19~24.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Feb 17 11:51:52 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
As LLama suggest I followed this official AMD guide, ignore the compatibility list, it's fine.

https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.4.2/install/quick-start.html
Verify the installation with:
https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.4.2/install/post-install.html

If at the end of this you don't see your device in here, fix it before proceeding:
 rocm-smi 

========================================== ROCm System Management Interface ==========================================
==================================================== Concise Info ====================================================
Device  Node  IDs              Temp    Power  Partitions          SCLK    MCLK     Fan     Perf  PwrCap  VRAM%  GPU%  
              (DID,     GUID)  (Edge)  (Avg)  (Mem, Compute, ID)                                                      
======================================================================================================================
0       1     0x73ef,   3931   57.0°C  5.0W   N/A, N/A, 0         700Mhz  96Mhz    100.0%  auto  100.0W  75%    0%    
1       2     0x1681,   52857  55.0°C  18.0W  N/A, N/A, 0         N/A     2400Mhz  0%      auto  N/A     83%    0%    
======================================================================================================================
================================================ End of ROCm SMI Log =================================================
As LLama is big and bulky it makes sense to verify that HIP is working at all:
ROCm ships with some C++ samples for HIP and the readme is helpful in building/running:
ls /opt/rocm/share/hip/samples/
0_Intro  1_Utils  2_Cookbook  CMakeLists.txt  common  packaging  README.md
But I found them very minimalistic and too small to actually see VRAM and GPU Usage fill up.
Grok helped me write a large matrix calculation that actually fills the 8gb VRAM: stress_hip.cpp
The matrix sizes are adjustable with:
    int M = 23000;
    int N = 23000;
    int K = 23000;
Make sure to set the HIP_DEVICE_ID according to the device number from rocm-smi.Compile and run with HIP compiler:
hipcc stress_hip.cpp -o stress_hip --offload-arch=gfx1030
./stress_hip
You can monitor the VRAM and GPU% with:
watch rocm-smi

# second terminal
nvtop

https://github.com/Syllo/nvtop?tab=readme-ov-file#ubuntu-focal-2004-debian-buster-stable-and-more-recent

Now that we know HIP works and the correct device is being worked we can work on LLama.cpp

https://github.com/ggml-org/llama.cpp/blob/baad94885df512bb24ab01e2b22d1998fce4d00e/docs/build.md#hip
Compile LLama with HIP and run qwen3 with CPU/GPU:

git clone git@github.com:ggml-org/llama.cpp.git
cd llama.cpp

export ROCM_ARCH="gfx1030"  # I use the 1030 although the 6700S is gtx1032

# DEBUG BUILD, if needed
#HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)"
#cmake -S . -B build -DGGML_HIP=ON -DLLAMA_ALL_WARNINGS=ON -DAMDGPU_TARGETS=$ROCM_ARCH -DCMAKE_BUILD_TYPE=Debug
#cmake --build build --config Debug -j$(nproc)

HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)"
cmake -S . -B build -DGGML_HIP=ON -DAMDGPU_TARGETS=$ROCM_ARCH -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc)

cd build/bin

export HIP_VISIBLE_DEVICES=0
export HSA_OVERRIDE_GFX_VERSION=10.3.0

# this downloads the model magically https://ollama.com/library/qwen3:4b
./llama-run qwen3:4b 
# exit
# ctrl+c

https://github.com/ggml-org/llama.cpp/blob/baad94885df512bb24ab01e2b22d1998fce4d00e/tools/run/README.md?plain=1#L19-L20

This minor detail cost me a lot of time.. it's only evermentioned there and in some dockerfiles, but if you don't define the number of layers to offload to the GPU with -n / -ngl, it defaults to using 100% CPU, even if there will be some minor increase in VRAM, that will confuse you!
Verbose output will show you which device is visible and loaded and how many layers are loaded into the GPU, if you load many layers for 14b model with only 8GB VRAM, you'll get an OOM error.
export HIP_VISIBLE_DEVICES=0
export HSA_OVERRIDE_GFX_VERSION=10.3.0

./llama-bench -m qwen3\:4b -ngl 99 -v
# ---
gml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 6700S, gfx1030 (0x1030), VMM: no, Wave Size: 32

# you'll see, depending on the model how many layers are actually loaded into VRAM
llama_kv_cache_unified: layer   0: dev = ROCm0
llama_kv_cache_unified: layer   1: dev = ROCm0
llama_kv_cache_unified: layer   2: dev = ROCm0
llama_kv_cache_unified: layer   3: dev = ROCm0
llama_kv_cache_unified: layer   4: dev = ROCm0
llama_kv_cache_unified: layer   5: dev = ROCm0
llama_kv_cache_unified: layer   6: dev = ROCm0
llama_kv_cache_unified: layer   7: dev = ROCm0
llama_kv_cache_unified: layer   8: dev = ROCm0
llama_kv_cache_unified: layer   9: dev = ROCm0
llama_kv_cache_unified: layer  10: dev = ROCm0
...
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | ROCm       |  99 |           tg128 |         43.27 ± 0.12 |
llama_perf_context_print:        load time =    6925.96 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /   641 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =   21717.15 ms /   642 tokens
llama_perf_context_print:    graphs reused =          0

# more readable without -v:
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen3 4B Q4_K - Medium         |   2.44 GiB |     4.02 B | ROCm       |  99 |           pp512 |        601.23 ± 8.72 |
| qwen3 4B Q4_K - Medium         |   2.44 GiB |     4.02 B | ROCm       |  99 |           tg128 |         39.52 ± 0.35 |
If you monitor your nvtop & rocm-smi you'll see that the GPU is filling the VRAM and mostly on full load during the benchmark.
You can also change the device to the internal GPU HIP_VISIBLE_DEVICES=1, or both HIP_VISIBLE_DEVICES="0,1" or set -n to 0 which both performs worse, especially on output token metric tg128.

  
## stress_hip.cpp
#include <hip/hip_runtime.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

int HIP_DEVICE_ID = 0; // Set to the desired device ID

#define CHECK_HIP(cmd) { \
    hipError_t err = cmd; \
    if (err != hipSuccess) { \
        fprintf(stderr, "HIP Error: %s at %s:%d\n", hipGetErrorString(err), __FILE__, __LINE__); \
        exit(1); \
    } \
}


__global__ void matrixMulHeavy(const float *A, const float *B, float *C, int M, int N, int K) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    if (row < M && col < N) {
        float sum = 0.0f;
        for (int i = 0; i < K; i++) {
            sum += A[row * K + i] * B[i * N + col];
        }
        // Add heavy compute to maximize GPU utilization
        for (int j = 0; j < 500; j++) { // Increased iterations
            sum = sum * 1.00001f + sinf(sum) * 0.0001f; // Math-heavy operations
        }
        C[row * N + col] = sum;
    }
}

int main() {
    // Matrix dimensions: A (M x K), B (K x N), C (M x N)
    int M = 23000;
    int N = 23000;
    int K = 23000;
    size_t size_A = (size_t)M * K * sizeof(float);
    size_t size_B = (size_t)K * N * sizeof(float);
    size_t size_C = (size_t)M * N * sizeof(float);

    printf("Allocating ~%.2f GB VRAM\n", (size_A + size_B + size_C) / (1024.0 * 1024.0 * 1024.0));

    // Set device to Radeon 6700S
    CHECK_HIP(hipSetDevice(HIP_DEVICE_ID));
    hipDeviceProp_t props;
    CHECK_HIP(hipGetDeviceProperties(&props, HIP_DEVICE_ID));
    printf("Running on device: %s\n", props.name);

    // Allocate host memory
    float *h_A = (float*)malloc(size_A);
    float *h_B = (float*)malloc(size_B);
    float *h_C = (float*)malloc(size_C);
    if (!h_A || !h_B || !h_C) {
        fprintf(stderr, "Host memory allocation failed\n");
        exit(1);
    }

    // Initialize matrices
    srand(time(NULL));
    for (size_t i = 0; i < (size_t)M * K; i++) h_A[i] = rand() / (float)RAND_MAX;
    for (size_t i = 0; i < (size_t)K * N; i++) h_B[i] = rand() / (float)RAND_MAX;

    // Allocate device memory
    float *d_A, *d_B, *d_C;
    CHECK_HIP(hipMalloc(&d_A, size_A));
    CHECK_HIP(hipMalloc(&d_B, size_B));
    CHECK_HIP(hipMalloc(&d_C, size_C));

    // Copy inputs to device
    CHECK_HIP(hipMemcpy(d_A, h_A, size_A, hipMemcpyHostToDevice));
    CHECK_HIP(hipMemcpy(d_B, h_B, size_B, hipMemcpyHostToDevice));

    // Launch kernel
    dim3 threadsPerBlock(16, 16);
    dim3 blocksPerGrid((N + threadsPerBlock.x - 1) / threadsPerBlock.x,
                       (M + threadsPerBlock.y - 1) / threadsPerBlock.y);
    printf("Launching kernel with grid (%d, %d), block (%d, %d)\n",
           blocksPerGrid.x, blocksPerGrid.y, threadsPerBlock.x, threadsPerBlock.y);

    // Run kernel multiple times for sustained high utilization
    for (int iter = 0; iter < 30; iter++) { // Increased iterations
        printf("Iteration %d...\n", iter + 1);
        hipLaunchKernelGGL(matrixMulHeavy, blocksPerGrid, threadsPerBlock, 0, 0,
                           d_A, d_B, d_C, M, N, K);
        CHECK_HIP(hipDeviceSynchronize());
    }

    // Copy result back (optional, for verification)
    CHECK_HIP(hipMemcpy(h_C, d_C, size_C, hipMemcpyDeviceToHost));

    // Basic verification
    printf("Verifying result...\n");
    for (int i = 0; i < 30; i++) { // Reduced checks for speed
        int row = rand() % M;
        int col = rand() % N;
        float sum = 0.0f;
        for (int k = 0; k < K; k++) {
            sum += h_A[row * K + k] * h_B[k * N + col];
        }
        for (int j = 0; j < 500; j++) {
            sum = sum * 1.00001f + sinf(sum) * 0.0001f;
        }
        if (fabs(h_C[row * N + col] - sum) > 1e-2) {
            printf("Verification failed at (%d, %d)!\n", row, col);
            break;
        }
    }
    printf("Test completed successfully\n");

    // Clean up
    CHECK_HIP(hipFree(d_A));
    CHECK_HIP(hipFree(d_B));
    CHECK_HIP(hipFree(d_C));
    free(h_A);
    free(h_B);
    free(h_C);

    return 0;
	#include <hip/hip_runtime.h>
	#include <stdio.h>
	#include <stdlib.h>
	#include <time.h>

	int HIP_DEVICE_ID = 0; // Set to the desired device ID

	#define CHECK_HIP(cmd) { \
	hipError_t err = cmd; \
	if (err != hipSuccess) { \
	fprintf(stderr, "HIP Error: %s at %s:%d\n", hipGetErrorString(err), __FILE__, __LINE__); \
	exit(1); \
	} \
	}



	__global__ void matrixMulHeavy(const float A, const float B, float *C, int M, int N, int K) {
	int row = blockIdx.y * blockDim.y + threadIdx.y;
	int col = blockIdx.x * blockDim.x + threadIdx.x;
	if (row < M && col < N) {
	float sum = 0.0f;
	for (int i = 0; i < K; i++) {
	sum += A[row * K + i] * B[i * N + col];
	}
	// Add heavy compute to maximize GPU utilization
	for (int j = 0; j < 500; j++) { // Increased iterations
	sum = sum * 1.00001f + sinf(sum) * 0.0001f; // Math-heavy operations
	}
	C[row * N + col] = sum;
	}
	}

	int main() {
	// Matrix dimensions: A (M x K), B (K x N), C (M x N)
	int M = 23000;
	int N = 23000;
	int K = 23000;
	size_t size_A = (size_t)M * K * sizeof(float);
	size_t size_B = (size_t)K * N * sizeof(float);
	size_t size_C = (size_t)M * N * sizeof(float);

	printf("Allocating ~%.2f GB VRAM\n", (size_A + size_B + size_C) / (1024.0 * 1024.0 * 1024.0));

	// Set device to Radeon 6700S
	CHECK_HIP(hipSetDevice(HIP_DEVICE_ID));
	hipDeviceProp_t props;
	CHECK_HIP(hipGetDeviceProperties(&props, HIP_DEVICE_ID));
	printf("Running on device: %s\n", props.name);

	// Allocate host memory
	float h_A = (float)malloc(size_A);
	float h_B = (float)malloc(size_B);
	float h_C = (float)malloc(size_C);
	if (!h_A \|\| !h_B \|\| !h_C) {
	fprintf(stderr, "Host memory allocation failed\n");
	exit(1);
	}

	// Initialize matrices
	srand(time(NULL));
	for (size_t i = 0; i < (size_t)M * K; i++) h_A[i] = rand() / (float)RAND_MAX;
	for (size_t i = 0; i < (size_t)K * N; i++) h_B[i] = rand() / (float)RAND_MAX;

	// Allocate device memory
	float d_A, d_B, *d_C;
	CHECK_HIP(hipMalloc(&d_A, size_A));
	CHECK_HIP(hipMalloc(&d_B, size_B));
	CHECK_HIP(hipMalloc(&d_C, size_C));

	// Copy inputs to device
	CHECK_HIP(hipMemcpy(d_A, h_A, size_A, hipMemcpyHostToDevice));
	CHECK_HIP(hipMemcpy(d_B, h_B, size_B, hipMemcpyHostToDevice));

	// Launch kernel
	dim3 threadsPerBlock(16, 16);
	dim3 blocksPerGrid((N + threadsPerBlock.x - 1) / threadsPerBlock.x,
	(M + threadsPerBlock.y - 1) / threadsPerBlock.y);
	printf("Launching kernel with grid (%d, %d), block (%d, %d)\n",
	blocksPerGrid.x, blocksPerGrid.y, threadsPerBlock.x, threadsPerBlock.y);

	// Run kernel multiple times for sustained high utilization
	for (int iter = 0; iter < 30; iter++) { // Increased iterations
	printf("Iteration %d...\n", iter + 1);
	hipLaunchKernelGGL(matrixMulHeavy, blocksPerGrid, threadsPerBlock, 0, 0,
	d_A, d_B, d_C, M, N, K);
	CHECK_HIP(hipDeviceSynchronize());
	}

	// Copy result back (optional, for verification)
	CHECK_HIP(hipMemcpy(h_C, d_C, size_C, hipMemcpyDeviceToHost));

	// Basic verification
	printf("Verifying result...\n");
	for (int i = 0; i < 30; i++) { // Reduced checks for speed
	int row = rand() % M;
	int col = rand() % N;
	float sum = 0.0f;
	for (int k = 0; k < K; k++) {
	sum += h_A[row * K + k] * h_B[k * N + col];
	}
	for (int j = 0; j < 500; j++) {
	sum = sum * 1.00001f + sinf(sum) * 0.0001f;
	}
	if (fabs(h_C[row * N + col] - sum) > 1e-2) {
	printf("Verification failed at (%d, %d)!\n", row, col);
	break;
	}
	}
	printf("Test completed successfully\n");

	// Clean up
	CHECK_HIP(hipFree(d_A));
	CHECK_HIP(hipFree(d_B));
	CHECK_HIP(hipFree(d_C));
	free(h_A);
	free(h_B);
	free(h_C);

	return 0;
No results found