Skip to content

Instantly share code, notes, and snippets.

@danielrosehill
Created November 23, 2025 16:08
Show Gist options
  • Select an option

  • Save danielrosehill/6a531b079906f160911a87dea50e1507 to your computer and use it in GitHub Desktop.

Select an option

Save danielrosehill/6a531b079906f160911a87dea50e1507 to your computer and use it in GitHub Desktop.
A Dummy's Guide to AMD GPU Issues on Linux - Understanding RDNA3, TLB Fences, and Kernel Parameters

A Dummy's Guide to AMD GPU Issues on Linux

A beginner-friendly guide to understanding and fixing AMD GPU crashes, freezes, and instability on Linux.

Table of Contents

Common Symptoms

You might be experiencing AMD GPU issues if you see:

  • System freezes/crashes randomly
  • Black screens
  • "GPU hung" or "fence timeout" messages in logs
  • Display flickering or artifacts
  • Messages about "overdrive" or "power management"
  • Applications crash when using GPU acceleration

Understanding the Terminology

GPU Architecture Terms

RDNA/RDNA2/RDNA3: AMD's GPU architecture generations

  • RDNA: RX 5000 series (e.g., RX 5700 XT)
  • RDNA2: RX 6000 series (e.g., RX 6800 XT)
  • RDNA3: RX 7000 series (e.g., RX 7900 XTX, RX 7700 XT)

Navi 10/21/23/31/32/33: Code names for specific GPU chips

  • Navi 32 = RX 7700 XT / 7800 XT
  • Navi 33 = RX 7600
  • Navi 31 = RX 7900 XTX / XT

GFX Version: Internal GPU identifier (e.g., gfx1101 for RDNA3)

Technical Terms Simplified

DMA (Direct Memory Access): How the GPU accesses system memory without involving the CPU. Think of it as a direct highway between GPU and RAM.

TLB (Translation Lookaside Buffer): A cache that translates memory addresses. Like a phone book for memory locations.

Fence Timeout: When the GPU promises to finish a task by a deadline but fails to do so. The system waits... and waits... and eventually gives up, causing a crash.

TLB Fence Timeout: The specific problem where the GPU can't complete memory translation tasks in time. This is a known bug in RDNA3 GPUs on certain Linux kernels.

IOMMU (Input-Output Memory Management Unit): Hardware that manages memory access for devices. Sometimes causes conflicts with AMD GPUs.

SMU (System Management Unit): Firmware that controls GPU power, clocks, and thermal management.

Power DPM (Dynamic Power Management): System that adjusts GPU clock speeds and voltage based on workload.

Overdrive: AMD's term for overclocking features. When people say "overdrive enabled," it usually just means the GPU can boost its clocks.

Driver/Firmware Terms

AMDGPU: The open-source Linux kernel driver for modern AMD GPUs

ROCm: AMD's compute platform for GPU computing (like CUDA for Nvidia)

Mesa: The open-source graphics stack that implements OpenGL, Vulkan, etc.

Firmware: Low-level software that runs on the GPU itself

Diagnosing Your GPU Issues

Step 1: Check Your Logs

# Check kernel messages for GPU errors
sudo dmesg | grep -i "amdgpu\|gpu\|fence\|timeout" | tail -50

# Check system logs
sudo journalctl -b -0 --no-pager | grep -i "amdgpu\|gpu hung\|fence" | tail -50

Step 2: What to Look For

TLB Fence Issues (RDNA3 specific):

amdgpu_tlb_fence_work
dma_fence_wait_timeout
Trying to push to a killed entity

→ This is a kernel bug, not a hardware problem

Power Management Issues:

amdgpu: GPU recovery enabled
runtime pm
gfx_off

→ Power features causing instability

Firmware Mismatches:

SMU driver if version not matched

→ Driver and firmware versions don't match

Display Issues:

DC (Display Core)
DMUB (Display Microcontroller)

→ Display subsystem problems

Step 3: Identify Your GPU

# Get GPU info
lspci | grep -i vga

# ROCm info (if installed)
rocm-smi --showproductname

# Check GFX version
grep "GFX Version" /var/log/Xorg.0.log

Common AMD GPU Problems

1. TLB Fence Timeouts (RDNA3)

Problem: GPU freezes, system hangs, "fence timeout" in logs
Affected: Mainly RX 7000 series on kernels 6.14-6.17
Cause: Kernel bug in memory management
Solution: Kernel parameters (see below)

2. Power Management Crashes

Problem: System freezes during idle or wake from sleep
Affected: Most AMD GPUs
Cause: Aggressive power saving features
Solution: Disable runtime PM and GFX off

3. Display Flickering/Black Screens

Problem: Random black screens, flickering
Affected: Multi-monitor setups, high refresh rate
Cause: Display Core (DC) bugs
Solution: DC-specific kernel parameters

4. Overheating/Throttling

Problem: GPU performance drops, thermal warnings
Affected: All GPUs with inadequate cooling
Cause: Poor airflow, dust, faulty firmware
Solution: Physical cleaning, firmware update, custom fan curves

Kernel Parameters Explained

Kernel parameters are settings you pass to the Linux kernel at boot. They're added in /etc/default/grub in the GRUB_CMDLINE_LINUX_DEFAULT line.

Critical Parameters for Stability

amdgpu.tmz=0

  • What: Disables Trusted Memory Zone
  • Why: TMZ has bugs on RDNA3, causes freezes
  • When to use: RDNA3 GPUs with random crashes

amdgpu.sg_display=0

  • What: Disables scatter-gather for display
  • Why: Reduces DMA fence timeouts
  • When to use: Display issues, TLB fence timeouts

amdgpu.dcdebugmask=0x10

  • What: Disables certain Display Core debugging features
  • Why: DC debugging can cause hangs
  • When to use: Display-related freezes

iommu=soft

  • What: Uses software IOMMU instead of hardware
  • Why: Hardware IOMMU can conflict with AMD GPUs
  • When to use: DMA/fence timeout issues

amdgpu.gpu_recovery=1

  • What: Enables automatic GPU recovery after hangs
  • Why: GPU can reset itself instead of crashing system
  • When to use: Always recommended

amdgpu.gfx_off=0

  • What: Disables GFX power gating
  • Why: GFX off state causes crashes on some GPUs
  • When to use: Idle crashes, wake-from-sleep issues

amdgpu.runpm=0 or amdgpu.runtime_pm=0

  • What: Disables runtime power management
  • Why: Runtime PM causes suspend/resume crashes
  • When to use: Sleep/wake issues

amdgpu.ppfeaturemask=0xffffffff

  • What: Enables all power play features
  • Why: Sometimes disabling features causes more problems
  • When to use: When conservative settings don't work

amdgpu.dc=0

  • What: Disables Display Core (uses legacy display code)
  • Why: DC has bugs, legacy is more stable
  • When to use: Last resort for display issues (loses features)

Example GRUB Configuration

Edit /etc/default/grub:

# For RDNA3 TLB fence issues
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amdgpu.tmz=0 amdgpu.sg_display=0 amdgpu.dcdebugmask=0x10 iommu=soft"

# For general stability
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amdgpu.gpu_recovery=1 amdgpu.gfx_off=0 amdgpu.runpm=0"

# After editing, update GRUB:
sudo update-grub

Step-by-Step Troubleshooting

Level 1: Information Gathering

  1. Identify your GPU:

    lspci | grep VGA
  2. Check kernel version:

    uname -r
  3. Check for errors in logs:

    sudo dmesg | grep -i amdgpu | tail -50
    sudo journalctl -b -0 | grep -i "fence\|timeout" | tail -20
  4. Check driver version:

    modinfo amdgpu | grep version
  5. Check firmware version:

    sudo dmesg | grep "smu fw version"

Level 2: Basic Fixes

  1. Update everything:

    sudo apt update && sudo apt upgrade
    sudo apt install linux-firmware
  2. Try a different kernel:

    • Reboot and select an older kernel from GRUB menu
    • For RDNA3: kernel 6.11.x often more stable than 6.14+
  3. Add basic stability parameters:

    sudo nano /etc/default/grub
    # Add to GRUB_CMDLINE_LINUX_DEFAULT:
    amdgpu.gpu_recovery=1 amdgpu.gfx_off=0
    
    sudo update-grub
    sudo reboot

Level 3: Advanced Fixes

  1. For TLB Fence Timeouts (RDNA3):

    # Add these parameters:
    amdgpu.tmz=0 amdgpu.sg_display=0 amdgpu.dcdebugmask=0x10 iommu=soft
  2. For Power Management Issues:

    # Add these parameters:
    amdgpu.runpm=0 amdgpu.gfx_off=0
  3. For Display Issues:

    # Try these one at a time:
    amdgpu.dcdebugmask=0x10
    amdgpu.dc=0  # Last resort - loses features
  4. Create a modprobe config (alternative to kernel parameters):

    sudo nano /etc/modprobe.d/amdgpu.conf

    Add:

    options amdgpu gpu_recovery=1
    options amdgpu gfx_off=0
    options amdgpu tmz=0
    

    Then:

    sudo update-initramfs -u
    sudo reboot

Level 4: Nuclear Options

If nothing else works:

  1. Try the proprietary driver (AMDGPU-PRO):

    • Not recommended for gaming
    • Better for compute workloads
    • Download from AMD website
  2. Downgrade to an older kernel:

    # Install older kernel
    sudo apt install linux-image-6.11.0-8-generic
    # Boot into it from GRUB menu
  3. File a bug report:

Monitoring and Verification

Check if Parameters Are Active

# See active kernel parameters
cat /proc/cmdline

# Check specific amdgpu parameter
cat /sys/module/amdgpu/parameters/gpu_recovery

Monitor GPU Status

# Watch GPU clocks and temperature (ROCm)
watch -n 1 rocm-smi

# Check power management state
cat /sys/class/drm/card0/device/power_dpm_force_performance_level

# Check current GPU clocks
cat /sys/class/drm/card0/device/pp_dpm_sclk
cat /sys/class/drm/card0/device/pp_dpm_mclk

Test Stability

# OpenGL stress test
glxgears -fullscreen

# Vulkan stress test
vkcube

# Compute test (if ROCm installed)
rocm-smi --showtemp --showpower --showclocks

Common Myths Debunked

Myth: "Overdrive causes crashes"
Reality: Overdrive is just AMD's term for boost clocks. The message is usually harmless.

Myth: "AMD GPUs don't work on Linux"
Reality: They work great! RDNA3 just has some kernel bugs being fixed.

Myth: "You need proprietary drivers"
Reality: The open-source AMDGPU driver is excellent and recommended.

Myth: "Lowering clocks fixes stability"
Reality: Usually doesn't help. Most issues are driver/kernel bugs, not hardware limits.

Myth: "More power management = better"
Reality: Aggressive power saving often causes more crashes than it's worth.

Getting Help

  1. Check logs first - 90% of diagnosis is reading dmesg/journalctl
  2. Search existing issues - Your problem is probably known
  3. Provide details:
    • GPU model (exact SKU)
    • Kernel version
    • Driver version
    • Full dmesg output showing the error
    • Steps to reproduce

Resources

Summary: Quick Reference

Problem First Try If That Fails
TLB fence timeout (RDNA3) amdgpu.tmz=0 amdgpu.sg_display=0 Try kernel 6.11.x
Sleep/wake crashes amdgpu.runpm=0 amdgpu.gfx_off=0 Add amdgpu.gpu_recovery=1
Display flickering amdgpu.dcdebugmask=0x10 Try amdgpu.dc=0
Random freezes amdgpu.gpu_recovery=1 Add iommu=soft
Poor performance Check thermals Update firmware

Note: This guide is based on real-world troubleshooting of RDNA3 GPU issues on Linux. Always back up your system before making changes, and remember that kernel/driver bugs get fixed over time - sometimes just waiting for updates is the best solution.

Disclaimer: Information provided is for educational purposes. The author is not responsible for any system instability or data loss. Always maintain backups and test changes carefully.


Generated by Claude Code - Verify all technical information before applying to production systems.

Comments are disabled for this gist.