A beginner-friendly guide to understanding and fixing AMD GPU crashes, freezes, and instability on Linux.
- Common Symptoms
- Understanding the Terminology
- Diagnosing Your GPU Issues
- Common AMD GPU Problems
- Kernel Parameters Explained
- Step-by-Step Troubleshooting
You might be experiencing AMD GPU issues if you see:
- System freezes/crashes randomly
- Black screens
- "GPU hung" or "fence timeout" messages in logs
- Display flickering or artifacts
- Messages about "overdrive" or "power management"
- Applications crash when using GPU acceleration
RDNA/RDNA2/RDNA3: AMD's GPU architecture generations
- RDNA: RX 5000 series (e.g., RX 5700 XT)
- RDNA2: RX 6000 series (e.g., RX 6800 XT)
- RDNA3: RX 7000 series (e.g., RX 7900 XTX, RX 7700 XT)
Navi 10/21/23/31/32/33: Code names for specific GPU chips
- Navi 32 = RX 7700 XT / 7800 XT
- Navi 33 = RX 7600
- Navi 31 = RX 7900 XTX / XT
GFX Version: Internal GPU identifier (e.g., gfx1101 for RDNA3)
DMA (Direct Memory Access): How the GPU accesses system memory without involving the CPU. Think of it as a direct highway between GPU and RAM.
TLB (Translation Lookaside Buffer): A cache that translates memory addresses. Like a phone book for memory locations.
Fence Timeout: When the GPU promises to finish a task by a deadline but fails to do so. The system waits... and waits... and eventually gives up, causing a crash.
TLB Fence Timeout: The specific problem where the GPU can't complete memory translation tasks in time. This is a known bug in RDNA3 GPUs on certain Linux kernels.
IOMMU (Input-Output Memory Management Unit): Hardware that manages memory access for devices. Sometimes causes conflicts with AMD GPUs.
SMU (System Management Unit): Firmware that controls GPU power, clocks, and thermal management.
Power DPM (Dynamic Power Management): System that adjusts GPU clock speeds and voltage based on workload.
Overdrive: AMD's term for overclocking features. When people say "overdrive enabled," it usually just means the GPU can boost its clocks.
AMDGPU: The open-source Linux kernel driver for modern AMD GPUs
ROCm: AMD's compute platform for GPU computing (like CUDA for Nvidia)
Mesa: The open-source graphics stack that implements OpenGL, Vulkan, etc.
Firmware: Low-level software that runs on the GPU itself
# Check kernel messages for GPU errors
sudo dmesg | grep -i "amdgpu\|gpu\|fence\|timeout" | tail -50
# Check system logs
sudo journalctl -b -0 --no-pager | grep -i "amdgpu\|gpu hung\|fence" | tail -50TLB Fence Issues (RDNA3 specific):
amdgpu_tlb_fence_work
dma_fence_wait_timeout
Trying to push to a killed entity
→ This is a kernel bug, not a hardware problem
Power Management Issues:
amdgpu: GPU recovery enabled
runtime pm
gfx_off
→ Power features causing instability
Firmware Mismatches:
SMU driver if version not matched
→ Driver and firmware versions don't match
Display Issues:
DC (Display Core)
DMUB (Display Microcontroller)
→ Display subsystem problems
# Get GPU info
lspci | grep -i vga
# ROCm info (if installed)
rocm-smi --showproductname
# Check GFX version
grep "GFX Version" /var/log/Xorg.0.logProblem: GPU freezes, system hangs, "fence timeout" in logs
Affected: Mainly RX 7000 series on kernels 6.14-6.17
Cause: Kernel bug in memory management
Solution: Kernel parameters (see below)
Problem: System freezes during idle or wake from sleep
Affected: Most AMD GPUs
Cause: Aggressive power saving features
Solution: Disable runtime PM and GFX off
Problem: Random black screens, flickering
Affected: Multi-monitor setups, high refresh rate
Cause: Display Core (DC) bugs
Solution: DC-specific kernel parameters
Problem: GPU performance drops, thermal warnings
Affected: All GPUs with inadequate cooling
Cause: Poor airflow, dust, faulty firmware
Solution: Physical cleaning, firmware update, custom fan curves
Kernel parameters are settings you pass to the Linux kernel at boot. They're added in /etc/default/grub in the GRUB_CMDLINE_LINUX_DEFAULT line.
amdgpu.tmz=0
- What: Disables Trusted Memory Zone
- Why: TMZ has bugs on RDNA3, causes freezes
- When to use: RDNA3 GPUs with random crashes
amdgpu.sg_display=0
- What: Disables scatter-gather for display
- Why: Reduces DMA fence timeouts
- When to use: Display issues, TLB fence timeouts
amdgpu.dcdebugmask=0x10
- What: Disables certain Display Core debugging features
- Why: DC debugging can cause hangs
- When to use: Display-related freezes
iommu=soft
- What: Uses software IOMMU instead of hardware
- Why: Hardware IOMMU can conflict with AMD GPUs
- When to use: DMA/fence timeout issues
amdgpu.gpu_recovery=1
- What: Enables automatic GPU recovery after hangs
- Why: GPU can reset itself instead of crashing system
- When to use: Always recommended
amdgpu.gfx_off=0
- What: Disables GFX power gating
- Why: GFX off state causes crashes on some GPUs
- When to use: Idle crashes, wake-from-sleep issues
amdgpu.runpm=0 or amdgpu.runtime_pm=0
- What: Disables runtime power management
- Why: Runtime PM causes suspend/resume crashes
- When to use: Sleep/wake issues
amdgpu.ppfeaturemask=0xffffffff
- What: Enables all power play features
- Why: Sometimes disabling features causes more problems
- When to use: When conservative settings don't work
amdgpu.dc=0
- What: Disables Display Core (uses legacy display code)
- Why: DC has bugs, legacy is more stable
- When to use: Last resort for display issues (loses features)
Edit /etc/default/grub:
# For RDNA3 TLB fence issues
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amdgpu.tmz=0 amdgpu.sg_display=0 amdgpu.dcdebugmask=0x10 iommu=soft"
# For general stability
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amdgpu.gpu_recovery=1 amdgpu.gfx_off=0 amdgpu.runpm=0"
# After editing, update GRUB:
sudo update-grub-
Identify your GPU:
lspci | grep VGA -
Check kernel version:
uname -r
-
Check for errors in logs:
sudo dmesg | grep -i amdgpu | tail -50 sudo journalctl -b -0 | grep -i "fence\|timeout" | tail -20
-
Check driver version:
modinfo amdgpu | grep version -
Check firmware version:
sudo dmesg | grep "smu fw version"
-
Update everything:
sudo apt update && sudo apt upgrade sudo apt install linux-firmware -
Try a different kernel:
- Reboot and select an older kernel from GRUB menu
- For RDNA3: kernel 6.11.x often more stable than 6.14+
-
Add basic stability parameters:
sudo nano /etc/default/grub # Add to GRUB_CMDLINE_LINUX_DEFAULT: amdgpu.gpu_recovery=1 amdgpu.gfx_off=0 sudo update-grub sudo reboot
-
For TLB Fence Timeouts (RDNA3):
# Add these parameters: amdgpu.tmz=0 amdgpu.sg_display=0 amdgpu.dcdebugmask=0x10 iommu=soft -
For Power Management Issues:
# Add these parameters: amdgpu.runpm=0 amdgpu.gfx_off=0 -
For Display Issues:
# Try these one at a time: amdgpu.dcdebugmask=0x10 amdgpu.dc=0 # Last resort - loses features
-
Create a modprobe config (alternative to kernel parameters):
sudo nano /etc/modprobe.d/amdgpu.conf
Add:
options amdgpu gpu_recovery=1 options amdgpu gfx_off=0 options amdgpu tmz=0Then:
sudo update-initramfs -u sudo reboot
If nothing else works:
-
Try the proprietary driver (AMDGPU-PRO):
- Not recommended for gaming
- Better for compute workloads
- Download from AMD website
-
Downgrade to an older kernel:
# Install older kernel sudo apt install linux-image-6.11.0-8-generic # Boot into it from GRUB menu
-
File a bug report:
- Check existing bugs: https://gitlab.freedesktop.org/drm/amd/-/issues
- Include: dmesg output, GPU model, kernel version, reproduction steps
# See active kernel parameters
cat /proc/cmdline
# Check specific amdgpu parameter
cat /sys/module/amdgpu/parameters/gpu_recovery# Watch GPU clocks and temperature (ROCm)
watch -n 1 rocm-smi
# Check power management state
cat /sys/class/drm/card0/device/power_dpm_force_performance_level
# Check current GPU clocks
cat /sys/class/drm/card0/device/pp_dpm_sclk
cat /sys/class/drm/card0/device/pp_dpm_mclk# OpenGL stress test
glxgears -fullscreen
# Vulkan stress test
vkcube
# Compute test (if ROCm installed)
rocm-smi --showtemp --showpower --showclocksMyth: "Overdrive causes crashes"
Reality: Overdrive is just AMD's term for boost clocks. The message is usually harmless.
Myth: "AMD GPUs don't work on Linux"
Reality: They work great! RDNA3 just has some kernel bugs being fixed.
Myth: "You need proprietary drivers"
Reality: The open-source AMDGPU driver is excellent and recommended.
Myth: "Lowering clocks fixes stability"
Reality: Usually doesn't help. Most issues are driver/kernel bugs, not hardware limits.
Myth: "More power management = better"
Reality: Aggressive power saving often causes more crashes than it's worth.
- Check logs first - 90% of diagnosis is reading dmesg/journalctl
- Search existing issues - Your problem is probably known
- Provide details:
- GPU model (exact SKU)
- Kernel version
- Driver version
- Full dmesg output showing the error
- Steps to reproduce
- AMD GPU Linux Kernel Driver: https://gitlab.freedesktop.org/drm/amd
- Mesa Graphics: https://gitlab.freedesktop.org/mesa/mesa
- ROCm: https://github.com/RadeonOpenCompute/ROCm
- Arch Wiki (excellent resource): https://wiki.archlinux.org/title/AMDGPU
- Ubuntu AMD GPU Guide: https://help.ubuntu.com/community/RadeonDriver
| Problem | First Try | If That Fails |
|---|---|---|
| TLB fence timeout (RDNA3) | amdgpu.tmz=0 amdgpu.sg_display=0 |
Try kernel 6.11.x |
| Sleep/wake crashes | amdgpu.runpm=0 amdgpu.gfx_off=0 |
Add amdgpu.gpu_recovery=1 |
| Display flickering | amdgpu.dcdebugmask=0x10 |
Try amdgpu.dc=0 |
| Random freezes | amdgpu.gpu_recovery=1 |
Add iommu=soft |
| Poor performance | Check thermals | Update firmware |
Note: This guide is based on real-world troubleshooting of RDNA3 GPU issues on Linux. Always back up your system before making changes, and remember that kernel/driver bugs get fixed over time - sometimes just waiting for updates is the best solution.
Disclaimer: Information provided is for educational purposes. The author is not responsible for any system instability or data loss. Always maintain backups and test changes carefully.
Generated by Claude Code - Verify all technical information before applying to production systems.