alexheretic/gfx1100-comfyui-setup.md

## gfx1100-comfyui-setup.md

      
    Raw
  

              gfx1100-comfyui-setup.md
            
          
    7900 GRE / gfx1100 optimised ComfyUI setup for Linux

This is stuff that has worked well for me.
Tested on Arch Linux, Ryzen 7 5800X, 64GB RAM, RX 7900 GRE, ROCM 7.2

Changelog

2026-02-07: Switch to upstream flash-attention + FLASH_ATTENTION_FWD_TRITON_AMD_CONFIG_JSON.


Setup venv

Create python 3.13 venv
python3.13 -m venv venv
Install torch

pip install --pre torch torchvision torchaudio --index-url https://rocm.nightlies.amd.com/v2/gfx110X-all/
Install flash-attention

See rocm install instructions in https://github.com/Dao-AILab/flash-attention.
Optimised forward attention config can be set with FLASH_ATTENTION_FWD_TRITON_AMD_CONFIG_JSON (see env vars section)
or you can try autotuning, see the repo's README & ensure you have a new enough version checked out.
Install comfy requirements

pip install -r requirements.txt
Note: Also install any custom_nodes requirements (not described here).
Env vars

# slower, but more stable / fewer OOMs. No OOMs? Maybe you don't need this.
export PYTORCH_NO_HIP_MEMORY_CACHING=1

# triton
export TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1
export FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE
## Significantly faster attn_fwd performance for wan2.2 workflows
export FLASH_ATTENTION_FWD_TRITON_AMD_CONFIG_JSON='{"BLOCK_M":128,"BLOCK_N":64,"waves_per_eu":1,"PRE_LOAD_V":false,"num_stages":1,"num_warps":8}'

# pytorch switches on NHWC for rocm > 7, causes signifant miopen regressions for upscaling
# todo: fixed now? since what pytorch version?
export PYTORCH_MIOPEN_SUGGEST_NHWC=0

# miopen
## Tell comfyui to *not* disable miopen/cudnn, otherwise upscale perf is much worse
export COMFYUI_ENABLE_MIOPEN=1
## miopen default find mode causes significant initial slowness, yields little or no benefit to workloads I tested
export MIOPEN_FIND_MODE=FAST
Notes:

Maybe don't use PYTORCH_TUNABLEOP_ENABLED (tunable ops) as it's slow to tune and can have little benefit afterwards,
at least to wan2.2, sdxl, upscale workloads I tested. If you do use it don't just leave online tuning on forever.

ComfyUI args


--use-flash-attention: use faster flash attention installed above.
--disable-pinned-memory: Comfy-Org/ComfyUI#11781 (comment)
--cache-ram 32: optional, helps prevent comfy from using up all 64GB of ram.

ComfyUI proposed patches


Comfy-Org/ComfyUI#10238: WanImageToVideo, WanFirstLastFrameToVideo: Add vae_tile_size optional arg. Use vae_tile_size: 256 for significant encode perf improvement. Add with e.g.
git remote add alexheretic https://github.com/alexheretic/ComfyUI
git fetch alexheretic
git merge --squash alexheretic/wan-vae-tiled-encode


Usage hints


Use tiled vae decode nodes (size 256 for wan).
No results found