This is stuff that has worked well for me.
Tested on Arch Linux, Ryzen 7 5800X, 64GB RAM, RX 7900 GRE, ROCM 7.2
Changelog
- 2026-02-07: Switch to upstream flash-attention +
FLASH_ATTENTION_FWD_TRITON_AMD_CONFIG_JSON.
Create python 3.13 venv
python3.13 -m venv venvpip install --pre torch torchvision torchaudio --index-url https://rocm.nightlies.amd.com/v2/gfx110X-all/See rocm install instructions in https://github.com/Dao-AILab/flash-attention.
Optimised forward attention config can be set with FLASH_ATTENTION_FWD_TRITON_AMD_CONFIG_JSON (see env vars section)
or you can try autotuning, see the repo's README & ensure you have a new enough version checked out.
pip install -r requirements.txtNote: Also install any custom_nodes requirements (not described here).
# slower, but more stable / fewer OOMs. No OOMs? Maybe you don't need this.
export PYTORCH_NO_HIP_MEMORY_CACHING=1
# triton
export TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1
export FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE
## Significantly faster attn_fwd performance for wan2.2 workflows
export FLASH_ATTENTION_FWD_TRITON_AMD_CONFIG_JSON='{"BLOCK_M":128,"BLOCK_N":64,"waves_per_eu":1,"PRE_LOAD_V":false,"num_stages":1,"num_warps":8}'
# pytorch switches on NHWC for rocm > 7, causes signifant miopen regressions for upscaling
# todo: fixed now? since what pytorch version?
export PYTORCH_MIOPEN_SUGGEST_NHWC=0
# miopen
## Tell comfyui to *not* disable miopen/cudnn, otherwise upscale perf is much worse
export COMFYUI_ENABLE_MIOPEN=1
## miopen default find mode causes significant initial slowness, yields little or no benefit to workloads I tested
export MIOPEN_FIND_MODE=FASTNotes:
- Maybe don't use
PYTORCH_TUNABLEOP_ENABLED(tunable ops) as it's slow to tune and can have little benefit afterwards, at least to wan2.2, sdxl, upscale workloads I tested. If you do use it don't just leave online tuning on forever.
--use-flash-attention: use faster flash attention installed above.--disable-pinned-memory: Comfy-Org/ComfyUI#11781 (comment)--cache-ram 32: optional, helps prevent comfy from using up all 64GB of ram.
- Comfy-Org/ComfyUI#10238: WanImageToVideo, WanFirstLastFrameToVideo: Add
vae_tile_sizeoptional arg. Usevae_tile_size: 256for significant encode perf improvement. Add with e.g.git remote add alexheretic https://github.com/alexheretic/ComfyUI git fetch alexheretic git merge --squash alexheretic/wan-vae-tiled-encode
- Use tiled vae decode nodes (size 256 for wan).
Thanks for sharing, proved also useful for my 9060 XT :)