This is stuff that has worked well for me.
Tested on Arch Linux, Ryzen 7 5800X, 64GB RAM, RX 7900 GRE, ROCM 7.2
Changelog
- 2026-02-07: Switch to upstream flash-attention +
FLASH_ATTENTION_FWD_TRITON_AMD_CONFIG_JSON.
Create python 3.13 venv
python3.13 -m venv venvpip install --pre torch torchvision torchaudio --index-url https://rocm.nightlies.amd.com/v2/gfx110X-all/See rocm install instructions in https://github.com/Dao-AILab/flash-attention.
Optimised forward attention config can be set with FLASH_ATTENTION_FWD_TRITON_AMD_CONFIG_JSON (see env vars section)
or you can try autotuning, see the repo's README & ensure you have a new enough version checked out.
pip install -r requirements.txtNote: Also install any custom_nodes requirements (not described here).
# slower, but more stable / fewer OOMs. No OOMs? Maybe you don't need this.
export PYTORCH_NO_HIP_MEMORY_CACHING=1
# triton
export TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1
export FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE
## Significantly faster attn_fwd performance for wan2.2 workflows
export FLASH_ATTENTION_FWD_TRITON_AMD_CONFIG_JSON='{"BLOCK_M":128,"BLOCK_N":64,"waves_per_eu":1,"PRE_LOAD_V":false,"num_stages":1,"num_warps":8}'
# pytorch switches on NHWC for rocm > 7, causes signifant miopen regressions for upscaling
# todo: fixed now? since what pytorch version?
export PYTORCH_MIOPEN_SUGGEST_NHWC=0
# miopen
## Tell comfyui to *not* disable miopen/cudnn, otherwise upscale perf is much worse
export COMFYUI_ENABLE_MIOPEN=1
## miopen default find mode causes significant initial slowness, yields little or no benefit to workloads I tested
export MIOPEN_FIND_MODE=FASTNotes:
- Maybe don't use
PYTORCH_TUNABLEOP_ENABLED(tunable ops) as it's slow to tune and can have little benefit afterwards, at least to wan2.2, sdxl, upscale workloads I tested. If you do use it don't just leave online tuning on forever.
--use-flash-attention: use faster flash attention installed above.--disable-pinned-memory: Comfy-Org/ComfyUI#11781 (comment)--cache-ram 32: optional, helps prevent comfy from using up all 64GB of ram.
- Comfy-Org/ComfyUI#10238: WanImageToVideo, WanFirstLastFrameToVideo: Add
vae_tile_sizeoptional arg. Usevae_tile_size: 256for significant encode perf improvement. Add with e.g.git remote add alexheretic https://github.com/alexheretic/ComfyUI git fetch alexheretic git merge --squash alexheretic/wan-vae-tiled-encode
- Use tiled vae decode nodes (size 256 for wan).
Good for 7900 XTX too. I did try to get Flash Attention to work before on Bazzite and DistroBox, but wasn't sure if it was working or not since not much improvement could be seen. With all of your optimizations though, WAN videos are coming twice as fast now compared to before. Huge thanks! Also your vae_tile_size option rocks!