Skip to content

Instantly share code, notes, and snippets.

@alexheretic
Last active March 10, 2026 16:55
Show Gist options
  • Select an option

  • Save alexheretic/d868b340d1cef8664e1b4226fd17e0d0 to your computer and use it in GitHub Desktop.

Select an option

Save alexheretic/d868b340d1cef8664e1b4226fd17e0d0 to your computer and use it in GitHub Desktop.
7900 GRE / gfx1100 optimised ComfyUI setup for Linux

7900 GRE / gfx1100 optimised ComfyUI setup for Linux

This is stuff that has worked well for me.

Tested on Arch Linux, Ryzen 7 5800X, 64GB RAM, RX 7900 GRE, ROCM 7.2

Changelog
  • 2026-02-07: Switch to upstream flash-attention + FLASH_ATTENTION_FWD_TRITON_AMD_CONFIG_JSON.

Setup venv

Create python 3.13 venv

python3.13 -m venv venv

Install torch

pip install --pre torch torchvision torchaudio --index-url https://rocm.nightlies.amd.com/v2/gfx110X-all/

Install flash-attention

See rocm install instructions in https://github.com/Dao-AILab/flash-attention.

Optimised forward attention config can be set with FLASH_ATTENTION_FWD_TRITON_AMD_CONFIG_JSON (see env vars section) or you can try autotuning, see the repo's README & ensure you have a new enough version checked out.

Install comfy requirements

pip install -r requirements.txt

Note: Also install any custom_nodes requirements (not described here).

Env vars

# slower, but more stable / fewer OOMs. No OOMs? Maybe you don't need this.
export PYTORCH_NO_HIP_MEMORY_CACHING=1

# triton
export TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1
export FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE
## Significantly faster attn_fwd performance for wan2.2 workflows
export FLASH_ATTENTION_FWD_TRITON_AMD_CONFIG_JSON='{"BLOCK_M":128,"BLOCK_N":64,"waves_per_eu":1,"PRE_LOAD_V":false,"num_stages":1,"num_warps":8}'

# pytorch switches on NHWC for rocm > 7, causes signifant miopen regressions for upscaling
# todo: fixed now? since what pytorch version?
export PYTORCH_MIOPEN_SUGGEST_NHWC=0

# miopen
## Tell comfyui to *not* disable miopen/cudnn, otherwise upscale perf is much worse
export COMFYUI_ENABLE_MIOPEN=1
## miopen default find mode causes significant initial slowness, yields little or no benefit to workloads I tested
export MIOPEN_FIND_MODE=FAST

Notes:

  • Maybe don't use PYTORCH_TUNABLEOP_ENABLED (tunable ops) as it's slow to tune and can have little benefit afterwards, at least to wan2.2, sdxl, upscale workloads I tested. If you do use it don't just leave online tuning on forever.

ComfyUI args

  • --use-flash-attention: use faster flash attention installed above.
  • --disable-pinned-memory: Comfy-Org/ComfyUI#11781 (comment)
  • --cache-ram 32: optional, helps prevent comfy from using up all 64GB of ram.

ComfyUI proposed patches

  • Comfy-Org/ComfyUI#10238: WanImageToVideo, WanFirstLastFrameToVideo: Add vae_tile_size optional arg. Use vae_tile_size: 256 for significant encode perf improvement. Add with e.g.
    git remote add alexheretic https://github.com/alexheretic/ComfyUI
    git fetch alexheretic
    git merge --squash alexheretic/wan-vae-tiled-encode

Usage hints

  • Use tiled vae decode nodes (size 256 for wan).
@legitsplit
Copy link

Thanks for sharing, proved also useful for my 9060 XT :)

@mikharju
Copy link

Good for 7900 XTX too. I did try to get Flash Attention to work before on Bazzite and DistroBox, but wasn't sure if it was working or not since not much improvement could be seen. With all of your optimizations though, WAN videos are coming twice as fast now compared to before. Huge thanks! Also your vae_tile_size option rocks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment