TL;DR - I think the paper is a good contribution and basically holds up, but Figure 2 seems suspicious and the released repo doesn't include the pieces (AE training code and pretrained 4096-element AEs) that would be needed to make DC-AE practically competitive with SD/SDXL VAEs.
DC-AE is an MIT / Tsinghua / NVIDIA paper about improving generative autoencoders (like the SD VAE) under the high-spatial-compression ratio regime.
I am interested in improved autoencoders, so this gist/thread is my attempt to analyze and review some key claims from the DC-AE paper.
(Disclaimer: I work at NVIDIA in an unrelated org :) - this review is written in my personal capacity as an autoencoder buff).





DC-AE Paper Claim 3 - DC-AE achieves uniformly faster + better results vs. the SD-VAE - MIXED
The paper shows lots of specific tests where the DC-AE recipe does better than the SD-VAE recipe (authors' reproduction?) at various tasks, and I'm okay with these claims.
However, I think the DC-AE paper implies it's better than SD-VAE in general, for all purposes, which I think needs some qualification.
The
.

0.90and2.04ImageNet-Val numbers for SD-VAEf8c4andf32c64appear to be directly from the original SD paper (or else reproduced with incredible accuracy?) but those autoencoders were trained on OpenImages, not ImageNet (per SD paper Table 8 caption) and those numbers should be better if SD-VAE were finetuned on ImageNet-Train.The revised paper includes additional results in Table 8, but doesn't say what these results represent (pretrained SD-VAE or reproduction? trained on what dataset / evaluated on what dataset?) so I don't know how to interpret them.
The released models / code are definitely not yet better than SD-VAE release, for two reasons:
Snippet for adding latent sizes to HF page
```js Array.from(document.querySelectorAll("header")).forEach(el => { if (!el.title.includes("dc-ae")) return; Array.from(el.querySelectorAll(".volume-label")).forEach(child=>el.removeChild(child)); let [f, c] = el.title.match(/f(\d+)c(\d+)/).slice(1).map(x=>parseInt(x)); let hw=Math.floor(256/f); let numelFor256 = Math.pow(hw,2)*c; console.log(el.title, f, c, numelFor256); let label = document.createElement("span"); label.classList.add("volume-label"); label.style.color = "orangered"; label.style.display = "inline-block"; label.style.paddingLeft = "1em"; label.style.fontSize = "0.9em"; label.style.fontFamily = "monospace"; label.textContent = `${c}x${hw}x${hw} = ${numelFor256}`; el.appendChild(label);}) ```