Skip to content

Instantly share code, notes, and snippets.

@cloneofsimo
Created September 21, 2025 14:06
Show Gist options
  • Select an option

  • Save cloneofsimo/af0caa0cd4b75561275be456c8f1e3bb to your computer and use it in GitHub Desktop.

Select an option

Save cloneofsimo/af0caa0cd4b75561275be456c8f1e3bb to your computer and use it in GitHub Desktop.
cfm

Simplifying the Contrastive Flow Matching Objective

We show that CFM (Contrastive flow matching) objective is fundamentally indifferent from FM because it's simply affine-transformation of the target velocity, which could be learned post-hoc.

Assumptions:

  • $\alpha_t = 1 - t, \sigma_t = t \Rightarrow v = -x_i + \epsilon_i, \tilde v = -\tilde x + \tilde\epsilon$
  • $(\tilde x,\tilde\epsilon)$ is drawn i.i.d. from the dataset, independent of $(x_i,\epsilon_i)$
  • Sampled noise has zero mean: $\mathbb{E}[\tilde\epsilon] = 0$
  • Let $\mu_x := \mathbb{E}[\tilde x]$ (empirical average $\bar x$ in practice)

Lemma (Averaging targets)

The minimizer of an expected squared error depends only on the mean of the random target. Concretely, for any random target $Y$ and any predictor $f$,

$$\arg\min_f \mathbb{E} |f - Y|^2 = \arg\min_f |f - \mathbb{E}[Y]|^2 $$


Original per-sample loss

$$ L_i(\theta) = |\hat v_i - v_i|^2 - \lambda |\hat v_i - \tilde v|^2, $$

with

$$ v_i = -x_i + \epsilon_i, \qquad \tilde v = -\tilde x + \tilde\epsilon $$


Averaging over the random contrastive sample

Using the lemma, we may average the target involving the random draw $(\tilde x,\tilde\epsilon)$ before optimizing.
Since

$$ \mathbb{E}[\tilde v] = -\mu_x, $$

we get

$$ \mathbb{E}_{\tilde x,\tilde\epsilon}[L_i(\theta)] \equiv (1-\lambda)|\hat v_i - y_i|^2 + \text{const}, $$

with the effective regression target

$$ y_i = \frac{v_i - \lambda \mathbb{E}[\tilde v]}{1-\lambda} = \frac{-x_i + \epsilon_i + \lambda\mu_x}{1-\lambda}. $$


Final simplified objective

For $\lambda < 1$ (so the weight is positive), minimizing the expected CFM loss is equivalent to minimizing a weighted MSE:

$$ \min_\theta \mathbb{E}_{i}\Big[ (1-\lambda)|\hat v_i - \tfrac{-x_i + \epsilon_i + \lambda\mu_x}{1-\lambda}|^2 \Big] \propto \min_\theta \mathbb{E}_{i}\Big[ |\hat v_i - \tfrac{-x_i + \epsilon_i + \lambda\mu_x}{1-\lambda}|^2 \Big]. $$

Notice that $\mu_x$, 'average of target distribution' is extremely simple to learn. Furthermore, it is common practice to 'center VAE', because trained VAE latents are not empirically $N(0, I_n)$. Thus one can expect CFM to have no effect, apart from scaling velocity field.

@cloneofsimo
Copy link
Author

If you are doubting this math, here is simple follow up example that the two objective is indeed completely identical.
i.e., CFM is essentially 'identical' to FM in the sense its augmenting $v \Rightarrow (v + \lambda \mu_x)/(1 - \lambda)$

import torch
import torch.nn as nn
import torch.optim as optim

torch.manual_seed(0)

d = 5   # dimension of v
n = 10000 # number of samples
lam = 0.3

# dataset: x_i and eps_i
center = torch.randn(d)
x = torch.randn(n, d) * 0.8 + center 
eps = torch.randn(n, d)

# contrastive pool: x_tilde, eps_tilde. They are shifted by 1 step, which is same as independent sampling.
# if you are paranoid, just resample but use larger n.
x_tilde = x.roll(1, dims=0)
eps_tilde = eps.roll(1, dims=0)

# empirical averages
mu_x = x_tilde.mean(0)   # dataset mean
# eps mean is ~0 so we skip

# -----------------------------
# Targets
# -----------------------------
# Original v, v_tilde
v = -x + eps
v_tilde = -x_tilde + eps_tilde

# Expected target y_i (simplified, affine transformation of v)
y = (v + lam * mu_x) / (1 - lam)

# -----------------------------
# Model parameter: a single learnable vector
# -----------------------------
param = nn.Parameter(torch.randn(d))

# -----------------------------
# Optimizers
# -----------------------------
opt1 = optim.SGD([param], lr=0.1)

# Train with simplified objective
for step in range(300):
    opt1.zero_grad()
    loss = ((param - y)**2).mean()  # simplified form
    loss.backward()
    opt1.step()

print("Optimized parameter (simplified):", param.data)

# -----------------------------
# Now check with original objective
# -----------------------------
param2 = nn.Parameter(torch.randn(d))
opt2 = optim.SGD([param2], lr=0.1)

for step in range(300):
    opt2.zero_grad()
    loss = ((param2 - v)**2).mean() - lam * ((param2 - v_tilde)**2).mean()
    loss.backward()
    opt2.step()

print("Optimized parameter (original):   ", param2.data)
assert torch.allclose(param2.data, param.data, atol=1e-1)
# Optimized parameter (simplified): tensor([-1.5306,  0.3296,  2.1793, -0.5896,  1.1025])
# Optimized parameter (original):    tensor([-1.5314,  0.3228,  2.1817, -0.5823,  1.0969])

It is indeed, completely identical

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment