cloneofsimo/cfm.md

## cfm.md

      
    Raw
  

              cfm.md
            
          
    Simplifying the Contrastive Flow Matching Objective

We show that CFM (Contrastive flow matching) objective is fundamentally indifferent from FM because it's simply affine-transformation of the target velocity, which could be learned post-hoc.

Assumptions:

$\alpha_t = 1 - t, \sigma_t = t \Rightarrow v = -x_i + \epsilon_i, \tilde v = -\tilde x + \tilde\epsilon$

$(\tilde x,\tilde\epsilon)$ is drawn i.i.d. from the dataset, independent of $(x_i,\epsilon_i)$

Sampled noise has zero mean: $\mathbb{E}[\tilde\epsilon] = 0$

Let $\mu_x := \mathbb{E}[\tilde x]$ (empirical average $\bar x$ in practice)


Lemma (Averaging targets)

The minimizer of an expected squared error depends only on the mean of the random target. Concretely, for any random target $Y$ and any predictor $f$,
$$\arg\min_f  \mathbb{E} |f - Y|^2 =
\arg\min_f  |f - \mathbb{E}[Y]|^2
$$

Original per-sample loss

$$
L_i(\theta) = |\hat v_i - v_i|^2 - \lambda |\hat v_i - \tilde v|^2,
$$
with
$$
v_i = -x_i + \epsilon_i,
\qquad
\tilde v = -\tilde x + \tilde\epsilon
$$

Averaging over the random contrastive sample

Using the lemma, we may average the target involving the random draw $(\tilde x,\tilde\epsilon)$ before optimizing.

Since
$$
\mathbb{E}[\tilde v] = -\mu_x,
$$
we get
$$
\mathbb{E}_{\tilde x,\tilde\epsilon}[L_i(\theta)]
\equiv (1-\lambda)|\hat v_i - y_i|^2 + \text{const},
$$
with the effective regression target
$$
y_i = \frac{v_i - \lambda \mathbb{E}[\tilde v]}{1-\lambda}
= \frac{-x_i + \epsilon_i + \lambda\mu_x}{1-\lambda}.
$$

Final simplified objective

For $\lambda &lt; 1$ (so the weight is positive), minimizing the expected CFM loss is equivalent to minimizing a weighted MSE:
$$
\min_\theta \mathbb{E}_{i}\Big[ (1-\lambda)|\hat v_i - \tfrac{-x_i + \epsilon_i + \lambda\mu_x}{1-\lambda}|^2 \Big]
\propto
\min_\theta \mathbb{E}_{i}\Big[ |\hat v_i - \tfrac{-x_i + \epsilon_i + \lambda\mu_x}{1-\lambda}|^2 \Big].
$$
Notice that $\mu_x$, 'average of target distribution' is extremely simple to learn. Furthermore, it is common practice to 'center VAE', because trained VAE latents are not empirically $N(0, I_n)$.
Thus one can expect CFM to have no effect, apart from scaling velocity field.
No results found