Zampini's Method

While our VFDIFF implementation operates directly in image space, modern state-of-the-art models like Stable Diffusion operate in a compressed latent space. Zampini et al. (2025) introduce a breakthrough method to enforce strict physical constraints on these latent models, bridging the gap between high-fidelity generation and scientific accuracy.

The Latent Space Gap

The Problem

Diffusion happens in latent space $\mathcal{Z}$, where data $\mathbf{z}_t$ is compressed. However, physical constraints (like fluid incompressibility) are defined in the real image space $\mathcal{X}$.

We cannot directly evaluate $g(\mathbf{z}_t)$ because the latent vector doesn't have physical units.

The Solution

We use the pre-trained decoder $\mathcal{D}$ to map the latent vector back to image space, evaluate the constraint, and then backpropagate the error through the decoder.

$$ \nabla_{\mathbf{z}_t} g = \left( \frac{\partial \mathcal{D}}{\partial \mathbf{z}_t} \right)^\top \nabla_{\mathbf{x}} g $$

Proximal Langevin Dynamics

Instead of simple guidance, Zampini's method treats generation as a constrained optimization problem. At every denoising step $t$, it performs an inner optimization loop (Proximal Operator) to project the sample onto the feasible manifold.

Standard Denoising Step

Compute the standard score estimate (or noise prediction).

$$ \mathbf{z}'_t = \mathbf{z}_t + \gamma_t \nabla_{\mathbf{z}_t} \log q(\mathbf{z}_t | \mathbf{z}_0) + \sqrt{2\gamma_t} \boldsymbol{\epsilon} $$

Decoded Constraint Evaluation

Decode the latent and compute constraint violation $g(\mathcal{D}(\mathbf{z}'_t))$.

$$ \mathbf{x}' = \mathcal{D}(\mathbf{z}'_t) \rightarrow \text{Calculate } \mathcal{L}_{physics}(\mathbf{x}') $$

Proximal Correction (Optimization Loop)

Iteratively update $\mathbf{z}'_t$ to minimize the constraint violation, using gradients through the decoder.

$$ \mathbf{z}^{(k+1)} \leftarrow \mathbf{z}^{(k)} - \eta \nabla_{\mathbf{z}} \left( g(\mathcal{D}(\mathbf{z})) + \frac{1}{2\lambda} \|\mathcal{D}(\mathbf{z}) - \mathcal{D}(\mathbf{z}_t)\|^2 \right) $$

This loop continues until the constraint is satisfied (e.g., divergence < threshold).

The "Pause and Fix" Mechanism

Learn how the Inner Loop achieves strict feasibility by effectively pausing the diffusion process at each step.

Read Explanation

Terminology Reference

Symbol	Description	Context
$\mathbf{z}_t$	Latent vector at timestep $t$	The compressed representation undergoing diffusion.
$\mathcal{D}(\cdot)$	Decoder Network	Maps latent vector $\mathbf{z}$ back to image space $\mathbf{x}$.
$\nabla_{\mathbf{z}} g$	Latent Constraint Gradient	Gradient w.r.t latent, derived via backprop through $\mathcal{D}$.
$\eta$	Proximal Step Size	Learning rate for the inner optimization loop.
$\lambda$	Regularization Weight	Keeps the update close to the original noisy sample.

Why Latent Diffusion?

Latent Diffusion Models (LDMs) like Stable Diffusion operate in a compressed lower-dimensional space (e.g., 64x64 latent for a 512x512 image). This offers massive memory savings and faster training/inference for high-resolution generation.

The Trade-off:

Pros: Generates high-resolution data with a fraction of the VRAM.
Cons: Physical constraints are defined in image space, not latent space. Enforcing them requires expensive decoding and backpropagation at every step, as proposed by Zampini et al.

Feature	Our Method (VFDIFF)	Zampini's Method
Operating Space	Image Space ($\mathbf{x}_t$) Direct pixel manipulation	Latent Space ($\mathbf{z}_t$) Compressed feature manipulation
Clean Data Estimate	Tweedie's Formula $\hat{x}_0 = \frac{x_t - \sqrt{1-\bar{\alpha}_t}\epsilon}{\sqrt{\bar{\alpha}_t}}$	Decoder Pass $\mathbf{x}' = \mathcal{D}(\mathbf{z}'_t) \approx \mathbf{x}_0$
Gradient Computation	Two Variants: 1. Autograd: Backprop through Tweedie. 2. Delta Noise: Manual "extra noise" scaling ($\hat{x}_0$ gradient).	Decoder Backprop: Must backpropagate through the entire Decoder Network. $\nabla_z = (\partial \mathcal{D}/\partial z)^T \nabla_x$
Constraint Enforcement	Soft Guidance Guidance Nudges the sample (can be multi-step).	Strict Optimization (Inner Loop) Iteratively corrects until strictly feasible.

Our Approach (Image Space)

We rely on Tweedie's Formula to estimate the clean field $\hat{x}_0$ from the noisy input $x_t$. We then compute the physics loss on this estimate.

Gradient Calculation variants:

Autograd: Differentiates through the Tweedie formula itself. Mathematically rigorous but heavy.
Delta Noise: Treats the gradient of the loss w.r.t $\hat{x}_0$ as an "extra noise" term. This is a fast, heuristic approximation that avoids heavy graph traversal.

Pros: "Delta Noise" is extremely fast; no neural network backprop required for guidance.
Cons: Limited to the resolution of the diffusion model itself.

Zampini's Approach (Latent Space)

Since diffusion happens in latent space $z$, the model doesn't "know" physics. Zampini's method bridges this by decoding $z \to x$ to check strict compliance.

Decoder Backpropagation:

To guide the latent $z$, the gradient from the physics loss $\nabla_x \mathcal{L}$ must be passed back through the Decoder network using the chain rule: $$ \nabla_z = (\partial \mathcal{D}/\partial z)^T \nabla_x $$

Pros: Allows enforcing constraints on high-res outputs from compressed latents.
Cons: Extremely expensive. Backpropagating through a GAN/VAE decoder is K times slower than our "Delta Noise" update (where K is the number of decoder layers).

Failed Baselines (Appendix E.1)

The authors briefly mention trying two alternative strategies, both of which failed to produce high-quality results (high FID) and were excluded from the main paper:

Image Space Correction

Project in image space, re-encode to latent.

Learned Latent Corrector

Train a network to project latents.

Summary

Zampini's method represents a significant advancement for Latent Diffusion Models, enabling them to be relevant in scientific contexts. However, for our specific 2D vector field application, our direct Image Space approach remains highly effective and avoids the complexity of decoding/encoding loops, provided we operate at manageable resolutions.