Future Plans - VFDIFF

Our current implementation utilizes a straightforward gradient descent approach for physics guidance. While effective for simple nudging, it lacks the rigorous guarantees of Zampini's method. This roadmap outlines our plan to bridge this gap, starting with strict guidance in image space and evolving into a full Latent Diffusion Model (LDM).

Phase 1: Implementing Strict Guidance

Why the current approach is insufficient

Our current single or multi-step gradient descent merely typically guides the sample towards lower energy states. However, it does not guarantee that the final sample strictly satisfies the physical constraints ($\nabla \cdot \mathbf{v} = 0$).

No Feasibility Guarantee: The model may resist the "nudge" if it conflicts strongly with the learned prior.
No Error Bounds: We cannot define a maximum tolerance $\varepsilon$ for physics violations.
Trajectory Drift: Without correction loops, the trajectory can drift significantly from the physical manifold between steps.

The Plan: Proximal Langevin Dynamics

We will adapt Zampini's Inner Loop mechanism to our current image-space model. Instead of a simple gradient update step, we will implement a "Pause and Fix" loop at every diffusion timestep.

Step 1: Check Violation

At each timestep $t$, immediately after estimating $\hat{x}_0$ via Tweedie's formula, calculate the divergence and curl errors.

Step 2: Inner Loop Optimization

Enter a while loop that performs gradient descent on $x_t$ until the error falls below a strict threshold $\varepsilon$. $$ x_t^{(k+1)} \leftarrow x_t^{(k)} - \eta \nabla_{x_t} \mathcal{L}_{physics}(\hat{x}_0) $$

Phase 2: Transition to Latent Space

Once strict guidance is verified in image space, we will retrain our Standard Diffusion Model to operate in a compressed Latent Space. This aligns our architecture with state-of-the-art models like Stable Diffusion and enables higher resolution generation.

Action Plan

Train VAE / Autoencoder: Train a Variational Autoencoder to compress vector fields into a compact latent representation $\mathbf{z}$.
Retrain Diffusion U-Net: Train the diffusion model to predict noise in the latent space $\mathcal{Z}$ instead of pixel space.
Implement Decoder Guidance: Port the Phase 1 strict guidance logic to use backpropagation through the decoder ($\nabla_z = (\partial \mathcal{D}/\partial z)^T \nabla_x$), fully realizing Zampini's method.

Expected Benefits

Scalability: Generate significantly larger vector fields (e.g., 1024x1024) with standard GPU memory.
Faster Inference: Denoising happens in the smaller latent space, reducing the computational cost of each step.
Foundation Model Capability: Opens the door to fine-tuning existing large-scale models on physical data.