TLDR

In the context of generative modeling, we examine ODEs, SDEs, and two recent works that share the idea of learning shortcuts that traverse through vector fields defined by ODEs faster. We then discuss the generalization of this idea to both ODE- and SDE-based models.

Differential Equations

Let’s start with a general scenario of generative modeling: suppose you want to generate data $x$ that follows a distribution $p (x)$ . In many cases, the exact form of $p (x)$ is unknown. What you can do is follow the idea of normalizing flow¹: start from a very simple, closed-form distribution $p (x_{0})$ (for example, a standard normal distribution), transform this distribution through time $t \in [0, 1]$ with intermediate distributions $p (x_{t})$ , and finally obtain the estimated distribution $p (x_{1})$ . By doing this, you are essentially trying to solve a differential equation (DE)² that depends on time:

d x_{t} = μ (x_{t}, t) d t + σ (x_{t}, t) d W_{t}, x_{0} \sim p (x_{0})

where $μ$ is the drift component that is deterministic, and $σ$ is the diffusion term driven by Brownian motion³ (denoted by $W_{t}$ ) that is stochastic. This differential equation specifies a time-dependent vector (velocity) field⁴ telling how a data point $x_{t}$ should be moved as time $t$ evolves from $t = 0$ to $t = 1$ (i.e., a flow⁵ from $x_{0}$ to $x_{1}$ ). Below we give an illustration where $x_{t}$ is 1-dimensional:

Vector field between two distributions specified by a differential equation.

When $σ (x_{t}, t) \equiv 0$ , we get an ordinary differential equation (ODE)⁶ where the vector field is deterministic, i.e., the movement of $x_{t}$ is fully determined by $μ$ and $t$ . Otherwise, we get a stochastic differential equation (SDE)⁷ where the movement of $x_{t}$ has a certain level of randomness. Extending the previous illustration, below we show the difference in flow of $x_{t}$ under ODE and SDE:

Difference of movements in vector fields specified by ODE and SDE. Source: Song, Yang, et al. “Score-based generative modeling through stochastic differential equations.” Note that their time is reversed.

As you would imagine, once we manage to solve the differential equation, even if we still cannot have a closed form of $p (x_{1})$ , we can sample from $p (x_{1})$ by sampling a data point $x_{0}$ from $p (x_{0})$ and get the generated data point $x_{1}$ by calculating the following forward-time integral with an integration technique of our choice:

x_{1} = x_{0} + \int_{0}^{1} μ (x_{t}, t) d t + \int_{0}^{1} σ (x_{t}, t) d W_{t}

Or more intuitively, moving $x_{0}$ towards $x_{1}$ along time in the vector field:

A flow of data point moving from $x_{0}$ towards $x_{1}$ in the vector field.

ODE and Flow Matching

ODE in Generative Modeling

For now, let’s focus on the ODE formulation since it is notationally simpler compared to SDE. Recall the ODE of our generative model:

\frac{d x _{t}}{d t} = μ (x_{t}, t)

Essentially, $μ$ is the vector field. For every possible combination of data point $x_{t}$ and time $t$ , $μ (x_{t}, t)$ is the instantaneous velocity in which the point will move. To generate a data point $x_{1}$ , we perform the integral:

x_{1} = x_{0} + \int_{0}^{1} μ (x_{t}, t) d t

To calculate this integral, a simple and widely adopted method is the Euler method⁸. Choose $N$ time steps $0 = t_{0} < t_{1} < \dots < t_{N - 1} < t_{N} = 1$ , and for each integral step $k = 0, 1, \dots, N - 1$ :

x_{t_{k + 1}} = x_{t_{k}} + (t_{k + 1} - t_{k}) μ (x_{t_{k}}, t_{k})

In other words, we discretize the time span $[0, 1]$ into $N$ time steps, and for each step the data is moved based on the instantaneous velocity at the current step.

Note

There are other methods to calculate the integral, of course. For example, one can use the solvers in the torchdiffeq Python package⁹.

Flow Matching

In many scenarios, the exact form of the vector field $μ$ is unknown. The general idea of flow matching¹⁰ is to find a ground truth vector field that defines the flow transporting $p (x_{0})$ to $p (x_{1})$ , and build a neural network $μ_{θ}$ that is trained to match the ground truth vector field, hence the name. In practice, this is usually done by independently sampling $x_{0}$ from the noise and $x_{1}$ from the training data, calculating the intermediate data point $x_{t}$ and the ground truth velocity $μ (x_{t}, t)$ , and minimizing the deviation between $μ_{θ} (x_{t}, t)$ and $μ (x_{t}, t)$ .

Ideally, the ground truth vector field should be as straight as possible, so we can use a small number of $N$ steps to calculate the ODE integral. Thus, the ground truth velocity is usually defined following the optimal transport flow map:

x_{t} = t x_{1} + (1 - t) x_{0}, μ (x_{t}, t) = x_{1} - x_{0}

And a neural network $μ_{θ}$ is trained to match the ground truth vectors as:

L = E_{x_{0}, x_{1}, t} ∥ μ_{θ} (x_{t}, t) - (x_{1} - x_{0}) ∥^{2}

Curvy Vector Field

Although the ground truth vector field is designed to be straight, in practice it usually is not. When the data space is high-dimensional and the target distribution $p (x_{1})$ is complex, there will be multiple pairs of $(x_{0}, x_{1})$ that result in the same intermediate data point $x_{t}$ , thus multiple velocities $x_{1} - x_{0}$ . At the end of the day, the actual ground truth velocity at $x_{t}$ will be the average of all possible velocities $x_{1} - x_{0}$ that pass through $x_{t}$ . This will lead to a “curvy” vector field, illustrated as follows:

Left: multiple vectors passing through the same intermediate data point. Right: the resulting ground truth vector field. Source: Geng, Zhengyang, et al. “Mean Flows for One-step Generative Modeling.” Note $z_{t}$ and $v$ in the figure correspond to $x_{t}$ and $μ$ in this post, respectively.

As we discussed, when you calculate the ODE integral, you are using the instantaneous velocity—tangent of the curves in the vector field—of each step. You would imagine this will lead to subpar performance when using a small number $N$ of steps, as demonstrated below:

Native flow matching models fail at few-step sampling. Source: Frans, Kevin, et al. “One step diffusion via shortcut models.”

Shortcut Vector Field

If we cannot straighten the ground truth vector field, can we tackle the problem of few-step sampling by learning velocities that properly jump across long time steps instead of learning the instantaneous velocities? Yes, we can.

Shortcut Models

Shortcut models¹¹ implement the above idea by training a network $u_{θ} (x_{t}, t, Δ t)$ to match the velocities that jump across long time steps (termed shortcuts in the paper). A ground truth shortcut $u (x_{t}, t, Δ t)$ will be the velocity pointing from $x_{t}$ to $x_{t + Δ t}$ , formally:

u (x_{t}, t, Δ t) = \frac{1}{Δ t} \int_{t}^{t + Δ t} μ (x_{τ}, τ) d τ

Ideally, you can transform $x_{0}$ to $x_{1}$ within one step with the learned shortcuts:

x_{1} \approx x_{0} + u_{θ} (x_{0}, 0, 1)

Note

Of course, in practice shortcut models face the same problem mentioned in the Curvy Vector Field: the same data point $x_{1}$ corresponds to multiple shortcut velocities to different data points $x_{0}$ , making the ground truth shortcut velocity at $x_{1}$ the average of all possibilities. So, shortcut models have a performance advantage with few sampling steps compared to conventional flow matching models, but typically don’t have the same performance with one step versus more steps.

The theory is quite straightforward. The tricky part is in the model training. First, the network expands from learning all possibilities of velocities at $(x_{t}, t)$ to all velocities at $(x_{t}, t, Δ t)$ with $Δ t \in [0, t]$ . Essentially, the shortcut vector field has one more dimension than the instantaneous vector field, making the learning space larger. Second, calculating the ground truth shortcut involves calculating integral, which can be computationally heavy.

To tackle these challenges, shortcut models introduce self-consistency shortcuts: one shortcut with step size $2Δ t$ should equal two consecutive shortcuts both with step size $Δ t$ :

u (x_{t}, t, 2Δ t) = u (x_{t}, t, Δ t) /2 + u (x_{t + Δ t}, t + Δ t, Δ t) /2

The model is then trained with the combination of matching instantaneous velocities and self-consistency shortcuts as below. Notice that we don’t train a separate network for matching the instantaneous vectors but leverage the fact that the shortcut $u (x_{t}, t, Δ t)$ is the instantaneous velocity when $Δ t \to 0$ .

L = E_{x_{0}, x_{1}, t, Δ t} [Flow-Matching ∥ u_{θ} (x_{t}, t, 0) - (x_{1} - x_{0}) ∥^{2} + Self-Consistency ∥ u_{θ} (x_{t}, t, 2Δ t) - sg (u_{target}) ∥^{2}],

u_{target} = u_{θ} (x_{t}, t, Δ t) /2 + u_{θ} (x_{t + Δ t}^{'}, t + Δ t, Δ t) /2 and x_{t + Δ t}^{'} = x_{t} + Δ t \cdot u_{θ} (x_{t}, t, Δ t)

Where $sg$ is stop gradient, i.e., detach $u_{target}$ from back propagation, making it a pseudo ground truth. Below is an illustration of the training process provided in the original paper.

Training of the shortcut models with self-consistency loss.

Mean Flow

Mean flow¹² is another work sharing the idea of learning velocities that take large step size shortcuts but with a stronger theoretical foundation and a different approach to training.

Illustration of the average velocity provided in the original paper.

Mean flow defines an average velocity as a shortcut between times $t$ and $r$ where $t$ and $r$ are independent:

u (x_{t}, r, t) = \frac{1}{t - r} \int_{r}^{t} μ (x_{τ}, τ) d τ

This average velocity is essentially equivalent to a shortcut in shortcut models given $Δ t = t - r$ . What differentiates mean flow from shortcut models is that mean flow aims to provide a ground truth of the vector field defined by $u (x_{t}, r, t)$ , and directly train a network $u_{θ} (x_{t}, r, t)$ to match the ground truth.

We transform the above equation by differentiate both sides with respect to $t$ and rearrange components, and get:

u (x_{t}, r, t) = μ (x_{t}, t) + (r - t) \frac{d}{d t} u (x_{t}, r, t)

We get the average velocity on the left, and the instantaneous velocity and the time derivative components on the right. This defines the ground truth average vector field, and our goal now is to calculate the right side. We already know that the ground truth instantaneous velocity $μ (x_{t}, t) = x_{1} - x_{0}$ . To compute the time derivative component, we can expand it in terms of partial derivatives:

\frac{d}{d t} u (x_{t}, r, t) = \frac{d x _{t}}{d t} \partial_{x} u + \frac{d r}{d t} \partial_{r} u + \frac{d t}{d t} \partial_{t} u

From the ODE definition $d x_{t} / d t = μ (x_{t}, t)$ , and $d t / d t = 1$ . Since $t$ and $r$ are independent, $d r / d t = 0$ . Thus, we have:

\frac{d}{d t} u (x_{t}, r, t) = μ (x_{t}, t) \partial_{x} u + \partial_{t} u

This means the time derivative component is the vector product between $[\partial_{x} u, \partial_{r} u, \partial_{t} u]$ and $[μ, 0, 1]$ . In practice, this can be computed using the Jacobian vector product (JVP) functions in NN libraries, such as the torch.func.jvp function in PyTorch. In summary, the mean flow loss function is:

L = E_{x_{t}, r, t} ∥ u_{θ} (x_{t}, r, t) - sg (μ (x_{t}, t) + (r - t) (μ (x_{t}, t) \partial_{x} u_{θ} + \partial_{t} u_{θ})) ∥^{2}

Notice that the JVP computation inside $sg$ is performed with the network $u_{θ}$ itself. In this regard, this loss function shares a similar idea with the self-consistency loss in shortcut models—supervising the network with output produced by the network itself.

Extended Reading: Rectified Flow

Both shortcut models and mean flow are built on top of the ground truth curvy ODE field. They don’t modify the field $μ$ , but rather try to learn shortcut velocities that can traverse the field with fewer Euler steps. This is reflected in their loss function design: shortcut models’ loss function specifically includes a standard flow matching component, and mean flow’s loss function is derived from the relationship between vector fields $μ$ and $u$ .

Rectified flow¹³, another family of flow matching models that aims to achieve one-step sampling, is fundamentally different in this regard. It aims to replace the original ground truth ODE field with a new one with straight flows. Ideally, the resulting ODE field has zero curvature, enabling one-step integration with the simple Euler method. This usually involves augmentation of the training data and a repeated reflow process.

We won’t discuss rectified flow in further detail in this post, but it’s worth pointing out its difference from shortcut models and mean flow.

SDE and Score Matching

SDE in Generative Modeling

SDE, as its name suggests, is a differential equation with a stochastic component. Recall the general differential equation we introduced at the beginning:

d x_{t} = μ (x_{t}, t) d t + σ (x_{t}, t) d W_{t}

In practice, the diffusion term $σ$ usually only depends on $t$ , so we will use the simpler formula going forward:

d x_{t} = μ (x_{t}, t) d t + σ (t) d W_{t}

$W_{t}$ is the Brownian motion (a.k.a. standard Wiener process). In practice, its behavior over time $t$ can be described as $W_{t + Δ t} - W_{t} \sim N (0, Δ t)$ . This is the source of SDE’s stochasticity, and also why people like to call the family of SDE-based generative models diffusion models¹⁴, since Brownian motion is derived from physical diffusion processes¹⁵.

In the context of generative modeling, the stochasticity in SDE means it can theoretically handle augmented data or data that is stochastic in nature (e.g., financial data) more gracefully. Practically, it also enables techniques such as stochastic control guidance¹⁶. At the same time, it also means SDE is mathematically more complicated compared to ODE. We no longer have a deterministic vector field $μ$ specifying flows of data points $x_{0}$ moving towards $x_{1}$ . Instead, both $μ$ and $σ$ have to be designed to ensure that the SDE leads to the target distribution $p (x_{1})$ we want.

To solve the SDE, similar to the Euler method used for solving ODE, we can use the Euler-Maruyama method¹⁷:

x_{t_{k + 1}} = x_{t_{k}} + (t_{k + 1} - t_{k}) μ (x_{t}, t) + t_{k + 1} - t_{k} σ (t) ϵ, ϵ \sim N (0, 1)

In other words, we move the data point guided by the velocity $μ (x_{t}, t)$ plus a bit of Gaussian noise scaled by $t_{k + 1} - t_{k} σ (t)$ .

Score Matching

In SDE, the exact form of the vector field $μ$ is still (quite likely) unknown. To solve this, the general idea is consistent with flow matching: we want to find the ground truth $μ (x_{t}, t)$ and build a neural network $μ_{θ} (x_{t}, t)$ to match it.

Score matching models¹⁸ implement this idea by parameterizing $μ$ as:

μ (x_{t}, t) = v (x_{t}, t) + \frac{σ ^{2} ( t )}{2} \nabla lo g p (x_{t})

where $v (x_{t}, t)$ is a velocity similar to that in ODE, and $\nabla lo g p (x_{t})$ is the score (a.k.a. informant)¹⁹ of $x_{t}$ . Without going too deep into the relevant theories, think of the score as a “compass” that points in the direction where $x_{t}$ becomes more likely to belong to the distribution $p (x_{1})$ . The beauty of introducing the score is that depending on the definition of ground truth $x_{t}$ , the velocity $v$ can be derived from the score, or vice versa. Then, we only have to focus on building a learnable score function $s_{θ} (x_{t}, t)$ to match the ground truth score using the loss function below, hence the name score matching:

L = E_{x_{t}, t} ∥ s_{θ} (x_{t}, t) - \nabla lo g p (x_{t}) ∥^{2}

For example, if we have time-dependent coefficients $α_{t}$ and $β_{t}$ (termed noise schedulers in most diffusion models), and define that $x_{t}$ follows the distribution given a clean data point $x_{1}$ :

p (x_{t}) = N (α_{t} x_{1}, β_{t}^{2})

then we will have:

\nabla lo g p (x_{t}) = - \frac{x _{t} - α _{t} x _{1}}{β _{t}^{2}}, v (x_{t}, t) = (β_{t}^{2} \frac{\partial _{t} α _{t}}{α _{t}} - (\partial_{t} β_{t}) β_{t}) \nabla lo g p (x_{t}) + \frac{\partial _{t} α _{t}}{α _{t}} x_{t}

Some works¹⁴ also propose to re-parameterize the score function with noise $ϵ$ sampled from a standard normal distribution, so that the neural network can be a learnable denoiser $ϵ_{θ} (x_{t}, t)$ that matches the noise rather than the score. Since $s_{θ} = - ϵ_{θ} / σ (t)$ , both approaches are equivalent.

Shortcuts in SDE

Most existing efforts sharing the idea of shortcut vector fields are grounded in ODEs. However, given the correlations between SDE and ODE, learning an SDE that follows the same idea should be straightforward. Generally speaking, SDE training, similar to ODE, focuses on the deterministic drift component $μ$ . One should be able to, for example, use the same mean flow loss function to train a score function for solving an SDE.

Note

Needless to say, generalizing shortcut models and mean flow to flow matching models with ground truth vector fields other than optimal transport flow requires no modification either, since most such models (e.g., Bayesian flow) are ultimately grounded in ODE.

One caveat of training a “shortcut SDE” is that the ideal result of one-step sampling contradicts the stochastic nature of SDE—if you are going to perform the sampling in one step, you are probably better off using ODE to begin with. Still, I believe it would be useful to train an SDE so that its benefits versus ODE are preserved, while still enabling the lowering of sampling steps $N$ for improved computational efficiency.

Below are some preliminary results I obtained from a set of amorphous material generation experiments. You don’t need to understand the figure—just know that it shows that applying the idea of learning shortcuts to SDE does yield better results compared to the vanilla SDE when using few-step sampling.

Structural functions of generated materials, sampled in 10 steps.

References

Holderrieth and Erives, “An Introduction to Flow Matching and Diffusion Models.”

Song and Ermon, “Generative Modeling by Estimating Gradients of the Data Distribution.”

Rezende, Danilo, and Shakir Mohamed. “Variational inference with normalizing flows.” ↩
https://en.wikipedia.org/wiki/Differential_equation ↩
https://en.wikipedia.org/wiki/Brownian_motion ↩
https://en.wikipedia.org/wiki/Vector_field ↩
https://en.wikipedia.org/wiki/Vector_flow ↩
https://en.wikipedia.org/wiki/Ordinary_differential_equation ↩
https://en.wikipedia.org/wiki/Stochastic_differential_equation ↩
https://en.wikipedia.org/wiki/Euler_method ↩
https://github.com/rtqichen/torchdiffeq ↩
Lipman, Yaron, et al. “Flow matching for generative modeling.” ↩
Frans, Kevin, et al. “One step diffusion via shortcut models. ↩
Geng, Zhengyang, et al. “Mean Flows for One-step Generative Modeling.” ↩
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow ↩
Ho, Jonathan, Ajay Jain, and Pieter Abbeel. “Denoising diffusion probabilistic models.” ↩ ↩²
https://en.wikipedia.org/wiki/Diffusion_process ↩
Huang et al., “Symbolic Music Generation with Non-Differentiable Rule Guided Diffusion.” ↩
https://en.wikipedia.org/wiki/Euler–Maruyama_method ↩
Song et al., “Score-Based Generative Modeling through Stochastic Differential Equations.” ↩
https://en.wikipedia.org/wiki/Informant_(statistics) ↩

Yan Lin's Blog

Explorer

Recent Notes

Welcome