A Beginner's Guide to Diffusion model

| February 15, 2025

Itô process,Chapman–Kolmogorov,Kramers–Moyal

Table of Contents

1. The Chapman–Kolmogorov Equation

1.1. What Is It?

A Markov process is one in which the future depends only on the present state, not on its past. Let

P(x,tx0,t0)P(x, t \mid x_0, t_0)

denote the probability density of transitioning from state x0x_0 at time t0t_0 to state xx at time tt. The Chapman–Kolmogorov equation tells us that long-time transitions can be computed by “summing over” intermediate states. For three times t0<t1<t2t_0 < t_1 < t_2, the equation reads:

P(x,t2x0,t0)=P(x,t2y,t1)P(y,t1x0,t0)dy.P(x, t_2 \mid x_0, t_0) = \int_{-\infty}^{\infty} P(x, t_2 \mid y, t_1)\,P(y, t_1 \mid x_0, t_0)\,dy.

1.2. Step-by-Step Derivation

  1. Law of Total Probability:
    To compute the probability of being in state xx at time t2t_2 given x0x_0 at time t0t_0, we condition on an intermediate state yy at time t1t_1:

    Pr(Xt2=xXt0=x0)=Pr(Xt2=x,Xt1=yXt0=x0)dy.\Pr(X_{t_2} = x \mid X_{t_0} = x_0) = \int \Pr(X_{t_2} = x,\, X_{t_1} = y \mid X_{t_0} = x_0)\,dy.
  2. Markov Property:
    Since the process is Markovian,

    Pr(Xt2=xXt1=y,Xt0=x0)=Pr(Xt2=xXt1=y),\Pr(X_{t_2} = x \mid X_{t_1} = y,\, X_{t_0} = x_0) = \Pr(X_{t_2} = x \mid X_{t_1} = y),

    so we can write:

    P(x,t2x0,t0)=P(x,t2y,t1)P(y,t1x0,t0)dy.P(x, t_2 \mid x_0, t_0) = \int P(x, t_2 \mid y, t_1)\,P(y, t_1 \mid x_0, t_0)\,dy.
  3. Discrete Case:
    In a discrete state space, the integral becomes a sum:

    Pij(n+m)=kPik(n)Pkj(m).P_{ij}^{(n+m)} = \sum_{k} P_{ik}^{(n)}\,P_{kj}^{(m)}.

    This shows that multi-step transitions can be computed by multiplying (or convolving) shorter-step transition probabilities.


2. The Kramers–Moyal Expansion

2.1. From Chapman–Kolmogorov to a Differential Equation

For continuous processes, we study the evolution of the probability density p(x,t)p(x,t) over a small time interval Δt\Delta t. Starting with:

p(x,t+Δtx0,t0)=p(x,t+Δty,t)p(y,tx0,t0)dy,p(x, t+\Delta t \mid x_0, t_0) = \int_{-\infty}^{\infty} p(x, t+\Delta t \mid y, t)\,p(y, t \mid x_0, t_0)\,dy,

we probe the evolution by multiplying by a smooth test function ϕ(x)\phi(x) and integrating over xx.

2.2. Taylor Expansion of the Test Function

For a fixed intermediate state yy, we expand ϕ(x)\phi(x) about x=yx = y:

ϕ(x)=ϕ(y)+(xy)ϕ(y)+(xy)22ϕ(y)+.\phi(x) = \phi(y) + (x-y)\,\phi'(y) + \frac{(x-y)^2}{2}\,\phi''(y) + \cdots.

Substituting this expansion into the inner integral yields:

ϕ(x)p(x,t+Δty,t)dx=ϕ(y)p(x,t+Δty,t)dx=1+ϕ(y)(xy)p(x,t+Δty,t)dx+.\int \phi(x)\,p(x, t+\Delta t \mid y, t)\,dx = \phi(y) \underbrace{\int p(x, t+\Delta t \mid y, t)\,dx}_{=1} + \phi'(y) \int (x-y)\,p(x, t+\Delta t \mid y, t)\,dx + \cdots.

2.3. Defining Moments and Kramers–Moyal Coefficients

Define the nn th moment over the small interval Δt\Delta t as:

M(n)(y,t,Δt)=(xy)np(x,t+Δty,t)dx.M^{(n)}(y,t,\Delta t) = \int (x-y)^n\,p(x, t+\Delta t \mid y, t)\,dx.

Assuming these moments scale linearly with Δt\Delta t, we define the Kramers–Moyal coefficients as:

D(n)(y,t)=limΔt01n!ΔtM(n)(y,t,Δt).D^{(n)}(y,t) = \lim_{\Delta t \to 0} \frac{1}{n!\,\Delta t}\,M^{(n)}(y,t,\Delta t).

2.4. Fokker–Planck Equation

Inserting the Taylor expansion into the integrated Chapman–Kolmogorov equation, subtracting the zeroth-order term, dividing by Δt\Delta t, and letting Δt0\Delta t \to 0, we obtain:

p(x,t)t=n=1(1)nnxn[D(n)(x,t)p(x,t)].\frac{\partial p(x,t)}{\partial t} = \sum_{n=1}^{\infty} (-1)^n \frac{\partial^n}{\partial x^n} \Bigl[D^{(n)}(x,t)\,p(x,t)\Bigr].

In many applications, the coefficients D(n)D^{(n)} for n3n \ge 3 vanish or are negligible. Truncating at n=2n=2 yields the Fokker–Planck equation:

p(x,t)t=x[D(1)(x,t)p(x,t)]+2x2[D(2)(x,t)p(x,t)].\frac{\partial p(x,t)}{\partial t} = -\frac{\partial}{\partial x}\Bigl[D^{(1)}(x,t)\,p(x,t)\Bigr] + \frac{\partial^2}{\partial x^2}\Bigl[D^{(2)}(x,t)\,p(x,t)\Bigr].

3. Itô’s Lemma

3.1.Real-World Example: Brownian Motion of a Pollen Grain

Imagine you are observing a tiny pollen grain suspended in water. The grain is bombarded by water molecules, and these collisions cause it to move in a seemingly random way. This erratic motion is called Brownian motion.

1. Discrete Modeling

Suppose you record the position of the pollen grain at discrete time intervals of length Δt\Delta t. At each time step, the grain’s position changes due to:

  • Drift: There might be a very slight overall current in the water, which gives a predictable, small shift.
  • Random Kicks (Diffusion): The collisions with water molecules produce random displacements.

A discrete update of the position XtX_t can be written as:

Xt+Δt=Xt+μ(Xt,t)Δt+σ(Xt,t)ΔtZ,X_{t+\Delta t} = X_t + \mu(X_t, t)\,\Delta t + \sigma(X_t, t)\,\sqrt{\Delta t}\,Z,

where:

  • XtX_t is the pollen grain’s position at time tt.
  • μ(Xt,t)\mu(X_t, t) represents any systematic drift (for example, due to a gentle water current).
  • σ(Xt,t)\sigma(X_t, t) represents the intensity of the random collisions.
  • ZZ is a standard normal random variable, ZN(0,1)Z\sim N(0,1).

In this context, the term μ(Xt,t)Δt\mu(X_t,t)\,\Delta t models the small, steady displacement due to the current, and the term σ(Xt,t)ΔtZ\sigma(X_t,t)\,\sqrt{\Delta t}\,Z models the random displacements caused by molecular collisions.

2. Variance and Scaling

Because ZZ is normally distributed with mean 0 and variance 1, the variance of the random term is:

Var[σ(Xt,t)ΔtZ]=σ2(Xt,t)Δt.\text{Var}\Bigl[\sigma(X_t,t)\,\sqrt{\Delta t}\,Z\Bigr] = \sigma^2(X_t,t)\,\Delta t.

This shows that over a short time interval Δt\Delta t, the variance of the displacement is proportional to Δt\Delta t, which is a hallmark of Brownian motion.

3. Taking the Continuous-Time Limit

When we let Δt0\Delta t \to 0, the process is observed over infinitely many infinitesimally small time steps. In the limit, by the central limit theorem (and Donsker’s invariance principle), the cumulative effect of the random displacements converges to a continuous-time Brownian motion WtW_t. Therefore, the discrete update

Xt+Δt=Xt+μ(Xt,t)Δt+σ(Xt,t)ΔtZX_{t+\Delta t} = X_t + \mu(X_t, t)\,\Delta t + \sigma(X_t, t)\,\sqrt{\Delta t}\,Z

transforms into the stochastic differential equation (SDE):

dXt=μ(Xt,t)dt+σ(Xt,t)dWt.dX_t = \mu(X_t,t)\,dt + \sigma(X_t,t)\,dW_t.

Here,

  • The term μ(Xt,t)dt\mu(X_t,t)\,dt still represents the drift (the effect of the current in the water).
  • The term σ(Xt,t)dWt\sigma(X_t,t)\,dW_t represents the random fluctuations (the effect of molecular collisions), with the important property that
(dWt)2=dt.(dW_t)^2 = dt.

3.2. The Setup

Suppose that XtX_t satisfies the stochastic differential equation (SDE):

dXt=μ(Xt,t)dt+σ(Xt,t)dWt,dX_t = \mu(X_t,t)\,dt + \sigma(X_t,t)\,dW_t,

where:

  • μ(Xt,t)\mu(X_t,t) is the drift,
  • σ(Xt,t)\sigma(X_t,t) is the diffusion coefficient,
  • dWtdW_t is an increment of standard Brownian motion, with E[dWt]=0,(dWt)2=dt,dtdWt=0,dt2=0.\mathbb{E}[dW_t] = 0,\quad (dW_t)^2 = dt,\quad dt\,dW_t = 0,\quad dt^2 = 0.

Let f(x,t)f(x,t) be a function in C1,2C^{1,2} (i.e., continuously differentiable in tt and twice in xx). We want to compute df(Xt,t)df(X_t,t).

3.3. Derivation

  1. Taylor Expansion:
    Expand f(Xt+dXt,t+dt)f(X_t + dX_t, t+dt):

    df=ft(Xt,t)dt+fx(Xt,t)dXt+12fxx(Xt,t)(dXt)2+higher order terms.df = f_t(X_t,t)\,dt + f_x(X_t,t)\,dX_t + \frac{1}{2} f_{xx}(X_t,t)\,(dX_t)^2 + \text{higher order terms}.
  2. Substitute the SDE:
    Replace dXtdX_t by

    dXt=μ(Xt,t)dt+σ(Xt,t)dWt.dX_t = \mu(X_t,t)\,dt + \sigma(X_t,t)\,dW_t.

    Thus,

    fx(Xt,t)dXt=fx(Xt,t)[μ(Xt,t)dt+σ(Xt,t)dWt].f_x(X_t,t)\,dX_t = f_x(X_t,t) \Bigl[\mu(X_t,t)\,dt + \sigma(X_t,t)\,dW_t\Bigr].
  3. Compute (dXt)2(dX_t)^2:
    We have

    (dXt)2=[μ(Xt,t)dt+σ(Xt,t)dWt]2.(dX_t)^2 = \Bigl[\mu(X_t,t)\,dt + \sigma(X_t,t)\,dW_t\Bigr]^2.

    Expanding:

    (dXt)2=μ2(Xt,t)(dt)2+2μ(Xt,t)σ(Xt,t)dtdWt+σ2(Xt,t)(dWt)2.(dX_t)^2 = \mu^2(X_t,t)(dt)^2 + 2\mu(X_t,t)\sigma(X_t,t)\,dt\,dW_t + \sigma^2(X_t,t)(dW_t)^2.

    Using the rules:

    (dt)2=0,dtdWt=0,(dWt)2=dt,(dt)^2 = 0,\quad dt\,dW_t = 0,\quad (dW_t)^2 = dt,

    we get:

    (dXt)2=σ2(Xt,t)dt.(dX_t)^2 = \sigma^2(X_t,t)\,dt.
  4. Combine the Terms:
    Substitute back into the Taylor expansion:

    df=ft(Xt,t)dt+fx(Xt,t)[μ(Xt,t)dt+σ(Xt,t)dWt]+12fxx(Xt,t)σ2(Xt,t)dt=[ft(Xt,t)+μ(Xt,t)fx(Xt,t)+12σ2(Xt,t)fxx(Xt,t)]dt+σ(Xt,t)fx(Xt,t)dWt.\begin{aligned} df &= f_t(X_t,t)\,dt + f_x(X_t,t) \Bigl[\mu(X_t,t)\,dt + \sigma(X_t,t)\,dW_t\Bigr] + \frac{1}{2} f_{xx}(X_t,t)\,\sigma^2(X_t,t)\,dt\\[1mm] &= \Bigl[f_t(X_t,t) + \mu(X_t,t)f_x(X_t,t) + \frac{1}{2}\sigma^2(X_t,t)f_{xx}(X_t,t)\Bigr]dt + \sigma(X_t,t)f_x(X_t,t)\,dW_t. \end{aligned}

This is Itô’s lemma:

df(Xt,t)=[ft(Xt,t)+μ(Xt,t)fx(Xt,t)+12σ2(Xt,t)fxx(Xt,t)]dt+σ(Xt,t)fx(Xt,t)dWt.\boxed{df(X_t,t)= \left[f_t(X_t,t) + \mu(X_t,t)f_x(X_t,t) + \frac{1}{2}\sigma^2(X_t,t)f_{xx}(X_t,t)\right]dt + \sigma(X_t,t)f_x(X_t,t)\,dW_t.}

4. Constructing an Itô Process

4.1. The Itô Integral

Suppose you have a function σ(t)\sigma(t) (possibly random, but non-anticipative) and wish to integrate it with respect to Brownian motion WtW_t. The Itô integral is defined by:

  1. Partitioning the Time Interval:
    Divide the interval [0,t][0,t] into small subintervals:

    0=t0<t1<<tn=t.0 = t_0 < t_1 < \cdots < t_n = t.
  2. Forming the Riemann Sum:
    Let ΔWi=Wti+1Wti\Delta W_i = W_{t_{i+1}} - W_{t_i}. Then approximate the integral as:

    Sn=i=0n1σ(ti)ΔWi.S_n = \sum_{i=0}^{n-1} \sigma(t_i)\,\Delta W_i.

    The evaluation of σ(ti)\sigma(t_i) at the left endpoint ensures the integral is non-anticipative.

  3. Taking the Limit:
    As the partition gets finer, the sum converges (in the mean-square sense) to the Itô integral:

    0tσ(s)dWs=limmax(ti+1ti)0i=0n1σ(ti)(Wti+1Wti).\int_0^t \sigma(s)\,dW_s = \lim_{\max(t_{i+1}-t_i) \to 0} \sum_{i=0}^{n-1} \sigma(t_i) \Bigl(W_{t_{i+1}} - W_{t_i}\Bigr).

4.2. Defining the Itô Process

An Itô process combines a drift part and a diffusion part:

Xt=X0+0tμ(s)ds+0tσ(s)dWs.X_t = X_0 + \int_0^t \mu(s)\,ds + \int_0^t \sigma(s)\,dW_s.
  • The drift term 0tμ(s)ds\int_0^t \mu(s)\,ds is a standard Lebesgue integral.
  • The diffusion term 0tσ(s)dWs\int_0^t \sigma(s)\,dW_s is the Itô integral.

4.3. Some Key Properties

  • Continuity:
    The process XtX_t is continuous (under suitable conditions on μ\mu and σ\sigma).
  • Quadratic Variation:
    The quadratic variation is contributed solely by the diffusion part: Xt=0tσ2(s)ds.\langle X \rangle_t = \int_0^t \sigma^2(s)\,ds.
  • Martingale Component:
    Removing the drift, the diffusion part forms a martingale.

5. Diffusion Models: Putting It All Together

Diffusion models use these concepts to describe how data is gradually corrupted by noise and then recovered.

  1. Forward Process (Noising):
    Starting with a data sample x0x_0, noise is gradually added by evolving x0x_0 using an Itô process. The evolution of the probability density p(x,t)p(x,t) is governed by the Fokker–Planck equation (obtained by truncating the Kramers–Moyal expansion).

  2. Reverse Process (Denoising):
    To generate or recover data, the process is reversed. The reverse-time stochastic differential equation—derived using time-reversal techniques and Itô’s lemma—employs the gradient of the log-density (known as the score function) to guide a noisy sample back to the data distribution.