← Back to all blogs

Mathematical Formulation of CausalImpact Analysis Using Structural Time Series and Gibbs Sampling

| December 4, 2024

Table of Contents

1. Overview

CausalImpact Analysis is a statistical method designed to estimate the causal effect of an intervention by comparing observed data against a counterfactual scenario—what would have happened in the absence of the intervention. This analysis leverages a Structural Time Series (STS) model to capture the underlying data-generating processes and employs Gibbs Sampling, a Bayesian inference technique, to derive posterior distributions of the model parameters.


2. Structural Time Series (STS) Model

The Structural Time Series (STS) model offers a robust framework for modeling time-series data by decomposing it into various components such as trend, seasonality, and regression effects.

2.1. State-Space Representation

The STS model is formulated within a state-space framework, comprising two primary equations: the State Transition Equation and the Observation Equation.

a. State Vector (xt\mathbf{x}_t)

The state vector encapsulates all latent (unobserved) components influencing the observed data at time tt:

xt=[ts1,ts2,tsK,tβt]\mathbf{x}_t = \begin{bmatrix} \ell_t \\ s_{1,t} \\ s_{2,t} \\ \vdots \\ s_{K,t} \\ \boldsymbol{\beta}_t \\ \end{bmatrix}

Components:

b. State Transition Equation

The evolution of the state vector over time is governed by:

xt=Gxt1+wt\mathbf{x}_t = \mathbf{G} \mathbf{x}_{t-1} + \mathbf{w}_t

Where:

wtN(0,W)\mathbf{w}_t \sim \mathcal{N}(\mathbf{0}, \mathbf{W})

Detailed Structure:

Assuming independent evolution of each component:

Thus, the transition matrices are defined as:

G=[100001000010000I],W=[σ2000σs12I000Σβ]\mathbf{G} = \begin{bmatrix} 1 & 0 & \dots & 0 & \mathbf{0}^\top \\ 0 & 1 & \dots & 0 & \mathbf{0}^\top \\ \vdots & \vdots & \ddots & \vdots & \vdots \\ 0 & 0 & \dots & 1 & \mathbf{0}^\top \\ \mathbf{0} & \mathbf{0} & \dots & \mathbf{0} & \mathbf{I} \end{bmatrix}, \quad \mathbf{W} = \begin{bmatrix} \sigma_\ell^2 & \mathbf{0} & \dots & \mathbf{0} \\ \mathbf{0} & \sigma_{s_1}^2 \mathbf{I} & \dots & \mathbf{0} \\ \vdots & \vdots & \ddots & \vdots \\ \mathbf{0} & \mathbf{0} & \dots & \mathbf{\Sigma}_\beta \end{bmatrix}

Parameters:

2.2. Observation Equation

The Observation Equation links the latent state vector to the observed data:

yt=Fxt+εt,εtN(0,σε2)y_t = \mathbf{F}^\top \mathbf{x}_t + \varepsilon_t, \quad \varepsilon_t \sim \mathcal{N}(0, \sigma_\varepsilon^2)

Where:

F=[10KXt]\mathbf{F} = \begin{bmatrix} 1 & \mathbf{0}_{K}^\top & \mathbf{X}_t^\top \end{bmatrix}^\top

3. Priors and Hyperparameters

In Bayesian analysis, priors represent initial beliefs about the model parameters before observing the data. Proper specification of priors is essential as they influence the posterior distributions.

3.1. Local Level Variance Prior

σ2Inverse-Gamma(α,β)\sigma_\ell^2 \sim \text{Inverse-Gamma}(\alpha_\ell, \beta_\ell)

3.2. Observation Noise Variance Prior

σε2Inverse-Gamma(αε,βε)\sigma_\varepsilon^2 \sim \text{Inverse-Gamma}(\alpha_\varepsilon, \beta_\varepsilon)

3.3. Regression Weights Prior

Assuming a multivariate normal prior for regression coefficients β\boldsymbol{\beta}:

βN(0,Λ1)\boldsymbol{\beta} \sim \mathcal{N}(\mathbf{0}, \mathbf{\Lambda}^{-1}) Λ=0.01×0.5×(XX)N\mathbf{\Lambda} = 0.01 \times \frac{0.5 \times (\mathbf{X}^\top \mathbf{X})}{N}

3.4. Initial State Priors

0N(y0,σy2)\ell_0 \sim \mathcal{N}(y_0, \sigma_y^2) sk,0N(0,σsk2)s_{k,0} \sim \mathcal{N}(0, \sigma_{s_k}^2) β0N(0,Σβ)\boldsymbol{\beta}_0 \sim \mathcal{N}(\mathbf{0}, \mathbf{\Sigma}_\beta)

4. Bayesian Inference via Gibbs Sampling

Gibbs Sampling is a Markov Chain Monte Carlo (MCMC) method used to sample from the joint posterior distribution of model parameters and latent states.

4.1. Posterior Distribution

The objective is to sample from the joint posterior distribution:

p(x1:T,θy1:T)p(\mathbf{x}_{1:T}, \boldsymbol{\theta} \mid \mathbf{y}_{1:T})

Where:

Using Bayes’ theorem:

p(x1:T,θy)p(yx,θ)p(x1:Tθ)p(θ)p(\mathbf{x}_{1:T}, \boldsymbol{\theta} \mid \mathbf{y}) \propto p(\mathbf{y} \mid \mathbf{x}, \boldsymbol{\theta}) \cdot p(\mathbf{x}_{1:T} \mid \boldsymbol{\theta}) \cdot p(\boldsymbol{\theta})

4.2. Gibbs Sampling Steps

Gibbs Sampling iteratively samples each parameter conditioned on the current values of all other parameters.

Step 1: Sample Local Level Variance (σ2\sigma_\ell^2)

σ2x1:T,y1:TInverse-Gamma(α,β)\sigma_\ell^2 \mid \mathbf{x}_{1:T}, \mathbf{y}_{1:T} \sim \text{Inverse-Gamma}\left(\alpha_\ell^*, \beta_\ell^*\right)

Where:

α=α+T2\alpha_\ell^* = \alpha_\ell + \frac{T}{2} β=β+12t=1T(tt1)2\beta_\ell^* = \beta_\ell + \frac{1}{2} \sum_{t=1}^T (\ell_t - \ell_{t-1})^2

Step 2: Sample Observation Noise Variance (σε2\sigma_\varepsilon^2)

σε2x1:T,y1:TInverse-Gamma(αε,βε)\sigma_\varepsilon^2 \mid \mathbf{x}_{1:T}, \mathbf{y}_{1:T} \sim \text{Inverse-Gamma}\left(\alpha_\varepsilon^*, \beta_\varepsilon^*\right)

Where:

αε=αε+T2\alpha_\varepsilon^* = \alpha_\varepsilon + \frac{T}{2} βε=βε+12t=1T(ytFxt)2\beta_\varepsilon^* = \beta_\varepsilon + \frac{1}{2} \sum_{t=1}^T (y_t - \mathbf{F}^\top \mathbf{x}_t)^2

Step 3: Sample Regression Weights (β\boldsymbol{\beta})

Assuming time-invariant regression coefficients:

βx1:T,y1:T,σε2N(m,V)\boldsymbol{\beta} \mid \mathbf{x}_{1:T}, \mathbf{y}_{1:T}, \sigma_\varepsilon^2 \sim \mathcal{N}\left(\mathbf{m}, \mathbf{V}\right)

Where:

V=(Λ+XXσε2)1\mathbf{V} = \left(\mathbf{\Lambda} + \frac{\mathbf{X}^\top \mathbf{X}}{\sigma_\varepsilon^2}\right)^{-1} m=V(Xyσε2)\mathbf{m} = \mathbf{V} \left(\frac{\mathbf{X}^\top \mathbf{y}}{\sigma_\varepsilon^2}\right)

Step 4: Sample State Vectors (x1:T\mathbf{x}_{1:T})

Utilize Forward-Backward Sampling or similar algorithms to sample the latent states given current parameter estimates and observed data.

4.3. Multiple MCMC Chains

To ensure convergence and robustness:


5. Posterior Predictive Inference

With posterior samples, derive predictions for the counterfactual scenario (y^t\hat{y}_t) and assess the impact of the intervention.

5.1. Posterior Means

For each time tt, the posterior mean prediction is:

y^t=E[yty1:T]\hat{y}_t = \mathbb{E}[y_t \mid \mathbf{y}_{1:T}]

In matrix form:

y^=Fxt\hat{\mathbf{y}} = \mathbf{F}^\top \mathbf{x}_t

5.2. Credible Intervals

Compute the α\alpha-credible intervals (e.g., 95%) for y^t\hat{y}_t:

y^t(q)=Quantile(y^t,q),q{α2,1α2}\hat{y}_t^{(q)} = \text{Quantile}\left(\hat{y}_t, q\right), \quad q \in \left\{ \frac{\alpha}{2}, 1 - \frac{\alpha}{2} \right\}

6. Causal Effect Estimation

Assess the intervention’s impact by comparing observed data with model predictions.

6.1. Point Effects

The immediate difference at time tt:

Point Effectt=yty^t\text{Point Effect}_t = y_t - \hat{y}_t

6.2. Cumulative Effects

Total impact from intervention start TstartT_{\text{start}} to time tt:

Cumulative Effectt=τ=Tstartt(yτy^τ)\text{Cumulative Effect}_t = \sum_{\tau=T_{\text{start}}}^{t} (y_\tau - \hat{y}_\tau)

6.3. Summary Statistics

Over the post-intervention period TpostT_{\text{post}}:


7. Matrix Operations and Linear Algebra

Efficient computation and representation of the STS model rely heavily on matrix operations.

7.1. State Transition Matrix (G\mathbf{G})

G=[100001000010000I]\mathbf{G} = \begin{bmatrix} 1 & 0 & \dots & 0 & \mathbf{0}^\top \\ 0 & 1 & \dots & 0 & \mathbf{0}^\top \\ \vdots & \vdots & \ddots & \vdots & \vdots \\ 0 & 0 & \dots & 1 & \mathbf{0}^\top \\ \mathbf{0} & \mathbf{0} & \dots & \mathbf{0} & \mathbf{I} \end{bmatrix}

7.2. Observation Matrix (F\mathbf{F})

F=[10KXt]\mathbf{F}^\top = \begin{bmatrix} 1 \\ \mathbf{0}_{K} \\ \mathbf{X}_t^\top \\ \end{bmatrix}

7.3. Covariance Matrices (W\mathbf{W})

W=[σ2000σs12I000Σβ]\mathbf{W} = \begin{bmatrix} \sigma_\ell^2 & \mathbf{0} & \dots & \mathbf{0} \\ \mathbf{0} & \sigma_{s_1}^2 \mathbf{I} & \dots & \mathbf{0} \\ \vdots & \vdots & \ddots & \vdots \\ \mathbf{0} & \mathbf{0} & \dots & \mathbf{\Sigma}_\beta \end{bmatrix}

7.4. Precision Matrix for Regression Weights (Λ\mathbf{\Lambda})

Λ=0.01×0.5×(XX)N\mathbf{\Lambda} = 0.01 \times \frac{0.5 \times (\mathbf{X}^\top \mathbf{X})}{N}

7.5. Likelihood Function

For the entire dataset, the likelihood is:

p(yx,θ)=t=1TN(ytFxt,σε2)p(\mathbf{y} \mid \mathbf{x}, \boldsymbol{\theta}) = \prod_{t=1}^T \mathcal{N}(y_t \mid \mathbf{F}^\top \mathbf{x}_t, \sigma_\varepsilon^2)

7.6. Posterior Distribution

Using Bayes’ theorem:

p(x1:T,θy)p(yx,θ)p(x1:Tθ)p(θ)p(\mathbf{x}_{1:T}, \boldsymbol{\theta} \mid \mathbf{y}) \propto p(\mathbf{y} \mid \mathbf{x}, \boldsymbol{\theta}) \cdot p(\mathbf{x}_{1:T} \mid \boldsymbol{\theta}) \cdot p(\boldsymbol{\theta})

Where:


8. Data Standardization and Scaling

Proper data preprocessing ensures that the model accurately captures patterns without being skewed by varying scales.

8.1. Standardizing Data

yt=ytμyσyy_t' = \frac{y_t - \mu_y}{\sigma_y}

8.2. Scaling Priors


9. Seasonal Effects Handling

Seasonality is a common feature in time-series data, representing periodic fluctuations.

9.1. Seasonal Components (sk,ts_{k,t})

9.2. Seasonal Drift (σsk\sigma_{s_k})

Allows seasonal trends to gradually change over time:

sk,t=sk,tm+ϵk,t,ϵk,tN(0,σsk2)s_{k,t} = s_{k,t-m} + \epsilon_{k,t}, \quad \epsilon_{k,t} \sim \mathcal{N}(0, \sigma_{s_k}^2)

10. Summary

Your CausalImpact Analysis implementation encompasses a comprehensive Bayesian Structural Time Series (STS) model with the following core components:

  1. State-Space Model: Defines the dynamics of latent states, including local level, seasonal components, and regression coefficients, influencing the observed data.
  2. Priors: Utilizes Inverse-Gamma priors for variances and Normal priors for regression weights and initial states, integrating domain knowledge and ensuring regularization.
  3. Gibbs Sampling: Employs Gibbs Sampling to iteratively sample from conditional posterior distributions of model parameters and latent states, ensuring convergence to the joint posterior.
  4. Posterior Predictive Inference: Derives posterior mean predictions and credible intervals, providing probabilistic estimates of counterfactual scenarios.
  5. Causal Effect Estimation: Quantifies the impact of interventions through point and cumulative effects by juxtaposing observed data against model predictions.
  6. Matrix Operations: Leverages matrix algebra for efficient representation and computation of state transitions, observations, and parameter updates.

By articulating the entire process in precise mathematical terms, your formulation not only enhances interpretability but also lays the groundwork for potential extensions or modifications to the model, catering to the evolving needs of data analysis and causal inference.


11. Appendix

Derivation of Sampling the Local Level Variance (σ2\sigma_\ell^2)

This section provides a detailed mathematical derivation of the sampling step for the Local Level Variance (σ2\sigma_\ell^2) within the Gibbs Sampling procedure.

1. Model Setup

1.1. State Transition Equation

The evolution of the Local Level component is given by:

t=t1+ηt,ηtN(0,σ2)\ell_t = \ell_{t-1} + \eta_t, \quad \eta_t \sim \mathcal{N}(0, \sigma_\ell^2)
1.2. Prior for σ2\sigma_\ell^2

An Inverse-Gamma prior is assumed:

σ2Inverse-Gamma(α,β)\sigma_\ell^2 \sim \text{Inverse-Gamma}(\alpha_\ell, \beta_\ell)

2. Likelihood Function

Given the state transition, the likelihood of observed states {t}t=1T\{\ell_t\}_{t=1}^T is:

p({t}t=1T{t1}t=1T,σ2)=t=1TN(tt1,σ2)p(\{\ell_t\}_{t=1}^T \mid \{\ell_{t-1}\}_{t=1}^T, \sigma_\ell^2) = \prod_{t=1}^T \mathcal{N}(\ell_t \mid \ell_{t-1}, \sigma_\ell^2)

Expanding the Normal density:

p({t}t=1Tσ2)=(2πσ2)T/2exp(12σ2t=1T(tt1)2)p(\{\ell_t\}_{t=1}^T \mid \sigma_\ell^2) = (2\pi \sigma_\ell^2)^{-T/2} \exp\left( -\frac{1}{2\sigma_\ell^2} \sum_{t=1}^T (\ell_t - \ell_{t-1})^2 \right)

3. Posterior Distribution

Applying Bayes’ theorem:

p(σ2data)p(dataσ2)p(σ2)p(\sigma_\ell^2 \mid \text{data}) \propto p(\text{data} \mid \sigma_\ell^2) \cdot p(\sigma_\ell^2)

Substituting the likelihood and prior:

p(σ2data)(σ2)T/2exp(12σ2t=1T(tt1)2)(σ2)α1exp(βσ2)p(\sigma_\ell^2 \mid \text{data}) \propto (\sigma_\ell^2)^{-T/2} \exp\left( -\frac{1}{2\sigma_\ell^2} \sum_{t=1}^T (\ell_t - \ell_{t-1})^2 \right) \cdot (\sigma_\ell^2)^{-\alpha_\ell -1} \exp\left( -\frac{\beta_\ell}{\sigma_\ell^2} \right)

Combining like terms:

p(σ2data)(σ2)(α+T2+1)exp(1σ2(12t=1T(tt1)2+β))p(\sigma_\ell^2 \mid \text{data}) \propto (\sigma_\ell^2)^{-\left(\alpha_\ell + \frac{T}{2} + 1\right)} \exp\left( -\frac{1}{\sigma_\ell^2} \left( \frac{1}{2} \sum_{t=1}^T (\ell_t - \ell_{t-1})^2 + \beta_\ell \right) \right)

4. Identifying the Posterior Distribution

Recognizing the form of the Inverse-Gamma distribution:

Inverse-Gamma(xα,β)=βαΓ(α)xα1exp(βx)\text{Inverse-Gamma}(x \mid \alpha', \beta') = \frac{{\beta'}^{\alpha'}}{\Gamma(\alpha')} x^{-\alpha' -1} \exp\left( -\frac{\beta'}{x} \right)

We identify the posterior parameters:

α=α+T2\alpha_\ell^* = \alpha_\ell + \frac{T}{2} β=β+12t=1T(tt1)2\beta_\ell^* = \beta_\ell + \frac{1}{2} \sum_{t=1}^T (\ell_t - \ell_{t-1})^2

Thus, the conditional posterior is:

σ2dataInverse-Gamma(α,β)\sigma_\ell^2 \mid \text{data} \sim \text{Inverse-Gamma}(\alpha_\ell^*, \beta_\ell^*)

Summary of the Sampling Step:

σ2dataInverse-Gamma(α+T2,β+12t=1T(tt1)2)\sigma_\ell^2 \mid \text{data} \sim \text{Inverse-Gamma}\left(\alpha_\ell + \frac{T}{2}, \beta_\ell + \frac{1}{2} \sum_{t=1}^T (\ell_t - \ell_{t-1})^2 \right)

This conjugate relationship between the Normal likelihood and the Inverse-Gamma prior facilitates efficient Gibbs Sampling, enabling straightforward updates of σ2\sigma_\ell^2 in each iteration.


References: