Mathematical Formulation of CausalImpact Analysis Using Structural Time Series and Gibbs Sampling

1. Overview
2. Structural Time Series (STS) Model
- 2.1. State-Space Representation
  - a. State Vector ( $\mathbf{x}_t$ )
  - b. State Transition Equation
- 2.2. Observation Equation
3. Priors and Hyperparameters
4. Bayesian Inference via Gibbs Sampling
5. Posterior Predictive Inference
- 5.1. Posterior Means
- 5.2. Credible Intervals
6. Causal Effect Estimation
7. Matrix Operations and Linear Algebra
8. Data Standardization and Scaling
- 8.1. Standardizing Data
- 8.2. Scaling Priors
9. Seasonal Effects Handling
- 9.1. Seasonal Components ( $s_{k,t}$ )
- 9.2. Seasonal Drift ( $\sigma_{s_k}$ )
10. Summary
11. Appendix
- Derivation of Sampling the Local Level Variance ( $\sigma_\ell^2$ )

1. Overview

CausalImpact Analysis is a statistical method designed to estimate the causal effect of an intervention by comparing observed data against a counterfactual scenario—what would have happened in the absence of the intervention. This analysis leverages a Structural Time Series (STS) model to capture the underlying data-generating processes and employs Gibbs Sampling, a Bayesian inference technique, to derive posterior distributions of the model parameters.

2. Structural Time Series (STS) Model

The Structural Time Series (STS) model offers a robust framework for modeling time-series data by decomposing it into various components such as trend, seasonality, and regression effects.

2.1. State-Space Representation

The STS model is formulated within a state-space framework, comprising two primary equations: the State Transition Equation and the Observation Equation.

a. State Vector ( $\mathbf{x}_t$ )

The state vector encapsulates all latent (unobserved) components influencing the observed data at time $t$ :

\mathbf{x}_t = \begin{bmatrix} \ell_t \\ s_{1,t} \\ s_{2,t} \\ \vdots \\ s_{K,t} \\ \boldsymbol{\beta}_t \\ \end{bmatrix}

Components:

$\ell_t$ : Local Level capturing the underlying trend at time $t$ .
$s_{k,t}$ : Seasonal Component $k$ at time $t$ for $k = 1, \dots, K$ .
$\boldsymbol{\beta}_t$ : Regression Coefficients representing the influence of covariates at time $t$ .

b. State Transition Equation

The evolution of the state vector over time is governed by:

\mathbf{x}_t = \mathbf{G} \mathbf{x}_{t-1} + \mathbf{w}_t

Where:

$\mathbf{G}$ : State Transition Matrix dictating how each state evolves.
$\mathbf{w}_t$ : State Noise Vector, modeled as:

\mathbf{w}_t \sim \mathcal{N}(\mathbf{0}, \mathbf{W})

Detailed Structure:

Assuming independent evolution of each component:

Local Level:
$\ell_t = \ell_{t-1} + \eta_t, \quad \eta_t \sim \mathcal{N}(0, \sigma_\ell^2)$
Seasonal Components:
$s_{k,t} = s_{k,t-m} + \epsilon_{k,t}, \quad \epsilon_{k,t} \sim \mathcal{N}(0, \sigma_{s_k}^2)$
Regression Coefficients:
$\boldsymbol{\beta}_t = \boldsymbol{\beta}_{t-1} + \boldsymbol{\xi}_t, \quad \boldsymbol{\xi}_t \sim \mathcal{N}(\mathbf{0}, \mathbf{\Sigma}_\beta)$

Thus, the transition matrices are defined as:

\mathbf{G} = \begin{bmatrix} 1 & 0 & \dots & 0 & \mathbf{0}^\top \\ 0 & 1 & \dots & 0 & \mathbf{0}^\top \\ \vdots & \vdots & \ddots & \vdots & \vdots \\ 0 & 0 & \dots & 1 & \mathbf{0}^\top \\ \mathbf{0} & \mathbf{0} & \dots & \mathbf{0} & \mathbf{I} \end{bmatrix}, \quad \mathbf{W} = \begin{bmatrix} \sigma_\ell^2 & \mathbf{0} & \dots & \mathbf{0} \\ \mathbf{0} & \sigma_{s_1}^2 \mathbf{I} & \dots & \mathbf{0} \\ \vdots & \vdots & \ddots & \vdots \\ \mathbf{0} & \mathbf{0} & \dots & \mathbf{\Sigma}_\beta \end{bmatrix}

Parameters:

$\sigma_\ell^2$ : Variance of the local level noise.
$\sigma_{s_k}^2$ : Variance of the $k$ -th seasonal component noise.
$\mathbf{\Sigma}_\beta$ : Covariance matrix for the regression coefficients.
$\mathbf{I}$ : Identity matrix of appropriate dimension.

2.2. Observation Equation

The Observation Equation links the latent state vector to the observed data:

y_t = \mathbf{F}^\top \mathbf{x}_t + \varepsilon_t, \quad \varepsilon_t \sim \mathcal{N}(0, \sigma_\varepsilon^2)

Where:

$\mathbf{F}$ : Observation Matrix, defined as:

\mathbf{F} = \begin{bmatrix} 1 & \mathbf{0}_{K}^\top & \mathbf{X}_t^\top \end{bmatrix}^\top

$\mathbf{X}_t$ : Covariate vector at time $t$ .
$\sigma_\varepsilon^2$ : Variance of the observation noise.

3. Priors and Hyperparameters

In Bayesian analysis, priors represent initial beliefs about the model parameters before observing the data. Proper specification of priors is essential as they influence the posterior distributions.

3.1. Local Level Variance Prior

\sigma_\ell^2 \sim \text{Inverse-Gamma}(\alpha_\ell, \beta_\ell)

$\alpha_\ell$ : Shape parameter.
$\beta_\ell$ : Scale parameter.

3.2. Observation Noise Variance Prior

\sigma_\varepsilon^2 \sim \text{Inverse-Gamma}(\alpha_\varepsilon, \beta_\varepsilon)

$\alpha_\varepsilon$ : Shape parameter.
$\beta_\varepsilon$ : Scale parameter.

3.3. Regression Weights Prior

Assuming a multivariate normal prior for regression coefficients $\boldsymbol{\beta}$ :

\boldsymbol{\beta} \sim \mathcal{N}(\mathbf{0}, \mathbf{\Lambda}^{-1})

$\mathbf{\Lambda}$ : Precision matrix, often derived from the design matrix $\mathbf{X}$ :

\mathbf{\Lambda} = 0.01 \times \frac{0.5 \times (\mathbf{X}^\top \mathbf{X})}{N}

$N$ : Number of observations.

3.4. Initial State Priors

\ell_0 \sim \mathcal{N}(y_0, \sigma_y^2)

s_{k,0} \sim \mathcal{N}(0, \sigma_{s_k}^2)

\boldsymbol{\beta}_0 \sim \mathcal{N}(\mathbf{0}, \mathbf{\Sigma}_\beta)

$y_0$ : Initial observed value.
$\sigma_y^2$ : Variance of the initial level.
$\sigma_{s_k}^2$ : Variance of the initial seasonal component $k$ .
$\mathbf{\Sigma}_\beta$ : Covariance matrix for the initial regression coefficients.

4. Bayesian Inference via Gibbs Sampling

Gibbs Sampling is a Markov Chain Monte Carlo (MCMC) method used to sample from the joint posterior distribution of model parameters and latent states.

4.1. Posterior Distribution

The objective is to sample from the joint posterior distribution:

p(\mathbf{x}_{1:T}, \boldsymbol{\theta} \mid \mathbf{y}_{1:T})

Where:

$\mathbf{x}_{1:T}$ : State vectors from time $1$ to $T$ .
$\boldsymbol{\theta}$ : Model parameters (e.g., $\sigma_\ell^2, \sigma_\varepsilon^2, \mathbf{\Lambda}$ ).
$\mathbf{y}_{1:T}$ : Observed data from time $1$ to $T$ .

Using Bayes’ theorem:

p(\mathbf{x}_{1:T}, \boldsymbol{\theta} \mid \mathbf{y}) \propto p(\mathbf{y} \mid \mathbf{x}, \boldsymbol{\theta}) \cdot p(\mathbf{x}_{1:T} \mid \boldsymbol{\theta}) \cdot p(\boldsymbol{\theta})

4.2. Gibbs Sampling Steps

Gibbs Sampling iteratively samples each parameter conditioned on the current values of all other parameters.

Step 1: Sample Local Level Variance ( $\sigma_\ell^2$ )

\sigma_\ell^2 \mid \mathbf{x}_{1:T}, \mathbf{y}_{1:T} \sim \text{Inverse-Gamma}\left(\alpha_\ell^*, \beta_\ell^*\right)

Where:

\alpha_\ell^* = \alpha_\ell + \frac{T}{2}

\beta_\ell^* = \beta_\ell + \frac{1}{2} \sum_{t=1}^T (\ell_t - \ell_{t-1})^2

Step 2: Sample Observation Noise Variance ( $\sigma_\varepsilon^2$ )

\sigma_\varepsilon^2 \mid \mathbf{x}_{1:T}, \mathbf{y}_{1:T} \sim \text{Inverse-Gamma}\left(\alpha_\varepsilon^*, \beta_\varepsilon^*\right)

Where:

\alpha_\varepsilon^* = \alpha_\varepsilon + \frac{T}{2}

\beta_\varepsilon^* = \beta_\varepsilon + \frac{1}{2} \sum_{t=1}^T (y_t - \mathbf{F}^\top \mathbf{x}_t)^2

Step 3: Sample Regression Weights ( $\boldsymbol{\beta}$ )

Assuming time-invariant regression coefficients:

\boldsymbol{\beta} \mid \mathbf{x}_{1:T}, \mathbf{y}_{1:T}, \sigma_\varepsilon^2 \sim \mathcal{N}\left(\mathbf{m}, \mathbf{V}\right)

Where:

\mathbf{V} = \left(\mathbf{\Lambda} + \frac{\mathbf{X}^\top \mathbf{X}}{\sigma_\varepsilon^2}\right)^{-1}

\mathbf{m} = \mathbf{V} \left(\frac{\mathbf{X}^\top \mathbf{y}}{\sigma_\varepsilon^2}\right)

Step 4: Sample State Vectors ( $\mathbf{x}_{1:T}$ )

Utilize Forward-Backward Sampling or similar algorithms to sample the latent states given current parameter estimates and observed data.

4.3. Multiple MCMC Chains

To ensure convergence and robustness:

Run Multiple Gibbs Chains ( $C$ chains): Each with different initializations.
Combine Samples Across Chains: Aggregate after convergence to form the posterior distribution.

5. Posterior Predictive Inference

With posterior samples, derive predictions for the counterfactual scenario ( $\hat{y}_t$ ) and assess the impact of the intervention.

5.1. Posterior Means

For each time $t$ , the posterior mean prediction is:

\hat{y}_t = \mathbb{E}[y_t \mid \mathbf{y}_{1:T}]

In matrix form:

\hat{\mathbf{y}} = \mathbf{F}^\top \mathbf{x}_t

5.2. Credible Intervals

Compute the $\alpha$ -credible intervals (e.g., 95%) for $\hat{y}_t$ :

\hat{y}_t^{(q)} = \text{Quantile}\left(\hat{y}_t, q\right), \quad q \in \left\{ \frac{\alpha}{2}, 1 - \frac{\alpha}{2} \right\}

6. Causal Effect Estimation

Assess the intervention’s impact by comparing observed data with model predictions.

6.1. Point Effects

The immediate difference at time $t$ :

\text{Point Effect}_t = y_t - \hat{y}_t

6.2. Cumulative Effects

Total impact from intervention start $T_{\text{start}}$ to time $t$ :

\text{Cumulative Effect}_t = \sum_{\tau=T_{\text{start}}}^{t} (y_\tau - \hat{y}_\tau)

6.3. Summary Statistics

Over the post-intervention period $T_{\text{post}}$ :

Average Predicted Outcome:
$\bar{y}_{\text{pred}} = \frac{1}{N} \sum_{t \in T_{\text{post}}} \hat{y}_t$
Cumulative Predicted Outcome:
$Y_{\text{pred}} = \sum_{t \in T_{\text{post}}} \hat{y}_t$
Absolute Effect:
$\text{Absolute Effect} = \bar{y}_{\text{obs}} - \bar{y}_{\text{pred}}$
Relative Effect:
$\text{Relative Effect} = \frac{\bar{y}_{\text{obs}}}{\bar{y}_{\text{pred}}} - 1$
P-value Calculation:
$p\text{-value} = \min\left( \frac{\#(y_{\text{pred}}^{(s)} \geq y_{\text{obs}})}{S}, \frac{\#(y_{\text{pred}}^{(s)} \leq y_{\text{obs}})}{S} \right)$
Where $S$ is the total number of posterior samples.

7. Matrix Operations and Linear Algebra

Efficient computation and representation of the STS model rely heavily on matrix operations.

7.1. State Transition Matrix ( $\mathbf{G}$ )

\mathbf{G} = \begin{bmatrix} 1 & 0 & \dots & 0 & \mathbf{0}^\top \\ 0 & 1 & \dots & 0 & \mathbf{0}^\top \\ \vdots & \vdots & \ddots & \vdots & \vdots \\ 0 & 0 & \dots & 1 & \mathbf{0}^\top \\ \mathbf{0} & \mathbf{0} & \dots & \mathbf{0} & \mathbf{I} \end{bmatrix}

Diagonal elements set to 1 for identity transitions.
Off-diagonal elements are 0, except for potential seasonal dependencies.

7.2. Observation Matrix ( $\mathbf{F}$ )

\mathbf{F}^\top = \begin{bmatrix} 1 \\ \mathbf{0}_{K} \\ \mathbf{X}_t^\top \\ \end{bmatrix}

Incorporates the local level and seasonal components directly.
Includes regression coefficients via $\mathbf{X}_t^\top$ .

7.3. Covariance Matrices ( $\mathbf{W}$ )

\mathbf{W} = \begin{bmatrix} \sigma_\ell^2 & \mathbf{0} & \dots & \mathbf{0} \\ \mathbf{0} & \sigma_{s_1}^2 \mathbf{I} & \dots & \mathbf{0} \\ \vdots & \vdots & \ddots & \vdots \\ \mathbf{0} & \mathbf{0} & \dots & \mathbf{\Sigma}_\beta \end{bmatrix}

Diagonal matrix with variances for each state component.
$\mathbf{\Sigma}_\beta$ represents the covariance matrix for regression weights.

7.4. Precision Matrix for Regression Weights ( $\mathbf{\Lambda}$ )

\mathbf{\Lambda} = 0.01 \times \frac{0.5 \times (\mathbf{X}^\top \mathbf{X})}{N}

Derived from the design matrix $\mathbf{X}$ (covariates).
Controls the prior variance of regression weights.

7.5. Likelihood Function

For the entire dataset, the likelihood is:

p(\mathbf{y} \mid \mathbf{x}, \boldsymbol{\theta}) = \prod_{t=1}^T \mathcal{N}(y_t \mid \mathbf{F}^\top \mathbf{x}_t, \sigma_\varepsilon^2)

7.6. Posterior Distribution

Using Bayes’ theorem:

p(\mathbf{x}_{1:T}, \boldsymbol{\theta} \mid \mathbf{y}) \propto p(\mathbf{y} \mid \mathbf{x}, \boldsymbol{\theta}) \cdot p(\mathbf{x}_{1:T} \mid \boldsymbol{\theta}) \cdot p(\boldsymbol{\theta})

Where:

$p(\mathbf{y} \mid \mathbf{x}, \boldsymbol{\theta})$ : Likelihood.
$p(\mathbf{x}_{1:T} \mid \boldsymbol{\theta})$ : Prior on states.
$p(\boldsymbol{\theta})$ : Priors on parameters.

8. Data Standardization and Scaling

Proper data preprocessing ensures that the model accurately captures patterns without being skewed by varying scales.

8.1. Standardizing Data

y_t' = \frac{y_t - \mu_y}{\sigma_y}

$\mu_y$ : Mean of the pre-intervention data.
$\sigma_y$ : Standard deviation of the pre-intervention data.

8.2. Scaling Priors

Level Scale ( $\sigma_\ell$ ):
$\sigma_\ell = \text{prior\_level\_sd} \times \sigma_y$
Seasonal Drift Scales ( $\sigma_s$ ):
$\sigma_s = 0.01 \times \sigma_y$

9. Seasonal Effects Handling

Seasonality is a common feature in time-series data, representing periodic fluctuations.

9.1. Seasonal Components ( $s_{k,t}$ )

Number of Seasons ( $m$ ): Defines the periodicity (e.g., $m=12$ for monthly data with yearly seasonality).
Steps per Season ( $n$ ): Granularity within each season (e.g., weekly steps within a yearly cycle).

9.2. Seasonal Drift ( $\sigma_{s_k}$ )

Allows seasonal trends to gradually change over time:

s_{k,t} = s_{k,t-m} + \epsilon_{k,t}, \quad \epsilon_{k,t} \sim \mathcal{N}(0, \sigma_{s_k}^2)

10. Summary

Your CausalImpact Analysis implementation encompasses a comprehensive Bayesian Structural Time Series (STS) model with the following core components:

State-Space Model: Defines the dynamics of latent states, including local level, seasonal components, and regression coefficients, influencing the observed data.
Priors: Utilizes Inverse-Gamma priors for variances and Normal priors for regression weights and initial states, integrating domain knowledge and ensuring regularization.
Gibbs Sampling: Employs Gibbs Sampling to iteratively sample from conditional posterior distributions of model parameters and latent states, ensuring convergence to the joint posterior.
Posterior Predictive Inference: Derives posterior mean predictions and credible intervals, providing probabilistic estimates of counterfactual scenarios.
Causal Effect Estimation: Quantifies the impact of interventions through point and cumulative effects by juxtaposing observed data against model predictions.
Matrix Operations: Leverages matrix algebra for efficient representation and computation of state transitions, observations, and parameter updates.

By articulating the entire process in precise mathematical terms, your formulation not only enhances interpretability but also lays the groundwork for potential extensions or modifications to the model, catering to the evolving needs of data analysis and causal inference.

11. Appendix

Derivation of Sampling the Local Level Variance ( $\sigma_\ell^2$ )

This section provides a detailed mathematical derivation of the sampling step for the Local Level Variance ( $\sigma_\ell^2$ ) within the Gibbs Sampling procedure.

1. Model Setup

1.1. State Transition Equation

The evolution of the Local Level component is given by:

\ell_t = \ell_{t-1} + \eta_t, \quad \eta_t \sim \mathcal{N}(0, \sigma_\ell^2)

1.2. Prior for $\sigma_\ell^2$

An Inverse-Gamma prior is assumed:

\sigma_\ell^2 \sim \text{Inverse-Gamma}(\alpha_\ell, \beta_\ell)

2. Likelihood Function

Given the state transition, the likelihood of observed states $\{\ell_t\}_{t=1}^T$ is:

p(\{\ell_t\}_{t=1}^T \mid \{\ell_{t-1}\}_{t=1}^T, \sigma_\ell^2) = \prod_{t=1}^T \mathcal{N}(\ell_t \mid \ell_{t-1}, \sigma_\ell^2)

Expanding the Normal density:

p(\{\ell_t\}_{t=1}^T \mid \sigma_\ell^2) = (2\pi \sigma_\ell^2)^{-T/2} \exp\left( -\frac{1}{2\sigma_\ell^2} \sum_{t=1}^T (\ell_t - \ell_{t-1})^2 \right)

3. Posterior Distribution

Applying Bayes’ theorem:

p(\sigma_\ell^2 \mid \text{data}) \propto p(\text{data} \mid \sigma_\ell^2) \cdot p(\sigma_\ell^2)

Substituting the likelihood and prior:

p(\sigma_\ell^2 \mid \text{data}) \propto (\sigma_\ell^2)^{-T/2} \exp\left( -\frac{1}{2\sigma_\ell^2} \sum_{t=1}^T (\ell_t - \ell_{t-1})^2 \right) \cdot (\sigma_\ell^2)^{-\alpha_\ell -1} \exp\left( -\frac{\beta_\ell}{\sigma_\ell^2} \right)

Combining like terms:

p(\sigma_\ell^2 \mid \text{data}) \propto (\sigma_\ell^2)^{-\left(\alpha_\ell + \frac{T}{2} + 1\right)} \exp\left( -\frac{1}{\sigma_\ell^2} \left( \frac{1}{2} \sum_{t=1}^T (\ell_t - \ell_{t-1})^2 + \beta_\ell \right) \right)

4. Identifying the Posterior Distribution

Recognizing the form of the Inverse-Gamma distribution:

\text{Inverse-Gamma}(x \mid \alpha', \beta') = \frac{{\beta'}^{\alpha'}}{\Gamma(\alpha')} x^{-\alpha' -1} \exp\left( -\frac{\beta'}{x} \right)

We identify the posterior parameters:

\alpha_\ell^* = \alpha_\ell + \frac{T}{2}

\beta_\ell^* = \beta_\ell + \frac{1}{2} \sum_{t=1}^T (\ell_t - \ell_{t-1})^2

Thus, the conditional posterior is:

\sigma_\ell^2 \mid \text{data} \sim \text{Inverse-Gamma}(\alpha_\ell^*, \beta_\ell^*)

Summary of the Sampling Step:

\sigma_\ell^2 \mid \text{data} \sim \text{Inverse-Gamma}\left(\alpha_\ell + \frac{T}{2}, \beta_\ell + \frac{1}{2} \sum_{t=1}^T (\ell_t - \ell_{t-1})^2 \right)

This conjugate relationship between the Normal likelihood and the Inverse-Gamma prior facilitates efficient Gibbs Sampling, enabling straightforward updates of $\sigma_\ell^2$ in each iteration.

References:

Harvey, A. C. (1990). Forecasting, Structural Time Series Models and the Kalman Filter. Cambridge University Press.
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
Gelman, A., et al. (2013). Bayesian Data Analysis (3rd ed.). CRC Press.

Table of Contents

1. Overview

2. Structural Time Series (STS) Model

2.1. State-Space Representation

a. State Vector (xt\mathbf{x}_txt​)

b. State Transition Equation

2.2. Observation Equation

3. Priors and Hyperparameters

3.1. Local Level Variance Prior

3.2. Observation Noise Variance Prior

3.3. Regression Weights Prior

3.4. Initial State Priors

4. Bayesian Inference via Gibbs Sampling

4.1. Posterior Distribution

4.2. Gibbs Sampling Steps

4.3. Multiple MCMC Chains

5. Posterior Predictive Inference

5.1. Posterior Means

5.2. Credible Intervals

6. Causal Effect Estimation

6.1. Point Effects

6.2. Cumulative Effects

6.3. Summary Statistics

7. Matrix Operations and Linear Algebra

7.1. State Transition Matrix (G\mathbf{G}G)

7.2. Observation Matrix (F\mathbf{F}F)

7.3. Covariance Matrices (W\mathbf{W}W)

7.4. Precision Matrix for Regression Weights (Λ\mathbf{\Lambda}Λ)

7.5. Likelihood Function

7.6. Posterior Distribution

8. Data Standardization and Scaling

8.1. Standardizing Data

8.2. Scaling Priors

9. Seasonal Effects Handling

9.1. Seasonal Components (sk,ts_{k,t}sk,t​)

9.2. Seasonal Drift (σsk\sigma_{s_k}σsk​​)

10. Summary

11. Appendix

Derivation of Sampling the Local Level Variance (σℓ2\sigma_\ell^2σℓ2​)

1. Model Setup

1.1. State Transition Equation

1.2. Prior for σℓ2\sigma_\ell^2σℓ2​

2. Likelihood Function

3. Posterior Distribution

4. Identifying the Posterior Distribution

a. State Vector ( $\mathbf{x}_t$ )

7.1. State Transition Matrix ( $\mathbf{G}$ )

7.2. Observation Matrix ( $\mathbf{F}$ )

7.3. Covariance Matrices ( $\mathbf{W}$ )

7.4. Precision Matrix for Regression Weights ( $\mathbf{\Lambda}$ )

9.1. Seasonal Components ( $s_{k,t}$ )

9.2. Seasonal Drift ( $\sigma_{s_k}$ )

Derivation of Sampling the Local Level Variance ( $\sigma_\ell^2$ )

1.2. Prior for $\sigma_\ell^2$