Statistical Foundations

This section covers the statistical principles underlying the Causation Entropy framework, including hypothesis testing, multiple comparisons, bootstrap methods, and theoretical guarantees. Understanding these foundations is essential for proper application and interpretation of causal discovery results.

Hypothesis Testing Framework

Null and Alternative Hypotheses

In causal discovery, the fundamental hypothesis test is:

\[H_0: I(X_j^{(t-\tau)}; X_i^{(t)} | \mathbf{Z}_i^{(t)}) = 0\]

\[H_1: I(X_j^{(t-\tau)}; X_i^{(t)} | \mathbf{Z}_i^{(t)}) > 0\]

Interpretation: - $H_0$: No causal relationship (conditional independence) - $H_1$: Causal relationship exists (conditional dependence)

Key Insight: Conditional independence testing forms the backbone of information-theoretic causal discovery.

Test Statistics and Distributions

The test statistic is the conditional mutual information:

\[T = \hat{I}(X_j^{(t-\tau)}; X_i^{(t)} | \mathbf{Z}_i^{(t)})\]

Distribution Under Null: For most information estimators, the null distribution is not analytically tractable. This motivates non-parametric approaches like permutation testing.

Asymptotic Properties: Under regularity conditions, for the Gaussian estimator:

\[2n \cdot \hat{I}(X;Y|Z) \xrightarrow{d} \chi^2_{df}\]

where :math:`df$ depends on the dimensionalities of :math:`X$, :math:`Y$, and :math:`Z$.

Permutation Testing

Theoretical Foundation

Permutation Distribution: Generate $B$ permutations :math:${X^{(b)}}_{b=1}^B$ and compute:

\[\{T^{(b)}\}_{b=1}^B = \{\hat{I}(X^{(b)}; Y | Z)\}_{b=1}^B\]

P-value Calculation:

\[p = \frac{1 + \sum_{b=1}^B \mathbb{I}(T^{(b)} \geq T_{\text{obs}})}{B + 1}\]

Permutation Strategies

Simple Permutation: Randomly shuffle :math:`X$ across all observations.

Conditional Permutation: For continuous :math:`Z$, this is challenging. Alternatives include:

Residual Permutation: Permute residuals from :math:`X sim f(Z)$
Local Permutation: Permute within neighborhoods of similar :math:`Z$ values
Model-Based Permutation: Fit :math:`p(X|Z)$ and generate synthetic data

Block Permutation: For time series data, preserve temporal structure:

\[\text{Block}(X, l) = [X_{i:i+l-1}, X_{j:j+l-1}, \ldots]\]

where blocks of length :math:`l$ are permuted rather than individual observations.

Statistical Properties

Exactness: Permutation tests provide exact control of Type I error under :math:`H_0$.

Power: Power depends on: - Effect size (true conditional mutual information) - Sample size :math:`n$ - Number of permutations :math:`B$ - Quality of information estimator

Computational Cost: Total cost is :math:`O((B+1) cdot C_{text{estimator}})$ where :math:`C_{text{estimator}}$ is the cost of computing one conditional mutual information estimate.

Sequential Testing in oCSE

Forward Selection Testing

At each forward selection step :math:`s$:

Test all remaining candidates: :math:`{H_{0,k}}_{k in mathcal{R}_s}$
Apply multiple testing correction within the step
Select the most significant candidate (if any pass the threshold)

Step-wise FDR Control: .. math:

\alpha_s = \alpha \cdot \frac{|\mathcal{R}_s|}{|\mathcal{R}_1|}

This allocates the error budget proportionally across steps.

Backward Elimination Testing

Test each selected predictor for continued significance:

\[H_0: I(X_j^{(t-\tau)}; X_i^{(t)} | \mathbf{S}_i \setminus \{j\}) = 0\]

Challenges: - Dependencies between tests (same target, overlapping conditioning sets) - Multiple testing across different removal orders

Solutions: - Use more conservative :math:`alpha$ for backward phase - Apply FDR control across all backward tests - Use stability-based selection criteria

Bootstrap Methods

Bootstrap Confidence Intervals

Procedure: 1. Generate $B$ bootstrap samples :math:${(mathbf{X}^{(b)}, mathbf{Y}^{(b)})}_{b=1}^B$ 2. Compute :math:`{hat{I}^{(b)}}_{b=1}^B$ for each bootstrap sample 3. Construct confidence interval: :math:`[hat{I}_{(alpha/2)}, hat{I}_{(1-alpha/2)}]$

Time Series Bootstrap: Standard bootstrap assumes i.i.d. data. For time series:

Block Bootstrap: .. math:

\text{Bootstrap Sample} = [B_1, B_2, \ldots, B_k]

where :math:`B_i$ are overlapping blocks of length :math:`l$.

Stationary Bootstrap: Random block lengths with geometric distribution.

Bootstrap-based Variable Selection

Stability Selection: For each bootstrap sample, perform variable selection and compute selection probability:

\[\Pi_j = P(\text{variable } j \text{ selected}) = \frac{1}{B} \sum_{b=1}^B \mathbb{I}(j \in \hat{\mathbf{S}}^{(b)})\]

Select variables with :math:`Pi_j geq pi_{text{threshold}}$ (typically 0.6-0.8).

Theoretical Guarantees: Under appropriate conditions, stability selection provides FDR control:

\[\mathbb{E}[\text{FDR}] \leq \frac{1}{2\pi_{\text{threshold}} - 1} \cdot \frac{\mathbb{E}[V]}{|\hat{\mathbf{S}}|}\]

Theoretical Guarantees

Consistency Properties

Selection Consistency: An estimator is selection consistent if:

\[P(\hat{\mathbf{S}} = \mathbf{S}_{\text{true}}) \to 1 \text{ as } n \to \infty\]

Conditions for oCSE: 1. Information Estimator Consistency: $\hat{I} \xrightarrow{P} I$ 2. **Significance Level Scaling:** :math:$alpha_n to 0$ appropriately 3. Sparsity: :math:`|\mathbf{S}_{\text{true}}| = o(n)$ 4. Signal Strength: Minimum true CMI bounded away from 0

Estimation Error Bounds

For Gaussian estimators, the estimation error satisfies:

\[|\hat{I} - I| = O_p\left(\sqrt{\frac{d \log n}{n}}\right)\]

where :math:`d$ is the effective dimensionality.

Implications for Causal Discovery: - Need $n \gg d \log n$ for reliable estimation - True relationships must have CMI significantly larger than :math:$sqrt{frac{d log n}{n}}$

High-Dimensional Theory

Conditions for :math:`p > n$: When the number of potential predictors exceeds sample size:

Sparsity: :math:`s = |\mathbf{S}_{\text{true}}| ll n$
Restricted Eigenvalue Condition: For information matrices
Signal-to-Noise Ratio: True CMI values sufficiently large

Phase Transitions: In high-dimensional regimes, there are sharp phase transitions where selection becomes possible/impossible based on the scaling of :math:`n$, :math:`p$, and :math:`s$.

Power Analysis

Theoretical Power

The power of a conditional independence test is:

\[\text{Power} = P(\text{reject } H_0 | H_1 \text{ true}) = P(T > t_{\alpha} | I > 0)\]

Factors Affecting Power: - Effect Size: Larger true CMI increases power - Sample Size: Power increases with :math:`n$ - Dimensionality: Higher dimensions reduce power (curse of dimensionality) - Information Estimator: Different estimators have different power characteristics

Sample Size Calculations

Rule of Thumb for Gaussian Estimator: To detect CMI of size :math:`delta$ with power :math:`1-beta$:

\[n \gtrsim \frac{(z_{\alpha} + z_{\beta})^2}{\delta^2} \cdot d\]

where :math:`d$ is the effective dimensionality.

Simulation-Based Power Analysis: 1. Specify effect sizes of interest 2. Generate synthetic data under alternative hypothesis 3. Apply testing procedure and compute empirical power 4. Repeat for different sample sizes to find required :math:`n$

Robustness and Sensitivity

Robustness to Outliers

Impact of Outliers: Information estimators vary in sensitivity to outliers: - Gaussian: Highly sensitive (based on sample covariance) - k-NN: Moderately sensitive (distance-based) - Histogram: Least sensitive (discretization reduces impact)

Robust Estimators: - Trimmed estimators: Remove extreme observations - M-estimators: Downweight outliers in computation - Robust covariance: Use robust estimates in Gaussian methods

Model Misspecification

Gaussian Assumption Violations: When data is non-Gaussian but Gaussian estimator is used: - May detect only linear relationships - Power reduced for nonlinear dependencies - Type I error control generally maintained

Non-stationarity: Time-varying relationships violate stationarity assumptions: - Use adaptive window methods - Apply tests for structural breaks - Consider time-varying parameter models

Sensitivity Analysis

Parameter Sensitivity: Assess robustness to hyperparameter choices: - Information estimator parameters (bandwidth, k) - Significance levels ($\alpha$) - Maximum lag (:math:$tau_{max}$)

Cross-Validation: Use held-out data to validate discovered relationships:

\[\text{CV-Score} = \frac{1}{K} \sum_{k=1}^K I_{\text{test},k}(\hat{\mathbf{S}}_{\text{train},k})\]

Practical Guidelines

Sample Size Requirements

Minimum Sample Sizes by Estimator:

Sample Size Guidelines
Estimator	Low Dim (d≤5)	Medium Dim (5<d≤20)	High Dim (d>20)
Gaussian	n ≥ 50	n ≥ 100	n ≥ 500
k-NN	n ≥ 100	n ≥ 500	n ≥ 1000+
KDE	n ≥ 200	n ≥ 1000	Not recommended

Significance Level Selection

Forward Selection: Use more stringent $\alpha$ to control false positives - Conservative: :math:$alpha = 0.01$ - Standard: $\alpha = 0.05$ - Liberal: :math:$alpha = 0.10$ (exploratory analysis)

Backward Elimination: Can use less stringent $\alpha$ for pruning - Typical: :math:$alpha_{text{backward}} = 1.5 times alpha_{text{forward}}$

Multiple Testing: Always apply appropriate corrections when testing multiple relationships simultaneously.

Diagnostic Procedures

Model Checking

Residual Analysis: After variable selection, examine residuals for: - Independence (serial correlation tests) - Normality (if using Gaussian methods) - Heteroscedasticity

Information Criteria: Compare model performance using information-theoretic criteria:

\[\text{AIC}_{\text{info}} = -2 \sum_{j \in \hat{\mathbf{S}}} \hat{I}_j + 2|\hat{\mathbf{S}}|\]

Stability Analysis

Bootstrap Stability: Assess selection stability across bootstrap samples:

\[\text{Stability Score} = \frac{1}{B} \sum_{b=1}^B \frac{|\hat{\mathbf{S}}^{(b)} \cap \hat{\mathbf{S}}|}{|\hat{\mathbf{S}}^{(b)} \cup \hat{\mathbf{S}}|}\]

Cross-Validation Stability: Use K-fold CV to assess robustness to data splitting.

Future Directions

Methodological Advances

Adaptive Testing: Data-driven significance level selection
Sequential FDR: Improved multiple testing for sequential selection
Robust Information Measures: Estimators less sensitive to outliers
High-Dimensional Theory: Better understanding of :math:`p >> n$ regimes
Causal-Specific Tests: Tests designed specifically for causal relationships

Computational Improvements

Parallel Testing: Efficient parallel algorithms for permutation tests
Approximate Methods: Fast approximate significance testing

Conclusion

The statistical foundations of optimal Causal Entropy provide the theoretical framework for reliable causal discovery. Key principles include:

Rigorous Hypothesis Testing: All causal claims should be statistically validated
Multiple Testing Awareness: Control for multiple comparisons when testing many relationships
Bootstrap Methods: Use resampling for uncertainty quantification and stability assessment
Power Considerations: Ensure sufficient sample sizes for reliable detection
Robustness Checks: Validate methods across different assumptions and parameter choices

Understanding these statistical foundations is crucial for proper application and interpretation of causal discovery results. Practitioners should always validate their findings through appropriate statistical testing and sensitivity analysis.