Statistical Foundations
This section covers the statistical principles underlying the Causation Entropy framework, including hypothesis testing, multiple comparisons, bootstrap methods, and theoretical guarantees. Understanding these foundations is essential for proper application and interpretation of causal discovery results.
Hypothesis Testing Framework
Null and Alternative Hypotheses
In causal discovery, the fundamental hypothesis test is:
Interpretation: - \(H_0\): No causal relationship (conditional independence) - \(H_1\): Causal relationship exists (conditional dependence)
Key Insight: Conditional independence testing forms the backbone of information-theoretic causal discovery.
Test Statistics and Distributions
The test statistic is the conditional mutual information:
Distribution Under Null: For most information estimators, the null distribution is not analytically tractable. This motivates non-parametric approaches like permutation testing.
Asymptotic Properties: Under regularity conditions, for the Gaussian estimator:
where :math:`df$ depends on the dimensionalities of :math:`X$, :math:`Y$, and :math:`Z$.
Permutation Testing
Theoretical Foundation
Permutation Distribution: Generate \(B$ permutations :math:\){X^{(b)}}_{b=1}^B$ and compute:
P-value Calculation:
Permutation Strategies
Simple Permutation: Randomly shuffle :math:`X$ across all observations.
Conditional Permutation: For continuous :math:`Z$, this is challenging. Alternatives include:
Residual Permutation: Permute residuals from :math:`X sim f(Z)$
Local Permutation: Permute within neighborhoods of similar :math:`Z$ values
Model-Based Permutation: Fit :math:`p(X|Z)$ and generate synthetic data
Block Permutation: For time series data, preserve temporal structure:
where blocks of length :math:`l$ are permuted rather than individual observations.
Statistical Properties
Exactness: Permutation tests provide exact control of Type I error under :math:`H_0$.
Power: Power depends on: - Effect size (true conditional mutual information) - Sample size :math:`n$ - Number of permutations :math:`B$ - Quality of information estimator
Computational Cost: Total cost is :math:`O((B+1) cdot C_{text{estimator}})$ where :math:`C_{text{estimator}}$ is the cost of computing one conditional mutual information estimate.
Sequential Testing in oCSE
Forward Selection Testing
At each forward selection step :math:`s$:
Test all remaining candidates: :math:`{H_{0,k}}_{k in mathcal{R}_s}$
Apply multiple testing correction within the step
Select the most significant candidate (if any pass the threshold)
Step-wise FDR Control: .. math:
\alpha_s = \alpha \cdot \frac{|\mathcal{R}_s|}{|\mathcal{R}_1|}
This allocates the error budget proportionally across steps.
Backward Elimination Testing
Test each selected predictor for continued significance:
Challenges: - Dependencies between tests (same target, overlapping conditioning sets) - Multiple testing across different removal orders
Solutions: - Use more conservative :math:`alpha$ for backward phase - Apply FDR control across all backward tests - Use stability-based selection criteria
Bootstrap Methods
Bootstrap Confidence Intervals
Procedure: 1. Generate \(B$ bootstrap samples :math:\){(mathbf{X}^{(b)}, mathbf{Y}^{(b)})}_{b=1}^B$ 2. Compute :math:`{hat{I}^{(b)}}_{b=1}^B$ for each bootstrap sample 3. Construct confidence interval: :math:`[hat{I}_{(alpha/2)}, hat{I}_{(1-alpha/2)}]$
Time Series Bootstrap: Standard bootstrap assumes i.i.d. data. For time series:
Block Bootstrap: .. math:
\text{Bootstrap Sample} = [B_1, B_2, \ldots, B_k]
where :math:`B_i$ are overlapping blocks of length :math:`l$.
Stationary Bootstrap: Random block lengths with geometric distribution.
Bootstrap-based Variable Selection
Stability Selection: For each bootstrap sample, perform variable selection and compute selection probability:
Select variables with :math:`Pi_j geq pi_{text{threshold}}$ (typically 0.6-0.8).
Theoretical Guarantees: Under appropriate conditions, stability selection provides FDR control:
Theoretical Guarantees
Consistency Properties
Selection Consistency: An estimator is selection consistent if:
Conditions for oCSE: 1. Information Estimator Consistency: \(\hat{I} \xrightarrow{P} I$ 2. **Significance Level Scaling:** :math:\)alpha_n to 0$ appropriately 3. Sparsity: :math:`|\mathbf{S}_{\text{true}}| = o(n)$ 4. Signal Strength: Minimum true CMI bounded away from 0
Estimation Error Bounds
For Gaussian estimators, the estimation error satisfies:
where :math:`d$ is the effective dimensionality.
Implications for Causal Discovery: - Need \(n \gg d \log n$ for reliable estimation - True relationships must have CMI significantly larger than :math:\)sqrt{frac{d log n}{n}}$
High-Dimensional Theory
Conditions for :math:`p > n$: When the number of potential predictors exceeds sample size:
Sparsity: :math:`s = |\mathbf{S}_{\text{true}}| ll n$
Restricted Eigenvalue Condition: For information matrices
Signal-to-Noise Ratio: True CMI values sufficiently large
Phase Transitions: In high-dimensional regimes, there are sharp phase transitions where selection becomes possible/impossible based on the scaling of :math:`n$, :math:`p$, and :math:`s$.
Power Analysis
Theoretical Power
The power of a conditional independence test is:
Factors Affecting Power: - Effect Size: Larger true CMI increases power - Sample Size: Power increases with :math:`n$ - Dimensionality: Higher dimensions reduce power (curse of dimensionality) - Information Estimator: Different estimators have different power characteristics
Sample Size Calculations
Rule of Thumb for Gaussian Estimator: To detect CMI of size :math:`delta$ with power :math:`1-beta$:
where :math:`d$ is the effective dimensionality.
Simulation-Based Power Analysis: 1. Specify effect sizes of interest 2. Generate synthetic data under alternative hypothesis 3. Apply testing procedure and compute empirical power 4. Repeat for different sample sizes to find required :math:`n$
Robustness and Sensitivity
Robustness to Outliers
Impact of Outliers: Information estimators vary in sensitivity to outliers: - Gaussian: Highly sensitive (based on sample covariance) - k-NN: Moderately sensitive (distance-based) - Histogram: Least sensitive (discretization reduces impact)
Robust Estimators: - Trimmed estimators: Remove extreme observations - M-estimators: Downweight outliers in computation - Robust covariance: Use robust estimates in Gaussian methods
Model Misspecification
Gaussian Assumption Violations: When data is non-Gaussian but Gaussian estimator is used: - May detect only linear relationships - Power reduced for nonlinear dependencies - Type I error control generally maintained
Non-stationarity: Time-varying relationships violate stationarity assumptions: - Use adaptive window methods - Apply tests for structural breaks - Consider time-varying parameter models
Sensitivity Analysis
Parameter Sensitivity: Assess robustness to hyperparameter choices: - Information estimator parameters (bandwidth, k) - Significance levels (\(\alpha$) - Maximum lag (:math:\)tau_{max}$)
Cross-Validation: Use held-out data to validate discovered relationships:
Practical Guidelines
Sample Size Requirements
Minimum Sample Sizes by Estimator:
Estimator |
Low Dim (d≤5) |
Medium Dim (5<d≤20) |
High Dim (d>20) |
|---|---|---|---|
Gaussian |
n ≥ 50 |
n ≥ 100 |
n ≥ 500 |
k-NN |
n ≥ 100 |
n ≥ 500 |
n ≥ 1000+ |
KDE |
n ≥ 200 |
n ≥ 1000 |
Not recommended |
Significance Level Selection
Forward Selection: Use more stringent \(\alpha$ to control false positives - Conservative: :math:\)alpha = 0.01$ - Standard: \(\alpha = 0.05$ - Liberal: :math:\)alpha = 0.10$ (exploratory analysis)
Backward Elimination: Can use less stringent \(\alpha$ for pruning - Typical: :math:\)alpha_{text{backward}} = 1.5 times alpha_{text{forward}}$
Multiple Testing: Always apply appropriate corrections when testing multiple relationships simultaneously.
Diagnostic Procedures
Model Checking
Residual Analysis: After variable selection, examine residuals for: - Independence (serial correlation tests) - Normality (if using Gaussian methods) - Heteroscedasticity
Information Criteria: Compare model performance using information-theoretic criteria:
Stability Analysis
Bootstrap Stability: Assess selection stability across bootstrap samples:
Cross-Validation Stability: Use K-fold CV to assess robustness to data splitting.
Future Directions
Methodological Advances
Adaptive Testing: Data-driven significance level selection
Sequential FDR: Improved multiple testing for sequential selection
Robust Information Measures: Estimators less sensitive to outliers
High-Dimensional Theory: Better understanding of :math:`p >> n$ regimes
Causal-Specific Tests: Tests designed specifically for causal relationships
Computational Improvements
Parallel Testing: Efficient parallel algorithms for permutation tests
Approximate Methods: Fast approximate significance testing
Conclusion
The statistical foundations of optimal Causal Entropy provide the theoretical framework for reliable causal discovery. Key principles include:
Rigorous Hypothesis Testing: All causal claims should be statistically validated
Multiple Testing Awareness: Control for multiple comparisons when testing many relationships
Bootstrap Methods: Use resampling for uncertainty quantification and stability assessment
Power Considerations: Ensure sufficient sample sizes for reliable detection
Robustness Checks: Validate methods across different assumptions and parameter choices
Understanding these statistical foundations is crucial for proper application and interpretation of causal discovery results. Practitioners should always validate their findings through appropriate statistical testing and sensitivity analysis.