Standard Causal Entropy
The Standard optimal Causal Entropy (standard oCSE) represents the canonical implementation of the causation entropy framework. This method begins with an initial conditioning set, typically consisting of lagged values of the target variable, and systematically builds the causal predictor set through forward selection and backward elimination phases.
Mathematical Foundation
The standard oCSE algorithm is built around the conditional mutual information measure:
where: - \(X_j^{(t-\tau)}\) is the candidate predictor variable \(j\) at lag \(\tau\) - \(X_i^{(t)}\) is the target variable \(i\) at the current time - \(\mathbf{Z}_i^{(t)} = \mathbf{Z}_{\text{init}} \cup \mathbf{S}_i^{(t)}\) is the conditioning set
The conditioning set consists of two components: - \(\mathbf{Z}_{\text{init}}\): Initial conditioning variables (usually lagged target values) - \(\mathbf{S}_i^{(t)}\): Previously selected predictor variables
Algorithm Description
The standard oCSE algorithm proceeds in two main phases:
Phase 1: Forward Selection with Initial Conditioning
Input: - Time series data \(\mathbf{X} \in \mathbb{R}^{T \times n}\) - Maximum lag \(\tau_{\max}\) - Significance level \(\alpha_{\text{forward}}\) - Number of permutations \(N_{\text{perm}}\)
Initialization: For each target variable \(i\), construct the initial conditioning set:
This incorporates the autoregressive structure of the target variable.
Forward Selection Loop:
Candidate Evaluation: For each remaining candidate predictor \(X_j^{(t-\tau)}\):
\[\text{CMI}_{j,\tau} = I(X_j^{(t-\tau)}; X_i^{(t)} | \mathbf{Z}_i^{(t)})\]Best Candidate Selection: Choose the predictor with maximum conditional mutual information:
\[(j^*, \tau^*) = \arg\max_{j,\tau} \text{CMI}_{j,\tau}\]Significance Testing: Perform permutation test for \(H_0: I(X_{j^*}^{(t-\tau^*)}; X_i^{(t)} | \mathbf{Z}_i^{(t)}) = 0\)
Generate \(N_{\text{perm}}\) permutations of \(X_{j^*}^{(t-\tau^*)}\)
Compute null distribution: \(\{\text{CMI}_{\text{perm}}^{(k)}\}_{k=1}^{N_{\text{perm}}}\)
Determine threshold: \(\theta = \text{percentile}(\{\text{CMI}_{\text{perm}}^{(k)}\}, 100(1-\alpha_{\text{forward}}))\)
Selection Decision:
\[\text{Accept } X_{j^*}^{(t-\tau^*)} \text{ if } \text{CMI}_{j^*,\tau^*} \geq \theta\]Conditioning Set Update: If accepted, update:
\[\mathbf{Z}_i^{(t)} \leftarrow \mathbf{Z}_i^{(t)} \cup \{X_{j^*}^{(t-\tau^*)}\}\]
Phase 2: Backward Elimination
Objective: Remove spurious predictors that may have been selected due to transitivity or confounding effects.
Backward Elimination Loop: For each selected predictor \(X_j^{(t-\tau)} \in \mathbf{S}_i\) (in random order):
Conditioning Set Construction:
\[\mathbf{Z}_{-j} = \mathbf{Z}_{\text{init},i} \cup (\mathbf{S}_i \setminus \{X_j^{(t-\tau)}\})\]Conditional Mutual Information:
\[\text{CMI}_{j,\tau} = I(X_j^{(t-\tau)}; X_i^{(t)} | \mathbf{Z}_{-j})\]Significance Testing: Test \(H_0: I(X_j^{(t-\tau)}; X_i^{(t)} | \mathbf{Z}_{-j}) = 0\)
Elimination Decision:
\[\text{Remove } X_j^{(t-\tau)} \text{ if } \text{CMI}_{j,\tau} < \theta_{\text{backward}}\]
Key Properties
Initial Conditioning Benefits
The initial conditioning set \(\mathbf{Z}_{\text{init}}\) provides several advantages:
Autoregressive Control: Controls for the natural temporal dependencies in the target variable
Enhanced Specificity: Identifies predictors that provide information beyond autoregressive patterns
Confounding Mitigation: Reduces spurious relationships due to common trends or cycles
Mathematical Formulation:
This measures the additional predictive information provided by \(X_j^{(t-\tau)}\) beyond what is already captured by the autoregressive terms.
Conditioning Set Evolution
The conditioning set evolves as:
where \(X_{j^*}^{(t-\tau^*)}\) is the predictor selected at iteration \(k+1\).
Information-Theoretic Interpretation
The standard oCSE framework can be understood through the lens of information decomposition. For a target variable \(X_i^{(t)}\) with autoregressive history \(\mathbf{H}_i\) and external predictor \(X_j^{(t-\tau)}\):
The first term represents the direct causal influence, while the second represents shared information with the autoregressive structure. Standard oCSE focuses on the first term.
Advantages and Limitations
Advantages
Autoregressive Control: Explicitly accounts for temporal dependencies in the target
Theoretical Foundation: Grounded in information theory with clear interpretations
Flexible Information Measures: Supports various entropy estimators (Gaussian, k-NN, KDE, etc.)
Statistical Rigor: Permutation-based significance testing controls false positives
Multivariate Conditioning: Properly handles confounding through conditioning sets
Limitations
Computational Complexity: \(O(n^2 \tau_{\max} N_{\text{perm}} T)\) scaling
Initial Conditioning Assumption: Assumes autoregressive structure is relevant
Greedy Selection: Forward selection may miss globally optimal solutions
Sample Size Requirements: Information estimators require sufficient data
Stationarity Assumptions: Most effective on stationary time series
Implementation Considerations
Hyperparameter Selection
Significance Levels: - \(\alpha_{\text{forward}}\): Controls Type I error in forward selection (typically 0.01-0.05) - \(\alpha_{\text{backward}}\): Controls Type I error in backward elimination (typically 0.05-0.10)
Maximum Lag: - Should reflect domain knowledge of system dynamics - Computational cost scales linearly with \(\tau_{\max}\) - Rule of thumb: \(\tau_{\max} \approx \sqrt{T}\) for exploratory analysis
Permutation Count: - Minimum 100 for rough estimates - 1000+ for publication-quality significance tests - Precision scales as \(1/\sqrt{N_{\text{perm}}}\)
Information Estimator Choice
Data Type |
Recommended Estimator |
Notes |
|---|---|---|
Continuous Gaussian |
Gaussian |
Exact under normality assumption |
Continuous Non-Gaussian |
k-NN or KDE |
k-NN more robust to dimensionality |
Mixed/Discrete |
Histogram or k-NN |
Careful binning for histogram |
High-Dimensional |
Geometric k-NN |
Accounts for manifold structure |
Small Sample |
Gaussian (if appropriate) |
Parametric methods more sample-efficient |
Example Implementation
Here’s a conceptual implementation of the standard oCSE forward selection:
def standard_forward_selection(X, Y, Z_init, alpha=0.05, n_shuffles=200):
"""
Standard oCSE forward selection with initial conditioning.
Parameters
----------
X : array (T, n*tau_max)
Lagged predictor matrix
Y : array (T, 1)
Target variable
Z_init : array (T, p)
Initial conditioning set
"""
n_predictors = X.shape[1]
selected = []
Z_current = Z_init.copy()
while True:
# Evaluate remaining candidates
remaining = [i for i in range(n_predictors) if i not in selected]
if not remaining:
break
cmi_values = []
for j in remaining:
X_j = X[:, [j]]
cmi = conditional_mutual_information(X_j, Y, Z_current)
cmi_values.append(cmi)
# Select best candidate
best_idx = remaining[np.argmax(cmi_values)]
best_cmi = max(cmi_values)
# Significance test
X_best = X[:, [best_idx]]
significant = permutation_test(X_best, Y, Z_current,
best_cmi, alpha, n_shuffles)
if not significant:
break
# Accept and update
selected.append(best_idx)
Z_current = np.hstack([Z_current, X_best])
return selected
Comparison with Alternative Methods
The standard oCSE can be compared with its variants:
Method |
Initial Conditioning |
Advantages |
Use Cases |
|---|---|---|---|
Standard oCSE |
Yes (lagged target) |
Controls autoregression |
Time series with strong temporal structure |
Alternative oCSE |
No |
Simpler, fewer assumptions |
Exploratory analysis, weak temporal structure |
Information LASSO |
Variable |
Handles high dimensions |
Large predictor spaces |
Pure LASSO |
No |
Computationally efficient |
Linear relationships, benchmarking |
Theoretical Connections
The standard oCSE framework connects to several established concepts:
Granger Causality: Standard oCSE generalizes linear Granger causality to information-theoretic measures with flexible conditioning.
Transfer Entropy: Related but distinct - transfer entropy typically uses uniform conditioning across all variables, while oCSE uses targeted conditioning sets.
Partial Correlation: The Gaussian version of standard oCSE is closely related to partial correlation analysis but extends to nonlinear relationships.
Conclusion
Standard oCSE provides a principled, information-theoretic approach to causal discovery that explicitly accounts for autoregressive structure in time series data. The method’s strength lies in its theoretical foundation, flexibility in information measures, and rigorous statistical testing. However, users should be aware of its computational requirements and the assumptions inherent in the initial conditioning approach.
The method is particularly well-suited for time series with strong temporal dependencies where controlling for autoregressive effects is crucial for accurate causal inference.