Alternative Causal Entropy
The Alternative optimal Causal Entropy (alternative oCSE) represents a simplified variant of the causation entropy framework that begins with an empty conditioning set. This approach offers a more exploratory perspective on causal discovery, building relationships purely from the data without prior assumptions about autoregressive structure.
Mathematical Foundation
The alternative oCSE algorithm uses the same conditional mutual information measure as standard oCSE, but with a different conditioning strategy:
The key difference is in the conditioning set evolution: - Standard oCSE: \(\mathbf{Z}_i^{(t)} = \mathbf{Z}_{\text{init}} \cup \mathbf{S}_i^{(t)}\) - Alternative oCSE: \(\mathbf{Z}_i^{(t)} = \mathbf{S}_i^{(t)}\) (no initial conditioning)
This means the algorithm starts with marginal mutual information and builds conditional dependencies organically through the selection process.
Algorithm Description
The alternative oCSE follows the same two-phase structure as standard oCSE but with different initialization and conditioning logic.
Phase 1: Forward Selection without Initial Conditioning
Initialization: For each target variable \(i\): - Selected predictors: \(\mathbf{S}_i = \emptyset\) - Conditioning set: \(\mathbf{Z}_i = \emptyset\)
Iteration k=1 (Marginal Selection):
Marginal Mutual Information: For each candidate predictor \(X_j^{(t-\tau)}\):
\[\text{MI}_{j,\tau} = I(X_j^{(t-\tau)}; X_i^{(t)}) = H(X_i^{(t)}) - H(X_i^{(t)} | X_j^{(t-\tau)})\]Best Candidate Selection:
\[(j^*, \tau^*) = \arg\max_{j,\tau} \text{MI}_{j,\tau}\]Significance Testing: Test \(H_0: I(X_{j^*}^{(t-\tau^*)}; X_i^{(t)}) = 0\)
First Selection: If significant, \(\mathbf{S}_i \leftarrow \{X_{j^*}^{(t-\tau^*)}\}\)
Subsequent Iterations (k≥2):
Conditional Mutual Information: For remaining candidates:
\[\text{CMI}_{j,\tau} = I(X_j^{(t-\tau)}; X_i^{(t)} | \mathbf{S}_i)\]Selection and Conditioning Update: Following the same logic as standard oCSE
Phase 2: Backward Elimination
Identical to standard oCSE, but the conditioning set for elimination only includes previously selected predictors:
Key Algorithmic Differences
Conditioning Set Evolution
The conditioning set grows incrementally without prior assumptions:
Information-Theoretic Interpretation
The alternative approach can be understood through the chain rule of mutual information:
The algorithm greedily maximizes each term in this decomposition, building the joint information incrementally. This provides a different perspective compared to standard oCSE, which conditions on autoregressive structure from the start.
First Selection: Marginal Mutual Information
The first selected predictor maximizes marginal mutual information:
This captures the strongest unconditional relationship, which may include autoregressive effects if they are the dominant signal.
Subsequent Selections: Conditional Uniqueness
Later selections maximize conditional mutual information:
This ensures each new predictor provides unique information not already captured by previously selected variables.
Advantages and Limitations
Advantages
No Prior Assumptions: Does not assume autoregressive structure is important
Exploratory Discovery: May find unexpected relationships not constrained by temporal assumptions
Simpler Implementation: Fewer parameters and initialization steps
Computational Efficiency: Slightly faster due to smaller initial conditioning sets
Data-Driven: Relationships emerge purely from information content in data
Interpretability: First selection shows strongest marginal relationship
Limitations
Confounding Risk: Without autoregressive control, may select spurious relationships
Order Dependence: Early selections heavily influence later conditioning
Transitivity Issues: May select indirect relationships as direct causes
Temporal Structure Ignored: Does not explicitly account for time series nature
Higher False Positive Risk: Less conservative than standard approach
When to Use Alternative oCSE
Recommended Scenarios
Exploratory Analysis: Initial investigation of unknown systems
Cross-Sectional Data: When temporal structure is less important
Weak Autoregressive Systems: Time series without strong temporal dependencies
Comparative Studies: Baseline for comparison with standard oCSE
High-Dimensional Systems: When autoregressive conditioning becomes prohibitive
Non-Temporal Networks: Spatial or other non-temporal relationship discovery
Comparison with Standard oCSE
Consider a simple three-variable system:
Standard oCSE (Target: :math:`X_1^{(t)}`):
Initial conditioning: \(\mathbf{Z}_{\text{init}} = \{X_1^{(t-1)}\}\)
Evaluate: \(I(X_2^{(t-1)}; X_1^{(t)} | X_1^{(t-1)})\) and \(I(X_3^{(t-1)}; X_1^{(t)} | X_1^{(t-1)})\)
Likely selects \(X_2^{(t-1)}\) due to causal relationship
Alternative oCSE (Target: :math:`X_1^{(t)}`):
No initial conditioning: \(\mathbf{Z} = \emptyset\)
Evaluate: \(I(X_1^{(t-1)}; X_1^{(t)})\), \(I(X_2^{(t-1)}; X_1^{(t)})\), \(I(X_3^{(t-1)}; X_1^{(t)})\)
Likely selects \(X_1^{(t-1)}\) first (strongest marginal relationship)
Then evaluates \(I(X_2^{(t-1)}; X_1^{(t)} | X_1^{(t-1)})\) and \(I(X_3^{(t-1)}; X_1^{(t)} | X_1^{(t-1)})\)
Selects \(X_2^{(t-1)}\) second
Both methods may reach the same final result, but through different paths and with different interpretations of the relationships.
Implementation Considerations
Hyperparameter Differences
Significance Levels: - May require more conservative \(\alpha\) values due to increased multiple testing - Consider Bonferroni or FDR corrections for the first (marginal) selection phase
Information Estimator Selection: - Same considerations as standard oCSE - May benefit from robust estimators due to less initial regularization
Stopping Criteria: - Consider earlier stopping due to increased false positive risk - Monitor conditioning set size to prevent overfitting
Example Implementation
def alternative_forward_selection(X, Y, alpha=0.05, n_shuffles=200):
"""
Alternative oCSE forward selection without initial conditioning.
Parameters
----------
X : array (T, n*tau_max)
Lagged predictor matrix
Y : array (T, 1)
Target variable
"""
n_predictors = X.shape[1]
selected = []
Z_current = None # Start with empty conditioning set
while True:
# Evaluate remaining candidates
remaining = [i for i in range(n_predictors) if i not in selected]
if not remaining:
break
cmi_values = []
for j in remaining:
X_j = X[:, [j]]
if Z_current is None:
# First iteration: marginal mutual information
cmi = mutual_information(X_j, Y)
else:
# Subsequent iterations: conditional mutual information
cmi = conditional_mutual_information(X_j, Y, Z_current)
cmi_values.append(cmi)
# Select best candidate
best_idx = remaining[np.argmax(cmi_values)]
best_cmi = max(cmi_values)
# Significance test
X_best = X[:, [best_idx]]
significant = permutation_test(X_best, Y, Z_current,
best_cmi, alpha, n_shuffles)
if not significant:
break
# Accept and update conditioning set
selected.append(best_idx)
if Z_current is None:
Z_current = X_best
else:
Z_current = np.hstack([Z_current, X_best])
return selected
Diagnostic Analysis
To understand the differences between standard and alternative oCSE results:
Selection Order Analysis
Compare the order of variable selection:
def compare_selection_order(X, Y, Z_init):
"""Compare selection order between methods."""
# Standard oCSE
standard_order = standard_forward_selection(X, Y, Z_init)
# Alternative oCSE
alternative_order = alternative_forward_selection(X, Y)
print("Standard oCSE order:", standard_order)
print("Alternative oCSE order:", alternative_order)
# Check if autoregressive terms selected first in alternative
auto_vars = identify_autoregressive_variables(X, Y)
alt_first_auto = any(var in auto_vars for var in alternative_order[:2])
return {
'standard_order': standard_order,
'alternative_order': alternative_order,
'alternative_selects_autoregressive': alt_first_auto
}
Conditional MI Comparison
Analyze how conditioning affects relationship strength:
Positive values indicate relationships that appear stronger without autoregressive conditioning.
Theoretical Implications
Model Selection Perspective
Alternative oCSE can be viewed as a model selection procedure that builds complexity incrementally:
Each selection represents a model complexity increase justified by information gain.
Connection to Feature Selection
The algorithm is closely related to information-based feature selection methods, particularly those using mutual information criteria. The key difference is the explicit temporal structure and causal interpretation.
Conclusion
Alternative oCSE provides a complementary approach to causal discovery that prioritizes data-driven relationship discovery over temporal assumptions. While it may be more susceptible to confounding and spurious relationships, it offers valuable insights for exploratory analysis and systems where autoregressive structure is not dominant.
The method is particularly useful as: - A baseline for comparison with standard oCSE - An exploratory tool for unknown systems - A method for cross-sectional or weakly temporal data - A diagnostic tool to understand the role of autoregressive conditioning
Users should consider both approaches and compare results to gain a comprehensive understanding of the causal structure in their data.