Information Theory
This section provides a comprehensive overview of information theory concepts that underpin the Causation Entropy framework. Understanding these foundational concepts is essential for applying and interpreting the methods in this library.
Fundamental Concepts
Entropy
Shannon Entropy quantifies the average information content in a random variable.
For a discrete random variable \(X\) with probability mass function \(p(x)\):
For a continuous random variable with probability density function \(f(x)\):
Properties: - \(H(X) \geq 0\) with equality if and only if \(X\) is deterministic - \(H(X)\) is maximized when \(X\) is uniformly distributed - Units depend on logarithm base: bits (base 2), nats (base e), or dits (base 10)
Joint and Conditional Entropy
Joint Entropy of two variables \(X\) and \(Y\):
Conditional Entropy of \(X\) given \(Y\):
Chain Rule of Entropy:
Mutual Information
Definition and Properties
Mutual Information between \(X\) and \(Y\):
Alternative formulation using Kullback-Leibler divergence:
Properties: - \(I(X;Y) \geq 0\) with equality if and only if \(X\) and \(Y\) are independent - \(I(X;Y) = I(Y;X)\) (symmetric) - \(I(X;X) = H(X)\) (self-information equals entropy) - \(I(X;Y) \leq \min(H(X), H(Y))\) (bounded by marginal entropies)
Conditional Mutual Information
Conditional Mutual Information of \(X\) and \(Y\) given \(Z\):
Equivalently:
Chain Rule for Mutual Information:
Information Decomposition:
This decomposition separates direct relationships from those mediated by \(Z\).
Information-Theoretic Measures for Causality
Transfer Entropy
Transfer Entropy from \(Y\) to \(X\):
where \(X_t^{(k)} = (X_t, X_{t-1}, \ldots, X_{t-k+1})\) represents the history of \(X\).
This measures the information flow from \(Y\)’s past to \(X\)’s future, beyond what is already contained in \(X\)’s own past.
Causation Entropy
Causation Entropy extends transfer entropy by considering multiple potential causes:
where: - \(\mathbf{H}_i^{(t)}\) is the historical information of variable \(i\) - \(\mathbf{S}_i^{(t)}\) is the set of other selected causal variables
The “optimal” aspect comes from systematic selection of \(\mathbf{S}_i^{(t)}\) to maximize information while controlling for statistical significance.
Estimation Methods
The choice of entropy estimator significantly affects the performance of information-theoretic causal discovery methods.
Parametric Estimators
Gaussian Entropy: For multivariate Gaussian \(\mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})\):
where \(k\) is the dimensionality.
Gaussian Mutual Information:
where \(\rho\) is the correlation coefficient.
Gaussian Conditional Mutual Information:
Non-Parametric Estimators
k-Nearest Neighbor (k-NN): The Kraskov-Stögbauer-Grassberger (KSG) estimator:
where: - \(\psi\) is the digamma function - \(N\) is the sample size - \(n_x, n_y\) are neighbor counts in marginal spaces
Kernel Density Estimation (KDE): Estimate density using kernel functions:
Then compute entropy as:
Histogram-Based: Discretize continuous variables and use discrete entropy formulas:
where \(n_i\) is the count in bin \(i\).
Estimator Comparison
Method |
Bias |
Variance |
Complexity |
Best Use Case |
|---|---|---|---|---|
Gaussian |
Low (if Gaussian) |
Low |
O(n³) |
Linear relationships |
k-NN |
Moderate |
Moderate |
O(n² log n) |
General purpose |
KDE |
Moderate |
High |
O(n²) |
Smooth densities |
Advanced Information Measures
Partial Information Decomposition
Decompose multivariate information into components:
This framework helps understand how multiple variables jointly provide information about a target.
Information Geometry
Information measures have geometric interpretations:
Fisher Information Metric:
Kullback-Leibler Divergence:
This measures the “distance” between probability distributions.
Multivariate Extensions
Multivariate Mutual Information
For \(n\) variables \(X_1, \ldots, X_n\):
Total Correlation:
Dual Total Correlation:
Information Networks
Construct networks where edge weights represent information flow:
This creates a partial correlation network in information-theoretic terms.
Practical Considerations
Sample Size Requirements
Information estimators have different sample size requirements:
Rule of Thumb: - Gaussian: \(N \geq 10 \times \text{dimensionality}\) - k-NN: \(N \geq 100 \times \text{dimensionality}\) - KDE: \(N \geq 1000 \times \text{dimensionality}\)
High-Dimensional Challenges: The “curse of dimensionality” affects all non-parametric estimators. Consider dimensionality reduction or parametric assumptions when \(d > 10\).
Bias and Variance Tradeoffs
Bias Sources: - Finite sample effects - Boundary effects (KDE) - Model assumptions (Gaussian)
Variance Sources: - Random sampling variation - Parameter choices (bandwidth, k) - Outliers and noise
Bias-Variance Management: - Cross-validation for parameter selection - Bootstrap for uncertainty quantification - Robust estimators for outlier handling
Statistical Testing
Permutation Tests
Test \(H_0: I(X;Y|Z) = 0\) using permutation of \(X\):
Compute observed \(I_{\text{obs}}(X;Y|Z)\)
Generate \(B$ permutations of :math:`X\): \(X^{(b)}\)
Compute null statistics: \(I^{(b)} = I(X^{(b)};Y|Z)\)
P-value: \(p = \frac{1 + \sum_{b=1}^B \mathbb{I}(I^{(b)} \geq I_{\text{obs}})}{B + 1}\)
Bootstrap Confidence Intervals
Construct confidence intervals for information measures:
where :math:`Q_p$ is the :math:`p$-quantile of bootstrap samples.
Multiple Testing Correction
When testing multiple relationships, correct for multiple comparisons:
Bonferroni Correction: .. math:
\alpha_{\text{adjusted}} = \frac{\alpha}{m}
False Discovery Rate (FDR): .. math:
\text{FDR} = \mathbb{E}\left[\frac{\text{False Positives}}{\max(\text{Total Positives}, 1)}\right]
Applications in Causal Discovery
Variable Selection
Use information measures for feature selection:
This balances information gain with model complexity.
Conditional Independence Testing
Test conditional independence: :math:`X perp Y | Z$
Equivalent to testing: :math:`I(X;Y|Z) = 0$
Advantages over linear methods: - Detects nonlinear dependencies - No distributional assumptions - Robust to outliers (with appropriate estimators)
Network Structure Learning
Learn network structure by testing all pairwise conditional independencies:
For each triple \((X_i, X_j, X_k)\): - Test \(I(X_i; X_j | X_k) = 0\) - Build network based on significant dependencies
Future Directions
Emerging Methods
1. Deep Learning Estimators: Neural networks for entropy estimation 5. Robust Estimation: Methods resilient to model misspecification
Conclusion
Information theory provides the mathematical foundation for the Optimal Causation Entropy framework. Understanding entropy, mutual information, and their estimation is crucial for effective application of these methods. Key takeaways:
Estimator Choice Matters: Different methods have different strengths and assumptions
Sample Size is Critical: Information estimators need sufficient data
Statistical Testing is Essential: Always validate relationships with significance tests
High Dimensions are Challenging: Consider regularization or dimensionality reduction
The field continues to evolve, with new estimators and theoretical insights regularly emerging. Practitioners should stay informed of developments and validate methods on their specific data types and problem domains.