Standard Causal Entropy

The Standard optimal Causal Entropy (standard oCSE) represents the canonical implementation of the causation entropy framework. This method begins with an initial conditioning set, typically consisting of lagged values of the target variable, and systematically builds the causal predictor set through forward selection and backward elimination phases.

Mathematical Foundation

The standard oCSE algorithm is built around the conditional mutual information measure:

\[I(X_j^{(t-\tau)}; X_i^{(t)} | \mathbf{Z}_i^{(t)}) = \sum_{x_j,x_i,\mathbf{z}} p(x_j,x_i,\mathbf{z}) \log \frac{p(x_i,x_j|\mathbf{z})}{p(x_i|\mathbf{z})p(x_j|\mathbf{z})}\]

where: - \(X_j^{(t-\tau)}\) is the candidate predictor variable \(j\) at lag \(\tau\) - \(X_i^{(t)}\) is the target variable \(i\) at the current time - \(\mathbf{Z}_i^{(t)} = \mathbf{Z}_{\text{init}} \cup \mathbf{S}_i^{(t)}\) is the conditioning set

The conditioning set consists of two components: - \(\mathbf{Z}_{\text{init}}\): Initial conditioning variables (usually lagged target values) - \(\mathbf{S}_i^{(t)}\): Previously selected predictor variables

Algorithm Description

The standard oCSE algorithm proceeds in two main phases:

Phase 1: Forward Selection with Initial Conditioning

Input: - Time series data \(\mathbf{X} \in \mathbb{R}^{T \times n}\) - Maximum lag \(\tau_{\max}\) - Significance level \(\alpha_{\text{forward}}\) - Number of permutations \(N_{\text{perm}}\)

Initialization: For each target variable \(i\), construct the initial conditioning set:

\[\mathbf{Z}_{\text{init},i} = \{X_i^{(t-1)}, X_i^{(t-2)}, \ldots, X_i^{(t-\tau_{\max})}\}\]

This incorporates the autoregressive structure of the target variable.

Forward Selection Loop:

Candidate Evaluation: For each remaining candidate predictor \(X_j^{(t-\tau)}\):

\[\text{CMI}_{j,\tau} = I(X_j^{(t-\tau)}; X_i^{(t)} | \mathbf{Z}_i^{(t)})\]
Best Candidate Selection: Choose the predictor with maximum conditional mutual information:

\[(j^*, \tau^*) = \arg\max_{j,\tau} \text{CMI}_{j,\tau}\]
Significance Testing: Perform permutation test for \(H_0: I(X_{j^*}^{(t-\tau^*)}; X_i^{(t)} | \mathbf{Z}_i^{(t)}) = 0\)
- Generate \(N_{\text{perm}}\) permutations of \(X_{j^*}^{(t-\tau^*)}\)
- Compute null distribution: \(\{\text{CMI}_{\text{perm}}^{(k)}\}_{k=1}^{N_{\text{perm}}}\)
- Determine threshold: \(\theta = \text{percentile}(\{\text{CMI}_{\text{perm}}^{(k)}\}, 100(1-\alpha_{\text{forward}}))\)
Selection Decision:

\[\text{Accept } X_{j^*}^{(t-\tau^*)} \text{ if } \text{CMI}_{j^*,\tau^*} \geq \theta\]
Conditioning Set Update: If accepted, update:

\[\mathbf{Z}_i^{(t)} \leftarrow \mathbf{Z}_i^{(t)} \cup \{X_{j^*}^{(t-\tau^*)}\}\]

Phase 2: Backward Elimination

Objective: Remove spurious predictors that may have been selected due to transitivity or confounding effects.

Backward Elimination Loop: For each selected predictor \(X_j^{(t-\tau)} \in \mathbf{S}_i\) (in random order):

Conditioning Set Construction:

\[\mathbf{Z}_{-j} = \mathbf{Z}_{\text{init},i} \cup (\mathbf{S}_i \setminus \{X_j^{(t-\tau)}\})\]
Conditional Mutual Information:

\[\text{CMI}_{j,\tau} = I(X_j^{(t-\tau)}; X_i^{(t)} | \mathbf{Z}_{-j})\]
Significance Testing: Test \(H_0: I(X_j^{(t-\tau)}; X_i^{(t)} | \mathbf{Z}_{-j}) = 0\)
Elimination Decision:

\[\text{Remove } X_j^{(t-\tau)} \text{ if } \text{CMI}_{j,\tau} < \theta_{\text{backward}}\]

Key Properties

Initial Conditioning Benefits

The initial conditioning set \(\mathbf{Z}_{\text{init}}\) provides several advantages:

Autoregressive Control: Controls for the natural temporal dependencies in the target variable
Enhanced Specificity: Identifies predictors that provide information beyond autoregressive patterns
Confounding Mitigation: Reduces spurious relationships due to common trends or cycles

Mathematical Formulation:

\[I(X_j^{(t-\tau)}; X_i^{(t)} | \mathbf{Z}_{\text{init}}) = H(X_i^{(t)} | \mathbf{Z}_{\text{init}}) - H(X_i^{(t)} | X_j^{(t-\tau)}, \mathbf{Z}_{\text{init}})\]

This measures the additional predictive information provided by \(X_j^{(t-\tau)}\) beyond what is already captured by the autoregressive terms.

Conditioning Set Evolution

The conditioning set evolves as:

\[\begin{split}\mathbf{Z}_i^{(0)} &= \mathbf{Z}_{\text{init},i} \\ \mathbf{Z}_i^{(k+1)} &= \mathbf{Z}_i^{(k)} \cup \{X_{j^*}^{(t-\tau^*)}\}\end{split}\]

where \(X_{j^*}^{(t-\tau^*)}\) is the predictor selected at iteration \(k+1\).

Information-Theoretic Interpretation

The standard oCSE framework can be understood through the lens of information decomposition. For a target variable \(X_i^{(t)}\) with autoregressive history \(\mathbf{H}_i\) and external predictor \(X_j^{(t-\tau)}\):

\[I(X_j^{(t-\tau)}; X_i^{(t)}) = I(X_j^{(t-\tau)}; X_i^{(t)} | \mathbf{H}_i) + I(X_j^{(t-\tau)}; \mathbf{H}_i)\]

The first term represents the direct causal influence, while the second represents shared information with the autoregressive structure. Standard oCSE focuses on the first term.

Advantages and Limitations

Advantages

Autoregressive Control: Explicitly accounts for temporal dependencies in the target
Theoretical Foundation: Grounded in information theory with clear interpretations
Flexible Information Measures: Supports various entropy estimators (Gaussian, k-NN, KDE, etc.)
Statistical Rigor: Permutation-based significance testing controls false positives
Multivariate Conditioning: Properly handles confounding through conditioning sets

Limitations

Computational Complexity: \(O(n^2 \tau_{\max} N_{\text{perm}} T)\) scaling
Initial Conditioning Assumption: Assumes autoregressive structure is relevant
Greedy Selection: Forward selection may miss globally optimal solutions
Sample Size Requirements: Information estimators require sufficient data
Stationarity Assumptions: Most effective on stationary time series

Implementation Considerations

Hyperparameter Selection

Significance Levels: - \(\alpha_{\text{forward}}\): Controls Type I error in forward selection (typically 0.01-0.05) - \(\alpha_{\text{backward}}\): Controls Type I error in backward elimination (typically 0.05-0.10)

Maximum Lag: - Should reflect domain knowledge of system dynamics - Computational cost scales linearly with \(\tau_{\max}\) - Rule of thumb: \(\tau_{\max} \approx \sqrt{T}\) for exploratory analysis

Permutation Count: - Minimum 100 for rough estimates - 1000+ for publication-quality significance tests - Precision scales as \(1/\sqrt{N_{\text{perm}}}\)

Information Estimator Choice

Estimator Selection Guide
Data Type	Recommended Estimator	Notes
Continuous Gaussian	Gaussian	Exact under normality assumption
Continuous Non-Gaussian	k-NN or KDE	k-NN more robust to dimensionality
Mixed/Discrete	Histogram or k-NN	Careful binning for histogram
High-Dimensional	Geometric k-NN	Accounts for manifold structure
Small Sample	Gaussian (if appropriate)	Parametric methods more sample-efficient

Example Implementation

Here’s a conceptual implementation of the standard oCSE forward selection:

def standard_forward_selection(X, Y, Z_init, alpha=0.05, n_shuffles=200):
    """
    Standard oCSE forward selection with initial conditioning.

    Parameters
    ----------
    X : array (T, n*tau_max)
        Lagged predictor matrix
    Y : array (T, 1)
        Target variable
    Z_init : array (T, p)
        Initial conditioning set
    """
    n_predictors = X.shape[1]
    selected = []
    Z_current = Z_init.copy()

    while True:
        # Evaluate remaining candidates
        remaining = [i for i in range(n_predictors) if i not in selected]
        if not remaining:
            break

        cmi_values = []
        for j in remaining:
            X_j = X[:, [j]]
            cmi = conditional_mutual_information(X_j, Y, Z_current)
            cmi_values.append(cmi)

        # Select best candidate
        best_idx = remaining[np.argmax(cmi_values)]
        best_cmi = max(cmi_values)

        # Significance test
        X_best = X[:, [best_idx]]
        significant = permutation_test(X_best, Y, Z_current,
                                     best_cmi, alpha, n_shuffles)

        if not significant:
            break

        # Accept and update
        selected.append(best_idx)
        Z_current = np.hstack([Z_current, X_best])

    return selected

Comparison with Alternative Methods

The standard oCSE can be compared with its variants:

Method Comparison
Method	Initial Conditioning	Advantages	Use Cases
Standard oCSE	Yes (lagged target)	Controls autoregression	Time series with strong temporal structure
Alternative oCSE	No	Simpler, fewer assumptions	Exploratory analysis, weak temporal structure
Information LASSO	Variable	Handles high dimensions	Large predictor spaces
Pure LASSO	No	Computationally efficient	Linear relationships, benchmarking

Theoretical Connections

The standard oCSE framework connects to several established concepts:

Granger Causality: Standard oCSE generalizes linear Granger causality to information-theoretic measures with flexible conditioning.

Transfer Entropy: Related but distinct - transfer entropy typically uses uniform conditioning across all variables, while oCSE uses targeted conditioning sets.

Partial Correlation: The Gaussian version of standard oCSE is closely related to partial correlation analysis but extends to nonlinear relationships.

Conclusion

Standard oCSE provides a principled, information-theoretic approach to causal discovery that explicitly accounts for autoregressive structure in time series data. The method’s strength lies in its theoretical foundation, flexibility in information measures, and rigorous statistical testing. However, users should be aware of its computational requirements and the assumptions inherent in the initial conditioning approach.

The method is particularly well-suited for time series with strong temporal dependencies where controlling for autoregressive effects is crucial for accurate causal inference.