Alternative Causal Entropy

The Alternative optimal Causal Entropy (alternative oCSE) represents a simplified variant of the causation entropy framework that begins with an empty conditioning set. This approach offers a more exploratory perspective on causal discovery, building relationships purely from the data without prior assumptions about autoregressive structure.

Mathematical Foundation

The alternative oCSE algorithm uses the same conditional mutual information measure as standard oCSE, but with a different conditioning strategy:

\[I(X_j^{(t-\tau)}; X_i^{(t)} | \mathbf{S}_i^{(t)}) = \sum_{x_j,x_i,\mathbf{s}} p(x_j,x_i,\mathbf{s}) \log \frac{p(x_i,x_j|\mathbf{s})}{p(x_i|\mathbf{s})p(x_j|\mathbf{s})}\]

The key difference is in the conditioning set evolution: - Standard oCSE: \(\mathbf{Z}_i^{(t)} = \mathbf{Z}_{\text{init}} \cup \mathbf{S}_i^{(t)}\) - Alternative oCSE: \(\mathbf{Z}_i^{(t)} = \mathbf{S}_i^{(t)}\) (no initial conditioning)

This means the algorithm starts with marginal mutual information and builds conditional dependencies organically through the selection process.

Algorithm Description

The alternative oCSE follows the same two-phase structure as standard oCSE but with different initialization and conditioning logic.

Phase 1: Forward Selection without Initial Conditioning

Initialization: For each target variable \(i\): - Selected predictors: \(\mathbf{S}_i = \emptyset\) - Conditioning set: \(\mathbf{Z}_i = \emptyset\)

Iteration k=1 (Marginal Selection):

Marginal Mutual Information: For each candidate predictor \(X_j^{(t-\tau)}\):

\[\text{MI}_{j,\tau} = I(X_j^{(t-\tau)}; X_i^{(t)}) = H(X_i^{(t)}) - H(X_i^{(t)} | X_j^{(t-\tau)})\]
Best Candidate Selection:

\[(j^*, \tau^*) = \arg\max_{j,\tau} \text{MI}_{j,\tau}\]
Significance Testing: Test \(H_0: I(X_{j^*}^{(t-\tau^*)}; X_i^{(t)}) = 0\)
First Selection: If significant, \(\mathbf{S}_i \leftarrow \{X_{j^*}^{(t-\tau^*)}\}\)

Subsequent Iterations (k≥2):

Conditional Mutual Information: For remaining candidates:

\[\text{CMI}_{j,\tau} = I(X_j^{(t-\tau)}; X_i^{(t)} | \mathbf{S}_i)\]
Selection and Conditioning Update: Following the same logic as standard oCSE

Phase 2: Backward Elimination

Identical to standard oCSE, but the conditioning set for elimination only includes previously selected predictors:

\[\mathbf{Z}_{-j} = \mathbf{S}_i \setminus \{X_j^{(t-\tau)}\}\]

Key Algorithmic Differences

Conditioning Set Evolution

The conditioning set grows incrementally without prior assumptions:

\[\begin{split}\mathbf{Z}_i^{(0)} &= \emptyset \\ \mathbf{Z}_i^{(1)} &= \{X_{j_1}^{(t-\tau_1)}\} \\ \mathbf{Z}_i^{(2)} &= \{X_{j_1}^{(t-\tau_1)}, X_{j_2}^{(t-\tau_2)}\} \\ &\vdots \\ \mathbf{Z}_i^{(k)} &= \{X_{j_1}^{(t-\tau_1)}, \ldots, X_{j_k}^{(t-\tau_k)}\}\end{split}\]

Information-Theoretic Interpretation

The alternative approach can be understood through the chain rule of mutual information:

\[I(X_{j_1}, X_{j_2}, \ldots, X_{j_k}; X_i) = \sum_{m=1}^k I(X_{j_m}; X_i | X_{j_1}, \ldots, X_{j_{m-1}})\]

The algorithm greedily maximizes each term in this decomposition, building the joint information incrementally. This provides a different perspective compared to standard oCSE, which conditions on autoregressive structure from the start.

First Selection: Marginal Mutual Information

The first selected predictor maximizes marginal mutual information:

\[X_{j_1}^{(t-\tau_1)} = \arg\max_{j,\tau} I(X_j^{(t-\tau)}; X_i^{(t)})\]

This captures the strongest unconditional relationship, which may include autoregressive effects if they are the dominant signal.

Subsequent Selections: Conditional Uniqueness

Later selections maximize conditional mutual information:

\[X_{j_k}^{(t-\tau_k)} = \arg\max_{j,\tau} I(X_j^{(t-\tau)}; X_i^{(t)} | X_{j_1}^{(t-\tau_1)}, \ldots, X_{j_{k-1}}^{(t-\tau_{k-1})})\]

This ensures each new predictor provides unique information not already captured by previously selected variables.

Advantages and Limitations

Advantages

No Prior Assumptions: Does not assume autoregressive structure is important
Exploratory Discovery: May find unexpected relationships not constrained by temporal assumptions
Simpler Implementation: Fewer parameters and initialization steps
Computational Efficiency: Slightly faster due to smaller initial conditioning sets
Data-Driven: Relationships emerge purely from information content in data
Interpretability: First selection shows strongest marginal relationship

Limitations

Confounding Risk: Without autoregressive control, may select spurious relationships
Order Dependence: Early selections heavily influence later conditioning
Transitivity Issues: May select indirect relationships as direct causes
Temporal Structure Ignored: Does not explicitly account for time series nature
Higher False Positive Risk: Less conservative than standard approach

When to Use Alternative oCSE

Recommended Scenarios

Exploratory Analysis: Initial investigation of unknown systems
Cross-Sectional Data: When temporal structure is less important
Weak Autoregressive Systems: Time series without strong temporal dependencies
Comparative Studies: Baseline for comparison with standard oCSE
High-Dimensional Systems: When autoregressive conditioning becomes prohibitive
Non-Temporal Networks: Spatial or other non-temporal relationship discovery

Comparison with Standard oCSE

Consider a simple three-variable system:

\[\begin{split}X_1^{(t)} &= 0.5 X_1^{(t-1)} + 0.3 X_2^{(t-1)} + \epsilon_1^{(t)} \\ X_2^{(t)} &= 0.4 X_2^{(t-1)} + \epsilon_2^{(t)} \\ X_3^{(t)} &= 0.6 X_3^{(t-1)} + \epsilon_3^{(t)}\end{split}\]

Standard oCSE (Target: :math:`X_1^{(t)}`):

Initial conditioning: \(\mathbf{Z}_{\text{init}} = \{X_1^{(t-1)}\}\)
Evaluate: \(I(X_2^{(t-1)}; X_1^{(t)} | X_1^{(t-1)})\) and \(I(X_3^{(t-1)}; X_1^{(t)} | X_1^{(t-1)})\)
Likely selects \(X_2^{(t-1)}\) due to causal relationship

Alternative oCSE (Target: :math:`X_1^{(t)}`):

No initial conditioning: \(\mathbf{Z} = \emptyset\)
Evaluate: \(I(X_1^{(t-1)}; X_1^{(t)})\), \(I(X_2^{(t-1)}; X_1^{(t)})\), \(I(X_3^{(t-1)}; X_1^{(t)})\)
Likely selects \(X_1^{(t-1)}\) first (strongest marginal relationship)
Then evaluates \(I(X_2^{(t-1)}; X_1^{(t)} | X_1^{(t-1)})\) and \(I(X_3^{(t-1)}; X_1^{(t)} | X_1^{(t-1)})\)
Selects \(X_2^{(t-1)}\) second

Both methods may reach the same final result, but through different paths and with different interpretations of the relationships.

Implementation Considerations

Hyperparameter Differences

Significance Levels: - May require more conservative \(\alpha\) values due to increased multiple testing - Consider Bonferroni or FDR corrections for the first (marginal) selection phase

Information Estimator Selection: - Same considerations as standard oCSE - May benefit from robust estimators due to less initial regularization

Stopping Criteria: - Consider earlier stopping due to increased false positive risk - Monitor conditioning set size to prevent overfitting

Example Implementation

def alternative_forward_selection(X, Y, alpha=0.05, n_shuffles=200):
    """
    Alternative oCSE forward selection without initial conditioning.

    Parameters
    ----------
    X : array (T, n*tau_max)
        Lagged predictor matrix
    Y : array (T, 1)
        Target variable
    """
    n_predictors = X.shape[1]
    selected = []
    Z_current = None  # Start with empty conditioning set

    while True:
        # Evaluate remaining candidates
        remaining = [i for i in range(n_predictors) if i not in selected]
        if not remaining:
            break

        cmi_values = []
        for j in remaining:
            X_j = X[:, [j]]
            if Z_current is None:
                # First iteration: marginal mutual information
                cmi = mutual_information(X_j, Y)
            else:
                # Subsequent iterations: conditional mutual information
                cmi = conditional_mutual_information(X_j, Y, Z_current)
            cmi_values.append(cmi)

        # Select best candidate
        best_idx = remaining[np.argmax(cmi_values)]
        best_cmi = max(cmi_values)

        # Significance test
        X_best = X[:, [best_idx]]
        significant = permutation_test(X_best, Y, Z_current,
                                     best_cmi, alpha, n_shuffles)

        if not significant:
            break

        # Accept and update conditioning set
        selected.append(best_idx)
        if Z_current is None:
            Z_current = X_best
        else:
            Z_current = np.hstack([Z_current, X_best])

    return selected

Diagnostic Analysis

To understand the differences between standard and alternative oCSE results:

Selection Order Analysis

Compare the order of variable selection:

def compare_selection_order(X, Y, Z_init):
    """Compare selection order between methods."""

    # Standard oCSE
    standard_order = standard_forward_selection(X, Y, Z_init)

    # Alternative oCSE
    alternative_order = alternative_forward_selection(X, Y)

    print("Standard oCSE order:", standard_order)
    print("Alternative oCSE order:", alternative_order)

    # Check if autoregressive terms selected first in alternative
    auto_vars = identify_autoregressive_variables(X, Y)
    alt_first_auto = any(var in auto_vars for var in alternative_order[:2])

    return {
        'standard_order': standard_order,
        'alternative_order': alternative_order,
        'alternative_selects_autoregressive': alt_first_auto
    }

Conditional MI Comparison

Analyze how conditioning affects relationship strength:

\[\Delta\text{CMI}_{j,\tau} = I(X_j^{(t-\tau)}; X_i^{(t)} | \mathbf{S}_i) - I(X_j^{(t-\tau)}; X_i^{(t)} | \mathbf{Z}_{\text{init}} \cup \mathbf{S}_i)\]

Positive values indicate relationships that appear stronger without autoregressive conditioning.

Theoretical Implications

Model Selection Perspective

Alternative oCSE can be viewed as a model selection procedure that builds complexity incrementally:

\[\begin{split}\text{Model}_0: &\quad X_i^{(t)} = \epsilon_i^{(t)} \\ \text{Model}_1: &\quad X_i^{(t)} = f_1(X_{j_1}^{(t-\tau_1)}) + \epsilon_i^{(t)} \\ \text{Model}_2: &\quad X_i^{(t)} = f_2(X_{j_1}^{(t-\tau_1)}, X_{j_2}^{(t-\tau_2)}) + \epsilon_i^{(t)} \\ &\vdots\end{split}\]

Each selection represents a model complexity increase justified by information gain.

Connection to Feature Selection

The algorithm is closely related to information-based feature selection methods, particularly those using mutual information criteria. The key difference is the explicit temporal structure and causal interpretation.

Conclusion

Alternative oCSE provides a complementary approach to causal discovery that prioritizes data-driven relationship discovery over temporal assumptions. While it may be more susceptible to confounding and spurious relationships, it offers valuable insights for exploratory analysis and systems where autoregressive structure is not dominant.

The method is particularly useful as: - A baseline for comparison with standard oCSE - An exploratory tool for unknown systems - A method for cross-sectional or weakly temporal data - A diagnostic tool to understand the role of autoregressive conditioning

Users should consider both approaches and compare results to gain a comprehensive understanding of the causal structure in their data.