Alternative Causal Entropy

The Alternative optimal Causal Entropy (alternative oCSE) represents a simplified variant of the causation entropy framework that begins with an empty conditioning set. This approach offers a more exploratory perspective on causal discovery, building relationships purely from the data without prior assumptions about autoregressive structure.

Mathematical Foundation

The alternative oCSE algorithm uses the same conditional mutual information measure as standard oCSE, but with a different conditioning strategy:

\[I(X_j^{(t-\tau)}; X_i^{(t)} | \mathbf{S}_i^{(t)}) = \sum_{x_j,x_i,\mathbf{s}} p(x_j,x_i,\mathbf{s}) \log \frac{p(x_i,x_j|\mathbf{s})}{p(x_i|\mathbf{s})p(x_j|\mathbf{s})}\]

The key difference is in the conditioning set evolution: - Standard oCSE: \(\mathbf{Z}_i^{(t)} = \mathbf{Z}_{\text{init}} \cup \mathbf{S}_i^{(t)}\) - Alternative oCSE: \(\mathbf{Z}_i^{(t)} = \mathbf{S}_i^{(t)}\) (no initial conditioning)

This means the algorithm starts with marginal mutual information and builds conditional dependencies organically through the selection process.

Algorithm Description

The alternative oCSE follows the same two-phase structure as standard oCSE but with different initialization and conditioning logic.

Phase 1: Forward Selection without Initial Conditioning

Initialization: For each target variable \(i\): - Selected predictors: \(\mathbf{S}_i = \emptyset\) - Conditioning set: \(\mathbf{Z}_i = \emptyset\)

Iteration k=1 (Marginal Selection):

  1. Marginal Mutual Information: For each candidate predictor \(X_j^{(t-\tau)}\):

    \[\text{MI}_{j,\tau} = I(X_j^{(t-\tau)}; X_i^{(t)}) = H(X_i^{(t)}) - H(X_i^{(t)} | X_j^{(t-\tau)})\]
  2. Best Candidate Selection:

    \[(j^*, \tau^*) = \arg\max_{j,\tau} \text{MI}_{j,\tau}\]
  3. Significance Testing: Test \(H_0: I(X_{j^*}^{(t-\tau^*)}; X_i^{(t)}) = 0\)

  4. First Selection: If significant, \(\mathbf{S}_i \leftarrow \{X_{j^*}^{(t-\tau^*)}\}\)

Subsequent Iterations (k≥2):

  1. Conditional Mutual Information: For remaining candidates:

    \[\text{CMI}_{j,\tau} = I(X_j^{(t-\tau)}; X_i^{(t)} | \mathbf{S}_i)\]
  2. Selection and Conditioning Update: Following the same logic as standard oCSE

Phase 2: Backward Elimination

Identical to standard oCSE, but the conditioning set for elimination only includes previously selected predictors:

\[\mathbf{Z}_{-j} = \mathbf{S}_i \setminus \{X_j^{(t-\tau)}\}\]

Key Algorithmic Differences

Conditioning Set Evolution

The conditioning set grows incrementally without prior assumptions:

\[\begin{split}\mathbf{Z}_i^{(0)} &= \emptyset \\ \mathbf{Z}_i^{(1)} &= \{X_{j_1}^{(t-\tau_1)}\} \\ \mathbf{Z}_i^{(2)} &= \{X_{j_1}^{(t-\tau_1)}, X_{j_2}^{(t-\tau_2)}\} \\ &\vdots \\ \mathbf{Z}_i^{(k)} &= \{X_{j_1}^{(t-\tau_1)}, \ldots, X_{j_k}^{(t-\tau_k)}\}\end{split}\]

Information-Theoretic Interpretation

The alternative approach can be understood through the chain rule of mutual information:

\[I(X_{j_1}, X_{j_2}, \ldots, X_{j_k}; X_i) = \sum_{m=1}^k I(X_{j_m}; X_i | X_{j_1}, \ldots, X_{j_{m-1}})\]

The algorithm greedily maximizes each term in this decomposition, building the joint information incrementally. This provides a different perspective compared to standard oCSE, which conditions on autoregressive structure from the start.

First Selection: Marginal Mutual Information

The first selected predictor maximizes marginal mutual information:

\[X_{j_1}^{(t-\tau_1)} = \arg\max_{j,\tau} I(X_j^{(t-\tau)}; X_i^{(t)})\]

This captures the strongest unconditional relationship, which may include autoregressive effects if they are the dominant signal.

Subsequent Selections: Conditional Uniqueness

Later selections maximize conditional mutual information:

\[X_{j_k}^{(t-\tau_k)} = \arg\max_{j,\tau} I(X_j^{(t-\tau)}; X_i^{(t)} | X_{j_1}^{(t-\tau_1)}, \ldots, X_{j_{k-1}}^{(t-\tau_{k-1})})\]

This ensures each new predictor provides unique information not already captured by previously selected variables.

Advantages and Limitations

Advantages

  1. No Prior Assumptions: Does not assume autoregressive structure is important

  2. Exploratory Discovery: May find unexpected relationships not constrained by temporal assumptions

  3. Simpler Implementation: Fewer parameters and initialization steps

  4. Computational Efficiency: Slightly faster due to smaller initial conditioning sets

  5. Data-Driven: Relationships emerge purely from information content in data

  6. Interpretability: First selection shows strongest marginal relationship

Limitations

  1. Confounding Risk: Without autoregressive control, may select spurious relationships

  2. Order Dependence: Early selections heavily influence later conditioning

  3. Transitivity Issues: May select indirect relationships as direct causes

  4. Temporal Structure Ignored: Does not explicitly account for time series nature

  5. Higher False Positive Risk: Less conservative than standard approach

When to Use Alternative oCSE

Comparison with Standard oCSE

Consider a simple three-variable system:

\[\begin{split}X_1^{(t)} &= 0.5 X_1^{(t-1)} + 0.3 X_2^{(t-1)} + \epsilon_1^{(t)} \\ X_2^{(t)} &= 0.4 X_2^{(t-1)} + \epsilon_2^{(t)} \\ X_3^{(t)} &= 0.6 X_3^{(t-1)} + \epsilon_3^{(t)}\end{split}\]

Standard oCSE (Target: :math:`X_1^{(t)}`):

  1. Initial conditioning: \(\mathbf{Z}_{\text{init}} = \{X_1^{(t-1)}\}\)

  2. Evaluate: \(I(X_2^{(t-1)}; X_1^{(t)} | X_1^{(t-1)})\) and \(I(X_3^{(t-1)}; X_1^{(t)} | X_1^{(t-1)})\)

  3. Likely selects \(X_2^{(t-1)}\) due to causal relationship

Alternative oCSE (Target: :math:`X_1^{(t)}`):

  1. No initial conditioning: \(\mathbf{Z} = \emptyset\)

  2. Evaluate: \(I(X_1^{(t-1)}; X_1^{(t)})\), \(I(X_2^{(t-1)}; X_1^{(t)})\), \(I(X_3^{(t-1)}; X_1^{(t)})\)

  3. Likely selects \(X_1^{(t-1)}\) first (strongest marginal relationship)

  4. Then evaluates \(I(X_2^{(t-1)}; X_1^{(t)} | X_1^{(t-1)})\) and \(I(X_3^{(t-1)}; X_1^{(t)} | X_1^{(t-1)})\)

  5. Selects \(X_2^{(t-1)}\) second

Both methods may reach the same final result, but through different paths and with different interpretations of the relationships.

Implementation Considerations

Hyperparameter Differences

Significance Levels: - May require more conservative \(\alpha\) values due to increased multiple testing - Consider Bonferroni or FDR corrections for the first (marginal) selection phase

Information Estimator Selection: - Same considerations as standard oCSE - May benefit from robust estimators due to less initial regularization

Stopping Criteria: - Consider earlier stopping due to increased false positive risk - Monitor conditioning set size to prevent overfitting

Example Implementation

def alternative_forward_selection(X, Y, alpha=0.05, n_shuffles=200):
    """
    Alternative oCSE forward selection without initial conditioning.

    Parameters
    ----------
    X : array (T, n*tau_max)
        Lagged predictor matrix
    Y : array (T, 1)
        Target variable
    """
    n_predictors = X.shape[1]
    selected = []
    Z_current = None  # Start with empty conditioning set

    while True:
        # Evaluate remaining candidates
        remaining = [i for i in range(n_predictors) if i not in selected]
        if not remaining:
            break

        cmi_values = []
        for j in remaining:
            X_j = X[:, [j]]
            if Z_current is None:
                # First iteration: marginal mutual information
                cmi = mutual_information(X_j, Y)
            else:
                # Subsequent iterations: conditional mutual information
                cmi = conditional_mutual_information(X_j, Y, Z_current)
            cmi_values.append(cmi)

        # Select best candidate
        best_idx = remaining[np.argmax(cmi_values)]
        best_cmi = max(cmi_values)

        # Significance test
        X_best = X[:, [best_idx]]
        significant = permutation_test(X_best, Y, Z_current,
                                     best_cmi, alpha, n_shuffles)

        if not significant:
            break

        # Accept and update conditioning set
        selected.append(best_idx)
        if Z_current is None:
            Z_current = X_best
        else:
            Z_current = np.hstack([Z_current, X_best])

    return selected

Diagnostic Analysis

To understand the differences between standard and alternative oCSE results:

Selection Order Analysis

Compare the order of variable selection:

def compare_selection_order(X, Y, Z_init):
    """Compare selection order between methods."""

    # Standard oCSE
    standard_order = standard_forward_selection(X, Y, Z_init)

    # Alternative oCSE
    alternative_order = alternative_forward_selection(X, Y)

    print("Standard oCSE order:", standard_order)
    print("Alternative oCSE order:", alternative_order)

    # Check if autoregressive terms selected first in alternative
    auto_vars = identify_autoregressive_variables(X, Y)
    alt_first_auto = any(var in auto_vars for var in alternative_order[:2])

    return {
        'standard_order': standard_order,
        'alternative_order': alternative_order,
        'alternative_selects_autoregressive': alt_first_auto
    }

Conditional MI Comparison

Analyze how conditioning affects relationship strength:

\[\Delta\text{CMI}_{j,\tau} = I(X_j^{(t-\tau)}; X_i^{(t)} | \mathbf{S}_i) - I(X_j^{(t-\tau)}; X_i^{(t)} | \mathbf{Z}_{\text{init}} \cup \mathbf{S}_i)\]

Positive values indicate relationships that appear stronger without autoregressive conditioning.

Theoretical Implications

Model Selection Perspective

Alternative oCSE can be viewed as a model selection procedure that builds complexity incrementally:

\[\begin{split}\text{Model}_0: &\quad X_i^{(t)} = \epsilon_i^{(t)} \\ \text{Model}_1: &\quad X_i^{(t)} = f_1(X_{j_1}^{(t-\tau_1)}) + \epsilon_i^{(t)} \\ \text{Model}_2: &\quad X_i^{(t)} = f_2(X_{j_1}^{(t-\tau_1)}, X_{j_2}^{(t-\tau_2)}) + \epsilon_i^{(t)} \\ &\vdots\end{split}\]

Each selection represents a model complexity increase justified by information gain.

Connection to Feature Selection

The algorithm is closely related to information-based feature selection methods, particularly those using mutual information criteria. The key difference is the explicit temporal structure and causal interpretation.

Conclusion

Alternative oCSE provides a complementary approach to causal discovery that prioritizes data-driven relationship discovery over temporal assumptions. While it may be more susceptible to confounding and spurious relationships, it offers valuable insights for exploratory analysis and systems where autoregressive structure is not dominant.

The method is particularly useful as: - A baseline for comparison with standard oCSE - An exploratory tool for unknown systems - A method for cross-sectional or weakly temporal data - A diagnostic tool to understand the role of autoregressive conditioning

Users should consider both approaches and compare results to gain a comprehensive understanding of the causal structure in their data.