Pure LASSO Methods

Pure LASSO methods represent the linear regression baseline for causal discovery in the Causation Entropy framework. While not information-theoretic in nature, these methods serve as important benchmarks and provide computationally efficient alternatives for linear systems. This section covers the theoretical foundation, implementation, and role of LASSO-based approaches in causal network inference.

Mathematical Foundation

Standard LASSO Formulation

The LASSO (Least Absolute Shrinkage and Selection Operator) solves the optimization problem:

\[\hat{\boldsymbol{\beta}} = \arg\min_{\boldsymbol{\beta}} \frac{1}{2n} \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|_2^2 + \lambda \|\boldsymbol{\beta}\|_1\]

This can be equivalently formulated as a constrained optimization:

\[\hat{\boldsymbol{\beta}} = \arg\min_{\boldsymbol{\beta}} \frac{1}{2n} \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|_2^2 \quad \text{subject to} \quad \|\boldsymbol{\beta}\|_1 \leq t\]

where $t$ corresponds to the constraint level determined by $\lambda$.

Causal Discovery Context

For causal discovery from time series, the LASSO problem becomes:

Target Variable: $X_i^{(t)}$ for $t = \tau_{\max} + 1, \ldots, T$

Predictor Matrix: .. math:

\mathbf{X}_{i,\text{lag}} = \begin{bmatrix}
X_1^{(\tau_{\max})} & X_1^{(\tau_{\max}-1)} & \cdots & X_1^{(1)} & \cdots & X_n^{(\tau_{\max})} & \cdots & X_n^{(1)} \\
X_1^{(\tau_{\max}+1)} & X_1^{(\tau_{\max})} & \cdots & X_1^{(2)} & \cdots & X_n^{(\tau_{\max}+1)} & \cdots & X_n^{(2)} \\
\vdots & \vdots & \ddots & \vdots & \ddots & \vdots & \ddots & \vdots \\
X_1^{(T-1)} & X_1^{(T-2)} & \cdots & X_1^{(T-\tau_{\max})} & \cdots & X_n^{(T-1)} & \cdots & X_n^{(T-\tau_{\max})}
\end{bmatrix}

Response Vector: .. math:

\mathbf{y}_i = \begin{bmatrix} X_i^{(\tau_{\max}+1)} \\ X_i^{(\tau_{\max}+2)} \\ \vdots \\ X_i^{(T)} \end{bmatrix}

The resulting coefficient vector has structure: .. math:

\boldsymbol{\beta}_i = [\beta_{i,1}^{(1)}, \beta_{i,1}^{(2)}, \ldots, \beta_{i,1}^{(\tau_{\max})}, \ldots, \beta_{i,n}^{(1)}, \ldots, \beta_{i,n}^{(\tau_{\max})}]^T

where $\beta_{i,j}^{(\tau)}$ represents the influence of variable $j$ at lag $\tau$ on variable $i$.

Causal Interpretation

Edge Detection

A directed edge from variable $j$ to variable $i$ at lag $\tau$ is inferred if:

\[|\hat{\beta}_{i,j}^{(\tau)}| > \epsilon\]

where $\epsilon$ is a small threshold (typically machine precision).

The strongest lag for each relationship can be determined by:

\[\tau_{i,j}^* = \arg\max_{\tau \in \{1,\ldots,\tau_{\max}\}} |\hat{\beta}_{i,j}^{(\tau)}|\]

Network Construction

The inferred adjacency matrix $\mathbf{A}$ has entries:

\[\begin{split}A_{ji} = \begin{cases} 1 & \text{if } \max_\tau |\hat{\beta}_{i,j}^{(\tau)}| > \epsilon \\ 0 & \text{otherwise} \end{cases}\end{split}\]

With optional lag information:

\[\begin{split}L_{ji} = \begin{cases} \tau_{i,j}^* & \text{if } A_{ji} = 1 \\ 0 & \text{otherwise} \end{cases}\end{split}\]

Regularization Parameter Selection

The choice of $\lambda$ critically affects the sparsity-accuracy tradeoff.

Cross-Validation Approach

Standard k-fold cross-validation minimizes prediction error:

\[\lambda^*_{CV} = \arg\min_\lambda \frac{1}{K} \sum_{k=1}^K \|\mathbf{y}_k^{\text{test}} - \mathbf{X}_k^{\text{test}}\hat{\boldsymbol{\beta}}_k(\lambda)\|_2^2\]

Information Criteria

Akaike Information Criterion (AIC): .. math:

\text{AIC}(\lambda) = n \log(\text{RSS}(\lambda)/n) + 2|\hat{\mathbf{S}}(\lambda)|

Bayesian Information Criterion (BIC): .. math:

\text{BIC}(\lambda) = n \log(\text{RSS}(\lambda)/n) + |\hat{\mathbf{S}}(\lambda)| \log n

where $\text{RSS}(\lambda) = \|\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}(\lambda)\|_2^2$ and $|\hat{\mathbf{S}}(\lambda)|$ is the number of selected predictors.

Stability Selection

For more robust selection, use stability selection across bootstrap samples:

\[\Pi_j(\lambda) = P(\beta_j(\lambda) \neq 0) = \frac{1}{B} \sum_{b=1}^B \mathbb{I}(\hat{\beta}_j^{(b)}(\lambda) \neq 0)\]

Select variables with :math:`Pi_j(lambda) geq pi_{text{thresh}}$ (typically 0.6-0.8).

Implementation Approaches

Standard LASSO Implementation

from sklearn.linear_model import LassoLarsIC, LassoCV
import numpy as np

def lasso_causal_discovery(data, max_lag=5, criterion='bic', alpha=None):
    """
    Discover causal network using LASSO regression.

    Parameters
    ----------
    data : array (T, n)
        Time series data
    max_lag : int
        Maximum lag to consider
    criterion : str
        Model selection criterion ('aic', 'bic', or 'cv')
    alpha : float or None
        Regularization parameter (if None, automatically selected)
    """
    T, n = data.shape

    # Create lagged design matrix
    X_lagged, Y_targets = create_lagged_matrices(data, max_lag)

    # Initialize results
    adjacency = np.zeros((n, n))
    coefficients = {}

    # Fit LASSO for each target variable
    for i in range(n):
        Y_i = Y_targets[:, i]

        if alpha is None:
            if criterion in ['aic', 'bic']:
                # Use information criterion for model selection
                lasso = LassoLarsIC(criterion=criterion,
                                  normalize=True,
                                  fit_intercept=True)
            else:
                # Use cross-validation
                lasso = LassoCV(cv=5, normalize=True, fit_intercept=True)
        else:
            from sklearn.linear_model import Lasso
            lasso = Lasso(alpha=alpha, normalize=True, fit_intercept=True)

        # Fit model
        lasso.fit(X_lagged, Y_i)

        # Extract causal relationships
        beta_i = lasso.coef_
        coefficients[i] = beta_i

        # Determine adjacency (reshape to (n, max_lag) structure)
        beta_reshaped = beta_i.reshape(n, max_lag)

        # Check for non-zero coefficients
        for j in range(n):
            if j != i:  # No self-loops
                if np.any(np.abs(beta_reshaped[j, :]) > 1e-8):
                    adjacency[j, i] = 1  # j -> i

    return adjacency, coefficients

Advanced LASSO Variants

Adaptive LASSO

Uses data-dependent weights to improve selection properties:

\[\hat{\boldsymbol{\beta}}_{\text{adaptive}} = \arg\min_{\boldsymbol{\beta}} \frac{1}{2n} \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|_2^2 + \lambda \sum_{j=1}^p \frac{1}{|\hat{\beta}_j^{\text{OLS}}|^\gamma} |\beta_j|\]

where $\hat{\boldsymbol{\beta}}^{\text{OLS}}$ are ordinary least squares estimates and $\gamma > 0$.

Group LASSO for Temporal Structure

Groups coefficients by variable across all lags:

\[\hat{\boldsymbol{\beta}}_{\text{group}} = \arg\min_{\boldsymbol{\beta}} \frac{1}{2n} \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|_2^2 + \lambda \sum_{j=1}^n \|\boldsymbol{\beta}_j\|_2\]

where $\boldsymbol{\beta}_j = [\beta_{j}^{(1)}, \ldots, \beta_{j}^{(\tau_{\max})}]^T$ contains all lag coefficients for variable $j$.

Elastic Net

Combines L1 and L2 penalties:

\[\hat{\boldsymbol{\beta}}_{\text{enet}} = \arg\min_{\boldsymbol{\beta}} \frac{1}{2n} \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|_2^2 + \lambda_1 \|\boldsymbol{\beta}\|_1 + \lambda_2 \|\boldsymbol{\beta}\|_2^2\]

This addresses multicollinearity issues common in time series data.

Theoretical Properties

Consistency and Oracle Properties

Under appropriate conditions, LASSO achieves:

Selection Consistency: .. math:

P(\hat{\mathbf{S}} = \mathbf{S}_{\text{true}}) \to 1 \text{ as } n \to \infty

Parameter Consistency: .. math:

\|\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}_{\text{true}}\|_2 = O_p(\sqrt{s \log p / n})

where $s = |\mathbf{S}_{\text{true}}|$ is the true sparsity level.

Conditions for Consistency

Key assumptions for theoretical guarantees:

Restricted Eigenvalue Condition: .. math:

\inf_{\boldsymbol{\delta} \in \mathcal{C}_s} \frac{\|\mathbf{X}\boldsymbol{\delta}\|_2^2}{n\|\boldsymbol{\delta}\|_2^2} \geq \phi_{\min} > 0

Sparsity: $s = o(n / \log p)$
Signal Strength: $\min_{j \in \mathbf{S}_{\text{true}}} |\beta_j| \geq c\sqrt{\log p / n}$
Regularization Choice: $\lambda \asymp \sqrt{\log p / n}$

Advantages and Limitations

Advantages

Computational Efficiency: Fast algorithms (coordinate descent, LARS)
High-Dimensional Capability: Handles $p >> n$ scenarios
Theoretical Guarantees: Well-established consistency theory
Interpretability: Sparse solutions with clear coefficients
Software Maturity: Robust, well-tested implementations
Automatic Selection: Built-in variable selection
Scalability: Efficient for very large datasets

Limitations

Linearity Assumption: Cannot detect nonlinear relationships
Correlation Issues: May select arbitrary variables from correlated groups
Causal Interpretation: Linear coefficients ≠ causal relationships
Temporal Assumptions: Assumes stationary, linear dynamics
No Significance Testing: No built-in statistical testing framework
Parameter Sensitivity: Results depend heavily on $\lambda$ choice

Comparison with Information-Theoretic Methods

Method Comparison
Aspect	LASSO	Standard oCSE	Information LASSO
Relationship Type	Linear only	Linear + Nonlinear	Mixed
Computational Speed	Very Fast	Slow	Moderate
High Dimensions	Excellent	Limited	Good
Statistical Testing	Limited	Rigorous	Developing
Theoretical Foundation	Mature	Strong (IT)	Emerging
Implementation	Simple	Complex	Moderate

When to Use LASSO Methods

Recommended Scenarios

Linear Systems: When relationships are primarily linear
High-Dimensional Data: $p >> n$ scenarios
Computational Constraints: Limited time/resources
Baseline Analysis: Initial exploration before sophisticated methods
Benchmarking: Comparison standard for other methods
Large-Scale Systems: Very large $n$, $p$, or $T$
Real-Time Applications: When fast inference is required

Avoid When

Nonlinear Systems: Complex, nonlinear relationships dominate
Small-Scale Problems: Information-theoretic methods are feasible
Causal Rigor Required: Need formal causal guarantees
Heterogeneous Data: Mixed data types or distributions

Best Practices

Preprocessing

Standardization: Center and scale variables to unit variance
Stationarity: Check and ensure stationarity (differencing if needed)
Outlier Detection: Remove or robust handling of outliers
Missing Data: Imputation or removal strategies

Model Selection

Cross-Validation: Use time series aware CV (e.g., time series split)
Information Criteria: BIC for conservative selection, AIC for liberal
Stability Selection: For robust variable selection
Path Analysis: Examine full regularization path

Post-Processing

Lag Consolidation: Combine multiple lags of same variable
Significance Assessment: Bootstrap or permutation-based confidence intervals
Network Validation: Compare with known relationships or other methods
Robustness Checks: Sensitivity analysis across parameter choices

Example Analysis

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LassoLarsIC

def analyze_lasso_path(data, target_var=0, max_lag=5):
    """Analyze LASSO regularization path for causal discovery."""

    # Prepare data
    X_lagged, Y_targets = create_lagged_matrices(data, max_lag)
    Y_target = Y_targets[:, target_var]

    # Fit LASSO path
    lasso = LassoLarsIC(criterion='bic', fit_intercept=True, normalize=True)
    lasso.fit(X_lagged, Y_target)

    # Extract selected variables
    selected_vars = np.where(lasso.coef_ != 0)[0]
    n_vars = data.shape[1]

    # Map back to (variable, lag) pairs
    selected_relationships = []
    for idx in selected_vars:
        var_idx = idx // max_lag
        lag_idx = idx % max_lag + 1  # lag starts from 1
        coeff = lasso.coef_[idx]
        selected_relationships.append((var_idx, lag_idx, coeff))

    # Print results
    print(f"Target Variable: {target_var}")
    print(f"Selected Relationships:")
    for var_idx, lag, coeff in selected_relationships:
        print(f"  Variable {var_idx} at lag {lag}: {coeff:.4f}")

    return selected_relationships, lasso

Integration with oCSE Framework

LASSO methods are integrated into the oCSE framework as:

Baseline Comparison: Standard benchmark for evaluation
Initial Screening: Fast preliminary variable selection
High-Dimensional Preprocessing: Dimension reduction before oCSE
Hybrid Approaches: Combined with information-theoretic methods
Validation Tool: Cross-validation of oCSE results

Future Directions

Research Areas

Nonlinear Extensions: Kernel LASSO, neural network regularization
Causal LASSO: Explicit causal objective functions
Time Series Adaptations: Specialized methods for temporal data
Robust Variants: Methods robust to outliers and model misspecification
Bayesian LASSO: Uncertainty quantification in variable selection

Methodological Improvements

Adaptive Regularization: Data-driven $\lambda$ selection
Group Structures: Better handling of temporal and cross-sectional grouping
Multi-Task Learning: Joint learning across multiple target variables
Online Methods: Streaming/online causal discovery
Distributed Computing: Scalable implementations for massive datasets

Conclusion

Pure LASSO methods provide a valuable computational and theoretical foundation for causal discovery in the oCSE framework. While limited to linear relationships, they offer unmatched computational efficiency and theoretical guarantees that make them essential tools for:

High-dimensional problems where information-theoretic methods are infeasible
Baseline comparisons and method evaluation
Initial screening in large-scale analyses
Systems where linear relationships dominate

The integration of LASSO methods with information-theoretic approaches represents a promising direction for combining computational efficiency with the ability to detect complex, nonlinear relationships. Understanding both approaches and their appropriate application domains is crucial for effective causal discovery in practice.