=====================
Pure LASSO Methods
=====================

Pure LASSO methods represent the linear regression baseline for causal discovery in 
the Causation Entropy framework. While not information-theoretic in nature, these 
methods serve as important benchmarks and provide computationally efficient alternatives 
for linear systems. This section covers the theoretical foundation, implementation, and 
role of LASSO-based approaches in causal network inference.

Mathematical Foundation
=======================

Standard LASSO Formulation
--------------------------

The LASSO (Least Absolute Shrinkage and Selection Operator) solves the optimization problem:

.. math::

   \hat{\boldsymbol{\beta}} = \arg\min_{\boldsymbol{\beta}} \frac{1}{2n} \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|_2^2 + \lambda \|\boldsymbol{\beta}\|_1

This can be equivalently formulated as a constrained optimization:

.. math::

   \hat{\boldsymbol{\beta}} = \arg\min_{\boldsymbol{\beta}} \frac{1}{2n} \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|_2^2 \quad \text{subject to} \quad \|\boldsymbol{\beta}\|_1 \leq t

where :math:`t` corresponds to the constraint level determined by :math:`\lambda`.

Causal Discovery Context
-----------------------

For causal discovery from time series, the LASSO problem becomes:

**Target Variable:** :math:`X_i^{(t)}` for :math:`t = \tau_{\max} + 1, \ldots, T`

**Predictor Matrix:** 
.. math::

   \mathbf{X}_{i,\text{lag}} = \begin{bmatrix}
   X_1^{(\tau_{\max})} & X_1^{(\tau_{\max}-1)} & \cdots & X_1^{(1)} & \cdots & X_n^{(\tau_{\max})} & \cdots & X_n^{(1)} \\
   X_1^{(\tau_{\max}+1)} & X_1^{(\tau_{\max})} & \cdots & X_1^{(2)} & \cdots & X_n^{(\tau_{\max}+1)} & \cdots & X_n^{(2)} \\
   \vdots & \vdots & \ddots & \vdots & \ddots & \vdots & \ddots & \vdots \\
   X_1^{(T-1)} & X_1^{(T-2)} & \cdots & X_1^{(T-\tau_{\max})} & \cdots & X_n^{(T-1)} & \cdots & X_n^{(T-\tau_{\max})}
   \end{bmatrix}

**Response Vector:**
.. math::

   \mathbf{y}_i = \begin{bmatrix} X_i^{(\tau_{\max}+1)} \\ X_i^{(\tau_{\max}+2)} \\ \vdots \\ X_i^{(T)} \end{bmatrix}

The resulting coefficient vector has structure:
.. math::

   \boldsymbol{\beta}_i = [\beta_{i,1}^{(1)}, \beta_{i,1}^{(2)}, \ldots, \beta_{i,1}^{(\tau_{\max})}, \ldots, \beta_{i,n}^{(1)}, \ldots, \beta_{i,n}^{(\tau_{\max})}]^T

where :math:`\beta_{i,j}^{(\tau)}` represents the influence of variable :math:`j` at lag :math:`\tau` on variable :math:`i`.

Causal Interpretation
====================

Edge Detection
--------------

A directed edge from variable :math:`j` to variable :math:`i` at lag :math:`\tau` is inferred if:

.. math::

   |\hat{\beta}_{i,j}^{(\tau)}| > \epsilon

where :math:`\epsilon` is a small threshold (typically machine precision).

The strongest lag for each relationship can be determined by:

.. math::

   \tau_{i,j}^* = \arg\max_{\tau \in \{1,\ldots,\tau_{\max}\}} |\hat{\beta}_{i,j}^{(\tau)}|

Network Construction
-------------------

The inferred adjacency matrix :math:`\mathbf{A}` has entries:

.. math::

   A_{ji} = \begin{cases}
   1 & \text{if } \max_\tau |\hat{\beta}_{i,j}^{(\tau)}| > \epsilon \\
   0 & \text{otherwise}
   \end{cases}

With optional lag information:

.. math::

   L_{ji} = \begin{cases}
   \tau_{i,j}^* & \text{if } A_{ji} = 1 \\
   0 & \text{otherwise}
   \end{cases}

Regularization Parameter Selection
==================================

The choice of :math:`\lambda` critically affects the sparsity-accuracy tradeoff.

Cross-Validation Approach
-------------------------

Standard k-fold cross-validation minimizes prediction error:

.. math::

   \lambda^*_{CV} = \arg\min_\lambda \frac{1}{K} \sum_{k=1}^K \|\mathbf{y}_k^{\text{test}} - \mathbf{X}_k^{\text{test}}\hat{\boldsymbol{\beta}}_k(\lambda)\|_2^2

Information Criteria
--------------------

**Akaike Information Criterion (AIC):**
.. math::

   \text{AIC}(\lambda) = n \log(\text{RSS}(\lambda)/n) + 2|\hat{\mathbf{S}}(\lambda)|

**Bayesian Information Criterion (BIC):**
.. math::

   \text{BIC}(\lambda) = n \log(\text{RSS}(\lambda)/n) + |\hat{\mathbf{S}}(\lambda)| \log n

where :math:`\text{RSS}(\lambda) = \|\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}(\lambda)\|_2^2` and 
:math:`|\hat{\mathbf{S}}(\lambda)|` is the number of selected predictors.

Stability Selection
-------------------

For more robust selection, use stability selection across bootstrap samples:

.. math::

   \Pi_j(\lambda) = P(\beta_j(\lambda) \neq 0) = \frac{1}{B} \sum_{b=1}^B \mathbb{I}(\hat{\beta}_j^{(b)}(\lambda) \neq 0)

Select variables with :math:`\Pi_j(\lambda) \geq \pi_{\text{thresh}}$ (typically 0.6-0.8).

Implementation Approaches
=========================

Standard LASSO Implementation
-----------------------------

.. code-block:: python

   from sklearn.linear_model import LassoLarsIC, LassoCV
   import numpy as np
   
   def lasso_causal_discovery(data, max_lag=5, criterion='bic', alpha=None):
       """
       Discover causal network using LASSO regression.
       
       Parameters
       ----------
       data : array (T, n)
           Time series data
       max_lag : int
           Maximum lag to consider
       criterion : str
           Model selection criterion ('aic', 'bic', or 'cv')
       alpha : float or None
           Regularization parameter (if None, automatically selected)
       """
       T, n = data.shape
       
       # Create lagged design matrix
       X_lagged, Y_targets = create_lagged_matrices(data, max_lag)
       
       # Initialize results
       adjacency = np.zeros((n, n))
       coefficients = {}
       
       # Fit LASSO for each target variable
       for i in range(n):
           Y_i = Y_targets[:, i]
           
           if alpha is None:
               if criterion in ['aic', 'bic']:
                   # Use information criterion for model selection
                   lasso = LassoLarsIC(criterion=criterion, 
                                     normalize=True, 
                                     fit_intercept=True)
               else:
                   # Use cross-validation
                   lasso = LassoCV(cv=5, normalize=True, fit_intercept=True)
           else:
               from sklearn.linear_model import Lasso
               lasso = Lasso(alpha=alpha, normalize=True, fit_intercept=True)
           
           # Fit model
           lasso.fit(X_lagged, Y_i)
           
           # Extract causal relationships
           beta_i = lasso.coef_
           coefficients[i] = beta_i
           
           # Determine adjacency (reshape to (n, max_lag) structure)
           beta_reshaped = beta_i.reshape(n, max_lag)
           
           # Check for non-zero coefficients
           for j in range(n):
               if j != i:  # No self-loops
                   if np.any(np.abs(beta_reshaped[j, :]) > 1e-8):
                       adjacency[j, i] = 1  # j -> i
       
       return adjacency, coefficients

Advanced LASSO Variants
======================

Adaptive LASSO
--------------

Uses data-dependent weights to improve selection properties:

.. math::

   \hat{\boldsymbol{\beta}}_{\text{adaptive}} = \arg\min_{\boldsymbol{\beta}} \frac{1}{2n} \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|_2^2 + \lambda \sum_{j=1}^p \frac{1}{|\hat{\beta}_j^{\text{OLS}}|^\gamma} |\beta_j|

where :math:`\hat{\boldsymbol{\beta}}^{\text{OLS}}` are ordinary least squares estimates and :math:`\gamma > 0`.

Group LASSO for Temporal Structure
----------------------------------

Groups coefficients by variable across all lags:

.. math::

   \hat{\boldsymbol{\beta}}_{\text{group}} = \arg\min_{\boldsymbol{\beta}} \frac{1}{2n} \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|_2^2 + \lambda \sum_{j=1}^n \|\boldsymbol{\beta}_j\|_2

where :math:`\boldsymbol{\beta}_j = [\beta_{j}^{(1)}, \ldots, \beta_{j}^{(\tau_{\max})}]^T` contains all lag coefficients for variable :math:`j`.

Elastic Net
-----------

Combines L1 and L2 penalties:

.. math::

   \hat{\boldsymbol{\beta}}_{\text{enet}} = \arg\min_{\boldsymbol{\beta}} \frac{1}{2n} \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|_2^2 + \lambda_1 \|\boldsymbol{\beta}\|_1 + \lambda_2 \|\boldsymbol{\beta}\|_2^2

This addresses multicollinearity issues common in time series data.

Theoretical Properties
=====================

Consistency and Oracle Properties
---------------------------------

Under appropriate conditions, LASSO achieves:

**Selection Consistency:** 
.. math::
   P(\hat{\mathbf{S}} = \mathbf{S}_{\text{true}}) \to 1 \text{ as } n \to \infty

**Parameter Consistency:**
.. math::
   \|\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}_{\text{true}}\|_2 = O_p(\sqrt{s \log p / n})

where :math:`s = |\mathbf{S}_{\text{true}}|` is the true sparsity level.

Conditions for Consistency
--------------------------

Key assumptions for theoretical guarantees:

1. **Restricted Eigenvalue Condition:** 
   .. math::
      \inf_{\boldsymbol{\delta} \in \mathcal{C}_s} \frac{\|\mathbf{X}\boldsymbol{\delta}\|_2^2}{n\|\boldsymbol{\delta}\|_2^2} \geq \phi_{\min} > 0

2. **Sparsity:** :math:`s = o(n / \log p)`

3. **Signal Strength:** :math:`\min_{j \in \mathbf{S}_{\text{true}}} |\beta_j| \geq c\sqrt{\log p / n}`

4. **Regularization Choice:** :math:`\lambda \asymp \sqrt{\log p / n}`

Advantages and Limitations
=========================

Advantages
----------

1. **Computational Efficiency:** Fast algorithms (coordinate descent, LARS)
2. **High-Dimensional Capability:** Handles :math:`p >> n` scenarios
3. **Theoretical Guarantees:** Well-established consistency theory
4. **Interpretability:** Sparse solutions with clear coefficients
5. **Software Maturity:** Robust, well-tested implementations
6. **Automatic Selection:** Built-in variable selection
7. **Scalability:** Efficient for very large datasets

Limitations
-----------

1. **Linearity Assumption:** Cannot detect nonlinear relationships
2. **Correlation Issues:** May select arbitrary variables from correlated groups
3. **Causal Interpretation:** Linear coefficients ≠ causal relationships
4. **Temporal Assumptions:** Assumes stationary, linear dynamics
5. **No Significance Testing:** No built-in statistical testing framework
6. **Parameter Sensitivity:** Results depend heavily on :math:`\lambda` choice

Comparison with Information-Theoretic Methods
=============================================

.. list-table:: Method Comparison
   :widths: 20 25 25 30
   :header-rows: 1

   * - Aspect
     - LASSO
     - Standard oCSE
     - Information LASSO
   * - Relationship Type
     - Linear only
     - Linear + Nonlinear
     - Mixed
   * - Computational Speed
     - Very Fast
     - Slow
     - Moderate
   * - High Dimensions
     - Excellent
     - Limited
     - Good
   * - Statistical Testing
     - Limited
     - Rigorous
     - Developing
   * - Theoretical Foundation
     - Mature
     - Strong (IT)
     - Emerging
   * - Implementation
     - Simple
     - Complex
     - Moderate

When to Use LASSO Methods
========================

Recommended Scenarios
--------------------

1. **Linear Systems:** When relationships are primarily linear
2. **High-Dimensional Data:** :math:`p >> n` scenarios
3. **Computational Constraints:** Limited time/resources
4. **Baseline Analysis:** Initial exploration before sophisticated methods
5. **Benchmarking:** Comparison standard for other methods
6. **Large-Scale Systems:** Very large :math:`n`, :math:`p`, or :math:`T`
7. **Real-Time Applications:** When fast inference is required

Avoid When
----------

1. **Nonlinear Systems:** Complex, nonlinear relationships dominate
2. **Small-Scale Problems:** Information-theoretic methods are feasible
3. **Causal Rigor Required:** Need formal causal guarantees
4. **Heterogeneous Data:** Mixed data types or distributions

Best Practices
==============

Preprocessing
------------

1. **Standardization:** Center and scale variables to unit variance
2. **Stationarity:** Check and ensure stationarity (differencing if needed)
3. **Outlier Detection:** Remove or robust handling of outliers
4. **Missing Data:** Imputation or removal strategies

Model Selection
--------------

1. **Cross-Validation:** Use time series aware CV (e.g., time series split)
2. **Information Criteria:** BIC for conservative selection, AIC for liberal
3. **Stability Selection:** For robust variable selection
4. **Path Analysis:** Examine full regularization path

Post-Processing
--------------

1. **Lag Consolidation:** Combine multiple lags of same variable
2. **Significance Assessment:** Bootstrap or permutation-based confidence intervals
3. **Network Validation:** Compare with known relationships or other methods
4. **Robustness Checks:** Sensitivity analysis across parameter choices

Example Analysis
===============

.. code-block:: python

   import numpy as np
   import matplotlib.pyplot as plt
   from sklearn.linear_model import LassoLarsIC
   
   def analyze_lasso_path(data, target_var=0, max_lag=5):
       """Analyze LASSO regularization path for causal discovery."""
       
       # Prepare data
       X_lagged, Y_targets = create_lagged_matrices(data, max_lag)
       Y_target = Y_targets[:, target_var]
       
       # Fit LASSO path
       lasso = LassoLarsIC(criterion='bic', fit_intercept=True, normalize=True)
       lasso.fit(X_lagged, Y_target)
       
       # Extract selected variables
       selected_vars = np.where(lasso.coef_ != 0)[0]
       n_vars = data.shape[1]
       
       # Map back to (variable, lag) pairs
       selected_relationships = []
       for idx in selected_vars:
           var_idx = idx // max_lag
           lag_idx = idx % max_lag + 1  # lag starts from 1
           coeff = lasso.coef_[idx]
           selected_relationships.append((var_idx, lag_idx, coeff))
       
       # Print results
       print(f"Target Variable: {target_var}")
       print(f"Selected Relationships:")
       for var_idx, lag, coeff in selected_relationships:
           print(f"  Variable {var_idx} at lag {lag}: {coeff:.4f}")
       
       return selected_relationships, lasso

Integration with oCSE Framework
==============================

LASSO methods are integrated into the oCSE framework as:

1. **Baseline Comparison:** Standard benchmark for evaluation
2. **Initial Screening:** Fast preliminary variable selection  
3. **High-Dimensional Preprocessing:** Dimension reduction before oCSE
4. **Hybrid Approaches:** Combined with information-theoretic methods
5. **Validation Tool:** Cross-validation of oCSE results

Future Directions
================

Research Areas
-------------

1. **Nonlinear Extensions:** Kernel LASSO, neural network regularization
2. **Causal LASSO:** Explicit causal objective functions
3. **Time Series Adaptations:** Specialized methods for temporal data
4. **Robust Variants:** Methods robust to outliers and model misspecification
5. **Bayesian LASSO:** Uncertainty quantification in variable selection

Methodological Improvements
---------------------------

1. **Adaptive Regularization:** Data-driven :math:`\lambda` selection
2. **Group Structures:** Better handling of temporal and cross-sectional grouping
3. **Multi-Task Learning:** Joint learning across multiple target variables
4. **Online Methods:** Streaming/online causal discovery
5. **Distributed Computing:** Scalable implementations for massive datasets

Conclusion
==========

Pure LASSO methods provide a valuable computational and theoretical foundation for 
causal discovery in the oCSE framework. While limited to linear relationships, they 
offer unmatched computational efficiency and theoretical guarantees that make them 
essential tools for:

- High-dimensional problems where information-theoretic methods are infeasible
- Baseline comparisons and method evaluation
- Initial screening in large-scale analyses
- Systems where linear relationships dominate

The integration of LASSO methods with information-theoretic approaches represents a 
promising direction for combining computational efficiency with the ability to detect 
complex, nonlinear relationships. Understanding both approaches and their appropriate 
application domains is crucial for effective causal discovery in practice.