============================
Information-Theoretic LASSO
============================

The Information-Theoretic LASSO (info-LASSO) method combines the information-theoretic 
foundation of optimal Causal Entropy with the regularization framework of LASSO regression. 
This hybrid approach aims to handle high-dimensional predictor spaces while maintaining 
the nonlinear relationship detection capabilities of information measures.

Mathematical Foundation
=======================

Traditional LASSO Objective
---------------------------

The standard LASSO optimization problem is:

.. math::

   \hat{\boldsymbol{\beta}} = \arg\min_{\boldsymbol{\beta}} \frac{1}{2n} \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|_2^2 + \lambda \|\boldsymbol{\beta}\|_1

where:
- :math:`\mathbf{y} \in \mathbb{R}^n` is the response vector
- :math:`\mathbf{X} \in \mathbb{R}^{n \times p}` is the predictor matrix  
- :math:`\boldsymbol{\beta} \in \mathbb{R}^p` are the regression coefficients
- :math:`\lambda \geq 0` is the regularization parameter

Information-Theoretic Extension
------------------------------

The info-LASSO modifies this framework by incorporating information-theoretic weights 
and measures. The conceptual objective becomes:

.. math::

   \hat{\mathbf{S}} = \arg\min_{\mathbf{S} \subseteq \{1,\ldots,p\}} \left[ -\sum_{j \in \mathbf{S}} w_j \cdot I_j + \lambda |\mathbf{S}| \right]

where:
- :math:`\mathbf{S}` is the selected predictor subset
- :math:`w_j` are information-based weights
- :math:`I_j` represents the information contribution of predictor :math:`j`
- :math:`\lambda` controls the sparsity-information tradeoff

Information-Theoretic Weights
=============================

The weights :math:`w_j` are derived from conditional mutual information measures:

Base Weights
-----------

For each potential predictor :math:`X_j^{(t-\tau)}`:

.. math::

   w_{j,\tau} = \frac{I(X_j^{(t-\tau)}; X_i^{(t)} | \mathbf{Z}_i)}{\sum_{k,\tau'} I(X_k^{(t-\tau')}; X_i^{(t)} | \mathbf{Z}_i)}

This normalizes the conditional mutual information values to create relative importance weights.

Adaptive Weighting
------------------

The weights can be updated iteratively based on current selections:

.. math::

   w_{j,\tau}^{(k+1)} = w_{j,\tau}^{(k)} \cdot \exp\left(\alpha \cdot I(X_j^{(t-\tau)}; X_i^{(t)} | \mathbf{S}_i^{(k)})\right)

where:
- :math:`\mathbf{S}_i^{(k)}` is the current selection set at iteration :math:`k`
- :math:`\alpha` controls the adaptation rate

Significance-Based Weighting
----------------------------

Incorporate statistical significance from permutation tests:

.. math::

   w_{j,\tau} = I(X_j^{(t-\tau)}; X_i^{(t)} | \mathbf{Z}_i) \cdot \mathbb{I}(\text{p-value}_{j,\tau} < \alpha_{\text{thresh}})

where :math:`\mathbb{I}(\cdot)` is the indicator function and :math:`\text{p-value}_{j,\tau}` 
comes from permutation testing.

Algorithmic Approaches
=====================

There are several ways to implement information-theoretic LASSO:

Approach 1: Weighted LASSO with Information Weights
--------------------------------------------------

1. **Weight Computation:** Calculate information-theoretic weights for all predictors
2. **Weighted LASSO:** Solve the modified LASSO problem:

   .. math::

      \hat{\boldsymbol{\beta}} = \arg\min_{\boldsymbol{\beta}} \frac{1}{2n} \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|_2^2 + \lambda \sum_{j=1}^p \frac{1}{w_j} |\beta_j|

3. **Selection:** Variables with :math:`\hat{\beta}_j \neq 0` are selected

Approach 2: Information-Guided Regularization Path
--------------------------------------------------

1. **Information Ranking:** Rank predictors by conditional mutual information
2. **Adaptive :math:`\lambda`:** Use different regularization for different predictors:

   .. math::

      \lambda_j = \lambda_0 \cdot \exp(-\gamma \cdot \text{rank}(I_j))

3. **Group Selection:** Apply group-wise regularization based on information content

Approach 3: Iterative Information-LASSO
---------------------------------------

Alternate between information computation and LASSO selection:

.. code-block:: none

   Initialize: S = ∅, Z = Z_init
   Repeat:
       1. Compute CMI for all candidates given current Z
       2. Update weights based on CMI values  
       3. Solve weighted LASSO with current weights
       4. Update selection S and conditioning set Z
   Until convergence

Implementation Framework
=======================

Two-Stage Implementation
-----------------------

**Stage 1: Information Assessment**

.. code-block:: python

   def compute_information_weights(X, Y, Z, method='gaussian'):
       """Compute information-theoretic weights for all predictors."""
       n_predictors = X.shape[1]
       weights = np.zeros(n_predictors)
       
       for j in range(n_predictors):
           X_j = X[:, [j]]
           cmi = conditional_mutual_information(X_j, Y, Z, method=method)
           weights[j] = cmi
           
       # Normalize weights
       weights = weights / np.sum(weights) if np.sum(weights) > 0 else weights
       return weights

**Stage 2: Weighted LASSO**

.. code-block:: python

   def information_lasso(X, Y, weights, lambda_reg=1.0):
       """Solve weighted LASSO problem with information weights."""
       from sklearn.linear_model import Lasso
       
       # Create penalty weights (inverse of information weights)
       penalty_weights = 1.0 / (weights + 1e-8)  # Add small constant for stability
       
       # Weighted features (approximate weighted penalty via feature scaling)
       X_weighted = X / penalty_weights.reshape(1, -1)
       
       # Fit LASSO
       lasso = Lasso(alpha=lambda_reg, fit_intercept=True)
       lasso.fit(X_weighted, Y.ravel())
       
       # Recover original coefficients
       beta_original = lasso.coef_ / penalty_weights
       
       # Select non-zero coefficients
       selected = np.where(np.abs(beta_original) > 1e-6)[0]
       return selected, beta_original

Adaptive Implementation
----------------------

.. code-block:: python

   def adaptive_information_lasso(X, Y, Z_init, max_iter=10, tol=1e-6):
       """Iterative information-guided LASSO selection."""
       n_predictors = X.shape[1]
       selected = []
       Z_current = Z_init.copy() if Z_init is not None else None
       
       for iteration in range(max_iter):
           # Compute current information weights
           weights = compute_information_weights(X, Y, Z_current)
           
           # Apply weighted LASSO
           new_selected, _ = information_lasso(X, Y, weights)
           
           # Check convergence
           if set(new_selected) == set(selected):
               break
               
           # Update selection and conditioning set
           selected = list(new_selected)
           if len(selected) > 0:
               Z_current = X[:, selected]
               if Z_init is not None:
                   Z_current = np.hstack([Z_init, Z_current])
           
       return selected

Theoretical Properties
=====================

Sparsity-Information Tradeoff
----------------------------

The info-LASSO balances two competing objectives:

.. math::

   \text{Information Gain} = \sum_{j \in \mathbf{S}} I(X_j^{(t-\tau_j)}; X_i^{(t)} | \mathbf{Z}_i)

.. math::

   \text{Complexity Cost} = \lambda |\mathbf{S}|

The optimal selection satisfies:

.. math::

   \mathbf{S}^* = \arg\max_{\mathbf{S}} \left[ \sum_{j \in \mathbf{S}} I_j - \lambda |\mathbf{S}| \right]

Consistency Properties
---------------------

Under appropriate conditions, the info-LASSO estimator has similar consistency properties 
to standard LASSO:

1. **Selection Consistency:** :math:`P(\hat{\mathbf{S}} = \mathbf{S}_{\text{true}}) \to 1` as :math:`n \to \infty`
2. **Estimation Consistency:** :math:`\|\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}_{\text{true}}\|_2 \to 0` 

The key difference is that "true" relationships are defined by information-theoretic 
rather than linear relationships.

Oracle Properties
-----------------

With proper choice of :math:`\lambda_n`, the info-LASSO can achieve oracle properties:

.. math::

   \lambda_n = o(n^{-1/2}) \quad \text{and} \quad \lambda_n \sqrt{\log p} \to \infty

Advantages and Limitations
=========================

Advantages
----------

1. **High-Dimensional Capability:** Handles :math:`p >> n` scenarios better than pure oCSE
2. **Regularization:** Built-in protection against overfitting
3. **Computational Efficiency:** Leverages fast LASSO solvers
4. **Information Preservation:** Maintains information-theoretic relationships
5. **Flexibility:** Can incorporate various information measures
6. **Path Solutions:** Can explore entire regularization path

Limitations
-----------

1. **Linear Approximation:** LASSO stage assumes linear relationships
2. **Weight Sensitivity:** Performance depends on quality of information weights
3. **Parameter Tuning:** Requires selection of :math:`\lambda` parameter
4. **Implementation Complexity:** More complex than pure LASSO or pure oCSE
5. **Theoretical Gaps:** Limited theoretical analysis for information-theoretic variant

Hyperparameter Selection
=======================

Cross-Validation for :math:`\lambda`
-----------------------------------

Use information-theoretic criteria for model selection:

.. math::

   \lambda^* = \arg\min_\lambda \text{CV-Score}(\lambda)

where the CV-Score can be based on:
- Prediction error (traditional)
- Information criteria (AIC, BIC)
- Out-of-sample mutual information

Information Criteria
--------------------

Adapt traditional criteria to information-theoretic setting:

.. math::

   \text{AIC}_{\text{info}} = -2 \sum_{j \in \hat{\mathbf{S}}} I_j + 2|\hat{\mathbf{S}}|

.. math::

   \text{BIC}_{\text{info}} = -2 \sum_{j \in \hat{\mathbf{S}}} I_j + |\hat{\mathbf{S}}| \log n

Comparison with Standard oCSE
=============================

.. list-table:: Method Comparison
   :widths: 25 25 25 25
   :header-rows: 1

   * - Aspect
     - Standard oCSE
     - Info-LASSO
     - Hybrid Approach
   * - High Dimensions
     - Limited (p < n)
     - Good (p >> n)
     - Excellent
   * - Nonlinear Relations
     - Excellent
     - Limited
     - Good
   * - Computation Time
     - Slow
     - Fast
     - Medium
   * - Parameter Tuning
     - Minimal
     - Moderate
     - Complex
   * - Theoretical Foundation
     - Strong
     - Developing
     - Emerging

Use Case Guidelines
==================

When to Use Info-LASSO
----------------------

1. **High-Dimensional Data:** :math:`p > n` or :math:`p \approx n`
2. **Mixed Relationships:** Combination of linear and nonlinear dependencies  
3. **Computational Constraints:** Limited time for full oCSE analysis
4. **Regularization Needed:** Risk of overfitting with standard oCSE
5. **LASSO Familiarity:** Team comfortable with regularization approaches

When to Avoid Info-LASSO
------------------------

1. **Purely Nonlinear Systems:** Standard oCSE more appropriate
2. **Low-Dimensional Data:** Overhead not justified for small :math:`p`
3. **Theoretical Requirements:** Need rigorous information-theoretic guarantees
4. **Complex Conditioning:** Requires sophisticated conditioning strategies

Example Application
==================

Consider a high-dimensional time series system:

.. code-block:: python

   import numpy as np
   from causationentropy.core.discovery import discover_network
   
   # Generate high-dimensional data
   T, n = 500, 50  # p >> n scenario
   data = generate_high_dim_system(T, n, sparsity=0.1)
   
   # Apply different methods
   # Standard oCSE (may struggle with high dimensions)
   G_standard = discover_network(data, method='standard', max_lag=2)
   
   # Information LASSO
   G_info_lasso = discover_network(data, method='information_lasso', max_lag=2)
   
   # Pure LASSO (baseline)
   G_lasso = discover_network(data, method='lasso', max_lag=2)
   
   # Compare results
   compare_networks(G_standard, G_info_lasso, G_lasso, true_network)

Future Directions
=================

Research Opportunities
---------------------

1. **Theoretical Analysis:** Develop consistency theory for information-LASSO
2. **Adaptive Methods:** Dynamic weight adjustment during selection
3. **Group Information LASSO:** Extend to grouped variable selection
4. **Nonlinear Extensions:** Kernel or neural network variants
5. **Bayesian Formulation:** Probabilistic interpretation of information weights

Implementation Improvements
--------------------------

1. **Efficient Algorithms:** Specialized solvers for information-weighted problems
2. **Automatic Tuning:** Data-driven :math:`\lambda` selection methods
3. **Parallel Computing:** Distributed computation for large-scale problems
4. **Memory Optimization:** Efficient storage for high-dimensional cases

Conclusion
==========

Information-theoretic LASSO represents a promising direction for combining the strengths 
of information-based causal discovery with the computational and regularization advantages 
of LASSO methods. While still an active area of research, it offers practical solutions 
for high-dimensional causal discovery problems where traditional oCSE methods may struggle.

The approach is particularly valuable in scenarios where:
- Dimensionality exceeds sample size
- Computational efficiency is important
- Mixed linear/nonlinear relationships exist
- Regularization is desired

Future development will likely focus on theoretical foundations, algorithmic improvements, 
and extensions to more complex relationship structures.