Glossary

This glossary provides definitions of key terms and concepts used throughout the Causation Entropy library and documentation.

Causal Discovery: The process of inferring causal relationships between variables from observational data, without direct experimental intervention. Distinguished from correlation analysis by attempting to identify directional, mechanistic relationships.
Optimal Causation Entropy (oCSE): An information-theoretic measure of causal influence based on conditional mutual information. Quantifies how much information a potential cause provides about an effect, beyond what is already known from other variables.
Conditional Mutual Information (CMI): A measure of mutual dependence between two variables given knowledge of a third variable (or set of variables). Mathematically:

\[I(X; Y | Z) = H(X | Z) - H(X | Y, Z)\]
Conditioning Set: The set of variables \(\mathbf{Z}\) that are held constant when computing conditional mutual information. In causal discovery, this typically includes confounding variables and previously selected predictors.
False Discovery Rate (FDR): The expected proportion of false positives among all discoveries (rejected null hypotheses). In causal discovery, this controls the expected fraction of incorrectly identified causal relationships.
False Positive Rate (FPR): The probability of incorrectly identifying a causal relationship when none exists. Also known as Type I error rate or \(1 - \text{specificity}\).

\[\text{FPR} = \frac{\text{FP}}{\text{FP} + \text{TN}}\]
Forward Selection: A greedy algorithm phase that iteratively selects the predictor variable with the highest conditional mutual information with the target, subject to statistical significance constraints.
Backward Elimination: A pruning phase that removes previously selected predictors that no longer maintain statistical significance when conditioned on all other selected variables.
Granger Causality: A statistical concept of causality based on predictability: X is said to Granger-cause Y if past values of X contain information that helps predict Y beyond what is contained in past values of Y alone.
Information Criterion: A measure used for model selection that balances goodness of fit against model complexity. Common examples include AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion).
k-Nearest Neighbor (k-NN) Estimator: A non-parametric method for estimating probability densities and information measures based on distances to the k-th nearest neighbor in the data space.
Kernel Density Estimation (KDE): A non-parametric method for estimating probability density functions by placing kernel functions (typically Gaussian) at each data point and summing their contributions.
Lag: The time delay \(\tau\) between a potential cause and its effect in time series analysis. A lag of \(\tau\) means the cause variable at time \(t-\tau\) potentially influences the effect variable at time \(t\).
LASSO (Least Absolute Shrinkage and Selection Operator): A regularization method that performs variable selection by adding an L1 penalty term to the loss function:

\[\min_\beta \frac{1}{2n}||y - X\beta||_2^2 + \lambda ||\beta||_1\]
Maximum Lag: The maximum time delay \(\tau_{\max}\) considered in causal discovery. Variables are tested as potential causes at lags \(1, 2, \ldots, \tau_{\max}\).
Mutual Information: A measure of mutual dependence between two variables, quantifying the amount of information obtained about one variable by observing another:

\[I(X; Y) = H(X) - H(X | Y) = H(Y) - H(Y | X)\]
Network Inference: The process of reconstructing the structure of a network (graph) from observational data on the nodes. In causal discovery, this involves identifying directed edges representing causal relationships.
Permutation Test: A non-parametric statistical test that assesses significance by comparing the observed test statistic to a distribution generated by randomly permuting the data under the null hypothesis.
Statistical Significance: The probability that an observed relationship occurred by chance, typically assessed using p-values and compared to a significance level \(\alpha\) (commonly 0.05).
Time Series: A sequence of data points indexed by time, typically collected at successive, equally-spaced points in time.
Transfer Entropy: An information-theoretic measure of directed information transfer between time series, closely related to Granger causality but based on information theory rather than linear prediction.
True Positive Rate (TPR): The probability of correctly identifying a causal relationship when it exists. Also known as sensitivity, recall, or statistical power.

\[\text{TPR} = \frac{\text{TP}}{\text{TP} + \text{FN}}\]
Vector Autoregression (VAR): A multivariate extension of autoregressive models where each variable is regressed on lagged values of itself and all other variables in the system:

\[\mathbf{x}_t = \mathbf{A}_1 \mathbf{x}_{t-1} + \cdots + \mathbf{A}_p \mathbf{x}_{t-p} + \boldsymbol{\epsilon}_t\]

Mathematical Notation

Common mathematical symbols used throughout the documentation:

Mathematical Symbols
Symbol	Meaning
\(H(X)\)	Entropy of random variable X
\(I(X; Y)\)	Mutual information between X and Y
\(I(X; Y \| Z)\)	Conditional mutual information between X and Y given Z
\(X^{(t)}\)	Variable X at time t
\(X_i^{(t-\tau)}\)	Variable i at time t-τ (lag τ)
\(\mathbf{Z}_i^{(t)}\)	Conditioning set for variable i at time t
\(\tau\)	Time lag
\(\tau_{\max}\)	Maximum lag considered
\(\alpha\)	Significance level (e.g., 0.05)
\(\lambda\)	Regularization parameter
\(\mathbf{A}\)	Adjacency matrix
\(\rho\)	Spectral radius or correlation coefficient
\(\epsilon\)	Error term or small constant
\(\psi(\cdot)\)	Digamma function
\(\Gamma(\cdot)\)	Gamma function
\(\|\mathbf{M}\|\)	Determinant of matrix M
\(\mathbf{I}_n\)	n×n identity matrix
\(\mathbb{E}[\cdot]\)	Expected value
\(\text{Var}(\cdot)\)	Variance
\(\text{Cov}(\cdot, \cdot)\)	Covariance

Abbreviations

Common Abbreviations
Abbreviation	Full Term
oCSE	optimal Causal Entropy
CMI	Conditional Mutual Information
MI	Mutual Information
KDE	Kernel Density Estimation
k-NN	k-Nearest Neighbor
KSG	Kraskov-Stögbauer-Grassberger (estimator)
LASSO	Least Absolute Shrinkage and Selection Operator
VAR	Vector Autoregression
AIC	Akaike Information Criterion
BIC	Bayesian Information Criterion
ROC	Receiver Operating Characteristic
AUC	Area Under Curve
TPR	True Positive Rate
FPR	False Positive Rate
FDR	False Discovery Rate
TE	Transfer Entropy
GC	Granger Causality