7.1. Supported Information Criteria and Cross Validation#

Information Criterion is a commonly used method for model selection, which balances the goodness of fit of the model with its complexity. The choice of an appropriate information criterion depends on the nature of the specific problem, the characteristics of the data, and the preferences of the researcher. Generally, a smaller information criterion indicates a better model. Below, we will introduce some commonly used information criteria and provide examples of how to call them in skscope.

[3]:

import numpy as np
import jax.numpy as jnp
from skscope import ScopeSolver, utilities
from sklearn.datasets import make_regression

import warnings
warnings.filterwarnings('ignore')

We take the linear regression model as an example. Let’s assume that the independent variables of the samples are denoted by $X \in \mathbb{R}^{n \times p}$, and the dependent variables of the samples are $y = X\beta^* + \varepsilon \in \mathbb{R}^p$, where $\varepsilon \in \mathbb{R}^p$ is the noise term. We set the sample size to $n=50$, the dimension to $p=50$ and the ture active size is $k=3$.

Next, we construct the samples X, y, and provide the loss function ols_loss().

[4]:

n, p, k = 50, 50, 3

np.random.seed(0)
X, y, beta = make_regression(n_samples=n, n_features=p, n_informative=k, coef=True)
print('Ture active set:', np.where(beta != 0)[0].tolist())

def ols_loss(params):
    loss = jnp.sum((y - X @ params) ** 2)
    return loss

Ture active set: [10, 20, 45]

7.1.1. Akaike Information Criterion (AIC)#

The Akaike Information Criterion (AIC) was proposed by Japanese statistician Hirotugu Akaike in 1974 [1] and is a commonly used method for model selection. AIC balances the goodness of fit of a model with its complexity by introducing a penalty term based on the sparsity of parameters onto the maximum likelihood estimation. The formula for AIC is given by:

\[\text{AIC} = -2\log(L) + 2k\]

Here, $L$ denotes the maximum likelihood function of the model, and $k$ represents the number of parameters in the model. In model selection, smaller AIC values indicate better fitting of the model to the data, while simultaneously favoring simpler models. Therefore, AIC serves as a tool for comparing different models and selecting the optimal one.

AIC can be applied to a wide range of models:

Multivariate linear regression:

Multivariate linear regression models the relationship between multiple independent variables $X \in \mathbb{R}^{n \times k}$ and a single dependent variable $Y \in \mathbb{R}^{n \times p}$. It’s represented by the equation $Y = X\beta + \epsilon$, where $\epsilon$ is the error term matrix.

According to [2], in multivariate linear regression, the models selected by AIC are consistent under certain conditions. These conditions include the following constraints on $(k, p, n)$:

\[\text{as }\{k, p, n\} \rightarrow \infty,\quad p / n \rightarrow c \in(0,1),\quad k / n \rightarrow \alpha \in(0,1-c).\]
Principal component analysis:

Principal Component Analysis (PCA) is a statistical technique used for dimensionality reduction and data compression while preserving most of the original information. It’s widely used in data preprocessing and exploratory data analysis.

Based on [3], in PCA, AIC consistently selects models under specific conditions, including the following constraints on $(p, n)$:

\[\text{as }\{p, n\} \rightarrow \infty,\quad p / n \rightarrow c \in(0,1) .\]

Below, we consider five models with sparsity levels of [1, 2, 3, 4, 5], and utilize AIC as the criterion for model selection within the skscope. Subsequently, we determine the sparsity levels and parameters chosen by AIC.

[5]:

def AIC(
    objective_value: float,
    dimensionality: int,
    effective_params_num: int,
    train_size: int,
):
    return (train_size * np.log(objective_value)
            + 2 * effective_params_num)


solver = ScopeSolver(
    dimensionality=p,
    sparsity=[1, 2, 3, 4, 5],
    sample_size=n,
    ic_method=AIC,
    )
params_scope = solver.solve(ols_loss, jit = True)

print('sparsity:', np.sum(params_scope != 0))
print('active set:', np.where(params_scope != 0)[0].tolist())

sparsity: 3
active set: [10, 20, 45]

7.1.2. Bayesian Information Criterion (BIC)#

The Bayesian Information Criterion (BIC), also known as the Schwarz Criterion, was proposed by Russian-American statistician Georgy Schwarz in 1978 [4]. Similar to AIC, BIC is a model selection criterion used to balance goodness of fit and model complexity.

The formula for BIC is given by:

\[\text{BIC} = -2\log(L) + k\log(n) ,\]

where $L$ is the maximized likelihood of the model, and $k$ is the number of parameters in the model. BIC, like AIC, incorporates a penalty term for model complexity. However, BIC introduces a penalty based on the sample size. This means that for datasets with larger sample sizes, BIC tends to favor simpler models to avoid overfitting. Conversely, for datasets with smaller sample sizes, BIC tends to favor more complex models because smaller datasets may benefit from models that capture more complexity.

BIC can also be applied to a wide range of models:

Multivariate linear regression:

Similarly, as stated in [2], in multivariate linear regression, the results selected by BIC exhibit consistency under certain conditions, which include:

\[\text{as }\{k, p, n\} \rightarrow \infty,\quad p / n \rightarrow c \in(0,1),\quad k / n \rightarrow \alpha \in(0,1-c).\]
Principal component analysis:

As mentioned in [3], in PCA, the outcomes chosen by BIC demonstrate consistency under specific conditions, comprising:

\[\text{as }\{p, n\} \rightarrow \infty,\quad p / n \rightarrow c \in(0,1) .\]

Below, we utilize BIC as the criterion for model selection within the skscope.

[6]:

def BIC(
    objective_value: float,
    dimensionality: int,
    effective_params_num: int,
    train_size: int,
):
    return (train_size * np.log(objective_value)
            + effective_params_num * np.log(train_size))


solver = ScopeSolver(
    dimensionality=p,
    sparsity=[1, 2, 3, 4, 5],
    sample_size=n,
    ic_method=BIC,
    )
params_scope = solver.solve(ols_loss, jit = True)

print('sparsity:', np.sum(params_scope != 0))
print('active set:', np.where(params_scope != 0)[0].tolist())

sparsity: 3
active set: [10, 20, 45]

7.1.3. Generalized Information Criterion (GIC)#

The form of the Generalized Information Criterion (GIC) is given by:

\[\text{GIC} = -2\log(L) + k\log(p)\log\log(n) ,\]

where $L$ is the maximized likelihood, and $k$ is the number of parameters. It can be observed that the penalty term of GIC is dependent on the dimension $p$, which allows GIC to still select relatively sparse solutions even when $p$ is large.

The modified extended Bayesian information criterion (MEBIC), proposed by in [5], is equivalent to GIC when sample size is sufficiently large.

GIC can be utilized in the following scenarios:

Linear regression:

skscope is a method based on the splicing algorithm. GIC used in the splicing algorithm for linear regression is consistent under certain conditions [6]. These conditions include constraints on the sample size $n$, dimension $p$, and the true sparsity level $s^*$:

\[\frac{s^*\log(p)\log\log(n)}{n}=o(1).\]
Single index models:

In a single index model, the relationship between the response variable $Y$ and the predictor variables $X_1, X_2, \ldots, X_p$ is assumed to be linear but with a special structure. Instead of directly modeling the response as a linear combination of the predictors, a single index $Z$ is introduced, which is a linear combination of the predictors:

\[Z = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_p X_p .\]

The response variable $Y$ is then modeled as a function of this single index:

\[Y = f(Z) + \varepsilon,\]

where $f(\cdot)$ is a link function and $\varepsilon$ is the error term.

GIC used in the splicing algorithm for single index models is consistent under certain conditions [7]. These conditions also include:

\[\frac{s^*\log(p)\log\log(n)}{n}=o(1).\]

Below, we utilize GIC as the criterion for model selection within the skscope.

[7]:

solver = ScopeSolver(
    dimensionality=p,
    sparsity=[1, 2, 3, 4, 5],
    sample_size=n,
    ic_method=utilities.GIC
    )
params_scope = solver.solve(ols_loss)

print('sparsity:', np.sum(params_scope != 0))
print('active set:', np.where(params_scope != 0)[0].tolist())

sparsity: 3
active set: [10, 20, 45]

7.1.4. Extended Bayesian Information Criterion (EBIC)#

The Extended Bayesian Information Criterion (EBIC) is an extension of the BIC, aiming to overcome some limitations when selecting high-dimensional models and providing a more flexible criterion for model selection.

In comparison to the traditional BIC, EBIC introduces an additional parameter $\gamma$ to control the penalty for model complexity. The formula for EBIC is as follows:

\[\text{EBIC} = -2\log(L) + k\log(n) + 2 \gamma k \log(p).\]

Here, $ L $ represents the maximized likelihood of the model, $ k $ is the number of parameters in the model, and $\gamma$ is an adjustable parameter.

Similar to BIC, EBIC considers both model complexity and sample size in the penalty term. However, EBIC introduces an extra parameter $\gamma$, allowing for greater flexibility in the model selection process. By adjusting the value of $\gamma$, a finer balance between goodness of fit and model complexity can be achieved. When $\gamma = 0$, EBIC reduces to BIC.

EBIC finds wide application in high-dimensional data analysis and model selection, especially in fitting sparse models or performing variable selection. By tuning the $\gamma$ parameter, more precise model selection tailored to the specific characteristics of the problem can be achieved, thereby enhancing the flexibility and accuracy of model selection.

EBIC can be applied to various models:

Linear regression:

In the context of linear regression, the results obtained using EBIC are consistent under certain conditions [8]. These conditions include:

\[p=O(n^\kappa),\]

for some constant $\kappa$.
Gaussian Graphical Models:

Gaussian graphical models (GGMs) represent dependencies between variables in multivariate Gaussian distributions using a graph structure. Nodes represent variables, and edges denote conditional dependencies. They are valuable for modeling high-dimensional data, such as in genetics and finance. GGMs are inferred by estimating the sparse precision matrix, capturing relationships after conditioning on other variables. They offer a powerful framework for understanding complex dependencies in data.

In the realm of GGMs, the outcomes obtained with EBIC remain consistent under specific conditions [9]. These conditions encompass:

\[\begin{split}\left\{\begin{array}{l} p=O\left(n^\kappa\right), p \rightarrow \infty, \\ (p+2 s^*) \log p \times \frac{\lambda_{\max }^2}{\theta_0^2}=o(n), \end{array}\right.\end{split}\]

Where $s^*$ represents the true sparsity level, and $\lambda_{\max}$ and $\theta_0$ are parameters associated with the true model, thus, The second condition can be seen to a certain extent $p\log p=o(n)$.

Below, we utilize EBIC as the criterion for model selection within the skscope.

[8]:

def EBIC(
    objective_value: float,
    dimensionality: int,
    effective_params_num: int,
    train_size: int,
):
    return (train_size * np.log(objective_value)
            + effective_params_num * np.log(train_size)
            + 2 * 0.1 * effective_params_num * np.log(dimensionality))


solver = ScopeSolver(
    dimensionality=p,
    sparsity=[1, 2, 3, 4, 5],
    sample_size=n,
    ic_method=EBIC,
    )
params_scope = solver.solve(ols_loss, jit = True)

print('sparsity:', np.sum(params_scope != 0))
print('active set:', np.where(params_scope != 0)[0].tolist())

sparsity: 3
active set: [10, 20, 45]

7.1.5. Cross Validation (CV)#

Cross validation [10] is a statistical method used to assess the predictive capability of a model. It is typically employed to evaluate how well a model performs on unseen data, thereby avoiding overfitting or underfitting during the model selection process. Cross validation involves repeatedly splitting the dataset into two subsets: a training set used to fit the model and a testing set used to evaluate its performance. Below, we will demonstrate how to invoke cross validation in skscope.

[9]:

def ols_loss_cv(params, data):
    return jnp.sum(
        jnp.square(data[1] - data[0] @ params)
    )

def split_method(data, index):
    return (data[0][index, :], data[1][index])

solver = ScopeSolver(
    dimensionality=p,
    sparsity=[1, 2, 3, 4, 5],
    sample_size=n,
    split_method=split_method,
    cv=10,
)
params_scope = solver.solve(ols_loss_cv, data=(X, y), jit = True)

print('sparsity:', np.sum(params_scope != 0))
print('active set:', np.where(params_scope != 0)[0].tolist())

sparsity: 5
active set: [10, 16, 20, 28, 45]

7.1.6. Comparison of Information Criteria and Cross Validation#

In scenarios with low dimensionality.

We continue to consider the linear regression model mentioned above, but with a sample size of $n=200$, dimensionality $p=10$. Below, we compare the results of different criteria in this scenario.

[10]:

n, p, k = 200, 10, 3

np.random.seed(0)
X, y, beta = make_regression(n_samples=n, n_features=p, n_informative=k, coef=True)
print('Ture active set:', np.where(beta != 0)[0].tolist())

# aic
solver = ScopeSolver(
    dimensionality=p,
    sparsity=[1, 2, 3, 4, 5],
    sample_size=n,
    ic_method=AIC
    )
params_scope = solver.solve(ols_loss)

print('sparsity of aic:', np.sum(params_scope != 0))
print('active set of aic:', np.where(params_scope != 0)[0].tolist())

# bic
solver = ScopeSolver(
    dimensionality=p,
    sparsity=[1, 2, 3, 4, 5],
    sample_size=n,
    ic_method=BIC
    )
params_scope = solver.solve(ols_loss)

print('sparsity of bic:', np.sum(params_scope != 0))
print('active set of bic:', np.where(params_scope != 0)[0].tolist())

# sic
solver = ScopeSolver(
    dimensionality=p,
    sparsity=[1, 2, 3, 4, 5],
    sample_size=n,
    ic_method=utilities.GIC
    )
params_scope = solver.solve(ols_loss)

print('sparsity of sic:', np.sum(params_scope != 0))
print('active set of sic:', np.where(params_scope != 0)[0].tolist())

# ebic
solver = ScopeSolver(
    dimensionality=p,
    sparsity=[1, 2, 3, 4, 5],
    sample_size=n,
    ic_method=EBIC
    )
params_scope = solver.solve(ols_loss)

print('sparsity of ebic:', np.sum(params_scope != 0))
print('active set of ebic:', np.where(params_scope != 0)[0].tolist())

# cv
solver = ScopeSolver(
    dimensionality=p,
    sparsity=[1, 2, 3, 4, 5],
    sample_size=n,
    split_method=split_method,
    cv=10,
)
params_scope = solver.solve(ols_loss_cv, data=(X, y), jit = True)

print('sparsity of cv:', np.sum(params_scope != 0))
print('active set:', np.where(params_scope != 0)[0].tolist())

Ture active set: [3, 5, 6]
sparsity of aic: 3
active set of aic: [3, 5, 6]
sparsity of bic: 3
active set of bic: [3, 5, 6]
sparsity of sic: 3
active set of sic: [3, 5, 6]
sparsity of ebic: 3
active set of ebic: [3, 5, 6]
sparsity of cv: 3
active set: [3, 5, 6]

In the above experiment, it can be seen that in low dimensions, all methods can obtain the true active set.

In scenarios with high dimensionality.

We consider the same linear regression model as mentioned above, but with a sample size of $n=50$, dimensionality $p=1000$ and sparsity $k=5$. Below, we compare the results of different criteria in this situation.

[26]:

n, p, k = 50, 1000, 5

np.random.seed(0)
X, y, beta = make_regression(n_samples=n, n_features=p, n_informative=k, coef=True)
print('Ture active set:', np.where(beta != 0)[0].tolist())

# aic
solver = ScopeSolver(
    dimensionality=p,
    sparsity=range(11),
    sample_size=n,
    ic_method=AIC
    )
params_scope = solver.solve(ols_loss)

print('sparsity of aic:', np.sum(params_scope != 0))
print('active set of aic:', np.where(params_scope != 0)[0].tolist())

# bic
solver = ScopeSolver(
    dimensionality=p,
    sparsity=range(11),
    sample_size=n,
    ic_method=BIC
    )
params_scope = solver.solve(ols_loss)

print('sparsity of bic:', np.sum(params_scope != 0))
print('active set of bic:', np.where(params_scope != 0)[0].tolist())

# sic
solver = ScopeSolver(
    dimensionality=p,
    sparsity=range(11),
    sample_size=n,
    ic_method=utilities.GIC
    )
params_scope = solver.solve(ols_loss)

print('sparsity of sic:', np.sum(params_scope != 0))
print('active set of sic:', np.where(params_scope != 0)[0].tolist())

# ebic
solver = ScopeSolver(
    dimensionality=p,
    sparsity=range(11),
    sample_size=n,
    ic_method=EBIC
    )
params_scope = solver.solve(ols_loss)

print('sparsity of ebic:', np.sum(params_scope != 0))
print('active set of ebic:', np.where(params_scope != 0)[0].tolist())

# cv
solver = ScopeSolver(
    dimensionality=p,
    sparsity=range(11),
    sample_size=n,
    split_method=split_method,
    cv=10,
)
params_scope = solver.solve(ols_loss_cv, data=(X, y), jit = True)

print('sparsity of cv:', np.sum(params_scope != 0))
print('active set:', np.where(params_scope != 0)[0].tolist())

Ture active set: [508, 539, 569, 636, 981]
sparsity of aic: 9
active set of aic: [331, 387, 428, 508, 539, 569, 579, 636, 981]
sparsity of bic: 9
active set of bic: [331, 387, 428, 508, 539, 569, 579, 636, 981]
sparsity of sic: 5
active set of sic: [508, 539, 569, 636, 981]
sparsity of ebic: 5
active set of ebic: [508, 539, 569, 636, 981]
sparsity of cv: 7
active set: [508, 539, 545, 569, 579, 636, 981]

In the above experiment, it can be observed that in high dimensions, GIC and EBIC can obtain the true active set, while AIC, BIC and cross-validation may select more variables.

7.1.7. Reference#

[1] Akaike, H. (1974). A new look at the statistical model identification. IEEE transactions on automatic control, 19(6), 716-723.

[2] Bai, Z., Fujikoshi, Y., & Hu, J. (2018). Strong consistency of the AIC, BIC, $ C_p $ and KOO methods in high-dimensional multivariate linear regression. arXiv preprint arXiv:1810.12609.

[3] Bai, Z., Choi, K. P., & Fujikoshi, Y. (2018). Consistency of AIC and BIC in estimating the number of significant components in high-dimensional principal component analysis. The Annals of Statistics, 46(3), 1050-1076.

[4] Schwarz, G. (1978). Estimating the dimension of a model. The annals of statistics, 461-464.

[5] Zhang, T. (2024). Variables selection using L0 penalty. Computational Statistics & Data Analysis, 190, 107860.

[6] Zhu, J., Wen, C., Zhu, J., Zhang, H., & Wang, X. (2020). A polynomial algorithm for best-subset selection problem. Proceedings of the National Academy of Sciences, 117(52), 33117-33123.

[7] Tang, B., Zhu, J., Zhu, J., Wang, X., & Zhang, H. (2023). A Consistent and Scalable Algorithm for Best Subset Selection in Single Index Models. arXiv preprint arXiv:2309.06230.

[8] Chen, J., & Chen, Z. (2008). Extended Bayesian information criteria for model selection with large model spaces. Biometrika, 95(3), 759-771.

[9] Foygel, R., & Drton, M. (2010). Extended Bayesian information criteria for Gaussian graphical models. Advances in neural information processing systems, 23.

[10] Hastie, T., Tibshirani, R., Friedman, J. H., & Friedman, J. H. (2009). The elements of statistical learning: data mining, inference, and prediction (Vol. 2, pp. 1-758). New York: springer.