1.1. Linear Regression#

1.1.1. Introduction#

Linear regression is a regression method that fits a linear relationship on datasets. It assumes a linear relationship between the independent variables and the dependent variable and seeks the optimal linear function to fit the data. Suppose we collect \(n\) independent observations for a response variable and \(p\) explanatory variables, say \(y \in R^n\) and \(X \in R^{n\times p}\). Let \(\epsilon_1, \ldots, \epsilon_n\) be i.i.d zero-mean random noises and \(\epsilon = (\epsilon_1, \ldots, \epsilon_n)\), the linear model has a form:

\[y=X \beta^{*} +\epsilon.\]

However, when dealing with high-dimensional data, high-dimensional linear regression faces several challenges, such as:

Computational efficiency: As the number of independent variables increases, the computational complexity of the model also increases. Including all independent variables in the model can result in long computation times and high memory usage. Variable selection improves computational efficiency by reducing the number of variables considered.
Feature correlation: In high-dimensional data, there may be strong correlations between independent variables. These redundant features cause model instability and multicollinearity. Variable selection eliminates redundant features with weak correlations to the target variable, reducing the impact of correlation.
Interpretability: In practical applications, interpretability of the model is crucial. Variable selection removes irrelevant or unimportant variables, making the model easier to understand and explain. This allows us to identify the factors that truly influence the target variable.

To address these challenges, variable selection is necessary in high-dimensional linear regression. By selecting relevant and meaningful independent variables, we can reduce the complexity of the model, improve predictive performance, and gain better insights into the relationships within the data.

We can consider minimizing the loss function under suitable sparse constraint conditions to obtain appropriate parameter estimates. In other words, we can formulate the problem as follows:

\[\begin{split}\begin{aligned} \arg\min & L(\beta)=\frac1{2n}\| y-X\beta \|_2^2 \\ &\text{subject to:} \; \| \beta \|_0 \leq s ,\\ \end{aligned}\end{split}\]

where \(\| \beta \|_0\) represents the \(l_0\) norm of \(\beta\).

1.1.2. Examples#

Next, we will consider using skscope to optimize the aforementioned problem and compare it with Lasso regularization. Lasso will use 5-fold cross-validation to select the regularization parameter. Lasso [1] is a commonly used regularization technique that automatically selects relevant features and reduces model complexity by introducing a sparsity penalty term, thereby improving model interpretability and generalization capability.

First, let’s consider the case with no intercept.

[1]:

import numpy as np
import jax.numpy as jnp
import matplotlib.pyplot as plt
import seaborn as sns
from skscope import ScopeSolver
from sklearn.linear_model import LassoCV

import warnings
warnings.filterwarnings('ignore')

We will work with a dataset of size \(n = 150\) and dimension \(p = 30\). Our assumption is that the sample \(X\) and noise \(\epsilon\) are both drawn from normal distributions. The true support set of \(\beta\) is \((1, 2, 3, 0, ..., 0)^{\top}\), consisting of non-zero coefficients in the first three positions. Now, let’s proceed to construct the samples.

[2]:

n, p = 150, 30
rng = np.random.default_rng(0)
X = rng.normal(0, 1, (n, p))
beta = np.zeros(p)
beta[:3] = [1, 2, 3]
y = X @ beta + rng.normal(0, 0.1, n)

Next, we consider using scope and Lasso to estimate the parameters.

[3]:

def ols_loss(params):
    loss = jnp.mean((y - X @ params) ** 2)
    return loss
solver = ScopeSolver(p, sparsity=3)
params_scope = solver.solve(ols_loss)

lasso_cv = LassoCV(cv=5, fit_intercept=False)
lasso_cv.fit(X, y)
params_lasso = lasso_cv.coef_

Subsequently, we compute the residual sum of squares for the estimates obtained from these two methods.

[4]:

print('scope: ', np.sum((params_scope-beta) ** 2).round(4))
print('lasso: ', np.sum((params_lasso-beta) ** 2).round(4))

scope:  0.0003
lasso:  0.001

Next, let’s consider the case with an intercept term.

We are considering a regression model given by \(y=\beta_0^*+X \beta^{*} +\epsilon\), where \(\beta_0^*\) represents the intercept term. Next, we will set the intercept term \(\beta_0^*=1\) and utilize the same settings as mentioned earlier to construct the samples.

[5]:

n, p = 150, 30
rng = np.random.default_rng(0)
X = rng.normal(0, 1, (n, p))
beta = np.zeros(p)
beta[:3] = [1, 2, 3]
y = X @ beta + rng.normal(1, 0.1, n)

Next, we will use scope and lasso to estimate the parameter \(\beta^{*}\) and the intercept term \(\beta_0^*\).

[6]:

lasso_cv = LassoCV(cv=5)
lasso_cv.fit(X, y)
intercept_lasso = lasso_cv.intercept_
params_lasso = lasso_cv.coef_

X = np.hstack((np.ones((n, 1)), X))
solver = ScopeSolver(p + 1, sparsity=4, preselect=0)
scope_estimate = solver.solve(ols_loss)
intercept_scope = scope_estimate[0]
params_scope = scope_estimate[1:]

Afterward, we calculate the sum of squared residuals for the estimations of parameters obtained using these two methods.

[7]:

print('scope: ', np.sum((params_scope-beta) ** 2).round(4))
print('lasso: ', np.sum((params_lasso-beta) ** 2).round(4))

scope:  0.0004
lasso:  0.0011

Below are the absolute differences between the intercept term obtained by these two methods and its actual value.

[8]:

print('scope: ', np.abs(intercept_scope - 1).round(4))
print('lasso: ', np.abs(intercept_lasso - 1).round(4))

scope:  0.0082
lasso:  0.0106

Summary

From the results obtained, it can be observed that in both cases with and without an intercept, the residual sum of squares for the estimates obtained from skscope is smaller. This indicates that skscope can provide an appropriate method for variable selection in high-dimensional linear regression.

1.1.3. Reference#

[1] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1), 267-288.