[1]:
%matplotlib inline
2.4. Multiple response linear regression#
We would like to use an example to show how the sparse-constrained optimization for multiple response linear regression works in our program.
2.4.1. Introduction#
Multiple linear regression is also known as multi output linear regression, multivariate linear regression, or multidimensional linear regression, which is used to predict multiple dependent variables using a linear combination of multiple independent variables. This is different from simple linear regression, which is used to predict a single dependent variable using a single independent variable.
The model is expressed as:
where \(y\) is an \(m\)-dimensional response variable, \(x\) is \(p\)-dimensional predictors, \(B \in R^{m \times p}\) is the sparse coefficient matrix, \(\epsilon\) is an \(m\)-dimensional random noise variable with zero mean.
With \(n\) independent data of the explanatory variables \(X\) and the response variable \(Y\), we can estimate $B^* $ by minimizing the objective function under sparsity constraint:
where \(|| B ||_ {0, 2}\) is the number of non-zero rows of \(B\).
Here is Python code for solving sparse multiple linear regression problem:
2.4.2. Import necessary packages#
[2]:
from skscope import ScopeSolver
import jax.numpy as jnp
import numpy as np
from sklearn.datasets import make_regression
2.4.3. Set a seed#
[3]:
np.random.seed(123)
2.4.4. Generate the data#
Firstly, we shall conduct multiple response linear regression on an artificial dataset for demonstration. The make_regression from sklearn.datasets function allows us to generate simulated data. The artificial dataset contains 500 observations and 20 predictors but only five predictors have influence on the three possible responses.
[4]:
n, p, k, m = 500, 20, 5, 3
x, y, coef = make_regression(n_samples=n, n_features=p, n_informative=k, n_targets=m, coef=True)
print('real variables\' index:\n', set(np.nonzero(coef)[0]))
print('real variables:\n', coef)
real variables' index:
{2, 9, 15, 16, 17}
real variables:
[[ 0. 0. 0. ]
[ 0. 0. 0. ]
[35.47424401 1.60554839 50.64398325]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[90.2768738 50.5206623 97.04288743]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[91.25666229 94.36559002 41.8819058 ]
[63.67968592 92.21002028 18.0026354 ]
[17.80823712 59.56585099 46.58491993]
[ 0. 0. 0. ]
[ 0. 0. 0. ]]
2.4.5. Define multiple response linear regression loss#
Secondly, to carry out sparse-constrained optimization for multiple response linear regression loss, we define the loss function multi_linear_objective accorting to 1 that matches the data generating function make_regression.
[5]:
def multi_linear_objective(params):
return jnp.sum(jnp.square(y - jnp.matmul(x, params.reshape((p, m)))))
2.4.6. Use skscope to solve the sparse multiple response linear regression problem#
After defining the data and the loss function, we can call ScopeSolver to solve the sparse-constrained optimization problem.
[6]:
solver = ScopeSolver(p * m, k, group=[i for i in range(p) for j in range(m)])
params = solver.solve(multi_linear_objective)
Now the solver.params contains the coefficients of multiple response linear regression model with no more than 5 variables. That is, those variables with a coefficient 0 is unused in the model:
[7]:
print(solver.params.reshape((p, m)))
[[ 0. 0. 0. ]
[ 0. 0. 0. ]
[35.4742443 1.60554861 50.64398376]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[90.27687424 50.52066226 97.04288718]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[91.25666144 94.36558946 41.88190469]
[63.67968542 92.21001962 18.00263606]
[17.80823746 59.56585047 46.58492027]
[ 0. 0. 0. ]
[ 0. 0. 0. ]]
We can further compare the coefficients estimated by skscope and the real coefficients in two-fold:
The true support set and the estimated support set
The true nonzero parameters and the estimated nonzero parameters
[8]:
print('real variables\' index:\n', set(np.nonzero(coef)[0]))
print('Estimated variables\' index:\n', set(np.nonzero(solver.params.reshape((p, m)))[0]))
real variables' index:
{2, 9, 15, 16, 17}
Estimated variables' index:
{2, 9, 15, 16, 17}
[9]:
print("Estimated parameter:\n", solver.params.reshape((p, m)))
print("True parameter:\n", coef)
Estimated parameter:
[[ 0. 0. 0. ]
[ 0. 0. 0. ]
[35.4742443 1.60554861 50.64398376]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[90.27687424 50.52066226 97.04288718]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[91.25666144 94.36558946 41.88190469]
[63.67968542 92.21001962 18.00263606]
[17.80823746 59.56585047 46.58492027]
[ 0. 0. 0. ]
[ 0. 0. 0. ]]
True parameter:
[[ 0. 0. 0. ]
[ 0. 0. 0. ]
[35.47424401 1.60554839 50.64398325]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[90.2768738 50.5206623 97.04288743]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[91.25666229 94.36559002 41.8819058 ]
[63.67968592 92.21002028 18.0026354 ]
[17.80823712 59.56585099 46.58491993]
[ 0. 0. 0. ]
[ 0. 0. 0. ]]
[ ]: