[1]:
%matplotlib inline
2.4. Multiple response linear regression#
We would like to use an example to show how the sparse-constrained optimization for multiple response linear regression works in our program.
2.4.1. Introduction#
Multiple response linear regression, also known as multivariate linear regression, is an extension of the classical linear regression model that allows for multiple dependent variables. Instead of modeling a single response variable, this approach models multiple response variables simultaneously, capturing the relationships between them and the set of predictor variables. This method is particularly useful when the response variables are correlated, as it accounts for the potential dependencies between them. Here are some examples of its practical applications:
Environmental science: In environmental science, multiple response linear regression is used to model the relationship between various environmental factors and multiple indicators of ecosystem health.
Finance and economics: Economists might use this method to model how interest rates, inflation, and unemployment rates jointly affect GDP growth, stock market returns, and consumer spending.
Engineering: In engineering, multiple response linear regression is applied in quality control and process optimization. Engineers might use this method to model how different manufacturing process parameters affect multiple quality metrics of a product, such as strength, durability, and efficiency.
The model is expressed as:
where \(y\) is an \(m\)-dimensional response variable, \(x\) is \(p\)-dimensional predictors, \(B \in R^{m \times p}\) is the sparse coefficient matrix, \(\epsilon\) is an \(m\)-dimensional random noise variable with zero mean.
With \(n\) independent data of the explanatory variables \(X\) and the response variable \(Y\), we can estimate $B^* $ by minimizing the objective function under sparsity constraint:
where \(|| B ||_ {0, 2}\) is the number of non-zero rows of \(B\).
Here is Python code for solving sparse multiple linear regression problem:
2.4.2. Data for multiple response linear regression#
We import necessary packages and set a seed.
[2]:
from skscope import ScopeSolver
import jax.numpy as jnp
import numpy as np
from sklearn.datasets import make_regression
[3]:
np.random.seed(123)
Firstly, we shall conduct multiple response linear regression on an artificial dataset for demonstration. The make_regression from sklearn.datasets function allows us to generate simulated data. The artificial dataset contains 500 observations and 20 predictors but only five predictors have influence on the three possible responses.
[4]:
n, p, k, m = 500, 20, 5, 3
x, y, coef = make_regression(n_samples=n, n_features=p, n_informative=k, n_targets=m, coef=True)
print('real variables\' index:\n', set(np.nonzero(coef)[0]))
print('real variables:\n', coef)
real variables' index:
{2, 9, 15, 16, 17}
real variables:
[[ 0. 0. 0. ]
[ 0. 0. 0. ]
[35.47424401 1.60554839 50.64398325]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[90.2768738 50.5206623 97.04288743]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[91.25666229 94.36559002 41.8819058 ]
[63.67968592 92.21002028 18.0026354 ]
[17.80823712 59.56585099 46.58491993]
[ 0. 0. 0. ]
[ 0. 0. 0. ]]
2.4.3. Solve the problem#
Secondly, to carry out sparse-constrained optimization for multiple response linear regression loss, we define the loss function multi_linear_objective accorting to 1 that matches the data generating function make_regression.
[5]:
def multi_linear_objective(params):
return jnp.sum(jnp.square(y - jnp.matmul(x, params.reshape((p, m)))))
We use skscope to solve the sparse multiple response linear regression problem After defining the data and the loss function, we can call ScopeSolver to solve the sparse-constrained optimization problem.
[6]:
solver = ScopeSolver(p * m, k, group=[i for i in range(p) for j in range(m)])
params = solver.solve(multi_linear_objective)
Now the solver.params contains the coefficients of multiple response linear regression model with no more than 5 variables. That is, those variables with a coefficient 0 is unused in the model:
[7]:
print(solver.params.reshape((p, m)))
[[ 0. 0. 0. ]
[ 0. 0. 0. ]
[35.4742443 1.60554861 50.64398376]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[90.27687424 50.52066226 97.04288718]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[91.25666144 94.36558946 41.88190469]
[63.67968542 92.21001962 18.00263606]
[17.80823746 59.56585047 46.58492027]
[ 0. 0. 0. ]
[ 0. 0. 0. ]]
We can further compare the coefficients estimated by skscope and the real coefficients in two-fold:
The true support set and the estimated support set
The true nonzero parameters and the estimated nonzero parameters
[8]:
print('real variables\' index:\n', set(np.nonzero(coef)[0]))
print('Estimated variables\' index:\n', set(np.nonzero(solver.params.reshape((p, m)))[0]))
real variables' index:
{2, 9, 15, 16, 17}
Estimated variables' index:
{2, 9, 15, 16, 17}
[9]:
print("Estimated parameter:\n", solver.params.reshape((p, m)))
print("True parameter:\n", coef)
Estimated parameter:
[[ 0. 0. 0. ]
[ 0. 0. 0. ]
[35.4742443 1.60554861 50.64398376]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[90.27687424 50.52066226 97.04288718]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[91.25666144 94.36558946 41.88190469]
[63.67968542 92.21001962 18.00263606]
[17.80823746 59.56585047 46.58492027]
[ 0. 0. 0. ]
[ 0. 0. 0. ]]
True parameter:
[[ 0. 0. 0. ]
[ 0. 0. 0. ]
[35.47424401 1.60554839 50.64398325]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[90.2768738 50.5206623 97.04288743]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
[91.25666229 94.36559002 41.8819058 ]
[63.67968592 92.21002028 18.0026354 ]
[17.80823712 59.56585099 46.58491993]
[ 0. 0. 0. ]
[ 0. 0. 0. ]]