[1]:

%matplotlib inline

2.4. Multiple response linear regression#

We would like to use an example to show how the sparse-constrained optimization for multiple response linear regression works in our program.

2.4.1. Introduction#

Multiple linear regression is also known as multi output linear regression, multivariate linear regression, or multidimensional linear regression, which is used to predict multiple dependent variables using a linear combination of multiple independent variables. This is different from simple linear regression, which is used to predict a single dependent variable using a single independent variable.

The model is expressed as:

\[y = B^* x + \epsilon,\]

where $y$ is an $m$-dimensional response variable, $x$ is $p$-dimensional predictors, $B \in R^{m \times p}$ is the sparse coefficient matrix, $\epsilon$ is an $m$-dimensional random noise variable with zero mean.

With $n$ independent data of the explanatory variables $X$ and the response variable $Y$, we can estimate $B^* $ by minimizing the objective function under sparsity constraint:

\[arg\min_{B}L(B) := ||Y-B X||^2, s.t. || B ||_ {0,2} \leq s, \tag{1}\]

where $|| B ||_ {0, 2}$ is the number of non-zero rows of $B$.

Here is Python code for solving sparse multiple linear regression problem:

2.4.2. Import necessary packages#

[2]:

from skscope import ScopeSolver
import jax.numpy as jnp
import numpy as np
from sklearn.datasets import make_regression

2.4.3. Set a seed#

[3]:

np.random.seed(123)

2.4.4. Generate the data#

Firstly, we shall conduct multiple response linear regression on an artificial dataset for demonstration. The make_regression from sklearn.datasets function allows us to generate simulated data. The artificial dataset contains 500 observations and 20 predictors but only five predictors have influence on the three possible responses.

[4]:

n, p, k, m = 500, 20, 5, 3
x, y, coef = make_regression(n_samples=n, n_features=p, n_informative=k, n_targets=m, coef=True)

print('real variables\' index:\n', set(np.nonzero(coef)[0]))
print('real variables:\n', coef)

real variables' index:
 {2, 9, 15, 16, 17}
real variables:
 [[ 0.          0.          0.        ]
 [ 0.          0.          0.        ]
 [35.47424401  1.60554839 50.64398325]
 [ 0.          0.          0.        ]
 [ 0.          0.          0.        ]
 [ 0.          0.          0.        ]
 [ 0.          0.          0.        ]
 [ 0.          0.          0.        ]
 [ 0.          0.          0.        ]
 [90.2768738  50.5206623  97.04288743]
 [ 0.          0.          0.        ]
 [ 0.          0.          0.        ]
 [ 0.          0.          0.        ]
 [ 0.          0.          0.        ]
 [ 0.          0.          0.        ]
 [91.25666229 94.36559002 41.8819058 ]
 [63.67968592 92.21002028 18.0026354 ]
 [17.80823712 59.56585099 46.58491993]
 [ 0.          0.          0.        ]
 [ 0.          0.          0.        ]]

2.4.5. Define multiple response linear regression loss#

Secondly, to carry out sparse-constrained optimization for multiple response linear regression loss, we define the loss function multi_linear_objective accorting to 1 that matches the data generating function make_regression.

[5]:

def multi_linear_objective(params):
    return jnp.sum(jnp.square(y - jnp.matmul(x, params.reshape((p, m)))))

2.4.6. Use skscope to solve the sparse multiple response linear regression problem#

After defining the data and the loss function, we can call ScopeSolver to solve the sparse-constrained optimization problem.

[6]:

solver = ScopeSolver(p * m, k, group=[i for i in range(p) for j in range(m)])
params = solver.solve(multi_linear_objective)

Now the solver.params contains the coefficients of multiple response linear regression model with no more than 5 variables. That is, those variables with a coefficient 0 is unused in the model:

[7]:

print(solver.params.reshape((p, m)))

[[ 0.          0.          0.        ]
 [ 0.          0.          0.        ]
 [35.4742443   1.60554861 50.64398376]
 [ 0.          0.          0.        ]
 [ 0.          0.          0.        ]
 [ 0.          0.          0.        ]
 [ 0.          0.          0.        ]
 [ 0.          0.          0.        ]
 [ 0.          0.          0.        ]
 [90.27687424 50.52066226 97.04288718]
 [ 0.          0.          0.        ]
 [ 0.          0.          0.        ]
 [ 0.          0.          0.        ]
 [ 0.          0.          0.        ]
 [ 0.          0.          0.        ]
 [91.25666144 94.36558946 41.88190469]
 [63.67968542 92.21001962 18.00263606]
 [17.80823746 59.56585047 46.58492027]
 [ 0.          0.          0.        ]
 [ 0.          0.          0.        ]]

We can further compare the coefficients estimated by skscope and the real coefficients in two-fold:

The true support set and the estimated support set
The true nonzero parameters and the estimated nonzero parameters

[8]:

print('real variables\' index:\n', set(np.nonzero(coef)[0]))
print('Estimated variables\' index:\n', set(np.nonzero(solver.params.reshape((p, m)))[0]))

real variables' index:
 {2, 9, 15, 16, 17}
Estimated variables' index:
 {2, 9, 15, 16, 17}

[9]:

print("Estimated parameter:\n", solver.params.reshape((p, m)))
print("True parameter:\n", coef)

Estimated parameter:
 [[ 0.          0.          0.        ]
 [ 0.          0.          0.        ]
 [35.4742443   1.60554861 50.64398376]
 [ 0.          0.          0.        ]
 [ 0.          0.          0.        ]
 [ 0.          0.          0.        ]
 [ 0.          0.          0.        ]
 [ 0.          0.          0.        ]
 [ 0.          0.          0.        ]
 [90.27687424 50.52066226 97.04288718]
 [ 0.          0.          0.        ]
 [ 0.          0.          0.        ]
 [ 0.          0.          0.        ]
 [ 0.          0.          0.        ]
 [ 0.          0.          0.        ]
 [91.25666144 94.36558946 41.88190469]
 [63.67968542 92.21001962 18.00263606]
 [17.80823746 59.56585047 46.58492027]
 [ 0.          0.          0.        ]
 [ 0.          0.          0.        ]]
True parameter:
 [[ 0.          0.          0.        ]
 [ 0.          0.          0.        ]
 [35.47424401  1.60554839 50.64398325]
 [ 0.          0.          0.        ]
 [ 0.          0.          0.        ]
 [ 0.          0.          0.        ]
 [ 0.          0.          0.        ]
 [ 0.          0.          0.        ]
 [ 0.          0.          0.        ]
 [90.2768738  50.5206623  97.04288743]
 [ 0.          0.          0.        ]
 [ 0.          0.          0.        ]
 [ 0.          0.          0.        ]
 [ 0.          0.          0.        ]
 [ 0.          0.          0.        ]
 [91.25666229 94.36559002 41.8819058 ]
 [63.67968592 92.21002028 18.0026354 ]
 [17.80823712 59.56585099 46.58491993]
 [ 0.          0.          0.        ]
 [ 0.          0.          0.        ]]

[ ]: