[1]:
%matplotlib inline

2.2. Poisson Regression#

We would like to use an example to show how the sparse-constrained optimization for poisson regression works in our program.

2.2.1. Introduction to poisson regression#

Poisson Regression involves regression models in which the response variable is in the form of counts. For example, the count of number of car accidents or number of customers in line at a reception desk. The response variables is assumed to follow a Poisson distribution.

The general mathematical equation for Poisson regression is

\[\begin{align}\log(E(y)) = \beta_0 + \beta_1 X_1+\beta_2 X_2+\dots+\beta_p X_p.\end{align} \tag{1}\]

With \(n\) independent data of the explanatory variables \(x\) and the response variable \(y\), we can estimate \(\beta\) by minimizing the negative log-likelihood function under sparsity constraint:

\[\arg \min _{\beta \in R^p} L(\beta):=-\frac{1}{n} \sum_{i=1}^n\left\{y_i x_i^T \beta-\exp \left(x_i^T \beta\right)-\log \left(y!\right)\right\}, \text { s.t. }\|\beta\|_0 \leq s .\tag{2}\]

Here is Python code for solving sparse poisson regression problem:

2.2.1.1. Import necessary packages#

[2]:
import numpy as np
from abess.datasets import make_glm_data
import jax.numpy as jnp
from skscope import ScopeSolver

2.2.1.2. Set a seed#

[3]:
np.random.seed(123)

2.2.1.3. Generate the data#

Firstly, we generate some artificial data using logic 1. Consider a dataset containing n=500 observations with p=500 variables. The make_glm_data function allows us to generate simulated data by specifying the family="poisson". By specifying k = 5, we set only 5 of the 500 variables to have effect on the expectation of the response.

[4]:
np.random.seed(1)

n = 500
p = 500
s = 5

data = make_glm_data(n=n, p=p, k=s, family="poisson")
X = data.x
y = data.y

print("The first 5 x observation:\n", X[0:5, ])
print("The first 5 y observation:\n", y[0:5])
The first 5 x observation:
 [[ 0.09600233 -0.03994146 -0.0349905  ... -0.13339626 -0.06864823
  -0.00362435]
 [-0.1022829   0.00154895 -0.05209999 ... -0.0005665   0.02165318
  -0.01390639]
 [-0.00940902 -0.15288256  0.03033626 ... -0.00915983  0.11700274
  -0.08647483]
 [ 0.07684115 -0.02137192 -0.01879096 ... -0.0527822   0.03978687
   0.08170928]
 [ 0.0287065   0.01281823 -0.02994294 ... -0.00994214  0.04239522
  -0.0176029 ]]
The first 5 y observation:
 [0 0 0 2 0]

2.2.1.4. Define function to calculate negative log-likelihood of poisson regression to serve as the loss to solve#

Secondly, we define the loss function poisson_loss accorting to 2 that matches the data generating function make_glm_data.

[5]:
def poisson_loss(params):
    xbeta = jnp.clip(X @ params, -30, 30)
    return jnp.mean(jnp.exp(xbeta) - y * xbeta) #omit \log y! term

2.2.1.5. Use skscope to solve the sparse possion regression problem#

After defining the data generation and loss function, we can call ScopeSolver to solve the sparse-constrained optimization problem.

[6]:
solver = ScopeSolver(p, s)
params = solver.solve(poisson_loss, jit=True)

Now the solver.params contains the coefficients of poisson model with no more than 5 variables. That is, those variables with a coefficient 0 is unused in the model:

[7]:
print(solver.params)
[0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         2.46268901 0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         4.07112904 0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 2.86969217 0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         5.24157465 0.         7.80686747 0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.        ]

We can further compare the coefficients estimated by skscope and the real coefficients in three-fold:

  • The true support set and the estimated support set

  • The true nonzero parameters and the estimated nonzero parameters

  • The true loss value and the estimated values

[8]:
true_support_set = np.nonzero(data.coef_)[0]
estimated_support_set = solver.support_set
print("True support set: ", true_support_set)
print("Estimated support set: ", estimated_support_set)
True support set:  [  4 189 360 433 435]
Estimated support set:  [111 189 360 433 435]
[9]:
print("True parameters: ", data.coef_[true_support_set])
print("Estimated parameters: ", solver.params[estimated_support_set])
True parameters:  [2.19224112 4.07074072 2.57632915 5.56121611 7.40111469]
Estimated parameters:  [2.46268901 4.07112904 2.86969217 5.24157465 7.80686747]
[10]:
print("True loss value: ", poisson_loss(data.coef_))
print("Estimated loss value: ", poisson_loss(solver.params))
True loss value:  0.7049176
Estimated loss value:  0.6987839

2.2.2. More on the results#

We can plot the sparse signal recovering from the noisy observations to visualize the results.

[11]:
import matplotlib.pyplot as plt
(inx_true,) =  data.coef_.nonzero()
(inx_est,) =  solver.params.nonzero()

# plot the sparse signal
plt.figure(figsize=(7, 7))
plt.subplot(2, 1, 1)
plt.stem(inx_true, data.coef_[inx_true], markerfmt='o', basefmt='k-')
plt.plot([0, 500], [0, 0], 'r-', lw=2)
plt.xlim(0, 500)
plt.title("Sparse signal")
#plt.plot(inx_true, true_params[inx_true], drawstyle='steps-post')

# plot the noisy reconstruction
plt.subplot(2, 1, 2)
plt.stem(inx_est, solver.params[inx_est], markerfmt='o', basefmt='k-')
plt.plot([0, 500], [0, 0], 'r-', lw=2)
plt.xlim(0, 500)
plt.title("Recovered signal from noisy observations")
#plt.plot(inx_est, solver.params[inx_est], drawstyle='steps-post')

plt.show()
../../_images/gallery_GeneralizedLinearModels_poisson-regression_21_0.png
[ ]: