Skip to main content
Ctrl+K
skscope

Site Navigation

  • User Guide
  • Software Features
  • Examples Gallery
  • API References
  • Contributor’s Guide
    • Frequent asked questions
    • Benchmarks
  • GitHub
  • PyPI
  • Conda

Site Navigation

  • User Guide
  • Software Features
  • Examples Gallery
  • API References
  • Contributor’s Guide
    • Frequent asked questions
    • Benchmarks
  • GitHub
  • PyPI
  • Conda
Ctrl+K

Section Navigation

  • 1. Linear Model and its Variants
    • 1.1. Linear Regression
    • 1.2. Linear regression with square-root loss
    • 1.3. Robust regression
    • 1.4. Quantile regression and expectile regression
    • 1.5. Linear mixed model
    • 1.6. Non-negative least squares
    • 1.7. Isotonic Regression
  • 2. Generalized Linear Models
    • 2.1. Logistic regressions
    • 2.2. Poisson regression
    • 2.3. Gamma regression
    • 2.4. Multiple response linear regression
    • 2.5. Multinomial logistic regression
    • 2.6. Multiple response non-negativity identity link Poisson model
  • 3. Survival Models
    • 3.1. Aalen’s additive hazards model
    • 3.2. Cox’s proportional hazards model
    • 3.3. Multivariate failure time model
    • 3.4. Competing risk model
  • 4. Fusion Models
    • 4.1. 1D trend filtering
    • 4.2. Piecewise-linear trend filtering with periodic components
    • 4.3. Spatial trend filtering
    • 4.4. DFS-Graph-Trend-Filtering
  • 5. Graphical Models
    • 5.1. Sparse Gaussian Precision Matrix
    • 5.2. Sparse Precision Matrix
    • 5.3. Sparse Ising Model
  • 6. Miscellaneous
    • 6.1. Non-linear feature selection via HSIC-SCOPE
    • 6.2. Portfolio selection
    • 6.3. Correlation inference for compositional data
    • 6.4. Classification on imbalanced labels with focal loss
    • 6.5. Exemplar-based clustering
  • 7. Sparsity Level Selection
    • 7.1. Supported Information Criteria and Cross Validation
  • Examples Gallery
  • 2. Generalized Linear Models

2.6. Multiple response non-negativity identity link Poisson model#

2.6.1. Introduction#

Poisson Regression involves regression models in which the response variable is in the form of counts. In the case that your covariates would act additively, identity link is better than log link. Next is an example.

  • Chromatography (GC/MS analysis): In chromatography (GC/MS analysis) one would often measure the superimposed signal of several approx Gaussian shaped peaks and this superimposed signal is measured with an electron multiplier, which means that the measured signal are ion counts and therefore Poisson distributed.

Since each of the peaks have by definition a positive height and act additively and the noise is Poisson, a nonnegative Poisson model with identity link would be appropriate here, and a log link Poisson model would be plain wrong. Based on the above elaboration, we implement the code for simulating data as followed coupled with a visualization of the signals.

[1]:
import numpy as np
import matplotlib.pyplot as plt

def simulate_spike_train(n=200, npeaks=20, channels=1, peakhrange=(1000, 10000), seed=123, Plot=True):
    np.random.seed(seed)
    x = np.arange(n)

    # Ensure number of peaks does not exceed the number of points
    npeaks = min(npeaks, n)
    # Sample peak locations
    u = np.random.choice(x, npeaks, replace=False)
    u.sort()
    # Generate peak heights logarithmically within the specified range
    h = 10 ** np.random.uniform(np.log10(min(peakhrange)), np.log10(max(peakhrange)), (npeaks, channels))

    # Initialize the true coefficient array
    a = np.zeros((n, channels))
    a[u, :] = h

    # Gaussian peak shape function
    def gauspeak(x, u, w, h=1):
        return h * np.exp(-((x-u)**2) / (2 * w**2))

    # Create a banded matrix with Gaussian shapes
    W = 5  # width of the Gaussian
    X = np.column_stack([gauspeak(x, ui, W, 1) for ui in x])

    # Generate the noiseless simulated signal
    y_nonoise = X @ a
    # Add Poisson noise to the signal
    y = np.random.poisson(y_nonoise)

    # Plotting
    if Plot and channels == 1:
        plt.figure(figsize=(10, 5))
        plt.plot(x, y, label="Noisy Signal", linestyle='-', marker='')
        plt.stem(x, a, 'r', markerfmt='ro', label="True Peaks", basefmt="r-")
        plt.title("simluated GC-MS data (red=true peaks)")
        plt.xlabel("Time")
        plt.ylabel("Signal")
        plt.legend()
        plt.show()

    if Plot and channels > 1:
        plt.figure(figsize=(10, 8))
        plt.imshow(y.T**0.1, aspect='auto', interpolation='nearest')
        plt.colorbar(label='Noisy Signal (Power 0.1)')
        plt.xlabel('Time')
        plt.ylabel('Channel')
        plt.title('simluated GC-MS data (red=true peaks)')
        true_peaks = np.nonzero(a.sum(axis=1))[0]
        for peak in true_peaks:
            plt.axvline(x=peak, color='red', label='True Peaks' if peak == true_peaks[0] else "")

    return {'y': y, 'X': X, 'a': a, 'supp':u}

data = simulate_spike_train()
../../_images/gallery_GeneralizedLinearModels_poisson-identity-link_2_0.png

The mathematical equation for Poisson regression with identity link is

\[\begin{align}E(y) = \beta_0 + \beta_1 X_1+\beta_2 X_2+\dots+\beta_p X_p.\end{align} \tag{1}\]

With \(n\) independent data of the explanatory variables \(x\) and the response variable \(y\), we can estimate \(\beta\) by minimizing the negative log-likelihood function under sparsity constraint:

\[\arg \min _{\beta \in R^p} L(\beta):=-\frac{1}{n} \sum_{i=1}^n\left\{y_i \log(x_i^T \beta)-x_i^T \beta-\log \left(y!\right)\right\}, \text { s.t. }\|\beta\|_0 \leq s .\tag{2}\]

However, since \(\log(x_i^T \beta)\) requires that \(x_i^T \beta\) be non-negative, this presents difficulties for optimization. One solution is to use a weighted non-negative least squares (weighted NNLS) loss function as an approximation.

2.6.2. Solve the problem#

Here is Python code for solving sparse poisson regression problem:

[2]:
from jax import numpy as jnp

def loss(params, data):
    n, channels = data['y'].shape
    return jnp.sum(jnp.square((data['y'] - data['X'] @ params.reshape(n, channels))) / (data['y'] + 0.1))

Here is an example of recovering 20 true peaks from a noisy signal. The true signal and estimated signals are visualized for illustrating the quality of the estimated signals.

[3]:
from skscope.layer import NonNegative
from skscope import ScopeSolver


n, npeaks, channels = 1000, 20, 1
data = simulate_spike_train(n, npeaks, channels, seed=123)

solver = ScopeSolver(
    dimensionality=n*channels,
    sparsity=npeaks,
)

solver.solve(
    objective=lambda params: loss(params, data),
    layers=[NonNegative(dimensionality = n*channels)],
    jit=True,
)
# Plotting
plt.figure(figsize=(10, 5))
plt.plot(np.arange(n), data['y'], label="Noisy Signal", linestyle='-', marker='')
plt.stem(np.arange(n), solver.params, 'r', markerfmt='ro', label="Estimated Peaks", basefmt="r-")
plt.title("simluated GC-MS data (red=estimated peaks)")
plt.xlabel("Time")
plt.ylabel("Signal")
plt.legend()
plt.show()
../../_images/gallery_GeneralizedLinearModels_poisson-identity-link_6_0.png
../../_images/gallery_GeneralizedLinearModels_poisson-identity-link_6_1.png

2.6.3. Case of multivariate#

In this part, we expand the dimension of the response to 50 to enrich the illustration of using skscope to solve sparsity-constraint optimization problem. Again, the true and estimated signals are both visualized.

[4]:
n, npeaks, channels = 1000, 10, 50
data = simulate_spike_train(n, npeaks, channels, seed=123)

solver = ScopeSolver(
    dimensionality=n*channels,
    sparsity=npeaks,
    group=[i for i in range(n) for _ in range(channels)],
)

solver.solve(
    objective=lambda params: loss(params, data),
    layers=[NonNegative(dimensionality = n*channels)],
    jit=True,
)
plt.figure(figsize=(10, 8))
plt.imshow(data['a'].T, aspect='auto', interpolation='nearest')
plt.colorbar(label='True Peaks')
plt.xlabel('Time')
plt.ylabel('Channel')
plt.title('True peaks and estimated peaks (red)')

true_peaks = np.nonzero(solver.params.reshape(n, channels).sum(axis=1))[0]
for peak in true_peaks:
    plt.axvline(x=peak, color='red', label='Estimated Peaks' if peak == true_peaks[0] else "")
../../_images/gallery_GeneralizedLinearModels_poisson-identity-link_8_0.png
../../_images/gallery_GeneralizedLinearModels_poisson-identity-link_8_1.png

The covariate matrix can be coded as a sparse matrix to save memory, but solving it will be more time-consuming.

[5]:
from jax.experimental import sparse

n, npeaks, channels = 1000, 10, 50
data = simulate_spike_train(n, npeaks, channels, seed=123)
print(f"The size of dense covariate matrix is {data['X'].nbytes / (1024**2)} MB")
data['X'] = sparse.BCOO.fromdense(data['X'])
print(f"The size of sparse covariate matrix is {(data['X'].data.nbytes + data['X'].indices.nbytes) / (1024**2)} MB")

solver = ScopeSolver(
    dimensionality=n*channels,
    sparsity=npeaks,
    sample_size=n,
    group=[i for i in range(n) for _ in range(channels)],
)

solver.solve(
    objective=lambda params: loss(params, data),
    layers=[NonNegative(dimensionality = n*channels)],
    jit=True,
)
plt.figure(figsize=(10, 8))
plt.imshow(data['a'].T, aspect='auto', interpolation='nearest')
plt.colorbar(label='Beta Value (Power 0.1)')
plt.xlabel('Time')
plt.ylabel('Channel')
plt.title('simluated GC-MS data (red=true peaks)')

true_peaks = np.nonzero(solver.params.reshape(n, channels).sum(axis=1))[0]
for peak in true_peaks:
    plt.axvline(x=peak, color='red', label='True Peaks' if peak == true_peaks[0] else "")
The size of dense covariate matrix is 7.62939453125 MB
The size of sparse covariate matrix is 1.4714584350585938 MB
../../_images/gallery_GeneralizedLinearModels_poisson-identity-link_10_1.png
../../_images/gallery_GeneralizedLinearModels_poisson-identity-link_10_2.png

2.6.4. Case of large-scale dataset#

The last example shows the time consumption when the problem size is large in real situations.

[6]:
import time
from skscope import HTPSolver
n, npeaks, channels = 10000, 100, 500
data = simulate_spike_train(n, npeaks, channels, seed=123, Plot=False)

solver = HTPSolver(
    dimensionality=n*channels,
    sparsity=npeaks,
    group=[i for i in range(n) for _ in range(channels)],
)
start_time = time.time()
solver.solve(
    objective=lambda params: loss(params, data),
    layers=[NonNegative(dimensionality = n*channels)],
    jit=True,
)
print(f"Use {time.time() - start_time} s.")
plt.figure(figsize=(10, 8))
plt.imshow(data['a'].T, aspect='auto', interpolation='nearest')
plt.colorbar(label='Beta Value (Power 0.1)')
plt.xlabel('Time')
plt.ylabel('Channel')
plt.title('simluated GC-MS data (red=true peaks)')

true_peaks = np.nonzero(solver.params.reshape(n, channels).sum(axis=1))[0]
for peak in true_peaks:
    plt.axvline(x=peak, color='red', label='True Peaks' if peak == true_peaks[0] else "")
Use 98.22258281707764 s.
../../_images/gallery_GeneralizedLinearModels_poisson-identity-link_12_1.png

previous

2.5. Multinomial logistic regression

next

3. Survival Models

On this page
  • 2.6.1. Introduction
  • 2.6.2. Solve the problem
  • 2.6.3. Case of multivariate
  • 2.6.4. Case of large-scale dataset
Edit on GitHub
Show Source

© Copyright 2023, abess-team.

Created using Sphinx 7.2.6.

Built with the PyData Sphinx Theme 0.14.1.