Comparison with `abess`#

Abess is a well-known open-source library that supports variable selection. A comparative table summarizing the supported features of abess and skscope is listed below:

Supported features	`abess`	`skscope`
Sparsity matrices	Covarites only	YES
Warm starts	YES	YES
GPU/TPU	NO	YES
Multi CPUs	YES	YES
:nbsphinx-math:`sklearn `compatibility	YES	YES
General objective function	YES	YES
No. supported algorithms	1	7
Group-structure sparisty	YES	YES
Nuisance parameters	YES	YES

Table: The support features of the abess and skscope.

In this notebook, we compare the performance (recovery accuracy and computation time) of skscope and abesson three tasks as follows:

Linear Regression
Non-negative Linear Regression
Logistic Regression.

Besides, we compare the corresponding performance with different data (sample size \(n\), feature number \(p\) and support size \(s\)) dimension.

[1]:

import time
import numpy as np
import pandas as pd
from IPython.display import display

import jax.numpy as jnp
from skscope import ScopeSolver


from abess.datasets import make_glm_data
from abess import LinearRegression, LogisticRegression

The following is the main test function which records the recovery accuracy and computation time of both methods.

[2]:

def test_time(n=500, p=1000, s=5, random_state=None, rep=1):
    print('='*20 + f'  n={n}, p={p}, s={s}  ' + '='*20 )
    # rng = np.random.default_rng(random_state)
    # true_support_set = rng.choice(np.arange(p), size=s, replace=False)
    # real_coef = np.zeros(p)

    # iterables = [['Linear', 'Non-Neg', 'Logistic'], ['abess', 'skscope']]
    # index = pd.MultiIndex.from_product(iterables, names=['Task', 'Model'])
    # res = pd.DataFrame(columns=['Accuracy', 'Time'], index = index)
    res = pd.DataFrame(columns=['Task', 'Model', 'Accuracy', 'Time'])

    for i in range(rep):
        rng = np.random.default_rng(i)
        true_support_set = rng.choice(np.arange(p), size=s, replace=False)
        real_coef = np.zeros(p)
        for task in ['Linear', 'Non-Neg', 'Logistic']:
            if task == 'Linear':
                real_coef[true_support_set] = rng.choice(np.arange(1, 4), size=s) * rng.choice([1, -1], size=s)
                data = make_glm_data(n=n, p=p, k=s, family='gaussian', coef_=real_coef)
                model = LinearRegression(support_size=s)
                def objective(params):
                    loss = jnp.mean((y - X @ params) ** 2)
                    return loss
                solver = ScopeSolver(p, sparsity=s)
            elif task == 'Non-Neg':
                real_coef[true_support_set] = rng.choice(np.arange(1, 4), size=s) * rng.choice([1], size=s)
                data = make_glm_data(n=n, p=p, k=s, family='gaussian', coef_=real_coef)
                model = LinearRegression(support_size=s)
                def objective(params):
                    loss = jnp.mean((y - X @ params) ** 2)
                    return loss
                solver = ScopeSolver(p, sparsity=s)
            elif task == 'Logistic':
                real_coef[true_support_set] = rng.choice(np.arange(1, 4), size=s) * rng.choice([1, -1], size=s)
                data = make_glm_data(n=n, p=p, k=s, family='binomial', coef_=real_coef)
                model = LogisticRegression(support_size=s)
                def objective(params):
                    xbeta = jnp.clip(X @ params, -30, 30)
                    return jnp.mean(jnp.log(1 + jnp.exp(xbeta)) - y * xbeta)
                solver = ScopeSolver(p, sparsity=s)

            X, y = data.x, data.y

            # abess
            t_begin = time.time()
            model.fit(X, y)
            t_abess = time.time() - t_begin
            acc_abess = len(set(np.nonzero(model.coef_)[0]) & set(true_support_set)) / s
            # res.loc[(type, 'abess')] = [acc_abess, np.round(t_abess, 4)]
            res.loc[len(res)] = [task, 'abess', acc_abess, t_abess]

            # skscope
            t_begin = time.time()
            params = solver.solve(objective, jit=True)
            t_skscope = time.time() - t_begin
            acc_skscope = len(set(np.nonzero(params)[0]) & set(np.nonzero(data.coef_)[0])) / s
            # res.loc[(type, 'skscope')] = [acc_skscope, np.round(t_skscope, 4)]
            res.loc[len(res)] = [task, 'skscope', acc_skscope, t_skscope]

    res_mean = res.groupby(['Task', 'Model']).mean()
    res_std = res.groupby(['Task', 'Model']).std()
    res_mean['Accuracy'] = res_mean['Accuracy'].map(lambda x: f'{x:.2f}')
    res_mean['Time'] = res_mean['Time'].map(lambda x: f'{x:.2f}')
    res_std['Accuracy'] = res_std['Accuracy'].map(lambda x: f' ({x:.2f})')
    res_std['Time'] = res_std['Time'].map(lambda x: f' ({x:.2f})')
    res_all = res_mean + res_std
    print(res_all)
    return res_all

We compare these three tasks with

\[(n, p, s)\in\{(500, 1000, 5), (2500, 5000, 25), (5000, 10000, 50)\}.\]

Certainly, abess is faster than skscope since all three tasks are generalized linear model and are specicialized class for abess.

This phenomenon is admmissible since the auto-differention procedure of skscope is general and takes more computation time.

All results are shwon in the following tables.

[3]:

settings = [
    (500, 1000, 10),
    (2500, 5000, 50),
    (5000, 10000, 100),
]
for setting in settings:
    n, p, s = setting
    res = test_time(n=n, p=p, s=s, rep=10)

====================  n=500, p=1000, s=10  ====================
                     Accuracy         Time
Task     Model
Linear   abess    1.00 (0.00)  0.01 (0.00)
         skscope  1.00 (0.00)  0.23 (0.03)
Logistic abess    0.99 (0.03)  0.02 (0.00)
         skscope  0.99 (0.03)  0.26 (0.03)
Non-Neg  abess    1.00 (0.00)  0.01 (0.00)
         skscope  1.00 (0.00)  0.22 (0.02)
====================  n=2500, p=5000, s=50  ====================
                     Accuracy         Time
Task     Model
Linear   abess    1.00 (0.00)  0.55 (0.12)
         skscope  1.00 (0.00)  3.88 (1.10)
Logistic abess    1.00 (0.00)  0.83 (0.15)
         skscope  1.00 (0.00)  5.00 (0.84)
Non-Neg  abess    1.00 (0.00)  0.63 (0.27)
         skscope  1.00 (0.00)  3.86 (1.01)
====================  n=5000, p=10000, s=100  ====================
                     Accuracy          Time
Task     Model
Linear   abess    1.00 (0.00)   2.19 (0.22)
         skscope  1.00 (0.00)  14.09 (1.28)
Logistic abess    1.00 (0.00)   6.81 (3.32)
         skscope  1.00 (0.00)  24.12 (5.83)
Non-Neg  abess    1.00 (0.00)   2.26 (0.10)
         skscope  1.00 (0.00)  14.80 (1.00)

Comparison with abess#

Comparison with `abess`#