Variable selection and regularization via arbitrary rectangle-range generalized elastic net

Ding, Yujia; Peng, Qidi; Song, Zhengming; Chen, Hansen

doi:10.1007/s11222-023-10240-4

Variable selection and regularization via arbitrary rectangle-range generalized elastic net

Original Paper
Open access
Published: 26 April 2023

Volume 33, article number 72, (2023)
Cite this article

Download PDF

You have full access to this open access article

Statistics and Computing Aims and scope Submit manuscript

Variable selection and regularization via arbitrary rectangle-range generalized elastic net

Download PDF

Yujia Ding ORCID: orcid.org/0000-0002-2762-4652¹,
Qidi Peng¹,
Zhengming Song¹ &
…
Hansen Chen²

1109 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

We introduce the arbitrary rectangle-range generalized elastic net penalty method, abbreviated to ARGEN, for performing constrained variable selection and regularization in high-dimensional sparse linear models. As a natural extension of the nonnegative elastic net penalty method, ARGEN is proved to have both variable selection consistency and estimation consistency under some conditions. The asymptotic behavior in distribution of the ARGEN estimators have been studied in this framework. We also propose an algorithm called MU-QP-RR-W-$l_1$ to efficiently solve the ARGEN problem. By conducting simulation study we show that ARGEN outperforms the elastic net in a number of settings. Finally an application of S &P 500 index tracking with constraints on the stock allocations is performed to provide general guidance for adapting ARGEN to solve real-world problems.

Machine Learning Optimization Techniques: A Survey, Classification, Challenges, and Future Research Issues

Article 29 March 2024

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

Article 30 August 2016

A binarization approach to model interactions between categorical predictors in Generalized Linear Models

Article Open access 19 June 2024

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Variable selection and regularization are essential tools in high-dimensional data analysis. Many existing strategies are able to achieve both high prediction accuracy and interpretability. For instance, the lasso (Tibshirani 1996) was popularized thanks to its computational efficiency (Efron et al 2004), variable selection consistency (Zhao and Yu 2006), and estimation consistency (Negahban et al 2012). We refer to Zou (2006); Bickel et al (2009); Efron et al (2007); Lounici (2008); Yuan and Lin (2006); Zhao et al (2009); Wang et al (2007) for more in-depth discussions of lasso. Later, elastic net (Zou and Hastie 2005) is proposed through linearly combining the lasso and ridge regression-like penalties. As a more flexible model, elastic net is shown to be able to outperform the lasso for high-dimensional data.

Recall that a regularized linear model has the following general form:

$$\begin{aligned} Y=\beta _0+\beta _1X_1+\ldots +\beta _pX_p+g(\beta _1,\ldots ,\beta _p)+\epsilon , \end{aligned}$$

(1.1)

where $Y\in {\mathbb {R}}$ is the response variable, $X_1,\ldots ,$ $X_p\in {\mathbb {R}}$ are p predictors, g is some penalty function and $\epsilon $ is the residual. In the setting of ordinary linear models, one considers no range constraint on the coefficients, i.e., it is assumed that $\beta _1,\ldots ,\beta _p\in {\mathbb {R}}$. However in practice $\beta _1,\ldots ,\beta _p$ are often restricted to a prior range of values. For example, in portfolio management problem, the coefficients are considered as allocations of assets in a fund, which are valued in [0, 1]; in academic grading problem, the coefficients are interpreted as weights of a list of courses, which are also ranged in [0, 1]. Such constraints may influence the behavior of the penalty g, as well as the estimated values of $\beta _1,\ldots ,\beta _p$. Concerning adapting to this real world constraint, Wu et al (2014) and Wu and Yang (2014) introduced nonnegative lasso and nonnegative elastic net approaches respectively, which have been successfully applied to solve the real world index tracking problem without short sales (this corresponds to the nonnegative-value constraint on weights). There exist more such range constraints on the regression coefficients in the real world problems, therefore more flexible models are needed to address problems that require arbitrary-range constraints on the regression coefficients. With this motivation, our first goal is to suggest a novel method that concerns arbitrary rectangle-range constraint on the regression coefficients. Also recall that Zou (2006), Mouret et al (2013) and Sokolov et al (2016) introduced methods that generalize the lasso and elastic net respectively, by placing adaptive weights on the predictors and penalties. We will also adopt this setting in our model, i.e., the coefficients in the penalties will be weighted. As conclusion, our paper proposes a method that allows arbitrary rectangle-range constraints on the regression coefficients of a generalized elastic net and provide rigorous theoretical results to support the consistencies of the model and its solution. To elaborate, the proposed arbitrary rectangle-range generalized elastic net method (abbreviated to ARGEN), is a regularization method that deals with high-dimensional problems. ARGEN generalizes the nonnegative elastic net. The motivation of using ARGEN is given below:

1.
Like elastic net (Zou 2006), ARGEN also does the job of reducing the estimation variance and removing unimportant factors. Being more general than the elastic net, ARGEN often has better prediction result. This fact has been shown in our simulation and experience study, see Sects. 5 and 6.
2.
Compared with nonnegative elastic net, ARGEN allows adding arbitrary lower and upper bounds constraints on the coefficients. As discussed before, restricting the regression coefficients to specific rectangle-range is often required in the real world applications. Therefore this setting ensures ARGEN more adaptability to real world constraints. In Sect. 6 we have provided the S &P 500 index tracking problem as one example: instead of considering nonnegative allocation parameters (i.e., each regression coefficient $\beta _j\in [0,+\infty )$), our hyper-parameter tuning result shows that considering $\beta _j\in [0.0082,0.6]$ or [0.0041, 0.8] yields lower out-of-sample error. Such result reveals the fact that each stock considered to track the S &P 500 index should have a weight no more than $80\%$.
3.
ARGEN also considers effects of individual and interactive penalty weights. These weights measure the factor importances of the p features $X_1,\ldots ,X_p$ in the regression. The traditional elastic net assumes the p features have equal factor importance. However this is not often the case in the real world. Not only this setting of weights fits more the real world situation, but it also promises better performance due to its larger parameter searching space.

To solve ARGEN, we introduce a novel algorithm multiplicative updates for solving quadratic programming with rectangle range and weighted $l_1$ regularizer (abbreviated to MU-QP-RR-W-$l_1$). We summarize the main contributions of our paper as follows:

1.
We introduce ARGEN, a method of solving variable selection and regularization problems that require the regression coefficients to be ranged in some rectangle in ${\mathbb {R}}^p$ (see (2.1)). As a flexible approach, ARGEN includes the nonnegative elastic net and a number of new extensions of the models lasso, ridge, and elastic net.
2.
Subject to some condition on the inputs, the variable selection consistency, the estimation consistency, and the limiting distribution of the estimator of ARGEN are obtained. We refer to Theorems 2.1, 2.2, and 2.4.
3.
A novel algorithm MU-QP-RR-W-$l_1$ is introduced to solve the general quadratic programming problem $\min _{v\in [0, l]}$ F(v) $= v'Av/2 + b' v+d'|v-v^0|$, following the notations in (3.3). The algorithm is implemented as a Python library through the PyPi server and is publicly shared.^{Footnote 1}
4.
We show a successful real world application of the ARGEN approach in the S &P 500 index tracking problem. Readers can get full access to the Python script in the Github repository.^{Footnote 2}

Throughout the paper, we denote the transpose of a matrix by $(\cdot )'$, the i-th column of a matrix by $(\cdot )_i$, the entry in the i-th row and j-th column of a matrix by $(\cdot )_{ij}$, the diagonal matrix with diagonal vector ${{\textbf {x}}}$ by ${{\,\textrm{diag}\,}}({{\textbf {x}}})$, and the maximum (resp. minimum) element of a vector by $\max (\cdot )$ (resp. $\min (\cdot )$). Besides, an $n\times n$ matrix X can be expressed by $ X=(X_{ij})_{1\le i,j\le n}$. The elementwise absolute value of a vector or matrix is $|\cdot |$: for ${{\textbf {x}}}=(x_1, \ldots , x_p)$, $|\mathrm {{\textbf {x}}}|:=(|x_1|,\ldots ,|x_p|)$ and for an $n\times n$ matrix X, $|X|:=(|X_{ij}|)_{1\le i,j\le n}$. Moreover, let ${{{\textbf {x}}}} =(x_1,\ldots ,x_p)$, ${{{\textbf {y}}}}=(y_1,\ldots ,y_p)$ be two equal-length vectors, we denote the p-dimensional interval by $[{{{\textbf {x}}}}, {{{\textbf {y}}}}]:=[x_1,y_1]\times \ldots \times [x_p,y_p]$.

In the sequel, we consider the linear regression model

$$\begin{aligned} Y = X\beta ^* + \epsilon , \end{aligned}$$

(1.2)

where X is a deterministic $ n \times p $ design matrix, $Y = (y_1~\ldots ~y_n)'$ is an $n\times 1$ response vector and $\epsilon = ( \epsilon _1~\ldots ~\epsilon _n)' $ is a Gaussian noise with marginal variance $\sigma ^2 $. Without loss of generality, we assume all the p predictors are real-valued and centered, so the intercept can be ignored. $\beta ^* \in {\mathbb {R}}^{p}$ is the regression coefficients.

The rest of the paper is organized as follows. In Sect. 2, we discuss the analytical features of ARGEN and its variable selection consistency (Theorem 2.1), estimation consistency (Theorem 2.2) and estimator’s limiting distribution (Theorem 2.4). In Sect. 3, we propose an efficient algorithm MU-QP-RR-W-$l_1$ for solving ARGEN. Approaches we use to speed up hyper-parameter optimization are discussed in Sect. 4. Simulations that compare the performances of various methods are conducted in Sect. 5. Section 6 shows an application of ARGEN to the real world S &P 500 index tracking problem. Section 7 is devoted to the conclusion and discussion of future research. Technical proofs are provided in Appendix.

2 The ARGEN

2.1 Definition

In practice it is often natural to assume sparsity in the high-dimensional dataset problem. Therefore in the sequel we assume that the linear model (1.2) is q-sparse, i.e., $\beta ^*$ has at most $ q ~(q \ll p) $ nonzero elements. We intend to cope with the case when there is a control on the range of the coefficients, that is, let $ s =(s_1,\ldots ,s_p)$, $ t=(t_1,\ldots ,t_p)$ with $s_i\in {\mathbb {R}}\cup \{-\infty \},~t_i\in {\mathbb {R}}\cup \{+\infty \}$, $s_i< t_i$ for all $i=1,\ldots ,p$, the optimal coefficients are in a p-dimensional rectangle ${\mathcal {I}}:=[s, t]\subset {\mathbb {R}}^p$. To capture the penalty weights for individual features, we introduced $\textrm{w}_{n} = (\textrm{w}_{n,1}~ \ldots ~ \mathrm w_{n,p})'$ as weights for each coefficient in the $l_1$ penalty, and it satisfies $\mathrm w_{n,i}\ge 0, i=1,\cdots ,p$. In addition to individual features, $\Sigma _n$, a positive semi-definite matrix, is introduced to represent the penalty weights for interactions between any two features. Consider the linear model (1.2) and let $ \beta =(\beta _1~\ldots ~\beta _p)'$ be a vector in ${\mathbb {R}}^p$. The ARGEN estimator of $\beta $ is given by

(2.1)

Here $\lambda _n^{(1)},\lambda _n^{(2)}\ge 0$ are the tuning parameters which control the importance of the $l_1$ and $l_2$ regularization terms, respectively.

The ARGEN (2.1) naturally extends the elastic net method. That is, it becomes the elastic net when ${\mathcal {I}}={\mathbb {R}}^p$, $\mathrm w_n=(1~ \ldots ~ 1)'$, and $\Sigma _n$ is the identity matrix. Thus ARGEN extends the lasso and ridge methods by further assigning $\lambda _n^{(2)}=0$ and $\lambda _n^{(1)}=0$ respectively. In addition, ARGEN becomes the nonnegative elastic net if we replace ${\mathcal {I}}={\mathbb {R}}^p$ with ${\mathcal {I}}=\mathbb R_+^p:=[0,+\infty )^p$ in the setting of elastic net.

2.2 Variable selection consistency

We define the variable selection consistency for the ARGEN as follows. For $i=1,\ldots ,p$, we decompose the interval $[s_i,t_i]$ with $s_i<t_i$ into 7 disjoint sub-intervals:

$$\begin{aligned} {[}s_i,t_i]=\bigcup _{k=2}^6{\mathcal {G}}_i^{(k)}\bigcup \mathcal G_i^{(1-)}\bigcup {\mathcal {G}}_i^{(1+)}, \end{aligned}$$

where

$$\begin{aligned}{} & {} {\mathcal {G}}_i^{(1-)}=(s_i, t_i)\cap (-\infty ,0),\\{} & {} {\mathcal {G}}_i^{(1+)}=(s_i, t_i)\cap (0,+\infty ),\\{} & {} {\mathcal {G}}_i^{(2)}=\{s_i\}\backslash \{0\},~~~~~{\mathcal {G}}_i^{(3)}=\{t_i\}\backslash \{0\},\\{} & {} {\mathcal {G}}_i^{(4)}=\{s_i\}\cap \{0\},~~~~~{\mathcal {G}}_i^{(5)}=\{t_i\}\cap \{0\},\\{} & {} {\mathcal {G}}_i^{(6)}=(s_i,t_i)\cap \{0\}. \end{aligned}$$

In addition, we define $ {\mathcal {G}}_i^{(1)}=\mathcal G_i^{(1-)}\cup {\mathcal {G}}_i^{(1+)} $ for simplicity. Correspondingly, each coefficient in $\beta ^*$ belongs to one of the 7 groups of values; i.e., for each $i=1,\ldots ,p$, there is a unique $k_i\in \{1-,1+,2,\ldots ,6\}$ such that $ \beta _i^*\in \mathcal G_i^{(k_i)}. $ Now for $j\in \{1-,1+,2,\ldots ,6\}$, denote by

$$\begin{aligned} S_{(j)}=\left\{ i\in \{1,\ldots ,p\}:~\beta _i^*\in \mathcal G_{i}^{(j)}\right\} , \end{aligned}$$

the set of indexes i for which $\beta _i^*$ belongs to the j-th group of values, and let $\# S_{(j)}$ be the cardinality of the set. Correspondingly, we can define

$$\begin{aligned} \begin{aligned}&\widehat{S}_{(j)}(\lambda _n^{(1)},\lambda _n^{(2)},\mathrm w_n,\mathrm \Sigma _n)\\&\quad =\left\{ i\in \{1,\ldots ,p\}:{{\widehat{\beta }}}_i\in \mathcal G_{i}^{(j)}\right\} . \end{aligned} \end{aligned}$$

Definition 1

ARGEN (2.1) is said to have variable selection consistency if there exist $\lambda _n^{(1)}$, $\lambda _n^{(2)}$, $\mathrm w_n$, and $\Sigma _n$ such that

$$\begin{aligned} \begin{aligned}&\mathbb P\left( \widehat{S}_{(j)}\big (\lambda _n^{(1)},\lambda _n^{(2)},\mathrm w_n,\mathrm \Sigma _n\big ) = S_{(j)}\Big |S_{(j)}\ne \emptyset \right) \\&\xrightarrow [n\rightarrow \infty ]{}1~\text{ for } j\in \{1-,1+,2,\ldots ,6\}. \end{aligned} \end{aligned}$$

(2.2)

(2.2) implies that, starting from some n, it is of big opportunity that ${{\widehat{\beta }}}_i$ equals $\beta _i^*$ if $\beta ^*_i\in \{0,s_i,t_i\}$. Such property includes the variable selection consistency of the nonnegative elastic net and elastic net as particular cases. Therefore our definition of the variable selection consistency for ARGEN is in a broader sense than that for the “free-range” or nonnegative elastic net (Zhao and Yu 2006; Wu et al 2014; Wu and Yang 2014).

Let $X_{(1)}=(X_{(1-)}, X_{(1+)})$ and for $j\in \{1-,1+,2,\ldots ,6\}$, let $X_{(j)} = \left( X_i\right) _{i\in S(j)}$ be the observed predictor values corresponding to the jth group of indexes. Similarly, let $\beta ^*_{(j)} = \left( \beta ^*_i\right) _{i\in S(j)}$, $s_{(j)} = \left( s_i\right) _{i\in S(j)}$, $t_{(j)} = \left( t_i\right) _{i\in S(j)}$, $\mathrm w_{n, (j)} = \left( \mathrm w_{n, i}\right) _{i\in S(j)}$, and $\Sigma _{n, (j_1j_2)} = \left( \Sigma _{n, i_1,i_2}\right) _{i_1\in S(j_1), i_2\in S(j_2)}$. Moreover, let C be

$$\begin{aligned} \begin{aligned} C&:=\begin{pmatrix} C_{ij} \end{pmatrix}_{1\le i,j\le 6}\\ {}&=\frac{1}{n}X'X=\begin{pmatrix} \frac{1}{n}X'_{(i)} X_{(j)} \end{pmatrix}_{1\le i,j\le 6} \end{aligned} \end{aligned}$$

(2.3)

and $\Lambda _{\min }(C_{11})$ be the minimal eigenvalue of $C_{11}$. Denote by

(2.4)

where for a vector $v=(v_1,\ldots ,v_n)$, ${{\,\textrm{sign}\,}}(v):=({{\,\textrm{sign}\,}}(v_1),\ldots ,{{\,\textrm{sign}\,}}(v_n))$ denotes the vector of signs of the elements in v, and ${{\,\textrm{diag}\,}}(v)$ denotes the diagonal matrix with diagonal elements v. To show ARGEN admits the variable selection consistency (2.2), we assume that the following conditions hold:

(2.5)

(2.6)

(2.7)

(2.8)

and for $j\in \{1-,1+\}$,

(2.9)

(2.10)

Besides, we assume that the arbitrary rectangle-range elastic irrepresentable condition (AREIC), defined below, is satisfied.

Definition 2

The AREIC is given as: For $j=2,\ldots ,6$ satisfying $S_{(j)}\ne \emptyset $, there exists a positive constant vector $\eta _{(j)}$, such that

(2.11)

where $ (D_{(2)}~D_{(3)}~D_{(4)}~D_{(5)})=\big ({{\,\textrm{diag}\,}}({{\,\textrm{sign}\,}}( s_{(2)}))$ $~{{\,\textrm{diag}\,}}({{\,\textrm{sign}\,}}( t_{(3)}))~1~-1\big ). $

Let us roughly explain how the technical condition AREIC plays its role in the derivation of the variable selection consistency of ARGEN. First by Lemma A.1 in the appendix, for $j\in \{1-,1+,2,\ldots ,6\}$,

$$\begin{aligned} \begin{aligned}&\mathbb P\Big (\widehat{S}_{(j)}\big (\lambda _n^{(1)},\lambda _n^{(2)},\mathrm w_n,\mathrm \Sigma _n\big ) = S_{(j)}\Big |S_{(j)}\ne \emptyset \Big )\\&\ge {\mathbb {P}}\left( {\mathcal {E}}(V_{(j)})\right) , \end{aligned} \end{aligned}$$

where the events ${\mathcal {E}}(V_{(j)})$’s are given in (A2). Next the condition AREIC (the left-hand side of (2.11) is the major part of ${\mathbb {P}}\left( {\mathcal {E}}(V_{(j)})\right) $) will algebraically lead to $\mathbb P\left( {\mathcal {E}}(V_{(j)})\right) \xrightarrow [n\rightarrow +\infty ]{}1$, which implies

$$\begin{aligned} \begin{aligned}&\mathbb P\Big (\widehat{S}_{(j)}\big (\lambda _n^{(1)},\lambda _n^{(2)},\mathrm w_n,\mathrm \Sigma _n\big ) = S_{(j)}\Big |S_{(j)}\ne \emptyset \Big )\\&\xrightarrow [n\rightarrow +\infty ]{}1, \end{aligned} \end{aligned}$$

i.e., the variable selection consistency is established.

When $s=0$, $t=+\infty $, $\mathrm w_n=1$ and $\Sigma _{n}$ is the identity matrix, the AREIC becomes the nonnegative elastic irrepresentable condition (NEIC) as follows:

$$\begin{aligned} C_{61}\Big (C_{11}+\frac{\lambda _n^{(2)}}{n}\Big )^{-1}\Big (\textbf{1}+\frac{2\lambda _n^{(2)}}{\lambda _n^{(1)}}\beta _{(1)}^*\Big )\le \textbf{1 }- \eta _{(6)}, \end{aligned}$$

(2.12)

which yielded the variable selection consistency of nonnegative elastic net (Zhao et al 2014). If, in addition to (2.12), $\lambda _n^{(2)}=0$, the NEIC then becomes the nonnegative irrepresentable condition (NIC):

$$\begin{aligned} C_{61}C_{11}^{-1}{\textbf{1}}\le {\textbf{1}} - \eta _{(6)}, \end{aligned}$$

which was a sufficient condition to obtain the variable selection consistency of the nonnegative lasso (Wu et al 2014). Note that, NIC is a nonnegative version of the irrepresentable condition (IC) for the variable selection consistency of the lasso (Zhao and Yu 2006):

$$\begin{aligned} |C_{61}C_{11}^{-1}{{\,\textrm{sign}\,}}(\beta _{(1)}^*)|\le {\textbf{1}} - \eta _{(6)}. \end{aligned}$$

Although IC is a sufficient and necessary condition for the variable selection consistency of the lasso (Zhao and Yu 2006) while NIC is only a necessary condition, in the real world NIC is easier to be satisfied than IC since it does not require the absolute value on the left-hand side of the inequality. As a result AREIC is a natural general version of the previous necessary conditions NEIC and NIC for the variable selection consistency. Below we state the first main result of the paper. Its proof is given in Appendix A.

Theorem 2.1

Under AREIC and the conditions (2.5) - (2.10), the ARGEN possesses the variable selection consistency property (2.2).

2.3 Estimation consistency

Recall that an estimation method with target parameter $\beta ^*$ has the property of estimation consistency if

$$\begin{aligned} \Vert {\widehat{\beta }} - \beta ^*\Vert _2 \xrightarrow [n\rightarrow \infty ]{{\mathbb {P}}} 0, \end{aligned}$$

where $\Vert \cdot \Vert _2$ denotes the Euclidean distance and $\xrightarrow [n\rightarrow \infty ]{{\mathbb {P}}}$ is the convergence in probability. Besides the variable selection consistency, ARGEN admits estimation consistency, subject to the following conditions.

(i)
$\beta ^*\in {\mathcal {I}}$. Let $p=p_n$, $q=q_n$ be non-decreasing as n increases.
(ii)
$\mathrm w_n=(\mathrm w_{n,1},\ldots , \mathrm w_{n,p_n})$ with $\mathrm w_{n, 1},\ldots ,\mathrm w_{n, p_n}$ $>0$ and $\Sigma _n$ are given.
(iii)
Let $X_j$ be the jth column of X, which satisfies
$$\begin{aligned} \max _{1\le j\le p_n}\frac{2(X_j'X_j+\lambda _n^{(2)}\Sigma _{n,jj})}{(1+\lambda _n^{(2)})\textrm{w}_{n,j}^2} \le 1,~\text{ for } \text{ all } ~ n\ge 1. \end{aligned}$$
(iv)
X satisfies the restricted eigenvalue (RE) condition, i.e. there exists a constant $\kappa >0$, such that for all $n\ge 1$ and all $\beta \in {\mathcal {I}}$ satisfying
$$\begin{aligned} \sum _{j=4}^6\mathrm w_{n,(j)}'|\beta _{(j)}|\le 3\sum _{j=1}^3\mathrm w_{n,(j)}'|\beta _{(j)}|, \end{aligned}$$
we have
$$\begin{aligned} 2(\Vert X\beta \Vert _2^2+\lambda _n^{(2)}\beta '\Sigma _n \beta ) \ge \kappa (1+\lambda _n^{(2)}) \Vert {{\,\textrm{diag}\,}}(\mathrm w_{n})\beta \Vert _2^2. \end{aligned}$$
(v)
$\lambda _n^{(1)}$, $\lambda _n^{(2)}$, $\mathrm w_{n}$, $p_n$ and $q_n$ satisfy
$$\begin{aligned} \frac{q_n(\lambda _n^{(1)})^2}{(1+\lambda _n^{(2)})^2}\xrightarrow [n\rightarrow \infty ]{}0 \end{aligned}$$
and
$$\begin{aligned} p_n\exp \bigg (-\frac{n}{8\sigma ^2}\frac{\big (\lambda _n^{(1)}\big )^2}{1+\lambda _n^{(2)}}\bigg )\xrightarrow [n\rightarrow \infty ]{}0, \end{aligned}$$
where $\sigma >0$ is the residual standard deviation of the ARGEN.

Below we state the estimation consistency of the ARGEN.

Theorem 2.2

Consider a $q_n$-sparse instance of the ARGEN (2.1). Let X satisfy the conditions (i) - (iv) and let the regularization parameters $\lambda _{n}^{(1)}>0, \lambda _{n}^{(2)}\ge 0$, then the ARGEN solution ${\widehat{\beta }}:={\widehat{\beta }}(\lambda _n^{(1)},\lambda _n^{(2)},\mathrm w_n, \Sigma _n)$ satisfies:

$$\begin{aligned}{} & {} \begin{aligned}&{\mathbb {P}}\bigg (\Vert {{\,\textrm{diag}\,}}(\mathrm w_{n})({\widehat{\beta }} - \beta ^*)\Vert _2^2 >\frac{9q_n(\lambda _n^{(1)})^2}{\kappa ^2(1+\lambda _n^{(2)})^2}\bigg )\\&\quad \le 2p_n\exp \bigg (-\frac{n}{8\sigma ^2}\frac{(\lambda _n^{(1)})^2}{1+\lambda _n^{(2)}}\bigg ), \end{aligned} \end{aligned}$$

(2.13)

$$\begin{aligned}{} & {} \begin{aligned}&{\mathbb {P}}\bigg (\Vert {{\,\textrm{diag}\,}}(\mathrm w_{n})({\widehat{\beta }} - \beta ^*) \Vert _1 > \frac{12q_n\lambda _n^{(1)}}{\kappa (1+\lambda _n^{(2)})}\bigg )\\&\quad \le 2p_n\exp \bigg (-\frac{n}{8\sigma ^2}\frac{(\lambda _n^{(1)})^2}{1+\lambda _n^{(2)}}\bigg ), \end{aligned} \end{aligned}$$

(2.14)

where $\sigma >0$ denotes the residual standard deviation of the ARGEN. In addition if (v) holds, we have

$$\begin{aligned} \Vert {\widehat{\beta }} - \beta ^*\Vert _2\xrightarrow [n\rightarrow \infty ]{{\mathbb {P}}}0. \end{aligned}$$

(2.15)

Proof

The main idea to the proof is to transform the ARGEN problem into a rectangle-range lasso problem. Let

$$\begin{aligned} \begin{aligned}{}&{} \widetilde{X}=\frac{\sqrt{2n}}{\sqrt{1+\lambda _n^{(2)}}}\begin{pmatrix} X{{\,\text {diag}\,}}(\mathrm w_{n})^{-1}\\ \sqrt{\lambda _n^{(2)}} \Sigma _n^{1/2}{{\,\text {diag}\,}}(\mathrm w_{n})^{-1} \end{pmatrix}_{(n+p)\times p },\\{}&{} \widetilde{Y} = \begin{pmatrix} \sqrt{2n}Y\\ 0 \end{pmatrix}_{(n+p)\times 1 }, \\{}&{} {\widetilde{\beta }}^* = \sqrt{1+\lambda _n^{(2)}}{{\,\text {diag}\,}}(\mathrm w_{n})\beta ^*, ~~~~\lambda _n = \frac{\lambda _n^{(1)}}{\sqrt{1+\lambda _n^{(2)}}},\\{}&{} \widetilde{{\mathcal {I}}} =\prod _{i=1}^{p_n}\left[ \sqrt{1+\lambda _n^{(2)}} \mathrm w_{n,i}s_i,\sqrt{1+\lambda _n^{(2)}} \mathrm w_{n,i}t_i\right] . \end{aligned} \end{aligned}$$

Then the ARGEN (2.1) can be written as the rectangle-range lasso:

$$\begin{aligned} \begin{aligned} \widehat{{\widetilde{\beta }}}(\lambda _n)&=\mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{\beta \in \widetilde{{\mathcal {I}}}}\left( \frac{1}{2n}\big \Vert \widetilde{Y} - \widetilde{X}\beta \big \Vert _2^2+\lambda _n|\beta |\right) \\&=\sqrt{1+\lambda _n^{(2)}}{{\,\textrm{diag}\,}}(\mathrm w_{n}){\widehat{\beta }}. \end{aligned} \end{aligned}$$

(2.16)

In view of the conditions (i)-(iv), all requirements of Corollary 2 in Negahban et al (2012) are satisfied. Therefore, applying Corollary 2 in Negahban et al (2012) to the lasso (2.16) yields the results. We point out that: (1) Based on its proof, Corollary 2 in Negahban et al (2012) works for rectangle-range lasso. (2) There is a typo in the statement of Corollary 2 in Negahban et al (2012): the inequalities (34) in Negahban et al (2012) should be corrected to

$$\begin{aligned} \begin{aligned}&\Vert {\widehat{\theta }}_{\lambda _n} - \theta ^* \Vert _2^2\le \frac{144\sigma ^2}{\kappa _{{\mathcal {L}}}^2}\frac{s\log p}{n}\\&\Vert {\widehat{\theta }}_{\lambda _n} - \theta ^* \Vert _1\le \frac{48\sigma }{\kappa _{{\mathcal {L}}}}s\sqrt{\frac{\log p}{n}}. \end{aligned} \end{aligned}$$

$\square $

If we assume $\mathrm w_n\nrightarrow 0$ as $n\rightarrow \infty $ in Theorem 2.2, we easily obtain the estimation consistency condition for the nonnegative lasso (see Proposition 1 in Wu et al (2014)) and the nonnegative elastic net. Note that the estimation consistency of the nonnegative elastic net (Wu and Yang 2014) has not yet been derived, hence we state it below as a corollary of Theorem 2.2. To obtain the corollary it suffices to observe $\sum _{i=1}^3\mathrm w_{n,(i)}'\mathrm w_{n,(i)}=q_n$ when $\mathrm w_{n,j}=1$ for all $n\ge 1$, $j=1,\ldots ,p_n$.

Corollary 2.3

Consider a $q_n$-sparse nonnegative elastic net model. Assume:

(i)
$\beta ^*\ge 0$. $p_n,q_n$ are non-decreasing as n increases.
(ii)
Let $X_j$ be the jth column of X which satisfies
$$\begin{aligned} \frac{2(X_j'X_j+\lambda _n^{(2)})}{1+\lambda _n^{(2)}} \le 1,~\text{ for } \text{ all } ~ j = 1,\ldots ,p. \end{aligned}$$
(iii)
There exists a constant $\kappa >0$, such that
$$\begin{aligned} 2(\Vert X\beta \Vert _2^2+\lambda _n^{(2)}\Vert \beta \Vert _2^2 )\ge \kappa (1+\lambda _n^{(2)}) \Vert \beta \Vert _2^2 \end{aligned}$$
for all $\beta \ge 0$ satisfying
$$\begin{aligned} \sum _{j\in \{1,\ldots ,p_n\}:~\beta _j^*=0}|\beta _{j}|\le 3\sum _{j\in \{1,\ldots ,p_n\}:~\beta _j^*\ne 0}|\beta _{j}|. \end{aligned}$$

Let $\lambda _{n}^{(1)}>0, \lambda _{n}^{(2)}\ge 0$, then the nonnegative elastic net solution ${\hat{\beta }}$ verifies the following inequalities:

As a consequence of Corollary 2.3, ${{\widehat{\beta }}}$ is consistent if

$$\begin{aligned} \frac{q_n(\lambda _n^{(1)})^2}{(1+\lambda _n^{(2)})^2}\xrightarrow [n\rightarrow \infty ]{}0 \end{aligned}$$

and

$$\begin{aligned} p_n \exp \bigg (-\frac{n(\lambda _{n}^{(1)})^2}{8\sigma ^2(1+\lambda _{n}^{(2)})}\bigg )\xrightarrow [n\rightarrow \infty ]{}0. \end{aligned}$$

If we take $\lambda _n^{(2)}=0$ and $\lambda _n^{(1)}=4\sigma \sqrt{\log p_n/n}$ in Corollary 2.3, we obtain the nonnegative lasso’s tail probability control as in Proposition 1 in Wu et al (2014). If we further assume $\beta ^*\in {\mathbb {R}}$ in Corollary 2.3, we derive the tail bounds for the lasso (see Corollary 2 in Negahban et al (2012)).

2.4 Limiting distributions of ARGEN estimators

We now study the asymptotic behavior in distribution of the ARGEN estimators, as $n\rightarrow \infty $. Again we can make use of the transformation of ARGEN to the rectangle-range lasso model (2.16), since the limiting distributions of the lasso regression estimators have been studied in Fu and Knight (2000). Observe that (2.16) is equivalent to

$$\begin{aligned} \widehat{{\widetilde{\beta }}}(\lambda _n) =\mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{\beta \in \widetilde{{\mathcal {I}}}}\left( \big \Vert \breve{Y} - \breve{X}\beta \big \Vert _2^2+\lambda _n |\beta |\right) , \end{aligned}$$

(2.17)

where $\breve{X}={\widetilde{X}}/\sqrt{2n}$ and $\breve{Y}=\widetilde{Y}/\sqrt{2n}$. (2.17) is then the type of lasso studied in Fu and Knight (2000). Assume that the row vectors of $\breve{X}$, denoted by $\breve{X}^{(i)}$, $i=1,\ldots ,n$, satisfy

$$\begin{aligned} \frac{1}{n}\sum _{i=1}^n\breve{X}^{(i)}{\breve{X}^{(i)}}{}'\xrightarrow [n\rightarrow \infty ]{}M, \end{aligned}$$

(2.18)

where M is a nonsigular nonnegative definite matrix and

$$\begin{aligned} \frac{1}{n}\max \limits _{1\le i\le n}{\breve{X}^{(i)}}{}'\breve{X}^{(i)}\xrightarrow [n\rightarrow \infty ]{}0. \end{aligned}$$

(2.19)

It follows from Theorem 2 in Fu and Knight (2000) that the ARGEN estimator ${{\widehat{\beta }}}$ has the following asymptotic behavior in distribution.

Theorem 2.4

Assume $\lim _{n\rightarrow \infty }p_n=p$, $\lim _{n\rightarrow \infty }\lambda _n^{(2)}$ $=\lambda ^{(2)}$ and $\lim _{n\rightarrow \infty }\mathrm w_n=\mathrm w=(\mathrm w_1,\ldots ,\mathrm w_p)$. Let X, $\mathrm w_n$ and $\Sigma _n$ satisfy (2.18) and (2.19). Also assume

$$\begin{aligned} \frac{\lambda _n^{(1)}}{\sqrt{n(1+\lambda _n^{(2)})}}\xrightarrow [n\rightarrow \infty ]{}\lambda _0\ge 0. \end{aligned}$$

Then

$$\begin{aligned} \sqrt{n}({{\widehat{\beta }}}-\beta ^*)\xrightarrow [n\rightarrow \infty ]{\text{ law }}\mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{u\in {\mathcal {I}}}(V(u)), \end{aligned}$$

where $\xrightarrow [n\rightarrow \infty ]{\text{ law }}$ denotes the convergence in distribution; V(u) is a Gaussian random variable given as

$$\begin{aligned} \begin{aligned}&V(u)=-2u'G+u'Mu\\&\qquad +\lambda _0\sum _{j=1}^p\left( u_j{{\,\textrm{sign}\,}}(\beta _j^*)\mathbb {1}(\beta _j^*\ne 0)+|u_j|\mathbb {1}(\beta _j^*=0)\right) . \end{aligned} \end{aligned}$$

In the above expression of V(u), $G\sim {\mathcal {N}}(0,\sigma ^2\,M)$, $u_j$ denotes the jth coordinate of u and $\mathbb {1}$ is the indicator function.

Theorem 2.4 includes the asymptotic behaviors of the elastic net and nonnegative elastic net estimators as its particular examples. As another particular example, when $\lambda _0=0$ and $p=1$ (then M is a single value and $\mathcal I=[s_1,t_1]$), by the fact that V is convex, we obtain

$$\begin{aligned} \mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{u\in {\mathcal {I}}}(V(u))=\left\{ \begin{array}{ll} M^{-1}G&{}~\text{ if }~M^{-1}G\in [s_1,t_1];\\ s_1&{}~\text{ if }~M^{-1}G<s_1;\\ t_1&{}~\text{ if }~M^{-1}G>t_1. \end{array} \right. \end{aligned}$$

In the above example, if $p\ge 2$ and ${\mathcal {I}}\ne {\mathbb {R}}^p$, $\mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{u\in {\mathcal {I}}}(V(u))$ has no simple explicit expression. Note that $\mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{u\in {\mathcal {I}}}(V(u))$ belongs to some quadratic programming problem. In the next section we provide a multiplicative updates numerical algorithm to solve the ARGEN. This algorithm may be further applied to simulate $\mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{u\in \mathcal I}(V(u))$ numerically.

3 MU-QP-RR-W-$l_1$ Algorithm for Solving ARGEN

In this section we provide a solution of ARGEN by using an extensive multiplicative updates algorithm. Given $\mathrm w_n,~\Sigma _n$ and $\lambda _n^{(1)},\lambda _n^{(2)}\ge 0$, the ARGEN in (2.1) can be expressed as the following equivalent problem:

(3.1)

To simplify the problem, we rewrite it by taking

and obtain an equivalent problem of (3.1), that is,

$$\begin{aligned} {\left\{ \begin{array}{ll} \text {minimize } F(v) = v'Av+b'v+d'|v-v^0|, \\ \text {subject to } v \in [0, l]. \end{array}\right. } \end{aligned}$$

(3.2)

This is obtained by arguing $ |v+ s|-|v-v^0|=s^+:=\big (\max \{0,s_1\}~\ldots ~$ $\max \{0,s_p\}\big )'$ and omitting the constant terms $ d' s^++ (3/2)s'A s-b' s. $ Here the matrix A is symmetric positive semi-definite. The problem (3.2) is a quadratic programming problem but contains an item of $d'|v-v^0|$.

Sha et al (2007a) derived the multiplicative updates for solving the nonnegative quadratic programming problem. The algorithm has been shown to have a simple closed-form, and a rapid convergence rate. For our problem (3.2), however, it contains the absolute values and lower and upper limits of the optimization variables, so that direct application of the algorithm in Sha et al (2007a) is impractical. Therefore, we propose a new iterative algorithm to solve (3.2) and call it multiplicative updates for solving quadratic programming with rectangle range and weighted $l_1$ regularizer (abbreviated to MU-QP-RR-W-$l_1$).

Let us formulate a more general problem that can be solved by MU-QP-RR-W-$l_1$:

$$\begin{aligned} {\left\{ \begin{array}{ll} \text{ minimize } ~F(v) = \frac{1}{2} v'Av + b' v+d'|v-v^0|, \\ \text{ subject } \text{ to }~v\in [0, l]. \end{array}\right. } \end{aligned}$$

(3.3)

Here $v,b,d,v^0,l$ are column vectors of dimension p, where elements of $d, v^0$ are nonnegative and elements of l are positive. The matrix $ A=(A_{ij})_{1\le i,j\le p}$ is positive semi-definite. In fact, the nonnegative quadratic programming (see e.g. Equation (5) in Sha et al (2007a) or (20) in Wu and Yang (2014)) is a special case of (3.3), where we take the elements of $d, v^0$ to be 0 and elements of l to be infinity.

Let us further adopt the following notations. For $i,j\in \{1,\ldots ,p\}$, we define the positive part and negative part of $A_{ij}$ by

$$\begin{aligned} A_{ij}^{+} := \max \left\{ 0,A_{ij}\right\} ~ \text { and }~ A_{ij}^{-} := \max \left\{ 0,-A_{ij}\right\} . \end{aligned}$$

Then denote the positive part and negative part of the matrix A by

$$\begin{aligned} A^+:=\big (A_{ij}^+\big )_{1\le i,j\le p}~\text{ and }~A^-:=\big (A_{ij}^{-}\big )_{1\le i,j\le p}. \end{aligned}$$

It follows that $A=A^+-A^-$ and $|A|:=\left( |A_{ij}|\right) _{1\le i,j\le p}$ $=A^++A^-$. Let $ a_i(v):= (A^+ v)_i~\text{ and }~c_i(v):= (A^- v)_i $, we then present the MU-QP-RR-W-$l_1$ algorithm in pseudocode below.

We point out that the conditions $r_1>v_i^0$ and $r_2<v_i^0$ in (3.4) are mutually exclusive when $v_i^{(m)}>0$. This is because, on one hand, $r_1>v_i^0$ is equivalent to

$$\begin{aligned} \begin{aligned}&\frac{2a_i(v)v_i^0}{v_i}+(b_i+d_i)<0\\&\text{ or }\\&\left\{ \begin{array}{ll} &{} \frac{2a_i(v)v_i^0}{v_i}+(b_i+d_i)\ge 0, \\ &{} a_i(v)\left( \frac{v_i^0}{v_i}\right) ^2+(b_i+d_i)\frac{v_i^0}{v_i}-c_i(v)<0. \end{array}\right. \end{aligned} \end{aligned}$$

(3.5)

On the other hand, $r_2<v_i^0$ is equivalent to

$$\begin{aligned} \left\{ \begin{array}{ll} &{} \frac{2a_i(v)v_i^0}{v_i}+(b_i-d_i)\ge 0, \\ &{} a_i(v)\left( \frac{v_i^0}{v_i}\right) ^2+(b_i-d_i)\frac{v_i^0}{v_i}-c_i(v)>0. \end{array}\right. \end{aligned}$$

(3.6)

Since $d_i\ge 0$, it is obvious that (3.5) and (3.6) are mutually exclusive.

A special case of the algorithm is that when $d_i=0$ and $l_i=+\infty $ for $i=1,\ldots ,p$, (3.4) becomes

$$\begin{aligned} v_i\longleftarrow v_i\Big (\frac{-b_i + \sqrt{b_i^2 + 4a_i(v) c_i(v)}}{2a_i(v)}\Big ), \end{aligned}$$

for $i=1,\ldots ,p$, which reduces to the one for nonnegative quadratic programming (see e.g. (7) - (12) in Sha et al (2007a) or (21) - (22) in Wu and Yang (2014)).

The MU-QP-RR-W-$l_1$ converges monotonically to the global (rectangle area) minimum of the objective function $F(\cdot )$ in (3.3). This is summarized in the theorem below:

Theorem 3.1

Let $F(\cdot ),~A,~b,~d,~v^0,~ l$ be given as in the problem (3.3). Define an auxiliary function $G( \cdot , \cdot )$ by: for $u,v\in [0, l]$,

(3.7)

For any positive-valued vector $v\in [0, l]$, pick a vector $U(v)\in \mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{ u\in [0, l]}G(u, v) $. Then U(v) satisfies the following:

(i)
For any $v\in [0, l]$,
$$\begin{aligned}{} & {} F(U(v))\le F(v), \end{aligned}$$
(3.8)
$$\begin{aligned}{} & {} F(U(v))= F(v)~\text{ if } \text{ and } \text{ only } \text{ if }~U(v)=v. \end{aligned}$$
(3.9)
(ii)
For each $v\in [0, l]$, U(v) is the updated value of v, presented in the form of (3.4).

The approach we used to prove Theorem 3.1 is similar to the ones used in Expectation-Maximization algorithm (Dempster et al 1977), nonnegative matrix factorization (Lee and Seung 2000), and Multiplicative Updates for Nonnegative Quadratic Programming (Sha et al 2007a, b). More specifically, the proof proceeds in two steps. First, we establish an auxiliary function $G( \cdot , \cdot )$ in (3.7) to show that the MU-QP-RR-W-$l_1$ monotonically decreases the value of the objective function $F( \cdot )$ in (3.3). Then, we show that the iteratively updates (3.4) in Algorithm 1 converge to the global minimum. The complete proof of Theorem 3.1 is provided in Appendix B.

4 Hyper-parameter optimization

ARGEN is a family including many well-known linear models with constraints. For instance, in Table 1, the ARLS, ARL, ARR, and AREN correspond to the arbitrary rectangle-range least squares, lasso, ridge, and elastic net, respectively. Besides, based on the choice of parameters we can propose some other new methods including ARGL, ARGR, ARLEN and ARREN, which are applicable to more complicated problems, and usually perform better than the free-range regression coefficients models.

However, many of the methods in Table 1 involve dealing with very high-dimensional hyper-parameter space, thus grid search method for tuning parameters can be very computationally expensive. Other tuning approaches such as Bayesian optimization and gradient-based optimization, developed to obtain better results in fewer evaluations, when applied to our case, however, are also very costly because they easily search around local minimum when the complexity of the surface is relatively high. Because discovering better tuning methods is not a focus in this paper, it is a potential direction of our future research, thus we simply use random search ($N_{calls}$ trials) to avoid costly searching on the entire grid. Following the convention in Zou and Hastie (2005) and Tibshirani (1996), we use the mean-squared error (MSE) as the score function, that is,

$$\begin{aligned} MSE={\mathbb {E}}[X{\widehat{\beta }}-X\beta ^*]^2=\mathbb E\big [({\widehat{\beta }}-\beta ^*)'X'X({\widehat{\beta }}-\beta ^*)\big ]. \end{aligned}$$

(4.1)

Table 1 Particular examples of ARGEN methods and their parameter setting

Full size table

Table 2 Tuning grid for different methods

Full size table

To further speed up the tuning process, we select the following values for each of the potential hyper-parameters to tune on. $\lambda _n^{(1)},~\lambda _n^{(2)}$ take integer values in the sets $\Lambda ^{(1)}$ and $\Lambda ^{(2)}$, respectively, where for each $i=1,2$, $ \Lambda ^{(i)}=\big \{ 0,~1,\ldots ,\lambda _{{{\text {up}}}} ^{(i)}\big \}~\text{ for } \text{ some }~\lambda _{{{\text {up}}}} ^{(i)}\in \mathbb {Z_+}. $ The weight vector $\mathrm w_n$ takes values in

for some $w _{{{\text {up}}}} \in \mathbb {Z^+}$. The matrix $\mathrm \Sigma _n$ can be decomposed to $\Sigma _n=PDP'$ with orthogonal matrix P and nonnegative diagonal matrix D. Therefore, the values of $\mathrm \Sigma _n$ are considered in

$$\begin{aligned} \begin{aligned} \Sigma =\Big \{PDP':&D={{\,\textrm{diag}\,}}(d_1,\cdots ,d_p), \\ {}&d_1,\ldots ,d_p\in \left\{ 0,\cdots ,d _{{{\text {up}}}} \right\} \Big \}, \end{aligned} \end{aligned}$$

for some $d _{{{\text {up}}}} \in \mathbb {Z^+}$ and orthogonal matrix P. The cardinality of $\Lambda ^{(1)}$, $\Lambda ^{(2)}$, W, and $\Sigma $ are $\lambda _{{{\text {up}}}} ^{(1)}+1$, $\lambda _{{{\text {up}}}} ^{(2)}+1$, $(w _{{{\text {up}}}} +1)^p$, and $(d _{{{\text {up}}}} +1)^p$, respectively, thus we obtain the total values on the parameter grid for each method listed in Table 2.

To improve the performance of the methods, we can: (1) randomly choose more trials, that is, increase $N_{calls}$, to cover more values on the grid, since theoretically searching on the whole grid will give the best result; (2) increase the values of $w _{{{\text {up}}}} $ and $d _{{{\text {up}}}} $, but at the same time $N_{calls}$ also needs to be increased since higher values of $w _{{{\text {up}}}} $ and $d _{{{\text {up}}}} $ will exponentially enlarge the grid.

5 Simulations

5.1 Signal recovery

The purpose of the signal recovery examples below is to explore the best possible performance of ARGEN, and to show ARGEN’s ability to “reduce noise” and deal with high-dimensional ($p\gg n$) sparse signals.

First, we conduct the same problem as in Mohammadi et al (2018) to compare our results with theirs. In the following, we briefly outline the problem. A sparse signal $\beta ^*\in {\mathbb {R}}^{4096}$ with 160 spikes that have amplitude 1 is generated and plotted in Fig. 1 (top), which is the true regression coefficient vector. The design matrix $X\in \mathbb R^{1024\times 4096}$ is generated with each entry sampled from i.i.d. standard normal distribution and each pair of rows orthogonalized. The response vector $Y\in {\mathbb {R}}^{1024}$ is then generated through $Y=X\beta ^*+\epsilon $, where $\epsilon \in R^{1024}$ is a vector of i.i.d Gaussian noise with zero mean and variance 0.1. The lower and upper bounds of ${{\widehat{\beta }}}$ are $-1$ and 1, respectively. Given $\lambda _n^{(1)}=10, \lambda _n^{(2)}=0,$ and $w_i=0 $ if $\beta _i\ne 0 $ for $i=1,\ldots , 4096$, we obtain the recovery signal ${{\widehat{\beta }}}$ and its difference from the true signal in the middle and bottom plots in Fig. 1. As a result, ARGEN achieves a lower MSE of 0.00069 compared with the MSE of 0.00273 in Mohammadi et al (2018).

To show that ARGEN can deal with more complicated problems, we take another signal recovering problem as example. We follow the same settings as in the previous problem, but this time replace the amplitude of each spike by a random value generated from a uniform distribution over [0, 1). The corresponding true and recovery signals and their difference are plotted in Fig. 2. The MSE obtained by ARGEN is 0.00166.

5.2 Methods comparison

In this section, we compare the performances of the methods listed in Table 1. We adopt the following setup to tune the four hyper-parameters: $\lambda _{{{\text {up}}}} ^{(1)}=100,~ \lambda _{{{\text {up}}}} ^{(2)}=100,~w _{{{\text {up}}}} =2,~d _{{{\text {up}}}} =2$. Considering the computational cost, the number of random values ($N_{calls}$) on grid to try is 100 for ARL and ARR, 500 for AREN, 1280 for ARGL and ARGR, 2560 for ARLEN and ARREN, and 6554 for ARGEN.

We conduct 8 examples to test the performance of each method. In each example, we simulate 50 datasets from $ Y=X\beta ^*+\epsilon , \epsilon \sim {\mathcal {N}}(0,\sigma ^2), $ and each of the data sets consists of independent training, validation, and testing sets. We use the training set to fit models. Parameters are tuned on the validation set. The test error, measured by the MSE (4.1), will be computed on the testing set. In the following, we outline these examples.

In Example 1, let $\beta ^*=(3~1.5~0~0~2~0~0~0)'$, $p=8$, $\sigma =3$, and the pairwise correlation between $X_i$ and $X_j$ be $0.5^{|i-j|}$ for all i, j. We use 20 observations for training, 20 for validation, and 200 for testing.

Example 2 is the same as Example 1, except that each entry of $\beta ^*$ is replaced with 0.85.

In Example 3, let $\sigma =15$, $p=40$,

$$\begin{aligned} \begin{aligned} \beta ^*=(\underbrace{0~\cdots ~0}_\text{10 } \text{ times }~\underbrace{2~\cdots ~2}_\text{10 } \text{ times }~\underbrace{0~\cdots ~0}_\text{10 } \text{ times }~\underbrace{2~\cdots ~2}_\text{10 } \text{ times})', \end{aligned} \end{aligned}$$

and the pairwise correlation between $X_i$ and $X_j$ be 0.5 for all i, j. We use 100 observations for training, 100 for validation, and 400 for testing.

In Example 4, let $\sigma =15$, $p=15$,

$$\begin{aligned} \beta ^*=(\underbrace{3~\cdots ~3}_\text {6 times}~\underbrace{0~\cdots ~0}_\text {9 times})'. \end{aligned}$$

Let the design matrix X be generated as:

$$\begin{aligned} \begin{aligned}{}&{} x_i=Z_1+\epsilon _i^x, ~~Z_1\sim {\mathcal {N}}(0,1), ~~i=1,2,\\{}&{} x_i=Z_2+\epsilon _i^x, ~~Z_2\sim {\mathcal {N}}(0,1), ~~i=3,4,\\{}&{} x_i=Z_3+\epsilon _i^x, ~~Z_3\sim {\mathcal {N}}(0,1), ~~i=5,6,\\{}&{} \text { where }~\epsilon _i^x\text {'s are } \text { i.i.d. }~ {\mathcal {N}}(0,0.01),~ i=1,\cdots ,6. \\{}&{} x_i\sim N(0,1), ~~x_i\text {'s are } \text { i.i.d. }, ~~i=7,\cdots ,15. \end{aligned} \end{aligned}$$

We use 40 observations for training, 40 for validation, and 100 for testing.

Example 5 is the same as Example 1, except that $\beta ^*=(-3~-1.5~0~0~2~0~0~0)'$ and $\beta _i\ge -1000$ for all i.

Example 6 is the same as Example 1, except that each entry of $\beta ^*$ is replaced with a random generated number in $[-5,5]$ and this values is used for all the 50 data sets. Besides, we restrict $\beta _i\in [-5,5]$ for all i.

Example 7 is the same as Example 1, but uses $\beta ^*=(-6~-8~0~0~7~0~0~0)'$ and restricts $\beta _i\in [-5,5]$ for all i.

Example 8 is the same as Example 4, but uses 5 observations for training, 5 for validation, and 50 for testing. Beside, we restrict $\beta _i\ge -1000$ for all i and use

$$\begin{aligned} \beta ^*=(\underbrace{-3~\cdots ~-3}_\text {6 times}~\underbrace{0~\cdots ~0}_\text {9 times})'. \end{aligned}$$

The first three examples above are from Zou and Hastie (2005) and Tibshirani (1996), which are originally constructed for lasso. The fourth example is similar to that in Zou and Hastie (2005), which creates a grouped variable situation. None of the first four examples, however, requires constraints on lower or upper bound for the coefficients. To show and test that ARGEN is applicable to more general and complicated problems, we add four more examples, which are Examples 5 to 8. In each of Examples 5, 6, and 8, constraints are added and include the true coefficients. In Example 7, we provide a case when the true coefficients are out of the interval constraints. The values 1000 and 5 were chosen arbitrarily to illustrate the model’s ability to work with constrained coefficients. Moreover, another purpose of introducing the last example is to test model performance on high-dimensional $(p\ge n)$ scenarios.

Table 3 Median MSE and the corresponding standard error (given in the parentheses) over 50 replications for each method and example

Full size table

Table 3 summarizes the Median MSE and its corresponding standard error over 50 data sets using each method in Table 1 for each of the above 8 examples. The MSEs of examples with different $\sigma $ are not comparable because they are simulated with different noise variances. Some of our examples do, however, share a similar simulation process and their MSEs are at the same level. For instance, Examples 1 and 5 are similar except that Example 5 has a lower limit of $-1000$ on the coefficients, whereas Example 1 has no limit. As a result of these lower constraints, Example 5 tends to force coefficients above the lower limit, resulting in relatively higher MSEs than Example 1. In addition to cross-example comparisons, it would make more sense to compare the MSEs across methods for each example. The overall performance of methods consisting of more parameters is better than those with fewer parameters. More specifically, the ARGL, ARGR, ARLEN, ARREN, and ARGEN are, in most cases, outperforming the ARLS, ARL, ARR, and AREN. For instance, ARGEN, the most complicated method that includes all four hyper-parameters, performs best in Examples 1, 2, 5, 6, and 7. ARLEN and ARREN, the second from the top regard to complicity, provide second high accuracy in Examples 1, 2, 3, 4, 5, 7, and 8. ARGL and ARGR are at the third level of performance in Example 1, 2, 5, 6, and 7. However, in Table 2, performances are not always increasing as the model gets more complicated. This is because the ratio of values searched ($N_{calls}$) to the total number of values on the grid is not the same for all the methods, due to the exponential increase of the size of the grid as more hyper-parameters are included. It is also because we keep the same $N_{calls}$ in each method for all the examples, which, in fact, have different dimensionality. Therefore, the performance of methods like ARLEN, ARREN, and ARGEN is worse than expected in some of the examples.

6 Real world application - S &P 500 index tracking

6.1 Outline

Index tracking is passive management that replicates the return of a market index (e.g., S &P 500 in New York and FTSE 100 in London) by constructing an equity portfolio that contains only a subset of the index constituents to reduce the transaction and management costs (Connor and Leland 1995; Franks 1992; Jacobs and Levy 1996; Jobst et al 2001; Larsen and Resnick 1998; Lobo et al 2000; Toy and Zurack 1989).

In this section, we show how ARGEN applies to index tracking, an asset allocation (Markowitz 1952) problem with allocation constraints, in the financial field and compare the results with those of nonnegative lasso (Wu et al 2014) and nonnegative elastic net (Wu and Yang 2014). Through this example, (1) we provide general practice guidance for adapting ARGEN to solve real world problems; (2) We demonstrate ARGEN’s feasibility and flexibility compared to the existing methods. In particular, we highlight that ARGEN can deal with problems that require constraints on the coefficient, while none of the existing methods (Wu et al 2014; Wu and Yang 2014) can.

We will take tracking the US S &P 500 index as the example. It is worth noting that, in Wu et al (2014), the nonnegative lasso is applied to tracking the CSI 300 index; in Wu and Yang (2014), the nonnegative elastic net is used to track the CSI 300 and and SSE 180. Both indexes are based on stocks without short sales. This is not the case for US stock market, where short sales are allowed. Although short-selling is allowed in the US market, traditional mutual funds still hold long-only portfolios. Previous research (Almazan et al 2004; Agarwal et al 2009; Chen et al 2013; An et al 2021) suggested that only a small portion of mutual funds hold short positions in their portfolios, although short-selling is allowed for a lot of mutual funds. Especially the data from An et al (2021) suggested that about 90% of the sampled mutual funds held long-only portfolios in the previous 8 quarters before the research date (July, 2021), even though 40% of those funds are explicitly allowed to hold short positions. There are long-only funds in the US market, to name a few: MS INVF Global Advantage Fund, Jennison Global Opportunity Fund, and Thematics Safety Fund. According to the managing partner from Reverb ETF (a user-voting-based, long-only, diversified equity fund), constraints are usually in place for constituents to ensure portfolio diversification and risk migration. From a global perspective, short selling still faces limitations or additional regulations outside the US, such as in China, India, and Brazil. In 2020, due to market volatility, several countries (Belgian, French, Italy, etc.) in Europe raised short-selling restrictions for 2 months from March 18, 2020, and a few Asian countries (South Korea, Indonesian, and Thailand) imposed more extended short selling restrictions (Manson, 2020). Given the above, we believe that an empirical study with a non-negative range setting would be a good fit to general applications in practice.

In portfolio management, “how close is the constructed tracking portfolio’s return compared to that of benchmark index” is a primary measurement for accessing portfolio performance for passive strategies such as index-tracking strategy. Hence, inspired by Sant’Anna et al (2020), we evaluate the tracking portfolio performance from the following three perspectives. Our primary performance measurement is tracking error (TE),

$$\begin{aligned} {\text {TE}} = \sqrt{\frac{\sum \limits _{t=1}^{T}\left( (r_{t}^{p} - r_{t}^{b}) - {\mathbb {E}}[r_{t}^{p} - r_{t}^{b}]\right) ^2}{T}} \end{aligned}$$

for measuring the volatility of the excess return of a portfolio to the corresponding benchmark. We also compute the annual volatility of portfolio return (ARV),

$$\begin{aligned} {\text {ARV}} = \sqrt{252} \sqrt{\frac{\sum \limits _{t=1}^{T}\left( r_{t}^{p} - {\mathbb {E}}[r_{t}^{p}]\right) ^2}{T}} \end{aligned}$$

to measure the annualized return volatility of a portfolio. In addition, we also report the cumulative return

$$\begin{aligned} {\text {CR}} = \prod _{t=1}^{T} (1+r_{t}^{p}) - 1 \end{aligned}$$

of the construction portfolios in our study. Here $r_{t}^{p}$ denotes the portfolio return at time t, $r_{t}^{b}$ is the benchmark return at time t, and T is the total number of periods.

Because there is no guarantee that the normalized ${{\widehat{\beta }}}$ is still less than t, we introduce the following normalization process, which constrains the way of choosing the lower and upper limits of ${\mathcal {I}}$. Recall that $s_i$ and $t_i$ are the lower and upper bounds of the coefficient $\beta _i$. To guarantee that the portfolio weight (i.e., normalized $\beta _i$), denoted by ${\tilde{\beta }}_i$, for stock i satisfies $0 \le {\tilde{\beta }}_i \le t_i \le 1$, we need $s_i$ and $t_i$ satisfy the following:

$$\begin{aligned} t_i + \sum _{j\in \{1,\ldots ,p\}\backslash \{i\}} s_j \ge 1, \end{aligned}$$

because it yields

$$\begin{aligned} {\tilde{\beta }}_i = \frac{\beta _i}{\sum _{j=1}^{p} \beta _j} \le \frac{t_i}{t_i+\sum _{j\in \{1,\ldots ,p\}\backslash \{i\}} s_j}\le t_i. \end{aligned}$$

In the special case of $s_i = s_0$ and $t_i = t_0$, we shall choose the lower and upper bounds through

$$\begin{aligned} \frac{1-t_0}{p-1} \le s_0 \le t_0 \le 1. \end{aligned}$$

We use 5-year (from February 19, 2016 to February 18, 2020) historical daily prices (1259 data points) of S &P 500^{Footnote 3} as our benchmark index and those of the constituent equities.^{Footnote 4} Because the list of S &P 500 constituents is updated regularly by S &P Dow Jones Indices LLC, we only include the daily prices from 377 stocks that have not been changed during the period of interest. In the linear model (1.2), Y is the vector of the daily return of the S &P 500 index and columns of X are the daily returns of the 391 stocks. To follow a buy-and-hold investment strategy, we split the data into training, validation, and testing sets. The training and validation sets consist of the first 252 data points (12 months), 20% of which are in the validation set. The remaining 1006 data points are referred to as the testing set. In addition, we construct a long-only portfolio by ensuring the lower bound s assumes only nonnegative values.

In the following, we outline our procedure of the index tracking problem. First, we target selecting N individual stocks to construct the tracking portfolio. The number N is among the normal range of the number of stocks hold to remove risk exposure and avoid unnecessary transaction costs. In other words, we constrain the number of nonzero elements in ${\widehat{\beta }}$ to N. Thus we use the bisection search (Wu and Yang 2014) in Algorithm 2 to determine the optimal $\lambda _n^{(1)}$ that produce the right number of nonzero coefficients, given $N=30,~50,~70,~90$ respectively, $\lambda _n^{(2)}=0$, $\textrm{w}_n$ has equal elements, and ${\mathcal {I}}=[0,+\infty )$, respectively. Hence we obtain the 50 stocks selected by the model. This first process proceeds on the training and validation sets.

Table 4 ARGEN vs ARLS tracking portfolio performance in the testing period

Full size table

Next, we consider ${\mathcal {I}} = [0.0082, 0.6]$ and ${\mathcal {I}} = [0.0041, 0.8]$, respectively, and apply ARLS and ARGEN to the corresponding data set of the selected 30, 50, 70, and 90 stocks to experiment the ARGEN algorithm. The ARLS is viewed as the baseline. For ARGEN, we search $\lambda _n^{(1)}$ randomly in a range of $(10^{-8}, 5\times 10^{-2})$, which is a smaller searching grid compared with that in Sect. 4, since larger range results in over 50 vanishing coefficients. The $\lambda _n^{(2)}$ is randomly searched in a range of $(10^{-8}, 10^2)$. We take $w _{{{\text {up}}}} =1$, and $d _{{{\text {up}}}} =1$. The hyper-parameter tuning process is conducted in Optuna hyper-parameter optimization framework (Akiba et al 2019), and selected the parameter set that has the lowest validation score, measured by MSE, compared with that of ARLS, and then apply it on testing data set to evaluate and compare the out-of-sample performances between the portfolios obtained using ARGEN and ARLS.

6.2 Experimental results

We follow the procedure as elaborated in the previous session to construct multiple ARGEN and ARLS portfolios with different numbers of stocks, and different numbers of hyper-parameter tuning trails. The portfolios’ testing performance is summarized in Table 4.

Particularly, Table 4 illustrates the performance of different ARGEN and ARLS portfolios constructed with different coefficient boundary and stock numbers. Across different portfolio construction configurations, the ARGEN portfolios tend to have lower tracking errors and annualized return volatility than ARLS portfolios, while satisfying the coefficient boundary conditions. Even though ARGEN portfolios tend to have lower cumulative returns, but they are comparable with the S &P 500 index cumulative return during the same period, except for 30-stock ARGEN portfolios. Portfolios with a wider range of constraints ([0.0041, 0.8]) track better than portfolios with narrower constraints range ([0.0082, 0.6]), which is expected behavior. Another expected behavior we can observe from the results is that as we increase the number of stocks in portfolios, the tracking errors decrease.

7 Conclusion and future perspectives

In this paper, we propose the ARGEN for variable selection and regularization. ARGEN linearly combines generalized lasso and ridge penalties, which are $\textrm{w}_n'\beta $ and $\beta '\Sigma _n\beta $, and it allows arbitrary lower and upper constraints on the coefficients. Many well-known methods including (nonnegative) lasso, ridge, and (nonnegative) elastic net are particular cases of ARGEN. We show that ARGEN has variable selection and estimation consistencies subject to some conditions. We propose an algorithm to solve the ARGEN problem by applying multiplicative updates to a quadratic programming problem with a rectangle range and weighted $l_1$ regularizer (MU-QP-RR-W-$l_1$). The algorithm is implemented as a Python library through the PyPi server. The simulations and the application in index-tracking present shreds of evidence that ARGEN usually outperforms other methods discussed in the paper due to its flexibility and adaptability for problems with a small to moderate amount of predictors. In problems with a huge amount of predictors, although ARGEN should perform best theoretically, the cost might be high. In this situation ARLEN and ARREN might be better choices. We refer readers to the Github repository^{Footnote 5} for full access to the code for the simulation and application parts.

Although in the paper the ARGEN penalty is added to linear models, there are possibilities that it is applied to other loss functions to improve their performances as directions of future research. Motivated by the index tracking problem, a constraint that guarantees the sum of weights equals one may be considered as another direction. Asymptotic behavior in law of the ARGEN estimator remains unknown. Most importantly, a more efficient tuning process is urgently required to apply ARGEN to solve more complicated problems. All of the above problems are open for future study.

Availability of data and materials

The datasets generated during the current study are available at https://github.com/songzhm/arbitraryElasticNet.

Code availability

The code is available at https://github.com/songzhm/arbitraryElasticNet and https://pypi.org/project/generalized-elastic-net.

Notes

https://pypi.org/project/generalized-elastic-net
https://github.com/songzhm/arbitraryElasticNet
Retrieved from finance.yahoo.com.
Retrieved from quandl.com.
https://github.com/songzhm/arbitraryElasticNet

References

Agarwal, V., Boyson, N.M., Naik, N.Y.: Hedge funds for retail investors? an examination of hedged mutual funds. J. Financ. Quant. Anal. 44(2), 273–305 (2009)
Article Google Scholar
Akiba, T., Sano, S., Yanase, T., et al.: Optuna: A next-generation hyperparameter optimization framework. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), pp. 2623–2631 (2019), https://doi.org/10.1145/3292500.3330701
Almazan, A., Brown, K.C., Carlson, M., et al.: Why constrain your mutual fund manager? J. Financ. Econ. 73(2), 289–321 (2004)
Article Google Scholar
An, L., Huang, S., Lou, D., et al.: Why don’t most mutual funds short sell? LSE Financial Markets Group, London (2021)
Book Google Scholar
Bickel, P.J., Ritov, Y., Tsybakov, A.B.: Simultaneous analysis of lasso and dantzig selector. Ann. Stat. 37(4), 1705–1732 (2009). https://doi.org/10.1214/08-AOS620
Article MathSciNet MATH Google Scholar
Chen, H., Desai, H., Krishnamurthy, S.: A first look at mutual funds that use short sales. J. Financ. Quant. Anal. 48(3), 761–787 (2013)
Article Google Scholar
Connor, G., Leland, H.: Cash management for index tracking. Financ. Anal. J. 51(6), 75–80 (1995). https://doi.org/10.2469/faj.v51.n6.1952
Article Google Scholar
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. Royal Stat. Soc. Ser. B (Methodological) 39(1), 1–22 (1977). https://doi.org/10.1111/j.2517-6161.1977.tb01600.x
Article MathSciNet MATH Google Scholar
Efron, B., Hastie, T., Johnstone, I., et al.: Least angle regression. Ann. Stat. 32(2), 407–499 (2004). https://doi.org/10.1214/009053604000000067
Article MathSciNet MATH Google Scholar
Efron, B., Hastie, T., Tibshirani, R.: Discussion: the dantzig selector: statistical estimation when $p$ is much larger than $n$. Ann. Stat. 35(6), 2358–2364 (2007). https://doi.org/10.1214/009053607000000433
Article Google Scholar
Franks, E.C.: Targeting excess-of-benchmark returns. J. Portfolio Manag. 18(4), 6–12 (1992). https://doi.org/10.3905/jpm.1992.409419
Article Google Scholar
Fu, W., Knight, K.: Asymptotics for lasso-type estimators. Ann. Stat. 28(5), 1356–1378 (2000). https://doi.org/10.1214/aos/1015957397
Article MathSciNet MATH Google Scholar
Jacobs, B.I., Levy, K.N.: Residual risk: how much is too much. J. Portfolio Manag. 22(3), 10–16 (1996)
Article Google Scholar
Jobst, N.J., Horniman, M.D., Lucas, C.A., et al.: Computational aspects of alternative portfolio selection models in the presence of discrete asset choice constraints. Quant. Financ. 1(5), 489–501 (2001). https://doi.org/10.1088/1469-7688/1/5/301
Article MathSciNet MATH Google Scholar
Larsen, G.A., Resnick, B.G.: Empirical insights on indexing: how capitalization, stratification and weighting can affect tracking error. J. Portfolio Manag. 25(1), 51–60 (1998). https://doi.org/10.3905/jpm.1998.409656
Article Google Scholar
Ledoux, M., Talagrand, M.: Probability in Banach Spaces: Isoperimetry and Processes, 1991st edn. Springer-Verlag, Berlin (2011)
MATH Google Scholar
Lee, D.D., Seung, H.S.: Algorithms for non-negative matrix factorization. In: Proceedings of the 13th International Conference on Neural Information Processing Systems (NIPS) (2000)
Lobo, A., Launer, L.J., Fratiglioni, L., et al.: Prevalence of dementia and major subtypes in Europe: a collaborative study of population-based cohorts. Neurology 54(11), S4-9 (2000)
Google Scholar
Lounici, K.: Sup-norm convergence rate and sign concentration property of lasso and dantzig estimators. Electron. J. Stat. 2, 90–102 (2008). https://doi.org/10.1214/08-EJS177
Article MathSciNet MATH Google Scholar
Markowitz, H.: Portfolio selection. J. Financ. 7(1), 77–91 (1952). https://doi.org/10.2307/2975974
Article Google Scholar
Mohammadi, M., Tan, Y.H., Hofman, W., et al.: A novel one-layer recurrent neural network for the $l_1$-regularized least square problem. Neurocomputing 315, 135–144 (2018). https://doi.org/10.1016/j.neucom.2018.07.007
Article Google Scholar
Mouret, G., Brault, J.J., Partovi Nia, V.: Generalized elastic net regression. In: Proceedings of JSM, pp. 3457–3464 (2013)
Negahban, S.N., Ravikumar, P., Wainwright, M.J., et al.: A unified framework for high-dimensional analysis of ${M}$-estimators with decomposable regularizers. Stat. Sci. 27(4), 538–557 (2012). https://doi.org/10.1214/12-STS400
Article MathSciNet MATH Google Scholar
Sant’Anna, L.R., Caldeira, J.F., Filomena, T.P.: Lasso-based index tracking and statistical arbitrage long-short strategies. North Am. J. Econ. Financ. 51(101), 055 (2020). https://doi.org/10.1016/j.najef.2019.101055
Article Google Scholar
Sha, F., Lin, Y., Saul, L.K., et al.: Multiplicative updates for nonnegative quadratic programming. Neural Comput. 19(8), 2004–2031 (2007). https://doi.org/10.1162/neco.2007.19.8.2004
Article MathSciNet MATH Google Scholar
Sha, F., Park, Y.A., Saul, L.K.: Multiplicative updates for ${L}_1$-regularized linear and logistic regression. Adv. Intell. Data Anal. VII 4723, 13–24 (2007). https://doi.org/10.1007/978-3-540-74825-0_2
Article Google Scholar
Sokolov, A., Carlin, D.E., Paull, E.O., et al.: Pathway-based genomics prediction using generalized elastic net. PLoS Comput. Biol. 12(3), e1004790 (2016). https://doi.org/10.1371/journal.pcbi.1004790
Article Google Scholar
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. Royal Stat. Soc. Ser. B (Methodological) 58(1), 267–288 (1996)
MathSciNet MATH Google Scholar
Tibshirani, R.J.: The lasso problem and uniqueness. Electron. J. Stat. 7, 1456–1490 (2013). https://doi.org/10.1214/13-EJS815
Toy, W.W., Zurack, M.A.: Tracking the Euro-Pac index. J. Portfolio Manag. 15(2), 55–58 (1989). https://doi.org/10.3905/jpm.1989.409186
Article Google Scholar
Wang, H., Li, G., Tsai, C.L.: Regression coefficient and autoregressive order shrinkage and selection via lasso. J. Royal Stat. Soc. Ser. B (Stat. Method.) 69(1), 63–78 (2007). https://doi.org/10.1111/j.1467-9868.2007.00577.x
Wu, L., Yang, Y.: Nonnegative elastic net and application in index tracking. Appl. Math. Comput. 227, 541–552 (2014). https://doi.org/10.1016/j.amc.2013.11.049
Article MathSciNet MATH Google Scholar
Wu, L., Yang, Y., Liu, H.: Nonnegative-lasso and application in index tracking. Comput. Stat Data Anal. 70, 116–126 (2014). https://doi.org/10.1016/j.csda.2013.08.012
Article MathSciNet MATH Google Scholar
Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. J. Royal Stat. Soc. Ser. B (Stat. Method.) 68(1), 49–67 (2006). https://doi.org/10.1111/j.1467-9868.2005.00532.x
Article MathSciNet MATH Google Scholar
Zhao, P., Yu, B.: On model selection consistency of Lasso. J. Machine Learn. Res. 7(90), 2541–2563 (2006)
MathSciNet MATH Google Scholar
Zhao, P., Rocha, G., Yu, B.: The composite absolute penalties family for grouped and hierarchical variable selection. Annal. Stat. 37(6A), 3468–3497 (2009). https://doi.org/10.1214/07-AOS584
Article MathSciNet MATH Google Scholar
Zhao, W., Zou, W., Chen, J.J.: Topic modeling for cluster analysis of large biological and medical datasets. BMC Bioinform. (2014). https://doi.org/10.1186/1471-2105-15-S11-S11
Article Google Scholar
Zou, H.: The adaptive Lasso and its oracle properties. J. Am. Stat. Assoc. 101(476), 1418–1429 (2006). https://doi.org/10.1198/016214506000000735
Article MathSciNet MATH Google Scholar
Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. Royal Stat. Soc. Ser. B (Stat. Method.) 67(2), 301–320 (2005). https://doi.org/10.1111/j.1467-9868.2005.00503.x
Article MathSciNet MATH Google Scholar

Download references

Funding

Open access funding provided by SCELC, Statewide California Electronic Library Consortium. No funds, grants, or other support was received.

Author information

Authors and Affiliations

Institute of Mathematical Sciences, Claremont Graduate University, Claremont, California, USA
Yujia Ding, Qidi Peng & Zhengming Song
Old Mission Capital, Chicago, Illinois, USA
Hansen Chen

Authors

Yujia Ding
View author publications
You can also search for this author in PubMed Google Scholar
Qidi Peng
View author publications
You can also search for this author in PubMed Google Scholar
Zhengming Song
View author publications
You can also search for this author in PubMed Google Scholar
Hansen Chen
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization and Methodology: Hansen Chen, Yujia Ding, Qidi Peng; Formal analysis and investigation: Yujia Ding, Qidi Peng; Writing: Yujia Ding, Qidi Peng; Data collection and analysis: Yujia Ding, Zhengming Song.

Corresponding author

Correspondence to Yujia Ding.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Ethics approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A Proof of theorem 2.1

Without loss of generality, we assume that $\big (S_{(1-)}, S_{(1+)}, S_{(2)}, \ldots , S_{(6)}\big ) =$ $(1,\ldots ,p)$ to simplify the notations. In addition, we follow the notations in Sect. 2.2. In order to show that ARGEN admits variable selection consistency, we first need to show that the following result holds.

Lemma A.1

For $j\in \{1-,1+,2,\ldots ,6\}$,

$$\begin{aligned} \begin{aligned}&\mathbb P\Big (\widehat{S}_{(j)}\big (\lambda _n^{(1)},\lambda _n^{(2)},\mathrm w_n,\mathrm \Sigma _n\big ) = S_{(j)}\Big |S_{(j)}\ne \emptyset \Big )\\&\quad \ge {\mathbb {P}}\left( {\mathcal {E}}(V_{(j)})\right) , \end{aligned} \end{aligned}$$

(A1)

for the events

$$\begin{aligned} {\left\{ \begin{array}{ll} {\mathcal {E}}(V_{(1-)}):=\left\{ \rho ( t_{(1-)}\wedge 0)< V_{(1-)}< \rho ( s_{(1-)}\wedge 0)\right\} ,\\ {\mathcal {E}}(V_{(1+)}):=\left\{ \rho ( t_{(1+)}\vee 0)< V_{(1+)}< \rho ( s_{(1+)}\vee 0)\right\} ,\\ {\mathcal {E}}(V_{(2)}):=\left\{ V_{(2)}\le 0\right\} ,~{\mathcal {E}}(V_{(3)}):=\left\{ V_{(3)}\ge 0\right\} ,\\ {\mathcal {E}}(V_{(4)}):=\left\{ V_{(4)}\le 0\right\} ,~{\mathcal {E}}(V_{(5)}):=\left\{ V_{(5)}\ge 0\right\} ,\\ {\mathcal {E}}(V_{(6)}):=\left\{ -\mathrm w_{n,(6)}\le V_{(6)}\le \mathrm w_{n,(6)}\right\} , \end{array}\right. } \end{aligned}$$

(A2)

where

(A3)

$$\begin{aligned} V_{(1-)}:= & {} (V_i)_{i\in S_{(1-)}},~V_{(1+)}:=(V_i)_{i\in S_{(1+)}}, \end{aligned}$$

(A4)

$$\begin{aligned} V_{(j)}:=T_{(j)}-D_{(j)}\mathrm w_{n,(j)},~\text{ for }~j=2,\ldots ,6, \end{aligned}$$

(A5)

with

(A6)

(A7)

Proof

Assume $\beta _{i}^*=0$ for some $i\in \{1,\ldots ,p\}$. By the Karush-Kuhn-Tucker (KKT) conditions, for given $\lambda _n^{(1)}$, $\lambda _n^{(2)}$, $\mathrm w_n$, $\mathrm \Sigma _n$, (2.1) is equivalent to solve ${{\widehat{\beta }}}$ from the following constraint optimization problem

(A8)

Since the term $|\beta |$ in (A8) is not differentiable but subdifferentiable at 0, similar to the lasso problem (see Eqs. (2)-(9) in Tibshirani (2013)), (A8) is equivalent to

(A9)

where $\theta \in {\mathbb {R}}^p$ is called a subgradient of the function $(x_1,\ldots ,x_p)\mapsto |x_1|+\ldots +|x_p|$ at $x={{\widehat{\beta }}}$. Let ${\widehat{\beta }}=({{\widehat{\beta }}}_{(1)}~\ldots ~{{\widehat{\beta }}}_{(6)})'$ be the estimates of $\beta ^*=(\beta _{(1)}^*~\ldots ~\beta _{(6)}^*)'$ respectively. Recall that $ Y=X\beta ^*+\epsilon $ with $\beta _{(2)}^*= s_{(2)}$, $\beta _{(3)}^*= t_{(3)}$, $\beta _{(4)}^*= s_{(4)}=0$, $\beta _{(5)}^*= t_{(5)}=0$, $\beta _{(6)}^*=0\in ( s_{(6)}, t_{(6)})$. Plugging them into (A9) yields

(A10)

If there exists ${{\widehat{\beta }}}$ that satisfies (A10) and

$$\begin{aligned}{} & {} {{\widehat{\beta }}}_{(1-)}\in \left( -\infty ,0\right) \cap ( s_{(1-)}, t_{(1-)}),\\{} & {} {{\widehat{\beta }}}_{(1+)}\in \left( 0,+\infty \right) \cap ( s_{(1+)}, t_{(1+)}),\\{} & {} {{\widehat{\beta }}}_{(2)}= s_{(2)}\ne 0,~{{\widehat{\beta }}}_{(3)}= t_{(3)}\ne 0,\\{} & {} {{\widehat{\beta }}}_{(4)}={{\widehat{\beta }}}_{(5)}={{\widehat{\beta }}}_{(6)}=0, \end{aligned}$$

then $S_{(j)}=\widehat{S}_{(j)}(\lambda _n^{(1)},\lambda _n^{(2)},\mathrm w_n,\Sigma _n)$ for $j\in \{1-,1+,$ $2,\ldots ,6\}$. That makes (A10) equivalent to

(A11)

Solving ${{\widehat{\beta }}}_{(1)}$ from the first equation in (A11), we obtain

(A12)

where $C_{11}$ is defined in (2.3). Replacing ${{\widehat{\beta }}}_{(1)}$ with (A12) in the rest equations in (A11) yields

$$\begin{aligned} {\left\{ \begin{array}{ll} T_{(2)}={{\,\textrm{diag}\,}}({{\,\textrm{sign}\,}}( s_{(2)}))\mathrm w_{n,(2)}-\frac{\gamma _{(2)}}{\lambda _{n}^{(1)}},\\ T_{(3)}={{\,\textrm{diag}\,}}({{\,\textrm{sign}\,}}( t_{(3)}))\mathrm w_{n,(3)}+\frac{\mu _{(3)}}{\lambda _{n}^{(1)}},\\ T_{(4)}=\mathrm w_{n,(4)}-\frac{\gamma _{(4)}}{\lambda _{n}^{(1)}},\\ T_{(5)}=-\mathrm w_{n,(5)}+\frac{\mu _{(5)}}{\lambda _{n}^{(1)}},\\ T_{(6)}={{\,\textrm{diag}\,}}(\theta _{(6)})\mathrm w_{n,(6)}, \end{array}\right. } \end{aligned}$$

(A13)

where $T_{(2)},\ldots ,T_{(6)}$ are defined in (A6). It follows from (A13) that (A10) admits a solution ${{\widehat{\beta }}}$ which satisfies the variable selection consistency if and only if

$$\begin{aligned} {\left\{ \begin{array}{ll} V_{(1-)}\in \left( \rho (0\wedge t_{(1-)}), \rho (0\wedge s_{(1-)})\right) ,\\ V_{(1+)}\in \left( \rho (0\vee t_{(1+)}), \rho (0\vee s_{(1+)})\right) ,\\ V_{(2)}\le 0,~V_{(3)}\ge 0,~V_{(4)}\le 0,~V_{(5)}\ge 0,\\ V_{(6)}={{\,\textrm{diag}\,}}(\theta _{(6)})\mathrm w_{n,(6)}\in [-\mathrm w_{n,(6)},\mathrm w_{n,(6)}], \end{array}\right. } \end{aligned}$$

(A14)

where $V_{(j)}$’s are given in (A4) and (A5). Observing that (A14) is nothing else but the events ${\mathcal {E}}(V_{(j)})$’s defined in (A2), therefore, (A1) holds if and only if (A2) occurs. Hence Lemma A.1 is proved. $\square $

The lemma above provides a lower bound on the probability of ARGEN correctly selecting variables. To prove the theorem, we also need the following well-known result on the upper bound of the maximum of Gaussian random vector (see (3.6) in Ledoux and Talagrand (2011)).

Lemma A.2

Let $(X_1,\ldots ,X_n)$ be any Gaussian random vector. For n large enough, we have

$$\begin{aligned} {\mathbb {E}}\big [\max _{1\le i\le n}|X_i|\big ]\le 8\sqrt{\log n}\max _{1\le i\le n}\sqrt{{\mathbb {E}} [X_i]^2}. \end{aligned}$$

With the results of Lemmas A.1 and A.2, we provide the proof of Theorem 2.1 in the following.

By Lemma A.1, to prove the theorem it suffices to build

$$\begin{aligned} {\mathbb {P}}\left( \mathcal E(V_{(j)})\right) \xrightarrow [n\rightarrow \infty ]{}1,~\text{ for }~j=1,\ldots ,6. \end{aligned}$$

(A15)

Below, we show (A15) holds for each j. In the case of $j=1$, we need to show both $ {\mathbb {P}}\big (\mathcal E(V_{(1-)})\big )\xrightarrow [n\rightarrow \infty ]{}1 $ and $ \mathbb P\big ({\mathcal {E}}(V_{(1+)})\big )\xrightarrow [n\rightarrow \infty ]{}1 $ hold. To obtain the former, we denote the ith element of $V_{(1)}$ in (A3) by $V_i$ and split it into

$$\begin{aligned} V_i=V_i^{(1)}+V_{i}^{(2)}. \end{aligned}$$

(A16)

Here

$$\begin{aligned} V_i^{(1)}:=\frac{e_i'}{n} \Big (C_{11}+\frac{\lambda _n^{(2)}}{n}\Sigma _{n,(11)}\Big )^{-1}\left( -X_{(1)}'\epsilon \right) \end{aligned}$$

(A17)

is a Gaussian random variable with ${\mathbb {E}}(V_i^{(1)})=0$ and

$$\begin{aligned} \mathbb Var\big (V_i^{(1)}\big )= & {} \frac{\sigma ^2}{n^2}e_i' \Big (C_{11}+\frac{\lambda _n^{(2)}}{n}\Sigma _{n,(11)}\Big )^{-2}C_{11}e_i\\{} & {} \le \frac{\sigma ^2\#S_{(1)}{\text {trace}}(C_{11})}{n^2\big (\Lambda _{\min } (C_{11}+\lambda _n^{(2)}\Sigma _{n,(11)}/n)\big )^2}, \end{aligned}$$

where $\Lambda _{\min }( \cdot )$ denotes the minimal eigenvalue and ${\text {trace}}( \cdot )$ denotes the trace of the matrix. The second term $V_i^{(2)}$ can be bounded by:

(A18)

where $C_n^{\min }$ and $C_n^{\max }$ are defined in (2.4), and $e_i$ is a vector with the ith element be 1 and the others be 0. Elementary probability calculus shows

(A19)

We observe that for large n, $\max \rho ( t_{(1-)}\wedge 0)<0$. It results from (2.9), (A16), (A17), (A18), and Lemma A.2 that

(A20)

Similarly, because $\min \rho ( s_{(1-)}\wedge 0)>0$ holds for large n, it follows from (2.10), (A16), (A17), (A18), and Lemma A.2 that

(A21)

Hence, (A15) with $j=1-$ results from (A19), (A20) and (A21). The case of $j=1+$ can be proved following quite a similar way, so we omit the details.

To show ${\mathbb {P}}\big (\mathcal E(V_{(2)})\big )\xrightarrow [n\rightarrow \infty ]{}1$, we first define

Setting $\eta _0=\min \{\eta _{(2)}\}$, by the AREIC (2.11) we know $ V_{(2)}^{(1)}\le {{\,\textrm{diag}\,}}({{\,\textrm{sign}\,}}( s_{(2)}))$ $\mathrm w_{n,(2)} - \eta _0. $ Accordingly, let

Observe that for each coordinate index j, $e_j'V_{(2)}^{(2)}$ is a zero mean Gaussian random variable and with n large enough, the variance is bounded above by $ \mathbb E\big [e_j'V_{(2)}^{(2)}\big ]^2\le 4n\sigma ^2/(\lambda _n^{(1)})^2. $ Then it follows that

By Markov’s inequality and Lemma A.2, we obtain

$$\begin{aligned}{} & {} {\mathbb {P}}\Big (\max V_{(2)}^{(2)}>\eta _0\Big )\le {\mathbb {P}}\Big (\max \big |V_{(2)}^{(2)}\big |>\eta _0\Big )\\{} & {} \quad \le \frac{{\mathbb {E}}\Big [\max \big |V_{(2)}^{(2)}\big |\Big ]}{\eta _0}\\{} & {} \quad \le \frac{8\sqrt{\log (\#S_{(2)})}}{\eta _0}\max _{j\in S_{(2)}}\sqrt{{\mathbb {E}}\Big [e_j'V_{(2)}^{(2)}\Big ]^2}\\{} & {} \quad \le \frac{16\sqrt{\log (\#S_{(2)})}}{\eta _0}\frac{\sqrt{n}\sigma }{\lambda _n^{(1)}}. \end{aligned}$$

Hence, ${\mathbb {P}}\big (\mathcal E(V_{(2)})\big )\xrightarrow [n\rightarrow \infty ]{}1$ follows from (2.6). The same process with slight modifications will be followed for the other cases of $j=3,4,5,6$. Therefore, (A15) holds and we have completed the proof.

Appendix B Proof of theorem 3.1

First, (3.8) holds if for any $u,v\in [0, l]$, the auxiliary function $G( \cdot , \cdot )$ satisfies the following:

$$\begin{aligned}{} & {} F(v)= G(v, v), \end{aligned}$$

(B1)

$$\begin{aligned}{} & {} F(u)\le G(u, v). \end{aligned}$$

(B2)

This is because $ F(U(v))\le G(U(v),v)\le G(v,v)= F(v). $ In view of (3.3) and (3.7), Equation (B1) can be easily obtained through plugging $u=v$ into G(u, v). To prove (B2), we need the following preliminary results of Lemmas 1 and 2 in Sha et al (2007a):

(B3)

(B4)

Then by (B1), we have

$$\begin{aligned} \begin{aligned}&F(u)=G(u,u)\\&\quad =\frac{1}{2} u'A^+u -\frac{1}{2} u'A^-u+\sum _{i=1}^p (b_iu_i+d_i|u_i-v^0_i|), \end{aligned} \end{aligned}$$

which is bounded above by G(u, v) using (B3) and (B4).

Next, to show that (3.9) and (ii) hold, it suffices to prove that U(v) is the unique vector in [0, l] that minimizes $G( \cdot ,v)$, and the mapping $U( \cdot )$ has the form as (3.4). Observe that G(u, v) can be rewritten as

$$\begin{aligned} G(u,v)=\sum _{i=1}^p G_i(u_i)-\frac{1}{2}\sum _{1\le i,j\le p}A^-_{ij}v_iv_j, \end{aligned}$$

where

$$\begin{aligned} \begin{aligned} G_i(u_i):=&\frac{1}{2}\Big (\sum _{j=1}^pA^+_{ij}v_j\Big )\frac{u_i^2}{v_i}-\Big (\sum _{j=1}^pA^-_{ij}v_j\Big )v_i\log \frac{u_i}{v_i}\\&+b_iu_i+d_i|u_i-v^0_i|. \end{aligned} \end{aligned}$$

Because the term $(1/2)\sum _{1\le i,j\le p}A^-_{ij}v_iv_j$ is independent of u, the minimization of G(u, v) over u can be accomplished by minimizing $G_i(u_i)$ over the marginal variable $u_i$, for each $i=1,\ldots ,p$. Fixing $i\in \{1,\ldots ,p\}$, below we minimize $G_i(u_i)$ over $u_i\in [0,l_i]$ in two cases.

First we consider the case when $u_i\in [0,l_i]\cap [v_i^0,+\infty )$. In this case $G_i(u_i)$ becomes

$$\begin{aligned} \begin{aligned} G_i(u_i)=&\frac{1}{2}\Big (\sum _{j=1}^pA^+_{ij}v_j\Big )\frac{u_i^2}{v_i}-\Big (\sum _{j=1}^pA^-_{ij}v_j\Big )v_i\log \frac{u_i}{v_i}\\&+b_iu_i+d_i(u_i-v^0_i) \end{aligned} \end{aligned}$$

and it is differentiable over ${\mathbb {R}}$. Taking the first and second derivatives of $G_i( \cdot )$, we obtain

$$\begin{aligned} G_i'(u_i) = \Big (\sum _{j=1}^pA^+_{ij}v_j\Big )\frac{u_i}{v_i}-\Big (\sum _{j=1}^pA^-_{ij}v_j\Big )\frac{v_i}{u_i}+b_i+d_i \end{aligned}$$

and

$$\begin{aligned} G_i''(u_i) = \Big (\sum _{j=1}^pA^+_{ij}v_j\Big )\frac{1}{v_i}+\Big (\sum _{j=1}^pA^-_{ij}v_j\Big )\frac{v_i}{u_i^2}. \end{aligned}$$

Here $u,v\ge 0$ and A is strictly positive definite, so $\sum _{j=1}^pA^+_{ij}v_j$ and $\sum _{j=1}^pA^-_{ij}v_j$ cannot be simultaneously equal to 0. Therefore $G_i''(u_i)>0$ for all $u_i\in [0,+\infty )$ and hence $G_i (\cdot )$ is strictly convex over $[0,+\infty )$. Then the minimum of $G_i( \cdot )$ over $[0,+\infty )$ is obtained on the unique critic point $r_1$ such that $G_i'(r_1)=0$, which yields

$$\begin{aligned} r_1= \frac{-(b_i+d_i)+\sqrt{(b_i+d_i)^2+4a_i(v)c_i(v)}}{2a_i(v)}v_i. \end{aligned}$$

It follows that the minimum of $G_i( \cdot )$ over $[0,l_i]\cap [v_i^0,+\infty )$ is given below:

$$\begin{aligned} \mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{u_i\in [0,l_i]\cap [v_i^0,+\infty )}G_i(u_i)=\left\{ \begin{array}{ll} v_i^0&{}~\text{ if }~r_1\le v_i^0\le l_i,\\ r_1&{}~\text{ if }~v_i^0<r_1 \le l_i,\\ l_i&{}~\text{ if }~v_i^0\le l_i<r_1,\\ \emptyset &{}~\text{ if }~v_i^0> l_i. \end{array}\right. \end{aligned}$$

(B5)

Next, we discuss the other case. When $u_i\in [0,l_i]\cap [0,v_i^0)$, we have

$$\begin{aligned} \begin{aligned} G_i(u_i)=&\frac{1}{2}\Big (\sum _{j=1}^pA^+_{ij}v_j\Big )\frac{u_i^2}{v_i}-\Big (\sum _{j=1}^pA^-_{ij}v_j\Big )v_i\log \frac{u_i}{v_i}\\&+b_iu_i-d_i(u_i-v^0_i). \end{aligned} \end{aligned}$$

Taking the first and second derivatives of it yields

$$\begin{aligned} G_i'(u_i) = \Big (\sum _{j=1}^pA^+_{ij}v_j\Big )\frac{u_i}{v_i}-\Big (\sum _{j=1}^pA^-_{ij}v_j\Big )\frac{v_i}{u_i}+b_i-d_i \end{aligned}$$

and

$$\begin{aligned} G_i''(u_i) = \Big (\sum _{j=1}^pA^+_{ij}v_j\Big )\frac{1}{v_i}+\Big (\sum _{j=1}^pA^-_{ij}v_j\Big )\frac{v_i}{u_i^2}>0. \end{aligned}$$

Because $G_i( \cdot )$ is strictly convex over $[0,+\infty )$, the minimum of $G_i( \cdot )$ over $[0,+\infty )$ is uniquely obtained at $r_2$ such that $G_i'(r_2)=0$, that is,

$$\begin{aligned} r_2= \frac{-(b_i-d_i)+\sqrt{(b_i-d_i)^2+4a_i(v)c_i(v)}}{2a_i(v)}v_i. \end{aligned}$$

It follows that the minimum of $G_i( \cdot )$ over $[0,l_i]\cap [0,v_i^0)$ is given as:

$$\begin{aligned} \mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{u_i\in [0,l_i]\cap [0,v_i^0)}G_i(u_i)=\min \left\{ r_2,l_i,v_i^0\right\} . \end{aligned}$$

(B6)

Combining the two cases (B5) and (B6), we obtain

$$\begin{aligned} \mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{u_i\in [0,l_i]}G_i(u_i)=\left\{ \begin{array}{cc} \min \{r_1,l_i\} &{} \text{ if } r_1>v_i^0, \\ \min \{r_2,l_i\} &{} \text{ if } r_2<v_i^0, \\ \min \{v_i^0,l_i\} &{} \text{ otherwise. } \end{array}\right. \end{aligned}$$

Denote by ${\widetilde{U}}:~{\mathbb {R}}^p\mapsto {\mathbb {R}}^p$ such that for $i=1,\ldots ,p$,

$$\begin{aligned} \big ({\widetilde{U}}(v)\big )_i:=\left\{ \begin{array}{cc} \min \{r_1,l_i\} &{} \text{ if } r_1>v_i^0, \\ \min \{r_2,l_i\} &{} \text{ if } r_2<v_i^0, \\ \min \{v_i^0,l_i\} &{} \text{ otherwise }. \end{array}\right. \end{aligned}$$

(B7)

We conclude that for each $v\in {\mathbb {R}}_+^p$, the vector given in (B7) is unique on [0, l] satisfying $ G\big ({\widetilde{U}}(v),v\big )=\min _{u\in [0, l]}G(u,v). $ This proves (3.9). In addition, by the uniqueness of the minimizer of $G( \cdot ,v)$ over [0, l], the mapping $U( \cdot )$ should have the same form as ${\widetilde{U}}( \cdot )$ given in (B7), which is exactly (3.4). This proves (ii) in Theorem 3.1.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Ding, Y., Peng, Q., Song, Z. et al. Variable selection and regularization via arbitrary rectangle-range generalized elastic net. Stat Comput 33, 72 (2023). https://doi.org/10.1007/s11222-023-10240-4

Download citation

Received: 07 January 2022
Accepted: 29 March 2023
Published: 26 April 2023
DOI: https://doi.org/10.1007/s11222-023-10240-4

Keywords

Mathematics Subject Classification

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Variable selection and regularization via arbitrary rectangle-range generalized elastic net

Abstract

Similar content being viewed by others

Machine Learning Optimization Techniques: A Survey, Classification, Challenges, and Future Research Issues

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

A binarization approach to model interactions between categorical predictors in Generalized Linear Models

1 Introduction

2 The ARGEN

2.1 Definition

2.2 Variable selection consistency

Definition 1

Definition 2

Theorem 2.1

2.3 Estimation consistency

Theorem 2.2

Proof

Corollary 2.3

2.4 Limiting distributions of ARGEN estimators

Theorem 2.4

3 MU-QP-RR-W-\(l_1\) Algorithm for Solving ARGEN

Theorem 3.1

4 Hyper-parameter optimization

5 Simulations

5.1 Signal recovery

5.2 Methods comparison

6 Real world application - S &P 500 index tracking

6.1 Outline

6.2 Experimental results

7 Conclusion and future perspectives

Availability of data and materials

Code availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethics approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Appendices

Appendix A Proof of theorem 2.1

Lemma A.1

Proof

Lemma A.2

Appendix B Proof of theorem 3.1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation