1 Introduction

Variable selection and regularization are essential tools in high-dimensional data analysis. Many existing strategies are able to achieve both high prediction accuracy and interpretability. For instance, the lasso (Tibshirani 1996) was popularized thanks to its computational efficiency (Efron et al 2004), variable selection consistency (Zhao and Yu 2006), and estimation consistency (Negahban et al 2012). We refer to Zou (2006); Bickel et al (2009); Efron et al (2007); Lounici (2008); Yuan and Lin (2006); Zhao et al (2009); Wang et al (2007) for more in-depth discussions of lasso. Later, elastic net (Zou and Hastie 2005) is proposed through linearly combining the lasso and ridge regression-like penalties. As a more flexible model, elastic net is shown to be able to outperform the lasso for high-dimensional data.

Recall that a regularized linear model has the following general form:

$$\begin{aligned} Y=\beta _0+\beta _1X_1+\ldots +\beta _pX_p+g(\beta _1,\ldots ,\beta _p)+\epsilon , \end{aligned}$$
(1.1)

where \(Y\in {\mathbb {R}}\) is the response variable, \(X_1,\ldots ,\) \(X_p\in {\mathbb {R}}\) are p predictors, g is some penalty function and \(\epsilon \) is the residual. In the setting of ordinary linear models, one considers no range constraint on the coefficients, i.e., it is assumed that \(\beta _1,\ldots ,\beta _p\in {\mathbb {R}}\). However in practice \(\beta _1,\ldots ,\beta _p\) are often restricted to a prior range of values. For example, in portfolio management problem, the coefficients are considered as allocations of assets in a fund, which are valued in [0, 1]; in academic grading problem, the coefficients are interpreted as weights of a list of courses, which are also ranged in [0, 1]. Such constraints may influence the behavior of the penalty g, as well as the estimated values of \(\beta _1,\ldots ,\beta _p\). Concerning adapting to this real world constraint, Wu et al (2014) and Wu and Yang (2014) introduced nonnegative lasso and nonnegative elastic net approaches respectively, which have been successfully applied to solve the real world index tracking problem without short sales (this corresponds to the nonnegative-value constraint on weights). There exist more such range constraints on the regression coefficients in the real world problems, therefore more flexible models are needed to address problems that require arbitrary-range constraints on the regression coefficients. With this motivation, our first goal is to suggest a novel method that concerns arbitrary rectangle-range constraint on the regression coefficients. Also recall that Zou (2006), Mouret et al (2013) and Sokolov et al (2016) introduced methods that generalize the lasso and elastic net respectively, by placing adaptive weights on the predictors and penalties. We will also adopt this setting in our model, i.e., the coefficients in the penalties will be weighted. As conclusion, our paper proposes a method that allows arbitrary rectangle-range constraints on the regression coefficients of a generalized elastic net and provide rigorous theoretical results to support the consistencies of the model and its solution. To elaborate, the proposed arbitrary rectangle-range generalized elastic net method (abbreviated to ARGEN), is a regularization method that deals with high-dimensional problems. ARGEN generalizes the nonnegative elastic net. The motivation of using ARGEN is given below:

  1. 1.

    Like elastic net (Zou 2006), ARGEN also does the job of reducing the estimation variance and removing unimportant factors. Being more general than the elastic net, ARGEN often has better prediction result. This fact has been shown in our simulation and experience study, see Sects. 5 and 6.

  2. 2.

    Compared with nonnegative elastic net, ARGEN allows adding arbitrary lower and upper bounds constraints on the coefficients. As discussed before, restricting the regression coefficients to specific rectangle-range is often required in the real world applications. Therefore this setting ensures ARGEN more adaptability to real world constraints. In Sect. 6 we have provided the S &P 500 index tracking problem as one example: instead of considering nonnegative allocation parameters (i.e., each regression coefficient \(\beta _j\in [0,+\infty )\)), our hyper-parameter tuning result shows that considering \(\beta _j\in [0.0082,0.6]\) or [0.0041, 0.8] yields lower out-of-sample error. Such result reveals the fact that each stock considered to track the S &P 500 index should have a weight no more than \(80\%\).

  3. 3.

    ARGEN also considers effects of individual and interactive penalty weights. These weights measure the factor importances of the p features \(X_1,\ldots ,X_p\) in the regression. The traditional elastic net assumes the p features have equal factor importance. However this is not often the case in the real world. Not only this setting of weights fits more the real world situation, but it also promises better performance due to its larger parameter searching space.

To solve ARGEN, we introduce a novel algorithm multiplicative updates for solving quadratic programming with rectangle range and weighted \(l_1\) regularizer (abbreviated to MU-QP-RR-W-\(l_1\)). We summarize the main contributions of our paper as follows:

  1. 1.

    We introduce ARGEN, a method of solving variable selection and regularization problems that require the regression coefficients to be ranged in some rectangle in \({\mathbb {R}}^p\) (see (2.1)). As a flexible approach, ARGEN includes the nonnegative elastic net and a number of new extensions of the models lasso, ridge, and elastic net.

  2. 2.

    Subject to some condition on the inputs, the variable selection consistency, the estimation consistency, and the limiting distribution of the estimator of ARGEN are obtained. We refer to Theorems 2.1, 2.2, and 2.4.

  3. 3.

    A novel algorithm MU-QP-RR-W-\(l_1\) is introduced to solve the general quadratic programming problem \(\min _{v\in [0, l]}\) F(v) \(= v'Av/2 + b' v+d'|v-v^0|\), following the notations in (3.3). The algorithm is implemented as a Python library through the PyPi server and is publicly shared.Footnote 1

  4. 4.

    We show a successful real world application of the ARGEN approach in the S &P 500 index tracking problem. Readers can get full access to the Python script in the Github repository.Footnote 2

Throughout the paper, we denote the transpose of a matrix by \((\cdot )'\), the i-th column of a matrix by \((\cdot )_i\), the entry in the i-th row and j-th column of a matrix by \((\cdot )_{ij}\), the diagonal matrix with diagonal vector \({{\textbf {x}}}\) by \({{\,\textrm{diag}\,}}({{\textbf {x}}})\), and the maximum (resp. minimum) element of a vector by \(\max (\cdot )\) (resp. \(\min (\cdot )\)). Besides, an \(n\times n\) matrix X can be expressed by \( X=(X_{ij})_{1\le i,j\le n}\). The elementwise absolute value of a vector or matrix is \(|\cdot |\): for \({{\textbf {x}}}=(x_1, \ldots , x_p)\), \(|\mathrm {{\textbf {x}}}|:=(|x_1|,\ldots ,|x_p|)\) and for an \(n\times n\) matrix X, \(|X|:=(|X_{ij}|)_{1\le i,j\le n}\). Moreover, let \({{{\textbf {x}}}} =(x_1,\ldots ,x_p)\), \({{{\textbf {y}}}}=(y_1,\ldots ,y_p)\) be two equal-length vectors, we denote the p-dimensional interval by \([{{{\textbf {x}}}}, {{{\textbf {y}}}}]:=[x_1,y_1]\times \ldots \times [x_p,y_p]\).

In the sequel, we consider the linear regression model

$$\begin{aligned} Y = X\beta ^* + \epsilon , \end{aligned}$$
(1.2)

where X is a deterministic \( n \times p \) design matrix, \(Y = (y_1~\ldots ~y_n)'\) is an \(n\times 1\) response vector and \(\epsilon = ( \epsilon _1~\ldots ~\epsilon _n)' \) is a Gaussian noise with marginal variance \(\sigma ^2 \). Without loss of generality, we assume all the p predictors are real-valued and centered, so the intercept can be ignored. \(\beta ^* \in {\mathbb {R}}^{p}\) is the regression coefficients.

The rest of the paper is organized as follows. In Sect. 2, we discuss the analytical features of ARGEN and its variable selection consistency (Theorem 2.1), estimation consistency (Theorem 2.2) and estimator’s limiting distribution (Theorem 2.4). In Sect. 3, we propose an efficient algorithm MU-QP-RR-W-\(l_1\) for solving ARGEN. Approaches we use to speed up hyper-parameter optimization are discussed in Sect. 4. Simulations that compare the performances of various methods are conducted in Sect. 5. Section 6 shows an application of ARGEN to the real world S &P 500 index tracking problem. Section 7 is devoted to the conclusion and discussion of future research. Technical proofs are provided in Appendix.

2 The ARGEN

2.1 Definition

In practice it is often natural to assume sparsity in the high-dimensional dataset problem. Therefore in the sequel we assume that the linear model (1.2) is q-sparse, i.e., \(\beta ^*\) has at most \( q ~(q \ll p) \) nonzero elements. We intend to cope with the case when there is a control on the range of the coefficients, that is, let \( s =(s_1,\ldots ,s_p)\), \( t=(t_1,\ldots ,t_p)\) with \(s_i\in {\mathbb {R}}\cup \{-\infty \},~t_i\in {\mathbb {R}}\cup \{+\infty \}\), \(s_i< t_i\) for all \(i=1,\ldots ,p\), the optimal coefficients are in a p-dimensional rectangle \({\mathcal {I}}:=[s, t]\subset {\mathbb {R}}^p\). To capture the penalty weights for individual features, we introduced \(\textrm{w}_{n} = (\textrm{w}_{n,1}~ \ldots ~ \mathrm w_{n,p})'\) as weights for each coefficient in the \(l_1\) penalty, and it satisfies \(\mathrm w_{n,i}\ge 0, i=1,\cdots ,p\). In addition to individual features, \(\Sigma _n\), a positive semi-definite matrix, is introduced to represent the penalty weights for interactions between any two features. Consider the linear model (1.2) and let \( \beta =(\beta _1~\ldots ~\beta _p)'\) be a vector in \({\mathbb {R}}^p\). The ARGEN estimator of \(\beta \) is given by

(2.1)

Here \(\lambda _n^{(1)},\lambda _n^{(2)}\ge 0\) are the tuning parameters which control the importance of the \(l_1\) and \(l_2\) regularization terms, respectively.

The ARGEN (2.1) naturally extends the elastic net method. That is, it becomes the elastic net when \({\mathcal {I}}={\mathbb {R}}^p\), \(\mathrm w_n=(1~ \ldots ~ 1)'\), and \(\Sigma _n\) is the identity matrix. Thus ARGEN extends the lasso and ridge methods by further assigning \(\lambda _n^{(2)}=0\) and \(\lambda _n^{(1)}=0\) respectively. In addition, ARGEN becomes the nonnegative elastic net if we replace \({\mathcal {I}}={\mathbb {R}}^p\) with \({\mathcal {I}}=\mathbb R_+^p:=[0,+\infty )^p\) in the setting of elastic net.

2.2 Variable selection consistency

We define the variable selection consistency for the ARGEN as follows. For \(i=1,\ldots ,p\), we decompose the interval \([s_i,t_i]\) with \(s_i<t_i\) into 7 disjoint sub-intervals:

$$\begin{aligned} {[}s_i,t_i]=\bigcup _{k=2}^6{\mathcal {G}}_i^{(k)}\bigcup \mathcal G_i^{(1-)}\bigcup {\mathcal {G}}_i^{(1+)}, \end{aligned}$$

where

$$\begin{aligned}{} & {} {\mathcal {G}}_i^{(1-)}=(s_i, t_i)\cap (-\infty ,0),\\{} & {} {\mathcal {G}}_i^{(1+)}=(s_i, t_i)\cap (0,+\infty ),\\{} & {} {\mathcal {G}}_i^{(2)}=\{s_i\}\backslash \{0\},~~~~~{\mathcal {G}}_i^{(3)}=\{t_i\}\backslash \{0\},\\{} & {} {\mathcal {G}}_i^{(4)}=\{s_i\}\cap \{0\},~~~~~{\mathcal {G}}_i^{(5)}=\{t_i\}\cap \{0\},\\{} & {} {\mathcal {G}}_i^{(6)}=(s_i,t_i)\cap \{0\}. \end{aligned}$$

In addition, we define \( {\mathcal {G}}_i^{(1)}=\mathcal G_i^{(1-)}\cup {\mathcal {G}}_i^{(1+)} \) for simplicity. Correspondingly, each coefficient in \(\beta ^*\) belongs to one of the 7 groups of values; i.e., for each \(i=1,\ldots ,p\), there is a unique \(k_i\in \{1-,1+,2,\ldots ,6\}\) such that \( \beta _i^*\in \mathcal G_i^{(k_i)}. \) Now for \(j\in \{1-,1+,2,\ldots ,6\}\), denote by

$$\begin{aligned} S_{(j)}=\left\{ i\in \{1,\ldots ,p\}:~\beta _i^*\in \mathcal G_{i}^{(j)}\right\} , \end{aligned}$$

the set of indexes i for which \(\beta _i^*\) belongs to the j-th group of values, and let \(\# S_{(j)}\) be the cardinality of the set. Correspondingly, we can define

$$\begin{aligned} \begin{aligned}&\widehat{S}_{(j)}(\lambda _n^{(1)},\lambda _n^{(2)},\mathrm w_n,\mathrm \Sigma _n)\\&\quad =\left\{ i\in \{1,\ldots ,p\}:{{\widehat{\beta }}}_i\in \mathcal G_{i}^{(j)}\right\} . \end{aligned} \end{aligned}$$

Definition 1

ARGEN (2.1) is said to have variable selection consistency if there exist \(\lambda _n^{(1)}\), \(\lambda _n^{(2)}\), \(\mathrm w_n\), and \(\Sigma _n\) such that

$$\begin{aligned} \begin{aligned}&\mathbb P\left( \widehat{S}_{(j)}\big (\lambda _n^{(1)},\lambda _n^{(2)},\mathrm w_n,\mathrm \Sigma _n\big ) = S_{(j)}\Big |S_{(j)}\ne \emptyset \right) \\&\xrightarrow [n\rightarrow \infty ]{}1~\text{ for } j\in \{1-,1+,2,\ldots ,6\}. \end{aligned} \end{aligned}$$
(2.2)

(2.2) implies that, starting from some n, it is of big opportunity that \({{\widehat{\beta }}}_i\) equals \(\beta _i^*\) if \(\beta ^*_i\in \{0,s_i,t_i\}\). Such property includes the variable selection consistency of the nonnegative elastic net and elastic net as particular cases. Therefore our definition of the variable selection consistency for ARGEN is in a broader sense than that for the “free-range” or nonnegative elastic net (Zhao and Yu 2006; Wu et al 2014; Wu and Yang 2014).

Let \(X_{(1)}=(X_{(1-)}, X_{(1+)})\) and for \(j\in \{1-,1+,2,\ldots ,6\}\), let \(X_{(j)} = \left( X_i\right) _{i\in S(j)}\) be the observed predictor values corresponding to the jth group of indexes. Similarly, let \(\beta ^*_{(j)} = \left( \beta ^*_i\right) _{i\in S(j)}\), \(s_{(j)} = \left( s_i\right) _{i\in S(j)}\), \(t_{(j)} = \left( t_i\right) _{i\in S(j)}\), \(\mathrm w_{n, (j)} = \left( \mathrm w_{n, i}\right) _{i\in S(j)}\), and \(\Sigma _{n, (j_1j_2)} = \left( \Sigma _{n, i_1,i_2}\right) _{i_1\in S(j_1), i_2\in S(j_2)}\). Moreover, let C be

$$\begin{aligned} \begin{aligned} C&:=\begin{pmatrix} C_{ij} \end{pmatrix}_{1\le i,j\le 6}\\ {}&=\frac{1}{n}X'X=\begin{pmatrix} \frac{1}{n}X'_{(i)} X_{(j)} \end{pmatrix}_{1\le i,j\le 6} \end{aligned} \end{aligned}$$
(2.3)

and \(\Lambda _{\min }(C_{11})\) be the minimal eigenvalue of \(C_{11}\). Denote by

(2.4)

where for a vector \(v=(v_1,\ldots ,v_n)\), \({{\,\textrm{sign}\,}}(v):=({{\,\textrm{sign}\,}}(v_1),\ldots ,{{\,\textrm{sign}\,}}(v_n))\) denotes the vector of signs of the elements in v, and \({{\,\textrm{diag}\,}}(v)\) denotes the diagonal matrix with diagonal elements v. To show ARGEN admits the variable selection consistency (2.2), we assume that the following conditions hold:

(2.5)
(2.6)
(2.7)
(2.8)

and for \(j\in \{1-,1+\}\),

(2.9)
(2.10)

Besides, we assume that the arbitrary rectangle-range elastic irrepresentable condition (AREIC), defined below, is satisfied.

Definition 2

The AREIC is given as: For \(j=2,\ldots ,6\) satisfying \(S_{(j)}\ne \emptyset \), there exists a positive constant vector \(\eta _{(j)}\), such that

(2.11)

where \( (D_{(2)}~D_{(3)}~D_{(4)}~D_{(5)})=\big ({{\,\textrm{diag}\,}}({{\,\textrm{sign}\,}}( s_{(2)}))\) \(~{{\,\textrm{diag}\,}}({{\,\textrm{sign}\,}}( t_{(3)}))~1~-1\big ). \)

Let us roughly explain how the technical condition AREIC plays its role in the derivation of the variable selection consistency of ARGEN. First by Lemma A.1 in the appendix, for \(j\in \{1-,1+,2,\ldots ,6\}\),

$$\begin{aligned} \begin{aligned}&\mathbb P\Big (\widehat{S}_{(j)}\big (\lambda _n^{(1)},\lambda _n^{(2)},\mathrm w_n,\mathrm \Sigma _n\big ) = S_{(j)}\Big |S_{(j)}\ne \emptyset \Big )\\&\ge {\mathbb {P}}\left( {\mathcal {E}}(V_{(j)})\right) , \end{aligned} \end{aligned}$$

where the events \({\mathcal {E}}(V_{(j)})\)’s are given in (A2). Next the condition AREIC (the left-hand side of (2.11) is the major part of \({\mathbb {P}}\left( {\mathcal {E}}(V_{(j)})\right) \)) will algebraically lead to \(\mathbb P\left( {\mathcal {E}}(V_{(j)})\right) \xrightarrow [n\rightarrow +\infty ]{}1\), which implies

$$\begin{aligned} \begin{aligned}&\mathbb P\Big (\widehat{S}_{(j)}\big (\lambda _n^{(1)},\lambda _n^{(2)},\mathrm w_n,\mathrm \Sigma _n\big ) = S_{(j)}\Big |S_{(j)}\ne \emptyset \Big )\\&\xrightarrow [n\rightarrow +\infty ]{}1, \end{aligned} \end{aligned}$$

i.e., the variable selection consistency is established.

When \(s=0\), \(t=+\infty \), \(\mathrm w_n=1\) and \(\Sigma _{n}\) is the identity matrix, the AREIC becomes the nonnegative elastic irrepresentable condition (NEIC) as follows:

$$\begin{aligned} C_{61}\Big (C_{11}+\frac{\lambda _n^{(2)}}{n}\Big )^{-1}\Big (\textbf{1}+\frac{2\lambda _n^{(2)}}{\lambda _n^{(1)}}\beta _{(1)}^*\Big )\le \textbf{1 }- \eta _{(6)}, \end{aligned}$$
(2.12)

which yielded the variable selection consistency of nonnegative elastic net (Zhao et al 2014). If, in addition to (2.12), \(\lambda _n^{(2)}=0\), the NEIC then becomes the nonnegative irrepresentable condition (NIC):

$$\begin{aligned} C_{61}C_{11}^{-1}{\textbf{1}}\le {\textbf{1}} - \eta _{(6)}, \end{aligned}$$

which was a sufficient condition to obtain the variable selection consistency of the nonnegative lasso (Wu et al 2014). Note that, NIC is a nonnegative version of the irrepresentable condition (IC) for the variable selection consistency of the lasso (Zhao and Yu 2006):

$$\begin{aligned} |C_{61}C_{11}^{-1}{{\,\textrm{sign}\,}}(\beta _{(1)}^*)|\le {\textbf{1}} - \eta _{(6)}. \end{aligned}$$

Although IC is a sufficient and necessary condition for the variable selection consistency of the lasso (Zhao and Yu 2006) while NIC is only a necessary condition, in the real world NIC is easier to be satisfied than IC since it does not require the absolute value on the left-hand side of the inequality. As a result AREIC is a natural general version of the previous necessary conditions NEIC and NIC for the variable selection consistency. Below we state the first main result of the paper. Its proof is given in Appendix A.

Theorem 2.1

Under AREIC and the conditions (2.5) - (2.10), the ARGEN possesses the variable selection consistency property (2.2).

2.3 Estimation consistency

Recall that an estimation method with target parameter \(\beta ^*\) has the property of estimation consistency if

$$\begin{aligned} \Vert {\widehat{\beta }} - \beta ^*\Vert _2 \xrightarrow [n\rightarrow \infty ]{{\mathbb {P}}} 0, \end{aligned}$$

where \(\Vert \cdot \Vert _2\) denotes the Euclidean distance and \(\xrightarrow [n\rightarrow \infty ]{{\mathbb {P}}}\) is the convergence in probability. Besides the variable selection consistency, ARGEN admits estimation consistency, subject to the following conditions.

  1. (i)

    \(\beta ^*\in {\mathcal {I}}\). Let \(p=p_n\), \(q=q_n\) be non-decreasing as n increases.

  2. (ii)

    \(\mathrm w_n=(\mathrm w_{n,1},\ldots , \mathrm w_{n,p_n})\) with \(\mathrm w_{n, 1},\ldots ,\mathrm w_{n, p_n}\) \(>0\) and \(\Sigma _n\) are given.

  3. (iii)

    Let \(X_j\) be the jth column of X, which satisfies

    $$\begin{aligned} \max _{1\le j\le p_n}\frac{2(X_j'X_j+\lambda _n^{(2)}\Sigma _{n,jj})}{(1+\lambda _n^{(2)})\textrm{w}_{n,j}^2} \le 1,~\text{ for } \text{ all } ~ n\ge 1. \end{aligned}$$
  4. (iv)

    X satisfies the restricted eigenvalue (RE) condition, i.e. there exists a constant \(\kappa >0\), such that for all \(n\ge 1\) and all \(\beta \in {\mathcal {I}}\) satisfying

    $$\begin{aligned} \sum _{j=4}^6\mathrm w_{n,(j)}'|\beta _{(j)}|\le 3\sum _{j=1}^3\mathrm w_{n,(j)}'|\beta _{(j)}|, \end{aligned}$$

    we have

    $$\begin{aligned} 2(\Vert X\beta \Vert _2^2+\lambda _n^{(2)}\beta '\Sigma _n \beta ) \ge \kappa (1+\lambda _n^{(2)}) \Vert {{\,\textrm{diag}\,}}(\mathrm w_{n})\beta \Vert _2^2. \end{aligned}$$
  5. (v)

    \(\lambda _n^{(1)}\), \(\lambda _n^{(2)}\), \(\mathrm w_{n}\), \(p_n\) and \(q_n\) satisfy

    $$\begin{aligned} \frac{q_n(\lambda _n^{(1)})^2}{(1+\lambda _n^{(2)})^2}\xrightarrow [n\rightarrow \infty ]{}0 \end{aligned}$$

    and

    $$\begin{aligned} p_n\exp \bigg (-\frac{n}{8\sigma ^2}\frac{\big (\lambda _n^{(1)}\big )^2}{1+\lambda _n^{(2)}}\bigg )\xrightarrow [n\rightarrow \infty ]{}0, \end{aligned}$$

    where \(\sigma >0\) is the residual standard deviation of the ARGEN.

Below we state the estimation consistency of the ARGEN.

Theorem 2.2

Consider a \(q_n\)-sparse instance of the ARGEN (2.1). Let X satisfy the conditions (i) - (iv) and let the regularization parameters \(\lambda _{n}^{(1)}>0, \lambda _{n}^{(2)}\ge 0\), then the ARGEN solution \({\widehat{\beta }}:={\widehat{\beta }}(\lambda _n^{(1)},\lambda _n^{(2)},\mathrm w_n, \Sigma _n)\) satisfies:

$$\begin{aligned}{} & {} \begin{aligned}&{\mathbb {P}}\bigg (\Vert {{\,\textrm{diag}\,}}(\mathrm w_{n})({\widehat{\beta }} - \beta ^*)\Vert _2^2 >\frac{9q_n(\lambda _n^{(1)})^2}{\kappa ^2(1+\lambda _n^{(2)})^2}\bigg )\\&\quad \le 2p_n\exp \bigg (-\frac{n}{8\sigma ^2}\frac{(\lambda _n^{(1)})^2}{1+\lambda _n^{(2)}}\bigg ), \end{aligned} \end{aligned}$$
(2.13)
$$\begin{aligned}{} & {} \begin{aligned}&{\mathbb {P}}\bigg (\Vert {{\,\textrm{diag}\,}}(\mathrm w_{n})({\widehat{\beta }} - \beta ^*) \Vert _1 > \frac{12q_n\lambda _n^{(1)}}{\kappa (1+\lambda _n^{(2)})}\bigg )\\&\quad \le 2p_n\exp \bigg (-\frac{n}{8\sigma ^2}\frac{(\lambda _n^{(1)})^2}{1+\lambda _n^{(2)}}\bigg ), \end{aligned} \end{aligned}$$
(2.14)

where \(\sigma >0\) denotes the residual standard deviation of the ARGEN. In addition if (v) holds, we have

$$\begin{aligned} \Vert {\widehat{\beta }} - \beta ^*\Vert _2\xrightarrow [n\rightarrow \infty ]{{\mathbb {P}}}0. \end{aligned}$$
(2.15)

Proof

The main idea to the proof is to transform the ARGEN problem into a rectangle-range lasso problem. Let

$$\begin{aligned} \begin{aligned}{}&{} \widetilde{X}=\frac{\sqrt{2n}}{\sqrt{1+\lambda _n^{(2)}}}\begin{pmatrix} X{{\,\text {diag}\,}}(\mathrm w_{n})^{-1}\\ \sqrt{\lambda _n^{(2)}} \Sigma _n^{1/2}{{\,\text {diag}\,}}(\mathrm w_{n})^{-1} \end{pmatrix}_{(n+p)\times p },\\{}&{} \widetilde{Y} = \begin{pmatrix} \sqrt{2n}Y\\ 0 \end{pmatrix}_{(n+p)\times 1 }, \\{}&{} {\widetilde{\beta }}^* = \sqrt{1+\lambda _n^{(2)}}{{\,\text {diag}\,}}(\mathrm w_{n})\beta ^*, ~~~~\lambda _n = \frac{\lambda _n^{(1)}}{\sqrt{1+\lambda _n^{(2)}}},\\{}&{} \widetilde{{\mathcal {I}}} =\prod _{i=1}^{p_n}\left[ \sqrt{1+\lambda _n^{(2)}} \mathrm w_{n,i}s_i,\sqrt{1+\lambda _n^{(2)}} \mathrm w_{n,i}t_i\right] . \end{aligned} \end{aligned}$$

Then the ARGEN (2.1) can be written as the rectangle-range lasso:

$$\begin{aligned} \begin{aligned} \widehat{{\widetilde{\beta }}}(\lambda _n)&=\mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{\beta \in \widetilde{{\mathcal {I}}}}\left( \frac{1}{2n}\big \Vert \widetilde{Y} - \widetilde{X}\beta \big \Vert _2^2+\lambda _n|\beta |\right) \\&=\sqrt{1+\lambda _n^{(2)}}{{\,\textrm{diag}\,}}(\mathrm w_{n}){\widehat{\beta }}. \end{aligned} \end{aligned}$$
(2.16)

In view of the conditions (i)-(iv), all requirements of Corollary 2 in Negahban et al (2012) are satisfied. Therefore, applying Corollary 2 in Negahban et al (2012) to the lasso (2.16) yields the results. We point out that: (1) Based on its proof, Corollary 2 in Negahban et al (2012) works for rectangle-range lasso. (2) There is a typo in the statement of Corollary 2 in Negahban et al (2012): the inequalities (34) in Negahban et al (2012) should be corrected to

$$\begin{aligned} \begin{aligned}&\Vert {\widehat{\theta }}_{\lambda _n} - \theta ^* \Vert _2^2\le \frac{144\sigma ^2}{\kappa _{{\mathcal {L}}}^2}\frac{s\log p}{n}\\&\Vert {\widehat{\theta }}_{\lambda _n} - \theta ^* \Vert _1\le \frac{48\sigma }{\kappa _{{\mathcal {L}}}}s\sqrt{\frac{\log p}{n}}. \end{aligned} \end{aligned}$$

\(\square \)

If we assume \(\mathrm w_n\nrightarrow 0\) as \(n\rightarrow \infty \) in Theorem 2.2, we easily obtain the estimation consistency condition for the nonnegative lasso (see Proposition 1 in Wu et al (2014)) and the nonnegative elastic net. Note that the estimation consistency of the nonnegative elastic net (Wu and Yang 2014) has not yet been derived, hence we state it below as a corollary of Theorem 2.2. To obtain the corollary it suffices to observe \(\sum _{i=1}^3\mathrm w_{n,(i)}'\mathrm w_{n,(i)}=q_n\) when \(\mathrm w_{n,j}=1\) for all \(n\ge 1\)\(j=1,\ldots ,p_n\).

Corollary 2.3

Consider a \(q_n\)-sparse nonnegative elastic net model. Assume:

  1. (i)

    \(\beta ^*\ge 0\). \(p_n,q_n\) are non-decreasing as n increases.

  2. (ii)

    Let \(X_j\) be the jth column of X which satisfies

    $$\begin{aligned} \frac{2(X_j'X_j+\lambda _n^{(2)})}{1+\lambda _n^{(2)}} \le 1,~\text{ for } \text{ all } ~ j = 1,\ldots ,p. \end{aligned}$$
  3. (iii)

    There exists a constant \(\kappa >0\), such that

    $$\begin{aligned} 2(\Vert X\beta \Vert _2^2+\lambda _n^{(2)}\Vert \beta \Vert _2^2 )\ge \kappa (1+\lambda _n^{(2)}) \Vert \beta \Vert _2^2 \end{aligned}$$

    for all \(\beta \ge 0\) satisfying

    $$\begin{aligned} \sum _{j\in \{1,\ldots ,p_n\}:~\beta _j^*=0}|\beta _{j}|\le 3\sum _{j\in \{1,\ldots ,p_n\}:~\beta _j^*\ne 0}|\beta _{j}|. \end{aligned}$$

Let \(\lambda _{n}^{(1)}>0, \lambda _{n}^{(2)}\ge 0\), then the nonnegative elastic net solution \({\hat{\beta }}\) verifies the following inequalities:

As a consequence of Corollary 2.3, \({{\widehat{\beta }}}\) is consistent if

$$\begin{aligned} \frac{q_n(\lambda _n^{(1)})^2}{(1+\lambda _n^{(2)})^2}\xrightarrow [n\rightarrow \infty ]{}0 \end{aligned}$$

and

$$\begin{aligned} p_n \exp \bigg (-\frac{n(\lambda _{n}^{(1)})^2}{8\sigma ^2(1+\lambda _{n}^{(2)})}\bigg )\xrightarrow [n\rightarrow \infty ]{}0. \end{aligned}$$

If we take \(\lambda _n^{(2)}=0\) and \(\lambda _n^{(1)}=4\sigma \sqrt{\log p_n/n}\) in Corollary 2.3, we obtain the nonnegative lasso’s tail probability control as in Proposition 1 in Wu et al (2014). If we further assume \(\beta ^*\in {\mathbb {R}}\) in Corollary 2.3, we derive the tail bounds for the lasso (see Corollary 2 in Negahban et al (2012)).

2.4 Limiting distributions of ARGEN estimators

We now study the asymptotic behavior in distribution of the ARGEN estimators, as \(n\rightarrow \infty \). Again we can make use of the transformation of ARGEN to the rectangle-range lasso model (2.16), since the limiting distributions of the lasso regression estimators have been studied in Fu and Knight (2000). Observe that (2.16) is equivalent to

$$\begin{aligned} \widehat{{\widetilde{\beta }}}(\lambda _n) =\mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{\beta \in \widetilde{{\mathcal {I}}}}\left( \big \Vert \breve{Y} - \breve{X}\beta \big \Vert _2^2+\lambda _n |\beta |\right) , \end{aligned}$$
(2.17)

where \(\breve{X}={\widetilde{X}}/\sqrt{2n}\) and \(\breve{Y}=\widetilde{Y}/\sqrt{2n}\). (2.17) is then the type of lasso studied in Fu and Knight (2000). Assume that the row vectors of \(\breve{X}\), denoted by \(\breve{X}^{(i)}\), \(i=1,\ldots ,n\), satisfy

$$\begin{aligned} \frac{1}{n}\sum _{i=1}^n\breve{X}^{(i)}{\breve{X}^{(i)}}{}'\xrightarrow [n\rightarrow \infty ]{}M, \end{aligned}$$
(2.18)

where M is a nonsigular nonnegative definite matrix and

$$\begin{aligned} \frac{1}{n}\max \limits _{1\le i\le n}{\breve{X}^{(i)}}{}'\breve{X}^{(i)}\xrightarrow [n\rightarrow \infty ]{}0. \end{aligned}$$
(2.19)

It follows from Theorem 2 in Fu and Knight (2000) that the ARGEN estimator \({{\widehat{\beta }}}\) has the following asymptotic behavior in distribution.

Theorem 2.4

Assume \(\lim _{n\rightarrow \infty }p_n=p\), \(\lim _{n\rightarrow \infty }\lambda _n^{(2)}\) \(=\lambda ^{(2)}\) and \(\lim _{n\rightarrow \infty }\mathrm w_n=\mathrm w=(\mathrm w_1,\ldots ,\mathrm w_p)\). Let X, \(\mathrm w_n\) and \(\Sigma _n\) satisfy (2.18) and (2.19). Also assume

$$\begin{aligned} \frac{\lambda _n^{(1)}}{\sqrt{n(1+\lambda _n^{(2)})}}\xrightarrow [n\rightarrow \infty ]{}\lambda _0\ge 0. \end{aligned}$$

Then

$$\begin{aligned} \sqrt{n}({{\widehat{\beta }}}-\beta ^*)\xrightarrow [n\rightarrow \infty ]{\text{ law }}\mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{u\in {\mathcal {I}}}(V(u)), \end{aligned}$$

where \(\xrightarrow [n\rightarrow \infty ]{\text{ law }}\) denotes the convergence in distribution; V(u) is a Gaussian random variable given as

$$\begin{aligned} \begin{aligned}&V(u)=-2u'G+u'Mu\\&\qquad +\lambda _0\sum _{j=1}^p\left( u_j{{\,\textrm{sign}\,}}(\beta _j^*)\mathbb {1}(\beta _j^*\ne 0)+|u_j|\mathbb {1}(\beta _j^*=0)\right) . \end{aligned} \end{aligned}$$

In the above expression of V(u), \(G\sim {\mathcal {N}}(0,\sigma ^2\,M)\), \(u_j\) denotes the jth coordinate of u and \(\mathbb {1}\) is the indicator function.

Theorem 2.4 includes the asymptotic behaviors of the elastic net and nonnegative elastic net estimators as its particular examples. As another particular example, when \(\lambda _0=0\) and \(p=1\) (then M is a single value and \(\mathcal I=[s_1,t_1]\)), by the fact that V is convex, we obtain

$$\begin{aligned} \mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{u\in {\mathcal {I}}}(V(u))=\left\{ \begin{array}{ll} M^{-1}G&{}~\text{ if }~M^{-1}G\in [s_1,t_1];\\ s_1&{}~\text{ if }~M^{-1}G<s_1;\\ t_1&{}~\text{ if }~M^{-1}G>t_1. \end{array} \right. \end{aligned}$$

In the above example, if \(p\ge 2\) and \({\mathcal {I}}\ne {\mathbb {R}}^p\), \(\mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{u\in {\mathcal {I}}}(V(u))\) has no simple explicit expression. Note that \(\mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{u\in {\mathcal {I}}}(V(u))\) belongs to some quadratic programming problem. In the next section we provide a multiplicative updates numerical algorithm to solve the ARGEN. This algorithm may be further applied to simulate \(\mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{u\in \mathcal I}(V(u))\) numerically.

3 MU-QP-RR-W-\(l_1\) Algorithm for Solving ARGEN

In this section we provide a solution of ARGEN by using an extensive multiplicative updates algorithm. Given \(\mathrm w_n,~\Sigma _n\) and \(\lambda _n^{(1)},\lambda _n^{(2)}\ge 0\), the ARGEN in (2.1) can be expressed as the following equivalent problem:

(3.1)

To simplify the problem, we rewrite it by taking

and obtain an equivalent problem of (3.1), that is,

$$\begin{aligned} {\left\{ \begin{array}{ll} \text {minimize } F(v) = v'Av+b'v+d'|v-v^0|, \\ \text {subject to } v \in [0, l]. \end{array}\right. } \end{aligned}$$
(3.2)

This is obtained by arguing \( |v+ s|-|v-v^0|=s^+:=\big (\max \{0,s_1\}~\ldots ~\) \(\max \{0,s_p\}\big )'\) and omitting the constant terms \( d' s^++ (3/2)s'A s-b' s. \) Here the matrix A is symmetric positive semi-definite. The problem (3.2) is a quadratic programming problem but contains an item of \(d'|v-v^0|\).

Sha et al (2007a) derived the multiplicative updates for solving the nonnegative quadratic programming problem. The algorithm has been shown to have a simple closed-form, and a rapid convergence rate. For our problem (3.2), however, it contains the absolute values and lower and upper limits of the optimization variables, so that direct application of the algorithm in Sha et al (2007a) is impractical. Therefore, we propose a new iterative algorithm to solve (3.2) and call it multiplicative updates for solving quadratic programming with rectangle range and weighted \(l_1\) regularizer (abbreviated to MU-QP-RR-W-\(l_1\)).

Let us formulate a more general problem that can be solved by MU-QP-RR-W-\(l_1\):

$$\begin{aligned} {\left\{ \begin{array}{ll} \text{ minimize } ~F(v) = \frac{1}{2} v'Av + b' v+d'|v-v^0|, \\ \text{ subject } \text{ to }~v\in [0, l]. \end{array}\right. } \end{aligned}$$
(3.3)

Here \(v,b,d,v^0,l\) are column vectors of dimension p, where elements of \(d, v^0\) are nonnegative and elements of l are positive. The matrix \( A=(A_{ij})_{1\le i,j\le p}\) is positive semi-definite. In fact, the nonnegative quadratic programming (see e.g. Equation (5) in Sha et al (2007a) or (20) in Wu and Yang (2014)) is a special case of (3.3), where we take the elements of \(d, v^0\) to be 0 and elements of l to be infinity.

Let us further adopt the following notations. For \(i,j\in \{1,\ldots ,p\}\), we define the positive part and negative part of \(A_{ij}\) by

$$\begin{aligned} A_{ij}^{+} := \max \left\{ 0,A_{ij}\right\} ~ \text { and }~ A_{ij}^{-} := \max \left\{ 0,-A_{ij}\right\} . \end{aligned}$$

Then denote the positive part and negative part of the matrix A by

$$\begin{aligned} A^+:=\big (A_{ij}^+\big )_{1\le i,j\le p}~\text{ and }~A^-:=\big (A_{ij}^{-}\big )_{1\le i,j\le p}. \end{aligned}$$

It follows that \(A=A^+-A^-\) and \(|A|:=\left( |A_{ij}|\right) _{1\le i,j\le p}\) \(=A^++A^-\). Let \( a_i(v):= (A^+ v)_i~\text{ and }~c_i(v):= (A^- v)_i \), we then present the MU-QP-RR-W-\(l_1\) algorithm in pseudocode below.

figure a

We point out that the conditions \(r_1>v_i^0\) and \(r_2<v_i^0\) in (3.4) are mutually exclusive when \(v_i^{(m)}>0\). This is because, on one hand, \(r_1>v_i^0\) is equivalent to

$$\begin{aligned} \begin{aligned}&\frac{2a_i(v)v_i^0}{v_i}+(b_i+d_i)<0\\&\text{ or }\\&\left\{ \begin{array}{ll} &{} \frac{2a_i(v)v_i^0}{v_i}+(b_i+d_i)\ge 0, \\ &{} a_i(v)\left( \frac{v_i^0}{v_i}\right) ^2+(b_i+d_i)\frac{v_i^0}{v_i}-c_i(v)<0. \end{array}\right. \end{aligned} \end{aligned}$$
(3.5)

On the other hand, \(r_2<v_i^0\) is equivalent to

$$\begin{aligned} \left\{ \begin{array}{ll} &{} \frac{2a_i(v)v_i^0}{v_i}+(b_i-d_i)\ge 0, \\ &{} a_i(v)\left( \frac{v_i^0}{v_i}\right) ^2+(b_i-d_i)\frac{v_i^0}{v_i}-c_i(v)>0. \end{array}\right. \end{aligned}$$
(3.6)

Since \(d_i\ge 0\), it is obvious that (3.5) and (3.6) are mutually exclusive.

A special case of the algorithm is that when \(d_i=0\) and \(l_i=+\infty \) for \(i=1,\ldots ,p\), (3.4) becomes

$$\begin{aligned} v_i\longleftarrow v_i\Big (\frac{-b_i + \sqrt{b_i^2 + 4a_i(v) c_i(v)}}{2a_i(v)}\Big ), \end{aligned}$$

for \(i=1,\ldots ,p\), which reduces to the one for nonnegative quadratic programming (see e.g. (7) - (12) in Sha et al (2007a) or (21) - (22) in Wu and Yang (2014)).

The MU-QP-RR-W-\(l_1\) converges monotonically to the global (rectangle area) minimum of the objective function \(F(\cdot )\) in (3.3). This is summarized in the theorem below:

Theorem 3.1

Let \(F(\cdot ),~A,~b,~d,~v^0,~ l\) be given as in the problem (3.3). Define an auxiliary function \(G( \cdot , \cdot )\) by: for \(u,v\in [0, l]\),

(3.7)

For any positive-valued vector \(v\in [0, l]\), pick a vector \(U(v)\in \mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{ u\in [0, l]}G(u, v) \). Then U(v) satisfies the following:

  1. (i)

    For any \(v\in [0, l]\),

    $$\begin{aligned}{} & {} F(U(v))\le F(v), \end{aligned}$$
    (3.8)
    $$\begin{aligned}{} & {} F(U(v))= F(v)~\text{ if } \text{ and } \text{ only } \text{ if }~U(v)=v. \end{aligned}$$
    (3.9)
  2. (ii)

    For each \(v\in [0, l]\), U(v) is the updated value of v, presented in the form of (3.4).

The approach we used to prove Theorem 3.1 is similar to the ones used in Expectation-Maximization algorithm (Dempster et al 1977), nonnegative matrix factorization (Lee and Seung 2000), and Multiplicative Updates for Nonnegative Quadratic Programming (Sha et al 2007a, b). More specifically, the proof proceeds in two steps. First, we establish an auxiliary function \(G( \cdot , \cdot )\) in (3.7) to show that the MU-QP-RR-W-\(l_1\) monotonically decreases the value of the objective function \(F( \cdot )\) in (3.3). Then, we show that the iteratively updates (3.4) in Algorithm 1 converge to the global minimum. The complete proof of Theorem 3.1 is provided in Appendix B.

4 Hyper-parameter optimization

ARGEN is a family including many well-known linear models with constraints. For instance, in Table 1, the ARLS, ARL, ARR, and AREN correspond to the arbitrary rectangle-range least squares, lasso, ridge, and elastic net, respectively. Besides, based on the choice of parameters we can propose some other new methods including ARGL, ARGR, ARLEN and ARREN, which are applicable to more complicated problems, and usually perform better than the free-range regression coefficients models.

However, many of the methods in Table 1 involve dealing with very high-dimensional hyper-parameter space, thus grid search method for tuning parameters can be very computationally expensive. Other tuning approaches such as Bayesian optimization and gradient-based optimization, developed to obtain better results in fewer evaluations, when applied to our case, however, are also very costly because they easily search around local minimum when the complexity of the surface is relatively high. Because discovering better tuning methods is not a focus in this paper, it is a potential direction of our future research, thus we simply use random search (\(N_{calls}\) trials) to avoid costly searching on the entire grid. Following the convention in Zou and Hastie (2005) and Tibshirani (1996), we use the mean-squared error (MSE) as the score function, that is,

$$\begin{aligned} MSE={\mathbb {E}}[X{\widehat{\beta }}-X\beta ^*]^2=\mathbb E\big [({\widehat{\beta }}-\beta ^*)'X'X({\widehat{\beta }}-\beta ^*)\big ]. \end{aligned}$$
(4.1)
Table 1 Particular examples of ARGEN methods and their parameter setting
Table 2 Tuning grid for different methods

To further speed up the tuning process, we select the following values for each of the potential hyper-parameters to tune on. \(\lambda _n^{(1)},~\lambda _n^{(2)}\) take integer values in the sets \(\Lambda ^{(1)}\) and \(\Lambda ^{(2)}\), respectively, where for each \(i=1,2\), \( \Lambda ^{(i)}=\big \{ 0,~1,\ldots ,\lambda _{{{\text {up}}}} ^{(i)}\big \}~\text{ for } \text{ some }~\lambda _{{{\text {up}}}} ^{(i)}\in \mathbb {Z_+}. \) The weight vector \(\mathrm w_n\) takes values in

for some \(w _{{{\text {up}}}} \in \mathbb {Z^+}\). The matrix \(\mathrm \Sigma _n\) can be decomposed to \(\Sigma _n=PDP'\) with orthogonal matrix P and nonnegative diagonal matrix D. Therefore, the values of \(\mathrm \Sigma _n\) are considered in

$$\begin{aligned} \begin{aligned} \Sigma =\Big \{PDP':&D={{\,\textrm{diag}\,}}(d_1,\cdots ,d_p), \\ {}&d_1,\ldots ,d_p\in \left\{ 0,\cdots ,d _{{{\text {up}}}} \right\} \Big \}, \end{aligned} \end{aligned}$$

for some \(d _{{{\text {up}}}} \in \mathbb {Z^+}\) and orthogonal matrix P. The cardinality of \(\Lambda ^{(1)}\), \(\Lambda ^{(2)}\), W, and \(\Sigma \) are \(\lambda _{{{\text {up}}}} ^{(1)}+1\), \(\lambda _{{{\text {up}}}} ^{(2)}+1\), \((w _{{{\text {up}}}} +1)^p\), and \((d _{{{\text {up}}}} +1)^p\), respectively, thus we obtain the total values on the parameter grid for each method listed in Table 2.

To improve the performance of the methods, we can: (1) randomly choose more trials, that is, increase \(N_{calls}\), to cover more values on the grid, since theoretically searching on the whole grid will give the best result; (2) increase the values of \(w _{{{\text {up}}}} \) and \(d _{{{\text {up}}}} \), but at the same time \(N_{calls}\) also needs to be increased since higher values of \(w _{{{\text {up}}}} \) and \(d _{{{\text {up}}}} \) will exponentially enlarge the grid.

5 Simulations

5.1 Signal recovery

The purpose of the signal recovery examples below is to explore the best possible performance of ARGEN, and to show ARGEN’s ability to “reduce noise” and deal with high-dimensional (\(p\gg n\)) sparse signals.

First, we conduct the same problem as in Mohammadi et al (2018) to compare our results with theirs. In the following, we briefly outline the problem. A sparse signal \(\beta ^*\in {\mathbb {R}}^{4096}\) with 160 spikes that have amplitude 1 is generated and plotted in Fig. 1 (top), which is the true regression coefficient vector. The design matrix \(X\in \mathbb R^{1024\times 4096}\) is generated with each entry sampled from i.i.d. standard normal distribution and each pair of rows orthogonalized. The response vector \(Y\in {\mathbb {R}}^{1024}\) is then generated through \(Y=X\beta ^*+\epsilon \), where \(\epsilon \in R^{1024}\) is a vector of i.i.d Gaussian noise with zero mean and variance 0.1. The lower and upper bounds of \({{\widehat{\beta }}}\) are \(-1\) and 1, respectively. Given \(\lambda _n^{(1)}=10, \lambda _n^{(2)}=0,\) and \(w_i=0 \) if \(\beta _i\ne 0 \) for \(i=1,\ldots , 4096\), we obtain the recovery signal \({{\widehat{\beta }}}\) and its difference from the true signal in the middle and bottom plots in Fig. 1. As a result, ARGEN achieves a lower MSE of 0.00069 compared with the MSE of 0.00273 in Mohammadi et al (2018).

To show that ARGEN can deal with more complicated problems, we take another signal recovering problem as example. We follow the same settings as in the previous problem, but this time replace the amplitude of each spike by a random value generated from a uniform distribution over [0, 1). The corresponding true and recovery signals and their difference are plotted in Fig. 2. The MSE obtained by ARGEN is 0.00166.

Fig. 1
figure 1

Signal recovery with equal-length spikes. MSE \(= 0.00069\)

Fig. 2
figure 2

Signal recovery with arbitrary-length spikes. MSE \(= 0.00166\)

5.2 Methods comparison

In this section, we compare the performances of the methods listed in Table 1. We adopt the following setup to tune the four hyper-parameters: \(\lambda _{{{\text {up}}}} ^{(1)}=100,~ \lambda _{{{\text {up}}}} ^{(2)}=100,~w _{{{\text {up}}}} =2,~d _{{{\text {up}}}} =2\). Considering the computational cost, the number of random values (\(N_{calls}\)) on grid to try is 100 for ARL and ARR, 500 for AREN, 1280 for ARGL and ARGR, 2560 for ARLEN and ARREN, and 6554 for ARGEN.

We conduct 8 examples to test the performance of each method. In each example, we simulate 50 datasets from \( Y=X\beta ^*+\epsilon , \epsilon \sim {\mathcal {N}}(0,\sigma ^2), \) and each of the data sets consists of independent training, validation, and testing sets. We use the training set to fit models. Parameters are tuned on the validation set. The test error, measured by the MSE (4.1), will be computed on the testing set. In the following, we outline these examples.

In Example 1, let \(\beta ^*=(3~1.5~0~0~2~0~0~0)'\), \(p=8\), \(\sigma =3\), and the pairwise correlation between \(X_i\) and \(X_j\) be \(0.5^{|i-j|}\) for all ij. We use 20 observations for training, 20 for validation, and 200 for testing.

Example 2 is the same as Example 1, except that each entry of \(\beta ^*\) is replaced with 0.85.

In Example 3, let \(\sigma =15\), \(p=40\),

$$\begin{aligned} \begin{aligned} \beta ^*=(\underbrace{0~\cdots ~0}_\text{10 } \text{ times }~\underbrace{2~\cdots ~2}_\text{10 } \text{ times }~\underbrace{0~\cdots ~0}_\text{10 } \text{ times }~\underbrace{2~\cdots ~2}_\text{10 } \text{ times})', \end{aligned} \end{aligned}$$

and the pairwise correlation between \(X_i\) and \(X_j\) be 0.5 for all ij. We use 100 observations for training, 100 for validation, and 400 for testing.

In Example 4, let \(\sigma =15\), \(p=15\),

$$\begin{aligned} \beta ^*=(\underbrace{3~\cdots ~3}_\text {6 times}~\underbrace{0~\cdots ~0}_\text {9 times})'. \end{aligned}$$

Let the design matrix X be generated as:

$$\begin{aligned} \begin{aligned}{}&{} x_i=Z_1+\epsilon _i^x, ~~Z_1\sim {\mathcal {N}}(0,1), ~~i=1,2,\\{}&{} x_i=Z_2+\epsilon _i^x, ~~Z_2\sim {\mathcal {N}}(0,1), ~~i=3,4,\\{}&{} x_i=Z_3+\epsilon _i^x, ~~Z_3\sim {\mathcal {N}}(0,1), ~~i=5,6,\\{}&{} \text { where }~\epsilon _i^x\text {'s are } \text { i.i.d. }~ {\mathcal {N}}(0,0.01),~ i=1,\cdots ,6. \\{}&{} x_i\sim N(0,1), ~~x_i\text {'s are } \text { i.i.d. }, ~~i=7,\cdots ,15. \end{aligned} \end{aligned}$$

We use 40 observations for training, 40 for validation, and 100 for testing.

Example 5 is the same as Example 1, except that \(\beta ^*=(-3~-1.5~0~0~2~0~0~0)'\) and \(\beta _i\ge -1000\) for all i.

Example 6 is the same as Example 1, except that each entry of \(\beta ^*\) is replaced with a random generated number in \([-5,5]\) and this values is used for all the 50 data sets. Besides, we restrict \(\beta _i\in [-5,5]\) for all i.

Example 7 is the same as Example 1, but uses \(\beta ^*=(-6~-8~0~0~7~0~0~0)'\) and restricts \(\beta _i\in [-5,5]\) for all i.

Example 8 is the same as Example 4, but uses 5 observations for training, 5 for validation, and 50 for testing. Beside, we restrict \(\beta _i\ge -1000\) for all i and use

$$\begin{aligned} \beta ^*=(\underbrace{-3~\cdots ~-3}_\text {6 times}~\underbrace{0~\cdots ~0}_\text {9 times})'. \end{aligned}$$

The first three examples above are from Zou and Hastie (2005) and Tibshirani (1996), which are originally constructed for lasso. The fourth example is similar to that in Zou and Hastie (2005), which creates a grouped variable situation. None of the first four examples, however, requires constraints on lower or upper bound for the coefficients. To show and test that ARGEN is applicable to more general and complicated problems, we add four more examples, which are Examples 5 to 8. In each of Examples 5, 6, and 8, constraints are added and include the true coefficients. In Example 7, we provide a case when the true coefficients are out of the interval constraints. The values 1000 and 5 were chosen arbitrarily to illustrate the model’s ability to work with constrained coefficients. Moreover, another purpose of introducing the last example is to test model performance on high-dimensional \((p\ge n)\) scenarios.

Table 3 Median MSE and the corresponding standard error (given in the parentheses) over 50 replications for each method and example

Table 3 summarizes the Median MSE and its corresponding standard error over 50 data sets using each method in Table 1 for each of the above 8 examples. The MSEs of examples with different \(\sigma \) are not comparable because they are simulated with different noise variances. Some of our examples do, however, share a similar simulation process and their MSEs are at the same level. For instance, Examples 1 and 5 are similar except that Example 5 has a lower limit of \(-1000\) on the coefficients, whereas Example 1 has no limit. As a result of these lower constraints, Example 5 tends to force coefficients above the lower limit, resulting in relatively higher MSEs than Example 1. In addition to cross-example comparisons, it would make more sense to compare the MSEs across methods for each example. The overall performance of methods consisting of more parameters is better than those with fewer parameters. More specifically, the ARGL, ARGR, ARLEN, ARREN, and ARGEN are, in most cases, outperforming the ARLS, ARL, ARR, and AREN. For instance, ARGEN, the most complicated method that includes all four hyper-parameters, performs best in Examples 1, 2, 5, 6, and 7. ARLEN and ARREN, the second from the top regard to complicity, provide second high accuracy in Examples 1, 2, 3, 4, 5, 7, and 8. ARGL and ARGR are at the third level of performance in Example 1, 2, 5, 6, and 7. However, in Table 2, performances are not always increasing as the model gets more complicated. This is because the ratio of values searched (\(N_{calls}\)) to the total number of values on the grid is not the same for all the methods, due to the exponential increase of the size of the grid as more hyper-parameters are included. It is also because we keep the same \(N_{calls}\) in each method for all the examples, which, in fact, have different dimensionality. Therefore, the performance of methods like ARLEN, ARREN, and ARGEN is worse than expected in some of the examples.

6 Real world application - S &P 500 index tracking

6.1 Outline

Index tracking is passive management that replicates the return of a market index (e.g., S &P 500 in New York and FTSE 100 in London) by constructing an equity portfolio that contains only a subset of the index constituents to reduce the transaction and management costs (Connor and Leland 1995; Franks 1992; Jacobs and Levy 1996; Jobst et al 2001; Larsen and Resnick 1998; Lobo et al 2000; Toy and Zurack 1989).

In this section, we show how ARGEN applies to index tracking, an asset allocation (Markowitz 1952) problem with allocation constraints, in the financial field and compare the results with those of nonnegative lasso (Wu et al 2014) and nonnegative elastic net (Wu and Yang 2014). Through this example, (1) we provide general practice guidance for adapting ARGEN to solve real world problems; (2) We demonstrate ARGEN’s feasibility and flexibility compared to the existing methods. In particular, we highlight that ARGEN can deal with problems that require constraints on the coefficient, while none of the existing methods (Wu et al 2014; Wu and Yang 2014) can.

We will take tracking the US S &P 500 index as the example. It is worth noting that, in Wu et al (2014), the nonnegative lasso is applied to tracking the CSI 300 index; in Wu and Yang (2014), the nonnegative elastic net is used to track the CSI 300 and and SSE 180. Both indexes are based on stocks without short sales. This is not the case for US stock market, where short sales are allowed. Although short-selling is allowed in the US market, traditional mutual funds still hold long-only portfolios. Previous research (Almazan et al 2004; Agarwal et al 2009; Chen et al 2013; An et al 2021) suggested that only a small portion of mutual funds hold short positions in their portfolios, although short-selling is allowed for a lot of mutual funds. Especially the data from An et al (2021) suggested that about 90% of the sampled mutual funds held long-only portfolios in the previous 8 quarters before the research date (July, 2021), even though 40% of those funds are explicitly allowed to hold short positions. There are long-only funds in the US market, to name a few: MS INVF Global Advantage Fund, Jennison Global Opportunity Fund, and Thematics Safety Fund. According to the managing partner from Reverb ETF (a user-voting-based, long-only, diversified equity fund), constraints are usually in place for constituents to ensure portfolio diversification and risk migration. From a global perspective, short selling still faces limitations or additional regulations outside the US, such as in China, India, and Brazil. In 2020, due to market volatility, several countries (Belgian, French, Italy, etc.) in Europe raised short-selling restrictions for 2 months from March 18, 2020, and a few Asian countries (South Korea, Indonesian, and Thailand) imposed more extended short selling restrictions (Manson, 2020). Given the above, we believe that an empirical study with a non-negative range setting would be a good fit to general applications in practice.

In portfolio management, “how close is the constructed tracking portfolio’s return compared to that of benchmark index” is a primary measurement for accessing portfolio performance for passive strategies such as index-tracking strategy. Hence, inspired by Sant’Anna et al (2020), we evaluate the tracking portfolio performance from the following three perspectives. Our primary performance measurement is tracking error (TE),

$$\begin{aligned} {\text {TE}} = \sqrt{\frac{\sum \limits _{t=1}^{T}\left( (r_{t}^{p} - r_{t}^{b}) - {\mathbb {E}}[r_{t}^{p} - r_{t}^{b}]\right) ^2}{T}} \end{aligned}$$

for measuring the volatility of the excess return of a portfolio to the corresponding benchmark. We also compute the annual volatility of portfolio return (ARV),

$$\begin{aligned} {\text {ARV}} = \sqrt{252} \sqrt{\frac{\sum \limits _{t=1}^{T}\left( r_{t}^{p} - {\mathbb {E}}[r_{t}^{p}]\right) ^2}{T}} \end{aligned}$$

to measure the annualized return volatility of a portfolio. In addition, we also report the cumulative return

$$\begin{aligned} {\text {CR}} = \prod _{t=1}^{T} (1+r_{t}^{p}) - 1 \end{aligned}$$

of the construction portfolios in our study. Here \(r_{t}^{p}\) denotes the portfolio return at time t, \(r_{t}^{b}\) is the benchmark return at time t, and T is the total number of periods.

Because there is no guarantee that the normalized \({{\widehat{\beta }}}\) is still less than t, we introduce the following normalization process, which constrains the way of choosing the lower and upper limits of \({\mathcal {I}}\). Recall that \(s_i\) and \(t_i\) are the lower and upper bounds of the coefficient \(\beta _i\). To guarantee that the portfolio weight (i.e., normalized \(\beta _i\)), denoted by \({\tilde{\beta }}_i\), for stock i satisfies \(0 \le {\tilde{\beta }}_i \le t_i \le 1\), we need \(s_i\) and \(t_i\) satisfy the following:

$$\begin{aligned} t_i + \sum _{j\in \{1,\ldots ,p\}\backslash \{i\}} s_j \ge 1, \end{aligned}$$

because it yields

$$\begin{aligned} {\tilde{\beta }}_i = \frac{\beta _i}{\sum _{j=1}^{p} \beta _j} \le \frac{t_i}{t_i+\sum _{j\in \{1,\ldots ,p\}\backslash \{i\}} s_j}\le t_i. \end{aligned}$$

In the special case of \(s_i = s_0\) and \(t_i = t_0\), we shall choose the lower and upper bounds through

$$\begin{aligned} \frac{1-t_0}{p-1} \le s_0 \le t_0 \le 1. \end{aligned}$$

We use 5-year (from February 19, 2016 to February 18, 2020) historical daily prices (1259 data points) of S &P 500Footnote 3 as our benchmark index and those of the constituent equities.Footnote 4 Because the list of S &P 500 constituents is updated regularly by S &P Dow Jones Indices LLC, we only include the daily prices from 377 stocks that have not been changed during the period of interest. In the linear model (1.2), Y is the vector of the daily return of the S &P 500 index and columns of X are the daily returns of the 391 stocks. To follow a buy-and-hold investment strategy, we split the data into training, validation, and testing sets. The training and validation sets consist of the first 252 data points (12 months), 20% of which are in the validation set. The remaining 1006 data points are referred to as the testing set. In addition, we construct a long-only portfolio by ensuring the lower bound s assumes only nonnegative values.

In the following, we outline our procedure of the index tracking problem. First, we target selecting N individual stocks to construct the tracking portfolio. The number N is among the normal range of the number of stocks hold to remove risk exposure and avoid unnecessary transaction costs. In other words, we constrain the number of nonzero elements in \({\widehat{\beta }}\) to N. Thus we use the bisection search (Wu and Yang 2014) in Algorithm 2 to determine the optimal \(\lambda _n^{(1)}\) that produce the right number of nonzero coefficients, given \(N=30,~50,~70,~90\) respectively, \(\lambda _n^{(2)}=0\), \(\textrm{w}_n\) has equal elements, and \({\mathcal {I}}=[0,+\infty )\), respectively. Hence we obtain the 50 stocks selected by the model. This first process proceeds on the training and validation sets.

Table 4 ARGEN vs ARLS tracking portfolio performance in the testing period
figure b

Next, we consider \({\mathcal {I}} = [0.0082, 0.6]\) and \({\mathcal {I}} = [0.0041, 0.8]\), respectively, and apply ARLS and ARGEN to the corresponding data set of the selected 30, 50, 70, and 90 stocks to experiment the ARGEN algorithm. The ARLS is viewed as the baseline. For ARGEN, we search \(\lambda _n^{(1)}\) randomly in a range of \((10^{-8}, 5\times 10^{-2})\), which is a smaller searching grid compared with that in Sect. 4, since larger range results in over 50 vanishing coefficients. The \(\lambda _n^{(2)}\) is randomly searched in a range of \((10^{-8}, 10^2)\). We take \(w _{{{\text {up}}}} =1\), and \(d _{{{\text {up}}}} =1\). The hyper-parameter tuning process is conducted in Optuna hyper-parameter optimization framework (Akiba et al 2019), and selected the parameter set that has the lowest validation score, measured by MSE, compared with that of ARLS, and then apply it on testing data set to evaluate and compare the out-of-sample performances between the portfolios obtained using ARGEN and ARLS.

6.2 Experimental results

We follow the procedure as elaborated in the previous session to construct multiple ARGEN and ARLS portfolios with different numbers of stocks, and different numbers of hyper-parameter tuning trails. The portfolios’ testing performance is summarized in Table 4.

Particularly, Table 4 illustrates the performance of different ARGEN and ARLS portfolios constructed with different coefficient boundary and stock numbers. Across different portfolio construction configurations, the ARGEN portfolios tend to have lower tracking errors and annualized return volatility than ARLS portfolios, while satisfying the coefficient boundary conditions. Even though ARGEN portfolios tend to have lower cumulative returns, but they are comparable with the S &P 500 index cumulative return during the same period, except for 30-stock ARGEN portfolios. Portfolios with a wider range of constraints ([0.0041, 0.8]) track better than portfolios with narrower constraints range ([0.0082, 0.6]), which is expected behavior. Another expected behavior we can observe from the results is that as we increase the number of stocks in portfolios, the tracking errors decrease.

7 Conclusion and future perspectives

In this paper, we propose the ARGEN for variable selection and regularization. ARGEN linearly combines generalized lasso and ridge penalties, which are \(\textrm{w}_n'\beta \) and \(\beta '\Sigma _n\beta \), and it allows arbitrary lower and upper constraints on the coefficients. Many well-known methods including (nonnegative) lasso, ridge, and (nonnegative) elastic net are particular cases of ARGEN. We show that ARGEN has variable selection and estimation consistencies subject to some conditions. We propose an algorithm to solve the ARGEN problem by applying multiplicative updates to a quadratic programming problem with a rectangle range and weighted \(l_1\) regularizer (MU-QP-RR-W-\(l_1\)). The algorithm is implemented as a Python library through the PyPi server. The simulations and the application in index-tracking present shreds of evidence that ARGEN usually outperforms other methods discussed in the paper due to its flexibility and adaptability for problems with a small to moderate amount of predictors. In problems with a huge amount of predictors, although ARGEN should perform best theoretically, the cost might be high. In this situation ARLEN and ARREN might be better choices. We refer readers to the Github repositoryFootnote 5 for full access to the code for the simulation and application parts.

Although in the paper the ARGEN penalty is added to linear models, there are possibilities that it is applied to other loss functions to improve their performances as directions of future research. Motivated by the index tracking problem, a constraint that guarantees the sum of weights equals one may be considered as another direction. Asymptotic behavior in law of the ARGEN estimator remains unknown. Most importantly, a more efficient tuning process is urgently required to apply ARGEN to solve more complicated problems. All of the above problems are open for future study.