Variable selection and regularization via arbitrary rectangle-range generalized elastic net

We introduce the arbitrary rectangle-range generalized elastic net penalty method, abbreviated to ARGEN, for performing constrained variable selection and regularization in high-dimensional sparse linear models. As a natural extension of the nonnegative elastic net penalty method, ARGEN is proved to have both variable selection consistency and estimation consistency under some conditions. The asymptotic behavior in distribution of the ARGEN estimators have been studied in this framework. We also propose an algorithm called MU-QP-RR-W-l1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$l_1$$\end{document} to efficiently solve the ARGEN problem. By conducting simulation study we show that ARGEN outperforms the elastic net in a number of settings. Finally an application of S &P 500 index tracking with constraints on the stock allocations is performed to provide general guidance for adapting ARGEN to solve real-world problems.

In the sequel, we consider the linear regression model where X is a deterministic n×p design matrix, Y = (y 1 . . .y n ) is an n×1 response vector and ε = (ε 1 . . .ε n ) is a Gaussian noise with marginal variance σ 2 .Without loss of generality, we assume all the p predictors are real-valued and centered, so the intercept can be ignored.β * ∈ R p is the regression coefficients.The rest of the paper is organized as follows.In Section 2, we discuss of the analytical features of ARGEN and discuss of its variable selection consistency (Theorem 2.3), estimation consistency (Theorem 2.4) and estimator's limiting distribution (Theorem 2.6).In Section 3, we propose an efficient algorithm MU-QP-RR-W-l 1 for solving ARGEN.Approaches we use to speed up hyper-parameter optimization are discussed in Section 4. Simulations that compare the performances of various methods are conducted in Section 5. Section 6 shows an application of ARGEN to the real world S&P 500 index tracking problem.Section 7 is devoted to the conclusion and discussion of future research.Technical proofs are moved to Appendix.

The ARGEN 2.1 Definition
In practice it is often natural to assume sparsity in the high-dimensional dataset problem.Therefore in the sequel we assume that the linear model (1.2) is q-sparse, i.e., β * has at most q (q p) nonzero elements.We intend to cope with the case when there is a control on the range of the coefficients, that is, let s = (s 1 , . . ., s p ), t = (t 1 , . . ., t p ) with s i ∈ R ∪ {−∞}, t i ∈ R ∪ {+∞}, s i < t i for all i = 1, . . ., p, the optimal coefficients are in a p-dimensional rectangle I := [s, t] ⊂ R p .To capture the penalty weights for individual features, we introduced w n = (w n,1 . . .w n,p ) as weights for each coefficient in the l 1 penalty, and it satisfies w n,i ≥ 0, i = 1, • • • , p.In addition to individual features, Σ n , a positive semi-definite matrix, is introduced to represent the penalty weights for interactions between any two features.Consider the linear model (1.2) and let β = (β 1 . . .β p ) be a vector in R p .The ARGEN estimator of β is given by β(λ (1)  n , λ (2)  n , w n , Σ n ) = arg min β∈I Y − Xβ 2 2 + λ (1)  n w n |β| + λ (2)  n β Σ n β . (2.1) Here λ n , λ n ≥ 0 are the tuning parameters which control the importance of the l 1 and l 2 regularization terms, respectively.
The ARGEN (2.1) naturally extends the elastic net method.That is, it becomes the elastic net when I = R p , w n = (1 . . . 1) , and Σ n is the identity matrix.Thus ARGEN extends the lasso and ridge methods by further assigning λ (2) n = 0 and λ (1) n = 0 respectively.In addition, ARGEN becomes the nonnegative elastic net if we replace I = R p with I = R p + := [0, +∞) p in the setting of elastic net.

Variable Selection Consistency
We define the variable selection consistency for the ARGEN as follows.For i = 1, . . ., p, we decompose the interval [s i , t i ] with s i < t i into seven disjoint sub-intervals: , where = {s i }\{0}, G = {s i } ∩ {0}, G In addition, we define G for simplicity.Correspondingly, each coefficient in β * belongs to one of the seven groups of values; i.e., for each i = 1, . . ., p, there is a unique . Now for j ∈ {1−, 1+, 2, . . ., 6}, denote by the set of indexes i for which β * i belongs to the j-th group of values, and let #S (j) be the cardinality of the set.Correspondingly, we can define S (j) (λ (1)  n , λ (2)  n , w n , Σ n ) = i ∈ {1, . . ., p} : n , w n , and Σ n such that P S (j) λ (1)  n , λ (2)  n , w n , 2) implies that, starting from some n, it is of big opportunity that Such property includes the variable selection consistency of the nonnegative elastic net and elastic net as particular cases.Therefore our definition of the variable selection consistency for ARGEN is in a broader sense than that for the "free-range" or nonnegative elastic net [28,29,32]. Let ) and for j ∈ {1−, 1+, 2, . . ., 6}, let X (j) = (X i ) i∈S(j) be the observed predictor values corresponding to the jth group of indexes.Similarly, let β * (j) = (β * i ) i∈S(j) , s (j) = (s i ) i∈S(j) , t (j) = (t i ) i∈S(j) , w n,(j) = (w n,i ) i∈S(j) , and Σ n,(j1j2) = (Σ n,i1,i2 ) i1∈S(j1),i2∈S(j2) .Moreover, let C be and Λ min (C 11 ) be the minimal eigenvalue of C 11 .Denote by −1 n := min −1 where for a vector v = (v 1 , . . ., v n ), sign(v) := (sign(v 1 ), . . ., sign(v n )) denotes the vector of signs of the elements in v, and diag(v) denotes the diagonal matrix with diagonal elements v.To show ARGEN admits the variable selection consistency (2.2), we assume that the following conditions hold: and for j ∈ {1−, 1+}, Besides, we assume that the arbitrary rectangle-range elastic irrepresentable condition (AREIC), defined below, is satisfied.

+ 2λ
(2) When s = 0, t = +∞, w n = 1 and Σ n is the identity matrix, the AREIC becomes the nonnegative elastic irrepresentable condition (NEIC) as follows: which was necessary to yield the variable selection consistency of nonnegative elastic net [33].If, in addition to (2.12), λ n = 0, the NEIC then becomes the nonnegative irrepresentable condition (NIC): which was a necessary condition to obtain the variable selection consistency of the nonnegative lasso [29].Note that, NIC is a nonnegative version of the irrepresentable condition (IC) for the variable selection consistency of the lasso [32]: . Although IC is a sufficient and necessary condition for the variable selection consistency of the lasso [32] while NIC is only a necessary condition, in the real world NIC is easier to be satisfied than IC since it does not require the absolute value on the left-hand side of the inequality.As a result AREIC is a natural general version of the previous necessary conditions NEIC and NIC for the variable selection consistency.Below we state the first main result of the paper.Its proof is given in Appendix A.

Estimation Consistency
Recall that an estimation method with target parameter β * has the property of estimation consistency if (i) β * ∈ I. Let p = p n , q = q n be non-decreasing as n increases.
(iii) Let X j be the jth column of X, which satisfies (iv) X satisfies the restricted eigenvalue (RE) condition, i.e. there exists a constant κ > 0, such that for all n ≥ 1 and all β ∈ I satisfying n , w n , p n and q n satisfy q n (λ where σ > 0 is the residual standard deviation of the ARGEN.
Below we state the estimation consistency of the ARGEN.
Theorem 2.4.Consider a q n -sparse instance of the ARGEN (2.1).Let X satisfy the conditions (i) -(iv) and let the regularization parameters λ n ≥ 0, then the ARGEN solution β := β(λ n , λ n , w n , Σ n ) satisfies: where σ > 0 denotes the residual standard deviation of the ARGEN.In addition if (v) holds, we have Proof.The main idea to the proof is to transform the ARGEN problem into a rectangle-range lasso problem.Let Then the ARGEN (2.1) can be written as the rectangle-range lasso: In view of the conditions (i)-(iv), all requirements of Corollary 2 in [19] are satisfied.Therefore, applying Corollary 2 in [19] to the lasso (2.16) yields the results.We point out that: (1) Based on its proof, Corollary 2 in [19] works for rectangle-range lasso.(2) There is a typo in the statement of Corollary 2 in [19]: the inequalities (34) in [19] should be corrected to If we assume w n 0 as n → ∞ in Theorem 2.4, we easily obtain the estimation consistency condition for the nonnegative lasso (see Proposition 1 in [29]) and the nonnegative elastic net.Note that the estimation consistency of the nonnegative elastic net [28] has not yet been derived, hence we state it below as a corollary of Theorem 2.4.To obtain the corollary it suffices to observe Corollary 2.5.Consider a q n -sparse nonnegative elastic net model.Assume: (i) β * ≥ 0. p n , q n are non-decreasing as n increases.
(ii) Let X j be the jth column of X which satisfies

If we take λ
(2) n = 0 and λ (1) n = 4σ log p n /n in Corollary 2.5, we obtain the nonnegative lasso's tail probability control as in Proposition 1 in [29].If we further assume β * ∈ R in Corollary 2.5, we derive the tail bounds for the lasso (see Corollary 2 in [19]).

Limiting Distributions of ARGEN Estimators
We now study the asymptotic behavior in distribution of the ARGEN estimators, as n → ∞.Again we can make use of the transformation of ARGEN to the rectangle-range lasso model (2.16), since the limiting distributions of the lasso regression estimators have been studied in [10].Observe that (2.16) is equivalent to where X = X/ √ 2n and Y = Y / √ 2n.(2.17) is then the type of lasso studied in [10].Assume that the row vectors of X, denoted by X(i) , i = 1, . . ., n, satisfy where M is a nonsigular nonnegative definite matrix and It follows from Theorem 2 in [10] that the ARGEN estimator β has the following asymptotic behavior in distribution.
Theorem 2.6 includes the asymptotic behaviors of the elastic net and nonnegative elastic net estimators as its particular examples.As another particular example, when λ 0 = 0 and p = 1 (then M is a single value and I = [s 1 , t 1 ]), by the fact that V is convex, we obtain arg min In the above example, if p ≥ 2 and I = R p , arg min u∈I (V (u)) has no simple explicit expression.Note that arg min u∈I (V (u)) belongs to some quadratic programming problem.In the next section we provide a multiplicative updates numerical algorithm to solve the ARGEN.This algorithm may be further applied to simulate arg min u∈I (V (u)) numerically.

MU-QP-RR-W-l 1 Algorithm for Solving ARGEN
In this section we provide a solution of ARGEN by using an extensive multiplicative updates algorithm.Given w n , Σ n and λ (1) n ≥ 0, the ARGEN in (2.1) can be expressed as the following equivalent problem: To simplify the problem, we rewrite it by taking , and obtain an equivalent problem of (3.1), that is, minimize Sha et al. [21] derived the multiplicative updates for solving the nonnegative quadratic programming problem.The algorithm has been shown to have a simple closed-form, and a rapid convergence rate.For our problem (3.2), however, it contains the absolute values and lower and upper limits of the optimization variables, so that direct application of the algorithm in [21] is impractical.Therefore, we propose a new iterative algorithm to solve (3.2) and call it multiplicative updates for solving quadratic programming with rectangle range and weighted l 1 regularizer (abbreviated to MU-QP-RR-W-l 1 ).
Let us formulate a more general problem that can be solved by MU-QP-RR-W-l 1 : minimize Here v, b, d, v 0 , l are column vectors of dimension p, where elements of d, v 0 are nonnegative and elements of l are positive.The matrix A = (A ij ) 1≤i,j≤p is positive semi-definite.In fact, the nonnegative quadratic programming (see e.g.Equation ( 5) in [21] or (20) in [28]) is a special case of (3.3), where we take the elements of d, v 0 to be 0 and elements of l to be infinity.Let us further adopt the following notations.For i, j ∈ {1, . . ., p}, we define the positive part and negative part of Then denote the positive part and negative part of the matrix A by

It follows that
we then present the MU-QP-RR-W-l 1 algorithm in pseudocode in Algorithm 1.
We point out that the conditions r 1 > v 0 i and r 2 < v 0 i in (3.4) are mutually exclusive when v On the other hand, Since d i ≥ 0, it is obvious that (3.5) and (3.6) are mutually exclusive.
The MU-QP-RR-W-l 1 converges monotonically to the global (rectangle area) minimum of the objective function F (•) in (3.3).This is summarized in the theorem below: For any positive-valued vector v ∈ [0, l], pick a vector U (v) ∈ arg min u∈[0,l] G(u, v).Then U (v) satisfies the following: is the updated value of v, presented in the form of (3.4).
The approach we used to prove Theorem 3.1 is similar to the ones used in Expectation-Maximization algorithm [4], nonnegative matrix factorization [13], and Multiplicative Updates for Nonnegative Quadratic Programming [21,22].More specifically, the proof proceeds in two steps.First, we establish an auxiliary function G(•, •) in (3.7) to show that the MU-QP-RR-W-l 1 monotonically decreases the value of the objective function F (•) in (3.3).Then, we show that the iteratively updates (3.4) in Algorithm 1 converge to the global minimum.The complete proof of Theorem 3.1 is provided in Appendix B.

Hyper-parameter Optimization
ARGEN is a family including many well-known linear models with constraints.For instance, in Table 1, the ARLS, ARL, ARR, and AREN correspond to the arbitrary rectangle-range least squares, lasso, ridge, and elastic net, respectively.Besides, based on the choice of parameters we can propose some other new methods including ARGL, ARGR, ARLEN and ARREN, which are applicable to more complicated problems, and usually perform better than the free-range regression coefficients models.
However, many of the methods in Table 1 involve dealing with very high-dimensional hyper-parameter space, thus grid search method for tuning parameters can be very computationally expensive.Other tuning approaches such as Bayesian optimization and gradient-based optimization, developed to obtain better results in fewer evaluations, when applied to our case, however, are also very costly because they easily search around local minimum when the complexity of the surface is relatively high.Because discovering better tuning methods is not a focus in this paper, it is a potential direction of our future research, thus we simply use random search (N calls trials) to avoid costly searching on the entire grid.Following the convention in [35] and [24], we use the mean-squared error (MSE) as the score function, that is, To further speed up the tuning process, we select the following values for each of the potential hyperparameters to tune on.λ n take integer values in the sets Λ (1) and Λ (2) , respectively, where for each i = 1, 2, Λ (i) = 0, 1, . . ., λ (i) up for some λ The matrix Σ n can be decomposed to Σ n = P DP with orthogonal matrix P and nonnegative diagonal matrix D. Therefore, values of Σ n are considered in for some d up ∈ Z + and orthogonal matrix P .The cardinality of Λ (1) , Λ (2) , W , and Σ are λ up + 1, (w up + 1) p , and (d up + 1) p , respectively, thus we obtain the total values on the parameter grid for each method listed in Table 2.
To improve the performance of the methods, we can: (1) randomly choose more trials, that is, increase N calls , to cover more values on the grid, since theoretically searching on the whole grid will give the best result; (2) increase the values of w up and d up , but at the same time N calls also needs to be increased since higher values of w up and d up will exponentially enlarge the grid.

Signal Recovery
The purpose of the signal recovery examples below is to explore the best possible performance of ARGEN, and to show ARGEN's ability to "reduce noise" and deal with high-dimensional (p n) sparse signals.First, we conduct the same problem as in [17] to compare our results with theirs.In the following, we briefly outline the problem.A sparse signal β * ∈ R 4096 with 160 spikes that have amplitude 1 is generated and plotted in Figure 1 (top), which is the true regression coefficient vector.The design matrix X ∈ R 1024×4096 is generated with each entry sampled from i.i.d.standard normal distribution and each pair of rows orthogonalized.The response vector Y ∈ R 1024 is then generated through Y = Xβ * + , where ∈ R 1024 is a vector of i.i.d Gaussian noise with zero mean and variance 0.1.The lower and upper bounds of β are −1 and 1, respectively.Given λ n = 0, and w i = 0 if β i = 0 for i = 1, . . ., 4096, we obtain the recovery signal β and its difference from the true signal in the middle and bottom plots in Figure 1.As a result, ARGEN achieves a lower MSE of 0.00069 compared with the MSE of 0.00273 in [17].
To show that ARGEN can deal with more complicated problems, we take another signal recovering problem as example.We follow the same settings as in the previous problem, but this time replace the amplitude of each spike by a random value generated from a uniform distribution over [0, 1).The corresponding true and recovery signals and their difference are plotted in Figure 2. The MSE obtained by ARGEN is 0.00166.

Methods Comparison
In this section, we compare the performances of the methods listed in Table 1.We adopt the following setup to tune the four hyper-parameters: λ up = 100, w up = 2, d up = 2. Considering the computational cost, the number of random values (N calls ) on grid to try is 100 for ARL and ARR, 500 for AREN, 1280 for ARGL and ARGR, 2560 for ARLEN and ARREN, and 6554 for ARGEN.
We conduct eight examples to test the performance of each method.In each example, we simulate 50 datasets from Y = Xβ * + , ∼ N (0, σ 2 ), and each of the data sets consists of independent training, validation, and testing sets.We use the training set to fit models.Parameters are tuned on the validation set.The test error, measured by the MSE (4.1), will be computed on the testing set.In the following, we outline these examples.
Example 2 is the same as Example 1, except that each entry of β * is replaced with 0.85.In Example 3, let σ = 15, p = 40, ) , and the pairwise correlation between X i and X j be 0.5 for all i, j.We use 100 observations for training, 100 for validation, and 400 for testing.
Let the design matrix X be generated as: Example 6 is the same as Example 1, except that each entry of β * is replaced with a random generated number in [−5, 5] and this values is used for all the 50 data sets.Besides, we restrict β i ∈ [−5, 5] for all i.
Example 8 is the same as Example 4, but uses 5 observations for training, 5 for validation, and 50 for testing.Beside, we restrict β i ≥ −1000 for all i and use ) .
The first three examples above are from [35] and [24], which are originally constructed for lasso.The fourth example is similar to that in [35], which creates a grouped variable situation.None of the first four examples, however, requires constraints on lower or upper bound for the coefficients.To show and test that ARGEN is applicable to more general and complicated problems, we add four more examples, which are Examples 5 to 8. In each of Examples 5, 6, and 8, constraints are added and include the true coefficients.In Example 7, we provide a case when the true coefficients are out of the interval constraints.The values 1000 and 5 were chosen arbitrarily to illustrate the model's ability to work with constrained coefficients.Moreover, another purpose of introducing the last example is to test model performance on high-dimensional (p ≥ n) scenarios.3 summarizes the Median MSE and its corresponding standard error over 50 data sets using each method in Table 1 for each of the above eight examples.The MSEs of examples with different σ are not comparable because they are simulated with different noise variances.Some of our examples do, however, share a similar simulation process and their MSEs are at the same level.For instance, Examples 1 and 5 are similar except that Example 5 has a lower limit of −1000 on the coefficients, whereas Example 1 has no limit.As a result of these lower constraints, Example 5 tends to force coefficients above the lower limit, resulting in relatively higher MSEs than Example 1.In addition to cross-example comparisons, it would make more sense to compare the MSEs across methods for each example.The overall performance of methods consisting of more parameters is better than those with fewer parameters.More specifically, the ARGL, ARGR, ARLEN, ARREN, and ARGEN are, in most cases, outperforming the ARLS, ARL, ARR, and AREN.For instance, ARGEN, the most complicated method that includes all four hyper-parameters, performs best in Examples 1, 2, 5, 6, and 7. ARLEN and ARREN, the second from the top regard to complicity, provide second high accuracy in Examples 1, 2, 3, 4, 5, 7, and 8. ARGL and ARGR are at the third level of performance in Example 1, 2, 5, 6, and 7.However, in Table 2, performances are not always increasing as the model gets more complicated.This is because the ratio of values searched (N calls ) to the total number of values on the grid is not the same for all the methods, due to the exponential increase of the size of the grid as more hyper-parameters are included.It is also because we keep the same N calls in each method for all the examples, which, in fact, have different dimensionality.Therefore, the performance of methods like ARLEN, ARREN, and ARGEN is worse than expected in some of the examples.
6 Real World Application -S&P 500 Index Tracking

Outline
Index tracking is passive management that replicates the return of a market index (e.g., S&P 500 in New York and FTSE 100 in London) by constructing an equity portfolio that contains only a subset of the index constituents to reduce the transaction and management costs [3, 7-9, 11, 14, 26].
In this section, we show how ARGEN applies to index tracking, an asset allocation [16] problem with allocation constraints, in the financial field and compare the results with those of nonnegative lasso [29] and nonnegative elastic net [28].Through this example, (1) we provide general practice guidance for adapting ARGEN to solve real world problems; (2) We demonstrate ARGEN's feasibility and flexibility compared to the existing methods.In particular, we highlight that ARGEN can deal with problems that require constraints on the coefficient, while none of the existing methods [28,29] can.
In portfolio management, "how close is the constructed tracking portfolio's return compared to that of benchmark index" is a primary measurement for accessing portfolio performance for passive strategies such as index-tracking strategy.Hence, inspired by [20], we evaluate the tracking portfolio performance from the following three perspectives.Our primary performance measurement is tracking error (TE), to measure the volatility of the excess return of a portfolio to the corresponding benchmark.We also compute the annual volatility of portfolio return (ARV), to measure the annualized return volatility of a portfolio.In addition, we also report the cumulative return of the construction portfolios in our study.Here r p t denotes the portfolio return at time t, r b t is the benchmark return at time t, and T is the total number of periods.
Because there is no guarantee that the normalized β is still less than t, we introduce the following normalization process, which constrains the way of choosing the lower and upper limits of I. Recall that s i and t i are the lower and upper bounds of the coefficient β i .To guarantee that the portfolio weight (i.e., normalized β i ), denoted by βi , for stock i satisfies 0 ≤ βi ≤ t i ≤ 1, we need s i and t i satisfy the following: In the special case of s i = s 0 and t i = t 0 , we shall choose the lower and upper bounds through We use 5-year (from February 19, 2016 to February 18, 2020) historical daily prices (1259 data points) of S&P 5003 as our benchmark index and those of the constituent equities 4 .Because the list of S&P 500 constituents is updated regularly by S&P Dow Jones Indices LLC, we only include the daily prices from 377 stocks that have not been changed during the period of interest.In the linear model (1.2), Y is the vector of the daily return of the S&P 500 index and columns of X are the daily returns of the 391 stocks.To follow a buy-and-hold investment strategy, we split the data into training, validation, and testing sets.The training and validation sets consist of the first 252 data points (12 months), 20% of which are in the validation set.The remaining 1006 data points are referred to as the testing set.In addition, we construct a long-only portfolio by ensuring the lower bound s assumes only nonnegative values.
In the following, we outline our procedure of the index tracking problem.First, we target selecting N individual stocks to construct the tracking portfolio.The number N is among the normal range of the number of stocks hold to remove risk exposure and avoid unnecessary transaction costs.In other words, we constrain the number of nonzero elements in β to N .Thus we use the bisection search [28] in Algorithm 2 to determine the optimal λ   n randomly in a range of (10 −8 , 5 × 10 −2 ), which is a smaller searching grid compared with that in Section 4, since larger range results in over 50 vanishing coefficients.The λ (2) n is randomly searched in a range of (10 −8 , 10 2 ).We take w up = 1, and d up = 1.The hyper-parameter tinning process is conducted in Optuna hyper-parameter optimization framework [1], and selected the parameter set that has the lowest validation score, measured by MSE, compared with that of ARLS, and then apply it on testing data set to evaluate and compare the out-of-sample performances between the portfolios obtained using ARGEN and ARLS.

Experimental Results
We follow the procedure as elaborated in the previous session to construct multiple ARGEN and ARLS portfolios with different numbers of stocks, and different numbers of hyper-parameter tuning trails.The portfolios' testing performance is summarized in Table 4.
Particularly, Table 4 illustrates the performance of different ARGEN and ARLS portfolios constructed with different coefficient boundary and stock numbers.Across different portfolio construction configurations, the ARGEN portfolios tend to have lower tracking errors and annualized return volatility than ARLS portfolios, while satisfying the coefficient boundary conditions.Even though ARGEN portfolios tends to have lower cumulative returns, but they are comparable with the S&P 500 index cumulative return during the same period, except for 30-stock ARGEN portfolios.Portfolios with a wider range of constraints ([0.0041, 0.8]) tracks better than portfolios with narrower constraints range ([0.0082, 0.6]), which is expected behavior.Another expected behavior we can observe from the results is that as we increase the number of stocks in portfolios, the tracking errors decrease.Table 4: ARGEN vs ARLS tracking portfolio performance in the testing period.All the parameter sets are achieved the best score in the validation period using a relatively large number (30000) of hyper-parameter tuning tails.S&P 500 index realizes a cumulative return (CR) as 66.4697% and an annualized return volatility (ARV) as 20.6233% in the testing period.

Conclusion and Future Perspectives
In this paper, we propose the ARGEN for variable selection and regularization.ARGEN linearly combines generalized lasso and ridge penalties, which are w n β and β Σ n β, and it allows arbitrary lower and upper constraints on the coefficients.Many well-known methods including (nonnegative) lasso, ridge, and (nonnegative) elastic net are particular cases of ARGEN.We show that ARGEN has variable selection and estimation consistencies subject to some conditions.We propose an algorithm to solve the ARGEN problem by applying multiplicative updates to a quadratic programming problem with a rectangle range and weighted l 1 regularizer (MU-QP-RR-W-l 1 ).The algorithm is implemented as a Python library through the PyPi server.The simulations and the application in index-tracking present shreds of evidence that ARGEN usually outperforms other methods discussed in the paper due to its flexibility and adaptability for problems with a small to moderate amount of predictors.In problems with a huge amount of predictors, although ARGEN should perform best theoretically, the cost might be high.In this situation ARLEN and ARREN might be better choices.We refer readers to the Github repository5 for full access to the code for the simulation and application parts.Although in the paper the ARGEN penalty is added to linear models, there are possibilities that it is applied to other loss functions to improve their performances as directions of future research.Motivated by the index tracking problem, a constraint that guarantees the sum of weights equals one may be considered as another direction.Asymptotic behavior in law of the ARGEN estimator remains unknown.Most importantly, a more efficient tuning process is urgently required to apply ARGEN to solve more complicated problems.All of the above problems are open for future study.

A Proof of Theorem 2.3
Without loss of generality, we assume that S (1−) , S (1+) , S (2) , . . ., S (6) = (1, . . ., p) to simplify the notations.In addition, we follow the notations in Section 2.2.In order to show that ARGEN admits variable selection consistency, we first need to show that the following result holds.
    = 0, By Markov's inequality and Lemma A.2, we obtain Then by (B.1), we have which is bounded above by G(u, v) using (B.3) and (B.4).Next, to show that (3.9) and (ii) hold, it suffices to prove that U (v) is the unique vector in [0, l] that minimizes G(•, v), and the mapping U (•) has the form as (3.4).Observe that G(u, v) can be rewritten as where Because the term (1/2) 1≤i,j≤p A − ij v i v j is independent of u, the minimization of G(u, v) over u can be accomplished by minimizing G i (u i ) over the marginal variable u i , for each i = 1, . . ., p. Fixing i ∈ {1, . . ., p}, below we minimize G i (u i ) over u i ∈ [0, l i ] in two cases.
First we consider the case when u i ∈ [0, l i ] ∩ [v 0 i , +∞).In this case G i (u i ) becomes and it is differentiable over R. Taking the first and second derivatives of G i (•), we obtain Here u, v ≥ 0 and A is strictly positive definite, so p j=1 A + ij v j and p j=1 A − ij v j cannot be simultaneously equal to 0. Therefore G i (u i ) > 0 for all u i ∈ [0, +∞) and hence G i (•) is strictly convex over [0, +∞).Then the minimum of G i (•) over [0, +∞) is obtained on the unique critic point r 1 such that G i (r 1 ) = 0, which yields It follows that the minimum of G i (•) over [0, l i ] ∩ [v 0 i , +∞) is given below: Next, we discuss the other case.When u i ∈ [0, l i ] ∩ [0, v 0 i ), we have Taking the first and second derivatives of it yields Because G i (•) is strictly convex over [0, +∞), the minimum of G i (•) over [0, +∞) is uniquely obtained at r 2 such that G i (r 2 ) = 0, that is, It follows that the minimum of G i (•) over [0, l i ] ∩ [0, v 0 i ) is given as: Denote by U : R p → R p such that for i = 1, . . ., p, (B.7)

2 P−
−−− → n→∞ 0, where • 2 denotes the Euclidean distance and P − −−− → n→∞ is the convergence in probability.Besides the variable selection consistency, ARGEN admits estimation consistency, subject to the following conditions.

( 1 )
n that produce the right number of nonzero coefficients, given N = 30, 50, 70, 90 respectively, λ(2)n = 0, w n has equal elements, and I = [0, +∞), respectively.Hence we obtain the 50 stocks selected by the model.This first process proceeds on the training and validation sets.

Table 1 :
.1) Particular examples of ARGEN methods and their parameter setting.1/p is the value of each element of w n .I represents the identity matrix in R p×p .Parameters with no value assigned are hyperparameters.AR stands for arbitrary rectangle-range.

Table 2 :
Tuning grid for different methods.

Table 3 :
Median MSE and the corresponding standard error (given in the parentheses) over 50 replications for each method and example.