COMBSS: Best Subset Selection via Continuous Optimization

The problem of best subset selection in linear regression is considered with the aim to find a fixed size subset of features that best fits the response. This is particularly challenging when the total available number of features is very large compared to the number of data samples. Existing optimal methods for solving this problem tend to be slow while fast methods tend to have low accuracy. Ideally, new methods perform best subset selection faster than existing optimal methods but with comparable accuracy, or, being more accurate than methods of comparable computational speed. Here, we propose a novel continuous optimization method that identifies a subset solution path, a small set of models of varying size, that consists of candidates for the single best subset of features, that is optimal in a specific sense in linear regression. Our method turns out to be fast, making the best subset selection possible when the number of features is well in excess of thousands. Because of the outstanding overall performance, framing the best subset selection challenge as a continuous optimization problem opens new research directions for feature extraction for a large variety of regression models.


Introduction
Recent developments in information technology have enabled the collection of high-dimensional complex data in engineering, economics, finance, biology, health sciences and other fields [9].In high-dimensional data, the number of features is large and often far higher than the number of collected data samples.In many applications, it is desirable to find a parsimonious best subset of predictors so that the resulting model has desirable prediction accuracy [26,10,25].This article is recasting the challenge of best subset selection in linear regression as a novel continuous optimization problem.We show that this reframing has enormous potential and substantially advances research into larger dimensional and exhaustive feature selection in regression, making available technology that can reliably and exhaustively select variables when the total number of variables is well in excess of thousands.
Here, we aim to develop a method that performs best subset selection and an approach that is faster than existing exhaustive methods while having comparable accuracy, or, that is more accurate than other methods of comparable computational speed.
Consider the linear regression model of the form y = Xβ + ϵ, where y = (y 1 , . . ., y n ) T is an ndimensional known response vector, X is a known design matrix of dimension n × p with x i,j indicating the ith observation of the jth explanatory variable, β = (β 1 , . . ., β p ) T is the p-dimensional vector of unknown regression coefficients, and ϵ = (ϵ 1 , . . ., ϵ n ) T is a vector of unknown errors, unless otherwise specified, assumed to be independent and identically distributed.Best subset selection is a classical problem that aims to first find a so-called best subset solution path [e.g.see 26,21] by solving, for a given k, where ∥ • ∥ 2 is the L 2 -norm, ∥β∥ 0 = p j=1 I(β j ̸ = 0) is the number of non-zero elements in β, and I(•) is the indicator function, and the best subset solution path is the collection of the best subsets as k varies from 1 to p.For ease of presentation, we assume that all columns of X are subject to selection, but generalizations are immediate (see Remark 2 for more details).
Exact methods for solving (1) are typically executed by first writing solutions for low-dimensional problems and then selecting the best solution over these.To see this, for any binary vector s = (s 1 , . . ., s p ) T ∈ {0, 1} p , let X [s] be the matrix of size n × |s| created by keeping only columns j of X for which s j = 1, where j = 1, . . ., p.Then, for any k, in the exact best subset selection, an optimal s can be found by solving the problem, where β [s] is a low-dimensional least squares estimate of elements of β with indices corresponding to non-zero elements of s, given by where A † denotes the pseudo-inverse of a matrix A. Both (1) and ( 2) are essentially solving the same problem, because β [s] is the least squares solution when constrained so that I(β j ̸ = 0) = s j for all j = 1, . . ., p.
It is well-known that solving the exact optimization problem (1) is in general non-deterministic polynomial-time hard [27].For instance, a popular exact method called leaps-and-bounds [12] is currently practically useful only for values of p smaller than 30 [30].To overcome this difficulty, the relatively recent method by [2] elegantly reformulates the best subset selection problem (1) as a mixed integer optimization and demonstrates that the problem can be solved for p much larger than 30 using modern mixed integer optimization solvers such as in the commercial software Gurobi [14] (which is not freely available except for an initial short period).As the name suggests, the formulation of mixed integer optimization has both continuous and discrete constraints.Although, the mixed integer optimization approach is faster than the exact methods for large p, its implementation via Gurobi remains slow from a practical point of view [17].
Due to computational constraints of mixed integer optimization, other popular existing methods for best subset selection are still very common in practice, these include forward stepwise selection, the least absolute shrinkage and selection operator (generally known as the Lasso), and their variants.Forward stepwise selection follows a greedy approach, starting with an empty model (or intercept-only model), and iteratively adding the variable that is most suitable for inclusion [8,19].On the other hand, the Lasso [32] solves a convex relaxation of the highly non-convex best subset selection problem by replacing the discrete L 0 -norm ∥β∥ 0 in (1) with the L 1 -norm ∥β∥ 1 .This clever relaxation makes the Lasso fast, significantly faster than mixed-integer optimization solvers.However, it is important to note that Lasso solutions typically do not yield the best subset solution [17,37] and in essence solve a different problem than exhaustive best subset selection approaches.In summary, there exists a trade-off between speed and accuracy when selecting an existing best subset selection method.
With the aim to develop a method that performs best subset selection as fast as the existing fast methods without compromising the accuracy, in this paper, we design COMBSS, a novel continuous optimization method towards best subset selection.
Our continuous optimization method can be described as follows.Instead of the binary vector space {0, 1} p as in the exact methods, we consider the whole hyper-cube [0, 1] p and for each t ∈ [0, 1] p , we consider a new estimate β t (defined later in Section 2) so that we have the following well-defined continuous extension of the exact problem (2): where X t is obtained from X by multiplying the jth column of X by t j for each j = 1, . . ., p, and the tuning parameter λ controls the sparsity of the solution obtained, analogous to selecting the best k in the exact optimization.Our construction of β t guarantees that ∥y − X s β s ∥ 2 = ∥y − X [s] β [s] ∥ 2 at the corner points s of the hypercube [0, 1] p , and the new objective function ∥y−X t β t ∥ 2 2 is smooth over the hypercube.
While COMBSS aims to find sets of models that are candidates for the best subset of variables, an important property is that it has no discrete constraints, unlike the exact optimization problem (2) or the mixed integer optimization formulation.As a consequence, our method can take advantage of standard continuous optimization methods, such as gradient descent methods, by starting at an interior point on the hypercube [0, 1] p and iteratively moving towards a corner that minimizes the objective function.See Fig. 1 for an illustration of our method.In the implementation, we move the box constrained problem (4) to an equivalent unconstrained problem so that the gradient descent method can run without experiencing boundary issues.
The rest of the paper is organized as follows: In Section 2, we describe the mathematical framework of the proposed method COMBSS.In Section 3, we first establish the continuity of the objective functions involved in COMBSS, and then we derive expressions for their gradients, which are exploited for conducting continuous optimization.Complete details of COMBSS algorithm  4) for different values of the parameter λ.In each of these three plots, the curve (in red) shows the execution of a basic gradient descent algorithm that, starting at the initial point t init = (0.5, 0.5) T , converges towards the best subsets of sizes 0, 1, and 2, respectively.
are presented in Section 4. In Section 5, we discuss roles of the tuning parameters that control the surface shape of the objective functions and the sparsity of the solutions obtained.Section 6 provides steps for efficient implementation of COMBSS using some popular linear algebra techniques.Simulation results comparing COMBSS with existing popular methods are presented in Section 7. We conclude the paper with some brief remarks in Section 8. Proofs of all our theoretical results are provided in Appendix A.

Continuous Extension of Best Subset Selection Problem
To see our continuous extension of the exact best subset selection optimization problem (2), for t = (t 1 , . . ., t p ) T ∈ [0, 1] p , define T t = Diag(t), the diagonal matrix with the diagonal elements being t 1 , . . ., t p , and let X t = XT t .With I denoting the identity matrix of an appropriate dimension, for a fixed constant δ > 0, define where we suppress δ for ease of reading.Intuitively, L t can be seen as a 'convex combination' of the matrices (X T X)/n and δI/n, because X T t X t = T t X T XT t and thus Using this notation, now define We need L † t in (7) so that β t is defined for all t ∈ [0, 1] p .However, from the way we conduct optimization, we need to compute β t only for t ∈ [0, 1) p .We later show in Theorem 1 that for all t ∈ [0, 1) p , L t is invertible and thus in the implementation of our method, β t always takes the form β t = L −1 t X T t y/n, eliminating the need to compute any computationally expensive pseudoinverse.
With the support of these observations, an immediate well-defined generalization of the best subset selection problem (1) is Instead of solving the constrained problem (8), by defining a Lagrangian function for a tunable parameter λ > 0, we aim to solve By defining g λ (w) = f λ (t(w)), we reformulate the box constrained problem (10) into an equivalent unconstrained problem, where the mapping t = t(w) is The unconstrained problem (11) is equivalent to the box constrained problem (10), because Remark 1.The non-zero parameter δ is important in the expression of the proposed estimator β t , as in (7), not only to make L t invertible for t ∈ [0, 1) p , but also to make the surface of f λ (t) to have smooth transitions from one corner to another over the hypercube.For example consider a situation where X T X is invertible.Then, for any interior point t ∈ (0, 1) p , since T −1 t exists, the optimal solution to min β ∥y − X t β∥ 2  2 /n after some simplification is T −1 t (X T X)X T y.As a result, the corresponding minimum loss is ∥y − X(X T X)X T y∥ 2 2 /n, which is a constant for all t over the interior of the hypercube.Hence, the surface of the loss function would have jumps at the borders while being flat over the interior of the hypercube.Clearly, such a loss function is not useful for conducting continuous optimization.
Remark 2. The proposed method and the corresponding theoretical results presented in this paper easily extend to linear models with intercept term.More generally, if we want to keep some features in the model, say features j = 1, 2, and 4, then we enforce t j = 1 for j = 1, 2, 4, and conduct subset selection only over the remaining features by taking t = (1, 1, t 3 , 1, t 5 , . . ., t p ) T and optimize over t 3 , t 5 , . . ., t p .
Remark 3. From the definition, for any t, we can observe that β t is the solution of , which can be seen as the well-known Thikonov regression.Since the solution β t does not change, even if the penalty λ p j=1 t j is added to the objective function above, with in the future, we can consider the optimization problem as an alternative to (10).This formulation allows us to use block coordinate descent, an iterative method, where in each iteration the optimal value of β is obtained given t using (7) and an optimal value of t is obtained given that β value.

Continuity and Gradients of the Objective Function
In this section, we first prove that the objective function g λ (w) of the unconstrained optimization problem (11) is continuous on R p and then we derive its gradients.En-route, we also establish the relationship between β [s] and β t which are respectively defined by (3) and (7).This relationship is useful in understanding the relationship between our method and the exact optimization (2).
Theorem 1 shows that for all t ∈ [0, 1) p , the matrix L t , which is defined in (5), is symmetric positive-definite and hence invertible.
Theorem 2. For any s ∈ {0, 1} p , ( As an immediate consequence of Theorem 2, we have ∥y − Therefore, the objective function of the exact optimization problem (2) is identical to the objective function of its extended optimization problem (8) (with λ = 0) at the corner points s ∈ {0, 1} p .
Our next result, Theorem 3, shows that f λ (t) is a continuous function on [0, 1] p .Theorem 3. The function f λ (t) defined in (9) is continuous over [0, 1] p in the sense that for any sequence t (1) , t (2) , • • • ∈ [0, 1) p converging to t ∈ [0, 1] p , the limit lim l→∞ f λ (t (l) ) exists and Corollary 1 establishes the continuity of g λ on R p .This is a simple consequence of Theorem 3, because from the definition, g λ (w) = f λ (t(w)) with t(w) = 1−exp(−w⊙w).Here and afterwards, in an expression with vectors, 1 denotes a vector of all ones of appropriate dimension, ⊙ denotes the element-wise (or, Hadamard) product of two vectors, and the exponential function, exp(•), is also applied element-wise.
Corollary 1.The objective function g λ (w) is continuous at every point w ∈ R p .
As mentioned earlier, our continuous optimization method uses a gradient descent method to solve the problem (11).Towards that we need to obtain the gradients of g λ (w).Theorem 4 provides an expression of the gradient ∇g λ (w).Theorem 4. For every w ∈ R p , with t = t(w) is defined by (12), where with , and Figure 2 illustrates the typical convergence behavior of t for an example dataset during the execution of a basic gradient descent algorithm for minimizing g λ (w) using the gradient ∇g λ given in Theorem 4. Here, w is mapped to t using (12) at each iteration.

Subset Selection Algorithms
Our algorithm COMBSS as stated in Algorithm 1, takes the data [X, y], tuning parameters δ, λ, and an initial point w (0) as input, and returns either a single model or multiple models of different sizes as output.It is executed in three steps.
In Step 1, GradientDescent w (0) , ∇g λ calls a gradient descent method, such as the well known adam optimizer, for minimizing the objective function g λ (w), which takes w (0) as the initial point and uses the gradient function ∇g λ for updating the vector w in each iteration; see, for example, [22] for a review of popular gradient based optimization methods.It terminates when a predefined termination condition is satisfied and returns the sequence w path = (w (0) , w (1) , . . . ) of all the points w visited during its execution, where w (l) denotes the point obtained in the lth iteration of the gradient descent.Usually, a robust termination condition is to terminate when the change in w (or, equivalently, in t(w)) is significantly small over a set of consecutive iterations.
Algorithm 1 COMBSS X, y, δ, λ, w (0) 1: w path ← GradientDescent w (0) , ∇g λ 2: Obtain t path from w path using the map (12) 3: M ← SubsetMap(t path ) 4: return M Selecting the initial point w (0) requires few considerations.From Theorem 4, for any j = 1, . . ., p, we have t j (w j ) = 0 if and only if w j = 0 and ∂g λ (w)/∂w j = 0 if w j = 0. Hence, if we start the gradient descent algorithm with w (0) j = 0 for some j, both w j and t j can continue to take 0 forever.As a result, we might not learn the optimal value for w j (or, equivalently for t j ).Thus, it is important to select all the elements of w (0) away from 0.
Consider the second argument, ∇g λ , in the gradient descent method.From Theorem 4, observe that computing the gradient ∇g λ (w) involves finding the values of the expression of the form L −1 t u twice, first for computing β t (using ( 7)) and then for computing the vector c t (defined in Theorem 4).Since L t is of dimension p × p, computing the matrix inversion L −1 t can be computationally demanding particularly in high-dimensional cases (n < p), where p can be very large; see, for example, [13].Since L t is invertible, observe that L −1 t u is the unique solution of the linear equation L t z = u.In Section 6, we first use the well-known Woodbury matrix identity to convert this p-dimensional linear equation problem to an n-dimensional linear equation problem, which is then solved using the conjugate gradient method, a popular linear equation solver.Moreover, again from Theorem 4, notice that ∇g λ (w) depends on both the tuning parameters δ and λ.Specifically, δ is required for computing L t and λ is used in the penalty term λ p j=1 t j of the objective function.In Section 5 we provide more details on the roles of these two parameters and instructions on how to choose them.
In Step 2, we obtain the sequence t path = (t (0) , t (1) , . . . ) from w path by using the map (12), that is, Finally, in Step 3, SubsetMap(t path ) takes the sequence t path as input to find a set of models M correspond to the input parameter λ.In the following subsections, we describe two versions of SubsetMap.
The following theoretical result, Theorem 5, guarantees convergence of COMBSS.In particular, this result establishes that a gradient descent algorithm on g λ (w) converges to an ϵ-stationary point.Towards this, we say that a point w ∈ R p is an ϵ-stationary point of g λ (w) if ∥∇g λ ( w)∥ 2 ≤ ϵ.Since w is called a stationary point if ∇g λ (w) = 0, an ϵ-stationary point provides an approximation to a stationary point.Theorem 5.There exists a constant α > 0 such that the gradient decent method, starting at any initial point w (0) and with a fixed positive learning rate smaller than α, converges to an ϵ-stationary point within O(1/ϵ 2 ) iterations.

Subset Map Version 1
One simple implementation of SubsetMap is stated as Algorithm 2 which we call SubsetMapV1 (where V1 stands for version 1) and it requires only the final point in the sequence t path and returns only one model using a predefined threshold parameter τ ∈ [0, 1).
Due to the tolerance allowed by the termination condition of the gradient descent, some w j in the final point of w path can be almost zero but not exactly zero, even though they are meant to converge to zero.As a result, the corresponding t j also take values close to zero but not exactly zero because of the mapping from w to t.Therefore, the threshold τ helps in mapping the insignificantly small t j to 0 and all other t j to 1.In practice, we call COMBSS X, y, δ, λ, w (0) for each λ over a grid Algorithm 2 SubsetMapV1 (t path , τ ) 1: Take t to be the final point of t path 2: for j = 1 to j = p do 3: s j ← I(t j > τ ) 4: end for 5: return s = (s 1 , . . ., s p ) T of values.When SubsetMapV1 is used, larger the value of λ, higher the sparsity in the resulting model s.Thus, we can control the sparsity of the output model using λ.Since we only care about the last point in t path in this version, an intuitive option for initialization is to take w (0) to be such that t(w T , the mid-point on the hypercube [0, 1] p , as it is at an equal distance from all the corner points, of which one is the (unknown) target solution of the best subset selection problem.
In Appendix B, we demonstrated the efficacy of COMBSS using SubsetMapV1 in predicting the true model of the data.In almost all the settings, we observe superior performance of COMBSS in comparison to existing popular methods.

Subset Map Version 2
Ideally, there is a value of λ for each k such that the output model s obtained by SubsetMapV1 has exactly k non-zero elements.However, when the ultimate goal is to find a best suitable model s for a given k ≤ q such that |s| = k, for some q ≪ min(n, p), since λ is selected over a grid, we might not obtain any model for some values of k.Furthermore, for a given size k, if there are two models with almost the same mean square error, then the optimization may have difficulty in distinguishing them.Addressing this difficulty may involve fine tuning of hyper-parameters of the optimization algorithm.
To overcome these challenges without any hyper-parameter tuning and reduce the reliance on the parameter λ, we consider the other points in t path .In particular, we propose a more optimal implementation of SubsetMap, which we call SubsetMapV2 and is stated as Algorithm 3. The key idea of this version is that as the gradient descent progresses over the surface of f λ (t), it can point towards some corners of the hypercube [0, 1] p before finally moving towards the final corner.Considering all these corners, we can refine the results.Specifically, this version provides for each λ a model for every k ≤ q.In this implementation, λ is seen as a parameter that allows us to explore the surface of f λ (t) rather than as a sparsity parameter.
For the execution of SubsetMapV2, we start at Step 1 with an empty set of models M k for each k ≤ q.In Step 2, for each t in t path , we consider the sequence of indices j 1 , . . ., j q such that t j 1 ≥ t j 2 ≥ • • • ≥ t jq .Then, for each k ≤ q, we take s k to be a binary vector with 1's only at j 1 , . . ., j k and add s k to the set M k .With this construction, it is clear that M k consists of models of size k, of which we pick a best candidate s * k as show at Step 3. Finally, the algorithm returns the set consists of s * 1 , . . ., s * q correspond to the given λ.When the main COMBSS is called for a grid of m values of λ with SubsetMapV2, then for each k ≤ q we obtain at most m models and among them the model with the minimum mean squared error is selected as the final best model for k.

Algorithm 3 SubsetMapV2 (t path )
1: M k ← {} for each k ≤ q 2: for each t = (t 1 , . . ., t p ) T in t path do 3: Let t j 1 , t j 2 , . . ., t jq be the q largest elements of t in the descending order 4: for k = 1 to q do 5: Take s k ∈ {0, 1} p with non-zero elements only at j 1 , . . ., j k 6: end for 8: end for 9: for k = 1 to k = q do 10: Since this version of COMBSS explores the surface, we can refine results further by starting from different initial points w (0) .Section 7 provides simulations to demonstrate the performance of COMBSS with SubsetMapV2.
Remark 4. It is not hard to observe that for each λ, if the model obtained by Algorithm 2 is of a size k ≤ q, then this model is present in M k of Algorithm 3, and hence, COMBSS with SubsetMapV2 always provides the same or a better solution than COMBSS with SubsetMapV1.

Roles of Tuning Parameters
In this section, we provide insights on how the tuning parameters δ and λ influence the objective function f λ (t) (or, equivalently g λ (w)) and hence the convergence of the algorithm.

Controlling the Shape of f λ (t) through δ
The normalized cost ∥y − X t β t ∥ 2 2 /n provides an estimator of the error variance.For any fixed t, we expect this variance (and hence the objective function f λ (t)) to be almost the same for all relatively large values of n, particularly, in situations where the errors ϵ i are independent and identically distributed.This is the case at all the corner points s ∈ {0, 1} p , because at these corner points, from Theorem 2, X s β s = X [s] β [s] , which is independent of δ.We would like to have a similar behavior at all the interior points t ∈ (0, 1) p as well, so that for each t, the function f λ (t) is roughly the same for all large values of n.Such consistent behavior is helpful in guaranteeing that the convergence paths of the gradient descent method are approximately the same for large values of n.
Figure 3 shows surface plots of f λ (t) for different values of n and δ for an example dataset obtained from a linear model with p = 2. Surface plots (a) and (d) correspond to δ = n, and as we can see, the shape of the surface of f λ (t) over [0, 1] p is very similar in both these plots.
To make this observation more explicit, we now show that the function f λ (t), at any t, takes almost  the same value for all large n if we keep δ = c n, for a fixed constant c > 0, under the assumption that the data samples are independent and identically distributed (this assumption simplifies the following discussion; however, the conclusion holds more generally).
Observe that where Under the independent and identically distributed assumption, y T y/n, X T y/n, and X T X/n converge element-wise as n increases.Since T t is independent of n, we would like to choose δ such that L −1 t also converges as n increases.Now recall from (6) that It is then evident that the choice δ = c n for a fixed constant c, independent of n, makes L t converging as n increases.Specifically, the choice c = 1 justifies the behavior observed in Figure 3.

Sparsity Controlling through λ
Intuitively, the larger the value of λ the sparser the solution offered by COMBSS using Sub-setMapV1, when all other parameters are fixed.We now strengthen this understanding mathematically.From Theorem 4, ∇f λ (t) = ζ t + λ1, t ∈ (0, 1) p , and for w ∈ R p , where ζ t , given by ( 15), is independent of λ.Note the following property of ζ t .
Proposition 1.For any j = 1, . . ., p, if all t i for i ̸ = j are fixed, This result implies that for any j = 1, . . ., p, we have lim t j ↓0 ∂f λ (t)/∂t j = λ, where lim t j ↓0 denotes the existence of the limit for any sequence of t j that converges to 0 from the right.Since ζ t is independent of λ, the above limit implies that there is a window (0, a j ) such that the slope ∂f λ (t)/∂t j > 0 for t j ∈ (0, a j ) and also the window size increases (i.e., a j increases) as λ increases.
As a result, for the function g λ (w), there exists a constant a ′ j > 0 such that In other words, for positive λ, there is a 'valley' on the surface of g λ (w) along the line w j = 0 and the valley becomes wider as λ increases.In summary, the larger the values of λ the more w j (or, equivalently t j ) have tendency to move towards 0 by the optimization algorithm and then a sparse model is selected (i.e, small number k of variables chosen).At the extreme value λ max = ∥y∥ 2 2 /n, all t j are forced towards 0 and thus the null model will be selected.

Low-vs High-dimension
Recall the expression of L t from (5): We have noticed earlier from Theorem 4 that for computing ∇g λ (w), twice we evaluate matrixvector products of the form L −1 t u, which is the unique solution of the linear equation L t z = u.Solving linear equations efficiently is one of the important and well-studied problems in the field of linear algebra.Among many elegant approaches for solving linear equations, the conjugate gradient method is well-suited for our problem as L t is symmetric positive-definite; see, for example, [13].
The running time of the conjugate gradient method for solving the linear equation Az = u depends on the dimension of A. For our algorithm, since L t is of dimension p × p, the conjugate gradient method can return a good approximation of L −1 t u within O(p 2 ) time by fixing the maximum number of iterations taken by the conjugate gradient method.This is true for both lowdimensional models (where p < n) and high-dimensional models (where n < p).
We now specifically focus on high-dimensional models and transform the problem of solving the p-dimensional linear equation L t z = u to the problem of solving an n-dimensional linear equation problem.This approach is based on a well-known result in linear algebra called the Woodbury matrix identity.Since we are calling the gradient descent method for solving a n-dimensional problem, instead of p-dimensional, we can achieve a much lower overall computational complexity for the high-dimensional models.The following result is a consequence of the Woodbury matrix identity, which is stated as Lemma 2 in Appendix A. Theorem 6.For t ∈ [0, 1) p , let S t be a p-dimensional diagonal matrix with the jth diagonal element being n/δ(1 − t 2 j ) and L t = I + X t S t X T t /n.Then, The above expression suggests that instead of solving the p-dimensional problem L t z = u directly, we can first solve the n-dimensional problem L t z = (X t S t u) and substitute the result in the above expression to get the value of L −1 t u.

A Dimension Reduction Approach
During the execution of the gradient descent algorithm, Step 1 of Algorithm 1, some of w j (and hence the corresponding t j ) can reach zero.Particularly, for basic gradient descent and similar methods, once w j reaches zero it remains zero until the algorithm terminates, because the update of w in the lth iteration of the basic gradient descent depends only on the gradient g λ (w (l) ), whose jth element Because ( 16) holds, we need to focus only on ∂g λ (w)/∂w j associated with w j ̸ = 0 in order to reduce the cost of computing the gradient ∇g λ (w).To simplify the notation, let P = {1, . . ., p} and for any t ∈ [0, 1) p , let Z t be the set of indices of the zero elements of t, that is, Similar to the notation used in Theorem 2, for a vector u ∈ R p , we write (u) + (respectively, (u) 0 ) to denote the vector of dimension p − |Z t | (respectively, |Z t |) constructed from u by removing all its elements with the indices in Z t (respectively, in P \ Z t ).Similarly, for a matrix A of dimension p × p, we write (A) + (respectively, (A) 0 ) to denote the new matrix constructed from A by removing its rows and columns with the indices in Z t (respectively, in P \ Z t ).Then we have the following result.
Theorem 7. Suppose t ∈ [0, 1) p .Then, Furthermore, we have β t 0 = 0, In Theorem 7, (18) shows that for every j ∈ Z t , all the off-diagonal elements of the jth row as well as the jth column of L −1 t are zero while its jth diagonal element is n/δ, and all other elements of L −1 t (which constitute the sub-matrix L −1 t + ) depend only on (L t ) + , which can be computed using only the columns of the design matrix X with indices in P \ Z t .As a consequence, (19) and (20) imply that computing β t and c t is equal to solving p + -dimensional linear equations of the form L −1 t + z = v, where p + = p − |Z t |.Since p + ≤ p, solving such a p + -dimensional linear equation using the conjugate gradient can be faster than solving the original p-dimensional linear equation of the form L t z = u.
In summary, for a vector t ∈ [0, 1) p with some elements being 0, the values of f λ (t) and ∇f λ (t) do not depend on the columns j of X where t j = 0. Therefore, we can reduce the computational complexity by removing all the columns j of the design matrix X where t j = 0.Here we compare running times for COMBSS with SubsetMapV1 using only conjugate gradient (ConjGrad), conjugate gradient with Woodbury matrix identity (ConjGrad-Woodbury), and conjugate gradient with both Woodbury matrix identity and truncation improvement (ConjGrad-Woodbury-Trunc). For the truncation, η = 0.001.The dataset for this experiment is the same dataset used for Figure 2.

Making Our Algorithm Fast
In Section 6.2, we noted that when some elements t j of t are zero, it is faster to compute the objective functions f λ (t) and g λ (t) and their gradients ∇f λ (t) and ∇g λ (t) by ignoring the columns j of the design matrix X.In Section 5.2, using Proposition 1, we further noted that for any λ > 0 there is a 'valley' on the surface of g λ (w) along w j = 0 for all j = 1, . . ., p, and thus for any j, when w j (or, equivalently, t j ) is sufficiently small during the execution of the gradient descent method, it will eventually become zero.Using these observations, in the implementation of our method, to reduce the computational cost of estimating the gradients, it is wise to map w j (and t j ) to 0 when w j is almost zero.We incorporate this truncation idea into our algorithm as follows.
We first fix a small constant η, say at 0.001.As we run the gradient descent algorithm, when t j becomes smaller than η for some j ∈ P, we take t j and w j to be zero and we stop updating them; that is, t j and w j will continue to be zero until the gradient descent algorithm terminates.In each iteration of the gradient descent algorithm, the design matrix is updated by removing all the columns corresponding to zero t j 's.If the algorithm starts at w with all non-zero elements, the effective dimension p + , which denotes the number of columns in the updated design matrix, monotonically decreases starting from p.In an iteration, if p + > n, we can use Theorem 6 to reduce the complexity of computing the gradients.However, when p + falls below n, we directly use conjugate gradient for computing the gradients without invoking Theorem 6.
Using a dataset, Fig. 4 illustrates the substantial improvement in the speed of our algorithm when the above mentioned improvement ideas are incorporated in its implementation.Remark 5. From our simulations over the range of scenarios considered in Section 7, we have observed that the performance of our method does not vary significantly when η is close to zero.In particular, we noticed that any value of η close to or less than 0.001 is a good choice.Good, in the sense that, if s η ∈ {0, 1} p is the model selected by COMBSS, then we rarely observed s η ̸ = s 0 .Thus, the Hamming distance between s η and s 0 is zero when η is close to or smaller than 0.001, except in few generated datasets).This holds when comparing the estimated true model and when comparing the best subsets.

Simulation Experiments
Our method is available through Python and R codes via GitHub1 .The code includes examples where p is as large as of order 10,000.This code further allows to replicate our simulation results presented in this section and in Appendix B.
In Appendix B, we focused on demonstrating (using SubsetMapV1) the efficacy in predicting the true model of the data.Here, our focus is on demonstrating the efficacy of our method in retrieving best subsets of given sizes, meaning our ability to solve (1) using SubsetMapV2.We compare our approach to forward selection, Lasso, mixed integer optimization and L0Learn [17].

Simulation design
The data is generated from the linear model: Here, each row of the predictor matrix X is generated from a multivariate normal distribution with zero mean and covariance matrix Σ with diagonal elements Σ j,j = 1 and off-diagonal elements Σ i,j = ρ |i−j| , i ̸ = j, for some correlation parameter ρ ∈ (−1, 1).Note that the noise ϵ is a ndimensional vector of independent and identically distributed normal variables with zero mean and variance σ 2 .In order to investigate a challenging situation, we use ρ = 0.8 to mimic strong correlation between predictors.For each simulation, we fix the signal-to-noise ratio (SNR) and compute the variance σ 2 of the noise ϵ using We consider the following two simulation settings: • Case 1: The first k 0 = 10 components of β are equal to 1 and all other components of β are equal to 0.
• Case 2: The first k 0 = 10 components of β are given by β i = 0.5 i−1 , for i = 1, . . ., k 0 and all other components of β are equal to 0.
Both Case 1 and Case 2 assumes strong correlation between the active predictors.Case 2 differs from Case 1 by presenting a signal decaying exponentially to 0.
In the low-dimensional setting, the forward stepwise selection (FS) and the mixed integer optimization (MIO) were tuned over k = 0, . . ., 20.In this simulation we ran MIO through the R package bestsubset offered in [15] while we ran L0Learn through the R package L0Learn offered in [18].For the high dimensional setting, we do not include MIO due to time computational constraints posed by MIO.

Low-dimensional case
In low dimensional case, we use the exhaustive method to find the exact solution of the best subset for any subset size ranging from 1 to p.Then, we assess our method in retrieving the exact best subset for each subset size.Figure 5, shows the frequency of retrieving the exact best subset (provided by exhaustive search) for any subset size from k = 1, . . ., p, for Case 1, over 200 replications.For each SNR level, MIO as expected retrieves perfectly the optimal best subset of any model size.Then COMBSS gives the best results to retrieve the best subset compared to FS, Lasso and L0Learn.We can also observe that each of these curves follow a U-shape, with the lowest point approximately at the middle.This behaviour seems to be related to possible p k choices for each subset size k = 1, . . ., p, as at each k we have p k options (corner points on [0, 1] p ) to explore.Similar behaviours are reported for the low-dimensional setting of Case 2 in Figure 6.

High-dimensional case
To assess the performance of our method in retrieving a competitive best subset, we compare the best subset obtained from COMBSS with other methods for two different subset sizes: 5 and 10, over 50 replications.Note that the exact best subset is unknown for the high dimensional case since it is computationally impractical to conduct an exhaustive search even for moderate subset sizes when p = 1000.Hence, for this comparison, we use the mean squared error (MSE) of the  dataset to evaluate which method is providing a better subset for size 5 and 10. Figure 7 presents these results over 50 replications for SNR values from 2 to 8. As expected the MSE of all methods is decreasing when SNR is increasing.Overall, COMBSS is consistently same or better than other methods for providing a competing best subset.On the other hand none of the alternative methods is consistent across all the cases.
In this high-dimensional setting, as mentioned earlier, deploying MIO, which is based on the Gurobi optimizer, proves impractical (see [16]).This is due to its prohibitively long running time, extending into the order of hours.In stark contrast, COMBSS exhibits running times of a few seconds for both the cases of the simulation settings: approximately 4.5 seconds with SubsetMapV1 (for predicting the true model) and approximately 7 seconds with SubsetMapV2 (for best subset selection).We have observed that for COMBSS, SubsetMapV1 operates at approximately twice the speed of SubsetMap2.Other existing methods demonstrate even faster running times, within a fraction of a second, but with lower performance compared to COMBSS.In summary, for best subset selection, COMBSS stands out as the most efficient among the methods that can run within a few seconds.Similarly, in predicting the true model, we believe that the consistently strong performance of COMBSS positions it as a crucial method, particularly when compared to other faster methods like Lasso.

Conclusion and Discussion
In this paper, we have introduced COMBSS, a novel continuous optimization method towards best subset selection in linear regression.The key goal of COMBSS is to extend the highly difficult discrete constrained best subset selection problem to an unconstrained continuous optimization problem.In particular, COMBSS involves extending the objective function of the best subset selection, which is defined at the corners of the hypercube [0, 1] p , to a differentiable function defined on the whole hypercube.For this extended function, starting from an interior point, a gradient descent method is executed to find a corner of the hypercube where the objective function is minimum.
In this paper, our simulation experiments highlight the ability of COMBSS with SubsetMapV2 for retrieving the "exact" best subset for any subset size in comparison to four existing methods: Forward Stepwise (FS), Lasso, L0Learn, and Mixed Integer Optimization (MIO).In Appendix B, we have presented several simulation experiments in both low-dimensional and high-dimensional setups to illustrate the good performance of COMBSS with SubsetMapV1 for predicting the true model of the data in comparison to FS, Lasso, L0Learn, and MIO.Both of these empirical studies emphasize the potential of COMBSS for feature extractions.In addition to these four methods, we have also explored with the minimax concave penalty (MCP) and smoothly clipped absolute deviation (SCAD), which are available through the R package ncvreg; refer to [4] for details of these two methods.In our simulation studies, we omitted the results for both MCP and SCAD, as their performance, although somewhat similar to the performance of Lasso, did not compete with COMBSS for best subset selection and for predicting the true model parameters.
In our algorithm, the primary operations involved are the matrix-vector product, the vector-vector element-wise product, and the scalar-vector product.Particularly, we note that most of the running time complexity of COMBSS comes from the application of the conjugate gradient method for solving linear equations of the form Az = u using off-the-shelf packages.The main operation involved in conjugate gradient is the matrix-vector product Au, and such operations are known to execute faster on graphics processing unit (GPU) based computers using parallel programming.A future GPU based implementation of COMBSS could substantially increase the speed of our method.Furthermore, application of stochastic gradient descent [3] instead of gradient descent and randomized Kaczmarz algorithm (a variant of stochastic gradient) [28] instead of conjugate gradient has potential to increase the speed of COMBSS as the stochastic gradient descent methods take just one data sample in each iteration.
A future direction for finding the best model of a given fixed size k is to explore different options for the penalty term of the objective function f λ (t).Ideally, if we select a sufficiently large penalty for p j=1 t j > k and 0 otherwise, we can drive the optimization algorithm towards a model of size k that lies along the hyperplane given by p j=1 t j = k.Because such a discrete penalty is not differentiable, we could use smooth alternatives.For instance, the penalty could be taken to be λ(k − p j=1 t j ) 2 when p j=1 t j > k and 0 otherwise, for a tuning parameter λ > 0.
We expect, similarly to the significant body of work that focuses on the Lasso and on MIO, respectively, that there are many avenues that can be explored and investigated for building on the presented COMBSS framework.Particularly, to tackle best subset selection when problems are ultra-high dimensional.In this paper, we have opened a novel framework for feature selection and this framework can be extended to other models beyond the linear regression model.For instance, recently [24] extended the COMBSS framework for solving column subset selection and Nystr öm approximation problems.
Moreover, in the context of Bayesian predictive modeling, [23] introduced Bayesian subset selection for linear prediction or classification, and they diverged from the traditional emphasis on identifying a single best subset, opting instead to uncover a family of subsets, with notable members such as the smallest acceptable subset.For a more general task of considering variable selection, the handbook edited by [29] offered an extensive exploration of Bayesian approaches to variable selection.Extending the concept of COMBSS to encompass more general variable selection and establishing a connection with Bayesian modelling appear to be promising avenues for further research.
In addition, the objective function in (13) becomes ∥y − X(t ⊙ β)∥ 2 2 /n when both the penalty terms are removed, where note that X t β = X(t ⊙ β).An unconstrained optimization of this function over t, β ∈ R p is studied in the area of implicit regularization; see, e.g., [20,33,35,11,36].Gradient descent in our method minimizes over the unconstrained variable w ∈ R p to get an optimal constrained variable t ∈ [0, 1] p .On the contrary, in their approach, t itself is unconstrained.Unlike the gradient descent of our method which terminates when it is closer to a stationary point on the hypercube [0, 1] p , the gradient descent of their methods may need an early-stopping criterion using a separate test set.

A Proofs
Proof of Theorem 1.Since both X T t X t and T t are symmetric, the symmetry of L t is obvious.We now show that L t is positive-definite for t ∈ [0, 1) p by establishing The matrix In addition, for all t ∈ [0, 1) p , the matrix δ I − T 2 t is also a positive-definite because δ > 0, and which is strictly positive if t ∈ [0, 1) p and u ∈ R p \ {0}.Since positive-definite matrices are invertible, we have L † t = L −1 t , and thus, Theorem 8 is a collection of results from the literature that we need in our proofs.Results (i) and (ii) of Theorem 8 are well-known in the literature as Banachiewicz inversion lemma (see, e.g., Tian and Takane [31]), and (iii) is its generalization to Moore-Penrose inverse (See Corollary 3.5 (c) in [5]).
Theorem 8. Let M be a square block matrix of the form with A being a square matrix.Let the Schur complement S = D − BA † C. Suppose that D is non-singular.Then following holds.
(i) If A is non-singular, then M is non-singular if and only if S is non-singular.
(ii) If both A and S are non-singular, then Proof of Theorem 2. The inverse of a matrix after a permutation of rows (respectively, columns) is identical to the matrix obtained by applying the same permutation on columns (respectively, rows) on the inverse of the matrix.Therefore, without loss of generality, we assume that all the zero-elements of s ∈ {0, 1} p appear at the end, in the form: s = (s 1 , . . ., s m , 0, . . ., 0), where m indicates the number of non-zeros in s.Recall that X [s] is the matrix of size n × |s| created by keeping only columns j of X for which s j = 1.Thus, L s is given by, From Theorem 8 (i), it is evident that L s is invertible if and only if X First assume that X T [s] X [s] is invertible.Then, from Theorem 8 (ii), Now recall the notations ( β s ) + and ( β s ) 0 introduced before stating Theorem 2.Then, we use (27) to obtain is singular, by replacing the inverse with its pseudo-inverse in the above discussion, and using Theorem 8 (iii) instead of Theorem 8 (ii), we can establish the same conclusions.This is because, the corresponding Schur complement for L s is S = n I/δ, which is symmetric and positive definite.
Proof of Theorem 3. Consider a sequence t 1 , t 2 , • • • ∈ [0, 1) p that converges a point t ∈ [0, 1] p .We know that the converges easily holds when t ∈ [0, 1) p from the continuity of matrix inversion which states that for any sequence of invertible matrices Z 1 , Z 2 , . . .that converging to an invertible matrix Z, the sequence of their inverses Z −1 1 , Z −1 2 , . . .converges to Z −1 .Now using Theorem 8, we prove the convergence when some or all of the elements of the limit point t are equal to 1. Suppose t has exactly m elements equal to ).
Further, take with X 1 denoting the first m columns of X.Similarly, we can write We now observe that As a result, and Since t ℓ ∈ [0, 1) p , L t ℓ is non-singular (see Theorem 1), and hence we have Note that the corresponding Schur complement S ℓ = D ℓ − B ℓ A −1 ℓ C ℓ is non-singular from Theorem 8 (i).Furthermore, since and hence, Using (24), and hence, is equal to Now by defining we have Since T t,2 < I, we can see that D is symmetric positive definite and hence non-singular (this can be established just like the proof of Theorem 1).Furthermore, the corresponding Schur complement S = D − BA † C is symmetric positive definite, and hence non-singular.The symmetry of S is easy to see from the definition because A and D are symmetric and B = C T .To see that S is positive definite, for any x ∈ R p−m \ {0}, let z = X t,2 x and thus 1 is a projection matrix and hence positive definite, S is also positive definite.
In addition, using the singular value decomposition (SVD) Similarly, we can show that AA † C = C. Thus, using (25) Since lim ℓ→∞ X t ℓ ,2 = X t,2 and lim ℓ→∞ B ℓ = B, from ( 28) and ( 29), to show that it is enough to show that Since S and each of S ℓ are non-singular, (31) holds from the continuity of matrix inversion.Now observe that To establish (32), we need to show that X = lim ℓ→∞ A −1 ℓ X T 1 is equal to X † 1 .Towards this, define (1/t 2 ℓ,i − 1).
Then, we observe that both η ℓ and ϵ ℓ are strictly positive and going to zero as ℓ → ∞.Thus, where for any two symmetric positive semi-definite matrices Z and Z ′ , we write Now for any matrix norm, denoting as ∥ • ∥, using the triangular inequality, Using the SVD of That is, suppose σ i is the ith singular value of X 1 , then the ith singular value of (A , which goes to zero and thus the first term in (33) goes to zero.The second term in (33) also converges to zero because of the limit definition of pseudo-inverse that states that for any matrix This completes the proof.
For proving Theorem 4, we use Lemma 1, which obtains the partial derivatives of β t with respect to the elements of t.
Lemma 1.For any t ∈ (0, 1) p , the partial derivative ∂ β t ∂t j for each j = 1, . . ., p is equal to where Z = n −1 X T X − δI and E j is a square matrix of dimension p × p with 1 at the (j, j)th position and 0 everywhere else.
Proof of Lemma 1. Existence of β t for every t ∈ (0, 1) p and δ > 0 follows from Theorem 1, which states that L t is positive-definite and hence guarantees the invertibility of L t .Since β t = L −1 t T t X T y/n, using matrix calculus, for any j = 1, . . ., p, where we used differentiation of an invertible matrix which implies Since ∂T t /∂t j = E j , and the fact that L t = T t ZT t + δI/n, we get This completes the proof Lemma 1.
Proof of Theorem 4. To obtain the gradient ∇f λ (t) for t ∈ (0, 1) p , let Consequently, where From the definitions of β t and γ t , which is obtained using Lemma 1 and the fact that ∂T t /∂t j = E j and Z = n −1 X T X − δI .This in-turn yields that ∂γ t ∂t j is equal to where we recall that b For a further simplification, recall that c t = L −1 t (t ⊙ a t ) and d t = Z (t ⊙ c t ).Then, from (36), the matrix ∂γ t /∂t of dimension p × p, with jth column being ∂γ t /∂t j , can be expressed as From (35), with 1 representing a vector of all ones, ∇f λ (t) can be expressed as where Finally, recall that g λ (w) = f λ (t(w)), w ∈ R p , where the map t(w) = 1 − exp(−w ⊙ w) and Then, from the chain rule of differentiation, for each j = 1, . . ., p, ∂g λ (w) ∂w j = ∂f λ (t) ∂t j 2w j exp(−w 2 j ) .
Proof of Theorem 5. A function g(w) is ℓ-smooth if the gradient ∇g λ (w) is Lipschitz continuous with Lipschitz constant ℓ > 0, that is, for all w, w ′ ∈ R p .From Section 3 of [7], gradient descent on a ℓ-smooth function g(w), with a fixed learning rate of 1/ℓ, starting at an initial point w (0) , achieves an ϵ-stationary point in c ℓ (g(w (0) )−g * ) iterations, where c is a positive constant, g * = min w∈R p g(w).This result (with a different constant c) holds for any fixed learning rate smaller than 1/ℓ.
Thus, we only need to show that the objective function g λ (w) is ℓ-smooth for some constant ℓ > 0.

B Additional Simulation Experiments
In this section, through additional simulations, we assess the efficiency of our proposed method, COMBSS using SubsetMapV1, by comparing it with some of the well-known existing methods, namely, Forward Selection (FS), Lasso, L0Learn, and MIO.These comparisons are conducted both in low-dimensional and high-dimensional settings.The simulation designs of Case 1 and Case 2 are provided in Section 7.1 of the main article.

B.1 Performance Metrics
For the comparison between the methods, we consider the prediction error as well as some variable selection accuracy performance metrics.The prediction error (PE) performance of a method is defined as where β is the estimated coefficient obtained by the method and β * is the true model parameters.
The variable selection accuracy performances used are sensibility (true positive rate), specificity (true negative rate), accuracy, F 1 score, and the Mathew's correlation coefficient (MCC) [6].

B.2 Model Tuning
In the low-dimensional setting, FS and MIO were tuned over k = 0, . . ., 20.For the high dimensional setting, FS was tuned over k = 0, . . ., 50.In this simulation we ran MIO through the R package bestsubest offered in [15] and we fixed the default budget to 1 minute 40 seconds (per problem per k).
The dynamic grid of λ values for COMBSS is constructed identical to the description provided in Section 7.1 of the main article.One exception is that for each simulation in this appendix, we call COMBSS with SubsetMapV1 only for one initial point t (0) = (0.5, . . ., 0.5) T .
For each subset size k for FS and MIO and each tuning parameter λ for Lasso and COMBSS, we evaluate the mean square error (MSE) using a validation set of 5000 samples from the true model defined in (19) in the main article.Then, for all the four methods, the best model is the one with the lowest MSE on the validation set.Overall, in the low-dimensional setting, COMBSS outperforms FS, L0Learn, and MIO methods in terms of MCC, accuracy, and prediction error.It also outperforms the Lasso in terms of MCC in both the cases.Note that the Lasso presents lower prediction error and accuracy in general as it tends to provide dense model compared to other methods.As a result, the Lasso suffers with low specificity.

B.3 Simulation Results
Figures 10 and 11 presents the results in the high-dimensional setting for Case 1 and Case 2, respectively.The panels in this figure display average MCC, prediction error, F1-score, Sensibility, and Specificity, over 50 replications, where the vertical bars denote one standard error.We ignored MIO for these simulation due to its high computational time requirement.We do not present accuracy for the high-dimensional setting, because even a procedure which always selects the null model will get an accuracy of 990/1000 = 0.99.
In both the cases, COMBSS clearly outperforms the other three methods (FS, L0Learn, and Lasso) in terms of MCC, prediction error, and F1 score.In this setting, the Lasso again suffers from selecting dense models and thus exhibiting lower specificity.

Figure 1 :
Figure1: Illustration of the workings of COMBSS for an example data with p = 2. Plot (a) shows the objective function of the exact method (2) for s ∈ {0, 1} 2 .Observe that the best subsets correspond to k = 0, k = 1, and k = 2 are (1, 1)T , (1, 0) T , and (0, 0) T , respectively.Plots (b) -(d) show the objective function of our optimization method (4) for different values of the parameter λ.In each of these three plots, the curve (in red) shows the execution of a basic gradient descent algorithm that, starting at the initial point t init = (0.5, 0.5) T , converges towards the best subsets of sizes 0, 1,

Figure 2 :
Figure 2: Convergence of t for a high-dimensional dataset during the execution of basic gradient descent.Solid lines correspond to β j = 0 and remaining 5 curves (with line style − • −) correspond to β j ̸ = 0.The dataset is generated using the model (21) shown in Section 7.1 with only 5 components of β are non-zero, equal to 1, at equally spaced indices between 1 and p = 1000, and n = 100, ρ = 0.8, and signal-to-noise ratio of 5.The parameters λ = 0.1 and δ = n; see Section 5 for more discussion on how to choose λ and δ.

Figure 3 :
Figure 3: Illustration of how δ effects the objective function f λ (t) (with λ = 0).A dataset consists of 10000 samples generated from the illustrative linear model used in Figure 1.For (a) and (b), 100 samples from the same dataset are used.

Figure 4 :
Figure4: Running times of our algorithm at λ = 0.1 for an example dataset using the adam optimizer, a popular gradient based method.These boxplots are based on 300 replications.Here we compare running times for COMBSS with SubsetMapV1 using only conjugate gradient (ConjGrad), conjugate gradient with Woodbury matrix identity (ConjGrad-Woodbury), and conjugate gradient with both Woodbury matrix identity and truncation improvement (ConjGrad-Woodbury-Trunc). For the truncation, η = 0.001.The dataset for this experiment is the same dataset used for Figure2.

Figure 7 :
Figure 7: Ability of COMBSS (using SubsetMapV2) for providing a competing best subset for subset sizes 5 and 10 in comparison to FS, Lasso and L0Learn.The top plots are for Case 1 while the bottom plots are for Case 2.

Figures 8 and 9
Figures 8 and 9 present the results in the low-dimensional setting for Case 1 and Case 2, respectively.The panels in this figure display the average of MCC, accuracy, prediction error, F1-score, Sensibility, and Specificity, over 50 replications, where the vertical bars denote one standard error.

Figure 11 :
Figure 11: Performance results in terms of MCC, prediction error, F1-score, Sensibility, and Specificity for Case 2 in the high-dimensional setting where n = 100, p = 1000, and ρ = 0.8.
1. Using the arguments from the proof of 2, without of loss of generality assume that all 1s in t appear together in the first m positions, that is, t = (1, . . ., 1 Figure 10: Performance results in terms of MCC, prediction error, F1-score, Sensibility, and Specificity for Case 1 in the high-dimensional setting where n = 100, p = 1000, and ρ = 0.8.