Abstract
Sorted L-One Penalized Estimator (SLOPE) is a relatively new convex optimization procedure for selecting predictors in high dimensional regression analyses. SLOPE extends LASSO by replacing the L1 penalty norm with a Sorted L1 norm, based on the non-increasing sequence of tuning parameters. This allows SLOPE to adapt to unknown sparsity and achieve an asymptotic minimax convergency rate under a wide range of high dimensional generalized linear models. Additionally, in the case when the design matrix is orthogonal, SLOPE with the sequence of tuning parameters λBH corresponding to the sequence of decaying thresholds for the Benjamini-Hochberg multiple testing correction provably controls the False Discovery Rate (FDR) in the multiple regression model. In this article we provide new asymptotic results on the properties of SLOPE when the elements of the design matrix are iid random variables from the Gaussian distribution. Specifically, we provide conditions under which the asymptotic FDR of SLOPE based on the sequence λBH converges to zero and the power converges to 1. We illustrate our theoretical asymptotic results with an extensive simulation study. We also provide precise formulas describing FDR of SLOPE under different loss functions, which sets the stage for future investigation on the model selection properties of SLOPE and its extensions.
1 Introduction
In this article we consider the classical problem of identifying important predictors in the multiple regression model
where X is the design matrix of dimension n × p and \(\epsilon \sim N(0,\sigma ^{2} I_{n\times n})\) is the noise vector. In the case when p > n, the vector of parameters b0 is not identifiable and can be uniquely estimated only under certain additional assumptions concerning e.g. its sparsity (i.e. the number of non-zero elements). One of the most popular and computationally tractable methods for estimating b0 in the case when p > n is the Least Absolute Shrinkage and Selection Operator (LASSO, Tibshirani (1996)), firstly introduced in the context of signal processing as the Basis Pursuit Denoising (BPDN, Chen and Donoho. D. (1994)). LASSO estimator is defined as
where ||⋅||2 denotes the regular Euclidean norm in Rn and \(|b|_{1}={\sum }_{j=1}^{p} |b_{j}|\) is the L1 norm of b. If \(X^{\prime }X=I\) then LASSO is reduced to the simple shrinkage operator imposed on the elements of the vector \(\tilde Y = X^{\prime }Y\),
where
In this case the choice of the tuning parameter
allows for the control of the probability of at least one false discovery (Family Wise Error Rate, FWER) at level α.
In the context of high dimensional multiple testing, the Bonferroni correction is often replaced by the Benjamini and Hochberg (1995) multiple testing procedure aimed at the control of the False Discovery Rate (FDR). Apart from FDR control, this procedure also has appealing properties in the context of the estimation of the vector of means for multivariate normal distribution with independent entries (Abramovich et al. 2006) or in the context of minimizing the Bayes Risk related to 0-1 loss (Bogdan et al. 2011; Neuvial and Roquain, 2012; Frommlet and Bogdan, 2013). In the context of the multiple regression with an orthogonal design, the Benjamini-Hochberg procedure works as follows:
-
(1)
Fix q ∈ (0, 1) and sort \(\tilde Y\) such that
$$ |\tilde Y |_{(1)} \ge |\tilde Y|_{(2)} \ge {\ldots} \ge |\tilde Y|_{(p)} , $$ -
(2)
Identify the largest j such that
$$ |\tilde Y|_{(j)} \ge \lambda^{BH}_{j}= \sigma {\Phi}^{-1}\left( 1- \frac{jq}{2p}\right), $$(1.3)Call this index jBH.
-
(3)
Reject H(j) for every j ≤ jBH.
Thus, in BH, the fixed threshold of the Bonferroni correction, λBon, is replaced with the sequence λBH of ‘sloped’ thresholds for the sorted test statistics (see Fig. 1). This allows for a substantial increase of power and for an improvement of prediction properties in the case when some of the predictors are relatively weak.
The idea of using a decreasing sequence of thresholds was subsequently used in the Sorted L-One Penalized Estimator (SLOPE, Bogdan et al. (2013) and Bogdan et al. (2015)) for the estimation of coefficients in the multiple regression model:
where \(\left |b\right |_{(1)}\geq \ldots \geq \left |b\right |_{(p)}\) are ordered magnitudes of elements of b and λ = (λ1, … , λp) is a non-zero, non-increasing and non-negative sequence of tuning parameters. As noted in Bogdan et al. (2013) and Bogdan et al. (2015), the function \(J_{\lambda }(b)={\sum }_{i=1}^{p}\lambda _{i}\left |b\right |_{(i)}\) is a norm. To see this, observe that:
-
for any constant a ∈ R and a sequence b ∈ Rp, Jλ(ab) = |a|Jλ(b)
-
Jλ(b) = 0 if and only if b = 0 ∈ Rp.
To show the triangular inequality:
-
Jλ(x + y) ≤ Jλ(x) + Jλ(y)
let us denote by ρ the permutation of the set {1, … , p} such that
Then
and the result can be derived from the well known rearrangement inequality, according to which for any permutation ρ and any sequence x ∈ Rp:
Thus SLOPE is a convex optimization procedure which can be efficiently solved using classical optimization tools. It is also easy to observe that in the case λ1 = … = λp, SLOPE is reduced to LASSO, while in the case λ1 > λ2 = … = λp = 0, the Sorted L-One norm Jλ is reduced to the \(L_{\infty }\) norm.
Figure 2 illustrates different shapes of the unit spheres corresponding to different versions of the Sorted L-One Norm. Since the solutions of SLOPE tend to occur on the edges of respective spheres, Fig. 2 demonstrates the large flexibility of SLOPE with respect to the dimensionality reduction. In the case when λ1 = … = λp, SLOPE reduces dimensionality by shrinking the coefficients to zero. In the case when λ1 > λ2 = … = λp = 0, the reduction of dimensionality is performed by shrinking the coefficients towards each other (since the edges of the \(l_{\infty }\) sphere correspond to vectors b such that at least two coefficients are equal to each other). In the case when the sequence of thresholding parameters is monotonically decreasing, SLOPE reduces the dimensionality both ways: it shrinks them towards zero and towards each other. Thus it returns sparse and stable estimators, which have recently been proved to achieve minimax rate of convergence in the context of sparse high dimensional regression and logistic regression (Su and Candès, 2016; Bellec et al. 2018; Abramovich and Grinshtein, 2017).
From the perspective of model selection it has been proved in Bogdan et al. (2013) and Bogdan et al. (2015) that SLOPE with the vector of tuning parameters λBH (1.3) controls FDR at level q under the orthogonal design. This is no longer true if the inner products between columns of the design matrix are different from zero, which almost always occurs if the predictors are random variables. Similar problems with the control of the number of False Discoveries occur for LASSO. Specifically, in Bogdan et al. (2013) it is shown that in the case of the Gaussian design with independent predictors, LASSO with a fixed parameter λ can control FDR only if the true parameter vector is sufficiently sparse. The natural question is whether there exists a bound on the sparsity under which SLOPE can control FDR ? In this article we address this question and report a theoretical result on the asymptotic control of FDR by SLOPE under the Gaussian design. Our main theoretical result states that by multiplying sequence λBH by a constant larger than 1, one can achieve the full asymptotic power and FDR converging to 0 if the number k(n) of nonzero elements in the true vector of the regression coefficients satisfies \(k=o\left (\sqrt {\frac {n}{\log p}}\right )\) and the values of these non-zero elements are sufficiently large. We also report results of a simulation study which suggest that the assumption on the signal sparsity is necessary when using λBH sequence but is unnecessarily strong when using the heuristic adjustment of this sequence, proposed in Bogdan et al. (2015). Simulations also suggest that the asymptotic FDR control is guaranteed independently of the magnitude of the non-zero regression coefficients.
2 Asymptotic Properties of SLOPE
2.1 False Discovery Rate and Power
Let us consider the multiple regression model (1.1) and let \(\hat b\) be some sparse estimator of b0. The numbers of false, true and all rejections, and the number of non-zero elements in b0 (respectively: V, TR, R, k), are defined as follows
The False Discovery Rate is defined as:
and the Power as:
2.2 Asymptotic FDR and Power
We formulate our asymptotic results under the setup where n and p diverge to infinity and p can grow faster than n. Similarly as in the case of the support recovery results for LASSO, we need to impose a constraint on the sparsity of b0, which is measured by the number of truly important predictors \(k = \#\{i: {b^{0}_{i}} \neq 0\}\). Thus we consider the sequence of linear models of the form (1.1) indexed by n and with their ”dimensions” characterized by the triplets: (n,pn,kn). For the sake of clarity, further in the text we skip the subscripts by p and k.
The main result of this article is as follows:
Theorem 2.1.
Consider the linear model of the form (1.1) and assume that all elements of the design matrix \(X \in \mathbb {R}^{n \times p}\) are i.i.d. random variables from the normal N(0,1/n) distribution. Moreover, suppose there exists δ > 0 such that
and suppose
Then for any q ∈ (0,1), the SLOPE procedure with the sequence of tuning parameters
has the following properties
Proof
The proof of Theorem 2.1 makes extensive use of the results on the asymptotic properties of SLOPE reported in Su and Candès (2016). The roadmap of this proof is provided in Section 3, while the proof details can be found in the Appendix and in supplementary materials. □
Remark 2.2 (The assumption on the design matrix).
The assumption that the elements of X are i.i.d. random variables from the normal N(0,1/n) distribution is technical. We expect that the results can be generalized to the case where the predictors are independent, sub-gaussian random variables. The assumption that the variance is equal to 1/n can be satisfied by an appropriate scaling of such a design matrix. As compared to the classical standardization towards unit variance, our scaling allows for the control of FDR with the sequence of the tuning parameters λ, which does not depend on the number of observations n. If the data are standardized such that \(X_{ij}\sim N(0,1)\), then Theorem 2.1 holds when the sequence of tuning parameters (2.7) and the lower bound on the signal magnitude (2.5) are both multiplied by n− 0.5.
Remark 2.3 (The assumption on the signal strength).
Our assumption on the signal strength is not very restrictive. When using the classical standardization of the explanatory variables (i.e. assuming that \(X_{ij}\sim N(0,1)\)), this assumption allows for the magnitude of the signal to converge to zero at such a rate that
This assumption is needed to obtain the power that converges to 1. The proof of Theorem 2.1 implies that this assumption is not needed for the asymptotic FDR control if k is bounded by some constant. Moreover, our simulations suggest that the asymptotic FDR control holds independently of the signal strength if only k satisfies the assumption (2.6). The proof of this conjecture would require a substantial refinement of the proof techniques and remains an interesting topic for future work.
2.3 Simulations
In this section we present results of the simulation study. The data are generated according to the linear model:
where elements of the design matrix X are i.i.d. random variables from the normal N(0,1/n) distribution and 𝜖 is independent of X and comes from the standard multivariate normal distribution N(0,I). The parameter vector b0 has k non-zero elements and p − k zeroes.
We present the comparison of the three methods - two versions of SLOPE and LASSO:
-
1.
SLOPE with the sequence of the tuning parameters provided by the formula (2.7), denoted by ”SLOPE”.
-
2.
SLOPE with the sequence:
$$ \lambda_{i}(q) = \left\{\begin{array}{ccc} \sigma {\Phi}^{-1}(1-q/2p) & if & i=1\\ \min\left( \lambda_{i-1}, \sigma{\Phi}^{-1}(1-qi/2p)\sqrt{1+\frac{{\sum}_{j<i}{\lambda_{j}^{2}}}{n-i-2}}\right) & if & i>1 . \end{array}\right. $$This sequence was proposed in Bogdan et al. (2015) as a heuristic correction which takes into account the influence of the cross products between the columns of the design matrix X. We refer to this procedure as heuristic SLOPE (”SLOPE_heur”).
-
3.
LASSO with the tuning parameter:
$$ \lambda= \sigma (1+\delta) {\Phi}^{-1}\left( 1- \frac{q}{2p}\right) . $$(2.8)
The tuning parameter for LASSO is equal to the first element of the tuning sequence of SLOPE, so FDR of LASSO and SLOPE are approximately equal when k = 0.
Figure 3 presents FDR and Power of different procedures. First, let us concentrate on the behavior of SLOPE when the sequence of tuning parameters is defined as in Theorem 2.1 (green line in each sub-plot). The green rectangle contains plots where the sequences of tuning parameters and the signal sparsity k(n) satisfy the assumptions of Theorem 2.1. It is noticeable that in this area FDR of SLOPE slowly converges to zero and the Power converges to 1. Moreover, FDR is close to or below the nominal level q = 0.2 for the whole range of considered values of n. It is also clear that larger values of δ lead to the more conservative versions of SLOPE.
FDR and Power of different procedures as functions of n and the parameters δ (see (2.7) and (2.8)) and α, for p = 0.05n1.5, k = round(nα), \({b^{0}_{1}}=\ldots ={b^{0}_{k}}=2(1+\delta )\sqrt {2 \log p}\) and q = 0.2. The results presented in a green rectangle correspond to the values of parameters which meet assumptions of Theorem 2.1. The numbers above green lines correspond to actual values of the parameter k. Each point was obtained by averaging false and true positive proportions over at least 500 independent replicates
In the red area the assumptions of Theorem 2.1 are violated. Here we can see that when δ > 0 and α = 0.5, FDR is still a decreasing function of n but the rate of this decrease is slow and FDR is substantially above level q = 0.2 even for n = 2000. In the case when δ = 0 (i.e. when the original λBH sequence is used), FDR stabilizes at the value which exceeds the nominal level.
Let us now turn our attention to other methods. We can observe that LASSO is the most conservative procedure and that, as expected, its FDR converges to 0 when k increases. Since the simulated signals are strong, this does not lead to a substantial decrease of Power as compared to SLOPE. Interestingly, SLOPE with a heuristic choice of tuning parameters seems to provide a stable FDR control over the whole range of considered parameter values. This suggests that the upper bound on k provided in assumption (2.6) could be relaxed when working with this heuristic sequence. The proof of this claim remains an interesting topic for further research.
Figure 4 presents simulations for the case when \({b^{0}_{1}}=\ldots ={b^{0}_{k}}=0.9\sqrt {2 \log p}\), i.e. when the signal magnitude does not satisfy the assumption (2.5). Here FDR of SLOPE behaves similarly as in the case of strong signals. These results suggest that the assumption on the signal strength might not be necessary in the context of FDR control. Figure 4 also illustrates a strikingly good control of FDR by the SLOPE with the heuristically adjusted sequence of tuning parameters. LASSO is substantially more conservative than both versions of SLOPE. Its FDR converges to zero, which in the case of such moderate signals leads to a substantial decrease of Power as compared to SLOPE.
FDR and Power of different procedures as functions of n and the parameters δ (see (2.7) and (2.8)) and α, for p = 0.05n1.5, k = round(nα), \({b^{0}_{1}}=\ldots ={b^{0}_{k}}=0.9\sqrt {2 \log p}\) and q = 0.2. The results presented in a green rectangle correspond to the values of parameters which meet assumptions of Theorem 2.1, except the condition on signal strength. The numbers above green lines correspond to actual values of the parameter k. Each point was obtained by averaging false and true positive proportions over at least 500 independent replicates
3 Roadmap of the Proof
In the first part of this section we characterize the support of the SLOPE estimator. The proofs of the Theorems presented in this part rely only on differentiability and convexity of the loss function. Therefore we decided to present them in a general form, will be useful for a further work on extensions of SLOPE for Generalized Linear Models or Gaussian Graphical Models.
3.1 Support of the SLOPE estimator under the general loss function
Let us consider the following generalization of SLOPE:
where l(b) is a convex and a differentiable loss function (e.g. \(0.5\Vert Y-X^{\prime }b{\Vert ^{2}_{2}}\) for the multiple linear regression). Let R denote the number of non-zero elements of \(\hat b\).
The following Theorems 3.1 and 3.2 characterize events \(\{\hat {b}_{i} \neq 0\}\) and {R = r} by using the gradient U(b) of the negative loss function − l(b);
Additionally, for a > 0 we define the vector T(a) as
Thus, \(T_{i}(a)=U_{i}(\hat {b})\) if \(\hat {b}_{i}=0\). Also, the additional term \(a\hat {b}_{i}\) has the same sign as \(U_{i}(\hat {b})\), so \(|T_{i}(a)|>|U_{i}(\hat b)|\) if \(\hat {b}_{i}\neq 0\). By calculating the subgradient of the objective function of the LASSO estimator, it is possible to check that LASSO selects these variables for which the respective coordinates of |T(a)| exceed the value of the tuning parameter λ. In Bogdan et al. (2015) the support of \(\hat b\) for SLOPE under the orthogonal design is provided. It is shown that, similarly as in the case of the Benjamini-Hochberg correction for multiple testing, it is not sufficient to compare the ordered coordinates of |T(a)| to the respective values of the decaying sequence of tuning parameters. It can happen that while performing this simple operation one could eliminate regressors with the value of |T(a)| larger than for some of the regressors retained in the model. The SLOPE estimator preserves the ordering of |T(a)|. Thus, identification of the SLOPE support is more involved and requires introduction of the following sets Hr: for r ∈{1, … , p} we define
Theorem 3.1.
Consider the optimization problem 3.1 with an arbitrary sequence \(\lambda _{1}\geqslant \lambda _{2} \geqslant ... \geqslant \lambda _{p} \geqslant 0\). Assume that l(b) is a convex and a differentiable function. Then for any a > 0,
Theorem 3.2.
Consider the optimization problem 3.1 with an arbitrary sequence \(\lambda _{1}\geqslant \lambda _{2} \geqslant ... \geqslant \lambda _{p} \geqslant 0\). Assume that l(b) is a convex and a differentiable function and R = r. Then for any a > 0 it holds:
and
Moreover if we assume that \(\lambda _{1}>\lambda _{2}>...>\lambda _{p}\geqslant 0\), then:
The proofs of Theorems 3.1 and 3.2 are provided in the supplementary materials.
3.2 FDR of SLOPE for the general loss function
Corollary 3.3.
Consider the optimization problem (3.1) with an arbitrary sequence \(\lambda _{1}\geqslant \lambda _{2} \geqslant ... \geqslant \lambda _{p} \geqslant 0\). Assume that l(b) is a convex and a differentiable function. Then for any a > 0, FDR of SLOPE is equal to:
Proof
Let us denote the support of the true parameter vector b0 by:
and a set that is the complement of S in {1, ... , p} by:
Directly from the definition we obtain:

and Corollary 3.3 is a direct consequence of Theorems 3.1 and 3.2. □
3.3 Proof of Theorem 2.1
We now focus on the multiple regression model (1.1). Elementary calculations show that in this case the vector U (for def. see (3.2)) takes the form
Let us denote for simplicity
and introduce the following notation for the components of T:
and
Naturally, T = M + Γ. Due to (3.4), we can express FDR for linear regression in the following way:
Deeper analysis shows that under the assumptions of Theorem 2.1, the FDR expression (3.4) can be simplified. Corollary 3.5, stated below, follows directly from Lemma 4.4 in Su and Candès (2016) (see the supplementary materials) and shows that, with a large probability, only the first elements of the summation over r are different from zero. Furthermore, the following Lemma 3.6 shows that elements of the vector Γ are sufficiently small, so we can focus on the properties of the vector M.
Definition 3.4 (Resolvent set, Su and Candès (2016)).
Fix S = supp(b0) of cardinality k, and an integer \(\tilde k^{*}\) obeying \(k < \tilde k^{*} < p\). The set \(\tilde S^{*} = \tilde S^{*}(S, \tilde k^{*})\) is said to be a resolvent set if it is the union of S and the \(\tilde k^{*}-k\) indexes with the largest values of \(\vert X_{i}^{\prime } \epsilon \vert \) among all i ∈{1, ... , p}∖ S.
Let us introduce the following notation on a sequence of events when the union of supports of b0 and \(\hat b\) is contained in \(\tilde S^{*}\)
Corollary 3.5.
Suppose the assumptions of Theorem 2.1 hold. Then there exists a deterministic sequence k∗ such that \(k^{*}/p \rightarrow 0\), \(((k^{*})^{2} \log p)/n \rightarrow 0\) and:
Corollary 3.5 follows directly from Lemma 4.4 in Su and Candès (2016) (see Lemma S.2.2 in the supplementary materials and the discussion below). From now on k∗ will denote the sequence satisfying Corollary 3.5.
Lemma 3.6.
Let us denote by \(Q_{2}(n, \tilde {\gamma }(n))\) a sequence of events when the \(l_{\infty }\) norm of vector Γ is smaller than \(\tilde \gamma (n)\):
If the assumptions of Theorem 2.1 hold then there exists a constant Cq, dependent only on q, such that the sequence \(\gamma (n) = C_{q} \sqrt {\frac {(k^{*})^{2} \log p}{n}} \lambda ^{BH}_{k^{*}}\) satisfies:
The proof of Lemma 3.6 is provided in the Appendix.
Let us denote by Q3(n,u) a sequence of events such that the l2 norm of the vector 𝜖 divided by \(\sigma \sqrt {n}\) is smaller than 1 + 1/u:
The following Corollary 3.7 is a consequence of the well known results on the concentration of the Gaussian measure (see Theorem S.2.4 in the supplementary materials).
Corollary 3.7.
Let k∗ = k∗(n) be the sequence satisfying Corollary 3.5. Then
From now on, for simplicity, we shall denote by Q1,Q2 and Q3 the sequences Q1(n,k∗),Q2(n,γ) and Q3(n,k∗) respectively. Moreover, let us introduce the following notation on the intersection of Q1,Q2 and Q3:
By using an event Q, we can bound FDR in the following way:

The first equality follows from the fact that . The inequality uses the fact that \(\frac {V}{R\vee 1} \leqslant 1\). The second equality is a consequence of the formula (3.7) applied to the second term. Naturally, due to conditions (3.12), (3.14) and (3.16), we obtain \(P(Q^{c})\rightarrow 0\). Therefore, we can focus on the properties of the second term in (3.18). Notice that Q1 implies that \( R\leqslant k^{*}\) (\(supp(\hat b) \subset S^{*}\)), therefore we can limit the summation over r to the first k∗ elements:
Furthermore, according to Theorems 3.1 and 3.2:
We now introduce the useful notation:
-
a vector \(M^{(i)} = (M^{(i)}_{1}, ... , M^{(i)}_{p})^{\prime }\) is the following modification of M
$$ M^{(i)}_{j}:= \left\lbrace\begin{array}{l} M_{j}~\text{for}~ j\neq i \\ \infty~\text{for}~ j=i\end{array}\right. $$ -
a set \(H_{r}^{\gamma }\), which is a generalization of the set Hr (3.3),
$$ \begin{array}{@{}rcl@{}} H_{r}^{\gamma} &=& \{ w \in \mathbb{R}^{p}: \forall_{j \leqslant r} \sum\limits_{i=j}^{r} (\lambda_{i}- \gamma)\\ &<& \sum\limits_{i=j}^{r} \vert w \vert_{(i)} ~and~ \forall_{j \geqslant r+1} \sum\limits_{i=r+1}^{j}(\lambda_{i}+ \gamma) \geqslant \sum\limits_{i=r+1}^{j} \vert w\vert_{(i)} \}. \end{array} $$
Lemma 3.8 (see the Appendix for the proof) allows for the replacement of an event {T ∈ Hr,|Ti| > λr,Q2} by an event which depends only on M(i).
Lemma 3.8.
If T ∈ Hr, |Ti| > λr and Q2 occurs, then \(M^{(i)} \in H_{r}^{\gamma }\).
Lemma 3.8, together with a fact that under Q2 |Ti| > λr, allows for the conclusion that |Mi| > λr − γ and therefore
The following Lemma 3.9 (see the Appendix for the proof ) provides asymptotic behavior of the right-hand side of (3.21):
Lemma 3.9.
Under the assumptions of Theorem 2.1, it holds:
The proof of Lemma 3.9 is based on several properties. First,
can be well approximated by
This approximation is a consequence of the fact that conditionally on 𝜖, M(i) and Mi are independent. Second, for i ∈ Sc:
can be well approximated by
where Φ(⋅) is the cumulative distribution function of the standard normal distribution. The first approximation is a consequence of the fact that for i ∈ Sc, Mi = Xi𝜖. Thus, conditionally on 𝜖, Mi has a normal distribution with the mean equal to 0 and the variance equal to \(\frac {||\epsilon ||^2}{n}\), which is close to σ2 (see Corollary 3.7). The second approximation relies on the well known formula
where ϕ(⋅) is the density of the standard normal distribution and o(c) converges to zero as c diverges to infinity.
Lastly
equals approximately to
This approximation is a consequence of the fact that Hrγ does not differ much from Hr and the family of the sets Hr with r ∈{1, … , p} is disjoint. By applying the above approximations we obtain:
Lemma 3.9, together with the fact that under assumptions of Theorem 2.1 \(P(Q^c)\rightarrow 0\), provides \(FDR \rightarrow 0\). One can notice that \(\left (\frac {k^{*}q}{p} \right )^{(1+\delta )^2-1}\) is the factor responsible for the convergence of FDR to 0. It remains an open question if in the definition of the λ sequence (2.7) a constant δ can be replaced by a sequence converging to 0 at such a rate that the asymptotic FDR is exactly equal to the nominal level q. The proof of this assertion would require a refinement of the bounds provided in Su and Candès (2016) and we leave this as a topic for future research.
Now, we will argue that under our assumptions the Power of SLOPE converges to 1 . Recall that \(TR = \#\{j: b_j^0 \neq 0 \text {and} \hat b_j \neq 0\}\) denotes the number of true rejections. Observe that
Naturally \({\Pi } \leqslant 1\), therefore by showing that \(P\left (\bigcap _{i \in S}\{\hat b_i \neq 0 \}\right ) \rightarrow 1\) we obtain the thesis.
Lemma 3.10.
Under the assumptions of Theorem 2.1, it holds:
Proof
The proof of Lemma 3.10 is provided in the Appendix. It is based on the sequence of the following inequalities:
The result follows by the conditional independence of \(X_i^{\prime }\epsilon \) (given 𝜖) and the classical approximation (3.23). □
4 Discussion
In this article we provide new asymptotic results on the model selection properties of SLOPE in the case when the elements of the design matrix X come from the normal distribution. Specifically, we provide conditions on the sparsity and the magnitude of true signals such that FDR of SLOPE based on the sequence λBH, corresponding to the thresholds of the Benjamini-Hochberg correction for multiple testing, converges to 0 and the Power converges to 1. We believe these results can be extended to the sub-gaussian design matrices with independent columns, which is the topic of ongoing research. Additionally, the general results on the support of SLOPE open the way for investigation of the properties of SLOPE under arbitrary convex and differentiable loss functions.
In simulations we compared SLOPE based on the sequence λBH with SLOPE based on the heuristic sequence proposed in Bogdan et al. (2015) and with LASSO with the tuning parameter λ adjusted to the first value of the tuning sequence for SLOPE. When regressors are independent and the vector of true regression coefficients is relatively sparse then the comparison between SLOPE and LASSO bears similarity to the comparison between the Bonferroni and the Benjamini-Hochberg corrections for multiple testing. When k = 0 both methods perform similarly and control FDR (which for k = 0 is equal to the Family Wise Error Rate) close to the nominal level. When k increases, FDR of LASSO converges to zero, which however comes at the prize of a substantial loss of Power for moderately strong signals. Concerning the two versions of SLOPE, our simulations suggest that the heuristic sequence allows for a more accurate FDR control over a wide range of sparsity values. We believe the techniques developed in this article form a good foundation for the analysis of the statistical properties of the heuristic version of SLOPE, which we consider an interesting topic for further research.
Our assumptions on the independence of predictors and the sparsity of the vector of true regression coefficients are restrictive, which is related to the well known problems with FDR control by LASSO (Bogdan et al. 2013; Su et al. 2017). In the case of LASSO, these problems can be solved by using adaptive or reweighted LASSO (Zou, 2006; Candès et al. 2008), which allow for the consistent model selection under much weaker assumptions. In these modifications the values of the tuning parameters corresponding to predictors which are deemed important (based on the values of initial estimates) are reduced, which substantially reduces the bias due to shrinkage and improves model selection accuracy. Recently (Jiang et al. 2019) developed the Adaptive Bayesian version of SLOPE (ABSLOPE). According to the results of simulations presented in (Jiang et al. 2019), ABSLOPE controls FDR under a much wider set of scenarios than the regular SLOPE, including examples with strongly correlated predictors. We believe our proof techniques can be extended to derive asymptotic FDR control for ABSLOPE, which we leave as an interesting topic for future research.
References
Abramovich, F. and Grinshtein, V. (2017). High-dimensional classification by sparse logistic regression. IEEE Transactions on Information Theory, PP, 06. https://doi.org/10.1109/TIT.2018.2884963.
Abramovich, F., Benjamini, Y., Donoho, D.L. and Johnstone, I.M. (2006). Adapting to unknown sparsity by controlling the false discovery rate. Annals of Statistics 34, 2, 584–653.
Bellec, P.C., Lecué, G. and Tsybakov, A.B. (2018). Slope meets lasso: improved oracle bounds and optimality. Annals of Statistics 46, 6B, 3603–3642.
Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B 57, 1, 289–300.
Bogdan, M., Chakrabarti, A., Frommlet, F. and Ghosh, J.K. (2011). Asymptotic Bayes optimality under sparsity of some multiple testing procedures. Annals of Statistics 39, 1551–1579.
Bogdan, M., van den Berg, E., Su, W. and Candès, E.J. (2013). Statistical estimation and testing via the ordered ℓ1 norm. Technical Report 2013-07, Department of Statistics Stanford University.
Bogdan, M., van den Berg, E., Sabatti, C., Su, W. and Candès, E.J. (2015). Slope – adaptive variable selection via convex optimization. Annals of Applied Statistics 9, 3, 1103–1140.
Candès, E.J., Wakin, M.B. and Boyd, S.P. (2008). . J. Fourier Anal. Appl. 14, 877–905.
Chen, S. and Donoho. D. (1994). Basis pursuit, 1. IEEE, p. 41–44.
Frommlet, F. and Bogdan, M. (2013). Some optimality properties of fDR controlling rules under sparsity. Electronic Journal of Statistics 7, 1328–1368.
Jiang, W., Bogdan, M., Josse, J., Miasojedow, B. and An TB Group Rockova, V. (2019). Adaptive bayesian slope-high-dimensional model selection with missing values. arXiv:1909.06631.
Neuvial, P. and Roquain, E. (2012). On false discovery rate thresholding for classification under sparsity. Annals of Statistics 40, 2572–2600.
Su, W. and Candès, E. (2016). Slope is adaptive to unknown sparsity and asymptotically minimax. Annals of Statistics 44, 3, 1038–1068,06. doi: https://doi.org/10.1214/15-AOS1397.
Su, W., Bogdan, M. and Candès, E.J. (2017). False discoveries occur early on the lasso path. Annals of Statistics 45, 5, 2133–2150.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pp. 267–288.
Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association 101, 476, 1418–1429. doi: https://doi.org/10.1198/016214506000000735.
Acknowledgments
We would like to thank the Associate Editor and the Referees for many constructive comments. Also, we would like to thank Damian Brzyski for preparing Fig. 2 and Artur Bogdan for helpful suggestions. The research was funded by the grant of the Polish National Center of Science Nr 2016/23/B/ST1/00454.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Appendix
Appendix
Proof
The proof of Lemma 3.6
We prove a stronger condition: \(P(Q_1 \cap Q_2) \rightarrow 1\).
Denote by \(X_I,\hat {b}_I,b^0_I\) a submatrix (subvector) of \(X ,\hat {b},b^0\) consisting of columns (vector elements) with indexes in a set I.
Observe that when \(Q_1 = \{Supp(b^0)\cup Supp(\hat {b})\subseteq S^{*}\}\), we can express \(\max \limits _i\vert {\Gamma }_i \vert \) in the following way:
We show that both elements are bounded by \(C_q \sqrt {\frac {(k^{*})^2 \log p}{n}} \lambda ^{BH}_{k^{*}}\) with the probability tending to 1. The bound on the first component is a direct corollary from Lemma A.12 proved in Su and Candès (2016).
Corollary 5.1.
Under the assumptions of Theorem 2.1, there exists a constant Cq only depending on q such that:
with the probability tending to 1.
Lemma A.12 of Su and Candès (2016) and the proof of Corollary 5.1 can be found in the supplementary materials (see Lemma S.2.5 and the discussion below).
It remains to be proven that the second component is also bounded by \(C_q \sqrt {\frac {(k^{*})^2 \log p}{n}} \lambda ^{BH}_{k^{*}}\) with the probability tending to 1. To do this, we use Lemma A.11 proved in Su and Candès (2016) which provides bounds on the largest and the smallest singular values of \(X_{S^{*}}\) (for details see Lemma S.2.6 in the supplementary materials).
Let \(X_{S^{*}} = G{\Sigma } V^{\prime }\) be the Singular Value Decomposition (SVD) of the matrix \(X_{S^{*}}\), where \(G \in \mathbb {M}_{n \times n}\) is a unitary matrix, \({\Sigma } \in \mathbb {M}_{n \times k^{*}}\) is a diagonal matrix with non-negative real numbers on the diagonal and \(V \in \mathbb {M}_{k^{*} \times k^{*}}\) is a unitary matrix. Moreover, let us denote by σi,σmin and σmax the i-th, the smallest and the largest singular value.
Assume that u is an arbitrary unit vector. Due to the relation between the \(l_{\infty }\) and the l2 vector norms, and additionally between the l2 vector norm and the ∥⋅∥2 matrix norm, it holds:
Using the SVD of the matrix \(X_{S^{*}}\) and the sub-multiplicity of the ∥⋅∥2 matrix norm we continue:
By definition the maximal singular value of the matrix A equals to the square root of the largest eigenvalue of the positive-semidefinite matrix \(A^{\prime }A\). Therefore in our case it is obvious that:
where σi is the i-th singular value of \(X_{S^{*}}\). Due to Lemma A.11 of Su and Candès (2016) (see Lemma S.2.6 in the supplementary materials) ,we obtain that for some constants C1 and C2
and in consequence for an arbitrary unit vector u
with the probability at least \(1 - 2e^{-k^{*} \log (p/k^{*})/2} - (\sqrt {2}ek^{*}/p)^{k^{*}}\).
On the other hand, due to Theorem 1.2 in Su and Candès (2016), for any constant δ1 > 0
Finally, observe that
Thus, the relations (5.2) and (5.3) imply that for certain constants C1 and C2:
with the probability tending to 1.
The inequality (5.4), together with (5.1), provide the thesis of the Lemma. □
Proof
The proof of Lemma 3.8
In order to prove Lemma 3.8, we have to introduce the modification of a vector T similar to that of vector M. Denote \(T^{(i)} =(T^{(i)}_1, ... , T^{(i)}_p)^{\prime }\):
In the first step we show that:
Proposition 5.2.
If T ∈ Hr and |Ti| > λr occurs, then T(i) ∈ Hr
Proof
Let us recall the definition of a set Hr:
On the one hand, we know that |Ti| > λr. On the other hand, T ∈ Hr implies that \(\vert T \vert _{(r+1)} \leqslant \lambda _{r+1}\) (the second condition in the definition of Hr for j = r + 1). These inequalities imply together that \(\vert T_i \vert \geqslant \vert T \vert _{(r)}\). Hence we only have to show that:
To see that the above is true we have to consider the relations between the r largest elements of vectors |T| and |T(i)|. Let us assume that |Ti| = |T|(k) for some \(k \leqslant r\). By the definition of the vector |T(i)| we know that \(\vert T^{(i)} \vert _{(1)} = \vert T^{(i)}_i \vert = \infty \) and that the other ordered statistics of |T(i)| are related to the ordered statistics of |T| in the following way:
In consequence we obtain that:
and this implies (5.5). □
In the second step of the proof of Lemma 3.8 we show that:
In order to prove this, we use the following Proposition:
Proposition 5.3.
Let us assume we have three vectors \(A,B,C \in \mathbb {R}^p\) and that A = B + C. Furthermore, let us assume that the vector A is ordered (\(\vert A_1 \vert \geqslant ...\geqslant \vert A_p\vert \)) and define \(d = \sup _i \vert C_i \vert \). Under the above assumptions:
for all i = 1, ... , p.
Proof
Let
From the triangular inequality, we have:
In consequence, \(d \geqslant e\). Thus, to prove the Proposition 5.3 it is sufficient to prove that
For this aim, let i0 be the index such that:
When \(\vert A_{i_0} \vert = \vert B \vert _{(i_0)}\), then f = 0 and the thesis is obtained immediately. Let us consider the case when \(\vert A_{i_0} \vert > \vert B \vert _{(i_0)}\). If \(\vert B_{i_0} \vert \leqslant \vert B \vert _{(i_0)}\), then:
and in consequence we obtain the thesis. If \(\vert B_{i_0} \vert \ > \vert B \vert _{(i_0)}\), then:
On the other hand we know that:
Therefore, the set \(\mathbb {A}\) has one more element than \(\mathbb {B}\) does. Hence, there exists an element \(|A_{i_1}| \in \mathbb {A}\), which is associated with an element \(|B_{i_1}|\) from a complement of the set \(\mathbb {B}\):
In consequence, due to facts \(|A_{i_0}| \leqslant |A_{i_1}|\) and \(|B_{i_1}| \leqslant \vert B \vert _{(i_0)}\) we obtain:
which completes the proof.
The proof for the case when \(\vert A_{i_0} \vert < \vert B \vert _{(i_0)}\) is analogous. □
Now let us recall the relation between T(i) and M(i):
Furthermore recall that \(Q_2 = \{\max \limits _i\vert {\Gamma }_i \vert \leqslant \gamma \}\). When we assume that Q2 occurs and apply Proposition 5.3 to a vector \((\vert T^{(i)}\vert _{(1)}, ... , \vert T^{(i)}\vert _{(p)})^{\prime }\), we obtain that:
for all j = 1, ... , p. Let us recall the definitions of Hr and \(H^{\gamma }_r\):
It can be noticed that due to condition (5.6) we have:
and the proof of Lemma 3.8 is completed. □
Proof
The proof of Lemma 3.9
Without the loss of generality, let us assume that 1 ∈ Sc. Observe that conditionally on 𝜖, the vector M(i) and the variable Mi are independent. Recall that \(Q_3 = \{ \frac {\Vert \epsilon \Vert _2}{\sigma \sqrt {n}} \leqslant 1+1/k^{*} \}\) depends only on 𝜖. Therefore:
Now for i ∈ Sc:
where the last equality is a consequence of the fact that conditionally on 𝜖, \(\sqrt {n}\frac {X_i^{\prime }\epsilon }{\Vert \epsilon \Vert _2}\) has a standard normal distribution. Furthermore, from the definition of λr and γ we know that for a large enough n
where in the last inequality we used the fact that \(\lambda ^{BH}_r/\sqrt {2 \log (p/qr)} \rightarrow 1\). Let us denote
By applying the above inequality to (5.8) we obtain for a large enough n:
where in the second inequality we used the classical approximation to the tail probability of the standard normal distribution (3.23).
By applying (5.9) and (5.7) to
we obtain for a large enough n:
where in the last equality we used the assumption that 1 ∈ Sc, the fact that for i ∈ Sc all M(i) have the same distribution and the fact that there are p − k elements in Sc.
Due to the fact that \((k^{*}/p)^{\delta } \rightarrow 0\), in order to prove the Lemma it remains to be shown that
We prove this by showing that
and
Naturally, \(P(M^{(1)} \in H^{\gamma }_{k+1}) \leqslant 1\) which, together with (5.11) and (5.12) would provide (5.10) and in consequence the thesis.
We begin by proving (5.11). Let us denote by \(\tilde {W}\) a vector of elements of M(1) with indices in S (corresponding to non-zero elements in \(b^0_i\)). Directly from the definition of the set \(H_r^{\gamma }\), we have for \(r \leqslant k\):
Furthermore
which is a consequence of the fact that \(\tilde {W}\) is a subvector of M(1) and that \(M^{(1)}_1= \vert M^{(1)} \vert _{(1)} = \infty \) (1 ∈ Sc). Therefore, it is obvious that \(\vert \tilde {W} \vert _{(r)} \leqslant \vert M^{(1)} \vert _{(r+1)}\).Moreover, let W be a modification of \(\tilde {W}\), where each \(b^0_i\) is replaced by \(2\sigma (1+\delta )\sqrt {2 \log p}\) and the resulting vector is multiplied by a function:
Therefore, \(W_i = (X_j^{\prime }\epsilon + 2\sigma (1+\delta )\sqrt {2 \log p})g(\epsilon )\) for the corresponding j ∈ S. Naturally, the elements of W, conditionally on 𝜖, are independent and identically distributed. Furthermore, due to the fact that \(|X_i^{\prime }\epsilon +b^0_i| =^D|X_i^{\prime }\epsilon +|b^0_i||\), the fact that the density of \(X^{\prime }_i \epsilon +|b^0_i|\) has a mode at \(|b^0_i|\), the assumption \(\min \limits _{i \in S} \vert b^0_i \vert > 2\sigma (1+\delta )\sqrt {2 \log p}\) and the fact that \(2\sigma (1+\delta )\sqrt {2 \log p} \geqslant \lambda _{r+1}+\gamma \) for a large enough n, it holds that:
Now,
where in the equality between the second and the third line we used the standard combinatorial arguments for calculating the cumulative distribution function of the r-th order statistic and the fact that elements of W are independent conditionally on 𝜖. In the inequality we used the fact that \( P\left (\vert W_1 \vert \leqslant \lambda _{r+1} +\gamma | \epsilon \right ) \leqslant 1\).
Now, for some j ∈ S (corresponding to W1) and a large enough n we have
where in the second line we first use the definition of the function g(𝜖) and in the inequality we skip the absolute value and use the fact that for a large enough n, \(\lambda _{r+1} +\gamma \leqslant \sigma (1+1.5\delta )\sqrt {2 \log p}\). Now, due to the fact that conditionally on 𝜖, \(X_j^{\prime } \epsilon \) is normal with a mean 0 and a standard deviation \(\Vert \epsilon \Vert _2/\sqrt {n}\), and the fact that Q3 provides the upper bound on this standard deviation, for a large enough n we obtain:
In consequence we can limit (5.11) from the above
for \(n \rightarrow \infty \). To see the equality between the first and the second line observe that only index i depends on r. Therefore, we sum multiple times the same elements and for a given i, the summation element occurs i times. The inequality uses the upper bound on the Newton symbol \({k \choose r} \leqslant \left (\frac {ke}{r}\right )^r\). Naturally, this result is much stronger than the relation (5.11).
Remark 5.4.
The proof of the relation (5.11) is the only place where we use the assumption on the signal strength in the context of FDR control. Naturally, when k is bounded we obtain (5.11) immediately:
and the condition on the signal strength is redundant.
Now, we turn our attention to (5.12). From now on we assume that \(r\geqslant k+2\). Let us denote by V a subvector of M(1) consisting of elements with indices in Sc ∖{1} (corresponding to elements of \(b^0_i\) equal to 0, except \(M^{(1)}_1\)). Notice that \(\vert M^{(1)} \vert _{(k+2)} \leqslant \vert V \vert _{(1)}\). This is a consequence of the fact that |V |(1) is the largest element in |V | and is equal or larger than p − k − 1 elements in |M(1)|. Similarly, we can show that:
Conditionally on 𝜖, elements of V are independent and identically distributed. Therefore we have:
where the random vector \(Z = (Z_1, ... , Z_{p-k-1})^{\prime } \sim N(0,\mathbb {I}_{p-k-1})\) is independent of ∥𝜖∥2. Notice that the probability in (5.14) is maximized for the largest possible standard deviation of V (maximal \(\frac {\Vert \epsilon \Vert _2}{\sqrt {n}}\)). Therefore, when restricting to Q3, we obtain:
Moreover for a large enough n we have
which follows directly from the fact that for a large enough n:
and from the fact that:
for \(r \leqslant k^{*}\).In consequence we obtain for a large enough n
where \(s = (1+\delta /2)\sqrt {2 \log p}\).
Let us denote by u1, ... , up−k− 1 i.i.d. random variables from the uniform distribution U[0,1] and by \(u_{[1]} \leqslant ... \leqslant u_{[p-k-1]}\), the corresponding order statistics. We know that
Therefore, by using the classical upper bound
we obtain for a large enough n:
We also know that the i-th order statistic of the uniform distribution is a beta-distributed random variable:
On the other hand, a well known fact is that if \(A_1 \sim Gamma(i,\theta )\) and \(A_2 \sim Gamma(p-k-i,\theta )\) are independent, then \(A_1/(A_1+A_2) \sim Beta(i,p-k-i)\). In consequence, when E1, ... , Ep−k are i.i.d. random variables from the exponential distribution with a mean equal to 1 it holds:
Therefore
and
where FE is the Erlang cumulative distribution function. Therefore we obtain that
To prove (5.12) it remains to be shown that:
and
The first relation follows directly from the properties of the Erlang cumulative distribution function. The second relation is a consequence of Chebyshev’s inequality (for details see the supplementary materials). This ends the proof. □
Proof
Proof of Lemma 3.10
Recall that we want to show
We can bound the considered probability in the following way:
where in the first and the last equation we used the law of total probability (\(\{R = r\}_{r = 1}^p\) is a partition of a sample space). The second equation is a consequence of Theorem 3.2 and the inequality comes from the fact that λ is a decreasing sequence. Now, observe that:
and recall that Ti = Mi + Γi and \(Q_2 = \{\max \limits _i |{\Gamma }_i|\leqslant \gamma \}\). Therefore, due to the triangle inequality we have:
Now, because \(P(Q_2) \rightarrow 1\), we only have to show that
Let us consider the properties of \(M_i = X_i^{\prime }\epsilon +b^0_i\). Notice that due to the symmetry of \(X_i^{\prime }\epsilon \) distribution, we have \(|X_i^{\prime }\epsilon +b^0_i| =^D |X_i^{\prime }\epsilon +|b^0_i||\), therefore
where in the last inequality we omit the absolute value in \(|X_i^{\prime }\epsilon +|b^0_i||\) and subtract \(|b^0_i|\). Now, due to the assumption on the signal strength, we have for a large enough n:
This is a consequence of the fact that for a large enough n:
Moreover, we know that conditionally on 𝜖, \(X^{\prime }_i\epsilon \) are independent from the normal distribution \(N(0, \Vert \epsilon \Vert _2^2/n) \). Therefore we have:
where in the last inequality we have used the condition defining Q3. Again, due to the fact that \(P(Q_3) \rightarrow 1\), we only have to show that
where \(a = \frac {(1+\delta /2)}{(1+1/k^{*})}\). Now, using the fact that conditionally on 𝜖, \(X^{\prime }_i\epsilon \) are independent random variables from the normal distribution \(N(0, \Vert \epsilon \Vert _2^2/n)\), we obtain
where in the above inequalities we used the bound (5.15) and Bernoulli’s inequality. The convergence is a consequence of the fact that for a large enough n, \(a \geqslant 1\) and the assumption that \(k/p \rightarrow 0\). This ends the proof of \({\Pi } \rightarrow 1\). □
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Kos, M., Bogdan, M. On the Asymptotic Properties of SLOPE. Sankhya A 82, 499–532 (2020). https://doi.org/10.1007/s13171-020-00212-5
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13171-020-00212-5
Keywords
- Multiple testing
- Model selection
- High dimensional regression
- Convex optimization.
AMS (2000) subject classification
- Primary; 62J07
- Secondary 62F12
- 62J05