On the Asymptotic Properties of SLOPE

Sorted L-One Penalized Estimator (SLOPE) is a relatively new convex optimization procedure for selecting predictors in high dimensional regression analyses. SLOPE extends LASSO by replacing the L1 penalty norm with a Sorted L1 norm, based on the non-increasing sequence of tuning parameters. This allows SLOPE to adapt to unknown sparsity and achieve an asymptotic minimax convergency rate under a wide range of high dimensional generalized linear models. Additionally, in the case when the design matrix is orthogonal, SLOPE with the sequence of tuning parameters λBH corresponding to the sequence of decaying thresholds for the Benjamini-Hochberg multiple testing correction provably controls the False Discovery Rate (FDR) in the multiple regression model. In this article we provide new asymptotic results on the properties of SLOPE when the elements of the design matrix are iid random variables from the Gaussian distribution. Specifically, we provide conditions under which the asymptotic FDR of SLOPE based on the sequence λBH converges to zero and the power converges to 1. We illustrate our theoretical asymptotic results with an extensive simulation study. We also provide precise formulas describing FDR of SLOPE under different loss functions, which sets the stage for future investigation on the model selection properties of SLOPE and its extensions.


Introduction
In this article we consider the classical problem of identifying important predictors in the multiple regression model where X is the design matrix of dimension n × p and ∼ N (0, σ 2 I n×n ) is the noise vector. In the case when p > n, the vector of parameters b 0 is not identifiable and can be uniquely estimated only under certain additional assumptions concerning e.g. its sparsity (i.e. the number of non-zero elements). One of the most popular and computationally tractable methods for estimating b 0 in the case when p > n is the Least Absolute Shrinkage and Selection Operator (LASSO, Tibshirani (1996)), firstly introduced in the context of signal processing as the Basis Pursuit Denoising (BPDN, Chen and Donoho. D. (1994)). LASSO estimator is defined aŝ where || · || 2 denotes the regular Euclidean norm in R n and |b| 1 = p j=1 |b j | is the L 1 norm of b. If X X = I then LASSO is reduced to the simple shrinkage operator imposed on the elements of the vectorỸ = X Y , In this case the choice of the tuning parameter allows for the control of the probability of at least one false discovery (Family Wise Error Rate, FWER) at level α.
In the context of high dimensional multiple testing, the Bonferroni correction is often replaced by the Benjamini and Hochberg (1995) multiple testing procedure aimed at the control of the False Discovery Rate (FDR). Apart from FDR control, this procedure also has appealing properties in the context of the estimation of the vector of means for multivariate normal distribution with independent entries (Abramovich et al., 2006) or in the On the Asymptotic Properties of SLOPE context of minimizing the Bayes Risk related to 0-1 loss (Bogdan et al., 2011;Neuvial and Roquain, 2012;Frommlet and Bogdan, 2013). In the context of the multiple regression with an orthogonal design, the Benjamini-Hochberg procedure works as follows: (1) Fix q ∈ (0, 1) and sortỸ such that (2) Identify the largest j such that Call this index j BH .
Thus, in BH, the fixed threshold of the Bonferroni correction, λ Bon , is replaced with the sequence λ BH of 'sloped' thresholds for the sorted test statistics (see Fig. 1). This allows for a substantial increase of power and for an improvement of prediction properties in the case when some of the predictors are relatively weak.
The idea of using a decreasing sequence of thresholds was subsequently used in the Sorted L-One Penalized Estimator (SLOPE, Bogdan et al. (2013) & Bogdan et al. (2015) for the estimation of coefficients in the multiple regression model: where |b| (1) ≥ . . . ≥ |b| (p) are ordered magnitudes of elements of b and λ = (λ 1 , . . . , λ p ) is a non-zero, non-increasing and non-negative sequence of tuning parameters. As noted in Bogdan et al. (2013) & Bogdan et al. (2015, the function J λ (b) = p i=1 λ i |b| (i) is a norm. To see this, observe that: • for any constant a ∈ R and a sequence b ∈ R p , J λ (ab) = |a|J λ (b) • J λ (b) = 0 if and only if b = 0 ∈ R p .
To show the triangular inequality: • J λ (x + y) ≤ J λ (x) + J λ (y) Then and the result can be derived from the well known rearrangement inequality, according to which for any permutation ρ and any sequence x ∈ R p : Thus SLOPE is a convex optimization procedure which can be efficiently solved using classical optimization tools. It is also easy to observe that in the case λ 1 = . . . = λ p , SLOPE is reduced to LASSO, while in the case λ 1 > λ 2 = . . . = λ p = 0, the Sorted L-One norm J λ is reduced to the L ∞ norm. Figure 2 illustrates different shapes of the unit spheres corresponding to different versions of the Sorted L-One Norm. Since the solutions of SLOPE On the Asymptotic Properties of SLOPE  Fig. 2 demonstrates the large flexibility of SLOPE with respect to the dimensionality reduction. In the case when λ 1 = . . . = λ p , SLOPE reduces dimensionality by shrinking the coefficients to zero. In the case when λ 1 > λ 2 = . . . = λ p = 0, the reduction of dimensionality is performed by shrinking the coefficients towards each other (since the edges of the l ∞ sphere correspond to vectors b such that at least two coefficients are equal to each other). In the case when the sequence of thresholding parameters is monotonically decreasing, SLOPE reduces the dimensionality both ways: it shrinks them towards zero and towards each other. Thus it returns sparse and stable estimators, which have recently been proved to achieve minimax rate of convergence in the context of sparse high dimensional regression and logistic regression (Su and Candès, 2016;Bellec et al., 2018;Abramovich and Grinshtein, 2017).
From the perspective of model selection it has been proved in Bogdan et al. (2013) & Bogdan et al. (2015 that SLOPE with the vector of tuning parameters λ BH (1.3) controls FDR at level q under the orthogonal design. This is no longer true if the inner products between columns of the design matrix are different from zero, which almost always occurs if the predictors are random variables. Similar problems with the control of the number of False Discoveries occur for LASSO. Specifically, in Bogdan et al. (2013) it is shown that in the case of the Gaussian design with independent predictors, LASSO with a fixed parameter λ can control FDR only if the true parameter vector is sufficiently sparse. The natural question is whether there exists a bound on the sparsity under which SLOPE can control FDR ? In this article we address this question and report a theoretical result on the asymptotic control of FDR by SLOPE under the Gaussian design. Our main theoretical result states that by multiplying sequence λ BH by a constant larger than 1, one can achieve the full asymptotic power and FDR converging to 0 if the number k(n) of nonzero elements in the true vector of the regression coefficients satisfies k = o n log p and the values of these non-zero elements are sufficiently large. We also report results of a simulation study which suggest that the assumption on the signal sparsity is necessary when using λ BH sequence but is unnecessarily strong when using the heuristic adjustment of this sequence, proposed in Bogdan et al. (2015). Simulations also suggest that the asymptotic FDR control is guaranteed independently of the magnitude of the non-zero regression coefficients.

False Discovery Rate and Power
Let us consider the multiple regression model (1.1) and letb be some sparse estimator of b 0 . The numbers of false, true and all rejections, and the number of non-zero elements in b 0 (respectively: V , T R, R, k), are defined as follows The False Discovery Rate is defined as: and the Power as: (2.4)

Asymptotic FDR and Power
We formulate our asymptotic results under the setup where n and p diverge to infinity and p can grow faster than n. Similarly as in the case of the support recovery results for LASSO, we need to impose a constraint on the sparsity of b 0 , which is measured by the number of truly important predictors k = #{i : b 0 i = 0}. Thus we consider the sequence of linear models of the form (1.1) indexed by n and with their "dimensions" characterized by the triplets: (n, p n , k n ). For the sake of clarity, further in the text we skip the subscripts by p and k.
The main result of this article is as follows: Then for any q ∈ (0, 1), the SLOPE procedure with the sequence of tuning parameters has the following properties: Proof. The proof of Theorem 2.1 makes extensive use of the results on the asymptotic properties of SLOPE reported in Su and Candès (2016). The roadmap of this proof is provided in Section 3, while the proof details can be found in the Appendix and in supplementary materials.
Remark 2.2 (The assumption on the design matrix). The assumption that the elements of X are i.i.d. random variables from the normal N (0, 1/n) distribution is technical. We expect that the results can be generalized to the case where the predictors are independent, sub-gaussian random variables.
The assumption that the variance is equal to 1/n can be satisfied by an appropriate scaling of such a design matrix. As compared to the classical standardization towards unit variance, our scaling allows for the control of FDR with the sequence of the tuning parameters λ, which does not depend on the number of observations n. If the data are standardized such that X ij ∼ N (0, 1), then Theorem 2.1 holds when the sequence of tuning parameters (2.7) and the lower bound on the signal magnitude (2.5) are both multiplied by n −0.5 .

Remark 2.3 (
The assumption on the signal strength). Our assumption on the signal strength is not very restrictive. When using the classical standardization of the explanatory variables (i.e. assuming that X ij ∼ N (0, 1)), this assumption allows for the magnitude of the signal to converge to zero at such a rate that

M. Kos and M. Bogdan
This assumption is needed to obtain the power that converges to 1. The proof of Theorem 2.1 implies that this assumption is not needed for the asymptotic FDR control if k is bounded by some constant. Moreover, our simulations suggest that the asymptotic FDR control holds independently of the signal strength if only k satisfies the assumption (2.6). The proof of this conjecture would require a substantial refinement of the proof techniques and remains an interesting topic for future work.

Simulations
In this section we present results of the simulation study. The data are generated according to the linear model: where elements of the design matrix X are i.i.d. random variables from the normal N (0, 1/n) distribution and is independent of X and comes from the standard multivariate normal distribution N (0, I). The parameter vector b 0 has k non-zero elements and p − k zeroes.
We present the comparison of the three methods -two versions of SLOPE and LASSO: 1. SLOPE with the sequence of the tuning parameters provided by the formula (2.7), denoted by "SLOPE".
2. SLOPE with the sequence: This sequence was proposed in Bogdan et al. (2015) as a heuristic correction which takes into account the influence of the cross products between the columns of the design matrix X. We refer to this procedure as heuristic SLOPE ("SLOPE heur").
3. LASSO with the tuning parameter: The tuning parameter for LASSO is equal to the first element of the tuning sequence of SLOPE, so FDR of LASSO and SLOPE are approximately equal when k = 0.
On the Asymptotic Properties of SLOPE Figure 3 presents FDR and Power of different procedures. First, let us concentrate on the behavior of SLOPE when the sequence of tuning parameters is defined as in Theorem 2.1 (green line in each sub-plot). The green rectangle contains plots where the sequences of tuning parameters and the signal sparsity k(n) satisfy the assumptions of Theorem 2.1. It is noticeable that in this area FDR of SLOPE slowly converges to zero and the Power converges to 1. Moreover, FDR is close to or below the nominal level q = 0.2 for the whole range of considered values of n. It is also clear that larger values of δ lead to the more conservative versions of SLOPE.
√ 2 log p and q = 0.2. The results presented in a green rectangle correspond to the values of parameters which meet assumptions of Theorem 2.1. The numbers above green lines correspond to actual values of the parameter k. Each point was obtained by averaging false and true positive proportions over at least 500 independent replicates In the red area the assumptions of Theorem 2.1 are violated. Here we can see that when δ > 0 and α = 0.5, FDR is still a decreasing function of n but the rate of this decrease is slow and FDR is substantially above level q = 0.2 even for n = 2000. In the case when δ = 0 (i.e. when the original λ BH sequence is used), FDR stabilizes at the value which exceeds the nominal level.
Let us now turn our attention to other methods. We can observe that LASSO is the most conservative procedure and that, as expected, its FDR converges to 0 when k increases. Since the simulated signals are strong, this does not lead to a substantial decrease of Power as compared to SLOPE. Interestingly, SLOPE with a heuristic choice of tuning parameters seems to provide a stable FDR control over the whole range of considered parameter values. This suggests that the upper bound on k provided in assumption (2.6) could be relaxed when working with this heuristic sequence. The proof of this claim remains an interesting topic for further research. Figure 4 presents simulations for the case when b 0 1 = . . . = b 0 k = 0.9 √ 2 log p, i.e. when the signal magnitude does not satisfy the assumption (2.5). Here FDR of SLOPE behaves similarly as in the case of strong signals. These results suggest that the assumption on the signal strength might not be necessary in the context of FDR control. Figure 4 also illustrates a strikingly good control of FDR by the SLOPE with the heuristically adjusted sequence of tuning parameters. LASSO is substantially more conservative than both versions of SLOPE. Its FDR converges to zero, which in the case of such moderate signals leads to a substantial decrease of Power as compared to SLOPE.

Roadmap of the Proof
In the first part of this section we characterize the support of the SLOPE estimator. The proofs of the Theorems presented in this part rely only on differentiability and convexity of the loss function. Therefore we decided to present them in a general form, will be useful for a further work on extensions of SLOPE for Generalized Linear Models or Gaussian Graphical Models.
3.1. Support of the SLOPE estimator under the general loss function Let us consider the following generalization of SLOPE:

1)
On the Asymptotic Properties of SLOPE for the multiple linear regression). Let R denote the number of non-zero elements ofb.
The following Theorems 3.1 and 3.2 characterize events {b i = 0} and {R = r} by using the gradient U (b) of the negative loss function −l(b);

M. Kos and M. Bogdan
Additionally, for a > 0 we define the vector T (a) as Theorem 3.1. Consider the optimization problem 3.1 with an arbitrary sequence λ 1 λ 2 ... λ p 0. Assume that l(b) is a convex and a differentiable function. Then for any a > 0,

Theorem 3.2. Consider the optimization problem 3.1 with an arbitrary
is a convex and a differentiable function and R = r. Then for any a > 0 it holds: Moreover if we assume that λ 1 > λ 2 > ... > λ p 0, then: The proofs of Theorems 3.1 and 3.2 are provided in the supplementary materials.

510
On the Asymptotic Properties of SLOPE

FDR of SLOPE for the general loss function
Corollary 3.3. Consider the optimization problem (3.1) with an arbitrary sequence λ 1 λ 2 ... λ p 0. Assume that l(b) is a convex and a differentiable function. Then for any a > 0, FDR of SLOPE is equal to: (3.4)

Proof.
Let us denote the support of the true parameter vector b 0 by: and a set that is the complement of S in {1, ..., p} by: Directly from the definition we obtain: (3.7) and Corollary 3.3 is a direct consequence of Theorems 3.1 and 3.2.

Proof of Theorem 2.1
We now focus on the multiple regression model (1.1). Elementary calculations show that in this case the vector U (for def. see (3.2)) takes the form Let us denote for simplicity and introduce the following notation for the components of T :

M. Kos and M. Bogdan
Naturally, T = M + Γ. Due to (3.4), we can express FDR for linear regression in the following way: (3.10) Deeper analysis shows that under the assumptions of Theorem 2.1, the FDR expression (3.4) can be simplified. Corollary 3.5, stated below, follows directly from Lemma 4.4 in Su and Candès (2016) (see the supplementary materials) and shows that, with a large probability, only the first elements of the summation over r are different from zero. Furthermore, the following Lemma 3.6 shows that elements of the vector Γ are sufficiently small, so we can focus on the properties of the vector M . Let us introduce the following notation on a sequence of events when the union of supports of b 0 andb is contained inS * (3.11) Corollary 3.5. Suppose the assumptions of Theorem 2.1 hold. Then there exists a deterministic sequence k * such that k * /p → 0, ((k * ) 2 log p)/n → 0 and: P (Q 1 (n, k * )) → 1. (3.12) Corollary 3.5 follows directly from Lemma 4.4 in Su and Candès (2016) (see Lemma S.2.2 in the supplementary materials and the discussion below). From now on k * will denote the sequence satisfying Corollary 3.5.
Lemma 3.6. Let us denote by Q 2 (n,γ(n)) a sequence of events when the l ∞ norm of vector Γ is smaller thanγ(n): (3.13) If the assumptions of Theorem 2.1 hold then there exists a constant C q , dependent only on q, such that the sequence γ(n) = C q (k * ) 2 log p n λ BH k * satisfies: P (Q 2 (n, γ(n))) → 1. (3.14) On the Asymptotic Properties of SLOPE The proof of Lemma 3.6 is provided in the Appendix. Let us denote by Q 3 (n, u) a sequence of events such that the l 2 norm of the vector divided by σ √ n is smaller than 1 + 1/u: The following Corollary 3.7 is a consequence of the well known results on the concentration of the Gaussian measure (see Theorem S.2.4 in the supplementary materials).
Corollary 3.7. Let k * = k * (n) be the sequence satisfying Corollary 3.5. Then From now on, for simplicity, we shall denote by Q 1 , Q 2 and Q 3 the sequences Q 1 (n, k * ), Q 2 (n, γ) and Q 3 (n, k * ) respectively. Moreover, let us introduce the following notation on the intersection of Q 1 , Q 2 and Q 3 : (3.17) By using an event Q, we can bound FDR in the following way: (3.18) The first equality follows from the fact that . The inequality uses the fact that V R∨1 1. The second equality is a consequence of the formula (3.7) applied to the second term. Naturally, due to conditions (3.12), (3.14) and (3.16), we obtain P (Q c ) → 0. Therefore, we can focus on the properties of the second term in (3.18). Notice that Q 1 implies that R k * (supp(b) ⊂ S * ), therefore we can limit the summation over r to the first k * elements: (3.19) Furthermore, according to Theorems 3.1 and 3.2:

M. Kos and M. Bogdan
We now introduce the useful notation: Lemma 3.8 (see the Appendix for the proof) allows for the replacement of an event {T ∈ H r , |T i | > λ r , Q 2 } by an event which depends only on M (i) .
Lemma 3.8, together with a fact that under Q 2 |T i | > λ r , allows for the conclusion that |M i | > λ r − γ and therefore (3.21) The following Lemma 3.9 (see the Appendix for the proof ) provides asymptotic behavior of the right-hand side of (3.21): Lemma 3.9. Under the assumptions of Theorem 2.1, it holds: The proof of Lemma 3.9 is based on several properties. First, can be well approximated by

On the Asymptotic Properties of SLOPE
This approximation is a consequence of the fact that conditionally on , M (i) and M i are independent. Second, for i ∈ S c : can be well approximated by where Φ(·) is the cumulative distribution function of the standard normal distribution. The first approximation is a consequence of the fact that for i ∈ S c , M i = X i . Thus, conditionally on , M i has a normal distribution with the mean equal to 0 and the variance equal to || || 2 n , which is close to σ 2 (see Corollary 3.7). The second approximation relies on the well known formula where φ(·) is the density of the standard normal distribution and o(c) converges to zero as c diverges to infinity. Lastly This approximation is a consequence of the fact that H γ r does not differ much from H r and the family of the sets H r with r ∈ {1, . . . , p} is disjoint. By applying the above approximations we obtain:

M. Kos and M. Bogdan
Lemma 3.9, together with the fact that under assumptions of Theorem 2.1 P (Q c ) → 0, provides F DR → 0. One can notice that k * q p (1+δ) 2 −1 is the factor responsible for the convergence of FDR to 0. It remains an open question if in the definition of the λ sequence (2.7) a constant δ can be replaced by a sequence converging to 0 at such a rate that the asymptotic FDR is exactly equal to the nominal level q. The proof of this assertion would require a refinement of the bounds provided in Su and Candès (2016) and we leave this as a topic for future research. Now, we will argue that under our assumptions the Power of SLOPE converges to 1 . Recall that T R = #{j : b 0 j = 0 andb j = 0} denotes the number of true rejections. Observe that Naturally Π 1, therefore by showing that P i∈S {b i = 0} → 1 we obtain the thesis.

Lemma 3.10. Under the assumptions of Theorem 2.1, it holds:
(3.24) Proof. The proof of Lemma 3.10 is provided in the Appendix. It is based on the sequence of the following inequalities: The result follows by the conditional independence of X i (given ) and the classical approximation (3.23).

Discussion
In this article we provide new asymptotic results on the model selection properties of SLOPE in the case when the elements of the design matrix X come from the normal distribution. Specifically, we provide conditions on the sparsity and the magnitude of true signals such that FDR of SLOPE based on the sequence λ BH , corresponding to the thresholds of the Benjamini-Hochberg correction for multiple testing, converges to 0 and the Power converges to 1. We believe these results can be extended to the sub-gaussian design matrices with independent columns, which is the topic of ongoing research. Additionally, the general results on the support of SLOPE open the way for investigation of the properties of SLOPE under arbitrary convex and differentiable loss functions.
In simulations we compared SLOPE based on the sequence λ BH with SLOPE based on the heuristic sequence proposed in Bogdan et al. (2015) and with LASSO with the tuning parameter λ adjusted to the first value of the tuning sequence for SLOPE. When regressors are independent and the vector of true regression coefficients is relatively sparse then the comparison between SLOPE and LASSO bears similarity to the comparison between the Bonferroni and the Benjamini-Hochberg corrections for multiple testing. When k = 0 both methods perform similarly and control FDR (which for k = 0 is equal to the Family Wise Error Rate) close to the nominal level. When k increases, FDR of LASSO converges to zero, which however comes at the prize of a substantial loss of Power for moderately strong signals. Concerning the two versions of SLOPE, our simulations suggest that the heuristic sequence allows for a more accurate FDR control over a wide range of sparsity values. We believe the techniques developed in this article form a good foundation for the analysis of the statistical properties of the heuristic version of SLOPE, which we consider an interesting topic for further research.
Our assumptions on the independence of predictors and the sparsity of the vector of true regression coefficients are restrictive, which is related to the well known problems with FDR control by LASSO (Bogdan et al., 2013;Su et al., 2017). In the case of LASSO, these problems can be solved by using adaptive or reweighted LASSO (Zou, 2006;Candès et al., 2008), which allow for the consistent model selection under much weaker assumptions. In these modifications the values of the tuning parameters corresponding to predictors which are deemed important (based on the values of initial estimates) are reduced, which substantially reduces the bias due to shrinkage and improves model selection accuracy. Recently (Jiang et al., 2019) developed the Adaptive Bayesian version of SLOPE (ABSLOPE). According to the results of simulations presented in (Jiang et al., 2019), ABSLOPE controls FDR under a much wider set of scenarios than the regular SLOPE, including examples with strongly correlated predictors. We believe our proof techniques can be extended to derive asymptotic FDR control for ABSLOPE, which we leave as an interesting topic for future research.
(2019). Adaptive bayesian slope-high-dimensional model selection with missing values. arXiv:1909.06631. neuvial, p. and roquain, e. Publisher's Note. Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix
Proof. The proof of Lemma 3. 6 We prove a stronger condition: P (Q 1 ∩ Q 2 ) → 1. Denote by X I ,b I , b 0 I a submatrix (subvector) of X,b, b 0 consisting of columns (vector elements) with indexes in a set I.
Observe that when Q 1 = {Supp(b 0 ) ∪ Supp(b) ⊆ S * }, we can express max i |Γ i | in the following way: We show that both elements are bounded by C q (k * ) 2 log p n λ BH k * with the probability tending to 1. The bound on the first component is a direct corollary from Lemma A.12 proved in Su and Candès (2016).
Corollary 4.1. Under the assumptions of Theorem 2.1, there exists a constant C q only depending on q such that: with the probability tending to 1.
Lemma A.12 of Su and Candès (2016) and the proof of Corollary 4.1 can be found in the supplementary materials (see Lemma S.2.5 and the discussion below).
It remains to be proven that the second component is also bounded by C q (k * ) 2 log p n λ BH k * with the probability tending to 1. To do this, we use Lemma A.11 proved in Su and Candès (2016) which provides bounds on the largest and the smallest singular values of X S * (for details see Lemma S.2.6 in the supplementary materials).
Let X S * = GΣV be the Singular Value Decomposition (SVD) of the matrix X S * , where G ∈ M n×n is a unitary matrix, Σ ∈ M n×k * is a diagonal matrix with non-negative real numbers on the diagonal and V ∈ M k * ×k * is a unitary matrix. Moreover, let us denote by σ i , σ min and σ max the i-th, the smallest and the largest singular value.
Assume that u is an arbitrary unit vector. Due to the relation between the l ∞ and the l 2 vector norms, and additionally between the l 2 vector norm and the · 2 matrix norm, it holds: Using the SVD of the matrix X S * and the sub-multiplicity of the · 2 matrix norm we continue: By definition the maximal singular value of the matrix A equals to the square root of the largest eigenvalue of the positive-semidefinite matrix A A. Therefore in our case it is obvious that: where σ i is the i-th singular value of X S * . Due to Lemma A.11 of Su and Candès (2016) (see Lemma S.2.6 in the supplementary materials) ,we obtain that for some constants C 1 and C 2 and in consequence for an arbitrary unit vector u with the probability at least 1 − 2e −k * log(p/k * )/2 − ( √ 2ek * /p) k * . On the other hand, due to Theorem 1.2 in Su and Candès (2016), for any constant δ 1 > 0 On the Asymptotic Properties of SLOPE Finally, observe that Thus, the relations (4.2) and (4.3) imply that for certain constants C 1 and C 2 : with the probability tending to 1. The inequality (4.4), together with (4.1), provide the thesis of the Lemma.
Proof. The proof of Lemma 3.8 In order to prove Lemma 3.8, we have to introduce the modification of a vector T similar to that of vector M. Denote In the first step we show that: On the one hand, we know that |T i | > λ r . On the other hand, T ∈ H r implies that |T | (r+1) λ r+1 (the second condition in the definition of H r for j = r + 1). These inequalities imply together that |T i | |T | (r) . Hence we only have to show that: To see that the above is true we have to consider the relations between the r largest elements of vectors |T | and |T (i) |. Let us assume that |T i | = |T | (k) for some k r. By the definition of the vector |T (i) | we know that |T (i) | (1) = (i) i | = ∞ and that the other ordered statistics of |T (i) | are related to the ordered statistics of |T | in the following way: In consequence we obtain that: (s) and this implies (4.5).
In the second step of the proof of Lemma 3.8 we show that: In order to prove this, we use the following Proposition: From the triangular inequality, we have: In consequence, d e. Thus, to prove the Proposition 4.3 it is sufficient to prove that |A i | − |B| (i) e for i = 1, 2, ..., p .
For this aim, let i 0 be the index such that: When |A i 0 | = |B| (i 0 ) , then f = 0 and the thesis is obtained immediately. Let us consider the case when |A i 0 | > |B| (i 0 ) . If |B i 0 | |B| (i 0 ) , then: Moreover, let W be a modification ofW , where each b 0 i is replaced by 2σ(1+ δ) √ 2 log p and the resulting vector is multiplied by a function: Therefore, W i = (X j + 2σ(1 + δ) √ 2 log p)g( ) for the corresponding j ∈ S. Naturally, the elements of W , conditionally on , are independent and identically distributed. Furthermore, due to the fact that |X i +b 0 i | = D |X i +|b 0 i ||, the fact that the density of X i + |b 0 i | has a mode at |b 0 i |, the assumption min i∈S |b 0 i | > 2σ(1+δ) √ 2 log p and the fact that 2σ(1+δ) √ 2 log p λ r+1 +γ for a large enough n, it holds that: where in the equality between the second and the third line we used the standard combinatorial arguments for calculating the cumulative distribution function of the r-th order statistic and the fact that elements of W are independent conditionally on . In the inequality we used the fact that P (|W 1 | λ r+1 + γ| ) 1. Now, for some j ∈ S (corresponding to W 1 ) and a large enough n we have P (|W1| λr+1 + γ| ) = P |(X j + 2σ(1 + δ) 2 log p)g( )| λr+1 + γ| = = P |(X j + 2σ(1 + δ) 2 log p)| λ r+1 + γ, Q 3 | P X j −σ(1 + δ/2) 2 log p, Q 3 | , Now, we turn our attention to (4.12). From now on we assume that r k + 2. Let us denote by V a subvector of M (1) consisting of elements with indices in S c \ {1} (corresponding to elements of b 0 i equal to 0, except M (1) 1 ). Notice that |M (1) | (k+2) |V | (1) . This is a consequence of the fact that |V | (1) is the largest element in |V | and is equal or larger than p − k − 1 elements in |M (1) |. Similarly, we can show that: Conditionally on , elements of V are independent and identically distributed. Therefore we have: where the random vector Z = (Z 1 , ..., Z p−k−1 ) ∼ N (0, I p−k−1 ) is independent of 2 . Notice that the probability in (4.14) is maximized for the largest possible standard deviation of V (maximal 2 √ n ). Therefore, when restricting to Q 3 , we obtain: Moreover for a large enough n we have which follows directly from the fact that for a large enough n: σ 2 log(p/r) (1 + 3δ/4) and from the fact that: for r k * .

528
On the Asymptotic Properties of SLOPE In consequence we obtain for a large enough n P |Z| (r−k−1) λ r − γ σ(1 + 1/k * ) P |Z| (r−k−1) s where s = (1 + δ/2) √ 2 log p. Let us denote by u 1 , ..., u p−k−1 i.i.d. random variables from the uniform distribution U [0, 1] and by u [1] ... u [p−k−1] , the corresponding order statistics. We know that Therefore, by using the classical upper bound we obtain for a large enough n: We also know that the i-th order statistic of the uniform distribution is a beta-distributed random variable: On the other hand, a well known fact is that if A 1 ∼ Gamma(i, θ) and A 2 ∼ Gamma(p−k −i, θ) are independent, then A 1 /(A 1 +A 2 ) ∼ Beta(i, p− k − i). In consequence, when E 1 , ..., E p−k are i.i.d. random variables from the exponential distribution with a mean equal to 1 it holds: where F E is the Erlang cumulative distribution function. Therefore we obtain that To prove (4.12) it remains to be shown that: (4.17) The first relation follows directly from the properties of the Erlang cumulative distribution function. The second relation is a consequence of Chebyshev's inequality (for details see the supplementary materials). This ends the proof.

Proof. Proof of Lemma 3.10
Recall that we want to show P i∈S {b i = 0} → 1.
We can bound the considered probability in the following way: On the Asymptotic Properties of SLOPE consequence of Theorem 3.2 and the inequality comes from the fact that λ is a decreasing sequence. Now, observe that: and recall that T i = M i + Γ i and Q 2 = {max i |Γ i | γ}. Therefore, due to the triangle inequality we have: Now, because P (Q 2 ) → 1, we only have to show that P i∈S {|M i | > λ 1 + γ} → 1 .
Let us consider the properties of M i = X i + b 0 i . Notice that due to the symmetry of X i distribution, we have where in the last inequality we omit the absolute value in |X i + |b 0 i || and subtract |b 0 i |. Now, due to the assumption on the signal strength, we have for a large enough n: This is a consequence of the fact that for a large enough n: λ 1 + γ σ(1 + 1.5δ) 2 log p .
Moreover, we know that conditionally on , X i are independent from the normal distribution N (0, 2 2 /n). Therefore we have: