1 Introduction

In this article we consider the classical problem of identifying important predictors in the multiple regression model

$$ Y=X b^{0}+\epsilon , $$
(1.1)

where X is the design matrix of dimension n Γ— p and \(\epsilon \sim N(0,\sigma ^{2} I_{n\times n})\) is the noise vector. In the case when p > n, the vector of parameters b0 is not identifiable and can be uniquely estimated only under certain additional assumptions concerning e.g. its sparsity (i.e. the number of non-zero elements). One of the most popular and computationally tractable methods for estimating b0 in the case when p > n is the Least Absolute Shrinkage and Selection Operator (LASSO, Tibshirani (1996)), firstly introduced in the context of signal processing as the Basis Pursuit Denoising (BPDN, Chen and Donoho. D. (1994)). LASSO estimator is defined as

$$ \hat b^{L}=argmin_{b} \left\{ \frac{1}{2} || Y-Xb||_{2}^{2}+\lambda |b|_{1}\right\} , $$
(1.2)

where ||β‹…||2 denotes the regular Euclidean norm in Rn and \(|b|_{1}={\sum }_{j=1}^{p} |b_{j}|\) is the L1 norm of b. If \(X^{\prime }X=I\) then LASSO is reduced to the simple shrinkage operator imposed on the elements of the vector \(\tilde Y = X^{\prime }Y\),

$$ {b^{L}_{i}}=\eta_{\lambda}(\tilde Y_{i}) , $$

where

$$ \eta_{\lambda}(x)=\left\{\begin{array}{ccc} x-\lambda&\text{if}&x>\lambda\\ 0&\text{if}&|x|<\lambda\\ x+\lambda&\text{if}&x<-\lambda \end{array} \right. . $$

In this case the choice of the tuning parameter

$$ \lambda=\lambda^{Bon}=\sigma {\Phi}^{-1}\left( 1-\frac{\alpha}{2p}\right)=\sigma \sqrt{2\log p}(1+o_{p}) , $$

allows for the control of the probability of at least one false discovery (Family Wise Error Rate, FWER) at level Ξ±.

In the context of high dimensional multiple testing, the Bonferroni correction is often replaced by the Benjamini and Hochberg (1995) multiple testing procedure aimed at the control of the False Discovery Rate (FDR). Apart from FDR control, this procedure also has appealing properties in the context of the estimation of the vector of means for multivariate normal distribution with independent entries (Abramovich et al. 2006) or in the context of minimizing the Bayes Risk related to 0-1 loss (Bogdan et al. 2011; Neuvial and Roquain, 2012; Frommlet and Bogdan, 2013). In the context of the multiple regression with an orthogonal design, the Benjamini-Hochberg procedure works as follows:

  1. (1)

    Fix q ∈ (0, 1) and sort \(\tilde Y\) such that

    $$ |\tilde Y |_{(1)} \ge |\tilde Y|_{(2)} \ge {\ldots} \ge |\tilde Y|_{(p)} , $$
  2. (2)

    Identify the largest j such that

    $$ |\tilde Y|_{(j)} \ge \lambda^{BH}_{j}= \sigma {\Phi}^{-1}\left( 1- \frac{jq}{2p}\right), $$
    (1.3)

    Call this index jBH.

  3. (3)

    Reject H(j) for every j ≀ jBH.

Thus, in BH, the fixed threshold of the Bonferroni correction, Ξ»Bon, is replaced with the sequence Ξ»BH of β€˜sloped’ thresholds for the sorted test statistics (see Fig.Β 1). This allows for a substantial increase of power and for an improvement of prediction properties in the case when some of the predictors are relatively weak.

Figure 1
figure 1

Bonferroni and Benjamini-Hochberg procedures for multiple testing

The idea of using a decreasing sequence of thresholds was subsequently used in the Sorted L-One Penalized Estimator (SLOPE, Bogdan et al. (2013) and Bogdan et al. (2015)) for the estimation of coefficients in the multiple regression model:

$$ b^{SL}=argmin_{b} \left\{\frac 12\left\|y-Xb\right\|_{2}^{2}+\sum\limits_{i=1}^{p}\lambda_{i}\left|b\right|_{(i)}\right\}, $$
(1.4)

where \(\left |b\right |_{(1)}\geq \ldots \geq \left |b\right |_{(p)}\) are ordered magnitudes of elements of b and Ξ» = (Ξ»1, … , Ξ»p) is a non-zero, non-increasing and non-negative sequence of tuning parameters. As noted in Bogdan et al. (2013) and Bogdan et al. (2015), the function \(J_{\lambda }(b)={\sum }_{i=1}^{p}\lambda _{i}\left |b\right |_{(i)}\) is a norm. To see this, observe that:

  • for any constant a ∈ R and a sequence b ∈ Rp, JΞ»(ab) = |a|JΞ»(b)

  • JΞ»(b) = 0 if and only if b = 0 ∈ Rp.

To show the triangular inequality:

  • JΞ»(x + y) ≀ JΞ»(x) + JΞ»(y)

let us denote by ρ the permutation of the set {1, … , p} such that

$$ \left|x_{\rho(1)}+y_{\rho(1)}\right|\geq \left|x_{\rho(2)}+y_{\rho(2)}\right|\geq ... \geq \left|x_{\rho(p)}+y_{\rho(p)}\right| .$$

Then

$$ J_{\lambda}(x+y) = \sum\limits_{i=1}^{p}\lambda_{i}\left|x_{\rho(i)}+y_{\rho(i)}\right| \leqslant \sum\limits_{i=1}^{p}\lambda_{i}\left|x_{\rho(i)}\right|+\sum\limits_{i=1}^{p}\lambda_{i}\left|y_{\rho(i)}\right| $$

and the result can be derived from the well known rearrangement inequality, according to which for any permutation ρ and any sequence x ∈ Rp:

$$ \sum\limits_{i=1}^{p}\lambda_{i}\left|x_{\rho(i)}\right| \leqslant \sum\limits_{i=1}^{p} \lambda_{i} |x|_{(i)} . $$

Thus SLOPE is a convex optimization procedure which can be efficiently solved using classical optimization tools. It is also easy to observe that in the case Ξ»1 = … = Ξ»p, SLOPE is reduced to LASSO, while in the case Ξ»1 > Ξ»2 = … = Ξ»p = 0, the Sorted L-One norm JΞ» is reduced to the \(L_{\infty }\) norm.

FigureΒ 2 illustrates different shapes of the unit spheres corresponding to different versions of the Sorted L-One Norm. Since the solutions of SLOPE tend to occur on the edges of respective spheres, Fig.Β 2 demonstrates the large flexibility of SLOPE with respect to the dimensionality reduction. In the case when Ξ»1 = … = Ξ»p, SLOPE reduces dimensionality by shrinking the coefficients to zero. In the case when Ξ»1 > Ξ»2 = … = Ξ»p = 0, the reduction of dimensionality is performed by shrinking the coefficients towards each other (since the edges of the \(l_{\infty }\) sphere correspond to vectors b such that at least two coefficients are equal to each other). In the case when the sequence of thresholding parameters is monotonically decreasing, SLOPE reduces the dimensionality both ways: it shrinks them towards zero and towards each other. Thus it returns sparse and stable estimators, which have recently been proved to achieve minimax rate of convergence in the context of sparse high dimensional regression and logistic regression (Su and CandΓ¨s, 2016; Bellec et al. 2018; Abramovich and Grinshtein, 2017).

Figure 2
figure 2

Shapes of different SLOPE spheres

From the perspective of model selection it has been proved in Bogdan et al. (2013) and Bogdan et al. (2015) that SLOPE with the vector of tuning parameters Ξ»BH (1.3) controls FDR at level q under the orthogonal design. This is no longer true if the inner products between columns of the design matrix are different from zero, which almost always occurs if the predictors are random variables. Similar problems with the control of the number of False Discoveries occur for LASSO. Specifically, in Bogdan et al. (2013) it is shown that in the case of the Gaussian design with independent predictors, LASSO with a fixed parameter Ξ» can control FDR only if the true parameter vector is sufficiently sparse. The natural question is whether there exists a bound on the sparsity under which SLOPE can control FDR ? In this article we address this question and report a theoretical result on the asymptotic control of FDR by SLOPE under the Gaussian design. Our main theoretical result states that by multiplying sequence Ξ»BH by a constant larger than 1, one can achieve the full asymptotic power and FDR converging to 0 if the number k(n) of nonzero elements in the true vector of the regression coefficients satisfies \(k=o\left (\sqrt {\frac {n}{\log p}}\right )\) and the values of these non-zero elements are sufficiently large. We also report results of a simulation study which suggest that the assumption on the signal sparsity is necessary when using Ξ»BH sequence but is unnecessarily strong when using the heuristic adjustment of this sequence, proposed in Bogdan et al. (2015). Simulations also suggest that the asymptotic FDR control is guaranteed independently of the magnitude of the non-zero regression coefficients.

2 Asymptotic Properties of SLOPE

2.1 False Discovery Rate and Power

Let us consider the multiple regression model (1.1) and let \(\hat b\) be some sparse estimator of b0. The numbers of false, true and all rejections, and the number of non-zero elements in b0 (respectively: V, TR, R, k), are defined as follows

$$ V = \#\{j: {b_{j}^{0}}=0 ~\text{and}~ \hat b_{j} \neq 0\}, \ \ \ \ \ \ \ R = \#\{j:\hat b_{j} \neq 0\} , $$
(2.1)
$$ TR = \#\{j: {b_{j}^{0}} \neq 0 ~\text{and}~ \hat b_{j} \neq 0\}, \ \ \ \ \ \ \ k = \#\{j: {b^{0}_{j}}\neq 0\} . $$
(2.2)

The False Discovery Rate is defined as:

$$ FDR := \mathbb{E}\left( \frac{V}{R \vee 1}\right) $$
(2.3)

and the Power as:

$$ {\Pi} := \frac{\mathbb{E} \left( TR\right)}{k}. $$
(2.4)

2.2 Asymptotic FDR and Power

We formulate our asymptotic results under the setup where n and p diverge to infinity and p can grow faster than n. Similarly as in the case of the support recovery results for LASSO, we need to impose a constraint on the sparsity of b0, which is measured by the number of truly important predictors \(k = \#\{i: {b^{0}_{i}} \neq 0\}\). Thus we consider the sequence of linear models of the form (1.1) indexed by n and with their ”dimensions” characterized by the triplets: (n,pn,kn). For the sake of clarity, further in the text we skip the subscripts by p and k.

The main result of this article is as follows:

Theorem 2.1.

Consider the linear model of the form (1.1) and assume that all elements of the design matrix \(X \in \mathbb {R}^{n \times p}\) are i.i.d. random variables from the normal N(0,1/n) distribution. Moreover, suppose there exists Ξ΄ > 0 such that

$$ \min_{{b^{0}_{j}} \neq 0} \vert {b^{0}_{j}} \vert \geqslant 2\sigma (1+\delta)\sqrt{2 \log p} $$
(2.5)

and suppose

$$ p\rightarrow \infty, \frac{k}{p} \rightarrow 0 ~\text{and}~ \frac{k^{2} \log p}{n} \rightarrow 0. $$
(2.6)

Then for any q ∈ (0,1), the SLOPE procedure with the sequence of tuning parameters

$$ \lambda_{i}(q,\delta) = \sigma (1+\delta) {\Phi}^{-1}\left( 1- \frac{qi}{2p}\right) $$
(2.7)

has the following properties

$$ FDR \rightarrow 0 , {\Pi} \rightarrow 1.$$

Proof

The proof of Theorem 2.1 makes extensive use of the results on the asymptotic properties of SLOPE reported in Su and Candès (2016). The roadmap of this proof is provided in Section 3, while the proof details can be found in the Appendix and in supplementary materials. ░

Remark 2.2 (The assumption on the design matrix).

The assumption that the elements of X are i.i.d. random variables from the normal N(0,1/n) distribution is technical. We expect that the results can be generalized to the case where the predictors are independent, sub-gaussian random variables. The assumption that the variance is equal to 1/n can be satisfied by an appropriate scaling of such a design matrix. As compared to the classical standardization towards unit variance, our scaling allows for the control of FDR with the sequence of the tuning parameters Ξ», which does not depend on the number of observations n. If the data are standardized such that \(X_{ij}\sim N(0,1)\), then TheoremΒ 2.1 holds when the sequence of tuning parameters (2.7) and the lower bound on the signal magnitude (2.5) are both multiplied by nβˆ’β€‰0.5.

Remark 2.3 (The assumption on the signal strength).

Our assumption on the signal strength is not very restrictive. When using the classical standardization of the explanatory variables (i.e. assuming that \(X_{ij}\sim N(0,1)\)), this assumption allows for the magnitude of the signal to converge to zero at such a rate that

$$ \min_{{b^{0}_{j}} \neq 0} \vert {b^{0}_{j}} \vert \geqslant 2\sigma (1+\delta)\sqrt{\frac{2 \log p}{n}} . $$

This assumption is needed to obtain the power that converges to 1. The proof of TheoremΒ 2.1 implies that this assumption is not needed for the asymptotic FDR control if k is bounded by some constant. Moreover, our simulations suggest that the asymptotic FDR control holds independently of the signal strength if only k satisfies the assumption (2.6). The proof of this conjecture would require a substantial refinement of the proof techniques and remains an interesting topic for future work.

2.3 Simulations

In this section we present results of the simulation study. The data are generated according to the linear model:

$$ Y = Xb^{0} +\epsilon , $$

where elements of the design matrix X are i.i.d. random variables from the normal N(0,1/n) distribution and πœ– is independent of X and comes from the standard multivariate normal distribution N(0,I). The parameter vector b0 has k non-zero elements and p βˆ’ k zeroes.

We present the comparison of the three methods - two versions of SLOPE and LASSO:

  1. 1.

    SLOPE with the sequence of the tuning parameters provided by the formula (2.7), denoted by ”SLOPE”.

  2. 2.

    SLOPE with the sequence:

    $$ \lambda_{i}(q) = \left\{\begin{array}{ccc} \sigma {\Phi}^{-1}(1-q/2p) & if & i=1\\ \min\left( \lambda_{i-1}, \sigma{\Phi}^{-1}(1-qi/2p)\sqrt{1+\frac{{\sum}_{j<i}{\lambda_{j}^{2}}}{n-i-2}}\right) & if & i>1 . \end{array}\right. $$

    This sequence was proposed in Bogdan et al. (2015) as a heuristic correction which takes into account the influence of the cross products between the columns of the design matrix X. We refer to this procedure as heuristic SLOPE (”SLOPE_heur”).

  3. 3.

    LASSO with the tuning parameter:

    $$ \lambda= \sigma (1+\delta) {\Phi}^{-1}\left( 1- \frac{q}{2p}\right) . $$
    (2.8)

The tuning parameter for LASSO is equal to the first element of the tuning sequence of SLOPE, so FDR of LASSO and SLOPE are approximately equal when k = 0.

FigureΒ 3 presents FDR and Power of different procedures. First, let us concentrate on the behavior of SLOPE when the sequence of tuning parameters is defined as in TheoremΒ 2.1 (green line in each sub-plot). The green rectangle contains plots where the sequences of tuning parameters and the signal sparsity k(n) satisfy the assumptions of TheoremΒ 2.1. It is noticeable that in this area FDR of SLOPE slowly converges to zero and the Power converges to 1. Moreover, FDR is close to or below the nominal level q = 0.2 for the whole range of considered values of n. It is also clear that larger values of Ξ΄ lead to the more conservative versions of SLOPE.

Figure 3
figure 3

FDR and Power of different procedures as functions of n and the parameters Ξ΄ (see (2.7) and (2.8)) and Ξ±, for p = 0.05n1.5, k = round(nΞ±), \({b^{0}_{1}}=\ldots ={b^{0}_{k}}=2(1+\delta )\sqrt {2 \log p}\) and q = 0.2. The results presented in a green rectangle correspond to the values of parameters which meet assumptions of TheoremΒ 2.1. The numbers above green lines correspond to actual values of the parameter k. Each point was obtained by averaging false and true positive proportions over at least 500 independent replicates

In the red area the assumptions of TheoremΒ 2.1 are violated. Here we can see that when Ξ΄ > 0 and Ξ± = 0.5, FDR is still a decreasing function of n but the rate of this decrease is slow and FDR is substantially above level q = 0.2 even for n = 2000. In the case when Ξ΄ = 0 (i.e. when the original Ξ»BH sequence is used), FDR stabilizes at the value which exceeds the nominal level.

Let us now turn our attention to other methods. We can observe that LASSO is the most conservative procedure and that, as expected, its FDR converges to 0 when k increases. Since the simulated signals are strong, this does not lead to a substantial decrease of Power as compared to SLOPE. Interestingly, SLOPE with a heuristic choice of tuning parameters seems to provide a stable FDR control over the whole range of considered parameter values. This suggests that the upper bound on k provided in assumption (2.6) could be relaxed when working with this heuristic sequence. The proof of this claim remains an interesting topic for further research.

FigureΒ 4 presents simulations for the case when \({b^{0}_{1}}=\ldots ={b^{0}_{k}}=0.9\sqrt {2 \log p}\), i.e. when the signal magnitude does not satisfy the assumption (2.5). Here FDR of SLOPE behaves similarly as in the case of strong signals. These results suggest that the assumption on the signal strength might not be necessary in the context of FDR control. FigureΒ 4 also illustrates a strikingly good control of FDR by the SLOPE with the heuristically adjusted sequence of tuning parameters. LASSO is substantially more conservative than both versions of SLOPE. Its FDR converges to zero, which in the case of such moderate signals leads to a substantial decrease of Power as compared to SLOPE.

Figure 4
figure 4

FDR and Power of different procedures as functions of n and the parameters Ξ΄ (see (2.7) and (2.8)) and Ξ±, for p = 0.05n1.5, k = round(nΞ±), \({b^{0}_{1}}=\ldots ={b^{0}_{k}}=0.9\sqrt {2 \log p}\) and q = 0.2. The results presented in a green rectangle correspond to the values of parameters which meet assumptions of TheoremΒ 2.1, except the condition on signal strength. The numbers above green lines correspond to actual values of the parameter k. Each point was obtained by averaging false and true positive proportions over at least 500 independent replicates

3 Roadmap of the Proof

In the first part of this section we characterize the support of the SLOPE estimator. The proofs of the Theorems presented in this part rely only on differentiability and convexity of the loss function. Therefore we decided to present them in a general form, will be useful for a further work on extensions of SLOPE for Generalized Linear Models or Gaussian Graphical Models.

3.1 Support of the SLOPE estimator under the general loss function

Let us consider the following generalization of SLOPE:

$$ \hat b=argmin_{b} \left\{l(b)+\sum\limits_{i=1}^{p}\lambda_{i}\left|b\right|_{(i)}\right\}, $$
(3.1)

where l(b) is a convex and a differentiable loss function (e.g. \(0.5\Vert Y-X^{\prime }b{\Vert ^{2}_{2}}\) for the multiple linear regression). Let R denote the number of non-zero elements of \(\hat b\).

The following TheoremsΒ 3.1 andΒ 3.2 characterize events \(\{\hat {b}_{i} \neq 0\}\) and {R = r} by using the gradient U(b) of the negative loss function βˆ’ l(b);

$$ U(b)=(U_{1}(b), ... , U_{p}(b))^{\prime}= -\triangledown (l(b)) . $$
(3.2)

Additionally, for a > 0 we define the vector T(a) as

$$ T(a) = U(\hat{b})+ a\hat{b} . $$

Thus, \(T_{i}(a)=U_{i}(\hat {b})\) if \(\hat {b}_{i}=0\). Also, the additional term \(a\hat {b}_{i}\) has the same sign as \(U_{i}(\hat {b})\), so \(|T_{i}(a)|>|U_{i}(\hat b)|\) if \(\hat {b}_{i}\neq 0\). By calculating the subgradient of the objective function of the LASSO estimator, it is possible to check that LASSO selects these variables for which the respective coordinates of |T(a)| exceed the value of the tuning parameter Ξ». In Bogdan et al. (2015) the support of \(\hat b\) for SLOPE under the orthogonal design is provided. It is shown that, similarly as in the case of the Benjamini-Hochberg correction for multiple testing, it is not sufficient to compare the ordered coordinates of |T(a)| to the respective values of the decaying sequence of tuning parameters. It can happen that while performing this simple operation one could eliminate regressors with the value of |T(a)| larger than for some of the regressors retained in the model. The SLOPE estimator preserves the ordering of |T(a)|. Thus, identification of the SLOPE support is more involved and requires introduction of the following sets Hr: for r ∈{1, … , p} we define

$$ H_{r} = \left\{ w \in \mathbb{R}^{p}: \forall_{j \leqslant r} \sum\limits_{i=j}^{r} \lambda_{i} < \sum\limits_{i=j}^{r} \vert w \vert_{(i)} ~and~ \forall_{j \geqslant r+1} \sum\limits_{i=r+1}^{j}\lambda_{i} \geqslant \sum\limits_{i=r+1}^{j} \vert w\vert_{(i)} \right\}. $$
(3.3)

Theorem 3.1.

Consider the optimization problemΒ 3.1 with an arbitrary sequence \(\lambda _{1}\geqslant \lambda _{2} \geqslant ... \geqslant \lambda _{p} \geqslant 0\). Assume that l(b) is a convex and a differentiable function. Then for any a > 0,

$$ R = r \Longleftrightarrow T(a) \in H_{r} . $$

Theorem 3.2.

Consider the optimization problemΒ 3.1 with an arbitrary sequence \(\lambda _{1}\geqslant \lambda _{2} \geqslant ... \geqslant \lambda _{p} \geqslant 0\). Assume that l(b) is a convex and a differentiable function and R = r. Then for any a > 0 it holds:

$$ \left( \hat{b}_{i} \neq 0 \right) \Leftrightarrow \left( \vert T_{i}(a) \vert >\lambda_{r}\right) $$

and

$$ \left( \hat{b}_{i} \neq 0 \right) \Rightarrow \left( \vert U_{i}(\hat b) \vert \geqslant\lambda_{r}\right). $$

Moreover if we assume that \(\lambda _{1}>\lambda _{2}>...>\lambda _{p}\geqslant 0\), then:

$$ \left( \hat{b}_{i} \neq 0 \right) \Leftarrow \left( \vert U_{i}(\hat b) \vert \geqslant \lambda_{r} \right). $$

The proofs of TheoremsΒ 3.1 andΒ 3.2 are provided in the supplementary materials.

3.2 FDR of SLOPE for the general loss function

Corollary 3.3.

Consider the optimization problem (3.1) with an arbitrary sequence \(\lambda _{1}\geqslant \lambda _{2} \geqslant ... \geqslant \lambda _{p} \geqslant 0\). Assume that l(b) is a convex and a differentiable function. Then for any a > 0, FDR of SLOPE is equal to:

$$ FDR = \sum\limits_{r=1}^{p}\frac{1}{r}\sum\limits_{i \in S^{c}}P\left( |T_{i}(a)|>\lambda_{r},T(a) \in H_{r} \right). $$
(3.4)

Proof

Let us denote the support of the true parameter vector b0 by:

$$ S = Supp(b^{0}) $$
(3.5)

and a set that is the complement of S in {1, ... , p} by:

$$ S^{c} = \{1, ... , p\}\setminus S . $$
(3.6)

Directly from the definition we obtain:

(3.7)

and CorollaryΒ 3.3 is a direct consequence of TheoremsΒ 3.1 andΒ 3.2. β–‘

3.3 Proof of Theorem 2.1

We now focus on the multiple regression model (1.1). Elementary calculations show that in this case the vector U (for def. see (3.2)) takes the form

$$ U(b) = X^{\prime}(Y-Xb) = X^{\prime}\epsilon + X^{\prime}X(b^{0}-b). $$

Let us denote for simplicity

$$ T = T(1) = \hat{b}+U(\hat{b}) = X^{\prime}\epsilon + b^{0} + (X^{\prime}X - \mathbb{I})(b^{0}-\hat{b}) $$

and introduce the following notation for the components of T:

$$ M = X^{\prime}\epsilon + b^{0} $$
(3.8)

and

$$ {\Gamma} = (\mathbb{I}-X^{\prime}X)(\hat{b}-b^{0}) . $$
(3.9)

Naturally, T = M + Ξ“. Due to (3.4), we can express FDR for linear regression in the following way:

$$ FDR = \sum\limits_{r=1}^{p}\frac{1}{r}\sum\limits_{i \in S^{c}}P(T \in H_{r},\vert T_{i} \vert > \lambda_{r}) . $$
(3.10)

Deeper analysis shows that under the assumptions of Theorem 2.1, the FDR expression (3.4) can be simplified. Corollary 3.5, stated below, follows directly from Lemma 4.4 in Su and Candès (2016) (see the supplementary materials) and shows that, with a large probability, only the first elements of the summation over r are different from zero. Furthermore, the following Lemma 3.6 shows that elements of the vector Γ are sufficiently small, so we can focus on the properties of the vector M.

Definition 3.4 (Resolvent set, Su and Candès (2016)).

Fix S = supp(b0) of cardinality k, and an integer \(\tilde k^{*}\) obeying \(k < \tilde k^{*} < p\). The set \(\tilde S^{*} = \tilde S^{*}(S, \tilde k^{*})\) is said to be a resolvent set if it is the union of S and the \(\tilde k^{*}-k\) indexes with the largest values of \(\vert X_{i}^{\prime } \epsilon \vert \) among all i ∈{1, ... , p}βˆ– S.

Let us introduce the following notation on a sequence of events when the union of supports of b0 and \(\hat b\) is contained in \(\tilde S^{*}\)

$$ Q_{1}(n,\tilde k^{*}) = \{Supp(b^{0})\cup Supp(\hat{b})\subseteq \tilde S^{*}\}. $$
(3.11)

Corollary 3.5.

Suppose the assumptions of TheoremΒ 2.1 hold. Then there exists a deterministic sequence kβˆ— such that \(k^{*}/p \rightarrow 0\), \(((k^{*})^{2} \log p)/n \rightarrow 0\) and:

$$ P(Q_{1}(n, k^{*})) \rightarrow 1. $$
(3.12)

CorollaryΒ 3.5 follows directly from Lemma 4.4 in Su and CandΓ¨s (2016) (see Lemma S.2.2 in the supplementary materials and the discussion below). From now on kβˆ— will denote the sequence satisfying CorollaryΒ 3.5.

Lemma 3.6.

Let us denote by \(Q_{2}(n, \tilde {\gamma }(n))\) a sequence of events when the \(l_{\infty }\) norm of vector Ξ“ is smaller than \(\tilde \gamma (n)\):

$$ Q_{2}(n,\tilde \gamma (n)) = \{\max_{i}\vert {\Gamma}_{i} \vert \leqslant \tilde\gamma(n) \} . $$
(3.13)

If the assumptions of TheoremΒ 2.1 hold then there exists a constant Cq, dependent only on q, such that the sequence \(\gamma (n) = C_{q} \sqrt {\frac {(k^{*})^{2} \log p}{n}} \lambda ^{BH}_{k^{*}}\) satisfies:

$$ P\left( Q_{2}(n,\gamma(n)) \right) \rightarrow 1 . $$
(3.14)

The proof of LemmaΒ 3.6 is provided in the Appendix.

Let us denote by Q3(n,u) a sequence of events such that the l2 norm of the vector πœ– divided by \(\sigma \sqrt {n}\) is smaller than 1 + 1/u:

$$ Q_{3}(n,u) = \left\{ \frac{\Vert \epsilon \Vert_{2}}{\sigma\sqrt{n}} \leqslant 1+1/u \right\}. $$
(3.15)

The following CorollaryΒ 3.7 is a consequence of the well known results on the concentration of the Gaussian measure (see Theorem S.2.4 in the supplementary materials).

Corollary 3.7.

Let kβˆ— = kβˆ—(n) be the sequence satisfying CorollaryΒ 3.5. Then

$$ P\left( Q_{3}(n,k^{*}) \right)\geqslant 1 - e^{- \frac{n}{2(k^{*})^{2}}} \rightarrow 1. $$
(3.16)

From now on, for simplicity, we shall denote by Q1,Q2 and Q3 the sequences Q1(n,kβˆ—),Q2(n,Ξ³) and Q3(n,kβˆ—) respectively. Moreover, let us introduce the following notation on the intersection of Q1,Q2 and Q3:

$$ Q = Q(n) = Q_{1}(n,k^{*}) \cap Q_{2}(n,\gamma) \cap Q_{3}(n,k^{*}). $$
(3.17)

By using an event Q, we can bound FDR in the following way:

(3.18)

The first equality follows from the fact that . The inequality uses the fact that \(\frac {V}{R\vee 1} \leqslant 1\). The second equality is a consequence of the formula (3.7) applied to the second term. Naturally, due to conditions (3.12), (3.14) and (3.16), we obtain \(P(Q^{c})\rightarrow 0\). Therefore, we can focus on the properties of the second term in (3.18). Notice that Q1 implies that \( R\leqslant k^{*}\) (\(supp(\hat b) \subset S^{*}\)), therefore we can limit the summation over r to the first kβˆ— elements:

$$ \sum\limits_{r=1}^{p}\frac{1}{r}\sum\limits_{i \in S^{c}} P(R=r,\hat{b}_{i} \neq 0,Q)= \sum\limits_{r=1}^{k^{*}}\frac{1}{r}\sum\limits_{i \in S^{c}}P(R=r,\hat{b}_{i} \neq 0,Q) . $$
(3.19)

Furthermore, according to TheoremsΒ 3.1 andΒ 3.2:

$$ \sum\limits_{r=1}^{k^{*}}\frac{1}{r}\sum\limits_{i \in S^{c}} P(R=r,\hat{b}_{i} \neq 0,Q)= \sum\limits_{r=1}^{k^{*}}\frac{1}{r}\sum\limits_{i \in S^{c}}P(T \in H_{r},\vert T_{i} \vert> \lambda_{r},Q) . $$
(3.20)

We now introduce the useful notation:

  • a vector \(M^{(i)} = (M^{(i)}_{1}, ... , M^{(i)}_{p})^{\prime }\) is the following modification of M

    $$ M^{(i)}_{j}:= \left\lbrace\begin{array}{l} M_{j}~\text{for}~ j\neq i \\ \infty~\text{for}~ j=i\end{array}\right. $$
  • a set \(H_{r}^{\gamma }\), which is a generalization of the set Hr (3.3),

    $$ \begin{array}{@{}rcl@{}} H_{r}^{\gamma} &=& \{ w \in \mathbb{R}^{p}: \forall_{j \leqslant r} \sum\limits_{i=j}^{r} (\lambda_{i}- \gamma)\\ &<& \sum\limits_{i=j}^{r} \vert w \vert_{(i)} ~and~ \forall_{j \geqslant r+1} \sum\limits_{i=r+1}^{j}(\lambda_{i}+ \gamma) \geqslant \sum\limits_{i=r+1}^{j} \vert w\vert_{(i)} \}. \end{array} $$

Lemma 3.8 (see the Appendix for the proof) allows for the replacement of an event {T ∈ Hr,|Ti| > λr,Q2} by an event which depends only on M(i).

Lemma 3.8.

If T ∈ Hr, |Ti| > λr and Q2 occurs, then \(M^{(i)} \in H_{r}^{\gamma }\).

LemmaΒ 3.8, together with a fact that under Q2 |Ti| > Ξ»r, allows for the conclusion that |Mi| > Ξ»r βˆ’ Ξ³ and therefore

$$ \sum\limits_{r=1}^{k^{*}}\frac{1}{r}\sum\limits_{i \in S^{c}}\! P(T \!\in\! H_{r},\vert T_{i} \vert\!>\! \lambda_{r},Q) \!\leqslant\! \sum\limits_{r=1}^{k^{*}}\frac{1}{r}\sum\limits_{i \in S^{c}}P\!\left( M^{(i)} \!\in\! H_{r}^{\gamma},\vert M_{i} \vert\!>\! \lambda_{r} - \gamma,Q_{3}\right)\! . $$
(3.21)

The following LemmaΒ 3.9 (see the Appendix for the proof ) provides asymptotic behavior of the right-hand side of (3.21):

Lemma 3.9.

Under the assumptions of TheoremΒ 2.1, it holds:

$$ \sum\limits_{r=1}^{k^{*}}\frac{1}{r}\sum\limits_{i \in S^{c}}P\left( M^{(i)} \in H_{r}^{\gamma},\vert M_{i} \vert> \lambda_{r}-\gamma,Q_{3}\right) \rightarrow 0. $$
(3.22)

The proof of LemmaΒ 3.9 is based on several properties. First,

$$ P\left( M^{(i)} \in H_{r}^{\gamma},\vert M_{i} \vert> \lambda_{r}-\gamma,Q_{3}\right) $$

can be well approximated by

$$ P\left( M^{(i)} \in H_{r}^{\gamma},Q_{3}\right) P\left( \vert M_{i} \vert> \lambda_{r}-\gamma,Q_{3}\right) $$

This approximation is a consequence of the fact that conditionally on πœ–, M(i) and Mi are independent. Second, for i ∈ Sc:

$$ P\left( \vert M_i \vert> \lambda_r-\gamma,Q_3\right) $$

can be well approximated by

$$ 2(1-{\Phi}((1+\delta)\lambda_r^{BH})) \approx \left( \frac{rq}{p} \right)^{(1+\delta)^2} , $$

where Ξ¦(β‹…) is the cumulative distribution function of the standard normal distribution. The first approximation is a consequence of the fact that for i ∈ Sc, Mi = Xiπœ–. Thus, conditionally on πœ–, Mi has a normal distribution with the mean equal to 0 and the variance equal to \(\frac {||\epsilon ||^2}{n}\), which is close to Οƒ2 (see CorollaryΒ 3.7). The second approximation relies on the well known formula

$$ 1-{\Phi}(c)=\frac{\phi(c)}{c}(1+o(c)) , $$
(3.23)

where Ο•(β‹…) is the density of the standard normal distribution and o(c) converges to zero as c diverges to infinity.

Lastly

$$ \sum\limits_{r=1}^{k^{*}}P\left( M^{(i)} \in H_r^{\gamma}\right) $$

equals approximately to

$$ \sum\limits_{r=1}^{k^{*}}P\left( M^{(i)} \in H_r\right) \leqslant 1. $$

This approximation is a consequence of the fact that HrΞ³ does not differ much from Hr and the family of the sets Hr with r ∈{1, … , p} is disjoint. By applying the above approximations we obtain:

$$ \sum\limits_{r=1}^{k^{*}}\frac{1}{r}\sum\limits_{i \in S^{c}}P\left( M^{(i)} \in H_r^{\gamma},\vert M_i \vert> \lambda_r-\gamma,Q_3\right) \approx \sum\limits_{i \in S^{c}}\sum\limits_{r=1}^{k^{*}}\frac{1}{r} \left( \frac{rq}{p} \right)^{(1+\delta)^2}P\left( M^{(i)} \in H_r^{\gamma}\right) \leqslant $$
$$ \left( \frac{q}{p} \right)\left( \frac{k^{*}q}{p} \right)^{(1+\delta)^2-1}\underbrace{\sum\limits_{i \in S^{c}}\sum\limits_{r=1}^{k^{*}} P\left( M^{(i)} \in H_r^{\gamma}\right)}_{\approx p-k} \approx \left( \frac{(p-k)q}{p} \right)\left( \frac{k^{*}q}{p} \right)^{(1+\delta)^2-1} \rightarrow 0 $$

Lemma 3.9, together with the fact that under assumptions of Theorem 2.1 \(P(Q^c)\rightarrow 0\), provides \(FDR \rightarrow 0\). One can notice that \(\left (\frac {k^{*}q}{p} \right )^{(1+\delta )^2-1}\) is the factor responsible for the convergence of FDR to 0. It remains an open question if in the definition of the λ sequence (2.7) a constant δ can be replaced by a sequence converging to 0 at such a rate that the asymptotic FDR is exactly equal to the nominal level q. The proof of this assertion would require a refinement of the bounds provided in Su and Candès (2016) and we leave this as a topic for future research.

Now, we will argue that under our assumptions the Power of SLOPE converges to 1 . Recall that \(TR = \#\{j: b_j^0 \neq 0 \text {and} \hat b_j \neq 0\}\) denotes the number of true rejections. Observe that

$$ {\Pi} = \frac{1}{k}\mathbb{E}\left( TR \right)= \frac{1}{k} \sum\limits_{j = 1}^k jP(TR=j) \geqslant P(TR = k) = P\left( \bigcap_{i \in S}\{\hat b_i \neq 0 \}\right). $$

Naturally \({\Pi } \leqslant 1\), therefore by showing that \(P\left (\bigcap _{i \in S}\{\hat b_i \neq 0 \}\right ) \rightarrow 1\) we obtain the thesis.

Lemma 3.10.

Under the assumptions of TheoremΒ 2.1, it holds:

$$ P\left( \bigcap_{i \in S}\{\hat b_i \neq 0 \}\right) \rightarrow 1. $$
(3.24)

Proof

The proof of LemmaΒ 3.10 is provided in the Appendix. It is based on the sequence of the following inequalities:

$$ P\left( \bigcap_{i \in S}\{\hat b_i \neq 0 \}\right)\geq P\left( \bigcap_{i \in S}\{|T_i|>\lambda_1\}\right)\geq P\left( \bigcap_{i \in S}\{|M_i|>\lambda_1+\gamma \}\right) $$
$$ \geq P\left( \bigcap_{i \in S}\{ X_i^{\prime} \epsilon> \lambda_1+\gamma-|b_i^0| \}\right)\geq P\left( \bigcap_{i \in S}\left\{ X_i^{\prime} \epsilon> -\sigma\left( 1+\frac{\delta}{2}\right)\sqrt{2\log p}\right.\right) . $$

The result follows by the conditional independence of \(X_i^{\prime }\epsilon \) (given πœ–) and the classical approximation (3.23). β–‘

4 Discussion

In this article we provide new asymptotic results on the model selection properties of SLOPE in the case when the elements of the design matrix X come from the normal distribution. Specifically, we provide conditions on the sparsity and the magnitude of true signals such that FDR of SLOPE based on the sequence Ξ»BH, corresponding to the thresholds of the Benjamini-Hochberg correction for multiple testing, converges to 0 and the Power converges to 1. We believe these results can be extended to the sub-gaussian design matrices with independent columns, which is the topic of ongoing research. Additionally, the general results on the support of SLOPE open the way for investigation of the properties of SLOPE under arbitrary convex and differentiable loss functions.

In simulations we compared SLOPE based on the sequence Ξ»BH with SLOPE based on the heuristic sequence proposed in Bogdan et al. (2015) and with LASSO with the tuning parameter Ξ» adjusted to the first value of the tuning sequence for SLOPE. When regressors are independent and the vector of true regression coefficients is relatively sparse then the comparison between SLOPE and LASSO bears similarity to the comparison between the Bonferroni and the Benjamini-Hochberg corrections for multiple testing. When k = 0 both methods perform similarly and control FDR (which for k = 0 is equal to the Family Wise Error Rate) close to the nominal level. When k increases, FDR of LASSO converges to zero, which however comes at the prize of a substantial loss of Power for moderately strong signals. Concerning the two versions of SLOPE, our simulations suggest that the heuristic sequence allows for a more accurate FDR control over a wide range of sparsity values. We believe the techniques developed in this article form a good foundation for the analysis of the statistical properties of the heuristic version of SLOPE, which we consider an interesting topic for further research.

Our assumptions on the independence of predictors and the sparsity of the vector of true regression coefficients are restrictive, which is related to the well known problems with FDR control by LASSO (Bogdan et al. 2013; Su et al. 2017). In the case of LASSO, these problems can be solved by using adaptive or reweighted LASSO (Zou, 2006; Candès et al. 2008), which allow for the consistent model selection under much weaker assumptions. In these modifications the values of the tuning parameters corresponding to predictors which are deemed important (based on the values of initial estimates) are reduced, which substantially reduces the bias due to shrinkage and improves model selection accuracy. Recently (Jiang et al. 2019) developed the Adaptive Bayesian version of SLOPE (ABSLOPE). According to the results of simulations presented in (Jiang et al. 2019), ABSLOPE controls FDR under a much wider set of scenarios than the regular SLOPE, including examples with strongly correlated predictors. We believe our proof techniques can be extended to derive asymptotic FDR control for ABSLOPE, which we leave as an interesting topic for future research.