Definition of average power
To develop optimal weighting strategies, it is useful to generalize the concept of power to the multiple testing setting by considering the average power of them
2 tests in which the alternative hypothesis is true. Assume that T
∼ N( ξ
, 1).If H
is a false null hypothesis for a one-sided test, then ξ
> 0. For simplicity, following the presentation in Roeder and Wasserman , only one-sided tests are considered in our review, although similar developments apply for two-sided tests. Let Φ(x) denote the standard normal cumulative distribution function. The power for a single test can be expressed as
Equation (1) can be further simplified as
The average power is then defined as
In the following sections, we review methods for finding weights that maximize average power in three relevant problem settings, first for Bonferroni control of FWER, then for FDR, and finally for grouped FDR.
Problem setting I: FWER control
Using the weighted Bonferroni procedure to control the FWER at level α, what is the W = ( W
2, ⋯, W
) that will maximize the average power?
Roeder and Wasserman  showed that the optimal “oracle” weight can be obtained by setting the derivatives of (3) to zero and solving the equations subject to Wj > 0 and . This leads to the following solution in terms of the unknown test means ξ
wherec is a constant so that
Although the ξ
are unknown, available data can be used to generate preliminary estimates. In the absence of data, it has been proposed to use a data-splitting approach  to provide such an estimate. If the data are identically and independently distributed then one can randomly split the data in two parts and use the first part as a training set to estimate ξ
and the corresponding optimal weights. These are then applied to the testing set.
In a follow-up paper, Roeder et al.  applied data splitting weights in a genome association study. They pointed out that the power gain from the weighted procedure cannot compensate for the power loss resulting from the splitting the data and using only a fraction of all samples as the test set. Instead, they propose to form k groups of tests with sizes of perhaps 10–20 that are likely to have the same mean test statistics. Assuming that this procedure is only approximately well informed; the distribution of the test statistics in each group can be assumed to follow a normal mixture distribution based on the proportion of true and false null hypotheses. They suggest moment estimators for the common group test statistic non-zero means, , and the proportion of true null hypotheses, π
, and use these to develop the weights in using Equations (4) and (5). If , where r
denotes the number of tests in the kth group, then . A smoothing procedure is proposed to account for excessive variability. They are able to show that this procedure controls FWER at level α. Software to implement this procedure can be found at http://wpicr.wpic.pitt.edu/WPICCompGen/.
To further demonstrate the merit of the proposed procedure, Roeder and colleagues showed in a simulation study that the grouped weighting procedure gains power when multiple tests with signals are clustered together in one or more groups. When the grouping is poorly chosen and many groups contain no true signal, the weights may not improve power, although in practice little power is lost under such circumstances.
Problem setting II: FDR control
Using theweighted wBH procedure and controlling FDR at level α, what is the W that will maximize the average power?
Identifying optimal weights under FDR control is more difficult than in the FWER setting because FDR has a random variable (the number of rejections) in the denominator. Roquain and van de Wiel  proposed an indirect approach to tackle this problem. They first fix the rejection region then perform the optimization for each fixed rejection (Δ
:= j tests have been rejected) which in turn leads to a family of optimal weight vectors ( W
(j), i = 1, …, m).
Roquain and van de Wiel  give the following multi-weighted algorithm:
Step 1: Compute for each i the weight vector W
(m). If all p-values P
are less than or equal to αW
(m), then reject all hypotheses. Otherwise go to step 2.
Step j ( j ≥ 2): Set r = m − j + 1 and compute for each i the weight vector W
(r) and the weighted p-value. Order the weighted p-values following . If , then reject the null hypotheses corresponding to the smaller weighted p-values . Otherwise go to step j + 1. When j = m, stop and reject none of the null hypotheses.
Note that if we set all weights to be 1, this procedure is reduced to the standard BH procedure. With the involvement of a single weight vector, this procedure can be reduced to the wBH procedure. The unique feature of multi-weighted linear step-up procedure is that it introduces several weight vectors corresponding to different rejection regions. This yields more flexibility than wBH procedure in term of boosting power. However, since this algorithm involves multiple weight vectors under different rejection regions, it cannot rigorously control FDR for any pre-determined weight matrix W. Therefore the following adjustment was suggested to control FDR.
Let and replace W
(r) with in the above step-up procedure to control FDR at level α under the assumption that p-values are independent. Since W
(m) ≤ m and α is usually small, Roquain and van de Wiel argue that W
(r) and are close to each other and the small corrections can be ignored.
Under this multi-weighting framework, one can freely choose weight for any given rejection region. Since the FDR procedure’s cutoff with r rejections is αr/ m, the power can be defined similarly to (2), (3), simply replacing α/ m with αr/ m. The same logic follows for Equations (4). Therefore, the optimal weight for fixed rejection region r is:
where c( r)is a constant that satisfies:
Roquain and van de Wiel’s idea of fixing the rejection region and offering an algorithm to control FDR at the nominal level is a novel approach for overcoming the challenge that FDR involves the number of rejections - a random quantity. By up-weighting the smaller means when the rejection region is large and the larger means when the rejection region is small, this is a powerful procedure for maximizing the chance of rejection. The method can be particularly useful when prior information is present. Yet, we note that the power gained from the multi-weighting scheme may increase the FDR for two reasons: First, the step-up algorithm ignores the constraint (7) and FDR can be inflated for certain W and m. Especially in genomic studies, when m is large, this increases the chance that some corrected weights maybe much smaller than un-corrected ones. Ignoring the correction may cause FDR to rise above the nominal level. Second, in practice we cannot usually guess or estimate the non-centrality parameter ξ
for false null hypotheses. Without relevant prior information, we can only use the data-splitting approach in Problem Setting I. This loss of sample size will also reduce the power. As suggest by Roeder and Wasserman , using a data-splitting approach and a weighted Bonferroni procedure may have less power than running un-weighted Bonferroni correction for the whole dataset. Therefore, we believe there is still room for improvement over the step-up procedure to address the above concerns.
Problem setting III: grouped FDR control
Using the weighted wBH procedure and controlling FDR at level α, what is the k valued set of weights W = ( W
) that will maximize the average power? Here, without the loss of generality, we assume .
This problem is motivated by Stratified False Discovery Rate (SFDR) control. Sun et al.  propose this method in the context of genetic studies when there is a natural stratification of the m hypotheses to be tested. For example, in a genetic study of the long-term complications of type I diabetes , researchers plan to screen about 1500 SNPs in candidate genes and identify SNPs that are associated with at least one out five phenotypes of interest. A total of 7500 tests will be carried out simultaneously, while natural stratification exists for these tests. Therefore, SFDR would be appropriate to account for this type of data.
SFDR controls FDR in each stratum. Let α
denote the FDR in the j th stratum. To investigate the relationship between α
and overall FDR α, based on the work of Storey , it can be shown that that when tests are independent. Then
(j) and R
(j) denote the number of false rejections and total rejections in j th stratum, and . Since ∑
= 1, it is easy to show that when FDR in each stratum is controlled at levelα, the overall FDR is controlled at α. The SFDR procedure can be implemented in the software package R using the function p.adjust.
To demonstrate the merit of SFDR, Sun et al.  describe a simulated genome-wide association study with 105,000 SNPs, among which 5,000 SNPs are from candidate genes and 100,000 SNPs are included to systematically scan the genome for novel associations. The number of associated genes in each stratum is assumed to be 100 and the power to detect a single true association is assumed to be 0.7 with Type I error of 0.001. If the FDR threshold is set to be 0.1, then SFDR is expected to identify 111 true associations as compared to 88 via traditional FDR. This simulation indicates that SFDR can take advantage of an imbalanced distribution of true signal across stratums.
SFDR is a special case of problem setting III. SFDR controls the FDR in each stratum at level α, while the weighted FDR only requires that the overall FDR be controlled at level α. This implies that the optimal weights derived from problem setting III will have better power than SFDR because of the greater degrees of freedom.
Problem setting III is still an open problem. We have not found any literature directly addressing this problem. Given the indirect solution in problem setting II, the optimal weight for this setting is not hard to estimate. The major difference between setting II and setting III is that setting III reduces the variance among the weights. It is not surprising that maximum achievable power in setting III is less than setting II, but setting III has at least two advantages over setting II: first, the weight estimate in setting III is more robust. Estimating the non-centrality parameters for each test, reduces the dimension of the parameter space and leads to more robust estimates. Second, it is possible to use all samples to estimate the unknown parameters rather than using a data-splitting approach that causes the power loss due to smaller sample size.