Abstract
Background
Stability of multiple testing procedures, defined as the standard deviation of total number of discoveries, can be used as an indicator of variability of multiple testing procedures. Improving stability of multiple testing procedures can help to increase the consistency of findings from replicated experiments. BenjaminiHochberg’s and Storey’s qvalue procedures are two commonly used multiple testing procedures for controlling false discoveries in genomic studies. Storey’s qvalue procedure has higher power and lower stability than BenjaminiHochberg’s procedure. To improve upon the stability of Storey’s qvalue procedure and maintain its high power in genomic data analysis, we propose a new multiple testing procedure, named BonEV, to control false discovery rate (FDR) based on Bonferroni’s approach.
Results
Simulation studies show that our proposed BonEV procedure can maintain the high power of the Storey’s qvalue procedure and also result in better FDR control and higher stability than Storey’s qvalue procedure for samples of large size(30 in each group) and medium size (15 in each group) for either independent, somewhat correlated, or highly correlated test statistics. When sample size is small (5 in each group), our proposed BonEV procedure has performance between the BenjaminiHochberg procedure and the Storey’s qvalue procedure. Examples using RNASeq data show that the BonEV procedure has higher stability than the Storey’s qvalue procedure while maintaining equivalent power, and higher power than the BenjaminiHochberg’s procedure.
Conclusions
For medium or large sample sizes, the BonEV procedure has improved FDR control and stability compared with the Storey’s qvalue procedure and improved power compared with the BenjaminiHochberg procedure. The BonEV multiple testing procedure is available as the BonEV package in R for download at https://CRAN.Rproject.org/package=BonEV.
Similar content being viewed by others
Background
Microarray and nextgeneration sequencing (NGS) technologies have been widely used in biological and biomedical research to identify novel biomarkers and genomic modifications related to biological processes and diseases. Multiple testing procedures are widely used in microarray and NGS studies to control the multiple testing error rate to minimize false discoveries from the enormous number of simultaneous hypothesis tests [1]. Many multiple testing error rates have been proposed such as familywise error rate (FWER) [2], k familywise error rate (kFWER) [3], and false discovery rate (FDR) [4]. For discovery purposes, the false discovery rate (FDR), defined as the expected proportion of false discoveries among total number of discoveries, is often controlled in multiple testing procedures to select significant features in microarray and NGS studies [4–6]. Benjamini and Hochberg’s FDR controlling procedure [4] and Storey’s qvalue procedure [7, 8] are the most commonly used procedures [9]. The Bonferroni procedure, although perceived as a conservative procedure for multiple testing error rate control, has stability superior to BenjaminiHochberg’s procedure in terms of variability of total number of discoveries, and equivalent power to BenjaminiHochberg’s procedure, when used to control the expected number of false discoveries (EV, where V stands for the number of false discoveries) at a userspecified level [10].
In this study, we examine the stability (defined as standard deviation of the total number of rejected hypotheses) of both BenjaminiHochberg’s FDR controlling procedure and Storey’s qvalue procedure for generating adjusted pvalues to select significant genes or biomarkers in microarray and NGS data analysis. In addition, we propose our own multiple testing procedure (named BonEV) based on Bonferroni’s EV controlling procedure, that has equivalent power, higher stability, and better FDR control than the Storey’s qvalue procedure with at least mediumsized samples in microarray and NGS data analysis. Multiple testing procedures with high power, good FDR control, and high stability are desirable in genomic data analysis due to the high cost of sequencing in genomic studies. The BonEV multiple testing procedure will be attractive to genomic data analysts as it not only maintains the high power of Storey’s qvalue procedure, but also offers better FDR control and higher stability, especially for small to medium sample size studies that need high stability, high power and good FDR control to maximize the odds of true discoveries.
Methods
Suppose we are testing m null hypotheses simultaneously in a highdimensional data analysis for single nucleotide polymorphism (SNP) identification, differential gene expression, or DNA methylation discovery. Among m null hypotheses, m _{0} null hypotheses are true null hypotheses. Among R rejected null hypotheses, V hypotheses are false rejections (false discoveries). Multiple testing error rates need to be controlled to minimize false discoveries among total rejections. False discovery rate (FDR) is a commonly used multiple testing error rate in genomic analysis. Several definitions of FDR have been proposed to measure the false discovery rate such as FDR, positive false discovery rate (pFDR), and \(\frac {E(V)}{E(R)}\). The FDR and pFDR are defined as:
BenjaminiHochberg procedure
The BenjaminiHochberg procedure [4] provides control of FDR at level α through the following stepup procedure:

Order original pvalues p _{ i },i=1,…,m, from the smallest to the largest such that p _{(1)}≤p _{(2)}·≤p _{(m)};

Find k as the largest i for which \(P_{(i)} \le \frac {i}{m}\alpha \);

Reject all null hypotheses H _{ i },i=1,2,…,k.
The BenjaminiHochberg procedure provides \(FDR = \frac {m_{0}}{m}\alpha \le \alpha \), a strong control for FDR at level α for independent and positively correlated test statistics. Meanwhile, the BenjaminiHochberg procedure is also conservative by a factor of \(\frac {m_{0}}{m}\) for controlling FDR at level α.
Storey’s qvalue procedure
Arguing that where m _{0}=m, one would not be interested in cases where no test is significant (F D R=1 in this situation), Storey [7] proposes the definition of positive false discovery rate (pFDR) that is conditional on at least one rejection. The Storey’s qvalue procedure used for controlling pFDR improves power over the BenjaminiHochberg procedure by including the estimation of \(\pi _{0} = \frac {m_{0}}{m}\). Storey’s qvalue procedure proceeds as follows:

Order original pvalues p _{ i },i=1,…,m, from the smallest to the largest such that p _{(1)}≤p _{(2)}·≤p _{(m)};

Find k as the largest i for which \(P_{(i)} \le \frac {i}{m \hat {\pi _{0}}}\alpha \) where \(\pi _{0} = \frac {m_{0}}{m}\);

Reject all null hypothesis H _{ i },i=1,2,…,k.
Storey proposes to estimate π _{0} conservatively by \(\hat {\pi _{0}} = \frac {\sharp \{p_{i} > \lambda \}}{(1\lambda)m}\), where λ is chosen to minimize the meansquared error of the pFDR estimates.
Bonferroni procedure
The Bonferroni procedure has traditionally been considered too conservative for genomic data analysis for discovery purposes. Gordon et al. [10] show the Bonferroni procedure has comparable power and superior stability to the BenjaminiHochberg procedure when used to control the expected number of false discoveries (E(V)). The Bonferroni procedure rejects H _{ i } if \(p_{i} \le \frac {\gamma }{m}\), and controls E(V) at a prespecified number of tests γ when test statistics are either independent or correlated. To prove that the Bonferroni procedure controls E(V) at level γ, we assume p _{ i } for i=1,2,…,m _{0} has an independent uniform distribution, then we have
If we let γ=α·E(R), then we have E(V)≤α·E(R) and \(\frac {E(V)}{E(R)} \le \alpha \). Notice that the Bonferroni procedure used to control E(V) is conservative by a factor of \(\frac {m_{0}}{m}\). Thus, we further improve power of the Bonferroni procedure by estimating m _{0} and replacing m with a conservative estimator of m _{0} in the cutoff value.
BonEV procedure
Based on theorem 1 in Storey’s 2003 paper, we propose our own BonEV procedure to control the pFDR at level α. Theorem 1 states that: Suppose that m identical hypothesis tests are performed with pvalues P _{1},…,P _{ m } and significant region includes all adjusted pvalues P ^{∗} less than or equal to α. Assume that (P _{ i },H _{ i }) are i.i.d. random variables, P _{ i }H _{ i }∼(1−H _{ i })·F _{0}+H _{ i }·F _{1} for some null distribution F _{0} and alternative distribution F _{1}, and H _{ i }∼B e r n o u l l i(π _{1}) for i=1,…,m. Then
Based on p F D R=P r(H=0P ^{∗}≤α), our BonEV procedure is as follows:

Compare each p _{ i } with \(\frac {\alpha \cdot Pr(\widehat {P^{*}} \le \alpha)}{\hat {\pi _{0}}}\), i=1,2,…,m;

Reject H _{ i } if \(p_{i} \leq \frac {\alpha \cdot Pr(\widehat {P^{*}} \le \alpha)}{\hat {\pi _{0}}}\), i=1,2,…,m.
In our BonEV procedure, P ^{∗} are the adjusted pvalues from the BenjaminiHochberg procedure, and P r(P ^{∗}≤α) is estimated by the proportion of null hypotheses that have the BenjaminiHochberg adjusted pvalues ≤α. We use the same estimate of π _{0} as used in Storey’s qvalue procedure.
The following equations show that our BonEV procedure controls pFDR at level α.
So, if E(V)≤α·(m P r(P ^{∗}≤α)), then p F D R≤α. To have E(V)≤α·(m P r(P ^{∗}≤α)) using the Bonferroni approach, we compare each p _{ i } with \(\frac {\alpha \cdot (m Pr(P^{*} \le \alpha))}{m_{0}}\):
Thus, pFDR will be controlled at level α if each p _{ i } is compared with \(\frac {\alpha \cdot (m Pr(P^{*} \le \alpha))}{m_{0}}\). As m _{0}=m·π _{0}, we compare each p _{ i } with \(\frac {\alpha \cdot Pr(P^{*} \le \alpha)}{\pi _{0}}\). P r(P ^{∗}≤α) is estimated by \(Pr(\widehat {P^{*}} \le \alpha) = \frac {\sharp \{P^{*}_{i} \le \alpha \}}{m}\) from BenjaminiHochberg’s FDR controlling method and π _{0} is estimated by \(\hat {\pi _{0}} = \frac {\sharp \{p_{i} > \lambda \}}{(1\lambda)m}\) from Storey’s qvalue method. The expected values of \(Pr(\widehat {P^{*}} \le \alpha)\) and \(\hat {\pi _{0}}\) are
Thus, our procedure controls E(V) at α·(m P r(P ^{∗}≤α)) and controls pFDR at level α. We took advantage of existing R functions to estimate P r(P ^{∗}≤α) and π _{0}. We estimate P r(P ^{∗}≤α) from the p.adjust function in R with Benjamini and Hochberg’s method and estimate π _{0} using the qvalue and pi0est function from the qvalue package.
Our proposed BonEV procedure integrates the approaches of the Bonferroni procedure, BenjaminiHochberg procedure, and Storey’s qvalue procedure. The estimated π _{0} from Storey’s qvalue procedure used in the BonEV procedure increases the cutoff value for pvalues in each comparison, thus improving its power compared to BenjaminiHochberg’s procedure. As the sample size increases, the \(\hat {\pi _{0}}\) is closer to the value of π _{0}, and the power will be further improved. By adapting the estimate of P r(P ^{∗}≤α) from the BenjaminiHochberg procedure for the BonEV procedure, the proportion of false discoveries is reduced compared to Storey’s qvalue procedure. Thus, we expect the BonEV procedure to have better FDR control than Storey’s qvalue procedure. Regarding the stability, singlestep approaches such as the Bonferroni and BonEV procedures are superior to stepwise approaches such as the BenjaminiHochberg procedure and Storey’s qvalue procedure, but, the inclusion of the π _{0} estimate in the BonEV and Storey’s qvalue procedures reduces the stability. Taken together, we expect the BonEV procedure to have better stability than Storey’s qvalue procedure.
Results
Simulation studies
We conduct simulation studies to evaluate the FDR control, power, and stability of our BonEV procedure, the BenjaminiHochberg procedure and Storey’s qvalue procedure. Power is defined as the proportion of true rejections among total nontrue null hypotheses, and stability is defined as the standard deviation (SD) of total number of rejections [11].
Our simulations compare gene expression levels between two groups with equal sample sizes of 5, 15, and 30 in each group. For each sample, we test 10,000 genes with expression levels following multivariate normal distributions with means at vector of 0 for the control group and means at vector of (μ,0) for the treatment group. We set standard deviations at 1 with correlations between genes equal to ρ (ρ = 0, 0.4, or 0.8, depending on the simulation study). All genes were equally correlated with correlation ρ. Meanwhile, we also conduct simulations with pairwise gene correlations randomly selected from uniform(0, 0.8) distribution for sample sizes of 5, 15, and 30 in each of the two groups. We set the proportions of differentially expressed genes (π _{1}=1−π _{0}) at 0.005, 0.01, 0.05, 0.10, 0.15, 0.20, 0.25, 0.30, 0.40, 0.50, 0.60, 0.75, and 0.80 for each simulation study. The vector μ is set as a sequence from 1 to 3 with length equal to 10,000×π _{1}. We use tstatistics for two independent samples for testing differential gene expression between groups. Each simulation include 1,000 iterations. Each procedure is set to control the FDR at the 0.05 level.
Simulation results
Figure 1 illustrates FDR, power, and stability estimates of the three multiple testing procedures for small sample size (5 in each group) when ρ=0. For test statistics that were independent from each other (ρ=0),the BenjaminiHochberg procedure has the strongest FDR control (the smallest FDR), followed by the BonEV procedure. Storey’s qvalue procedure has the largest FDR, although all three procedures control FDR within 5%. It is also noticeable that the estimated FDR decreases as the proportion of nontrue null hypotheses increase for all three multiple testing procedures. Storey’s qvalue procedure has the greatest power, followed first by the BonEV procedure, and then the BenjaminiHochberg procedure (Additional file 1: Table S1). All three multiple testing procedures have very low power when the proportions of nontrue null hypotheses are less than 15%, and the power of all multiple testing procedures increases as proportions of nontrue null hypotheses increase. The BenjaminiHochberg procedure is the most stable, followed by the BonEV procedure, with Storey’s qvalue procedure the least stable. The BenjaminiHochberg procedure has the smallest SD of total number of discoveries and Storey’s qvalue procedure has the largest SD of total number of discoveries (Additional file 1: Figure S1). When test statistics are moderately correlated (ρ=0.4), the same trends are observed for FDR, power, and SD of total number of discoveries. We notice the power of our BonEV procedure and Storey’s qvalue procedure converge as the correlation between test statistics increases from 0 to 0.4. Meanwhile, the power of all three procedures increases as correlations of test statistics increase especially when the proportions of nontrue null hypotheses are small. When correlations within test statistics further increase to 0.8, we observe the same trends for FDR, power, and SD of total number of discoveries (Additional file 1: Figure S2). It is also noticeable that the difference in power between Storey’s qvalue procedure and our BonEV procedure gets increasingly smaller as test statistic correlations increase to 0.8 from 0.4. The total number of discoveries (Additional file 1: Table S3) increase as the correlation of test statistics increases. The total number of discoveries of our BonEV procedure is close to Storey’s qvalue procedure and higher than the BenjaminiHochberg procedure. In these small sample size cases, a large proportion of nontrue null hypotheses are not detected, especially when the proportion of nontrue null hypotheses is small. The FDR, power, and stability estimates when the correlations across genes are random are close to the estimates when correlations across genes are 0.4 (ρ=0.4, Additional file 1: Figure S3 and Additional file 1: Table S1).
Figure 2 illustrates the FDR, power, and SD of total number of discoveries of the three multiple testing procedures at sample sizes of 15 in each group when ρ=0. At these sample sizes and with independent test statistics (ρ=0), all three multiple testing procedures control FDR to within 5% except when the proportion of nontrue null hypotheses is very small. The power of our BonEV procedure is very close to Storey’s qvalue procedure and higher than the BenjaminiHochberg procedure when the proportion of nontrue null hypotheses is greater than 0.15 (Additional file 1: Table S2). Storey’s qvalue procedure has the lowest stability (the highest SD of total number of discoveries) among the three multiple testing procedures especially when the proportion of nontrue null hypotheses is large. We observe the same trend at a moderate level of correlation between test statistics (ρ=0.4). The power of our BonEV procedure is the same as the power of Storey’s qvalue procedure, and the BonEV procedure continues to show greater stability (smaller SD of total number of discoveries) than Storey’s qvalue procedure (Additional file 1: Figure S4). At the highest level of correlation between test statistics (ρ=0.8), the trends remain the same (Additional file 1: Figure S5). The power of our BonEV procedure and Storey’s qvalue procedure is greater than the BenjaminiHochberg procedure, and the BenjaminiHochberg procedure has better FDR control and lower SD of total number of discoveries. Our BonEV procedure still has better FDR control, the same power, and higher stability than Storey’s qvalue procedure. The total number of discoveries is much closer to the true number of nontrue null hypotheses when the sample size increases to 15 in each group (Additional file 1: Table S3). The total number of discoveries for the BonEV procedure is almost the same as for Storey’s qvalue procedure, and still higher than the BenjaminiHochberg procedure. Also, for Storey’s qvalue procedure, the total number of discoveries exceeds the number of nontrue null hypotheses when the proportion of nontrue null hypotheses is larger than 40%. Similar results on FDR, power, and stability estimates are observed when the data have random correlations (Additional file 1: Figure S6 and Additional file 1: Table S2).
Figure 3 shows the FDR, power, and SD of total number of discoveries of the three multiple testing procedures when ρ=0 and sample size is as large as 30 in each group. All three multiple testing procedures illustrate reasonable control of FDR to within 5%. The power of our BonEV procedure is equivalent to Storey’s qvalue procedure but higher than the BenjaminiHochberg procedure when the proportion of nontrue null hypotheses is greater than 0.15 (Additional file 1: Table S3). Storey’s qvalue procedure still has the lowest stability among the three multiple testing procedures especially when the proportion of nontrue null hypotheses is large. We observe similar results on FDR, power, and stability estimates at a moderate and high level of correlations between test statistics when ρ=0.4 and ρ=0.8 (Additional file 1: Figure S7 and Additional file 1: Figure S8). We also notice that the power and stability improve as the correlation and sample size increase. The FDR, power, and stability estimates when data have random correlations are similar to those estimates when data has moderate correlations (ρ=0.4, Additional file 1: Figure S9 and Additional file 1: Table S3 and Table S4).
Examples using real data
As a complement to our simulation studies, we also compare apparent test power (total number of discoveries) and stability (SD of total number of discoveries) resulting from these three different procedures using human RNASeq data [12]. We compare gene expression levels between 17 females and 24 males using count data from RNASeq downloaded from the ReCount web site [13]. The RNASeq data from human Bcells that we analyze include 52,580 genes and 41 samples. The summarized count data from the RNASeq experiment is first filtered by only retaining genes that express at a countpermillion (CPM) above 0.5 in at least two samples. The retained 9745 genes are further normalized to eliminate RNA composition biases between libraries by finding a set of scaling factors for the library sizes that minimize the logfold changes between the samples for most genes using the default methoda trimmed mean of M values (TMM) [14] between each pair of samples. The raw pvalue is obtained by fitting negative binomial generalized linear models with CoxReid dispersion estimates using the glmFit and flmLRT function in the edgeR package in Bioconductor [15].
We apply the three multiple testing procedures to the raw pvalues generated from the edgeR package to compare the total number of rejected genes (apparent test power) after controlling the false discovery rate. Figure 4 shows the apparent test powers of these three different multiple testing procedures at different FDR levels ranging from 0.01 to 0.05. The BonEV procedure and Storey’s qvalue procedure produce the same number of rejections at the FDR level of 0.02 to 0.04. The BonEV procedure discovers more genes than the BenjaminiHochberg procedure and Storey’s qvalue procedure when the FDR level is equal to 1%. The apparent test power comparison is consistent with our simulation results.
To examine the stability of the BonEV procedure, we bootstrap the RNASeq samples 1000 times within each group and obtain total number of rejections in each boostrap sample. Then, we examine the stability of the BonEV, the BenjaminiHochberg procedure and the Storey’s qvalue procedure by calculating the standard deviation of total number of rejections from 1000 bootstrap samples. Figure 5 shows the stability comparison results among these three procedures, which are also consistent with the results from our simulation studies.
Using a similar approach, we compare apparent test power and stability of the three multiple testing procedures using pvalues generated from a differential gene expression analysis of the two most commonly used inbred mouse strains in neuroscience research  C57BL/6J (B6) and DBA/2J (D2) [16]. Data from 10 B6 mouse samples and 11 D2 mouse samples in the RNASeq experiment and the count data are again downloaded from the ReCount web site. Using the same filtering criteria with C P M>0.5 in at least two samples, we retain 11471 genes out of 36536 total genes. After using the same analysis method from the edgeR package in Bioconductor to obtain the raw pvalues, we apply the three different multiple testing procedures to calculate adjusted pvalues. Figure 6 shows that the number of discoveries at different FDR levels ranges from 0.01 to 0.05. Our BonEV procedure also has more discoveries at lower FDR levels. The stability shown in Fig. 7 indicates that the BonEV procedure has higher stability than Storey’s qvalue procedure. Again, the results are consistent with our simulation results.
Discussion
In this study, we propose a new multiple testing procedure (BonEV), based on the Bonferroni procedure, intended to improve FDR control and stability as well as maintain power. We compare the BonEV procedure with the BenjaminiHochberg procedure and Storey’s qvalue procedure using both simulation studies and real RNASeq data. Our studies show that the proposed BonEV multiple testing procedure has better control of FDR and higher stability than Storey’s qvalue procedure, and also maintains high levels of power at small and medium sample sizes.
Next generation sequencing and third generation sequencing technology has become more popular in biological and biomedical studies. The sample size in sequencing studies remain small to medium although the price of sequencing per sample has significantly decreased in recent years. Multiple testing procedures with larger power could help increase the probability of novel discoveries. The BonEV procedure, similar to the Storey’s qvalue procedure, offers higher power than the BenjaminiHochberg procedure and increases the costeffectiveness of sequencing studies.
Compared to the Storey’s qvalue procedure, the BonEV procedure offers better FDR control. In recent years, irreproducibility of results in biomedical research has drawn increasing attention in the popular press and academia [17, 18]. Investigations have found many landmarks in preclinical oncology to be nonreproducible, and confirming research conducted by scientists in the haematology and oncology department at the biotechnology firm Amgen in Thousand Oaks, California, finds only 11% of scientific findings they examined to be reproducible [17]. The reasons for irreproducibility are at least, in part, due to false discoveries in those studies [18–20]. With better FDR control, the BonEV procedure will enable more accurate control of false discoveries in genomic studies. Meanwhile, understanding the source of variation in the data generation and analysis can help improve reproducibility of scientific studies [21]. Multiple testing procedures are widely used to select significant features such as genes, SNPs, methylation loci, and others in microarray and NGS studies. Assessment of the stability of statistical findings from multiple testing procedures and improving the stability of these procedures could reduce replication failures. The BonEV procedure will help reduce replication failures compared with the Storey’s qvalue procedure, and provide higher stability.
Conclusions
Our study investigates the stability of BenjaminiHochberg and Storey’s qvalue FDR controlling procedures commonly used in genomic and genetic data analysis and proposes a new multiple testing procedure with higher stability, better FDR control and power equivalent to Storey’s qvalue procedure as well as higher power than the BenjaminiHochberg procedure. The BonEV multiple testing procedure we propose is attractive in microarray and sequencing data analysis in that it has higher power than the BenjaminiHochberg procedure, and better FDR control and higher stability than Storey’s qvalue procedure.
References
Pounds SB. Estimation and control of multiple testing error rates for microarray studies. Brief Bioinforma. 2006; 7(1):25–36.
Hochberg Y, Tamhane AC. Multiple Comparison Procedures. New York, NY: WileyInterscience; 1987.
Dohler S. A sufficient criterion for control of some generalized error rates in multiple testing. Stat Probab Lett. 2014; 92:114–20.
Benjamini Y, Hochberg Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J R Stat Soc Series B (Methodological). 1995; 57(1):289–300.
Benjamini Y, Hochberg Y. On the adaptive control of the false discovery rate in multiple testing with independent statistics. J Educ Behav Stat. 2000; 25(1):60–83.
Keselman HJ, Cribbie R, Holland B. Controlling the rate of type i error over a large set of statistical tests. Br J Math Stat Psychol. 2002; 55:27–39.
Storey JD. A direct approach to false discovery rates. J R Stat Soc Series B (Methodological). 2002; 64(3):479–98.
Storey JD. The positive false discovery rate: a Bayesian interpretation and the qvalue. Ann Stat. 2003; 31(6):2013–35.
Khatree R, Naik D. Computational Methods in Biomedical Research. New York, NY: Chapman & Hall/CRC; 2008.
Gordon A, Glazko G, Qiu X, Yakovlev A. Control of the mean number of false discoveries, bonferroni and stability of multiple testing. Ann Appl Stat. 2007; 1(1):179–90.
Li D, Xie Z, Le Pape M, Dye T. An evaluation of statistical methods for dna methylation microarray data analysis. BMC Bioinforma. 2015; 16(1):1–20. doi:http://dx.doi.org/10.1186/s128590150641x.
Cheung VG, Nayak RR, Wang IX, Elwyn S, Cousins SM, Morley M, Spielman RS. Polymorphic Cis and Transregulation of human gene expression. PLoS Biol. 2010; 8(9):1000480. doi:http://dx.doi.org/10.1371/journal.pbio.1000480.
Frazee AC, Langmead B, Leek JT. Recount: A multiexperiment resource of analysisready rnaseq gene count datasets. BMC Bioinformatics. 2011; 12(1):1–5. doi:http://dx.doi.org/10.1186/1471210512449.
Robinson MD, Oshlack A. A scaling normalization method for differential expression analysis of rnaseq data. Genome Biol. 2010; 11(3):25.
Robinson MD, McCatyhy DJ, Smyth GK. edger: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010; 26(1):139–40.
Bottomly D, Walter NAR, Hunter JE, Darakjian P, Kawane S, Buck KJ, Searles RP, Mooney M, McWeeney SK, Hitzemann R. Evaluating gene expression in c57bl/6j and dba/2j mouse striatum using rnaseq and microarrays. PLoS ONE. 2011; 6(3):1–8. doi:http://dx.doi.org/10.1371/journal.pone.0017820.
Begley CG, Ellis LM. Drug development: Raise standards for preclinical cancer research. Nature. 2012; 483:531–3.
Plant AL, Locascio LE, May WE, Gallagher PD. Improved reproducibility by assuring confidence in measurements in biomedical research. Nat Methods. 2014; 11(9):895–9.
Begley CG. Reproducibility: Six red flags for suspect work. Nature. 2013; 497:433–4.
McNutt M. Reproducibility. Science. 2014; 343(6168):229–9. doi:http://dx.doi.org/10.1126/science.1250475. http://science.sciencemag.org/content/343/6168/229.full.pdf.
Stodden V. Reproducing statistical results. Annu Rev Stat Appl. 2015; 2:1–19.
Acknowledgements
We thank Stanley Pounds from biostatistics department at St. Jude Children’s Research Hospital for his insightful suggestions and comments on our manuscript and the Center for Integrated Research Computing at the University of Rochester for providing high performance computing resources. We also would like to thank the anonymous reviewers for their insightful comments and suggestions that helped to further improve our manuscript.
Funding
Dr. Li’s, Dr. Zand’s, Mr. Fogg’s, and Dr. Dye’s time is supported by the University of Rochester’s Clinical and Translational Science Award (CTSA) number UL1 TR000042 and UL1 TR002001 from the National Center for Advancing Translational Sciences of the National Institutes of Health. Dr. Zand is also supported by the National Institute of Allergy and Infectious Diseases and the National Institute of Immunology, grant numbers AI098112 and AI069351. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Availability of data and materials
The BonEV multiple testing procedure is available as the BonEV package in R for download at https://CRAN.Rproject.org/package=BonEV. The RNASeq data used in the real data examples are downloaded from http://bowtiebio.sourceforge.net/recount/website.
Authors’ contributions
DL, ZX, MZ, TF, and TD contributed to the conception and design of the study. DL, ZX designed the simulation studies. DL performed the simulation studies and real data analysis. DL, ZX, MZ, TF, and TD interpreted the analysis results and wrote the manuscript. All authors edited and approved the final manuscript. All authors agreed to be accountable for all aspects of the study in ensuring that questions related to the accuracy or integrity of any part of the study are appropriately investigated and resolved.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable
Ethics approval and consent to participate
Not applicable
Author information
Authors and Affiliations
Corresponding author
Additional file
Additional file 1
The additional file includes supplemental figures when correlations were 0.4, 0.8, and random across all genes. Figures S1S9 shows the estimated FDR, power, SD of power, and SD of total discoveries of compared multiple testing procedures, with correlations of 0.4, 0.8, and random, and sample size of 5, 15, and 30 in each group. The values of power and stability were shown in supplemental Table S1S4. Figure S1. Shows the performance of compared multiple testing procedures with ρ = 0.4 and sample size of 5 in each group. Figure S2. Shows the performance of compared multiple testing procedures with ρ= 0.8 and sample size of 5 in each group. Figure S3. Shows the performance of compared multiple testing procedures with random correlation across genes and sample size of 5 in each group. Figure S4. Shows the performance of compared multiple testing procedures with ρ = 0.4 and sample size of 15 in each group. Figure S5. Shows the performance of compared multiple testing procedures with ρ = 0.8 and sample size of 15 in each group. Figure S6. Shows the performance of compared multiple testing procedures with random correlation across genes and sample size of 15 in each group. Figure S7. Shows the performance of compared multiple testing procedures with ρ = 0.4 and sample size of 30 in each group. Figure S8. Shows the performance of compared multiple testing procedures with ρ = 0.8 and sample size of 30 in each group. Figure S9. Shows the performance of compared multiple testing procedures with random correlation across genes and sample size of 30 in each group. Table S1. Shows the power and stability of compared multiple testing procedures for sample size n = 5 in each group. Table S2. Shows the power and stability of compared multiple testing procedures for sample size n = 15 in each group. Table S3. Shows the power and stability of compared multiple testing procedures for sample size n = 30 in each group. Table S4. Shows the total number of rejections of compared multiple testing procedures for sample size n = 5, 15, 30 in each group. (PDF 59 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Li, D., Xie, Z., Zand, M. et al. BonEV: an improved multiple testing procedure for controlling false discovery rates. BMC Bioinformatics 18, 1 (2017). https://doi.org/10.1186/s128590161414x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s128590161414x