Introduction

In 2006, Narum published a paper in Conservation Genetics pointing out the conservative nature of the Bonferroni approach to correct for multiple testing when considering a set of statistical inferences and the potential for higher Type II errors (Narum 2006). He suggested that alternative approaches, such as the use of false discovery rate (FDR) to correct for multiple testing can be very effective and can provide a better balance between Type I and Type II errors (Type I error is a false positive, incorrectly rejecting a true null hypothesis; whereas Type II error is a false negative, a failure to reject a false null hypothesis). Further, Narum (2006) argued that tests to correct for multiple testing should be chosen on a case-by-case basis depending on the priority of potential Type I and Type II errors. Finally, he proposed the FDR approach of (Benjamini and Yekutieli 2001) as an alternative approach and potentially more biologically relevant for conservation genetics.

His paper, “Beyond Bonferroni: Less conservative analyses for Conservation Genetics,” has been cited over 600 times to date. The article has not only been cited in the field of conservation genetics, but also has been increasingly cited in the fields of biology and medicine. These studies apply the equation described by Narum (2006) attributed to the Benjamini and Yekutieli (2001) procedure for multiple testing correction (BY-FDR). However, a careful review of the published BY method and what Narum describes as the BY method shows crucial differences. Close examination of the two works shows that not all steps were included in calculating the BY-FDR procedure in Narum (2006), and thus this implementation of BY is incorrect and cannot be guaranteed to control the FDR. Thus, we believe that this error has created confusion about the BY procedure and the misimplementation is being propagated along an increasing number of studies.

Within this context, we have three goals of this paper: The first is to provide an overview of the Bonferroni method, the original (Benjamini and Hochberg 1995) FDR (BH-FDR), and the Benjamini and Yekutieli (2001) method (BY-FDR); the second goal is to describe the incorrect implementation of the BY-FDR approach described by Narum, which we will henceforth label as the BY-mis (short for BY-Misimplementation) approach; and the third is to assess the potential impact of this error using 30 of the most recent publications that cite the Narum (2006) paper. However, with the large number of papers that have applied this approach, the specific impact within the fields of conservation genetics, biology, and medicine will need to be evaluated by experts within each of the domains or sub-domains of research in these fields. We will demonstrate that using the BY-mis approach for multiple testing correction results in higher rates of false positives, especially when a large number of multiple tests are performed. However, as pointed out by Narum (2006), false negatives can also be a concern and specific situations may require approaches that limit Type II errors. Typically larger sample sizes are needed to confirm true negatives. In situations where sample sizes are low, as is often the case in conservation genetics (e.g., low number of sampled individuals and/or populations, low number of loci in non-model species) decisions based on false negatives could lead to less productive conservation management strategies (Narum 2006). Thus, we also provide simulations to demonstrate the rates of false negatives using different approaches for multiple testing correction in two specific scenarios.

Theory

We first review the different multiple testing approaches discussed by Narum (2006) using his notation as closely as possible. We start with a collection of k tests, each with a corresponding p value, \(p_{i}\), i = 1,…,k. A multiple testing procedure identifies a subset of the k tests as significant while controlling some measure of false positive risk that takes into account the number of tests performed. The Bonferroni method controls the family-wise error (FWE), the chance of one or more false positives, by using a fixed threshold of:

$$\alpha_{\text{Bonf}} = \frac{ 1}{k}\alpha_{\text{FWE}}$$

where αFWE is the desired FWE level: All tests with \(p_{i}\) ≤ αBonf can be declared significant while controlling the FWE.

Benjamini and Hochberg (1995) introduced the false discovery rate (FDR) for multiple testing correction. In describing the FDR it is useful to first define the false discovery proportion (FDP): FDP is the ratio of the number of false positive tests to total number of significant tests, defined as 0 if no tests are significant. The FDR is the expected value of FDP; put another way, FDR is the expected proportion of false positives among positives. To find FDR-significant tests, denote the ordered p-values \(p_{(1)}\) ≤ \(p_{(2)}\) ≤ ··· ≤ \(p_{(k)}\). Then for a desired αFDR, let the index i* be found as

$$i^{*} = \hbox{max} \left\{ {i:p_{\left( i \right)} \le \frac{i}{k}\alpha_{\text{FDR}} } \right\},$$

and the tests with \(p_{i}\) ≤ \(p_{{(i^{*} )}}\) can be declared significant while controlling FDR at αFDR.

The assumptions of this BH-FDR procedure (BH-FDR) are independence among the test statistics (Benjamini and Hochberg 1995). However, Benjamini and Yekutieli (2001) found that weaker assumptions could be used, allowing a general form of positive dependence among the test statistics. They proposed another method for controlling FDR that makes no assumptions about the dependence among the tests, as long as a more stringent criterion was used (Theorem 1.3, BY), with the index \(i_{BY}^{*}\) computed:

$$i_{\text{BY}}^{*} = \hbox{max} \left\{ {i:p_{\left( i \right)} \le \frac{i}{k}\frac{1}{{\mathop \sum \nolimits_{{i^{'} = 1}}^{k} \frac{1}{{ i^{'} }}}}\alpha_{\text{FDR}} } \right\}.$$

With this approach, the tests with \(p_{i}\) ≤ \(p_{{(i_{\text{BY}}^{*} )}}\) are marked significant and FDR is controlled at αFDR under any form of dependency. Note that \(\mathop \sum \nolimits_{{i^{'} = 1 }}^{k} \frac{1}{{i^{'} }} \approx \log \left( k \right) + \gamma\), where \(\gamma \approx 0.57721\) is Euler–Mascheroni constant. This is the method we refer to by BY-FDR.

We can now make a quick comparison of three methods on the basis of the smallest p-value \(p_{(1)}\): Bonferroni has the fixed threshold αFWE/k, while BH-FDR will compare \(p_{(1)}\) to αFDR/k and BY-FDR will compare \(p_{(1)}\) to approximately αFDR/(k log (k)). Of course, BH-FDR and BY-FDR are adaptive and thus the comparison for each p-value within a test set has successively more lenient thresholds. However, as BH-FDR and BY-FDR use the same inequality except for the ≈ 1/log (k) term, BY-FDR can only be more stringent than BH-FDR.Now, in Narum (2006), the author incorrectly states that the BY-FDR threshold is fixed and equal to:

$$\begin{array}{*{20}c} {\frac{1}{{\mathop \sum \nolimits_{i = 1}^{k} \frac{1}{i}}}\alpha_{\text{FDR}} } \\ \end{array}.$$

This is a fundamental error, as a key feature of FDR methods is that they are adaptive. The error arose from neglecting that this expression was just one component of the BY procedure [to be substituted for q in BY Eq. (1) on pp. 1167 (Benjamini and Yekutieli 2001)]. This incorrect application of the BY approach (BY-mis) results in a fixed threshold for a specific k.

Since a fixed threshold specifies the average or per comparison error rate (PCE), we have taken several approaches to assess the impact of this error. Assuming the complete null, i.e. no signal for any test, k × PCE is the expected number of false positives. For the threshold at the 0.05 level, for k = 105, BY-mis has k × PCE ≈ 1, while for k = 1590, k × PCE ≈ 10. This demonstrates that the BY-mis approach can be assured to produce an increasing number of false positives for an increasing k. In contrast, for Bonferroni k × PCE is exactly αFWE, i.e. always less than 1, and every valid FWE or FDR level α procedure is guaranteed to produce no false positives with probability 1−α (again, in this complete null setting). While the BY-mis approach does asymptote to zero as k approaches infinity, it approaches zero extremely slowly. For example, with 10 million tests performed, the BY-mis p-value threshold is 0.003, in contrast to the Bonferroni threshold of 0.000000005.

To evaluate the rate of significant p-values found with the Bonferroni, BH, BY, BY-mis, and uncorrected approaches we conducted a simulation using the Python programming language version 2.7.13 (Zope Corporation and a cast of thousands;www.python.org); the code used for all simulations is available in the supplement.. We performed simulations using k values ranging from 1 to 100 tests. For each k, we created 50,000 random realizations where null p-values were computed from test statistics generated as a standard normal distribution. Thus, for k = 1 we had a total of 50,000 independent p-values and in this case the four approaches were identical. For k > 1 we generated k independent p-values and applied each of the four methods. A nominal αFWE = αFDR = 0.05 was used for all methods. In this null setting, any “discovery” is a false discovery and so the measured FDR and FWE are the same. We computed the proportion of realizations where any p-values were found significant, representing a FWE error and a FDP of 1. Figure 1a shows the FDR and FWE as a function of the number of tests, showing that Bonferroni and BH-FDR both control false positives as expected (as an aside, while Bonferroni is often regarded as conservative, in this setting of small k and independent tests, it is essentially exact). The FDR/FWE of BY-FDR becomes increasing conservative while the BY-mis has inflated false positives with a near linear increase with increasing k.

Fig. 1
figure 1

Probability of Type I and Type II errors compared to the number of independent tests performed. a False positive rates under the complete null setting, showing false discovery and family-wise error rate (here, identical) plotted against the number of tests performed using five different approaches: Bonferroni, Benjamini–Hochberg (BH-FDR), Benjamini and Yekutieli (BY-FDR), the BY-misimplementation (BY-mis), and no correction. It is demonstrated in this simulation that the FDR and FWE rise dramatically with k (the number of tests) for BY-mis. b Type II error rates for a one non-null test out of a total of k tests (k = 1–100). c Average Type II error rate over 25 non-null tests out of k tests (k = 25–100). Type II error rates rise with k for all multiple testing methods, but BY-mis has dramatically different rates than BY-FDR. A total of 50,000 iterations were done for each simulation and the python code is provided in the supplement

In addition, we performed simulations using python to measure both false negative rates for the Bonferroni, BH, BY, and the BY-mis approaches for multiple testing correction. These simulations were creating 50,000 realizations of sets of k tests, 1 to 100, but in this simulation we included a mix of null and non-null tests. We performed two classes of simulations, one with 1 non-null test and one with 25 non-null tests. For example, with k = 50 and the situation of 1 non-null test, there were 49 random p-values computed from a standard Normal distribution test statistic, and 1 p-value that was generated with from a non-null Normal with mean set to give a test with 80% power at the uncorrected level α = 0.05. The same situation with k = 50 for the case with 25 non-null tests, where 25 p-values were generated from null test statistics and 25 non-null p-values were generated to have 80% power to reject the null. This can be seen in Fig. 1b, c where the probability of a false negative for uncorrected comparisons remains at 0.2. These simulations show that the BY-FDR, has the highest probability of a Type II error with one simulated non-null result, whereas the BH-FDR and Bonferroni are very similar.

To illustrate these simulations with an example, say that a study was conducted in which 50 tests were performed (k = 50) with half of the tests actually being significant. Thus, there are 25 tests in which there is a possibility of false positive, and 25 tests in which there is a possibility of a false negative. Since Fig. 1c models the case of 25 out of k = 25 to 100 significant tests, the probability of a false negative for k = 50 is approximately 0.4 for the BH-FDR, 0.43 for the BY-mis, 0.68 for the BY-FDR, and 0.72 for the Bonferroni approach. The probability of a false positive for 25 non-significant tests can be determined from Fig. 1a. With k = 25, the FDR and FWE rate would be at approximately 5% and below for the Bonferroni, BH-FDR, and BY-FDR, but the false discovery rate would be approximately 30% for the BY-mis (Fig. 1a). Figure 1b, c shows that for all methods used to correct for multiple testing, the risk of Type II error increases with the number of tests k. However, there is a dramatic difference between the performance of BY-FDR and the BY-mis. Note the advantage of the BH-FDR approach in minimizing both false positive and false negative errors, while still controlling FDR.

We also consider the specific set of 15 p-values used in Narum (2006) to tabulate the p-value thresholds for the Bonferroni, BH, BY, and the BY-mis approaches. Table 1 shows the thresholds used for each of the 15-exemplar p-values, with significant tests marked in bold. It can be seen that the BY-FDR and the BY-mis are not the same. Narum (2006) reported four significant tests as compared to the correct BY-FDR’s having two significant tests.

Table 1 A set of p-values from 15 significance testing taken from the Narum 2006 paper (column labeled ‘p-value examples’) and comparison with four approaches to multiple testing (critical p-values for significance)

The example in Table 1 also demonstrates one of the challenges in finding a balance between Type I and Type II errors and the choice for multiple testing correction. The probability that 12 of 15 independent tests would show an uncorrected p-value less than 0.05 is very low. Thus, Bonferroni, having only two significant tests, is likely overly conservative and would result in a higher type II error rate. The BH-FDR approach, however, shows that 10 of the 15 tests are identified as significant, which in this situation may be more plausible, although it would be helpful to know the covariance structure between the different variables, as statistical dependence between variables is not uncommon. Figure 1c demonstrates type II error rates for the simulation of 25 true positives (80% chance of being less than p < 0.05) and the notable differences between the Bonferroni and BH-FDR for k = 1–100 independent tests.

Finally, we used Scopus to identify the 30 most recent publications (search date: February, 9, 2019) that cite Narum (2006) to sample the impact of this error on the literature (Table 2). Of these 30 articles, nine articles (30%) were specifically related to conservation genetics; ten articles were in the fields of biology, mostly involving genetic analyses (33.3%); nine articles (30%) were in the field of medicine, most commonly in psychiatry; and the two additional articles were in the fields of statistics and anthropology. In 20 of these articles (67%) we could confidently determine that BY-mis was used (2006), while it was unclear in six articles (20%), and one article cited Narum (2006), but did not use the BY-mis approach. None of the papers described using a standard statistical software package to calculate the BY-FDR. Eight of the twenty articles that applied the BY-mis approach also cited the Benjamini and Yekutieli (2001) article. Of the 28 relevant articles [excluding Hauser et al. (2018) and Stepien et al. (2018) as these papers cited but did not apply the BY-mis approach], only eight articles (29%) provide enough information to calculate the alternate multiple testing corrections for the data provided for the specific study. Four of these eight articles show an reduction in the number of significant tests when BY-mis is replaced with BY-FDR, whereas the other four have tests that either are negative (one article) or are so strongly significant that all the tests also pass Bonferroni correction (three articles). Also noteworthy, eight of the twenty articles that applied the BY-mis approach (40%) applied independent levels of multiple testing, rather than applying multiple testing to all tests in the article.

Table 2 List of the 30 most recent articles identified via Scopus (9 February 2019) who cited the Narum (2006) article

Discussion

In 1995, Benjamini and Hochberg proposed the FDR metric and a method to control FDR. Benjamini and Yekutieli in 2001 proposed a method to control FDR with weaker assumptions, but more stringent correction than the BH approach. Narum’s (2006) paper provided an overview and examples of the BY-FDR procedure, however, did not include all steps of the BY algorithm (shown above). A careful reading of Benjamini and Yekutieli (2001) reveals that the equation for multiple testing from Narum (2006) (from Theorem 1.3 on pp. 1169 of BY) should be entered as the α in the B-H equation (Eq. (1) on pp. 1167 in BY), producing an adaptive threshold. Further, based on a series of p-values taken from the Narum (2006) paper (Table 1), different results are obtained comparing the Narum (2006) description of the BY approach and the BY-FDR described by Benjamini and Yekutieli (2001).

Direct calculation shows that BY-mis has expected number of false positives that increases nearly linearly with number of tests k, and that this increasing false positive rate differs dramatically from the BY-FDR approach (Fig. 1a). We believe that a large percentage of the over 600 publications are liable to have this inflated rate of false positives in their results, notably since results arising from Type I errors are much easier to publish than those from Type II errors. We found that at least 40% of a sample of the 30 most recent papers that cite Narum (2006) article also cite Benjamini and Yekutieli (2001) and that they have applied the BY approach, but actually apply the BY-mis-FDR approach (Table 2).

We do agree with Narum that the Bonferroni approach can be highly conservative in some situations of multiple testing correction, especially with dependent data. However, there has also been a growing concern that many studies fail to replicate (Ioannidis 2005; Open Science Collaboration 2015; Nichols et al. 2017; Gelman 2018). In the past, analyses were performed without adequately controlling for the numbers of tests performed (Carp 2012) which resulted in numerous Type I errors but also likely fewer Type II errors. We also agree with Narum that the individual studies should determine the balance between Type I and Type II errors, as there are some situations in many fields where researchers want to limit Type II errors. Examples include situations in conservation genetics where a failure to show a positive effect could direct conservation management strategies that are counter to the survival of a species (Narum 2006). Species in which there is concern over extinction often have smaller populations and lower rates of reproduction (Lynch and Lande 1998) and decisions based on false negatives in some populations could lead to less productive conservation management strategies. Examples in medicine with concerns over false negatives include the presurgical use of functional magnetic resonance imaging to identify eloquent cortex (Durnez et al. 2013). In such cases, a false negative could result in the removal of eloquent cortical regions and thus stringent correction for multiple testing would not be indicated. Thus, in conservation genetics, biology, medicine, and other fields, individual studies may shift the choice of limiting either Type I or Type II errors and providing the rationale for the choice of (or lack of) multiple testing correction should always be provided.

Our attempt to extract vital information to assess the multiple testing correction within each of the 30 most recent articles that cite the Narum (2006) paper highlights the need in the literature for greater transparency regarding the use of multiple testing correction. Over two-thirds of these papers did not provide enough information to replicate the authors approach for multiple testing correction nor to compare the different methods. Further, a minority of these papers presented effect sizes or confidence intervals for their findings, and omission of these data been shown to be a problem in many fields of science (Chavalarias et al. 2016). None of the authors described using statistical software packages, (i.e. R or SAS) to calculate the BY-FDR, which, if performed correctly, would have resulted in an accurate calculation of multiple testing correction. It is likely that the BY-mis approach, which provides a single critical p-value and is trivial to calculate, was easier than the use of statistical software. There is currently discussions regarding moving away from the use of the p ≤ 0.05 approach (American Statistical Association 2016), we would recommend that if p-values are presented, they should always be the full, unadjusted p-value and should be accompanied by effect sizes or confidence intervals. Effect sizes or confidence intervals provide greater details regarding hypothesis testing compared to p-values (Smith 2018) and will enhance replication, as studies evaluating small effects in the wake of considerable noise are likely false positives (Gelman 2018), considering a system rewarded by positive findings.

In summary, so long as p-values remain one of the top methods of choice to report statistical results, we agree with the Narum (2006) that researchers should carefully consider the different tests for multiple testing correction and should make a priori decisions based on Type I and Type II errors within their specific study. Further, we provide an overview of FWE and FDR correction approaches and several simulations to show both type I and type II errors. We point out an error in Narum’s (2006) paper describing the BY approach and show that the BY-mis does not adequately control for FDR when used for multiple testing correction. Finally, we recommend that authors be transparent in reporting the number of tests, the number of clusters of tests, and method used when performing multiple testing correction. Authors should also present effect sizes or confidence intervals is also key.