Accreditation and Quality Assurance

, Volume 19, Issue 1, pp 1–10 | Cite as

Multiple hypothesis testing for metrology applications



Scrutiny of hypotheses by means of statistical tests is a common practice in experimental research. However, valid inference drawn from data analysis needs scientific grounds, whereas conclusions based on significance tests or hypothesis testing may be problematic, especially when dealing with a multiplicity of tested hypotheses, as in experiments performed on bio-molecules. The problem of false discovery rate is focused in the present paper, aiming at eliciting application of sound criteria for rejection/acceptance of hypotheses, with a view of addressing related methods for uncertainty characterization.


Measurement science Statistical significance Multiple hypothesis tests False discoveries 

Defined functions


False discovery proportion, Eqs. (4, 5)


False discovery rate, Eqs. (7a, 7b, 19, 22)


False non-discovery rate, Eq. (20)


False positive proportion, Eqs. (2, 3)


Family-wise error rate Eqs. (6, 14)


Marginal FDR, Eq. (9)


Per-comparison error rate, Eq. (10)


Positive FDR, Eq. (8)


Per-family error rate, Eq. (11)


Positive FNR, Eq. (21)







Probability density function


Probability value


Random variable

List of symbols

\( p(b|a) \)

Probability of b conditional on a




Cardinality (number of elements belonging to the set)


(“Hat”) indicates estimation, e.g., \( \hat{J} \) is an estimation of a quantity J



Partially developed in the framework of Joint Research Project SIB54 “Bio-SITrace”, co-funded by European Metrology Research Programme (EMRP). The EMRP is jointly funded by the EMRP-participating countries within the EURAMET and the European Union.


  1. 1.
    Brumfiel G (2008) Significant. Nature 455:1027–1028Google Scholar
  2. 2.
    Ioannidis JPA (2005) Why most published research findings are false. PLoS Med 2(8):e124. doi: 10.1371/journal.pmed.0020124.t002 CrossRefGoogle Scholar
  3. 3.
    D’Errico GE (2009) Paradigms for uncertainty treatments: a comparative analysis with application to measurement. Measurement 42:494–500CrossRefGoogle Scholar
  4. 4.
    D’Errico GE (2009) Issues in significance testing. Measurement 42:1478–1481CrossRefGoogle Scholar
  5. 5.
    Fisher RA (1973) Statistical methods and scientific inference, 3rd edn. Macmillan, London, p 42Google Scholar
  6. 6.
    Neyman J, Pearson ES (1933) On the problem of the most efficient tests of statistical hypotheses. Philos Trans R Soc Lond Ser A 231:289–337CrossRefGoogle Scholar
  7. 7.
    Mises RV (1943) On the problem of testing hypotheses. Ann Math Stat 14(3):236–252CrossRefGoogle Scholar
  8. 8.
    Wald A (1942) On the principle of statistical inference. University of Notre Dame, Notre Dame, INGoogle Scholar
  9. 9.
    Jeffreys H (1961) Theory of probability, chaps V–VI, 3rd edn. Clarendon Press, OxfordGoogle Scholar
  10. 10.
    Jaynes ET (2003) Probability theory: the logic of science. Cambridge University Press, CambridgeCrossRefGoogle Scholar
  11. 11.
    Lindley DV (1957) A statistical paradox. Biometrika 44:187–192Google Scholar
  12. 12.
    Rosenkrantz RD (1973) The significance test controversy. Synthese 26:304–321CrossRefGoogle Scholar
  13. 13.
    Birnbaum A (1977) The Neyman–Pearson theory as decision theory and as inference theory; with a criticism of the Lindley–Savage argument for Bayesian theory. Synthese 36:19–49CrossRefGoogle Scholar
  14. 14.
    Cox DR (1977) The role of significance tests. Scand J Stat 4:49–70Google Scholar
  15. 15.
    Cohen Y (1994) The earth is round (p < .05). Am Psychol 49(2):997–1003CrossRefGoogle Scholar
  16. 16.
    Royall RM (1997) Statistical evidence: a likelihood paradigm. Chapman & Hall, LondonGoogle Scholar
  17. 17.
    Breaugh JA (2003) Effect size estimation: factors to consider and mistakes to avoid. J Manag 29(1):79–97Google Scholar
  18. 18.
    Nakagawa S, Cuthill IC (2007) Effect size, confidence interval and statistical significance: a practical guide for biologists. Biol Rev 82:591–605CrossRefGoogle Scholar
  19. 19.
    Li XR, Li X-B (2008) Common fallacies in applying hypothesis testing. In: Proceedings of the 11th IEEE conference on information fusion, Cologne, Germany, 30 June–3 July, 2008Google Scholar
  20. 20.
    D’Errico GE (2009) Testing for outliers based on Bayes rule. In: Proceedings of the IMEKO XIX world congress fundamental and applied metrology, Lisbon, Portugal, September 6–11, pp 2368–2370Google Scholar
  21. 21.
    Nickerson RS (2000) Null hypothesis significance testing: a review of an old and continuing controversy. Psychol Methods 5(2):241–301CrossRefGoogle Scholar
  22. 22.
    Berger JO (2003) Could Fisher, Jeffreys and Neyman have agreed on testing? (with comments and rejoinder). Stat Sci 18(1):1–32CrossRefGoogle Scholar
  23. 23.
    Tukey JW (1991) The philosophy of multiple comparisons. Stat Sci 8(1):100–116CrossRefGoogle Scholar
  24. 24.
    Gelman A, Hill J, Yajima M (2012) Why we (usually) don’t have to worry about multiple comparisons. J Res Educ Eff 5:189–211Google Scholar
  25. 25.
    Dudoit S, Shaffer JP, Boldrick JC (2003) Multiple hypothesis testing in microarray experiments. Stat Sci 18(1):71–103CrossRefGoogle Scholar
  26. 26.
    Storey JD, Tibshirani R (2003) Statistical significance for genomewide studies. Proc Natl Acad Sci USA 100(16):9440–9445CrossRefGoogle Scholar
  27. 27.
    Efron B (2007) Size, power and false discovery rates. Ann. Stat 35(4):1351–1377CrossRefGoogle Scholar
  28. 28.
    Efron B (2008) Microarrays, empirical Bayes and the two-groups model. Ann Stat 23(1):1–22Google Scholar
  29. 29.
    Langaas M, Lindqvist BH, Ferkingstad E (2005) Estimating the proportion of true null hypotheses, with application to DNA microarray data. J R Stat Soc B 67(4):555–572CrossRefGoogle Scholar
  30. 30.
    Sorić B (1989) Statistical ‘discoveries’ and effect-size estimation. J Am Stat Assoc 84:608–610Google Scholar
  31. 31.
    Edwards W, Lindman H, Savage LJ (1963) Bayesian statistical inference for psychological research. Psychol Rev 7(3):193–242CrossRefGoogle Scholar
  32. 32.
    Seeger P (1968) A note on a method for the analysis of significances en masse. Technometrics 10(3):586–593CrossRefGoogle Scholar
  33. 33.
    Cournot AA (1843) Exposition de la Théorie des Chances et des Probabilités. Hachette, Paris (quoted from Shaffer [35])Google Scholar
  34. 34.
    Bonferroni CE (1935) Il calcolo delle assicurazioni su gruppi di teste. In: Studi in Onore del Professore Salvatore Ortu Carboni, pp 13–60, Roma (I) (quoted from Strimmer [63])Google Scholar
  35. 35.
    Shaffer JP (1995) Multiple hypothesis testing. Annu Rev Psychol 46:561–584CrossRefGoogle Scholar
  36. 36.
    Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc B 57(1):289–300Google Scholar
  37. 37.
    Benjamini Y, Hochberg Y (2000) On the adaptive control of the false discovery rate in multiple testing with independent statistics. J Educ Behav Stat 25(1):60–83Google Scholar
  38. 38.
    Fisher RA (1955) Statistical methods and scientific induction. J R Stat Soc B 17(1):69–78Google Scholar
  39. 39.
    Neyman J (1977) Frequentist probability and frequentist statistics. Synthese 36:97–131CrossRefGoogle Scholar
  40. 40.
    Richardson JTE (1996) Measures of effects size. Behav Res Methods Instrum Comput 28(1):12–22CrossRefGoogle Scholar
  41. 41.
    Storey JD (2002) A direct approach to false discovery rates. J R Stat Soc B 64(3):479–498CrossRefGoogle Scholar
  42. 42.
    Storey JD (2003) The positive false discovery rate: a Bayesian interpretation and the q-value. Ann Stat 31(6):2013–2035CrossRefGoogle Scholar
  43. 43.
    Bradley E (2010) Large-scale inference, chap 3, 1st edn. Cambridge Books, CambridgeGoogle Scholar
  44. 44.
    Murdoch DJ, Tsai Y-L, Adcock J (2008) P-values are random variables. Am Stat 62(3):242–245CrossRefGoogle Scholar
  45. 45.
    Schweder T, Spjøtvoll E (1982) Plots of P-values to evaluate many tests simultaneously. Biometrika 69(3):483–502Google Scholar
  46. 46.
    Holm S (1979) A simple sequentially rejective multiple test procedure. Scand J Stat 6:65–70Google Scholar
  47. 47.
    Simes SR (1986) An improved Bonferroni procedure for multiple tests of significance. Biometrika 73(3):751–754CrossRefGoogle Scholar
  48. 48.
    Hochberg Y (1988) A sharper Bonferroni procedure for multiple tests of significance. Biometrika 75(4):800–802CrossRefGoogle Scholar
  49. 49.
    Hommel G (1988) A stage-wise rejective multiple test procedure based on a modified Bonferroni test. Biometrika 75(2):383–386CrossRefGoogle Scholar
  50. 50.
    Hommel G (1989) A comparison of two modified Bonferroni procedures. Biometrika 76(3):624–625CrossRefGoogle Scholar
  51. 51.
    Lewin B (1996) On the Holm, Simes, and Hochberg test procedures. Am J Public Health 86(5):628–629CrossRefGoogle Scholar
  52. 52.
    Benjamini Y, Yekutieli D (2001) The control of the false discovery rate in multiple testing under dependency. Ann Stat 29(4):1165–1188CrossRefGoogle Scholar
  53. 53.
    Robbins H (1956) An empirical Bayes approach to statistics. In: Proceedings of the third Berkeley symposium on math. statist. and prob., vol 1. University of California Press, Berkeley, pp 157–63Google Scholar
  54. 54.
    Efron B, Storey JD, Tibshirani R, Tusher V (2001) Empirical Bayes analysis of a microarray experiment. J Am Stat Assoc 96(456):1151–1160CrossRefGoogle Scholar
  55. 55.
    Efron B, Tibshirani R (2002) Empirical Bayes methods and false discovery rates for microarrays. Genet Epidemiol 23:70–86CrossRefGoogle Scholar
  56. 56.
    Morton NE (1955) Sequential tests for the detection of linkage. Am J Hum Genet 7:277–318Google Scholar
  57. 57.
    Smith CAB (1959) Some comments on the statistical methods used in linkage investigations. Am J Hum Genet 11(4):289–304Google Scholar
  58. 58.
    Genovese G, Wasserman L (2004) A stochastic process approach to false discovery control. Ann Stat 32(3):1035–1061CrossRefGoogle Scholar
  59. 59.
    Genovese G, Wasserman L (2002) Operating characteristics and extensions of the FDR procedure. J R Stat Soc B 64(3):499–518CrossRefGoogle Scholar
  60. 60.
    Sarkar SK (2006) False discovery and false nondiscovery rates in single-step multiple testing procedures. Ann Stat 34(1):394–415CrossRefGoogle Scholar
  61. 61.
    Storey JD, Taylor JE, Siegmund D (2004) Strong control, conservative point estimation, and simultaneous conservative consistency of false discovery rates: a unified approach. J R Stat Soc B 66(1):187–205CrossRefGoogle Scholar
  62. 62.
    Storey JD (2011) False discovery rates. In: Miodrag L (ed) International encyclopaedia of statistical science. Springer, Berlin, pp 504–508Google Scholar
  63. 63.
    Strimmer K (2008) A unified approach to false discovery rate estimation. BMC Bioinformatics 9:303. doi: 10.1186/1471-2105-9-303 CrossRefGoogle Scholar
  64. 64.
    Storey JD (2007) The optimal discovery procedure: a new approach to simultaneous significance testing. J R Stat Soc B 69(3):347–368CrossRefGoogle Scholar
  65. 65.
    Verhoeven KJF, Simonsen KL, McIntyre LM (2005) Implementing false discovery rate control: increasing your power. Oikos 108:643–647CrossRefGoogle Scholar
  66. 66.
    Lewis C, Thayer DT (2009) Bayesian decision theory for multiple comparisons. In: IMS lecture notes—monograph series, optimality: the third Erich L. Lehmann symposium, vol 57, pp 326–332Google Scholar
  67. 67.
    Bogdan M, Ghosh JK, Tokdar ST (2008) A comparison of the Benjamini–Hochberg procedure with some Bayesian rules for multiple testing. In: IMS collections—beyond parametrics in interdisciplinary research: festschrift in honor of Professor Pranab K. Sen, vol 1, pp 211–230Google Scholar
  68. 68.
    Benjamini Y, Yekutieli D (2005) False discovery rate-adjusted multiple confidence intervals for selected parameters (with comments and rejoinder). J Am Stat Assoc 100(469):71–93CrossRefGoogle Scholar
  69. 69.
    Olshen RA (1973) The conditional level of the F-test. J Am Stat Assoc 68(343):692–698Google Scholar
  70. 70.
    Scheffé H (1977) A note on a reformulation of the S-method of multiple comparison (with comment and rejoinder). J Am Stat Assoc 72(357):143–146Google Scholar
  71. 71.
    Rao CV, Swarupchand U (2009) Multiple comparison procedures—a note and a bibliography. J Stat 16:66–109Google Scholar
  72. 72.
    Wacholder S, Chanock S, Garcia-Closas M, El Ghormli L, Rothman N (2004) Assessing the probability that a positive report is false: an approach for molecular epidemiology studies. J Natl Cancer Inst 96(6):434–442CrossRefGoogle Scholar
  73. 73.
    Gadbury GL, Page GP, Edwards J, Kayo T, Prolla TA, Weindruch R, Permana PA, Mountz JD, Allison DB (2004) Power and sample size estimation in high dimensional biology. Stat Methods Med Res 13:325–338CrossRefGoogle Scholar
  74. 74.
    Broberg P (2005) A comparative review of estimates of the proportion unchanged genes and the false discovery rate. BMC Bioinformatics 6:199. doi: 10.1186/1471-2105-6-199 CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  1. 1.Istituto Nazionale di Ricerca Metrologica (INRIM)TurinItaly

Personalised recommendations