Advertisement

Scientometrics

, Volume 102, Issue 1, pp 411–432 | Cite as

Null hypothesis significance tests. A mix-up of two different theories: the basis for widespread confusion and numerous misinterpretations

  • Jesper W. Schneider
Article

Abstract

Null hypothesis statistical significance tests (NHST) are widely used in quantitative research in the empirical sciences including scientometrics. Nevertheless, since their introduction nearly a century ago significance tests have been controversial. Many researchers are not aware of the numerous criticisms raised against NHST. As practiced, NHST has been characterized as a ‘null ritual’ that is overused and too often misapplied and misinterpreted. NHST is in fact a patchwork of two fundamentally different classical statistical testing models, often blended with some wishful quasi-Bayesian interpretations. This is undoubtedly a major reason why NHST is very often misunderstood. But NHST also has intrinsic logical problems and the epistemic range of the information provided by such tests is much more limited than most researchers recognize. In this article we introduce to the scientometric community the theoretical origins of NHST, which is mostly absent from standard statistical textbooks, and we discuss some of the most prevalent problems relating to the practice of NHST and trace these problems back to the mix-up of the two different theoretical origins. Finally, we illustrate some of the misunderstandings with examples from the scientometric literature and bring forward some modest recommendations for a more sound practice in quantitative data analysis.

Keywords

Null hypothesis significance test Fisher’s significance test Neyman–Pearson’s hypothesis test Statistical inference Scientometrics 

Mathematic Subject Classification

97K70 

JEL Classification

C120 

References

  1. Abelson, R. P. (1997). On the surprising longevity of flogged horses: Why there is a case for the significance test. Psychological Science, 8(1), 12–15.Google Scholar
  2. American Psychological Association. (2010). Publication Manual of the APA (6th ed.). Washington, DC: APA.Google Scholar
  3. Anderson, D. R. (2008). Model based inference in the life sciences: A primer on evidence. New York: Springer.Google Scholar
  4. Anderson, D. R., Burnham, K. P., & Thompson, W. L. (2000). Null hypothesis testing: Problems, prevalence, and an alternative. Journal of Wildlife Management, 64, 912–923.Google Scholar
  5. Armstrong, J. S. (2007). Significance tests harm progress in forecasting. International Journal of Forecasting, 23(2), 321–327.Google Scholar
  6. Armstrong, J. S. (2012). Illusions in regression analysis. International Journal of Forecasting, 28(3), 689–694.Google Scholar
  7. Beninger, P. G., Boldina, I., & Katsanevakis, S. (2012). Strengthening statistical usage in marine ecology. Journal of Experimental Marine Biology and Ecology, 426, 97–108.Google Scholar
  8. Berger, J. O., & Berry, D. A. (1988). Statistical analysis and the illusion of objectivity. American Scientist, 76(2), 159–165.Google Scholar
  9. Berger, J. O., & Sellke, T. (1987). Testing a point null hypothesis—The irreconcilability of p-values and evidence. Journal of the American Statistical Association, 82(397), 112–122.MathSciNetzbMATHGoogle Scholar
  10. Berk, R. A., & Freedman, D. A. (2003). Statistical assumptions as empirical commitments. In T. G. Blomberg & S. Cohen (Eds.), Law, punishment, and social control: Essays in honor of Sheldon Messinger (pp. 235–254). New York: Aldine.Google Scholar
  11. Berk, R. A., Western, B., & Weiss, R. E. (1995). Statistical inference for apparent populations. Sociological Methodology, 25, 421–458.Google Scholar
  12. Berkson, J. (1938). Some difficulties of interpretation encountered in the application of the Chi square test. Journal of the American Statistical Association, 33(203), 526–536.zbMATHGoogle Scholar
  13. Berkson, J. (1942). Tests of significance considered as evidence. Journal of the American Statistical Association, 37(219), 325–335.Google Scholar
  14. Boring, E. G. (1919). Mathematical versus scientific significance. Psychological Bulletin, 16, 335–338.Google Scholar
  15. Bornmann, L., & Leydesdorff, L. (2013). Statistical tests and research assessments: A comment on Schneider (2012). Journal of the American Society for Information Science and Technology, 64(6), 1306–1308.Google Scholar
  16. Carver, R. P. (1978). The case against statistical significance testing. Harvard Educational Review, 48(3), 378–399.MathSciNetGoogle Scholar
  17. Chow, S. L. (1998). Précis of Statistical significance: Rationale, validity, and utility. Behavioral and Brain Sciences, 2, 169–239.Google Scholar
  18. Clark, C. A. (1963). Hypothesis testing in relation to statistical methodology. Review of Educational Research, 33, 455–473.Google Scholar
  19. Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum.zbMATHGoogle Scholar
  20. Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45(12), 1304–1312.Google Scholar
  21. Cohen, J. (1994). The earth is round (p < 0.05). American Psychologist, 49(12), 1003–1007.Google Scholar
  22. Cortina, J. M., & Dunlap, W. P. (1997). On the logic and purpose of significance testing. Psychological Methods, 2(2), 161–172.Google Scholar
  23. Cumming, G. (2012). Understanding the new statistics. Effect sizes, confidence intervals, and meta-analysis. New York: Routledge.Google Scholar
  24. Dixon, P., & O’Reilly, T. (1999). Scientific versus statistical inference. Canadian Journal of Experimental Psychology-Revue Canadienne De Psychologie Experimentale, 53(2), 133–149.Google Scholar
  25. Ellis, P. D. (2010). The essential guide to effect sizes: Statistical power, meta-analysis, and the interpretation of research results. Cambridge: Cambridge University Press.Google Scholar
  26. Falk, R., & Greenbaum, C. W. (1995). Significance tests die hard. Theory and Psychology, 5, 396–400.Google Scholar
  27. Fisher, R. A. (1925). Statistical methods for research workers (1st ed.). London: Oliver & Boyd.Google Scholar
  28. Fisher, R. A. (1935a). The design of experiments (1st ed.). Edinburgh: Oliver & Boyd.Google Scholar
  29. Fisher, R. A. (1935b). Statistical tests. Nature, 136, 474.Google Scholar
  30. Fisher, R. A. (1935c). The logic of inductive inference. Journal of the Royal Statistical Society, 98, 71–76.Google Scholar
  31. Fisher, R. A. (1951). The design of experiments (6th ed.). Edinburgh: Oliver & Boyd.Google Scholar
  32. Fisher, R. A. (1955). Statistical methods and scientific induction. Journal of the Royal Statistical Society B, 17, 69–78.zbMATHGoogle Scholar
  33. Fisher, R. A. (1956). Statistical methods and scientific inference. London: Oliver & Boyd.zbMATHGoogle Scholar
  34. Frick, R. W. (1996). The appropriate use of null hypothesis testing. Psychological Methods, 1(4), 379–390.MathSciNetGoogle Scholar
  35. Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2004). Bayesian data analysis. Boca Raton: Chapman & Hall/CRC.zbMATHGoogle Scholar
  36. Gelman, A., & Stern, H. (2006). The difference between “significant” and “not significant” is not itself statistically significant. The American Statistician, 60(4), 328–331.MathSciNetGoogle Scholar
  37. Gigerenzer, G. (1993). The superego, the ego, and the id in statistical reasoning. In G. Keren & C. Lewis (Eds.), A handbook for data analysis in the behavioral sciences: methodological issues (pp. 311–339). Hillsdale: Erlbaum.Google Scholar
  38. Gigerenzer, G. (2004). Mindless statistics. The Journal of Socio-Economics, 33(5), 587–606.Google Scholar
  39. Gigerenzer, G., Swijtink, Z., Porter, T., Daston, L., Beatty, J., & Kruger, L. (1989). The empire of chance: How probability changed science and everyday life. New York: Cambridge University Press.Google Scholar
  40. Gill, J. (2007). Bayesian methods: A social and behavioral sciences approach (2nd ed.). Boca Raton: Chapman & Hall/CRC.Google Scholar
  41. Glass, G. (2006). Meta-analysis: The quantitative synthesis of research findings. In J. L. Green, G. Camilli, & P. B. Elmore (Eds.), Handbook of Complementary Methods in Education Research. Mahwah, NJ: Lawrence Erlbaum.Google Scholar
  42. Good, I. J. (1950). Probability and the weighing of evidence. London: Griffin.zbMATHGoogle Scholar
  43. Goodman, S. N. (1993). P values, hypothesis tests, and likelihood: Implications for epidemiology of a neglected historical debate. American Journal of Epidemiology, 137(5), 485–496.Google Scholar
  44. Goodman, S. N. (1999a). Toward evidence-based medical statistics. 1: The P value fallacy. Annals of Internal Medicine, 130(12), 995–1004.Google Scholar
  45. Goodman, S. N. (1999b). Toward evidence-based medical statistics. 2: The Bayes factor. Annals of Internal Medicine, 130(12), 1005–1013.Google Scholar
  46. Goodman, S. N. (2003). Commentary: The P-value, devalued. International Journal of Epidemiology, 32(5), 699–702.Google Scholar
  47. Goodman, S. N. (2008). A dirty dozen: Twelve P-value misconceptions. Seminars in Hematology, 45(3), 135–140.Google Scholar
  48. Goodman, S. N., & Greenland, S. (2007). Why most published research findings are false: Problems in the analysis. PLoS Medicine, 4(4), e168.Google Scholar
  49. Greenland, S. (1990). Randomization, statistics, and causal Inference. Epidemiology, 1(6), 421–429.Google Scholar
  50. Greenland, S., & Poole, C. (2013). Living with statistics in observational research. Epidemiology, 24(1), 73–78.Google Scholar
  51. Hacking, I. (1965). Logic of statistical inference. Cambridge: Cambridge University Press.zbMATHGoogle Scholar
  52. Haller, H., & Krauss, S. (2002). Misinterpretations of significance: A problem students share with their teachers. Methods of Psychological Research, 7(1), 1–20.Google Scholar
  53. Harlow, L. L., Muliak, S. A., & Steiger, J. H. (Eds.). (1997). What if there were no significance tests?. Mahwah: Lawrence Erlbaum.Google Scholar
  54. Hubbard, R. (2004). Alphabet soup: Blurring the distinctions between p’s and a’s in psychological research. Theory and Psychology, 14(3), 295–327.Google Scholar
  55. Hubbard, R., & Armstrong, J. S. (2006). Why we don’t really know what statistical significance means: Implications for educators. Journal of Marketing Education, 28(2), 114–120.Google Scholar
  56. Hubbard, R., & Bayarri, M. J. (2003). Confusion over measures of evidence (p’s) versus errors (α’s) in classical statistical testing. American Statistician, 57(3), 171–178.MathSciNetGoogle Scholar
  57. Hubbard, R., & Lindsay, R. M. (2008). Why P values are not a useful measure of evidence in statistical significance testing. Theory and Psychology, 18(1), 69–88.Google Scholar
  58. Hubbard, R., & Ryan, P. A. (2000). The historical growth of statistical significance testing in psychology and its future prospects. Educational and Psychological Measurement, 60, 661–681.Google Scholar
  59. Hunter, J. E. (1997). Needed: A ban on the significance test. Psychological Science, 8, 3–7.Google Scholar
  60. Hurlbert, S. H., & Lombardi, C. M. (2009). Final collapse of the Neyman–Pearson decision theoretic framework and rise of the neoFisherian. Annales Zoologici Fennici, 46(5), 311–349.Google Scholar
  61. Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Medicine, 2(8), 696–701.Google Scholar
  62. Jeffreys, H. (1939). The theory of probability (1st ed.). Oxford: Oxford University Press.Google Scholar
  63. Jeffreys, H. (1961). The theory of probability (3rd ed.). Oxford: Oxford University Press.Google Scholar
  64. Kirk, R. E. (1996). Practical significance: a concept whose time has come. Educational and Psychological Measurement, 61(5), 246–759.Google Scholar
  65. Kline, R. B. (2004). Beyond significance testing: reforming data analysis methods in behavioral research. Washington, DC: American Psychological Association.Google Scholar
  66. Kline, R. B. (2013). Beyond significance testing: reforming data analysis methods in behavioral research (2nd ed.). Washington, DC: American Psychological Association.Google Scholar
  67. Krämer, W., & Gigerenzer, G. (2005). How to confuse with statistics or: The use and misuse of conditional probabilities. Statistical Science, 20(3), 223–230.MathSciNetzbMATHGoogle Scholar
  68. Kruschke, J. K. (2010). What to believe: Bayesian methods for data analysis. Trends in Cognitive Sciences, 14(7), 293–300.Google Scholar
  69. Lehmann, E. L. (1993). The Fisher, Neyman–Pearson theories of testing hypotheses: One theory or two? Journal of the American Statistical Association, 88(424), 1242–1249.MathSciNetzbMATHGoogle Scholar
  70. Leydesdorff, L. (2013). Does the specification of uncertainty hurt the progress of scientometrics? Journal of Informetrics, 7(2), 292–293.Google Scholar
  71. Lindley, D. (1957). A statistical paradox. Biometrika, 44, 187–192.MathSciNetzbMATHGoogle Scholar
  72. Ludwig, D. A. (2005). Use and misuse of p-values in designed and observational studies: Guide for researchers and reviewers. Aviation, Space and Environmental Medicine, 76(7), 675–680.Google Scholar
  73. Lykken, D. T. (1968). Statistical significance in psychological research. Psychological Bulletin, 70(3, Part 1), 151–159.Google Scholar
  74. Mayo, D. (1996). Error and the growth of experimental knowledge. Chicago University Press: Chicago, IL.Google Scholar
  75. Mayo, D. (2006). Philosophy of Statistics. In S. Sarkar & J. Pfeifer (Eds.), The philosophy of science: An encyclopedia (pp. 802–815). London: Routledge.Google Scholar
  76. Meehl, P. E. (1978). Theoretical risks and tabular asterisk: Sir Karl, Sir Ronald, and the slow progress of soft psychology. Journal of Counseling and Clinical Psychology, 46, 806–834.Google Scholar
  77. Meehl, P. E. (1990). Appraising and amending theories: the strategy of Lakatosian defense and two principles that warrant it. Psychological Inquiry, 1, 108–141.Google Scholar
  78. Morrison, D. E., & Henkel, R. E. (Eds.). (1970). The significance test controversy. Chicago: Aldine.Google Scholar
  79. Neyman, J. (1937). Outline of a theory of statistical estimation based on the classical theory of probability. Philosophical Transactions of the Royal Society A, 236, 333–380.Google Scholar
  80. Neyman, J., & Pearson, E. S. (1928). On the use and interpretation of certain test criteria of statistical inference, part I. Biometrika, 20A, 175–240.Google Scholar
  81. Neyman, J., & Pearson, E. S. (1933a). On the problem of the most efficient test of statistical hypotheses. Philosophical Transactions of the Royal Society of London A, 231, 289–337.Google Scholar
  82. Neyman, J., & Pearson, E. S. (1933b). The testing of statistical hypotheses in relation to probabilituies a priori. Proceedings of the Cambridge Philosophical Society, 29, 492–510.Google Scholar
  83. Nickerson, R. S. (2000). Null hypothesis significance testing: a review of an old and continuing controversy. Psychological Methods, 5(2), 241–301.Google Scholar
  84. Oakes, M. (1986). Statistical inference: A commentary for the social and behavioral sciences. New York: Wiley.Google Scholar
  85. Pollard, P., & Richardson, J. T. E. (1987). On the probability of making Type I errors. Psychological Bulletin, 102, 159–163.Google Scholar
  86. Rosnow, R. L., & Rosenthal, R. (1989). Statistical procedures and the justification of knowledge in psychological science. American Psychologist, 44(10), 1276–1284.Google Scholar
  87. Royall, R. M. (1997). Statistical evidence: A likelihood paradigm. London: Chapman & Hall.zbMATHGoogle Scholar
  88. Rozeboom, W. W. (1960). The fallacy of the null-hypothesis significance test. Psychological Bulletin, 57(5), 416–428.Google Scholar
  89. Scarr, S. (1997). Rules of evidence: A larger context for the statistical debate. Psychological Science, 8, 16–17.Google Scholar
  90. Schneider, J. W. (2012). Testing university rankings statistically: Why this perhaps is not such a good idea after all. Some reflections on statistical power, effect size, random sampling and imaginary populations. In É. Archambault, Y. Gingras, & V. Larivière (Eds.), Proceedings of the 17th international conference on science and technology indicators, Montreal. Retrieved, from http://2012.sticonference.org/Proceedings/vol2/Schneider_Testing_719.pdf.
  91. Schneider, J. W. (2013). Caveats for using statistical significance tests in research assessments. Journal of Informetrics, 7(1), 50–62.Google Scholar
  92. Schneider, A. L., & Darcy, R. E. (1984). Policy implications of using significance tests in evaluation research. Evaluation Review, 8(4), 573–582.Google Scholar
  93. Schrodt, P. A. (2006). Beyond the linear frequentist orthodoxy. Political Analysis, 14(3), 335–339.Google Scholar
  94. Schwab, A., Abrahamson, E., Starbuck, W. H., & Fidler, F. (2011). Researchers should make thoughtful assessments instead of null-hypothesis significance tests. Organization Science, 22(4), 1105–1120.Google Scholar
  95. Sellke, T., Bayarri, M. J., & Berger, J. O. (2001). Calibration of rho values for testing precise null hypotheses. The American Statistician, 55, 62–71.MathSciNetzbMATHGoogle Scholar
  96. Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366.Google Scholar
  97. Spielman, S. (1974). The logic of tests of significance. Philosophy of Science, 41, 211–226.MathSciNetGoogle Scholar
  98. Starbuck, W. H. (2006). The production of knowledge: The challenge of social science research. Oxford: Oxford University Press.Google Scholar
  99. Taagepera, R. (2008). Making social sciences more scientific: The need for predictive models. Oxford: Oxford University Press.Google Scholar
  100. Tukey, J. W. (1977). Exploratory data analysis. Reading: Addison-Wesley.zbMATHGoogle Scholar
  101. Tukey, J. W. (1991). The philosophy of multiple comparisons. Statistical Science, 6(1), 100–116.Google Scholar
  102. Wagenmakers, E. J. (2007). A practical solution to the pervasive problem of p values. Psychonomic Bulletin & Review, 14(5), 779–804.Google Scholar
  103. Webster, E. J., & Starbuck, W. H. (1988). Theory building in industrial and organizational psychology. In C. L. Cooper & I. Robertson (Eds.), International review of industrial and organizational psychology (pp. 93–138). London: Wiley.Google Scholar
  104. Wetzels, R., Matzke, D., Lee, M. D., Rouder, J. N., Iverson, G. J., & Wagenmakers, E.-J. (2011). Statistical evidence in experimental psychology: An empirical comparison using 855 t tests. Perspectives on Psychological Science, 6(3), 291–298.Google Scholar
  105. Wilkinson, L., & Task Force on Statistical Inference, APA Board on Scientific Affairs (1999). Statistical methods in psychology journals - Guidelines and explanations. American Psychologist, 54(8), 594–604.Google Scholar
  106. Ziliak, S. T., & McCloskey, D. N. (2008). The cult of statistical significance: How the standard error costs us jobs, justice, and lives. Ann Arbor: The University of Michigan Press.Google Scholar

Copyright information

© Akadémiai Kiadó, Budapest, Hungary 2014

Authors and Affiliations

  1. 1.Department of Political Science & Government, Danish Centre for Studies in Research and Research PolicyAarhus UniversityAarhus CDenmark

Personalised recommendations