Skip to main content
Log in

Null hypothesis significance tests. A mix-up of two different theories: the basis for widespread confusion and numerous misinterpretations

  • Published:
Scientometrics Aims and scope Submit manuscript

The statistician cannot excuse himself from the duty of getting his head clear on the principles of scientific inference, but equally no other thinking man can avoid a like obligation (Fisher 1951, p. 2)

Abstract

Null hypothesis statistical significance tests (NHST) are widely used in quantitative research in the empirical sciences including scientometrics. Nevertheless, since their introduction nearly a century ago significance tests have been controversial. Many researchers are not aware of the numerous criticisms raised against NHST. As practiced, NHST has been characterized as a ‘null ritual’ that is overused and too often misapplied and misinterpreted. NHST is in fact a patchwork of two fundamentally different classical statistical testing models, often blended with some wishful quasi-Bayesian interpretations. This is undoubtedly a major reason why NHST is very often misunderstood. But NHST also has intrinsic logical problems and the epistemic range of the information provided by such tests is much more limited than most researchers recognize. In this article we introduce to the scientometric community the theoretical origins of NHST, which is mostly absent from standard statistical textbooks, and we discuss some of the most prevalent problems relating to the practice of NHST and trace these problems back to the mix-up of the two different theoretical origins. Finally, we illustrate some of the misunderstandings with examples from the scientometric literature and bring forward some modest recommendations for a more sound practice in quantitative data analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Notes

  1. Notice, other hypotheses to be nullified, such as a directional, non-zero or interval estimates are possible but seldom used, hence the ‘null ritual’.

  2. Statistical power is the probability of rejecting H0 when it is false (Cohen 1988). Statistical power is affected by α and β levels, the size of the effect and the size of the sample used to detect it. These elements make it possible to define the probability density function for the alternative hypothesis.

  3. E.g., a sampling design where one either chooses to toss a coin until it produces a pre-specified pattern, or instead doing a pre-specified number of tosses. The results can be identical, but the p values will be different.

  4. For example, the instructions to authors in the journal Epidemiology reads “We strongly discourage the use of p values and language referring to statistical significance” (http://edmgr.ovid.com/epid/accounts/ifauth.htm).

References

  • Abelson, R. P. (1997). On the surprising longevity of flogged horses: Why there is a case for the significance test. Psychological Science, 8(1), 12–15.

    Google Scholar 

  • American Psychological Association. (2010). Publication Manual of the APA (6th ed.). Washington, DC: APA.

    Google Scholar 

  • Anderson, D. R. (2008). Model based inference in the life sciences: A primer on evidence. New York: Springer.

    Google Scholar 

  • Anderson, D. R., Burnham, K. P., & Thompson, W. L. (2000). Null hypothesis testing: Problems, prevalence, and an alternative. Journal of Wildlife Management, 64, 912–923.

    Google Scholar 

  • Armstrong, J. S. (2007). Significance tests harm progress in forecasting. International Journal of Forecasting, 23(2), 321–327.

    Google Scholar 

  • Armstrong, J. S. (2012). Illusions in regression analysis. International Journal of Forecasting, 28(3), 689–694.

    Google Scholar 

  • Beninger, P. G., Boldina, I., & Katsanevakis, S. (2012). Strengthening statistical usage in marine ecology. Journal of Experimental Marine Biology and Ecology, 426, 97–108.

    Google Scholar 

  • Berger, J. O., & Berry, D. A. (1988). Statistical analysis and the illusion of objectivity. American Scientist, 76(2), 159–165.

    Google Scholar 

  • Berger, J. O., & Sellke, T. (1987). Testing a point null hypothesis—The irreconcilability of p-values and evidence. Journal of the American Statistical Association, 82(397), 112–122.

    MathSciNet  MATH  Google Scholar 

  • Berk, R. A., & Freedman, D. A. (2003). Statistical assumptions as empirical commitments. In T. G. Blomberg & S. Cohen (Eds.), Law, punishment, and social control: Essays in honor of Sheldon Messinger (pp. 235–254). New York: Aldine.

    Google Scholar 

  • Berk, R. A., Western, B., & Weiss, R. E. (1995). Statistical inference for apparent populations. Sociological Methodology, 25, 421–458.

    Google Scholar 

  • Berkson, J. (1938). Some difficulties of interpretation encountered in the application of the Chi square test. Journal of the American Statistical Association, 33(203), 526–536.

    MATH  Google Scholar 

  • Berkson, J. (1942). Tests of significance considered as evidence. Journal of the American Statistical Association, 37(219), 325–335.

    Google Scholar 

  • Boring, E. G. (1919). Mathematical versus scientific significance. Psychological Bulletin, 16, 335–338.

    Google Scholar 

  • Bornmann, L., & Leydesdorff, L. (2013). Statistical tests and research assessments: A comment on Schneider (2012). Journal of the American Society for Information Science and Technology, 64(6), 1306–1308.

    Google Scholar 

  • Carver, R. P. (1978). The case against statistical significance testing. Harvard Educational Review, 48(3), 378–399.

    MathSciNet  Google Scholar 

  • Chow, S. L. (1998). Précis of Statistical significance: Rationale, validity, and utility. Behavioral and Brain Sciences, 2, 169–239.

    Google Scholar 

  • Clark, C. A. (1963). Hypothesis testing in relation to statistical methodology. Review of Educational Research, 33, 455–473.

    Google Scholar 

  • Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum.

    MATH  Google Scholar 

  • Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45(12), 1304–1312.

    Google Scholar 

  • Cohen, J. (1994). The earth is round (p < 0.05). American Psychologist, 49(12), 1003–1007.

    Google Scholar 

  • Cortina, J. M., & Dunlap, W. P. (1997). On the logic and purpose of significance testing. Psychological Methods, 2(2), 161–172.

    Google Scholar 

  • Cumming, G. (2012). Understanding the new statistics. Effect sizes, confidence intervals, and meta-analysis. New York: Routledge.

    Google Scholar 

  • Dixon, P., & O’Reilly, T. (1999). Scientific versus statistical inference. Canadian Journal of Experimental Psychology-Revue Canadienne De Psychologie Experimentale, 53(2), 133–149.

    Google Scholar 

  • Ellis, P. D. (2010). The essential guide to effect sizes: Statistical power, meta-analysis, and the interpretation of research results. Cambridge: Cambridge University Press.

    Google Scholar 

  • Falk, R., & Greenbaum, C. W. (1995). Significance tests die hard. Theory and Psychology, 5, 396–400.

    Google Scholar 

  • Fisher, R. A. (1925). Statistical methods for research workers (1st ed.). London: Oliver & Boyd.

    Google Scholar 

  • Fisher, R. A. (1935a). The design of experiments (1st ed.). Edinburgh: Oliver & Boyd.

    Google Scholar 

  • Fisher, R. A. (1935b). Statistical tests. Nature, 136, 474.

    Google Scholar 

  • Fisher, R. A. (1935c). The logic of inductive inference. Journal of the Royal Statistical Society, 98, 71–76.

    Google Scholar 

  • Fisher, R. A. (1951). The design of experiments (6th ed.). Edinburgh: Oliver & Boyd.

    Google Scholar 

  • Fisher, R. A. (1955). Statistical methods and scientific induction. Journal of the Royal Statistical Society B, 17, 69–78.

    MATH  Google Scholar 

  • Fisher, R. A. (1956). Statistical methods and scientific inference. London: Oliver & Boyd.

    MATH  Google Scholar 

  • Frick, R. W. (1996). The appropriate use of null hypothesis testing. Psychological Methods, 1(4), 379–390.

    MathSciNet  Google Scholar 

  • Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2004). Bayesian data analysis. Boca Raton: Chapman & Hall/CRC.

    MATH  Google Scholar 

  • Gelman, A., & Stern, H. (2006). The difference between “significant” and “not significant” is not itself statistically significant. The American Statistician, 60(4), 328–331.

    MathSciNet  Google Scholar 

  • Gigerenzer, G. (1993). The superego, the ego, and the id in statistical reasoning. In G. Keren & C. Lewis (Eds.), A handbook for data analysis in the behavioral sciences: methodological issues (pp. 311–339). Hillsdale: Erlbaum.

    Google Scholar 

  • Gigerenzer, G. (2004). Mindless statistics. The Journal of Socio-Economics, 33(5), 587–606.

    Google Scholar 

  • Gigerenzer, G., Swijtink, Z., Porter, T., Daston, L., Beatty, J., & Kruger, L. (1989). The empire of chance: How probability changed science and everyday life. New York: Cambridge University Press.

    Google Scholar 

  • Gill, J. (2007). Bayesian methods: A social and behavioral sciences approach (2nd ed.). Boca Raton: Chapman & Hall/CRC.

    Google Scholar 

  • Glass, G. (2006). Meta-analysis: The quantitative synthesis of research findings. In J. L. Green, G. Camilli, & P. B. Elmore (Eds.), Handbook of Complementary Methods in Education Research. Mahwah, NJ: Lawrence Erlbaum.

    Google Scholar 

  • Good, I. J. (1950). Probability and the weighing of evidence. London: Griffin.

    MATH  Google Scholar 

  • Goodman, S. N. (1993). P values, hypothesis tests, and likelihood: Implications for epidemiology of a neglected historical debate. American Journal of Epidemiology, 137(5), 485–496.

    Google Scholar 

  • Goodman, S. N. (1999a). Toward evidence-based medical statistics. 1: The P value fallacy. Annals of Internal Medicine, 130(12), 995–1004.

    Google Scholar 

  • Goodman, S. N. (1999b). Toward evidence-based medical statistics. 2: The Bayes factor. Annals of Internal Medicine, 130(12), 1005–1013.

    Google Scholar 

  • Goodman, S. N. (2003). Commentary: The P-value, devalued. International Journal of Epidemiology, 32(5), 699–702.

    Google Scholar 

  • Goodman, S. N. (2008). A dirty dozen: Twelve P-value misconceptions. Seminars in Hematology, 45(3), 135–140.

    Google Scholar 

  • Goodman, S. N., & Greenland, S. (2007). Why most published research findings are false: Problems in the analysis. PLoS Medicine, 4(4), e168.

    Google Scholar 

  • Greenland, S. (1990). Randomization, statistics, and causal Inference. Epidemiology, 1(6), 421–429.

    Google Scholar 

  • Greenland, S., & Poole, C. (2013). Living with statistics in observational research. Epidemiology, 24(1), 73–78.

    Google Scholar 

  • Hacking, I. (1965). Logic of statistical inference. Cambridge: Cambridge University Press.

    MATH  Google Scholar 

  • Haller, H., & Krauss, S. (2002). Misinterpretations of significance: A problem students share with their teachers. Methods of Psychological Research, 7(1), 1–20.

    Google Scholar 

  • Harlow, L. L., Muliak, S. A., & Steiger, J. H. (Eds.). (1997). What if there were no significance tests?. Mahwah: Lawrence Erlbaum.

    Google Scholar 

  • Hubbard, R. (2004). Alphabet soup: Blurring the distinctions between p’s and a’s in psychological research. Theory and Psychology, 14(3), 295–327.

    Google Scholar 

  • Hubbard, R., & Armstrong, J. S. (2006). Why we don’t really know what statistical significance means: Implications for educators. Journal of Marketing Education, 28(2), 114–120.

    Google Scholar 

  • Hubbard, R., & Bayarri, M. J. (2003). Confusion over measures of evidence (p’s) versus errors (α’s) in classical statistical testing. American Statistician, 57(3), 171–178.

    MathSciNet  Google Scholar 

  • Hubbard, R., & Lindsay, R. M. (2008). Why P values are not a useful measure of evidence in statistical significance testing. Theory and Psychology, 18(1), 69–88.

    Google Scholar 

  • Hubbard, R., & Ryan, P. A. (2000). The historical growth of statistical significance testing in psychology and its future prospects. Educational and Psychological Measurement, 60, 661–681.

    Google Scholar 

  • Hunter, J. E. (1997). Needed: A ban on the significance test. Psychological Science, 8, 3–7.

    Google Scholar 

  • Hurlbert, S. H., & Lombardi, C. M. (2009). Final collapse of the Neyman–Pearson decision theoretic framework and rise of the neoFisherian. Annales Zoologici Fennici, 46(5), 311–349.

    Google Scholar 

  • Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Medicine, 2(8), 696–701.

    Google Scholar 

  • Jeffreys, H. (1939). The theory of probability (1st ed.). Oxford: Oxford University Press.

    Google Scholar 

  • Jeffreys, H. (1961). The theory of probability (3rd ed.). Oxford: Oxford University Press.

    Google Scholar 

  • Kirk, R. E. (1996). Practical significance: a concept whose time has come. Educational and Psychological Measurement, 61(5), 246–759.

    Google Scholar 

  • Kline, R. B. (2004). Beyond significance testing: reforming data analysis methods in behavioral research. Washington, DC: American Psychological Association.

    Google Scholar 

  • Kline, R. B. (2013). Beyond significance testing: reforming data analysis methods in behavioral research (2nd ed.). Washington, DC: American Psychological Association.

    Google Scholar 

  • Krämer, W., & Gigerenzer, G. (2005). How to confuse with statistics or: The use and misuse of conditional probabilities. Statistical Science, 20(3), 223–230.

    MathSciNet  MATH  Google Scholar 

  • Kruschke, J. K. (2010). What to believe: Bayesian methods for data analysis. Trends in Cognitive Sciences, 14(7), 293–300.

    Google Scholar 

  • Lehmann, E. L. (1993). The Fisher, Neyman–Pearson theories of testing hypotheses: One theory or two? Journal of the American Statistical Association, 88(424), 1242–1249.

    MathSciNet  MATH  Google Scholar 

  • Leydesdorff, L. (2013). Does the specification of uncertainty hurt the progress of scientometrics? Journal of Informetrics, 7(2), 292–293.

    Google Scholar 

  • Lindley, D. (1957). A statistical paradox. Biometrika, 44, 187–192.

    MathSciNet  MATH  Google Scholar 

  • Ludwig, D. A. (2005). Use and misuse of p-values in designed and observational studies: Guide for researchers and reviewers. Aviation, Space and Environmental Medicine, 76(7), 675–680.

    Google Scholar 

  • Lykken, D. T. (1968). Statistical significance in psychological research. Psychological Bulletin, 70(3, Part 1), 151–159.

    Google Scholar 

  • Mayo, D. (1996). Error and the growth of experimental knowledge. Chicago University Press: Chicago, IL.

  • Mayo, D. (2006). Philosophy of Statistics. In S. Sarkar & J. Pfeifer (Eds.), The philosophy of science: An encyclopedia (pp. 802–815). London: Routledge.

    Google Scholar 

  • Meehl, P. E. (1978). Theoretical risks and tabular asterisk: Sir Karl, Sir Ronald, and the slow progress of soft psychology. Journal of Counseling and Clinical Psychology, 46, 806–834.

    Google Scholar 

  • Meehl, P. E. (1990). Appraising and amending theories: the strategy of Lakatosian defense and two principles that warrant it. Psychological Inquiry, 1, 108–141.

    Google Scholar 

  • Morrison, D. E., & Henkel, R. E. (Eds.). (1970). The significance test controversy. Chicago: Aldine.

    Google Scholar 

  • Neyman, J. (1937). Outline of a theory of statistical estimation based on the classical theory of probability. Philosophical Transactions of the Royal Society A, 236, 333–380.

    Google Scholar 

  • Neyman, J., & Pearson, E. S. (1928). On the use and interpretation of certain test criteria of statistical inference, part I. Biometrika, 20A, 175–240.

    Google Scholar 

  • Neyman, J., & Pearson, E. S. (1933a). On the problem of the most efficient test of statistical hypotheses. Philosophical Transactions of the Royal Society of London A, 231, 289–337.

    Google Scholar 

  • Neyman, J., & Pearson, E. S. (1933b). The testing of statistical hypotheses in relation to probabilituies a priori. Proceedings of the Cambridge Philosophical Society, 29, 492–510.

    Google Scholar 

  • Nickerson, R. S. (2000). Null hypothesis significance testing: a review of an old and continuing controversy. Psychological Methods, 5(2), 241–301.

    Google Scholar 

  • Oakes, M. (1986). Statistical inference: A commentary for the social and behavioral sciences. New York: Wiley.

    Google Scholar 

  • Pollard, P., & Richardson, J. T. E. (1987). On the probability of making Type I errors. Psychological Bulletin, 102, 159–163.

    Google Scholar 

  • Rosnow, R. L., & Rosenthal, R. (1989). Statistical procedures and the justification of knowledge in psychological science. American Psychologist, 44(10), 1276–1284.

    Google Scholar 

  • Royall, R. M. (1997). Statistical evidence: A likelihood paradigm. London: Chapman & Hall.

    MATH  Google Scholar 

  • Rozeboom, W. W. (1960). The fallacy of the null-hypothesis significance test. Psychological Bulletin, 57(5), 416–428.

    Google Scholar 

  • Scarr, S. (1997). Rules of evidence: A larger context for the statistical debate. Psychological Science, 8, 16–17.

    Google Scholar 

  • Schneider, J. W. (2012). Testing university rankings statistically: Why this perhaps is not such a good idea after all. Some reflections on statistical power, effect size, random sampling and imaginary populations. In É. Archambault, Y. Gingras, & V. Larivière (Eds.), Proceedings of the 17th international conference on science and technology indicators, Montreal. Retrieved, from http://2012.sticonference.org/Proceedings/vol2/Schneider_Testing_719.pdf.

  • Schneider, J. W. (2013). Caveats for using statistical significance tests in research assessments. Journal of Informetrics, 7(1), 50–62.

    Google Scholar 

  • Schneider, A. L., & Darcy, R. E. (1984). Policy implications of using significance tests in evaluation research. Evaluation Review, 8(4), 573–582.

    Google Scholar 

  • Schrodt, P. A. (2006). Beyond the linear frequentist orthodoxy. Political Analysis, 14(3), 335–339.

    Google Scholar 

  • Schwab, A., Abrahamson, E., Starbuck, W. H., & Fidler, F. (2011). Researchers should make thoughtful assessments instead of null-hypothesis significance tests. Organization Science, 22(4), 1105–1120.

    Google Scholar 

  • Sellke, T., Bayarri, M. J., & Berger, J. O. (2001). Calibration of rho values for testing precise null hypotheses. The American Statistician, 55, 62–71.

    MathSciNet  MATH  Google Scholar 

  • Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366.

    Google Scholar 

  • Spielman, S. (1974). The logic of tests of significance. Philosophy of Science, 41, 211–226.

    MathSciNet  Google Scholar 

  • Starbuck, W. H. (2006). The production of knowledge: The challenge of social science research. Oxford: Oxford University Press.

    Google Scholar 

  • Taagepera, R. (2008). Making social sciences more scientific: The need for predictive models. Oxford: Oxford University Press.

    Google Scholar 

  • Tukey, J. W. (1977). Exploratory data analysis. Reading: Addison-Wesley.

    MATH  Google Scholar 

  • Tukey, J. W. (1991). The philosophy of multiple comparisons. Statistical Science, 6(1), 100–116.

    Google Scholar 

  • Wagenmakers, E. J. (2007). A practical solution to the pervasive problem of p values. Psychonomic Bulletin & Review, 14(5), 779–804.

    Google Scholar 

  • Webster, E. J., & Starbuck, W. H. (1988). Theory building in industrial and organizational psychology. In C. L. Cooper & I. Robertson (Eds.), International review of industrial and organizational psychology (pp. 93–138). London: Wiley.

    Google Scholar 

  • Wetzels, R., Matzke, D., Lee, M. D., Rouder, J. N., Iverson, G. J., & Wagenmakers, E.-J. (2011). Statistical evidence in experimental psychology: An empirical comparison using 855 t tests. Perspectives on Psychological Science, 6(3), 291–298.

    Google Scholar 

  • Wilkinson, L., & Task Force on Statistical Inference, APA Board on Scientific Affairs (1999). Statistical methods in psychology journals - Guidelines and explanations. American Psychologist, 54(8), 594–604.

  • Ziliak, S. T., & McCloskey, D. N. (2008). The cult of statistical significance: How the standard error costs us jobs, justice, and lives. Ann Arbor: The University of Michigan Press.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jesper W. Schneider.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Schneider, J.W. Null hypothesis significance tests. A mix-up of two different theories: the basis for widespread confusion and numerous misinterpretations. Scientometrics 102, 411–432 (2015). https://doi.org/10.1007/s11192-014-1251-5

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-014-1251-5

Keywords

Mathematic Subject Classification

JEL Classification

Navigation