Null hypothesis significance tests. A mix-up of two different theories: the basis for widespread confusion and numerous misinterpretations

Schneider, Jesper W.

doi:10.1007/s11192-014-1251-5

Null hypothesis significance tests. A mix-up of two different theories: the basis for widespread confusion and numerous misinterpretations

Published: 25 February 2014

Volume 102, pages 411–432, (2015)
Cite this article

Scientometrics Aims and scope Submit manuscript

Jesper W. Schneider¹

3214 Accesses
71 Citations
7 Altmetric
Explore all metrics

The statistician cannot excuse himself from the duty of getting his head clear on the principles of scientific inference, but equally no other thinking man can avoid a like obligation (Fisher 1951, p. 2)

Abstract

Null hypothesis statistical significance tests (NHST) are widely used in quantitative research in the empirical sciences including scientometrics. Nevertheless, since their introduction nearly a century ago significance tests have been controversial. Many researchers are not aware of the numerous criticisms raised against NHST. As practiced, NHST has been characterized as a ‘null ritual’ that is overused and too often misapplied and misinterpreted. NHST is in fact a patchwork of two fundamentally different classical statistical testing models, often blended with some wishful quasi-Bayesian interpretations. This is undoubtedly a major reason why NHST is very often misunderstood. But NHST also has intrinsic logical problems and the epistemic range of the information provided by such tests is much more limited than most researchers recognize. In this article we introduce to the scientometric community the theoretical origins of NHST, which is mostly absent from standard statistical textbooks, and we discuss some of the most prevalent problems relating to the practice of NHST and trace these problems back to the mix-up of the two different theoretical origins. Finally, we illustrate some of the misunderstandings with examples from the scientometric literature and bring forward some modest recommendations for a more sound practice in quantitative data analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

What is Qualitative in Qualitative Research

Article Open access 27 February 2019

Literature reviews as independent studies: guidelines for academic practice

Article Open access 14 October 2022

RMSEA, CFI, and TLI in structural equation modeling with ordered categorical data: The story they tell depends on the estimation methods

Article 04 June 2018

Notes

Notice, other hypotheses to be nullified, such as a directional, non-zero or interval estimates are possible but seldom used, hence the ‘null ritual’.
Statistical power is the probability of rejecting H₀ when it is false (Cohen 1988). Statistical power is affected by α and β levels, the size of the effect and the size of the sample used to detect it. These elements make it possible to define the probability density function for the alternative hypothesis.
E.g., a sampling design where one either chooses to toss a coin until it produces a pre-specified pattern, or instead doing a pre-specified number of tosses. The results can be identical, but the p values will be different.
For example, the instructions to authors in the journal Epidemiology reads “We strongly discourage the use of p values and language referring to statistical significance” (http://edmgr.ovid.com/epid/accounts/ifauth.htm).

References

Abelson, R. P. (1997). On the surprising longevity of flogged horses: Why there is a case for the significance test. Psychological Science, 8(1), 12–15.
Google Scholar
American Psychological Association. (2010). Publication Manual of the APA (6th ed.). Washington, DC: APA.
Google Scholar
Anderson, D. R. (2008). Model based inference in the life sciences: A primer on evidence. New York: Springer.
Google Scholar
Anderson, D. R., Burnham, K. P., & Thompson, W. L. (2000). Null hypothesis testing: Problems, prevalence, and an alternative. Journal of Wildlife Management, 64, 912–923.
Google Scholar
Armstrong, J. S. (2007). Significance tests harm progress in forecasting. International Journal of Forecasting, 23(2), 321–327.
Google Scholar
Armstrong, J. S. (2012). Illusions in regression analysis. International Journal of Forecasting, 28(3), 689–694.
Google Scholar
Beninger, P. G., Boldina, I., & Katsanevakis, S. (2012). Strengthening statistical usage in marine ecology. Journal of Experimental Marine Biology and Ecology, 426, 97–108.
Google Scholar
Berger, J. O., & Berry, D. A. (1988). Statistical analysis and the illusion of objectivity. American Scientist, 76(2), 159–165.
Google Scholar
Berger, J. O., & Sellke, T. (1987). Testing a point null hypothesis—The irreconcilability of p-values and evidence. Journal of the American Statistical Association, 82(397), 112–122.
MathSciNet MATH Google Scholar
Berk, R. A., & Freedman, D. A. (2003). Statistical assumptions as empirical commitments. In T. G. Blomberg & S. Cohen (Eds.), Law, punishment, and social control: Essays in honor of Sheldon Messinger (pp. 235–254). New York: Aldine.
Google Scholar
Berk, R. A., Western, B., & Weiss, R. E. (1995). Statistical inference for apparent populations. Sociological Methodology, 25, 421–458.
Google Scholar
Berkson, J. (1938). Some difficulties of interpretation encountered in the application of the Chi square test. Journal of the American Statistical Association, 33(203), 526–536.
MATH Google Scholar
Berkson, J. (1942). Tests of significance considered as evidence. Journal of the American Statistical Association, 37(219), 325–335.
Google Scholar
Boring, E. G. (1919). Mathematical versus scientific significance. Psychological Bulletin, 16, 335–338.
Google Scholar
Bornmann, L., & Leydesdorff, L. (2013). Statistical tests and research assessments: A comment on Schneider (2012). Journal of the American Society for Information Science and Technology, 64(6), 1306–1308.
Google Scholar
Carver, R. P. (1978). The case against statistical significance testing. Harvard Educational Review, 48(3), 378–399.
MathSciNet Google Scholar
Chow, S. L. (1998). Précis of Statistical significance: Rationale, validity, and utility. Behavioral and Brain Sciences, 2, 169–239.
Google Scholar
Clark, C. A. (1963). Hypothesis testing in relation to statistical methodology. Review of Educational Research, 33, 455–473.
Google Scholar
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum.
MATH Google Scholar
Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45(12), 1304–1312.
Google Scholar
Cohen, J. (1994). The earth is round (p < 0.05). American Psychologist, 49(12), 1003–1007.
Google Scholar
Cortina, J. M., & Dunlap, W. P. (1997). On the logic and purpose of significance testing. Psychological Methods, 2(2), 161–172.
Google Scholar
Cumming, G. (2012). Understanding the new statistics. Effect sizes, confidence intervals, and meta-analysis. New York: Routledge.
Google Scholar
Dixon, P., & O’Reilly, T. (1999). Scientific versus statistical inference. Canadian Journal of Experimental Psychology-Revue Canadienne De Psychologie Experimentale, 53(2), 133–149.
Google Scholar
Ellis, P. D. (2010). The essential guide to effect sizes: Statistical power, meta-analysis, and the interpretation of research results. Cambridge: Cambridge University Press.
Google Scholar
Falk, R., & Greenbaum, C. W. (1995). Significance tests die hard. Theory and Psychology, 5, 396–400.
Google Scholar
Fisher, R. A. (1925). Statistical methods for research workers (1st ed.). London: Oliver & Boyd.
Google Scholar
Fisher, R. A. (1935a). The design of experiments (1st ed.). Edinburgh: Oliver & Boyd.
Google Scholar
Fisher, R. A. (1935b). Statistical tests. Nature, 136, 474.
Google Scholar
Fisher, R. A. (1935c). The logic of inductive inference. Journal of the Royal Statistical Society, 98, 71–76.
Google Scholar
Fisher, R. A. (1951). The design of experiments (6th ed.). Edinburgh: Oliver & Boyd.
Google Scholar
Fisher, R. A. (1955). Statistical methods and scientific induction. Journal of the Royal Statistical Society B, 17, 69–78.
MATH Google Scholar
Fisher, R. A. (1956). Statistical methods and scientific inference. London: Oliver & Boyd.
MATH Google Scholar
Frick, R. W. (1996). The appropriate use of null hypothesis testing. Psychological Methods, 1(4), 379–390.
MathSciNet Google Scholar
Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2004). Bayesian data analysis. Boca Raton: Chapman & Hall/CRC.
MATH Google Scholar
Gelman, A., & Stern, H. (2006). The difference between “significant” and “not significant” is not itself statistically significant. The American Statistician, 60(4), 328–331.
MathSciNet Google Scholar
Gigerenzer, G. (1993). The superego, the ego, and the id in statistical reasoning. In G. Keren & C. Lewis (Eds.), A handbook for data analysis in the behavioral sciences: methodological issues (pp. 311–339). Hillsdale: Erlbaum.
Google Scholar
Gigerenzer, G. (2004). Mindless statistics. The Journal of Socio-Economics, 33(5), 587–606.
Google Scholar
Gigerenzer, G., Swijtink, Z., Porter, T., Daston, L., Beatty, J., & Kruger, L. (1989). The empire of chance: How probability changed science and everyday life. New York: Cambridge University Press.
Google Scholar
Gill, J. (2007). Bayesian methods: A social and behavioral sciences approach (2nd ed.). Boca Raton: Chapman & Hall/CRC.
Google Scholar
Glass, G. (2006). Meta-analysis: The quantitative synthesis of research findings. In J. L. Green, G. Camilli, & P. B. Elmore (Eds.), Handbook of Complementary Methods in Education Research. Mahwah, NJ: Lawrence Erlbaum.
Google Scholar
Good, I. J. (1950). Probability and the weighing of evidence. London: Griffin.
MATH Google Scholar
Goodman, S. N. (1993). P values, hypothesis tests, and likelihood: Implications for epidemiology of a neglected historical debate. American Journal of Epidemiology, 137(5), 485–496.
Google Scholar
Goodman, S. N. (1999a). Toward evidence-based medical statistics. 1: The P value fallacy. Annals of Internal Medicine, 130(12), 995–1004.
Google Scholar
Goodman, S. N. (1999b). Toward evidence-based medical statistics. 2: The Bayes factor. Annals of Internal Medicine, 130(12), 1005–1013.
Google Scholar
Goodman, S. N. (2003). Commentary: The P-value, devalued. International Journal of Epidemiology, 32(5), 699–702.
Google Scholar
Goodman, S. N. (2008). A dirty dozen: Twelve P-value misconceptions. Seminars in Hematology, 45(3), 135–140.
Google Scholar
Goodman, S. N., & Greenland, S. (2007). Why most published research findings are false: Problems in the analysis. PLoS Medicine, 4(4), e168.
Google Scholar
Greenland, S. (1990). Randomization, statistics, and causal Inference. Epidemiology, 1(6), 421–429.
Google Scholar
Greenland, S., & Poole, C. (2013). Living with statistics in observational research. Epidemiology, 24(1), 73–78.
Google Scholar
Hacking, I. (1965). Logic of statistical inference. Cambridge: Cambridge University Press.
MATH Google Scholar
Haller, H., & Krauss, S. (2002). Misinterpretations of significance: A problem students share with their teachers. Methods of Psychological Research, 7(1), 1–20.
Google Scholar
Harlow, L. L., Muliak, S. A., & Steiger, J. H. (Eds.). (1997). What if there were no significance tests?. Mahwah: Lawrence Erlbaum.
Google Scholar
Hubbard, R. (2004). Alphabet soup: Blurring the distinctions between p’s and a’s in psychological research. Theory and Psychology, 14(3), 295–327.
Google Scholar
Hubbard, R., & Armstrong, J. S. (2006). Why we don’t really know what statistical significance means: Implications for educators. Journal of Marketing Education, 28(2), 114–120.
Google Scholar
Hubbard, R., & Bayarri, M. J. (2003). Confusion over measures of evidence (p’s) versus errors (α’s) in classical statistical testing. American Statistician, 57(3), 171–178.
MathSciNet Google Scholar
Hubbard, R., & Lindsay, R. M. (2008). Why P values are not a useful measure of evidence in statistical significance testing. Theory and Psychology, 18(1), 69–88.
Google Scholar
Hubbard, R., & Ryan, P. A. (2000). The historical growth of statistical significance testing in psychology and its future prospects. Educational and Psychological Measurement, 60, 661–681.
Google Scholar
Hunter, J. E. (1997). Needed: A ban on the significance test. Psychological Science, 8, 3–7.
Google Scholar
Hurlbert, S. H., & Lombardi, C. M. (2009). Final collapse of the Neyman–Pearson decision theoretic framework and rise of the neoFisherian. Annales Zoologici Fennici, 46(5), 311–349.
Google Scholar
Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Medicine, 2(8), 696–701.
Google Scholar
Jeffreys, H. (1939). The theory of probability (1st ed.). Oxford: Oxford University Press.
Google Scholar
Jeffreys, H. (1961). The theory of probability (3rd ed.). Oxford: Oxford University Press.
Google Scholar
Kirk, R. E. (1996). Practical significance: a concept whose time has come. Educational and Psychological Measurement, 61(5), 246–759.
Google Scholar
Kline, R. B. (2004). Beyond significance testing: reforming data analysis methods in behavioral research. Washington, DC: American Psychological Association.
Google Scholar
Kline, R. B. (2013). Beyond significance testing: reforming data analysis methods in behavioral research (2nd ed.). Washington, DC: American Psychological Association.
Google Scholar
Krämer, W., & Gigerenzer, G. (2005). How to confuse with statistics or: The use and misuse of conditional probabilities. Statistical Science, 20(3), 223–230.
MathSciNet MATH Google Scholar
Kruschke, J. K. (2010). What to believe: Bayesian methods for data analysis. Trends in Cognitive Sciences, 14(7), 293–300.
Google Scholar
Lehmann, E. L. (1993). The Fisher, Neyman–Pearson theories of testing hypotheses: One theory or two? Journal of the American Statistical Association, 88(424), 1242–1249.
MathSciNet MATH Google Scholar
Leydesdorff, L. (2013). Does the specification of uncertainty hurt the progress of scientometrics? Journal of Informetrics, 7(2), 292–293.
Google Scholar
Lindley, D. (1957). A statistical paradox. Biometrika, 44, 187–192.
MathSciNet MATH Google Scholar
Ludwig, D. A. (2005). Use and misuse of p-values in designed and observational studies: Guide for researchers and reviewers. Aviation, Space and Environmental Medicine, 76(7), 675–680.
Google Scholar
Lykken, D. T. (1968). Statistical significance in psychological research. Psychological Bulletin, 70(3, Part 1), 151–159.
Google Scholar
Mayo, D. (1996). Error and the growth of experimental knowledge. Chicago University Press: Chicago, IL.
Mayo, D. (2006). Philosophy of Statistics. In S. Sarkar & J. Pfeifer (Eds.), The philosophy of science: An encyclopedia (pp. 802–815). London: Routledge.
Google Scholar
Meehl, P. E. (1978). Theoretical risks and tabular asterisk: Sir Karl, Sir Ronald, and the slow progress of soft psychology. Journal of Counseling and Clinical Psychology, 46, 806–834.
Google Scholar
Meehl, P. E. (1990). Appraising and amending theories: the strategy of Lakatosian defense and two principles that warrant it. Psychological Inquiry, 1, 108–141.
Google Scholar
Morrison, D. E., & Henkel, R. E. (Eds.). (1970). The significance test controversy. Chicago: Aldine.
Google Scholar
Neyman, J. (1937). Outline of a theory of statistical estimation based on the classical theory of probability. Philosophical Transactions of the Royal Society A, 236, 333–380.
Google Scholar
Neyman, J., & Pearson, E. S. (1928). On the use and interpretation of certain test criteria of statistical inference, part I. Biometrika, 20A, 175–240.
Google Scholar
Neyman, J., & Pearson, E. S. (1933a). On the problem of the most efficient test of statistical hypotheses. Philosophical Transactions of the Royal Society of London A, 231, 289–337.
Google Scholar
Neyman, J., & Pearson, E. S. (1933b). The testing of statistical hypotheses in relation to probabilituies a priori. Proceedings of the Cambridge Philosophical Society, 29, 492–510.
Google Scholar
Nickerson, R. S. (2000). Null hypothesis significance testing: a review of an old and continuing controversy. Psychological Methods, 5(2), 241–301.
Google Scholar
Oakes, M. (1986). Statistical inference: A commentary for the social and behavioral sciences. New York: Wiley.
Google Scholar
Pollard, P., & Richardson, J. T. E. (1987). On the probability of making Type I errors. Psychological Bulletin, 102, 159–163.
Google Scholar
Rosnow, R. L., & Rosenthal, R. (1989). Statistical procedures and the justification of knowledge in psychological science. American Psychologist, 44(10), 1276–1284.
Google Scholar
Royall, R. M. (1997). Statistical evidence: A likelihood paradigm. London: Chapman & Hall.
MATH Google Scholar
Rozeboom, W. W. (1960). The fallacy of the null-hypothesis significance test. Psychological Bulletin, 57(5), 416–428.
Google Scholar
Scarr, S. (1997). Rules of evidence: A larger context for the statistical debate. Psychological Science, 8, 16–17.
Google Scholar
Schneider, J. W. (2012). Testing university rankings statistically: Why this perhaps is not such a good idea after all. Some reflections on statistical power, effect size, random sampling and imaginary populations. In É. Archambault, Y. Gingras, & V. Larivière (Eds.), Proceedings of the 17th international conference on science and technology indicators, Montreal. Retrieved, from http://2012.sticonference.org/Proceedings/vol2/Schneider_Testing_719.pdf.
Schneider, J. W. (2013). Caveats for using statistical significance tests in research assessments. Journal of Informetrics, 7(1), 50–62.
Google Scholar
Schneider, A. L., & Darcy, R. E. (1984). Policy implications of using significance tests in evaluation research. Evaluation Review, 8(4), 573–582.
Google Scholar
Schrodt, P. A. (2006). Beyond the linear frequentist orthodoxy. Political Analysis, 14(3), 335–339.
Google Scholar
Schwab, A., Abrahamson, E., Starbuck, W. H., & Fidler, F. (2011). Researchers should make thoughtful assessments instead of null-hypothesis significance tests. Organization Science, 22(4), 1105–1120.
Google Scholar
Sellke, T., Bayarri, M. J., & Berger, J. O. (2001). Calibration of rho values for testing precise null hypotheses. The American Statistician, 55, 62–71.
MathSciNet MATH Google Scholar
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366.
Google Scholar
Spielman, S. (1974). The logic of tests of significance. Philosophy of Science, 41, 211–226.
MathSciNet Google Scholar
Starbuck, W. H. (2006). The production of knowledge: The challenge of social science research. Oxford: Oxford University Press.
Google Scholar
Taagepera, R. (2008). Making social sciences more scientific: The need for predictive models. Oxford: Oxford University Press.
Google Scholar
Tukey, J. W. (1977). Exploratory data analysis. Reading: Addison-Wesley.
MATH Google Scholar
Tukey, J. W. (1991). The philosophy of multiple comparisons. Statistical Science, 6(1), 100–116.
Google Scholar
Wagenmakers, E. J. (2007). A practical solution to the pervasive problem of p values. Psychonomic Bulletin & Review, 14(5), 779–804.
Google Scholar
Webster, E. J., & Starbuck, W. H. (1988). Theory building in industrial and organizational psychology. In C. L. Cooper & I. Robertson (Eds.), International review of industrial and organizational psychology (pp. 93–138). London: Wiley.
Google Scholar
Wetzels, R., Matzke, D., Lee, M. D., Rouder, J. N., Iverson, G. J., & Wagenmakers, E.-J. (2011). Statistical evidence in experimental psychology: An empirical comparison using 855 t tests. Perspectives on Psychological Science, 6(3), 291–298.
Google Scholar
Wilkinson, L., & Task Force on Statistical Inference, APA Board on Scientific Affairs (1999). Statistical methods in psychology journals - Guidelines and explanations. American Psychologist, 54(8), 594–604.
Ziliak, S. T., & McCloskey, D. N. (2008). The cult of statistical significance: How the standard error costs us jobs, justice, and lives. Ann Arbor: The University of Michigan Press.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Political Science & Government, Danish Centre for Studies in Research and Research Policy, Aarhus University, Bartholins Allé 7, 8000, Aarhus C, Denmark
Jesper W. Schneider

Authors

Jesper W. Schneider
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jesper W. Schneider.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Schneider, J.W. Null hypothesis significance tests. A mix-up of two different theories: the basis for widespread confusion and numerous misinterpretations. Scientometrics 102, 411–432 (2015). https://doi.org/10.1007/s11192-014-1251-5

Download citation

Received: 03 February 2014
Published: 25 February 2014
Issue Date: January 2015
DOI: https://doi.org/10.1007/s11192-014-1251-5

Keywords

Mathematic Subject Classification

97K70

JEL Classification

C120

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Null hypothesis significance tests. A mix-up of two different theories: the basis for widespread confusion and numerous misinterpretations

Abstract

Access this article

Similar content being viewed by others

What is Qualitative in Qualitative Research

Literature reviews as independent studies: guidelines for academic practice

RMSEA, CFI, and TLI in structural equation modeling with ordered categorical data: The story they tell depends on the estimation methods

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Mathematic Subject Classification

JEL Classification

Navigation

Null hypothesis significance tests. A mix-up of two different theories: the basis for widespread confusion and numerous misinterpretations

Abstract

Access this article

Similar content being viewed by others

What is Qualitative in Qualitative Research

Literature reviews as independent studies: guidelines for academic practice

RMSEA, CFI, and TLI in structural equation modeling with ordered categorical data: The story they tell depends on the estimation methods

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematic Subject Classification

JEL Classification

Search

Navigation