Psychonomic Bulletin & Review

, Volume 21, Issue 5, pp 1157–1164 | Cite as

Robust misinterpretation of confidence intervals

  • Rink Hoekstra
  • Richard D. Morey
  • Jeffrey N. Rouder
  • Eric-Jan Wagenmakers
Brief Report

Abstract

Null hypothesis significance testing (NHST) is undoubtedly the most common inferential technique used to justify claims in the social sciences. However, even staunch defenders of NHST agree that its outcomes are often misinterpreted. Confidence intervals (CIs) have frequently been proposed as a more useful alternative to NHST, and their use is strongly encouraged in the APA Manual. Nevertheless, little is known about how researchers interpret CIs. In this study, 120 researchers and 442 students—all in the field of psychology—were asked to assess the truth value of six particular statements involving different interpretations of a CI. Although all six statements were false, both researchers and students endorsed, on average, more than three statements, indicating a gross misunderstanding of CIs. Self-declared experience with statistics was not related to researchers’ performance, and, even more surprisingly, researchers hardly outperformed the students, even though the students had not received any education on statistical inference whatsoever. Our findings suggest that many researchers do not know the correct interpretation of a CI. The misunderstandings surrounding p-values and CIs are particularly unfortunate because they constitute the main tools by which psychologists draw conclusions from data.

Keywords

Confidence intervals Significance testing Inference 

References

  1. Abelson, R. P. (1997). A retrospective on the significance test ban of 1999 (if there were no significance tests, they would be invented). In L. L. Harlow, S. A. Mulaik, & J. H. Steiger (Eds.), What if there were no significance tests? Mahwah, NJ: Erlbaum.Google Scholar
  2. American Psychological Association. (2001). Publication manual of the American Psychological Association (5th ed.). Washington, DC: Author.Google Scholar
  3. American Psychological Association. (2009). Publication manual of the American Psychological Association (6th ed.). Washington, DC: Author.Google Scholar
  4. Belia, S., Fidler, F., Williams, J., & Cumming, G. (2005). Researchers misunderstand confidence intervals and standard error bars. Psychological Methods, 10, 389–396.PubMedCrossRefGoogle Scholar
  5. Berger, J. O. (2006). The case for objective Bayesian analysis. Bayesian Analysis, 1, 385–402.CrossRefGoogle Scholar
  6. Berger, J. O., & Wolpert, R. L. (1988). The likelihood principle (2nd ed.). Hayward, CA: Institute of Mathematical Statistics.Google Scholar
  7. Berkson, J. (1942). Tests of significance considered as evidence. Journal of the American Statistical Association, 37, 325–335.CrossRefGoogle Scholar
  8. Blaker, H., & Spjøtvoll, E. (2000). Paradoxes and improvements in interval estimation. The American Statistician, 54, 242–247.Google Scholar
  9. Chow, S. L. (1998). A précis of “Statistical significance: Rationale, validity and utility. Behavioral and Brain Sciences, 21, 169–194.PubMedGoogle Scholar
  10. Cohen, J. (1994). The earth is round (p <.05). American Psychologist, 49, 997–1003.CrossRefGoogle Scholar
  11. Cortina, J. M., & Dunlap, W. P. (1997). On the logic and purpose of significance testing. Psychological Methods, 2, 161–172.CrossRefGoogle Scholar
  12. Cumming, G., & Finch, S. (2001). A primer on the understanding, use, and calculation of confidence intervals that are based on central and non-central distributions. Educational and Psychological Measurement, 61, 532–574.CrossRefGoogle Scholar
  13. Cumming, G., & Fidler, F. (2009). Confidence intervals: Better answers to better questions. Zeitschrift für Psychologie, 217, 15–26. doi:10.1027/0044-3409.217.1.15
  14. Curran-Everett, D. (2000). Multiple comparisons: Philosophies and illustrations. American Journal of Physiology - Regulatory, Integrative and Comparative Physiology, 279, R1–R8.PubMedGoogle Scholar
  15. Dienes, Z. (2011). Bayesian versus orthodox statistics: Which side are you on? Perspectives on Psychological Science, 6, 274–290.CrossRefGoogle Scholar
  16. Edwards, W., Lindman, H., & Savage, L. J. (1963). Bayesian statistical inference for psychological research. Psychological Review, 70, 193–242.CrossRefGoogle Scholar
  17. Falk, R., & Greenbaum, C. W. (1995). Significance tests die hard: The amazing persistence of a probabilistic misconception. Theory and Psychology, 5, 75–98.CrossRefGoogle Scholar
  18. Fidler, F. (2005). From statistical significance to effect estimation: Statistical reform in psychology, medicine and ecology. Unpublished doctoral dissertation, University of Melbourne, Melbourne.Google Scholar
  19. Fidler, F., & Loftus, G. R. (2009). Why figures with error bars should replace p values: Some conceptual arguments and empirical demonstrations. Journal of Psychology, 217, 27–37.Google Scholar
  20. Finch, S., Cumming, G., Williams, J., Palmer, L., Griffith, E., Alders, C., Anderson, J., & Goodman, O. (2004). Reform of statistical inference in psychology: The case of memory and cognition. Behavior Research Methods, Instruments, & Computers, 36, 312–324.Google Scholar
  21. Gigerenzer, G. (2004). Mindless statistics. The Journal of Socio-Economics, 33, 587–606.CrossRefGoogle Scholar
  22. Haller, H., & Krauss, S. (2002). Misinterpretations of significance: a problem students share with their teachers? Methods of Psychological Research Online [On-line serial], 7, 120. Retrieved May 27, 2013, from www2.uni-jena.de/svw/metheval/lehre/0405-ws/evaluationuebung/haller.pdf
  23. Harlow, Mulaik, S. A., & Steiger, J. H. (1997). What if there were no significance tests? Mahwah, NJ: Erlbaum.Google Scholar
  24. Hoekstra, R., Finch, S., Kiers, H. A. L., & Johnson, A. (2006). Probability as certainty: Dichotomous thinking and the misuse of p-values. Psychonomic Bulletin & Review, 13, 1033–1037.CrossRefGoogle Scholar
  25. Hoekstra, R., Johnson, A., & Kiers, H. A. L. (2012). Confidence intervals make a difference: Effects of showing confidence intervals on inferential reasoning. Educational and Psychological Measurement, 72, 1039–1052.CrossRefGoogle Scholar
  26. Hoenig, J. M., & Heisey, D. M. (2001). The abuse of power: The pervasive fallacy of power calculations for data analysis. The American Statistician, 55, 19–24.CrossRefGoogle Scholar
  27. Jaynes, E. T. (1976). Confidence intervals vs Bayesian intervals. In W. L. Harper & C. A. Hooker (Eds.), Foundations of Probability Theory, Statistical Inference, and Statistical Theories of Science (pp. 175–257). Dordrecht, The Netherlands: Reidel Publishing Company.CrossRefGoogle Scholar
  28. Kalinowski, P. (2010). Identifying misconceptions about confidence intervals. Proceedings of the Eighth International Conference on Teaching Statistics. [CDROM]. IASE, Lijbljana, Slovenia, Refereed paper.Google Scholar
  29. Kline, R. B. (2004). Beyond significance testing: reforming data analysis methods in behavioral research. Washington DC, USA: American Psychological Association.Google Scholar
  30. Kruschke, J. K., Aguinis, H., & Joo, H. (2012). The time has come: Bayesian methods for data analysis in the organizational sciences. Organizational Research Methods, 15, 722–752. doi:10.1177/1094428112457829 CrossRefGoogle Scholar
  31. Lecoutre, M.-P., Poitevineau, J., & Lecoutre, B. (2003). Even statisticians are not immune to misinterpretations of null hypothesis tests. International Journal of Psychology, 38, 37–45.CrossRefGoogle Scholar
  32. Lindley, D. V. (1965). Introduction to probability and statistics from a Bayesian viewpoint. Part 2. Cambridge: Inference. Cambridge University Press.CrossRefGoogle Scholar
  33. Morey, R. D. (2013). The consistency test does not-and cannot-deliver what is advertised: A comment on Francis (2013). Journal of Mathematical Psychology. doi:10.1016/j.jmp.2013.03.004 Google Scholar
  34. Nosek, B. A., Spies, J. R., & Motyl, M. (2012). Scientific utopia: II. Restructuring incentives and practices to promote truth over publishability. Perspectives on Psychological Science, 7, 615–631. doi:10.1177/1745691612459058 CrossRefGoogle Scholar
  35. O’Hagan, A. (2004). Dicing with the unknown. Significance, 1, 132–133.CrossRefGoogle Scholar
  36. Oakes, M. (1986). Statistical inference: A commentary for the social and behavioural sciences. Chicester: John Wiley & Sons.Google Scholar
  37. Pashler, H., & Wagenmakers, E.-J. (2012). Editors’ introduction to the special section on replicability in psychological science: A crisis of confidence? Perspectives on Psychological Science, 7, 528–530.CrossRefGoogle Scholar
  38. Rosnow, R. L., & Rosenthal, R. (1989). Statistical procedures and the justification of knowledge in psychological science. American Psychologist, 44, 1276–1284.CrossRefGoogle Scholar
  39. Scheutz, F., Andersen, B., & Wulff, H. R. (1988). What do dentists know about statistics? Scandinavian Journal of Dental Research, 96, 281–287.PubMedGoogle Scholar
  40. Schmidt, F. L. (1996). Statistical significance testing and cumulative knowledge in psychology: Implications for training of researchers. Psychological Methods, 1, 115–129.CrossRefGoogle Scholar
  41. Schmidt, F. L., & Hunter, J. E. (1997). Eight common but false objections to the discontinuation of significance testing in the analysis of research data. In L. L. Harlow, S. A. Mulaik, & J. H. Steiger (Eds.), What if there were no significance tests? Mahwah, NJ: Erlbaum.Google Scholar
  42. Sellke, T., Bayarri, M.-J., & Berger, J. O. (2001). Calibration of p values for testing precise null hypotheses. The American Statistician, 55, 62–71.CrossRefGoogle Scholar
  43. Stone, M. (1969). The role of significance testing: Some data with a message. Biometrika, 56, 485–493.CrossRefGoogle Scholar
  44. Wagenmakers, E.-J. (2007). A practical solution to the pervasive problem of p values. Psychonomic Bulletin & Review, 14, 779–804.CrossRefGoogle Scholar
  45. Wilkinson, L., & APA Task Force on Statistical Inference. (1999). Statistical methods in psychology journals: Guidelines and explanations. American Psychologist, 54, 594–604.CrossRefGoogle Scholar
  46. Winch, R. F., & Campbell, D. T. (1969). Proof? No. Evidence? Yes. The significance of tests of significance. American Sociologist, 4, 140–143.Google Scholar
  47. Wulff, H. R., Andersen, B., Brandenhoff, P., & Guttler, F. (1987). What do doctors know about statistics? Statistics in Medicine, 6, 3–10.PubMedCrossRefGoogle Scholar

Copyright information

© Psychonomic Society, Inc. 2014

Authors and Affiliations

  • Rink Hoekstra
    • 1
  • Richard D. Morey
    • 1
  • Jeffrey N. Rouder
    • 2
  • Eric-Jan Wagenmakers
    • 1
  1. 1.University of GroningenGroningenThe Netherlands
  2. 2.University of MissouriColumbiaUSA

Personalised recommendations