Robust misinterpretation of confidence intervals

Abstract

Null hypothesis significance testing (NHST) is undoubtedly the most common inferential technique used to justify claims in the social sciences. However, even staunch defenders of NHST agree that its outcomes are often misinterpreted. Confidence intervals (CIs) have frequently been proposed as a more useful alternative to NHST, and their use is strongly encouraged in the APA Manual. Nevertheless, little is known about how researchers interpret CIs. In this study, 120 researchers and 442 students—all in the field of psychology—were asked to assess the truth value of six particular statements involving different interpretations of a CI. Although all six statements were false, both researchers and students endorsed, on average, more than three statements, indicating a gross misunderstanding of CIs. Self-declared experience with statistics was not related to researchers’ performance, and, even more surprisingly, researchers hardly outperformed the students, even though the students had not received any education on statistical inference whatsoever. Our findings suggest that many researchers do not know the correct interpretation of a CI. The misunderstandings surrounding p-values and CIs are particularly unfortunate because they constitute the main tools by which psychologists draw conclusions from data.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2

References

  1. Abelson, R. P. (1997). A retrospective on the significance test ban of 1999 (if there were no significance tests, they would be invented). In L. L. Harlow, S. A. Mulaik, & J. H. Steiger (Eds.), What if there were no significance tests? Mahwah, NJ: Erlbaum.

    Google Scholar 

  2. American Psychological Association. (2001). Publication manual of the American Psychological Association (5th ed.). Washington, DC: Author.

    Google Scholar 

  3. American Psychological Association. (2009). Publication manual of the American Psychological Association (6th ed.). Washington, DC: Author.

    Google Scholar 

  4. Belia, S., Fidler, F., Williams, J., & Cumming, G. (2005). Researchers misunderstand confidence intervals and standard error bars. Psychological Methods, 10, 389–396.

    PubMed  Article  Google Scholar 

  5. Berger, J. O. (2006). The case for objective Bayesian analysis. Bayesian Analysis, 1, 385–402.

    Article  Google Scholar 

  6. Berger, J. O., & Wolpert, R. L. (1988). The likelihood principle (2nd ed.). Hayward, CA: Institute of Mathematical Statistics.

    Google Scholar 

  7. Berkson, J. (1942). Tests of significance considered as evidence. Journal of the American Statistical Association, 37, 325–335.

    Article  Google Scholar 

  8. Blaker, H., & Spjøtvoll, E. (2000). Paradoxes and improvements in interval estimation. The American Statistician, 54, 242–247.

    Google Scholar 

  9. Chow, S. L. (1998). A précis of “Statistical significance: Rationale, validity and utility. Behavioral and Brain Sciences, 21, 169–194.

    PubMed  Google Scholar 

  10. Cohen, J. (1994). The earth is round (p <.05). American Psychologist, 49, 997–1003.

    Article  Google Scholar 

  11. Cortina, J. M., & Dunlap, W. P. (1997). On the logic and purpose of significance testing. Psychological Methods, 2, 161–172.

    Article  Google Scholar 

  12. Cumming, G., & Finch, S. (2001). A primer on the understanding, use, and calculation of confidence intervals that are based on central and non-central distributions. Educational and Psychological Measurement, 61, 532–574.

    Article  Google Scholar 

  13. Cumming, G., & Fidler, F. (2009). Confidence intervals: Better answers to better questions. Zeitschrift für Psychologie, 217, 15–26. doi:10.1027/0044-3409.217.1.15

  14. Curran-Everett, D. (2000). Multiple comparisons: Philosophies and illustrations. American Journal of Physiology - Regulatory, Integrative and Comparative Physiology, 279, R1–R8.

    PubMed  Google Scholar 

  15. Dienes, Z. (2011). Bayesian versus orthodox statistics: Which side are you on? Perspectives on Psychological Science, 6, 274–290.

    Article  Google Scholar 

  16. Edwards, W., Lindman, H., & Savage, L. J. (1963). Bayesian statistical inference for psychological research. Psychological Review, 70, 193–242.

    Article  Google Scholar 

  17. Falk, R., & Greenbaum, C. W. (1995). Significance tests die hard: The amazing persistence of a probabilistic misconception. Theory and Psychology, 5, 75–98.

    Article  Google Scholar 

  18. Fidler, F. (2005). From statistical significance to effect estimation: Statistical reform in psychology, medicine and ecology. Unpublished doctoral dissertation, University of Melbourne, Melbourne.

  19. Fidler, F., & Loftus, G. R. (2009). Why figures with error bars should replace p values: Some conceptual arguments and empirical demonstrations. Journal of Psychology, 217, 27–37.

    Google Scholar 

  20. Finch, S., Cumming, G., Williams, J., Palmer, L., Griffith, E., Alders, C., Anderson, J., & Goodman, O. (2004). Reform of statistical inference in psychology: The case of memory and cognition. Behavior Research Methods, Instruments, & Computers, 36, 312–324.

    Google Scholar 

  21. Gigerenzer, G. (2004). Mindless statistics. The Journal of Socio-Economics, 33, 587–606.

    Article  Google Scholar 

  22. Haller, H., & Krauss, S. (2002). Misinterpretations of significance: a problem students share with their teachers? Methods of Psychological Research Online [On-line serial], 7, 120. Retrieved May 27, 2013, from www2.uni-jena.de/svw/metheval/lehre/0405-ws/evaluationuebung/haller.pdf

  23. Harlow, Mulaik, S. A., & Steiger, J. H. (1997). What if there were no significance tests? Mahwah, NJ: Erlbaum.

    Google Scholar 

  24. Hoekstra, R., Finch, S., Kiers, H. A. L., & Johnson, A. (2006). Probability as certainty: Dichotomous thinking and the misuse of p-values. Psychonomic Bulletin & Review, 13, 1033–1037.

    Article  Google Scholar 

  25. Hoekstra, R., Johnson, A., & Kiers, H. A. L. (2012). Confidence intervals make a difference: Effects of showing confidence intervals on inferential reasoning. Educational and Psychological Measurement, 72, 1039–1052.

    Article  Google Scholar 

  26. Hoenig, J. M., & Heisey, D. M. (2001). The abuse of power: The pervasive fallacy of power calculations for data analysis. The American Statistician, 55, 19–24.

    Article  Google Scholar 

  27. Jaynes, E. T. (1976). Confidence intervals vs Bayesian intervals. In W. L. Harper & C. A. Hooker (Eds.), Foundations of Probability Theory, Statistical Inference, and Statistical Theories of Science (pp. 175–257). Dordrecht, The Netherlands: Reidel Publishing Company.

    Google Scholar 

  28. Kalinowski, P. (2010). Identifying misconceptions about confidence intervals. Proceedings of the Eighth International Conference on Teaching Statistics. [CDROM]. IASE, Lijbljana, Slovenia, Refereed paper.

  29. Kline, R. B. (2004). Beyond significance testing: reforming data analysis methods in behavioral research. Washington DC, USA: American Psychological Association.

    Google Scholar 

  30. Kruschke, J. K., Aguinis, H., & Joo, H. (2012). The time has come: Bayesian methods for data analysis in the organizational sciences. Organizational Research Methods, 15, 722–752. doi:10.1177/1094428112457829

    Article  Google Scholar 

  31. Lecoutre, M.-P., Poitevineau, J., & Lecoutre, B. (2003). Even statisticians are not immune to misinterpretations of null hypothesis tests. International Journal of Psychology, 38, 37–45.

    Article  Google Scholar 

  32. Lindley, D. V. (1965). Introduction to probability and statistics from a Bayesian viewpoint. Part 2. Cambridge: Inference. Cambridge University Press.

    Google Scholar 

  33. Morey, R. D. (2013). The consistency test does not-and cannot-deliver what is advertised: A comment on Francis (2013). Journal of Mathematical Psychology. doi:10.1016/j.jmp.2013.03.004

    Google Scholar 

  34. Nosek, B. A., Spies, J. R., & Motyl, M. (2012). Scientific utopia: II. Restructuring incentives and practices to promote truth over publishability. Perspectives on Psychological Science, 7, 615–631. doi:10.1177/1745691612459058

    Article  Google Scholar 

  35. O’Hagan, A. (2004). Dicing with the unknown. Significance, 1, 132–133.

    Article  Google Scholar 

  36. Oakes, M. (1986). Statistical inference: A commentary for the social and behavioural sciences. Chicester: John Wiley & Sons.

    Google Scholar 

  37. Pashler, H., & Wagenmakers, E.-J. (2012). Editors’ introduction to the special section on replicability in psychological science: A crisis of confidence? Perspectives on Psychological Science, 7, 528–530.

    Article  Google Scholar 

  38. Rosnow, R. L., & Rosenthal, R. (1989). Statistical procedures and the justification of knowledge in psychological science. American Psychologist, 44, 1276–1284.

    Article  Google Scholar 

  39. Scheutz, F., Andersen, B., & Wulff, H. R. (1988). What do dentists know about statistics? Scandinavian Journal of Dental Research, 96, 281–287.

    PubMed  Google Scholar 

  40. Schmidt, F. L. (1996). Statistical significance testing and cumulative knowledge in psychology: Implications for training of researchers. Psychological Methods, 1, 115–129.

    Article  Google Scholar 

  41. Schmidt, F. L., & Hunter, J. E. (1997). Eight common but false objections to the discontinuation of significance testing in the analysis of research data. In L. L. Harlow, S. A. Mulaik, & J. H. Steiger (Eds.), What if there were no significance tests? Mahwah, NJ: Erlbaum.

    Google Scholar 

  42. Sellke, T., Bayarri, M.-J., & Berger, J. O. (2001). Calibration of p values for testing precise null hypotheses. The American Statistician, 55, 62–71.

    Article  Google Scholar 

  43. Stone, M. (1969). The role of significance testing: Some data with a message. Biometrika, 56, 485–493.

    Article  Google Scholar 

  44. Wagenmakers, E.-J. (2007). A practical solution to the pervasive problem of p values. Psychonomic Bulletin & Review, 14, 779–804.

    Article  Google Scholar 

  45. Wilkinson, L., & APA Task Force on Statistical Inference. (1999). Statistical methods in psychology journals: Guidelines and explanations. American Psychologist, 54, 594–604.

    Article  Google Scholar 

  46. Winch, R. F., & Campbell, D. T. (1969). Proof? No. Evidence? Yes. The significance of tests of significance. American Sociologist, 4, 140–143.

    Google Scholar 

  47. Wulff, H. R., Andersen, B., Brandenhoff, P., & Guttler, F. (1987). What do doctors know about statistics? Statistics in Medicine, 6, 3–10.

    PubMed  Article  Google Scholar 

Download references

Acknowledgements

This work was supported by the starting grant “Bayes or Bust” awarded by the European Research Council, and by National Science Foundation Grants BCS-1240359 and SES-102408.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Rink Hoekstra.

Appendices

Appendix 1 Questionnaire on p-values (Gigerenzer, 2004)

(The scenario and the table are reproduced verbatim from Gigerenzer [2004, p. 594].)

Suppose you have a treatment that you suspect may alter performance on a certain task. You compare the means of your control and experimental groups (say 20 subjects in each sample). Further, suppose you use a simple independent means t-test and your result is significant (t = 2.7, d.f. = 18, p = 0.01). Please mark each of the statements below as “true” or “false.” “False” means that the statement does not follow logically from the above premises. Also note that several or none of the statements may be correct (between the population means).

1. You have absolutely disproved the null hypothesis (that is, there is no difference between the population means).

[] true/false []

2. You have found the probability of the null hypothesis being true.

[] true/false []

3. You have absolutely proved your experimental hypothesis (that there is a difference between the population means).

[] true/false []

4. You can deduce the probability of the experimental hypothesis being true.

[] true/false []

5.You know, if you decide to reject the null hypothesis, the probability that you are making the wrong decision.

[] true/false []

6. You have a reliable experimental finding in the sense that if, hypothetically, the experiment were repeated a great number of times, you would obtain a significant result on 99 % of occasions.

[] true/false []

Appendix 2 Questionnaire on confidence intervals

(The questionnaires for the students were in Dutch, and the researchers could choose between an English and a Dutch version.)

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Hoekstra, R., Morey, R.D., Rouder, J.N. et al. Robust misinterpretation of confidence intervals. Psychon Bull Rev 21, 1157–1164 (2014). https://doi.org/10.3758/s13423-013-0572-3

Download citation

Keywords

  • Confidence intervals
  • Significance testing
  • Inference