Null hypothesis significance testing (NHST) is undoubtedly the most common inferential technique used to justify claims in the social sciences. However, even staunch defenders of NHST agree that its outcomes are often misinterpreted. Confidence intervals (CIs) have frequently been proposed as a more useful alternative to NHST, and their use is strongly encouraged in the APA Manual. Nevertheless, little is known about how researchers interpret CIs. In this study, 120 researchers and 442 students—all in the field of psychology—were asked to assess the truth value of six particular statements involving different interpretations of a CI. Although all six statements were false, both researchers and students endorsed, on average, more than three statements, indicating a gross misunderstanding of CIs. Self-declared experience with statistics was not related to researchers’ performance, and, even more surprisingly, researchers hardly outperformed the students, even though the students had not received any education on statistical inference whatsoever. Our findings suggest that many researchers do not know the correct interpretation of a CI. The misunderstandings surrounding p-values and CIs are particularly unfortunate because they constitute the main tools by which psychologists draw conclusions from data.
This is a preview of subscription content, access via your institution.
Buy single article
Instant access to the full article PDF.
Tax calculation will be finalised during checkout.
Subscribe to journal
Immediate online access to all issues from 2019. Subscription will auto renew annually.
Tax calculation will be finalised during checkout.
Abelson, R. P. (1997). A retrospective on the significance test ban of 1999 (if there were no significance tests, they would be invented). In L. L. Harlow, S. A. Mulaik, & J. H. Steiger (Eds.), What if there were no significance tests? Mahwah, NJ: Erlbaum.
American Psychological Association. (2001). Publication manual of the American Psychological Association (5th ed.). Washington, DC: Author.
American Psychological Association. (2009). Publication manual of the American Psychological Association (6th ed.). Washington, DC: Author.
Belia, S., Fidler, F., Williams, J., & Cumming, G. (2005). Researchers misunderstand confidence intervals and standard error bars. Psychological Methods, 10, 389–396.
Berger, J. O. (2006). The case for objective Bayesian analysis. Bayesian Analysis, 1, 385–402.
Berger, J. O., & Wolpert, R. L. (1988). The likelihood principle (2nd ed.). Hayward, CA: Institute of Mathematical Statistics.
Berkson, J. (1942). Tests of significance considered as evidence. Journal of the American Statistical Association, 37, 325–335.
Blaker, H., & Spjøtvoll, E. (2000). Paradoxes and improvements in interval estimation. The American Statistician, 54, 242–247.
Chow, S. L. (1998). A précis of “Statistical significance: Rationale, validity and utility. Behavioral and Brain Sciences, 21, 169–194.
Cohen, J. (1994). The earth is round (p <.05). American Psychologist, 49, 997–1003.
Cortina, J. M., & Dunlap, W. P. (1997). On the logic and purpose of significance testing. Psychological Methods, 2, 161–172.
Cumming, G., & Finch, S. (2001). A primer on the understanding, use, and calculation of confidence intervals that are based on central and non-central distributions. Educational and Psychological Measurement, 61, 532–574.
Cumming, G., & Fidler, F. (2009). Confidence intervals: Better answers to better questions. Zeitschrift für Psychologie, 217, 15–26. doi:10.1027/0044-3409.217.1.15
Curran-Everett, D. (2000). Multiple comparisons: Philosophies and illustrations. American Journal of Physiology - Regulatory, Integrative and Comparative Physiology, 279, R1–R8.
Dienes, Z. (2011). Bayesian versus orthodox statistics: Which side are you on? Perspectives on Psychological Science, 6, 274–290.
Edwards, W., Lindman, H., & Savage, L. J. (1963). Bayesian statistical inference for psychological research. Psychological Review, 70, 193–242.
Falk, R., & Greenbaum, C. W. (1995). Significance tests die hard: The amazing persistence of a probabilistic misconception. Theory and Psychology, 5, 75–98.
Fidler, F. (2005). From statistical significance to effect estimation: Statistical reform in psychology, medicine and ecology. Unpublished doctoral dissertation, University of Melbourne, Melbourne.
Fidler, F., & Loftus, G. R. (2009). Why figures with error bars should replace p values: Some conceptual arguments and empirical demonstrations. Journal of Psychology, 217, 27–37.
Finch, S., Cumming, G., Williams, J., Palmer, L., Griffith, E., Alders, C., Anderson, J., & Goodman, O. (2004). Reform of statistical inference in psychology: The case of memory and cognition. Behavior Research Methods, Instruments, & Computers, 36, 312–324.
Gigerenzer, G. (2004). Mindless statistics. The Journal of Socio-Economics, 33, 587–606.
Haller, H., & Krauss, S. (2002). Misinterpretations of significance: a problem students share with their teachers? Methods of Psychological Research Online [On-line serial], 7, 120. Retrieved May 27, 2013, from www2.uni-jena.de/svw/metheval/lehre/0405-ws/evaluationuebung/haller.pdf
Harlow, Mulaik, S. A., & Steiger, J. H. (1997). What if there were no significance tests? Mahwah, NJ: Erlbaum.
Hoekstra, R., Finch, S., Kiers, H. A. L., & Johnson, A. (2006). Probability as certainty: Dichotomous thinking and the misuse of p-values. Psychonomic Bulletin & Review, 13, 1033–1037.
Hoekstra, R., Johnson, A., & Kiers, H. A. L. (2012). Confidence intervals make a difference: Effects of showing confidence intervals on inferential reasoning. Educational and Psychological Measurement, 72, 1039–1052.
Hoenig, J. M., & Heisey, D. M. (2001). The abuse of power: The pervasive fallacy of power calculations for data analysis. The American Statistician, 55, 19–24.
Jaynes, E. T. (1976). Confidence intervals vs Bayesian intervals. In W. L. Harper & C. A. Hooker (Eds.), Foundations of Probability Theory, Statistical Inference, and Statistical Theories of Science (pp. 175–257). Dordrecht, The Netherlands: Reidel Publishing Company.
Kalinowski, P. (2010). Identifying misconceptions about confidence intervals. Proceedings of the Eighth International Conference on Teaching Statistics. [CDROM]. IASE, Lijbljana, Slovenia, Refereed paper.
Kline, R. B. (2004). Beyond significance testing: reforming data analysis methods in behavioral research. Washington DC, USA: American Psychological Association.
Kruschke, J. K., Aguinis, H., & Joo, H. (2012). The time has come: Bayesian methods for data analysis in the organizational sciences. Organizational Research Methods, 15, 722–752. doi:10.1177/1094428112457829
Lecoutre, M.-P., Poitevineau, J., & Lecoutre, B. (2003). Even statisticians are not immune to misinterpretations of null hypothesis tests. International Journal of Psychology, 38, 37–45.
Lindley, D. V. (1965). Introduction to probability and statistics from a Bayesian viewpoint. Part 2. Cambridge: Inference. Cambridge University Press.
Morey, R. D. (2013). The consistency test does not-and cannot-deliver what is advertised: A comment on Francis (2013). Journal of Mathematical Psychology. doi:10.1016/j.jmp.2013.03.004
Nosek, B. A., Spies, J. R., & Motyl, M. (2012). Scientific utopia: II. Restructuring incentives and practices to promote truth over publishability. Perspectives on Psychological Science, 7, 615–631. doi:10.1177/1745691612459058
O’Hagan, A. (2004). Dicing with the unknown. Significance, 1, 132–133.
Oakes, M. (1986). Statistical inference: A commentary for the social and behavioural sciences. Chicester: John Wiley & Sons.
Pashler, H., & Wagenmakers, E.-J. (2012). Editors’ introduction to the special section on replicability in psychological science: A crisis of confidence? Perspectives on Psychological Science, 7, 528–530.
Rosnow, R. L., & Rosenthal, R. (1989). Statistical procedures and the justification of knowledge in psychological science. American Psychologist, 44, 1276–1284.
Scheutz, F., Andersen, B., & Wulff, H. R. (1988). What do dentists know about statistics? Scandinavian Journal of Dental Research, 96, 281–287.
Schmidt, F. L. (1996). Statistical significance testing and cumulative knowledge in psychology: Implications for training of researchers. Psychological Methods, 1, 115–129.
Schmidt, F. L., & Hunter, J. E. (1997). Eight common but false objections to the discontinuation of significance testing in the analysis of research data. In L. L. Harlow, S. A. Mulaik, & J. H. Steiger (Eds.), What if there were no significance tests? Mahwah, NJ: Erlbaum.
Sellke, T., Bayarri, M.-J., & Berger, J. O. (2001). Calibration of p values for testing precise null hypotheses. The American Statistician, 55, 62–71.
Stone, M. (1969). The role of significance testing: Some data with a message. Biometrika, 56, 485–493.
Wagenmakers, E.-J. (2007). A practical solution to the pervasive problem of p values. Psychonomic Bulletin & Review, 14, 779–804.
Wilkinson, L., & APA Task Force on Statistical Inference. (1999). Statistical methods in psychology journals: Guidelines and explanations. American Psychologist, 54, 594–604.
Winch, R. F., & Campbell, D. T. (1969). Proof? No. Evidence? Yes. The significance of tests of significance. American Sociologist, 4, 140–143.
Wulff, H. R., Andersen, B., Brandenhoff, P., & Guttler, F. (1987). What do doctors know about statistics? Statistics in Medicine, 6, 3–10.
This work was supported by the starting grant “Bayes or Bust” awarded by the European Research Council, and by National Science Foundation Grants BCS-1240359 and SES-102408.
Appendix 1 Questionnaire on p-values (Gigerenzer, 2004)
(The scenario and the table are reproduced verbatim from Gigerenzer [2004, p. 594].)
Suppose you have a treatment that you suspect may alter performance on a certain task. You compare the means of your control and experimental groups (say 20 subjects in each sample). Further, suppose you use a simple independent means t-test and your result is significant (t = 2.7, d.f. = 18, p = 0.01). Please mark each of the statements below as “true” or “false.” “False” means that the statement does not follow logically from the above premises. Also note that several or none of the statements may be correct (between the population means).
1. You have absolutely disproved the null hypothesis (that is, there is no difference between the population means).
 true/false 
2. You have found the probability of the null hypothesis being true.
 true/false 
3. You have absolutely proved your experimental hypothesis (that there is a difference between the population means).
 true/false 
4. You can deduce the probability of the experimental hypothesis being true.
 true/false 
5.You know, if you decide to reject the null hypothesis, the probability that you are making the wrong decision.
 true/false 
6. You have a reliable experimental finding in the sense that if, hypothetically, the experiment were repeated a great number of times, you would obtain a significant result on 99 % of occasions.
 true/false 
Appendix 2 Questionnaire on confidence intervals
(The questionnaires for the students were in Dutch, and the researchers could choose between an English and a Dutch version.)
About this article
Cite this article
Hoekstra, R., Morey, R.D., Rouder, J.N. et al. Robust misinterpretation of confidence intervals. Psychon Bull Rev 21, 1157–1164 (2014). https://doi.org/10.3758/s13423-013-0572-3
- Confidence intervals
- Significance testing