Statistical inference is central to the justification of claims across scientific fields. When statistics such as p-values or confidence intervals (CIs) serve as the basis for scientific claims, it is essential that researchers interpret them appropriately; otherwise, one of the central goals of science—the justification of knowledge—is undermined. It is therefore critically important to identify and correct errors where researchers believe that a statistic justifies a particular claim when it, in fact, does not.
Of the statistical techniques used to justify claims in the social sciences, null hypothesis significance testing (NHST) is undoubtedly the most common (Harlow, Mulaik, & Steiger, 1997; Hoekstra, Finch, Kiers, & Johnson, 2006; Kline, 2004). Despite its frequent use, NHST has been criticized for many reasons, including its inability to provide the answers that researchers are interested in (e.g., Berkson, 1942; Cohen, 1994), its violation of the likelihood principle (e.g., Berger & Wolpert, 1988; Wagenmakers, 2007), its tendency to overestimate the evidence against the null hypothesis (e.g., Edwards, Lindman, & Savage, 1963; Sellke, Bayarri, & Berger, 2001), and its dichotomization of evidence (e.g., Fidler & Loftus, 2009; Rosnow & Rosenthal, 1989). In addition, it has been argued that NHST is conceptually difficult to understand and that, consequently, researchers often misinterpret test results (Schmidt, 1996). Although some researchers have defended its usability (e.g., Abelson, 1997; Chow, 1998; Cortina & Dunlap, 1997; Winch & Campbell, 1969), there seems to be widespread agreement that the results from NHST are often misinterpreted.
For example, in a well-known study on the misinterpretation of results from NHST, Oakes (1986) presented a brief scenario with a significant p-value to 70 academic psychologists and asked them to endorse as true or false six statements that provided differing interpretations of the significant p-value. All six statements were false; nonetheless, participants endorsed, on average, 2.5 statements, indicating that the psychologists had little understanding of the technique’s correct interpretation. Even when the correct interpretation was added to the set of statements, the average number of incorrectly endorsed statements was about 2.0, whereas the correct interpretation was endorsed in about 11 % of the cases. Falk and Greenbaum (1995) found similar results in a replication of Oakes’s study, and Haller and Krauss (2002) showed that even professors and lecturers teaching statistics often endorse false statements about the results from NHST. Lecoutre, Poitevineau, and Lecoutre (2003) found the same for statisticians working for pharmaceutical companies, and Wulff and colleagues reported misunderstandings in doctors and dentists (Scheutz, Andersen, & Wulff, 1988; Wulff, Andersen, Brandenhoff, & Guttler, 1987). Hoekstra et al. (2006) showed that in more than half of a sample of published articles, a nonsignificant outcome was erroneously interpreted as proof for the absence of an effect, and in about 20 % of the articles, a significant finding was considered absolute proof of the existence of an effect. In sum, p-values are often misinterpreted, even by researchers who use them on a regular basis.
The philosophical underpinning of NHST offers a hint as to why its results are so easily misinterpreted. Specifically, NHST follows the logic of so-called frequentist statistics. Within the framework of frequentist statistics, conclusions are based on a procedure’s average performance for a hypothetical infinite repetition of experiments (i.e., the sample space). Importantly, frequentist statistics does not allow one to assign probabilities to parameters or hypotheses (e.g., O’Hagan, 2004); this can be done only in the framework of Bayesian statistical techniques, which are philosophically incompatible with frequentist statistics. It has been suggested that the common misinterpretations of NHST arise in part because its results are erroneously given a Bayesian interpretation, such as when the p-value is misinterpreted as the probability that the null hypothesis is true (e.g., Cohen, 1994; Dienes, 2011; Falk & Greenbaum, 1995).
Within the frequentist framework, a popular alternative to NHST is inference by CIs (e.g., Cumming & Finch, 2001; Fidler & Loftus, 2009; Schmidt, 1996; Schmidt & Hunter, 1997). CIs are often claimed to be a better and more useful alternative to NHST. Schmidt, for example, considered replacing NHST by point estimates with CIs “essential for the future progress of cumulative knowledge in psychological research” (p. 115), and Cumming and Fidler (2009) argued that “NHST . . . has hobbled the theoretical thinking of psychologists for half a century” (p. 15) and that CIs address the problems with NHST, albeit to varying degrees. Fidler and Loftus stated that NHST dichotomizes researchers’ conclusions, and they expected that since CIs would make precision immediately salient, they would help to alleviate this dichotomous thinking. Cumming and Finch mentioned four reasons why CIs should be used: First, they give accessible and comprehensible point and interval information and, thus, support substantive understanding and interpretation; second, CIs provide information on any hypothesis, whereas NHST is informative only about the null; third, CIs are better designed to support meta-analysis; and finally, CIs give direct insight into the precision of the procedure and can therefore be used as an alternative to power calculations.
The criticism of NHST and the advocacy for CIs have had some effects on practice. At the end of the previous century, the American Psychological Association (APA) installed the Task Force on Statistical Inference (TFSI) to study the controversy over NHST and to make a statement about a possible ban on NHST. The TFSI published its findings in 1999 in the American Psychologist (Wilkinson & TFSI, 1999), and it encouraged, among other things, the use of CIs, because “it is hard to imagine a situation in which a dichotomous accept–reject decision is better than reporting an actual p-value or, better still, a confidence interval” (Wilkinson & TFSI, 1999, p. 599). The advice of the TFSI was partly incorporated into the fifth and sixth editions of the APA Publication Manual, by calling CIs “in general, the best reporting strategy” (APA, 2001, p. 22; APA, 2009, p. 34). Earlier, between 1994 and 1997, as editor of Memory & Cognition, Geoffrey Loftus had tried to reform the publication practices of the journal. He encouraged the use of error bars and avoidance of NHST. Although there was a temporary effect of his policy, it seemed hardly effective in the long run (Finch et al., 2004).
The argument for the use of CIs, that they are accessible and comprehensible, rests on the idea that researchers can properly interpret them; including information that researchers cannot interpret is, at best, of limited use and, at worst, potentially misleading. Several previous studies have explored whether presenting results using CIs leads to better interpretations than does presenting the same results using NHST. Belia, Fidler, Williams, and Cumming (2005) showed that there was a lack of knowledge of participants about the relationship between CIs and significance levels, suggesting that when data are presented with CIs versus significance test outcomes, people might interpret the same results differently. Fidler and Loftus (2009) found that both first-year and masters students tended to make the mistake of accepting the null hypothesis more often if data were presented using NHST than if the same data were presented using CIs. Hoekstra, Johnson, and Kiers (2012) found similar effects for researchers and also found that presenting the same data by means of NHST or CIs affected researchers’ intuitive estimates about the certainty of the existence of a population effect or the replicability of that effect.
The findings above show that people interpret data differently depending on whether these data are presented through NHST or CIs. Despite the fact that CIs are endorsed in many articles, however, little is known about how CIs are generally understood by researchers. Fidler (2005) showed that students frequently overlooked the inferential nature of a CI and, instead, interpreted the CI as a descriptive statistic (e.g., the range of the mid C % of the sample scores, or even an estimate for the sample mean). Hoenig and Heisey (2001) stated that it is “surely prevalent that researchers interpret confidence intervals as if they were Bayesian credibility regions” (p. 5), but they did this without referring to data to back up this claim.
Before proceeding, it is important to recall the correct definition of a CI. A CI is a numerical interval constructed around the estimate of a parameter. Such an interval does not, however, directly indicate a property of the parameter; instead, it indicates a property of the procedure, as is typical for a frequentist technique. Specifically, we may find that a particular procedure, when used repeatedly across a series of hypothetical data sets (i.e., the sample space), yields intervals that contain the true parameter value in 95 % of the cases. When such a procedure is applied to a particular data set, the resulting interval is said to be a 95 % CI. The key point is that the CIs do not provide for a statement about the parameter as it relates to the particular sample at hand; instead, they provide for a statement about the performance of the procedure of drawing such intervals in repeated use. Hence, it is incorrect to interpret a CI as the probability that the true value is within the interval (e.g., Berger & Wolpert, 1988). As is the case with p-values, CIs do not allow one to make probability statements about parameters or hypotheses.
In this article, we address two major questions: first, the extent to which CIs are misinterpreted by researchers and students, and second, to what extent any misinterpretations are reduced by experience in research. To address these questions, we surveyed students and active researchers about their interpretations of CIs.