In this section, we present an example taken from the confidence interval literature (Berger and Wolpert 1988; Lehmann 1959; Pratt 1961; Welch 1939) designed to bring into focus how CI theory works. This example is intentionally simple; unlike many demonstrations of CIs, no simulations are needed, and almost all results can be derived by readers with some training in probability and geometry. We have also created interactive versions of our figures to aid readers in understanding the example; see the figure captions for details.
A 10-meter-long research submersible with several people on board has lost contact with its surface support vessel. The submersible has a rescue hatch exactly halfway along its length, to which the support vessel will drop a rescue line. Because the rescuers only get one rescue attempt, it is crucial that when the line is dropped to the craft in the deep water that the line be as close as possible to this hatch. The researchers on the support vessel do not know where the submersible is, but they do know that it forms two distinctive bubbles. These bubbles could form anywhere along the craft’s length, independently, with equal probability, and float to the surface where they can be seen by the support vessel.
The situation is shown in Fig. 1a. The rescue hatch is the unknown location θ, and the bubbles can rise from anywhere with uniform probability between θ−5 meters (the bow of the submersible) to θ+5 meters (the stern of the submersible). The rescuers want to use these bubbles to infer where the hatch is located. We will denote the first and second bubble observed by y
1 and y
2, respectively; for convenience, we will often use x
1 and x
2 to denote the bubbles ordered by location, with x
1 always denoting the smaller location of the two. Note that \(\bar {y}=\bar {x}\), because order is irrelevant when computing means, and that the distance between the two bubbles is |y
1−y
2|=x
2−x
1. We denote this difference as d.
The rescuers first note that from observing two bubbles, it is easy to rule out all values except those within five meters of both bubbles because no bubble can occur further than 5 meters from the hatch. If the two bubble locations were y
1 = 4 and y
2 = 6, then the possible locations of the hatch are between 1 and 9, because only these locations are within 5 meters of both bubbles. This constraint is formally captured in the likelihood, which is the joint probability density of the observed data for all possible values of θ. In this case, because the observations are independent, the joint probability density is:
$$\begin{array}{@{}rcl@{}} p(y_{1}, y_{2}; \theta) &=& p_{y}(y_{1};\theta)\times p_{y}(y_{2};\theta). \end{array} $$
The density for each bubble p
y
is uniform across the submersible’s 10 meter length, which means the joint density must be 1/10×1/10=1/100. If the lesser of y
1 and y
2 (which we denote x
1) is greater than θ−5, then obviously both y
1 and y
2 must be greater than θ−5. This means that the density, written in terms of constraints on x
1 and x
2, is:
$$\begin{array}{@{}rcl@{}} p(y_{1}, y_{2}; \theta) &=& \left\{\begin{array}{ll} 1/100, &\text{if}~x_{1}>\theta-5~\text{and}~x_{2}<\theta+5,\\ 0& \text{otherwise}. \end{array} \right. \end{array} $$
(1)
If we write Eq 1 as a function of the unknown parameter θ for fixed, observed data, we get the likelihood, which indexes the information provided by the data about the parameter. In this case, it is positive only when a value θ is possible given the observed bubbles (see also Figs. 1 and 5):
$$\begin{array}{@{}rcl@{}} p(\theta ; y_{1}, y_{2}) &=& \left\{\begin{array}{cl} 1,&\theta>x_{2} - 5~\text{and}~\theta\leq x_{1}+5,\\ 0& \text{otherwise}. \end{array}\right.\\ \end{array} $$
We replaced 1/100 with 1 because the particular values of the likelihood do not matter, only their relative values. Writing the likelihood in terms of \(\bar {x}\) and the difference between the bubbles d = x
2−x
1, we get an interval:
$$\begin{array}{@{}rcl@{}} p(\theta ; y_{1}, y_{2})\! &=&\!\!\left\{\!\begin{array}{ll} 1,&\! \bar{x} - (5 - d/2) < \theta \leq \bar{x} + (5 - d/2),\\ 0& \!\text{otherwise}. \end{array}\right. \end{array} $$
(2)
If the likelihood is positive, the value θ is possible; if it is 0, that value of θ is impossible. Expressing the likelihood as in Eq. 2 allows us to see several important things. First, the likelihood is centered around a reasonable point estimate for θ, \(\bar {x}\). Second, the width of the likelihood 10−d, which here is an index of the uncertainty of the estimate, is larger when the difference between the bubbles d is smaller. When the bubbles are close together, we have little information about θ compared to when the bubbles are far apart. Keeping in mind the likelihood as the information in the data, we can define our confidence procedures.
Five confidence procedures
A group of four statisticiansFootnote 3 happen to be on board, and the rescuers decide to ask them for help improving their judgments using statistics. The four statisticians suggest four different 50 % confidence procedures. We will outline the four confidence procedures; first, we describe a trivial procedure that no one would ever suggest. An applet allowing readers to sample from these confidence procedures is available at the link in the caption for Fig. 1.
0. A trivial procedure
A trivial 50 % confidence procedure can be constructed by using the ordering of the bubbles. If y
1 > y
2, we construct an interval that contains the whole ocean, (−∞, ∞). If y
2 > y
1, we construct an interval that contains only the single, exact point directly under the middle of the rescue boat. This procedure is obviously a 50 % confidence procedure; exactly half of the time — when y
1 > y
2 — the rescue hatch will be within the interval. We describe this interval merely to show that by itself, a procedure including the true value X % of the time means nothing (see also Basu, 1981). We must obviously consider something more than the confidence property, which we discuss subsequently.
1. A procedure based on the sampling distribution of the mean
The first statistician suggests building a confidence procedure using the sampling distribution of the mean \(\bar {x}\). The sampling distribution of \(\bar {x}\) has a known triangular distribution with θ as the mean. With this sampling distribution, there is a 50 % probability that \(\bar {x}\) will differ from θ by less than \(5 - 5/\sqrt {2}\), or about 1.46m. We can thus use \(\bar {x} - \theta \) as a so-called “pivotal quantity” (Casella & Berger, 2002; see the supplement to this manuscript for more details) by noting that there is a 50 % probability that θ is within this same distance of \(\bar {x}\) in repeated samples. This leads to the confidence procedure
$$\bar{x} \pm \left( 5 - 5/\sqrt{2}\right), $$
which we call the “sampling distribution” procedure. This procedure also has the familiar form \(\bar {x} \pm C\times SE\), where here the standard error (that is, the standard deviation of the estimate \(\bar {x}\)) is known to be 2.04.
2. A nonparametric procedure
The second statistician notes that θ is both the mean and median bubble location. Olive (2008) and Rusu and Dobra (2008) suggest a nonparametric confidence procedure for the median that in this case is simply the interval between the two observations:
It is easy to see that this must be a 50 % confidence procedure; the probability that both observations fall below θ is .5×.5=.25, and likewise for both falling above. There is thus a 50 % chance that the two observations encompass θ. Coincidentally, this is the same as the 50 % Student’s t procedure for n = 2.
3. The uniformly most-powerful procedure
The third statistician, citing Welch (1939), describes a procedure that can be thought of as a slight modification of the nonparametric procedure. Suppose we obtain a particular confidence interval using the nonparametric procedure. If the nonparametric interval is more than 5 meters wide, then it must contain the hatch, because the only possible values are less than 5 meters from both bubbles. Moreover, in this case the interval will contain impossible values, because it will be wider than the likelihood. We can exclude these impossible values by truncating the interval to the likelihood whenever the width of the interval is greater than 5 meters:
$$\bar{x} \pm \left\{\begin{array}{lllr} \frac{d}{2} & \text{if} & d < 5 & \text{(nonparametric procedure)}\\ 5 - \frac{d}{2} &\text{if} & d \geq 5 & \text{(likelihood)} \end{array} \right. $$
This will not change the probability that the interval contains the hatch, because it is simply replacing one interval that is sure to contain it with another. Pratt (1961) noted that this interval can be justified as an inversion of the uniformly most-powerful (UMP) test.
4. An objective Bayesian procedure
The fourth statistician suggests an objective Bayesian procedure. Using this procedure, we simply take the central 50 % of the likelihood as our interval:
$$\bar{x} \pm \frac{1}{2}\left( 5 - \frac{d}{2}\right). $$
From the objective Bayesian viewpoint, this can be justified by assuming a prior distribution that assigns equal probability to each possible hatch location. In Bayesian terms, this procedure generates “credible intervals” for this prior. It can also be justified under Fisher’s fiducial theory; see Welch (1939).
Properties of the procedures
The four statisticians report their four confidence procedures to the rescue team, who are understandably bewildered by the fact that there appear to be at least four ways to infer the hatch location from two bubbles. Just after the statisticians present their confidence procedures to the rescuers, two bubbles appear at locations x
1 = 1 and x
2 = 1.5. The resulting likelihood and the four confidence intervals are shown in Fig. 1a.
The fundamental confidence fallacy
After using the observed bubbles to compute the four confidence intervals, the rescuers wonder how to interpret them. It is clear, first of all, why the fundamental confidence fallacy is a fallacy. As Fisher pointed out in the discussion of CI theory mentioned above, for any given problem — as for this one — there are many possible confidence procedures. These confidence procedures will lead to different confidence intervals. In the case of our submersible confidence procedures, all confidence intervals are centered around \(\bar {x}\), and so the intervals will be nested within one another.
If we mistakenly interpret these observed intervals as having a 50 % probability of containing the true value, a logical problem arises. First, there must always be a 50 % probability that the shortest interval contains the parameter. The reason is basic probability theory: the narrowest interval would have probability 50 % of including the true value, and the widest interval would have probability 50 % of excluding the true value. According to this reasoning, there must be a 0 % probability that the true value is outside the narrower, nested interval yet inside the wider interval. If we believed the FCF, we would always come to the conclusion that the shortest of a set of nested X% intervals has an X% probability of containing the true value. Of course, the confidence procedure “always choose the shortest of the nested intervals” will tend to have a lower than X% probability of including the true value. If we believed the FCF, then we must come to the conclusion that the shortest interval simultaneously has an X% probability of containing the true value, and a less than X% probability. Believing the FCF results in contradiction.
This point regarding the problem of interpreting nested CIs is not, by itself, a critique of confidence interval theory proper; it is rather a critique of the folk theory of confidence. Neyman himself was very clear that this interpretation was not permissible, using similarly nested confidence intervals to demonstrate the fallacy (Neyman 1941, pp. 213–215). It is a warning that the improper interpretations of confidence intervals used throughout the scientific literature leads to mutually contradictory inferences, just as Fisher warned.
Even without nested confidence procedures, one can see that the FCF must be a fallacy. Consider Fig. 1b, which shows the resulting likelihood and confidence intervals when x
1 = 0.5 and x
2 = 9.5. When the bubbles are far apart, as in this case, the hatch can be localized very precisely: the bubbles are far enough apart that they must have come from the bow and stern of the submersible. The sampling distribution, nonparametric, and UMP confidence intervals all encompass the likelihood, meaning that there is 100 % certainty that these 50 % confidence intervals contain the hatch. Reporting 50 % certainty, 50 % probability, or 50 % confidence in a specific interval that surely contains the parameter would clearly be a mistake.
Relevant subsets
The fact that we can have 100 % certainty that a 50 % CI contains the true value is a specific case of a more general problem flowing from the FCF. The shaded regions in Fig. 2, left column, show when the true value is contained in the various confidence procedures for all possible pairs of observations. The top, middle, and bottom row correspond to the sampling distribution, nonparametric/UMP, and the Bayes procedures, respectively. Because each procedure is a 50 % confidence procedure, in each plot the shaded area occupies 50 % of the larger square delimiting the possible observations. The points ‘a’ and ‘b’ are the bubble patterns in Fig. 1a and b, respectively; point ‘b’ is in the shaded region for each intervals because the true value is included in every kind of interval, as shown in Fig. 1b; likewise, ‘a’ is outside every shaded region because all CIs exclude the true value for this observed bubble pair.
Instead of considering the bubbles themselves, we might also translate their locations into the mean location \(\bar {y}\) and the difference between them, b = y
2−y
1. We can do this without loss of any information: \(\bar {y}\) contains the point estimate of the hatch location, and b contains the information about the precision of that estimate. Figure 2, right column, shows the same information as in the left column, except as a function of \(\bar {y}\) and b. The figures in the right column are 45∘ clockwise rotations of those in the left. Although the two columns show the same information, the rotated right column reveals a critical fact: the various confidence procedures have different probabilities of containing the true value when the distance between the bubbles varies.
To see this, examine the horizontal line under point ‘a’ in Fig. 2b. The horizontal line is the subset of all bubble pairs that show the same difference between the bubbles as those in Fig. 1a: 0.5 meters. About 31 % of this line falls under the shaded region, meaning that in the long run, 31 % of sampling distributions intervals will contain the true value, when the bubbles are 0.5 meters apart. For the nonparametric and UMP intervals (middle row), this percentage is only about 5 %. For the Bayes interval (bottom row), it is exactly 50 %.
Believing the FCF implies believing that we can use the long-run probability that a procedure contains the true value as an index of our post-data certainty that a particular interval contains the true value. But in this case, we have identified two long-run probabilities for each interval: the average long-run probability not taking into account the observed difference — that is, 50 % — and the long-run probability taking into account b which, for the sampling distribution interval is 31 % and for the nonparametric/UMP intervals is 5 %. Both are valid long-run probabilities; which do we use for our inference? Under FCF, both are valid. Hence the FCF leads to contradiction.
The existence of multiple, contradictory long-run probabilities brings back into focus the confusion between what we know before the experiment with what we know after the experiment. For any of these confidence procedures, we know before the experiment that 50 % of future CIs will contain the true value. After observing the results, conditioning on a known property of the data — such as, in this case, the variance of the bubbles — can radically alter our assessment of the probability.
The problem of contradictory inferences arising from multiple applicable long-run probabilities is an example of the “reference class” problem (Venn, 1888; Reichenbach, 1949), where a single observed event (e.g., a CI) can be seen as part of several long-run sequences, each with a different long-run probability. Fisher noted that when there are identifiable subsets of the data that have different probabilities of containing the true value — such as those subsets with a particular value of d, in our confidence interval example — those subsets are relevant to the inference (Fisher 1959). The existence of relevant subsets means that one can assign more than one probability to an interval. Relevant subsets are identifiable in many confidence procedures, including the common classical Student’s t interval, where wider CIs have a greater probability of containing the true value (Buehler, 1959; Buehler & Feddersen,1963; Casella, 1992; Robinson, 1979). There are, as far as we know, only two general strategies for eliminating the threat of contradiction from relevant subsets: Neyman’s strategy of avoiding any assignment of probabilities to particular intervals, and the Bayesian strategy of always conditioning on the observed data, to be discussed subsequently.
The precision and likelihood fallacies
This set of confidence procedures also makes clear the precision fallacy. Consider Fig. 3, which shows how the width of each of the intervals produced by the four confidence procedures changes as a function of the width of the likelihood. The Bayes procedure tracks the uncertainty in the data: when the likelihood is wide, the Bayes CI is wide. The reason for this necessary correspondence between the likelihood and the Bayes interval will be discussed later.
Intervals from the sampling distribution procedure, in contrast, have a fixed width, and so cannot reveal any information about the precision of the estimate. The sampling distribution interval is of the commonly-seen CI form
$$\bar{x}\pm C\times SE, $$
Like the CI for a normal population mean with known population variance, the standard error — defined as the standard deviation of the sampling distribution of \(\bar {x}\) — is known and fixed; here, it is approximately 2.04 (see the supplement for details). This indicates that the long-run standard error — and hence, confidence intervals based on the standard error — cannot always be used as a guide to the uncertainty we should have in a parameter estimate.
Strangely, the nonparametric procedure generates intervals whose widths are inversely related to the uncertainty in the parameter estimates. Even more strangely, intervals from the UMP procedure initially increase in width with the uncertainty in the data, but when the width of the likelihood is greater than 5 meters, the width of the UMP interval is inversely related to the uncertainty in the data, like the nonparametric interval. This can lead to bizarre situations. Consider observing the UMP 50 % interval [1,1.5]. This is consistent with two possible sets of observations: (1,1.5), and (−3.5,6). Both of these sets of bubbles will lead to the same CI. Yet the second data set indicates high precision, and the first very low precision! The UMP and sampling distribution procedures share the dubious distinction that their CIs cannot be used to work backwards to the observations. In spite of being the “most powerful” procedure, the UMP procedure clearly throws away important information.
To see how the likelihood fallacy is manifest in this example, consider again Fig. 3. When the uncertainty is high, the likelihood is wide; yet the nonparametric and UMP intervals are extremely narrow, indicating both false precision and excluding almost all likely values. Furthermore, the sampling distribution procedure and the nonparametric procedure can contain impossible values.Footnote 4
Evaluating the confidence procedures
The rescuers who have been offered the four intervals above have a choice to make: which confidence procedure to choose? We have shown that several of the confidence procedures have counter-intuitive properties, but thus far, we have not made any firm commitments about which confidence procedures should be preferred to the others. For the sake of our rescue team, who have a decision to make about which interval to use, we now compare the four procedures directly. We begin with the evaluation of the procedures from the perspective of confidence interval theory, then evaluate them according to Bayesian theory.
As previously mentioned, confidence interval theory specifies that better intervals will include false values less often. Figure 4 shows the probability that each of the procedures include a value θ
′ at a specified distance from the hatch θ. All procedures are 50 % confidence procedures, and so they include the true value θ 50 % of the time. Importantly, however, the procedures include particular false values θ
′≠θ at different rates. See the interactive versions of Figs. 1
and
4 linked in the figure captions for a hands-on demonstration.
The trivial procedure (T; gray horizontal line) is obviously a bad interval because it includes every false value with the same frequency as the true value. This is analogous to a hypothesis test with power equal to its Type I error rate. The trivial procedure will be worse than any other procedure, unless the procedure is specifically constructed to be pathological. The UMP procedure (UMP), on the other hand, is better than every other procedure for every value of θ
′. This is due to the fact that it was created by inverting a most-powerful test.
The ordering among the three remaining procedures can be seen by comparing their curves. The sampling distribution procedure (SD) is always superior to the Bayes procedure (B), but not to the nonparametric procedure (NP). The nonparametric procedure and the Bayes procedure curves overlap, so one is not preferred to the other. Welch (1939) remarked that the Bayes procedure is “not the best way of constructing confidence limits” using precisely the frequentist comparison shown in Fig. 4 with the UMP interval.Footnote 5
The frequentist comparison between the procedures is instructive, because we have arrived at an ordering of the procedures employing the criteria suggested by Neyman and used by the modern developers of new confidence procedures: coverage and power. The UMP procedure is the best, followed by the sampling distribution procedure. The sampling distribution procedure is better than the Bayes procedure. The nonparametric procedure is not preferred to any interval, but neither is it the worst.
We can also examine the procedures from a Bayesian perspective, which is primarily concerned with whether the inferences are reasonable in light of the data and what was known before the data were observed (Howson and Urbach 2006). We have already seen that interpreting the non-Bayesian procedures in this way leads to trouble, and that the Bayesian procedure, unsurprisingly, has better properties in this regard. We will show how the Bayesian interval was derived in order to provide insight into why it has good properties.
Consider the left column of Fig. 5, which shows Bayesian reasoning from prior and likelihood to posterior and so-called credible interval. The prior distribution in the top panel shows that before observing the data, all the locations in this region are equally probable. Upon observing the bubbles shown in Fig. 1a — also shown in the top of the “likelihood” panel — the likelihood is a function that is 1 for all possible locations for the hatch, and 0 otherwise. To combine our prior knowledge with the new information from the two bubbles, we condition what we knew before on the information in the data by multiplying by the likelihood — or, equivalently, excluding values we know to be impossible — which results in the posterior distribution in the bottom row. The central 50 % credible interval contains all values in the central 50 % of the area of the posterior, shown as the shaded region. The right column of Fig. 5 shows a similar computation using an informative prior distribution that does not assume that all locations are equally likely, as might occur if some other information about the location of the submersible were available.
It is now obvious why the Bayesian credible interval has the properties typically ascribed to confidence intervals. The credible interval can be interpreted as having a 50 % probability of containing the true value, because the values within it account for 50 % of the posterior probability. It reveals the precision of our knowledge of the parameter, in light of the data and prior, through its relationship with the posterior and likelihood.
Of the five procedures considered, intervals from the Bayesian procedure are the only ones that can be said to have 50 % probability of containing the true value, upon observing the data. Importantly, the ability to interpret the interval in this way arises from Bayesian theory and not from confidence interval theory. Also importantly, it was necessary to stipulate a prior to obtain the desired interval; the interval should be interpreted in light of the stipulated prior. Of the other four intervals, none can be justified as providing a “reasonable” inference or conclusion from the data, because of their strange properties and that there is no possible prior distribution that could lead to these procedures. In this light, it is clear why Neyman’s rejection of “conclusions” and “reasoning” from data naturally flowed from his theory: the theory, after all, does not support such ideas. It is also clear that if they care about making reasonable inferences from data, scientists might want want to reject confidence interval theory as a basis for evaluating procedures.
We can now review what we know concerning the four procedures. Only the Bayesian procedure — when its intervals are interpreted as credible intervals — allows the interpretation that there is a 50 % probability that the hatch is located in the interval. Only the Bayesian procedure properly tracks the precision of the estimate. Only the Bayesian procedure covers the plausible values in the expected way: the other procedures produce intervals that are known with certainty — by simple logic — to contain the true value, but still are “50 %” intervals. The non-Bayesian intervals have undesirable, even bizarre properties, which would lead any reasonable analyst to reject them as a means to draw inferences. Yet the Bayesian procedure is judged by frequentist CI theory as inferior.
The disconnect between frequentist theory and Bayesian theory arises from the different goals of the two theories. Frequentist theory is a “pre-data” theory. It looks forward, devising procedures that will have particular average properties in repeated sampling (Jaynes 2003; Mayo 1981; 1982) in the future (see also Neyman, 1937, p. 349). This thinking can be clearly seen in Neyman (1941) as quoted above: reasoning ends once the procedure is derived. Confidence interval theory is vested in the average frequency of including or excluding true and false parameter values, respectively. Any given inference may — or may not — be reasonable in light of the observed data, but this is not Neyman’s concern; he disclaims any conclusions or beliefs on the basis of data. Bayesian theory, on the other hand, is a post-data theory: a Bayesian analyst uses the information in the data to determine what is reasonable to believe, in light of the model assumptions and prior information.
Using an interval justified by a pre-data theory to make post-data inferences can lead to unjustified, and possibly arbitrary, inferences. This problem is not limited to the pedagogical submersible example (Berger and Wolpert 1988; Wagenmakers et al. 2014) though this simple example is instructive for identifying these issues. In the next section we show how a commonly-used confidence interval leads to similarly flawed post-data inferences.