The fallacy of placing confidence in confidence intervals

Interval estimates – estimates of parameters that include an allowance for sampling uncertainty – have long been touted as a key component of statistical analyses. There are several kinds of interval estimates, but the most popular are confidence intervals (CIs): intervals that contain the true parameter value in some known proportion of repeated samples, on average. The width of confidence intervals is thought to index the precision of an estimate; CIs are thought to be a guide to which parameter values are plausible or reasonable; and the confidence coefficient of the interval (e.g., 95 %) is thought to index the plausibility that the true parameter is included in the interval. We show in a number of examples that CIs do not necessarily have any of these properties, and can lead to unjustified or arbitrary inferences. For this reason, we caution against relying upon confidence interval theory to justify interval estimates, and suggest that other theories of interval estimation should be used instead. Electronic supplementary material The online version of this article (doi:10.3758/s13423-015-0947-8) contains supplementary material, which is available to authorized users.


Sampling distribution procedure
Consider the sample mean,ȳ = (y 1 + y 2 )/2. As the sum of two uniform deviates, it is a well-known fact thatȳ will have a triangular distribution with location θ and minimum and maximum θ − 5 and θ + 5, respectively. This distribution is shown in Figure 1.
It is desired to find the width of the base of the shaded region in Figure 1 such that it has an area of .5. To do this we first find the width of the base of the unshaded triangular area marked "a" in Figure 1 such that the area of the triangle is .25. The corresponding unshaded triangle on the left side will also have area .25, which means that since the figure is a density, the shaded region must have the remaining area of .5. Elementary geometry will show that the width of the base of triangle "a" is 5/ √ 2, meaning that the distance between θ and the altitude of triangle "a" is 5 − 5/ √ 2 or about 1.46m. We can thus say that which implies that, in repeated sampling, Pr(ȳ − (5 − 5/ √ 2) < θ <ȳ + (5 − 5/ √ 2)) = .5 which defines the sampling distribution confidence procedure. This is an example of usingȳ − θ as a pivotal quantity (Casella & Berger, 2002). We can also derive the standard deviation of the sampling distribution ofȳ, also called the standard error. It is defined as: where p(z) is the triangular sampling distribution in Figure 1 centered around θ = 0. Solving the integral yields

Bayesian procedure
The posterior distribution is proportional to the likelihood times the prior. The likelihood is where I is an indicator function. Note since this is the product of two indicator functions, it can only be nonzero when both indicator functions' conditions are met; that is, when y 1 + 5 and y 2 + 5 are both greater than θ, and y 1 − 5 and y 2 − 5 are both less than θ. If the minimum of y 1 + 5 and y 2 + 5 is greater than θ, then so to must be the maximum. The likelihood thus can be rewritten where x 1 and x 2 are the minimum and maximum observations, respectively. If the prior for θ is proportional to a constant, then the posterior is This posterior is a uniform distribution over all a posteriori possible values of θ (that is, all θ values within 5 meters of all observations), has width 10 − (x 2 − x 1 ), and is centered aroundx. Because the posterior comprises all values of θ the data have not ruled outand is essentially just the classical likelihood -the width of this posterior can be taken as an indicator of the precision of the estimate of θ. The middle 50% of the likelihood can be taken as a 50% objective Bayesian credible interval. Proof that this Bayesian procedure is also a confidence procedure is trivial and can be found in Welch (1939).

BUGS implementation
The submersible example was selected in part because it is so trivial; the confidence intervals and Bayesian credible intervals can be derived with very little effort. However, for more complicated problems, credible intervals can be more challenging to derive. Thankfully, modern Bayesian software tools make estimation of credible intervals in many problems as trivial as stating the problem along with priors on the parameters.
BUGS is a special language that allows users to define a model and prior. Using a software that interprets the BUGS language, such as JAGS (Plummer, 2003) or WinBUGS (Lunn, Thomas, Best, & Spiegelhalter, 2000), the model and prior are then combined with the data. The software then outputs samples from the posterior distribution for all the parameters, which can be used to create credible intervals.
A full explanation of how to use the BUGS language is beyond the scope of this supplement. Readers can find more information about using BUGS in Ntzoufras (2009), Lee and Wagenmakers (2013), and many tutorials are available on the world wide web. Here, we show how to obtain a credible interval for the submersible Example 1 using JAGS; in a later section we show how to obtain a confidence interval for Example 2, ω 2 in ANOVA designs.
We first define the model and prior in R using the BUGS language. Notice that this is simply stating the distributions of the data points, along with a prior for θ. BUGS_model = " model{ y1~dunif(theta -5, theta + 5) y2~dunif(theta -5, theta + 5) theta~dnorm( theta_mean, theta_precision) } " We now define a list of values that will get passed to JAGS. y1 and y2 are the data observed values from Figure 1A, and the prior we choose is an informative prior for demonstration. for_JAGS = list( y1 = -4.5, y2 = 4.5, theta_mean = -2.5, theta_precision = 1/10^2 ) Since precision is the reciprocal of variance, the prior on θ corresponds to a Normal(µ = −2.5, σ = 10) prior. All that remains is to load JAGS and combine the model information in BUGS_model with the data in for_JAGS, then to obtain samples from the posterior distribution. Note the resemblance to Figure 5, bottom panel, in the manuscript. We use the summary function on the samples to obtain a point estimate as well as quantiles of the posterior distribution, which can be used to form credible intervals. The 50% central credible interval is the interval between the 25th and 75th percentile. In the manuscript, we compare Steiger's (2004) confidence intervals for ω 2 to Bayesian highest posterior density (HPD) credible intervals. In this section we describe how the Bayesian HPD intervals were computed.
Consider a one-way design with J groups and N observations in each group. Let y ij be the ith observation in the jth group. Also suppose that where µ j is the population mean of the jth group and σ 2 is the error variance. We assume a "noninformative" prior on parameters µ, σ 2 : This prior is flat on (µ 1 , . . . , µ J , log σ 2 ). In application, it would be wiser to assume an informative prior on these parameters, in particular assuming a population over the µ parameters or even the possibility that µ 1 = . . . = µ J = 0 (Rouder, Morey, Speckman, & Province, 2012). However, for this manuscript we compare against a "non-informative" prior in order to show the differences between the confidence interval and the Bayesian result with "objective" priors.
Assuming the prior above, an elementary Bayesian calculation (Gelman, Carlin, Stern, & Rubin, 2004) reveals that σ 2 | y ∼ Inverse Gamma(J(N − 1)/2, S/2) where S is the error sum-of-squares from the corresponding one-way ANOVA, and where µ j andx j are the true and observed means for the jth group. Following Steiger (2004) we can define as the deviation from the grand mean of the jth group, and It is then straightforward to set up an MCMC sampler for ω 2 . Let M be the number of MCMC iterations desired. We first sample M samples from the marginal posterior distribution of σ 2 , then sample the group means from the conditional posterior distribution for µ 1 , . . . , µ J . Using these posterior samples, M posterior samples for λ and ω 2 can be computed.
The following R function will sample from the marginal posterior distribution of ω 2 : ## Assumes that data.frame y has two columns: ## $y is the dependent variable ## $grp is the grouping variable, as a factor The Bayes.posterior.omega2 function can be used to compute the posterior and HPD for the first example in the manuscript. The fake.data.F function, defined in the file steiger.utility.R (available with the manuscript source code at https://github.com/richarddmorey/ConfidenceIntervalsFallacy), generates a data set with a specified F statistic.

BUGS implementation
Although the code above can be used to quickly sample ω 2 for any one-way design, it is not particularly generalizable for typical users. We can use the BUGS language for Bayesian modeling to create credible intervals in a way that is more accessible to the general user. # Define our quantities of interest lambda <-N * sum( mu_dev_sq ) omega2 <-lambda / ( lambda + N * J ) } " In the R code below, we define all the constants and the data needed for the analysis, including the prior parameters. These prior parameters were chosen to approximate the "non-informative" prior we used in the previous analysis. As we mentioned in the manuscript, we do not generally advise the use of such non-informative priors; these values are merely chosen for demonstration. In practice, reasonable values would be chosen to inform the analysis. for_JAGS = list( y = y$y, group = y$grp, N = N, J = J, NJ = N*J, mean_mu = 0, precision_mu = 1e-6, a_variance = 1e-6, b_variance = 1e-6 ) The following code joins the model (BUGS_model) with the data and defined constants (for_JAGS) and 10,000 samples from the posterior distribution, outputting the samples of omega2, the parameter of interest. Note the close similarity between Figure 4 and Figure 3. We can do whatever we like with these samples; of particular interest would be a point estimate and credible interval. For the point estimate, we might select the posterior mean; for the credible interval, we can compute a highest-density region: