# Bayes factor design analysis: Planning for compelling evidence

- 3.6k Downloads
- 8 Citations

## Abstract

A sizeable literature exists on the use of frequentist power analysis in the null-hypothesis significance testing (NHST) paradigm to facilitate the design of informative experiments. In contrast, there is almost no literature that discusses the design of experiments when Bayes factors (BFs) are used as a measure of evidence. Here we explore Bayes Factor Design Analysis (BFDA) as a useful tool to design studies for maximum efficiency and informativeness. We elaborate on three possible BF designs, (a) a fixed-*n* design, (b) an open-ended Sequential Bayes Factor (SBF) design, where researchers can test after each participant and can stop data collection whenever there is strong evidence for either \(\mathcal {H}_{1}\) or \(\mathcal {H}_{0}\), and (c) a modified SBF design that defines a maximal sample size where data collection is stopped regardless of the current state of evidence. We demonstrate how the properties of each design (i.e., expected strength of evidence, expected sample size, expected probability of misleading evidence, expected probability of weak evidence) can be evaluated using Monte Carlo simulations and equip researchers with the necessary information to compute their own Bayesian design analyses.

## Keywords

Bayes factor Power analysis Design analysis Design planning Sequential testingThe following rule of experimentation is therefore suggested: perform that experiment for which the expected gain in information is the greatest, and continue experimentation until a preassigned amount of information has been attained (Lindley 1956, p. 987)

We aim to explore *Bayes Factor Design Analysis* (BFDA) as a useful tool to design studies for maximum efficiency and informativeness. In the classical frequentist framework, statistical power refers to the long-term probability (across multiple hypothetical studies) of obtaining a significant *p*-value in case an effect of a certain size exists (Cohen 1988). Classical power analysis is a special case of the broader class of *design analysis*, which uses prior guesses of effect sizes and other parameters in order to compute distributions of any study outcome (Gelman and Carlin 2014).^{1} The general principle is to assume a certain state of reality, most importantly the expected true effect size, and tune the settings of a research design in a way such that certain desirable outcomes are likely to occur. For example, in frequentist power analysis, the property “sample size” of a design can be tuned such that, say, 80 % of all studies would yield a *p*-value <.05 if an effect of a certain size exists.

The framework of design analysis is general and can be used both for Bayesian and non-Bayesian designs, and it can be applied to any study outcome of interest. For example, in designs reporting Bayes factors a researcher can plan sample size such that, say, 80 % of all studies result in a compelling Bayes factor, for instance BF_{10}>10 (Weiss 1997; De Santis 2004). One can also determine the sample size such that, with a desired probability of occurrence, a highest density interval for a parameter excludes zero, or a particular parameter is estimated with a predefined precision (Kruschke 2014; Gelman and Tuerlinckx 2000). Hence, the concept of prospective design analysis, which refers to design planning before data are collected, is not limited to null-hypothesis significance testing (NHST), and our paper applies the concept to studies that use Bayes factors (BFs) as an index of evidence.

The first part of this article provides a short introduction to BFs as a measure of evidence for a hypothesis (relative to an alternative hypothesis). The second part describes how compelling evidence is a necessary ingredient for strong inference, which has been argued to be the fastest way to increase knowledge (Platt 1964). The third part of this article elaborates on how to apply the idea of design analysis to research designs with BFs. The fourth part introduces three BF designs, (a) a fixed-*n* design, (b) an open-ended Sequential Bayes Factor (SBF) design, where researchers can test after each participant and can stop data collection when there is strong evidence for either \(\mathcal {H}_{1}\) or \(\mathcal {H}_{0}\), and (c) a modified SBF design that defines a maximal sample size where data collection is stopped in any case. We demonstrate how to use Monte Carlo simulations and graphical summaries to assess the properties of each design and how to plan for compelling evidence. Finally, we discuss the approach in terms of possible extensions, the issue of (un)biased effect size estimates in sequential designs, and practical considerations.

## Bayes factors as an index of evidence

The Bayes factor is “fundamental to the Bayesian comparison of alternative statistical models” (O’Hagan and Forster 2004, p. 55) and it represents “the standard Bayesian solution to the hypothesis testing and model selection problems” (Lewis and Raftery 1997, p. 648) and “the primary tool used in Bayesian inference for hypothesis testing and model selection” (Berger 2006, p. 378). Here we briefly describe the Bayes factor as it applies to the standard scenario where a precise, point-null hypothesis \(\mathcal {H}_{0}\) is compared to a composite alternative hypothesis \(\mathcal {H}_{1}\). Under a composite hypothesis, the parameter of interest is not restricted to a particular fixed value (Jeffreys 1961). In the case of a *t*-test, for instance, the null hypothesis specifies the absence of an effect, that is, \(\mathcal {H}_{0}: \delta = 0\), whereas the composite alternative hypothesis allows effect size to take on nonzero values.

In order to gauge the support that the data provide for \(\mathcal {H}_{0}\) versus \(\mathcal {H}_{1}\), the Bayes factor hypothesis test requires that both models make predictions. This, in turn, requires that the expectations under \(\mathcal {H}_{1}\) are made explicit by assigning effect size *δ* a prior distribution, for instance a normal distribution centered on zero with a standard deviation of 1, \(\mathcal {H}_{1}: \delta \sim \mathcal {N}(0,1)\).

Depending on the order of numerator and denominator in the ratio, the Bayes factor is either denoted as BF_{01} (“\(\mathcal {H}_{0}\) over \(\mathcal {H}_{1}\)”, as in Eq. (1)) or as its inverse BF_{10} (“\(\mathcal {H}_{1}\) over \(\mathcal {H}_{0}\)”). When the Bayes factor BF_{01} equals 5, this indicates that the data are five times more likely under \(\mathcal {H}_{0}\) than under \(\mathcal {H}_{1}\), meaning that \(\mathcal {H}_{0}\) has issued a better probabilistic prediction for the observed data than did \(\mathcal {H}_{1}\). In contrast, when BF_{01} equals 0.25 the data support \(\mathcal {H}_{1}\) over \(\mathcal {H}_{0}\). Specifically, the data are 1/BF_{01}=BF_{10}=4 times more likely under \(\mathcal {H}_{1}\) than under \(\mathcal {H}_{0}\).

The Bayes factor offers several advantages for the practical researcher (Wagenmakers et al. 2016). First, the Bayes factor quantifies evidence, both for \(\mathcal {H}_{1}\) but also for \(\mathcal {H}_{0}\); second, its predictive underpinnings entail that neither \(\mathcal {H}_{0}\) nor \(\mathcal {H}_{1}\) need be “true” for the Bayes factor to be useful (but see van Erven et al. 2012); third, the Bayes factor does not force an all-or-none decision, but instead coherently reallocates belief on a continuous scale; fourth, the Bayes factor distinguishes between absence of evidence and evidence of absence (e.g., Dienes 2014, 2016); fifth, the Bayes factor does not require adjustment for sampling plans (i.e., the Stopping Rule Principle; (Bayarri et al. 2016; Berger and Wolpert 1988; Rouder 2014). A practical corollary is that, in contrast to *p*-values, Bayes factors retain their meaning in situations common in ecology and astronomy, where nature provides data over time and sampling plans do not exist (Wagenmakers et al. 2016).

Bayes factor | Evidence category |
---|---|

> 100 | Extreme evidence for \(\mathcal {H}_{1}\) |

30 - 100 | Very strong evidence for \(\mathcal {H}_{1}\) |

10 - 30 | Strong evidence for \(\mathcal {H}_{1}\) |

3 - 10 | Moderate evidence for \(\mathcal {H}_{1}\) |

1 - 3 | Anecdotal evidence for \(\mathcal {H}_{1}\) |

1 | No evidence |

1/3 - 1 | Anecdotal evidence for \(\mathcal {H}_{0}\) |

1/10 - 1/3 | Moderate evidence for \(\mathcal {H}_{0}\) |

1/30 - 1/10 | Strong evidence for \(\mathcal {H}_{0}\) |

1/100 - 1/30 | Very strong evidence for \(\mathcal {H}_{0}\) |

< 1/100 | Extreme evidence for \(\mathcal {H}_{0}\) |

## The purpose of design analyses: planning for compelling evidence

In the planning phase of an experiment, the purpose of a prospective design analysis is to facilitate the design of a study that ensures a sufficiently high probability of detecting an effect if it exists. Executed correctly, this is a crucial ingredient to *strong inference* (Platt 1964), which includes “[d]evising a crucial experiment [...], with alternative possible outcomes, each of which will, as nearly as possible, exclude one or more of the hypotheses” (p. 347). In other words, a study design with strong inferential properties is likely to provide compelling evidence, either for one hypothesis or for the other. Such a study generally does not leave researchers in a state of inference that is inconclusive.

When a study is underpowered, in contrast, it most likely provides only weak inference. Within the framework of frequentist statistics, underpowered studies result in *p*-values that are relatively nondiagnostic. Specifically, underpowered studies inflate both false-negative and false-positive results (Button et al. 2013; Dreber et al. 2015; Ioannidis 2005; Lakens and Evers 2014), wasting valuable resources such as the time and effort of participants, the lives of animals, and scientific funding provided by society. Consequently, research unlikely to produce diagnostic outcomes is inefficient and can even be considered unethical (Halpern et al. (Halpern et al. 2002); Emanuel et al. 2000; but see Bacchetti et al. 2005).

To summarize, the primary purpose of a prospective design analysis is to assist in the design of studies that increase the probability of obtaining compelling evidence, a necessary requirement for strong inference.

## Design analysis for Bayes factor designs

We apply design analysis to studies that report the Bayes factor as a measure of evidence. Note, first, that we seek to evaluate the operational characteristics of a Bayesian research design *before* the data are collected (i.e., a prospective design analysis). Therefore, our work centers on design, not on inference; once specific data have been collected, pre-data design analyses are inferentially irrelevant, at least from a Bayesian perspective (Bayarri et al. 2016; Wagenmakers et al. 2014). Second, our focus is on the Bayes factor as a measure of evidence, and we expressly ignore both prior model probabilities and utilities (Berger 1985; Taroni et al. 2010; Lindley 1997), two elements that are essential for decision making yet orthogonal to the quantification of evidence provided by the observed data. Thus, we consider scenarios where “the object of experimentation is not to reach decisions but rather to gain knowledge about the world” (Lindley 1956, p. 986).

### Target outcome of a Bayes factor design analysis: strong evidence and no misleading evidence

In the context of evaluating the empirical support for and against a null hypothesis, Bayes factors quantify the strength of evidence for that null hypothesis \(\mathcal {H}_{0}\) relative to the alternative hypothesis \(\mathcal {H}_{1}\). To facilitate strong inference, we wish to design studies such that they are likely to result in compelling Bayes factors in favor of the true hypothesis – thus, the informativeness of a design may be quantified by the expected Bayes factor (Good 1979; Lindley 1956; Cavagnaro et al. 2009), or an entire distribution of Bayes factors.

Prior to the experiment, one may expect that in the majority of data sets that may be obtained the Bayes factor will point towards the correct hypothesis. However, for particular data sets sampling variability may result in a misleading Bayes factor, that is, a Bayes factor that points towards the incorrect hypothesis. For example, even when \(\mathcal {H}_{0}\) holds in the population, a random sample can show strong evidence in favor of \(\mathcal {H}_{1}\), just by sampling fluctuations. We term this situation *false positive evidence* (FPE). If, in contrast, the data set shows strong evidence for \(\mathcal {H}_{0}\), although in reality \(\mathcal {H}_{1}\) is correct, we term this *false negative evidence* (FNE). In general terms, misleading evidence is defined as a situation where the data show strong evidence in favor of the incorrect hypothesis (Royall 2000).

Research designs differ with respect to their probability of generating misleading evidence. The probability of yielding misleading evidence is a pre-data concept that should not be confused with a related but different post-data concept, namely the probability that a given evidence in a particular data set is misleading (Blume 2002).

The expected strength of evidence (i.e., the expected BF) and the probability of misleading evidence are conceptually distinct, but practically tightly related properties of a research design (Royall 2000), as in general higher evidential thresholds will lead to lower rates of misleading evidence (Blume 2008; Schönbrodt et al. 2015). To summarize, the joint goal of a prospective design analysis should be a high probability of obtaining strong evidence and a low probability of obtaining misleading evidence, which usually go together.

### Dealing with uncertainty in expected effect size

Power in a classical power analysis is a *conditional power*, because the computed power is conditional on the assumed true (or minimally interesting) effect size. One difficulty is to commit to a point estimate of that parameter when there is considerable uncertainty about it. This uncertainty could be dealt with by computing the necessary sample size for a *set* of plausible fixed parameter values. For example, previous experiments may suggest that the true effect size is around 0.5, but a researcher feels that the true effect could as well be 0.3 or 0.7, and computes the necessary sample sizes for these effect size guesses as well. Such a sensitivity analysis gives an idea about the variability of resulting sample sizes.

A problem of this approach, however, is that there is no principled way of choosing an appropriate sample size from this set: Should the researcher aim for the conservative estimate, which would be highly inefficient in case the true effect is larger? Or should she aim for the optimistic estimate, which would lead to a low actual power if the true effect size is at the lower end of plausible values?

#### Prior effect size distributions quantify uncertainty

Extending the procedure of a sensitivity analysis, however, one can compute the probability of achieving a research goal averaged across all possible effect sizes. For this purpose, one has to define prior plausibilities of the effect sizes, compute the distribution of target outcomes for each effect size, and then obtain a weighted average. This averaged probability of success has been called “assurance” (O’Hagan et al. 2005) or “expected Bayesian power” (Spiegelhalter et al. 2004), and is the expected probability of success with respect to the prior.^{2}

Garthwaite et al. (2005) give advice on how to elicit a prior distribution from experts. These procedures help an expert to formulate his or her substantive knowledge in probabilistic form, which in turn can be used for Bayesian computations. Such an elicitation typically includes several steps, for example asking experts about the most plausible value (i.e., about the mode of the prior), or asking about the quantiles, such as ‘Please make a guess about a very high value, such that you feel there is only a 5 % probability the true value would exceed your guess’.

Morris et al. (2014) provide an online tool that can help to fit an appropriate distribution to an experts’ input^{3}.

#### Design priors vs. analysis priors

Two types of priors can be differentiated (Walley et al. 2015; O’Hagan and Stevens 2001). *Design priors* are used before data collection to quantify prior beliefs about the true state of nature. These design priors are used to do design analyses and in general to assist experimental design. *Analysis priors*, in contrast, are used for Bayesian statistical analysis after the data are in.

At first glance it might appear straightforward to use the same priors for design planning and for data analysis. Both types of priors, however, can serve different goals. The design prior is used to tune the design before data collection to make compelling evidence likely and to avoid misleading evidence. The target audience for a design analysis is mainly the researcher him- or herself, who wants to design the most informed study. Hence, design priors should be based on the researcher’s experience and can contain a lot of existing prior information and experience to aid an optimal planning of the study’s design. Relying on a non-central, highly informative prior (in the extreme case, a point effect size guess as in classical power analysis) can result in a highly efficient design (i.e., with a just large-enough sample size) if the real effect size is close to that guess. On the other hand, it bears the risk to end up with inconclusive evidence if the true effect is actually smaller. A less informative design prior, in contrast, will typically lead to larger planned sample sizes, as more plausibility is assigned to smaller effect sizes.^{4}

This increases the chances of compelling evidence in the actual data analysis, but can be inefficient compared to a design that uses a more precise (and valid) effect size guess. Researchers may balance that trade-off based on their subjective certainty about plausible effect sizes, utilities about successful or failed studies, or budget constraints. Whenever prospective design analyses are used to motivate sample size costs in grant applications, the design priors should be convincing to the funder and the grant reviewers.

The analysis priors that are used to compute the BF, in contrast, should be convincing to a skeptical target audience, and therefore often are less informative than the design priors. In the examples of this paper, we will use an informed, non-central prior distribution for the planning stage, but a default effect size prior (which is less informative) for data analysis.

## Three exemplary designs for a Bayes factor design analysis

- 1.
*Fixed-n design*: In this design, a sample of fixed size is collected and one data analysis is performed at the end. From this perspective, one can ask the following design-related questions: Given a fixed sample size and the expected effect size – what BFs can be expected? What sample size do I need to have at least a 90 % probability of obtaining a BF_{10}of, say, 6 or greater? What is the probability of obtaining misleading evidence? - 2.
*Open-ended sequential designs*: Here participants are added to a growing sample and BFs are computed until a desired level of evidence is reached (Schönbrodt et al. 2015). As long as researchers do not run out of participants, time, or money, this approach eliminates the possibility of ending up with weak evidence. With this design, one can ask the following design-related questions: Given the desired level of evidence and the expected effect size – what distribution of sample sizes can be expected? What is the probability of obtaining misleading evidence? - 3.
*Sequential designs with maximal n*: In this modification of the open-ended SBF design, participants are added until (a) a desired level of evidence is obtained, or (b) a maximum number of participants has been reached. If sampling is stopped because of (b), the evidence will not be as strong as desired initially, but the direction and the strength of the BF can still be interpreted. With this design, one can ask the following design-related questions: Given the desired level of evidence, the expected effect size, and the maximum sample size – what distribution of sample sizes can be expected? How many studies can be expected to stop because of crossing the evidential threshold, and how many because*n*_{max}has been reached? What is the probability of obtaining misleading evidence?

As most design planning concerns directional hypotheses, we will focus on these in this paper. Furthermore, in our examples we use the JZS default Bayes factor for a two group *t*-test provided in the *BayesFactor* package (Morey and Rouder 2015) for the R Environment for Statistical Computing (R Core Team 2014) and in JASP (JASP Team, 2016). The JZS Bayes factor assumes that effect sizes under \(\mathcal {H}_{1}\) (expressed as Cohen’s *d*) follow a central Cauchy distribution (Rouder et al. 2009). The Cauchy distribution with a scale parameter of 1 equals a *t* distribution with one degree of freedom. This prior has several convenient properties and can be used as a default choice when no specific information about the expected effects sizes is available. The width of the Cauchy distribution can be tuned using the scale parameter, which corresponds to smaller or larger plausible effect sizes. In our examples below, we use a default scale parameter of \(\sqrt {2}/2\). This corresponds to the prior expectation that 50 % of probability mass is placed on effect sizes that have an (absolute) size smaller than \(\sqrt {2}/2\), and 50 % larger than \(\sqrt {2}/2\). Note that all computations and procedures outlined here are not restricted to these specific choices and can be easily generalized to undirected tests and all other flavors of Bayes factors as well (Dienes 2014).

### Fixed-*n* design

With a pre-determined sample size, the following questions can be asked in a design analysis: (a) What is the expected distribution of obtained evidence? (b) What is the probability of obtaining misleading evidence? (c) Sample size determination: What is the necessary sample size that compelling evidence can be expected with sufficiently high probability?

Monte Carlo simulations can be used to answer these questions easily. In our example, we focus on a test for the difference between two population means (i.e., a Bayesian *t*-test; Rouder 2009). For didactic purposes, we demonstrate this design analysis with a fixed expected effect size (i.e., without a prior distribution). This way the design analysis is analogous to a classical power analysis in the NHST paradigm, that also assumes a fixed effect size under \(\mathcal {H}_{1}\).

- 1.
Define a population that reflects the expected effect size under \(\mathcal {H}_{1}\) and, if prior information is available, other properties of the real data (e.g., specific distributional properties). In the example given below, we used two populations with normal distributions and a fixed standardized mean difference of

*δ*= 0.5. - 2.
Draw a random sample of size

*n*_{fixed}from the populations (all*n*refer to sample size in each group). - 3.
Compute the BF for that simulated data using the analysis prior that will also be used in the actual data analysis and save the result. In the example given below, we analyzed simulated data with a Cauchy prior (scale parameter = \(\sqrt {2}/2\)).

- 4.
Repeat steps 2 and 3, say, 10,000 times.

- 5.
In order to compute the probability of false-positive evidence, the same simulation must be done under the \(\mathcal {H}_{0}\) (i.e., two populations that have no mean difference).

^{5}

The proposed simulations provide a distribution of obtained BFs under \(\mathcal {H}_{1}\), and another distribution under \(\mathcal {H}_{0}\). For these distributions, one can set several thresholds and retrieve the probability that a random study will provide a BF in a certain evidential category. For example, one can set a single threshold at BF_{10} = 1 and compute the probability of obtaining a BF with the wrong direction. Or, one can aim for more compelling evidence and set thresholds at BF_{10} = 6 and BF_{10} = 1/6. This means evidence is deemed inconclusive when 1/6<BF_{10}<6. Furthermore, one can define asymmetric thresholds under \(\mathcal {H}_{0}\) and \(\mathcal {H}_{1}\). Depending on the analysis prior in the computation of the BF, it can be expensive and time-consuming to gather strong evidence for \(\mathcal {H}_{0}\). In these cases one can relax the requirements for strong \(\mathcal {H}_{0}\) support and still aim for strong \(\mathcal {H}_{1}\) support, for example by using thresholds 1/6 and 20 (Weiss 1997).

#### Expected distribution of BFs and rates of misleading evidence

_{10}distribution that can be expected under \(\mathcal {H}_{1}\) (top row) and under \(\mathcal {H}_{0}\) (bottom row). The simulations were conducted with two fixed sample sizes:

*n*= 20 (left column) and

*n*= 100 (right column). Evidence thresholds were defined at 1/6 and 6. If an effect of

*δ*= 0.5 exists and studies with

*n*= 20 are conducted, 0.3 % of all simulated studies point towards the (wrong) \(\mathcal {H}_{0}\) (BF < 1/6). This is the rate of false negative evidence, and it is visualized as the dark grey area in the top density of Fig. 3A. Conversely, 21.1 % of studies show \(\mathcal {H}_{1}\) support (BF

_{10}> 6; light gray area in the top density), which is the probability of true positive results. The remaining 78.5 % of studies yield inconclusive evidence (1/6 <BF

_{10}< 6; medium grey area in the top density).

If, however, no effect exists (see bottom density of Fig. 3A), 0.9 % of all studies will yield false-positive evidence (BF_{10}>6), and 13.7 % of all studies correctly support \(\mathcal {H}_{0}\) with the desired strength of evidence (BF_{10}<1/6). A large majority of studies (85.5 %) remain inconclusive under \(\mathcal {H}_{0}\) with respect to that threshold. Hence, a design with that fixed sample size has a high probability of being uninformative under \(\mathcal {H}_{0}\).

With increasing sample size the BF distributions under \(\mathcal {H}_{1}\) and \(\mathcal {H}_{0}\) diverge (see Fig. 3B), making it more likely to obtain compelling evidence for either hypothesis. Consequently, the probability of misleading evidence and the probability of inconclusive evidence is reduced. At *n* = 100 and evidential thresholds of 6 and 1/6 the rate of false negative evidence drops from 0.3 % to virtually 0 %, and the rate of false positive evidence drops from 0.9 % to 0.6 %. The probability to detect an existing effect of *δ* = 0.5 increases from 21.1 % to 84.0 %, and the probability to find evidence in favor of a true \(\mathcal {H}_{0}\) increases from 13.7 % to 53.4 %.

#### Sample size determination

For sample size determination, simulated sample sizes can be adjusted until the computed probability of achieving a research goal under \(\mathcal {H}_{1}\) is close to the desired level. In our example, the necessary sample size of achieving a BF_{10}>6 under \(\mathcal {H}_{1}\) with a probability of 95 % would be *n* = 146. Such a fixed-*n* Bayes factor design with *n* = 146 implies a false negative rate of virtually 0 %, and, under \(\mathcal {H}_{0}\), a false positive rate of 0.4 % and a probability of 61.5 % to correctly support \(\mathcal {H\hspace *{-.22pt}}_{0\hspace *{-.22pt}}\).

In a pre-data design perspective the focus is on the frequentist properties of BFs. We should mention that this can be complemented by investigating the Bayesian properties of BFs. From that perspective, one can look at the probability of a hypothesis being true given a certain BF (Rouder 2014). When \(\mathcal {H}_{1}\) and \(\mathcal {H}_{0}\) have equal prior probability, and when the analysis prior equals the design prior, then a single study with a BF_{10} of, say, 6 has 6:1 odds of stemming from \(\mathcal {H}_{1}\).

The goal of obtaining strong evidence can be achieved by planning a sample size that ensures a strong enough BF with sufficient probability. There is, however, an easier way that *guarantees* compelling evidence: Sample sequentially and compute the BF until the desired level of evidence is achieved. This design will be explained in the next section.

### Open-ended sequential bayes factor design: *SBF*

In the planning phase of an experiment, it is often difficult to decide on an expected or minimally interesting effect size. If the planned effect size is smaller than the true effect size, the fixed *n* will be inefficient. More often, presumably, the effect size is overestimated in the planning stage, leading to a smaller actual probability to detect a true effect.

A proposed solution that is less dependent on the true effect size is the Sequential Bayes Factor (SBF) design (Schönbrodt et al. 2015). In this design, the sample size is increased until the desired level of evidence for \(\mathcal {H}_{1}\) or \(\mathcal {H}_{0}\) has been reached (see also Wald 1945; Kass & Raftery 1995; Berger et al. 1994; Dienes 2008; Lindley 1956). This principle of “accumulation of evidence” is also central to optimal models for human perceptual decision making (e.g., random walk models, diffusion models; e.g., Bogacz et al. 2006; Forstmann et al. 2016). This accumulation principle allows a flexible adaption of the sample size based on the actual empirical evidence.

In the planning phase of a SBF design, researchers define an a priori threshold that represents the desired grade of evidence, for example a BF_{10} of 6 for \(\mathcal {H}_{1}\) and the reciprocal value of 1/6 for \(\mathcal {H}_{0}\). Furthermore, an analysis prior for the effect sizes under \(\mathcal {H}_{1}\) is defined in order to compute the BF. Finally, the researcher may determine a minimum number of participants to be collected regardless, before the optional stopping phase of the experiment (e.g., *n* _{ m i n } = 20 per group).

After a sample of *n* _{ m i n } participants has been collected, a BF is computed. If this BF does not exceed the \(\mathcal {H}_{1}\) threshold or the \(\mathcal {H}_{0}\) threshold, the sample size is increased as often as desired and a new BF computed at each stage (even after each participant). As soon as one of the thresholds is reached or exceeded, sampling can be stopped. One prominent advantage of sequential designs is that sample sizes are in most cases smaller than those from fixed-*n* designs with the same error rates.^{6} For example, in typical scenarios the SBF design for comparing two group means yielded about 50 % smaller samples on average compared to the optimal NHST fixed-*n* design with the same error rates (Schönbrodt et al. 2015).

With regard to design analysis in a SBF design, one can ask: (a) What is the probability of obtaining misleading evidence by stopping at the wrong threshold? (b) What is the expected sample size until an evidential threshold is reached?

- 1.
Define a population that reflects the expected effect size under \(\mathcal {H}_{1}\) and, if prior information is available, other properties of the real data. In the example given below, we used two populations with normal distributions and a standardized mean difference that has been drawn from a normal distribution \(\mathcal {N} (0.5, \sigma = 0.1)\) at each iteration.

- 2.
Draw a random sample of size

*n*_{ m i n }from the populations. - 3.
Compute the BF for that simulated data set, using the analysis prior that will also be used in the actual data analysis (in our example: a Cauchy prior with scale parameter = \(\sqrt {2}/2\)). If the BF exceeds the \(\mathcal {H}_{1}\) or the \(\mathcal {H}_{0}\) threshold (in our example: > 6 or < 1/6), stop sampling, and save the final BF and the current sample size. If the BF does not exceed a threshold yet, increase sample size (in our example: by 1 in each group). Repeat step 3 until one of both thresholds is exceeded.

- 4.
Repeat steps 1 to 3, say, 10,000 times.

- 5.
In order to compute the rate of false-positive evidence and the expected sample size under \(\mathcal {H}_{0}\), the same simulation must be done under the \(\mathcal {H}_{0}\) (i.e., two populations that have no mean difference).

This design can completely eliminate weak evidence, as data collection is continued until evidence is conclusive in either direction. The consistency property ensures that BFs ultimately drift either towards 0 or towards *∞* and every study ends up producing compelling evidence – unless researchers run out of time, money, or participants (Edwards et al. 1963). We call this design “open-ended” because there is no fixed termination point defined a priori (in contrast to the SBF design with maximal sample size, which is outlined below). “Open-ended”, however, does not imply that data collection can continue forever without hitting a threshold; in contrast, the consistency property of BFs guarantees that the possibility of collecting samples indefinitely is zero.

_{10}in several studies where the true effect size follows the prior distribution displayed in Fig. 1. Each grey line in the plot shows how the BF

_{10}of a specific study evolves with increasing

*n*. Some studies hit the (correct) \(\mathcal {H}_{1}\) boundary sooner, some later, and the distribution of stopping-

*n*s is visualized as the density on top of the \(\mathcal {H}_{1}\) boundary. Although all trajectories are guaranteed to drift towards and across the correct threshold in the limiting case, some hit the wrong \(\mathcal {H}_{0}\) threshold prematurely. Most misleading evidence happens at early stages of the sequential sampling. Consequently, increasing

*n*

_{ m i n }also decreases the rate of misleading evidence (Schönbrodt et al. 2015). Figure 4 (bottom) shows the same evolution of BFs under \(\mathcal {H}_{0}\).

#### Expected rates of misleading evidence

If one updates the BF after each single participant under this \(\mathcal {H}_{1}\) of \(d \sim \mathcal {N} (0.5, \sigma = 0.1)\) and evidential thresholds at 6 and 1/6, 97.2 % of all studies stop at the correct \(\mathcal {H}_{1}\) threshold (i.e., the true positive rate), 2.8 % stop incorrectly at the \(\mathcal {H}_{0}\) threshold (i.e., the false negative rate). Under the \(\mathcal {H}_{0}\), 93.8 % terminate at the correct \(\mathcal {H}_{0}\) threshold, and 6.2 % at the incorrect \(\mathcal {H}_{1}\) threshold (i.e., the false positive rate).

The algorithm above computes the BF after each single participant. The more often a researcher checks whether the BF has exceeded the thresholds, the higher the probability of misleading evidence, because the chances are increased that the stop is at a random extreme value. In contrast to NHST, however, where the probability of a Type-I error can be pushed towards 100 % if enough interim tests are performed (Armitage et al. 1969), the rate of misleading evidence has an upper limit in the SBF design. When the simulations are conducted with interim tests after each single participant, one obtains the upper bound on the rate of misleading evidence. In the current example this leads to a maximal FPE rate of 6.2 %. If the BF is computed after every 5 participants, the rate is reduced to 5.2 %, after every 10 participants to 4.5 %. It should be noted that these changes in FPE rate are, from an inferential Bayesian perspective, irrelevant (Rouder 2014).

#### Expected sample size

In the above example, the average sample size at the stopping point (across both threshold hits) under \(\mathcal {H}_{1}\) is *n* = 53, the median sample size is *n* = 36, and 80 % of all studies stop with fewer than 74 participants. Under \(\mathcal {H}_{0}\), the sample size is on average 93, median = 46, and 80 % quantile = 115. Hence, although the SBF design has no a priori defined upper limit of sample size, the prospective design analysis reveals estimates of the expected sample sizes.

Furthermore, this example highlights the efficiency of the sequential design. A fixed-*n* Bayes factor design that also aims for evidence with BF_{10}≥6 (resp. ≤1/6) with the same true positive rate of 97.2 % requires *n* = 241 participants (but will have different rates of misleading evidence).

### Sequential Bayes factor with maximal *n*: *SBF+maxN*

The SBF design is attractive because a study is guaranteed to end up with compelling evidence. A practical drawback of the open-ended SBF design, however, is that the BF can meander in the inconclusive region for hundreds or even thousands of participants when effect sizes are very small (Schönbrodt et al. 2015). In practice, researchers do not have unlimited resources, and usually want to set a maximum sample size based on budget, time, or availability of participants.

The *SBF+maxN* design extends the SBF design with such an upper limit on the sample size. Data collection is stopped whenever one of both evidential thresholds has been exceeded, or when the a priori defined maximal sample size has been reached. When sampling is stopped because *n* _{max} has been reached, one can still interpret the final BF. Although it has not reached the threshold for compelling evidence, its direction and strength can still be interpreted.

When planning an *SBF+maxN* design, one can ask: (a) How many studies can be expected to stop because of crossing an evidential threshold, and how many because of reaching *n* _{max}?, (b) What is the probability of obtaining misleading evidence?, (c) If sampling stopped at *n* _{max}: How many of these studies have a BF that points into the correct direction? (d) What distribution of sample sizes can be expected?

Again, Monte Carlo simulations can be used to examine the operational characteristics of this design. The computation is equivalent to the SBF design above, with the only exception that step 3 is terminated when the BF exceeds the \(\mathcal {H}_{1}\) or \(\mathcal {H}_{0}\) threshold, or *n* reaches *n* _{max}.

To highlight the flexibility and practicality of the *SBF+maxN* design, we consider a hypothetical scenario in which a researcher intends to test as efficiently as possible, has practical limitations on the maximal sample size, and wants to keep the rate of false positive evidence low. To achieve this goal, we introduce some changes to the example from the open-ended *SBF* design above: Asymmetric boundaries, a different minimal sample size, and a maximum sample size.

False positive evidence happens when the \(\mathcal {H}_{1}\) boundary is hit prematurely although \(\mathcal {H}_{0}\) is true. As most misleading evidence happens at early terminations of a sequential design, the FPE rate can be reduced by increasing *n* _{ m i n } (say, *n* _{ m i n }=40). Furthermore, the FPE rate can be reduced by a high \(\mathcal {H}_{1}\) threshold (say, BF_{10}>=30). With an equally strong threshold for \(\mathcal {H}_{0}\) (1/30), however, the expected sample size can easily go into thousands under \(\mathcal {H}_{0}\) (Schönbrodt et al. 2015). To avoid such a protraction, the researcher may set a lenient \(\mathcal {H}_{0}\) threshold of BF_{10}<1/6. Finally, due to budget restrictions, the maximum affordable sample size is defined as *n* _{max} = 100. With these settings, the researcher trades in a higher expected rate of false negative evidence (caused by the lenient \(\mathcal {H}_{0}\) threshold), and some probability of weak evidence (when the study is terminated at *n* _{max}) for a smaller expected sample size, a low rate of false positive evidence and the certainty that the sample size does not exceed *n* _{max}.

_{10}at 30 and 1/6,

*n*

_{ m i n }= 40, and

*n*

_{max}= 100. The uncertainty about the effect size under \(\mathcal {H}_{1}\) is expressed as \(\delta \sim \mathcal {N} (0.5, \sigma = 0.1)\). Figure 5 visualizes the trajectories and stopping point distributions under \(\mathcal {H}_{1}\) (results under \(\mathcal {H}_{0}\) not shown). The upper and lower densities show the distribution of

*n*for all studies that hit a threshold. The distribution on the right shows the distribution of BF

_{10}for all studies that stopped at

*n*

_{max}.

#### Expected stopping threshold (H _{1}, H _{0}, or n _{max}) and expected rates of misleading evidence

Under \(\mathcal {H}_{1}\) of this example, 70.6 % of all studies hit the correct \(\mathcal {H}_{1}\) threshold (i.e., the true positive rate), 1.6 % hit the wrong \(\mathcal {H}_{0}\) threshold (i.e, the false negative rate). The remaining 27.8 % of studies stopped at *n* _{max} and remained inconclusive with respect to the a priori set thresholds.

One goal in the example was a low FPE rate. Under \(\mathcal {H}_{0}\) (not displayed), 70.9 % of all studies hit the correct \(\mathcal {H}_{0}\) threshold and 0.6 % hit the wrong \(\mathcal {H}_{1}\) threshold (i.e., the false positive rate). The remaining 28.5 % of studies stopped at *n* _{max} and remained inconclusive with respect to the a priori set thresholds.

Again, these are the maximum rates of misleading evidence, when a test after each participant is computed. More realistic sequential tests, such as testing after every 10 participants, will lower these rates.

#### Distribution of evidence at n _{max}

The BF of studies that did not reach the a priori threshold for compelling evidence can still be interpreted. In the current example, we categorize the inconclusive studies into results that show at least moderate evidence for either hypothesis (BF < 1/3 or BF > 3) or are completely inconclusive (1/3 < BF < 3). Of course any other threshold can be used to categorize the non-compelling studies; in general a BF of 3 provides only weak evidence for a hypothesis and implies, from a design perspective, a high rate of misleading evidence (Schönbrodt et al. 2015).

In the current example, under \(\mathcal {H}_{1}\), 15.5 % of all studies terminated at *n* _{max} with a BF_{10}> 3, meaning that these studies correctly indicated at least moderate evidence for \(\mathcal {H}_{1}\). 11.6 % of studies remained inconclusive (1/3<BF_{10}<3), and 0.7 % pointed towards the wrong hypothesis (BF_{10}<1/3). Under \(\mathcal {H}_{0}\), 1.1 % incorrectly pointed towards \(\mathcal {H}_{1}\), 10.8 % towards \(\mathcal {H}_{0}\), and 16.6 % remained inconclusive.

#### Expected sample size

The average expected sample size under \(\mathcal {H}_{1}\) (combined across all studies, regardless of the stopping condition) is *n* = 69, with a median of 65. The average expected sample size under \(\mathcal {H}_{0}\) is *n* = 66, with a median of 56. Hence, the average expected sample size is under both hypotheses considerably lower than *n* _{max}, which has been defined at *n* = 100.

## Discussion

We explored the concept of a Bayes Factor Design Analysis, and how it can help to plan a study for compelling evidence. Pre-data design analyses allow researchers to plan a study in a way that strong inference is likely. As in frequentist power analysis, one has to find a trade-off between the rates of misleading evidence, the desired probability of achieving compelling evidence, and practical limits concerning sample size. Additionally, in order to compute the expected outcomes of future studies, one has to make explicit one’s assumption for several key parameters, such as the expected effect size under \(\mathcal {H}_{1}\). Any pre-data analysis is conditional on these assumptions, and the validity of the results depends on the validity of the assumptions. If reality does not follow the assumptions, the actual operational characteristics of a design will differ from the results of the design analysis. For example, if the actual effect size is smaller than anti-cipated, a chosen design has actually higher FNE rates and, in the sequential case, larger expected sample sizes until a threshold is reached.

In contrast to *p*-values, the interpretation of Bayes factors does not depend on stopping rules (Rouder 2014). This property allows researchers to use flexible research designs without the requirement of special and ad-hoc corrections. For example, the proposed *SBF+maxN* design stops abruptly at *n* _{max}. An alternative procedure is one where the evidential thresholds gradually move closer together as *n* increases. This implies that a lower grade of evidence is accepted when sampling was not already stopped at a strong evidential threshold, and puts a practical (but not fixed) upper limit on sample size (for an application in response time modeling see Boehm et al. 2015). The properties of this special design (or of any sequential or non-sequential BF design) can be evaluated using the same simulation approach outlined in this paper. This further underscores the flexibility and the generality of the sequential Bayesian procedure.

### From the planning stage to the analysis stage

This paper covered the planning stage, before data are collected. After a design has been chosen, based on a careful evaluation of its operational characteristics, the actual study is carried out (see also Fig. 2). A design analysis only relates to the actual inference if the same analysis prior is used in the planning stage and in the analysis stage. Additionally, the BF computation in the analysis stage should contain a sensitivity analysis, which shows whether the inference is robust against reasonable variations in the analysis prior.

It is important to note that, in contrast to NHST, the inference drawn from the actual data set is entirely independent from the planning stage (Berger and Wolpert 1988; Wagenmakers et al. 2014; Dienes 2011). All inferential information is contained in the actual data set, the analysis prior, and the likelihood function. Hypothetical studies from the planning stage (that have not been done) cannot add anything. From that perspective, it would be perfectly fine to use a different analysis prior in the actual analysis than in the design analysis. This would not invalidate the inference (as long as the chosen analysis prior is defensible); it just would disconnect the pre-data design analysis, which from a post-data perspective is irrelevant anyway, from the actual analysis.

### Unbiasedness of effect size estimates

Concerning the sequential procedures described here, some authors have raised concerns that these procedures result in biased effect size estimates (e.g., Bassler et al. 2010; kruschke 2014). We believe these concerns are overstated, for at least two reasons.

First, it is true that studies that terminate early at the \(\mathcal {H}_{1}\) boundary will, on average, overestimate the true effect. This conditional bias, however, is balanced by late terminations, which will, on average, underestimate the true effect. Early terminations have a smaller sample size than late terminations, and consequently receive less weight in a meta-analysis. When all studies (i.e., early and late terminations) are considered together, the bias is negligible (Fan et al. 2004; Schönbrodt et al. 2015; Goodman 2007; Berry et al. 2010). Hence, across multiple studies the sequential procedure is approximately unbiased.

Second, the conditional bias of early terminations is conceptually equivalent to the bias that results when only significant studies are reported and non-significant studies disappear into the file drawer (Goodman 2007). In all experimental designs –whether sequential, non-sequential, frequentist, or Bayesian– the average effect size inevitably increases when one selectively averages studies that show a larger-than-average effect size. Selective publishing is a concern across the board, and an unbiased research synthesis requires that one considers significant and non-significant results, as well as early and late terminations.

Although sequential designs have negligible unconditional bias, it may nevertheless be desirable to provide a principled “correction” for the conditional bias at early terminations, in particular when the effect size of a single study is evaluated. For this purpose, Goodman (2007) outlines a Bayesian approach that uses prior expectations about plausible effect sizes (see also Pocock and Hughes 1989). This approach shrinks extreme estimates from early terminations towards more plausible regions. Smaller sample sizes are naturally more sensitive to prior-induced shrinkage, and hence the proposed correction fits the fact that most extreme deviations from the true value are found in very early terminations that have a small sample size (Schönbrodt et al. 2015).

### Practical considerations

Many granting agencies require a priori computations for the determination of sample size. This ensures that proposers explicitly consider the expected or minimally relevant effect size. Such calculations are necessary to pinpoint the amount of requested money to pay participants.

The *SBF+maxN* design seems especially suitable for a scenario where researchers want to take advantage of the high efficiency of a sequential design but still have to define a fixed (maximum) sample size in a proposal. For this purpose, one could compute a first design analysis based on an open-ended SBF design to determine a reasonable *n* _{max}. If, for example, the 80 % quantile of the stopping-*n* distribution is used as *n* _{ m a x } in a *SBF+maxN* design, one can expect to hit a boundary before *n* _{max} is reached in 80 % of all studies. Although there is a risk of 20 % that a study does not reach compelling evidence within the funding limit, this outcome is not a “failure” as the direction and the size of the final BF can still be interpreted. In a second design analysis one should consider the characteristics of that *SBF+maxN* design and evaluate whether the rates of misleading evidence are acceptable.

This approach enables researchers to define an informed upper limit for sample size, which allows them to apply for a predefined amount of money. Still, one can save resources if the evidence is strong enough for an earlier stop, and in almost all cases the study will be more efficient than a fixed-*n* NHST design with comparable error rates (Schönbrodt et al. 2015).

### Conclusion

In the planning phase of a study it is essential to carry out a design analysis in order to formalize one’s expectations and facilitate the design of informative experiments. A large body of literature is available on planning frequentist designs, but little practical advice exists for research designs that employ Bayes factors as a measure of evidence. In this contribution we elaborate on three BF designs –a fixed-*n* design, an open-ended Sequential Bayes Factor (SBF) design, and an SBF design with maximal sample size– and demonstrate how the properties of each design can be evaluated using Monte Carlo simulations. Based on the analyses of the operational characteristics of a design, the specific settings of the research design can be balanced in a way that compelling evidence is a likely outcome of the to-be-conducted study, misleading evidence is an unlikely outcome, and sample sizes are within practical limits.

## Footnotes

- 1.
Other authors have used “power analysis” as a generic term for the “probability of achieving a research goal” (e.g. Kruschke 2010, p. 1). In line with Gelman and Carlin (2014), we prefer the more general term “design analysis” and reserve “power analysis” for the special case where a design analysis aims to ensure a minimum rate of true positive outcomes in a hypothesis test (i.e., prob(strong \(\mathcal {H}_{1}\) evidence \(| \mathcal {H}_{1}\)), which is the classical meaning of statistical power.

- 2.
It is possible to construct an unconditional effect size prior that describes the plausibility of effect sizes both under \(\mathcal {H}_{1}\) and \(\mathcal {H}_{0}\), for example by defining a prior effect size distribution that assigns considerable probability to values around zero and the opposite direction, or by using a mixture distribution that has some mass around zero, and some mass around a non-zero effect size (Muirhead and Soaita 2013). Here, in contrast, we prefer to construct a conditional effect size prior under \(\mathcal {H}_{1}\) and to contrast it with a point \(\mathcal {H}_{0}\) that has all probability mass on zero. Hence, the result of our design analysis is a conditional average probability of success under \(\mathcal {H}_{1}\), which Eaton et al. (2013) consider to be the most plausible average probability for sample size planning.

- 3.
- 4.
In Bayesian parameter estimation so called uninformative priors are quite common. A very wide prior, such as a half-normal distribution with mean=0 and

*SD*=10, however, should not be used for design analysis, as too much probability mass is placed upon unrealistically large effect sizes. Such a design analysis will yield planned sample sizes that are usually too small, and consequently the actual data analysis will most likely be uninformative. As any design choice involves the fundamental trade-off between expected strength of evidence and efficiency, there exists no “uninformative” design prior in prospective design analysis. - 5.
The R code is also available on the OSF (https://osf.io/qny5x/).

- 6.
For a procedure related to the SBF, the sequential probability ratio test (SPRT; Wald 1945), it has been proven that this test of two simple (point) hypotheses is an optimal test. That means that of all tests with the same error rates it requires the fewest observations on average (Wald and Wolfowitz 1948), with sample sizes that are typically 50 % lower than the best competing fixed-

*n*design.

## Notes

### Acknowledgments

This research was supported in part by grant 283876 “Bayes or Bust!” awarded by the European Research Council.

## References

- Armitage, P., McPherson, C.K., & Rowe, B.C. (1969). Repeated significance tests on accumulating data.
*Journal of the Royal Statistical Society. Series A (General)*,*132*(2), 235–244.CrossRefGoogle Scholar - Bacchetti, P., Wolf, L.E., Segal, M.R., & McCulloch, C.E. (2005). Ethics and sample size.
*American Journal of Epidemiology*,*161*(2), 105–110. doi: 10.1093/aje/kwi014.CrossRefPubMedGoogle Scholar - Bassler, D., Briel, M., Montori, V.M., Lane, M., Glasziou, P., Zhou, Q., & Guyatt, G.H. (2010). Stopping randomized trials early for benefit and estimation of treatment effects: Systematic review and meta-regression analysis.
*Journal of the American Medical Association*,*303*(12), 1180–1187.CrossRefPubMedGoogle Scholar - Bayarri, M.J., Benjamin, D.J., Berger, J.O., & Sellke, T.M. (2016). Rejection odds and rejection ratios: A proposal for statistical practice in testing hypotheses.
*Journal of Mathematical Psychology*,*72*, 90–103. doi: 10.1016/j.jmp.2015.12.007.CrossRefGoogle Scholar - Berger, J.O. (1985).
*Statistical decision theory and Bayesian analysis*, 2nd ed. Springer: New York.CrossRefGoogle Scholar - Berger, J.O. (2006). Bayes factors. In Kotz, S., Balakrishnan, N., Read, C., Vidakovic, B., & Johnson, N.L. (Eds.),
*Encyclopedia of statistical sciences*. 2nd ed., (Vol. 1, pp. 378–386). Hoboken, NJ: Wiley.Google Scholar - Berger, J.O., & Wolpert, R.L. (1988).
*The likelihood principle*, 2nd ed. Hayward, CA: Institute of Mathematical Statistics.Google Scholar - Berger, J.O., Brown, L.D., & Wolpert, R.L. (1994). A unified conditional frequentist and Bayesian test for fixed and sequential simple hypothesis testing.
*The Annals of Statistics*,*22*(4), 1787–1807. doi: 10.1214/aos/1176325757.CrossRefGoogle Scholar - Berry, S.M., Bradley, P.C., & Connor, J. (2010). Bias and trials stopped early for benefit.
*JAMA*,*304*, 156–159. doi: 10.1001/jama.2010.930.CrossRefPubMedGoogle Scholar - Blume, J.D. (2002). Likelihood methods for measuring statistical evidence.
*Statistics in Medicine*,*21*(17), 2563–2599. doi: 10.1002/sim.1216.CrossRefPubMedGoogle Scholar - Blume, J.D. (2008). How often likelihood ratios are misleading in sequential trials.
*Communications in Statistics: Theory & Methods*,*37*(8), 1193–1206. doi: 10.1080/03610920701713336.CrossRefGoogle Scholar - Boehm, U., Hawkins, G.E., Brown, S., van Rijn, H., & Wagenmakers, E.-J. (2015). Of monkeys and men: Impatience in perceptual decision-making.
*Psychonomic Bulletin & Review*,*23*(3), 738–749. doi: 10.3758/s13423-015-0958-5. - Bogacz, R., Brown, E., Moehlis, J., Holmes, P., & Cohen, J.D. (2006). The Physics of Optimal Decision Making: A Formal Analysis of Models of Performance in Two?alternative Forced Choice Tasks.
*Psychological Review*,*113*, 700–765.CrossRefPubMedGoogle Scholar - Button, K.S., Ioannidis, J.P., Mokrysz, C., Nosek, B.A., Flint, J., Robinson, E.S., & Munafò, M.R. (2013). Power failure: Why small sample size undermines the reliability of neuroscience.
*Nature Reviews Neuroscience*,*14*(5), 365–376.CrossRefPubMedGoogle Scholar - Cavagnaro, D.R., Myung, J.I., Pitt, M.A., & Kujala, J.V. (2009). Adaptive design optimization: A mutual information-based approach to model discrimination in cognitive science.
*Neural Computation*,*22*(4), 887–905. doi: 10.1162/neco.2009.02-09-959.CrossRefGoogle Scholar - Cohen, J. (1988).
*Statistical power analysis for the behavioral sciences*. New Jersey, US: Lawrence Erlbaum Associates.Google Scholar - De Santis, F. (2004). Statistical evidence and sample size determination for Bayesian hypothesis testing.
*Journal of Statistical Planning and Inference*,*124*, 121–144.CrossRefGoogle Scholar - Dienes, Z. (2008).
*Understanding psychology as a science: An introduction to scientific and statistical inference*. New York: Palgrave Macmillan.Google Scholar - Dienes, Z. (2011). Bayesian versus orthodox statistics: Which side are you on?.
*Perspectives on Psycho-logical Science*,*6*(3), 274–290. doi: 10.1177/1745691611406920.CrossRefGoogle Scholar - Dienes, Z. (2014). Using Bayes to get the most out of nonsignificant results.
*Frontiers in Psychology: Quantitative Psychology and Measurement*,*5*, 781. doi: 10.3389/fpsyg.2014.00781.CrossRefGoogle Scholar - Dienes, Z. (2016). How Bayes factors change scientific practice.
*Journal of Mathematical Psychology*,*72*, 78–89. doi: 10.1016/j.jmp.2015.10.003.CrossRefGoogle Scholar - Dreber, A., Pfeiffer, T., Almenberg, J., Isaksson, S., Wilson, B., Chen, Y., & Johannesson, M. (2015). Using prediction markets to estimate the reproducibility of scientific research.
*Proceedings of the National Academy of Sciences*,*112*(50), 15343–15347. doi: 10.1073/pnas.1516179112.CrossRefGoogle Scholar - Eaton, M.L., Muirhead, R.J., & Soaita, A.I. (2013). On the limiting behavior of the probability of claiming superiority in a Bayesian context.
*Bayesian Analysis*,*8*(1), 221–232.CrossRefGoogle Scholar - Edwards, W., Lindman, H., & Savage, L.J. (1963). Bayesian statistical inference for psychological research.
*Psychological Review*,*70*(3), 193–242. doi: 10.1037/h0044139.CrossRefGoogle Scholar - Emanuel, E.J., Wendler, D., & Grady, C. (2000). What makes clinical research ethical?.
*JAMA*,*283*(20), 2701–2711.CrossRefPubMedGoogle Scholar - Fan, X., DeMets, D.L., & Lan, K.K.G. (2004). Conditional bias of point estimates following a group sequential test.
*Journal of Biopharmaceutical Statistics*,*14*(2), 505–530. doi: 10.1081/BIP-120037195.CrossRefPubMedGoogle Scholar - Forstmann, B.U., Ratcliff, R., & Wagenmakers, E.-J. (2016). Sequential Sampling Models in Cognitive Neuroscience: Advantages, Applications, and Extensions.
*Annual Review of Psychology*,*67*, 641–666.CrossRefPubMedGoogle Scholar - Garthwaite, P.H., Kadane, J.B., & O’Hagan, A. (2005). Statistical methods for eliciting probability distributions.
*Journal of the American Statistical Association*,*100*(470), 680–701. doi: 10.1198/016214505000000105.CrossRefGoogle Scholar - Gelman, A., & Carlin, J. (2014). Beyond power calculations: Assessing Type S (sign) and Type M (magnitude) errors.
*Perspectives on Psychological Science*,*9*(6), 641–651. doi: 10.1177/1745691614551642.CrossRefPubMedGoogle Scholar - Gelman, A., & Tuerlinckx, F. (2000). Type S error rates for classical and Bayesian single and multiple comparison procedures.
*Computational Statistics*,*15*(3), 373–390.CrossRefGoogle Scholar - Good, I.J. (1979). Studies in the history of probability and statistics. XXXVII A. M. Turing?s statistical work in World War II.
*Biometrika*,*66*(2), 393–396. doi: 10.1093/biomet/66.2.393.CrossRefGoogle Scholar - Goodman, S.N. (2007). Stopping at nothing? Some dilemmas of data monitoring in clinical trials.
*Annals of Internal Medicine*,*146*(12), 882–887.CrossRefPubMedGoogle Scholar - Halpern, S.D., Karlawish, J.H.T., & Berlin, J.A. (2002). The continuing unethical conduct of underpowered clinical trials.
*JAMA*,*288*(3), 358–362.CrossRefPubMedGoogle Scholar - Hoijtink, H., Klugkist, I., & Boelen, P. (2008).
*Bayesian Evaluation of Informative Hypotheses*. New York: Springer.CrossRefGoogle Scholar - Ioannidis, J.P.A. (2005). Why most published research findings are false.
*PLoS Med*,*2*(8), e124. doi: 10.1371/journal.pmed.0020124.CrossRefPubMedPubMedCentralGoogle Scholar - JASP Team (2016). JASP (Version 0.7.5.6)[Computer software].Google Scholar
- Jeffreys, H. (1961). The theory of probability. Oxford University Press.Google Scholar
- Kass, R.E., & Raftery, A.E. (1995). Bayes factors.
*Journal of the American Statistical Association*,*90*(430), 773–795.CrossRefGoogle Scholar - Kruschke, J.K. (2010). Bayesian data analysis.
*Wiley Interdisciplinary Reviews: Cognitive Science*,*1*(5), 658–676. doi: 10.1002/wcs.72.PubMedGoogle Scholar - Kruschke, J.K. (2014).
*Doing Bayesian data analysis: A tutorial with R, JAGS, and Stan*, 2nd edn. Boston: Academic Press.Google Scholar - Lakens, D., & Evers, E.R.K. (2014). Sailing from the seas of chaos into the corridor of stability: Practical recommendations to increase the informational value of studies.
*Perspectives on Psychological Science*,*9*(3), 278–292. doi: 10.1177/1745691614528520.CrossRefPubMedGoogle Scholar - Lee, M.D., & Wagenmakers, E.-J. (2013). Bayesian cognitive modeling: A practical course. Cambridge University Press.Google Scholar
- Lewis, S.M., & Raftery, A.E. (1997). Estimating Bayes Factors via posterior simulation with the Laplace Metropolis estimator.
*Journal of the American Statistical Association*,*92*, 648–655.Google Scholar - Lindley, D.V. (1956). On a Measure of the Information Provided by an Experiment.
*The Annals of Mathematical Statistics*,*27*, 986– 1005.CrossRefGoogle Scholar - Lindley, D.V. (1997). The choice of sample size.
*Journal of the Royal Statistical Society. Series D (The Statistician)*,*46*(2), 129– 138.CrossRefGoogle Scholar - Ly, A., Verhagen, J., & Wagenmakers, E.-J. (2016). Harold Jeffreys’s default Bayes factor hypothesis tests: Explanation, extension, and application in psychology.
*Journal of Mathematical Psychology. Bayes Factors for Testing Hypotheses in Psychological Research: Practical Relevance and New Developments*,*72*, 19–32. doi: 10.1016/j.jmp.2015.06.004.Google Scholar - Morey, R.D., & Rouder, J.N. (2015). BayesFactor: Computation of Bayes factors for common designs.Google Scholar
- Morey, R.D., Romeijn, J.-W., & Rouder, J.N. (2016). The philosophy of Bayes factors and the quantification of statistical evidence.
*Journal of Mathematical Psychology. Bayes Factors for Testing Hypotheses in Psychological Research: Practical Relevance and New Developments*,*72*, 6–18. doi: 10.1016/j.jmp.2015.11.001.Google Scholar - Morris, D.E., Oakley, J.E., & Crowe, J.A. (2014). A webbased tool for eliciting probability distributions from experts.
*Environmental Modelling & Software*,*52*, 1–4. doi: 10.1016/j.envsoft.2013.10.010.CrossRefGoogle Scholar - Muirhead, R.J., & Soaita, A.I. (2013). On an approach to Bayesian sample sizing in clinical trials. In Jones, G. , & Shen, X. (Eds.),
*Advances in Modern Statistical Theory and Applications: A Festschrift in honor of Morris L. Eaton*(pp. 126–137). Ohio: Institute of Mathematical Statistics: Beachwood.CrossRefGoogle Scholar - Mulder, J., & Wagenmakers, E.-J. (2016). Editors? introduction to the special issue Bayes factors for testing hypotheses in psychological research: Practical relevance and new developments.
*Journal of Mathematical Psychology. Bayes Factors for Testing Hypotheses in Psychological Research: Practical Relevance and New Developments*,*72*, 1–5. doi: 10.1016/j.jmp.2016.01.002.Google Scholar - O’Hagan, A., & Forster, J. (2004). Kendall’s Advanced Theory of Statistics Vol. 2B: Bayesian Inference (2nd ed.) London: Arnold.Google Scholar
- O’Hagan, A., & Stevens, J.W. (2001). Bayesian assessment of sample size for clinical trials of cost effectiveness.
*Medical Decision Making: An International Journal of the Society for Medical Decision Making*,*21*(3), 219–230.CrossRefGoogle Scholar - O’Hagan, A., Stevens, J.W., & Campbell, M.J. (2005). Assurance in clinical trial design.
*Pharmaceutical Statistics*,*4*(3), 187–201. doi: 10.1002/pst.175.CrossRefGoogle Scholar - Platt, J.R. (1964). Strong inference.
*Science*,*146*(3642), 347–353. doi: 10.1126/science.146.3642.347.CrossRefPubMedGoogle Scholar - Pocock, S.J., & Hughes, M.D. (1989). Practical problems in interim analyses, with particular regard to estimation.
*Controlled Clinical Trials*,*10*(4), 209–221.CrossRefGoogle Scholar - R Core Team (2014). R: A language and environment for statistical computing. Vienna, Austria.Google Scholar
- Rouder, J.N. (2014). Optional stopping: No problem for Bayesians.
*Psychonomic Bulletin & Review*,*21*(2), 301–308. doi: 10.3758/s13423-014-0595-4.CrossRefGoogle Scholar - Rouder, J.N., Morey, R.D., Speckman, P.L., & Province, J.M. (2012). Default Bayes factors for ANOVA designs.
*Journal of Mathematical Psychology*,*56*(5), 356–374. doi: 10.1016/j.jmp.2012.08.001.CrossRefGoogle Scholar - Rouder, J.N., Speckman, P.L., Sun, D., Morey, R.D., & Iverson, G. (2009). Bayesian t tests for accepting and rejecting the null hypothesis.
*Psychonomic Bulletin & Review*,*16*(2), 225–237.CrossRefGoogle Scholar - Royall, R.M. (2000). On the probability of observing misleading statistical evidence.
*Journal of the American Statistical Association*,*95*(451), 760–768. doi: 10.2307/2669456.CrossRefGoogle Scholar - Schönbrodt, F.D. (2016). BFDA: Bayes factor design analysis package for R. https://github.com/nicebread/BFDA.
- Schönbrodt, F.D., Wagenmakers, E.-J., Zehetleitner, M., & Perugini, M. (2015). Sequential hypothesis testing with Bayes factors: Efficiently testing mean differences.
*Psychological Methods*. doi: 10.1037/met0000061. - Spiegelhalter, D.J., Abrams, K.R., & Myles, J.P. (2004). Bayesian approaches to clinical trials and health-care evaluation. John Wiley & Sons.Google Scholar
- Taroni, F., Bozza, S., Biedermann, A., Garbolino, P., & Aitken, C. (2010).
*Data analysis in forensic science: A Bayesian decision perspective*. Chichester: JohnWiley & Sons.CrossRefGoogle Scholar - van Erven, T., Grünwald, P., & de Rooij, S. (2012). Catching up faster by switching sooner: A predictive approach to adaptive estimation with an application to the AIC? BIC dilemma.
*Journal of the Royal Statistical Society B*,*74*, 361–417.Google Scholar - Wagenmakers, E.-J., Grünwald, P., & Steyvers, M. (2006). Accumulative prediction error and the selection of time series models.
*Journal of Mathematical Psychology,**50*, 149–166.CrossRefGoogle Scholar - Wagenmakers, E.-J., Morey, R.D., & Lee, M.D. (2016). Bayesian benefits for the pragmatic researcher.
*Current Directions in Psychological Science*,*25*(3), 169–176. doi: 10.1177/0963721416643289.CrossRefGoogle Scholar - Wagenmakers, E.-J., Verhagen, J., Ly, A., Bakker, M., Lee, M.D., Matzke, D., & Morey, R.D. (2014). A power fallacy. Behavior Research Methods 47(4):913–917. doi: 10.3758/s13428-014-0517-4.
- Wald, A. (1945). Sequential tests of statistical hypotheses.
*The Annals of Mathematical Statistics*,*16*(2), 117–186.CrossRefGoogle Scholar - Wald, A., & Wolfowitz, J. (1948). Optimum character of the sequential probability ratio test.
*The Annals of Mathematical Statistics*,*19*(3), 326–339. doi: 10.1214/aoms/1177730197.CrossRefGoogle Scholar - Walley, R.J., Smith, C.L., Gale, J.D., & Woodward, P. (2015). Advantages of a wholly Bayesian approach to assessing efficacy in early drug development: a case study.
*Pharmaceutical Statistics*,*14*(3), 205–215. doi: 10.1002/pst.1675.CrossRefPubMedGoogle Scholar - Weiss, R. (1997). Bayesian sample size calculations for hypothesis testing.
*Journal of the Royal Statistical Society. Series D (The Statistician)*,*46*(2), 185–191.CrossRefGoogle Scholar