Introduction

Inferences made on the basis of behavioral experiments have the potential to influence both scientific consensus and personalized treatment recommendations. However, strong and accurate inferences can require a daunting number of observations, a requirement that can be prohibitive when resources, e.g., participant attention, are scarce. Thus, methods that maximize the information provided by each individual observation are extremely valuable.

Adaptive design optimization (ADO) is a method that leverages observations from individual participants, on-the-fly, to identify the most powerful design in sequence (Cavagnaro et al., 2010).Footnote 1 At its core, ADO evaluates candidate stimuli with a global utility function that estimates, for each stimulus, the potential informativeness of possible responses to that stimulus. Because of its potential to automatically identify powerful designs, ADO has been used extensively for behavioral, psychometric, and psychiatric applications (Kwon et al., 2022). Such applications are facilitated by the combination of increased access to computational resources and the development of software packages that facilitate its implementation (Yang et al., 2020; Sloman, 2022).

ADO relies on the machinery of Bayesian inference, which requires that the user specify a prior distribution across models and parameter values that will generate their data, i.e., a distribution across possible values of the psychological characteristics underlying the observed stimulus–response relationship. When using optimal design methods like ADO, which rely on specified prior distributions in the design of the experiment itself, the choice of prior has dual consequences: Misinformative priors can bias inference and mislead the experimental design process. The prior distribution can have a substantial impact on ADO’s behavior (Myung et al., 2013; Cavagnaro et al., 2016). Thus, choosing a prior distribution is an issue of enormous practical import, and requires that the experimenter balance multiple considerations, e.g., prior knowledge and analytical tractability (Myung et al., 2013).

The goal of the present work is to unpack the effects of these various considerations on the behavior of ADO. We consider a common paradigm in which the goal of the experiment is to measure some latent variable, representing a given psychological characteristic, at the participant level as precisely as possible. The assumption is that the behavior exhibited by a given participant can be perfectly captured by a single value of this latent variable, and that these values are drawn from a distribution characterizing the participant population.

In practice, experimenters usually specify a single prior that they use for a large number of experimental participants, their specified prior. If the specified prior matches the true distribution of relevant psychological characteristics in the participant population, ADO’s criterion for evaluating stimuli can be interpreted as the amount of information the experimenter would receive, on average, across sufficiently many repetitions of the experiment. In this case, the design selected by ADO is optimal in the sense that it will lead the experimenter to correct inferences as quickly as possible, on average. If the specified prior does not match this population distribution, ADO’s global utility function no longer admits this interpretation, and the designs selected by ADO may no longer lead the experimenter efficiently towards correct inferences.

Prior literature has devised ways to construct an informed specified prior by incorporating observations from similar past experiments (Tulsyan et al., 2012; Kim et al., 2014). However, this may be infeasible or impractical in many situations of interest, due to, e.g., resource limitations that restrict the number of total participants one can recruit, or a desire to endow all participants with the same prior knowledge for the sake of ethical considerations or the tractability of pooled analyses. In such situations, experimenters are forced to contend with some degree of uncertainty about the true population distribution, and run the risk of deviations between the prior they specify and the population distribution.

The goal of the present work is to study how deviations between the specified prior and the true population distribution affect the performance of ADO. We refer to the presence of such a deviation as prior misinformation. In the sections that follow, we introduce a novel conceptual and mathematical framework for investigating the effect of prior misinformation. We leverage this framework to identify both (a) characteristics of specified priors that contribute to robust inference and (b) cases in which the threats of prior misinformation can only be mitigated by acquiring knowledge of the population distribution.

Preliminaries” introduces the mechanics of ADO and its application to problems of inference about psychological characteristics, such as trait values and model structure. “The prior’s two lives” presents the main conceptual tension addressed in our paper: Users of ADO implicitly rely on two distinct – and potentially opposing – interpretations of the specified prior. “Extended notation” gives a mathematical decomposition of the measure of information gain that reveals how prior information affects ADO’s efficiency. “Prior misinformation in the context of parameter estimation” and “Prior misinformation in the contextof model selection” interpret these results in the context of the problems of parameter estimation and model selection, respectively. These sections also present results from simulation experiments illustrating the effect of misinformation on the behavior of ADO in practice. “Robust practices for ADOfor model selection” discusses and suggests practices users of ADO can adopt to enhance robustness to issues we will show can arise in the context of model selection. “Discussionand limitations” discusses limitations of the present work and avenues for future work, and “Conclusion” concludes.

Preliminaries

Notation

We use bolded, capital letters to refer to random variables, and lowercase, unbolded letters to refer to their corresponding realizations. The probability of a particular realization x of the random variable \(\textbf{X}\) is p(x), i.e., \(\textbf{X} ~ : ~ x \rightarrow p(x)\).

Cognitive models

Latent constructs, like those typically of interest in psychological research, are, by definition, unavailable for observation and thus difficult to measure. For many applications, experimenters specify cognitive models, which mathematically represent these constructs in such a way that facilitates their measurement. The scope of the present work is within-subjects estimation: estimating as precisely as possible the degree to which a given participant exhibits a psychological characteristic. We give example applications later in this section. First, we make more precise how cognitive models facilitate the measurement of latent psychological constructs.

We consider probabilistic cognitive models that associate stimuli, e.g., questions that could be asked in an experiment, with probability distributions over possible responses.Footnote 2 We denote stimuli x and responses y, which are realizations of a random variable \(\textbf{Y} \vert x\). Models, denoted m, are families of functions indexed by a free parameter or parameters, denoted \(\theta \). Models encapsulate substantive mechanistic accounts of the relevant psychological, cognitive, or perceptual processes. The parameters encapsulate psychological or behavioral traits that may vary between experimental participants, but which are consistent within a participant. Our framework assumes that there is some true model \(m^*\) and corresponding parameter value \(\theta ^*\) that defines the true data-generating distribution for each stimulus x, given by \(\textbf{Y} \vert x,\theta ^*,m^*\).

We consider separately the goals of parameter estimation and model selection. Parameter estimation is the problem of inferring the value of \(\theta ^*\), or measuring the degree to which a participant exhibits a particular trait (assuming a given model structure). For example, for educational testing, the examiner’s goal is to identify the examinee’s ability level (assuming a given item-response model). Model selection is the problem of inferring the identity of \(m^*\) from a set of candidate models M, i.e., determining which of several substantively different processes a participant exhibits. For example, a longstanding problem in psychophysics has been to distinguish among various functional forms for describing the relationship between physical dimensions of stimuli and the psychological experience they induce (Roberts, 1979). Both of these goals – parameter estimation and model selection – can be achieved using Bayesian inference, in which the experimenter places a prior distribution across models and parameter values \(\mathbf{\left( M, \Theta \right) }\) and updates this prior according to observed data.

By specifying a prior distribution, the experimenter also implicitly specifies a prior predictive distribution \(\textbf{Y} \vert x\), for which each possible response to a stimulus has a corresponding marginal probability:

$$\begin{aligned} p(y \vert x)&= \sum _{m \in M} p(m) \int _\theta p(y \vert x, \theta , m) ~ p(\theta \vert m). \end{aligned}$$
(1)

We can also compute the predictive distribution conditioned on a particular quantity, such as a parameter value or model.

Adaptive design optimization

Different sets of stimuli have different degrees of power to identify the generating model and parameter value (Myung & Pitt, 2009; Cavagnaro et al., 2010; Young et al., 2012; Broomell et al., 2019). To address this, researchers have developed methods for the principled selection of stimuli to maximize the informativeness and efficiency of one’s experiment (Myung & Pitt, 2009; Broomell & Bhatia, 2014). ADO is one such method (Cavagnaro et al., 2010). By basing its recommendations on the observations it has seen so far, ADO identifies experimental designs tailored to the response patterns of the current participant.

Experiments using ADO proceed across a sequence of mini-experiments, which we call trials. Each trial may consist of a single stimulus or a block of stimuli. ADO dynamically incorporates information throughout the experiment by using the posterior distribution from one trial as the prior distribution on the subsequent trial. This process is visualized in Fig. 1.

Fig. 1
figure 1

Note. Flow chart of ADO experiment. The experimenter begins the experiment at the lightest grey node, by specifying a prior distribution over models and parameter values. On each trial, they select the stimulus that maximizes the global utility, observe responses to that stimulus, update the distribution over models and parameter values according to Bayes’ rule, and then use the obtained posterior as the prior on the next trial

To identify the stimulus with the greatest information gain, users of ADO specify a local utility function \(u(x, y, \theta , m)\) which is a function of the candidate stimulus x, response y, and a possible model and parameter value \(\{ m, \theta \}\) (together, a possible state of the world). The local utility function measures how much is learned from response y on stimulus x about the state of the world \(\{ m, \theta \}\). It can take a variety of forms, depending on the particular goals of the experimenter. The true state of the world and outcome of the experiment are unknown to the experimenter a priori – otherwise, there would be no need to run the experiment. Therefore, rather than maximizing u, ADO selects the stimulus that maximizes the expectation of u across possible models, parameter values and experimental outcomes according to the specified prior distribution. This yields the global utility function:

$$\begin{aligned} U(x) = \sum _{m \in M} p(m) \int _\theta \int _y u(x, y, \theta , m) ~ p(y \vert x, \theta , m) ~ p(\theta \vert m). \end{aligned}$$
(2)

For our applications, we consider a specification of u such that Eq. 2 measures the amount of information the candidate stimulus x is expected to yield about some inferential quantity of interest. The amount of information one variable provides about another has been made mathematically precise in the field of information theory by the concept of mutual information (Cover & Thomas, 1991). Motivated by these information-theoretic principles, global mutual information utility is the mutual information (I) between a focal quantity of interest, which we refer to as the focus and denote \(\phi \), and responses to a stimulus (Bernardo, 1979). Then, the global mutual information utility of a stimulus is:Footnote 3

$$\begin{aligned} U(x)= & {} \int _\phi \int _y \log {\left( \frac{p(\phi \vert y, x)}{p(\phi )} \right) } ~ p(y \vert x, \phi ) ~ p(\phi ) \nonumber \\= & {} I(\mathbf{\Phi } ; \textbf{Y} \vert x). \end{aligned}$$
(3)

In order for the global utility function to have the form in Eq. 3, the local utility function must take the form:

$$\begin{aligned} u(x, y, \theta , m)&= \log {\left( \frac{p(\phi \vert y, x)}{p(\phi )} \right) } \end{aligned}$$
(4)

which can be thought of as a measure of the information gained about the true value of \(\phi \) from \(y \vert x\).

Parameter estimation” and “Model selection” show how this specification is adapted to two of the most frequent applications of ADO: the problems of parameter estimation and of model selection. In the former case, the parameters \(\theta \) are the focus, and in the latter case, the model m is the focus.

Notice that Eq. 3 can be rewritten in terms of Kullback–Leibler divergence, an information-theoretic measure that captures the information gained in moving from one distribution to another. Specifically:

$$\begin{aligned} U(x)&= \int _y \underbrace{D_{KL}\left( \mathbf{\Phi } \vert y,x ~ \vert \vert ~ \mathbf{\Phi } \right) }_{\text {Focal divergence{}}} ~ p(y \vert x) \end{aligned}$$
(5)

where \(D_{KL}\left( \mathbf{\Phi } \vert y,x ~ \vert \vert ~ \mathbf{\Phi } \right) \), or what we will refer to as the focal divergence, is the Kullback–Leibler divergence from distribution \(\mathbf{\Phi } \vert y,x\) to distribution \(\mathbf{\Phi }\). In other words, global mutual information utility captures, in an information-theoretic sense, how much an observed response to a particular stimulus x is expected to move the prior distribution assigned to the focus.

Parameter estimation

Parameter estimation refers to the problem of maximizing the precision of one’s estimate of the parameters \(\theta \) given a particular model m. Applications of ADO to parameter estimation are useful if the experimenter is interested in capturing individual variation, for the purpose of, e.g., generating personalized treatment recommendations on the basis of a behavioral assessment. In the educational testing setting mentioned above, the examiner’s goal is to identify each examinee’s ability level in order to make recommendations of areas of strength or potential improvement (Owen, 1969). In a medical application, Hou et al. (2016) used ADO to estimate participants’ degree of visual contrast sensitivity, a characteristic that can be used for diagnosis of eye disease and treatment recommendations.

In the context of ADO for parameter estimation, m is assumed known, and the focus of the utility function is the parameter \(\theta \).

The global utility function is:

$$\begin{aligned} U(x)&= \int _\theta \int _y \log {\left( \frac{p(\theta \vert y, x)}{p(\theta )} \right) } ~ p(y \vert \theta ) ~ p(\theta ). \end{aligned}$$
(6)

Focal predictive distributions

As mentioned in “Cognitive models”, we can compute the predictive distribution conditioned on any particular state of the world, \(\textbf{Y} \vert x, \theta , m\) (Eq. 1).

In the context of parameter estimation, the value of m is known by assumption, so we can equivalently compute the predictive distribution conditioned on any particular value of \(\theta \), \(\textbf{Y} \vert x, \theta \). In this case, the set of predictive distributions characterized by possible parameter values are also the set of focal predictive distributions, or the predictive distributions associated with possible values of the focus.

We highlight two properties of the focal predictive distributions in the context of parameter estimation. First, since the true data-generating distribution is \(\textbf{Y} \vert x, \theta \) for some value of \(\theta \), the set of focal predictive distributions is in effect a set of possible data-generating distributions. The parameter estimation problem then (asymptotically) amounts to identifying which value of the focus has a corresponding predictive distribution that most resembles the distribution of observed data.

Second, because of this, the predictive distribution corresponding to a particular value of the focus does not depend on additional information like the current trial number or history of observations: While a particular value of \(\theta \) may become arbitrarily more or less likely, it will always elicit the same likelihood on a given stimulus–response pair.

Model selection

Model selection refers to the problem of maximizing the precision of one’s estimate of the model m, assuming both m and \(\theta \) are unknown. The problem of model selection can be thought of as identifying the core psychological process governing a participant’s response distribution.

In the context of model selection, the focus of the utility function is the model m, which yields the global utility function:

$$\begin{aligned} U(x)= & {} \sum _{m \in M} p(m) \int _y \log {\left( \frac{p(m \vert y, x)}{p(m)} \right) } ~ p(y \vert x, m) \nonumber \\= & {} \sum _{m \in M} p(m) \int _\theta \int _y \log {\left( \frac{p(m \vert y, x)}{p(m)} \right) } ~ p(y \vert x, \theta , m) ~ p(\theta \vert m).\nonumber \\ \end{aligned}$$
(7)

Focal predictive distributions

In the case of model selection, the focal predictive distributions are the predictive distributions associated with possible values of the model m, which can be calculated as:

$$\begin{aligned} p(y \vert x, m)&= \int _\theta p(y \vert x, \theta , m) ~ p(\theta \vert m). \end{aligned}$$
(8)

Experimenters faced with the model selection problem have two sources of uncertainty to contend with (the value of m and the value of \(\theta \)), yet measure utility with respect to reduction in only one source of uncertainty. This is reflected in properties of Eq. 8: Unlike in the case of parameter estimation, here, the focus is not the only conditioning variable needed to completely specify a possible response distribution \(\textbf{Y} \vert x,\theta ,m\); full specification of the response distribution also requires knowledge of \(\theta \).Footnote 4 In addition, unlike in the case of parameter estimation, the focal predictive distributions are a moving target: Because of their dependence on the parameter distributions, they shift as the parameter distributions are updated on the basis of observed data. These characteristics will become important in our discussion in “Prior misinformation in the context of model selection” of the impact of prior misinformation in the context of model selection.

The prior’s two lives

In ADO, the specified prior plays two roles: It both facilitates estimation of the focus from data via Bayesian updating, and informs the design of the experiment that generates these data. These two roles, or ‘lives,’ of the prior map on to two traditions in Bayesian statistics: Bayesian inference and Bayesian decision theory. While the effect of the prior on the behavior of Bayesian inference has been well studied, specified priors that enjoy good theoretical guarantees in the context of Bayesian inference may not seem so appealing when evaluated on the quality of a corresponding sequential decision-making policy. This section unpacks the reasons for this.

The goal of the present work is, in a sense, parallel to that of literature understanding the effect of priors on Bayesian inference: Our goal is to understand the effect of the choice of prior distribution on the quality of the corresponding sequential decision-making policy, and give guidance for users of ADO constrained to identify a single prior that lives both lives.

Sequential Bayesian inference is a core component of ADO: On each trial, the prior distribution is constructed as the posterior from the previous trial. In its first role, the prior can be seen as a launching pad for learning that will occur throughout the experiment. The prior is understood as an incomplete and ill-informed characterization of the distribution over possible states of the world, and is usually constructed on the basis of a variety of epistemic and pragmatic considerations. Considerations pertaining to – and guidance for constructing – the prior in the context of sequential Bayesian inference is the topic of a substantial body of existing literature (e.g., Lopes and Tobias (2011); Gelman et al. (2017)). Uninformative priors are often selected because of their pragmatic appeal in this role.

In its second role, the prior is used when calculating the global utility (Eq. 2) and thus informs the experimental design policy about the relative likelihoods of various outcomes. Bayesian decision theory refers to a prescriptive decision-making policy in which the costs and benefits of taking an action in different states of the world are averaged according to the probabilities of those states of the world (DeGroot, 2005; Berger, 2013). ADO’s policy of selecting the stimulus that maximizes the global utility is a special case of a Bayesian decision theoretic method. If the decision-making policy relies on a prior that mischaracterizes the relative likelihoods of candidate states of the world, the prescribed action is no longer defensible as the action with the highest expected benefit. Bayesian decision theoretic applications thus require a prior that is as informed as possible with available knowledge about the distribution of states of the world. Priors that ignore or mislead about the available knowledge cannot be easily justified from a decision-theoretic perspective, as they may bias the design selection toward stimuli that would not actually be the most informative across multiple experiments.

We assume that the relevant prior knowledge is the true distribution of relevant psychological characteristics in the participant population. Therefore, we will refer to the best decision-theoretic prior as the population prior. We do this for conceptual tractability; however, the analyses that follow require only that there is some defensible decision-theoretic prior. Our results apply regardless of the basis on which that prior is constructed. In many cases, information in addition to or instead of a population distribution should inform the decision-theoretic prior. For example, in all but the first trial of an adaptive experiment, the decision-theoretic prior must condition on the observations seen in previous trials. In these cases, the decision-theoretic prior can be formed from the population distribution conditioned on the history of observations (our analyses incorporate this consideration, in a way that is stated more formally in “Extended notation”). More generally, our framework extends to any case in which other information, e.g., knowledge about relevant demographic characteristics or a participant’s past behavior, is available, as our results can be readily generalized by considering the ‘population’ as all participants with the same demographic or behavioral characteristics.

If the specified prior – the prior used in the context of the experiment – matches the population prior, the global utility (Eq. 3) is also the expected focal divergence – the degree of focal divergence one should expect if one were to run the experiment on a sufficiently large participant sample. On the other hand, if the specified prior is not well calibrated, the global utility values could be misleading about the expected focal divergence. “Motivating example” gives an example of this in the context of an item-response model, a common paradigm used for educational testing. Experiments identified by ADO may not have the power to precisely identify the true model or its parameters, leading to a situation where a characteristic indicative of a disease or needed intervention is not identified efficiently, or possibly at all.

Types of priors

Priors are typically categorized as ‘informative’ or ‘uninformative.’ With an informative prior, a Bayesian analysis may reach a different conclusion than a conventional one because the prior injects information that is not in the data. For a single experiment aimed at identifying the model and parameter of an individual, the ideal informative prior would be a degenerate one that gives probability 1 to the true model and parameter. Such a prior is not feasible for the paradigm we consider here, where the same prior must be used for each participant drawn from a heterogeneous population. For this case, the best one could do would be to use a population prior. The logic of ADO implicitly assumes that the specified prior is the population prior. Therefore, we characterize the prior that coincides with the population prior as informative, and any prior that deviates from that population prior as misinformative.

Under our definition, priors that are usually referred to as ‘uninformative’ are typically misinformative when considered in the context of decision-theoretic applications. ‘Uninformative’ priors are not supposed to inject information, but in the paradigm we consider here, they entail explicit assumptions about the population of participants in the study. We will here use uninformative in the context of parameter estimation to refer to a special class of misinformative priors that are agnostic about either the parameter value or the predictive distribution. Priors that are agnostic about the parameter value – are uninformative in parameter space – are disperse across the support of the parameter distribution. Priors that are agnostic about the data distribution – are uninformative in data space – have high density in regions of the parameter space that correspond to a wide variety of data distributions. These two properties do not necessarily, or even usually, coincide.

Expected focal divergence

The primary innovation of our analysis is to decouple the two lives of the prior, and provide a framework within which one can reason separately about the process of sequential Bayesian inference and the distribution of observations upon which this inference is performed.Footnote 5 In this section, we more precisely define, motivate, and mathematically unpack the expected focal divergence, a concept that is central to the remainder of our analyses.

Extended notation

In the remainder of our paper, it will be important to distinguish whether a random variable is distributed according to the population or specified distribution of the corresponding quantity. We will do this by subscripting variables that correspond to the population distribution with a 0, e.g., the population distribution of models and parameters becomes \(\mathbf{\left( M, \Theta \right) }_0\), and the corresponding marginal distribution of observations becomes \(\textbf{Y}_0 \vert x\). Analogously, we will subscript variables that correspond to the specified distribution with a 1, e.g., the specified distribution of models and parameters becomes \(\mathbf{\left( M, \Theta \right) }_1\), and the corresponding marginal distribution of observations, i.e., the distribution of observations implied by the specified prior, becomes \(\textbf{Y}_1 \vert x\). We will also use \(p_0\) and \(p_1\) analogously to refer to the probabilities of the implied random variables taking particular values under the true and specified distribution, respectively.

Table 1 Extended notational system

The notation for quantities used repeatedly is summarized in Table 1. While Table 1, and our discussion more generally, refers to prior distributions, i.e., the distributions of random variables before conditioning on observations, all distributions should be interpreted to implicitly condition on the number of observations implied by context. For example, we write \(\mathbf{\left( M, \Theta \right) }_1\) to refer generally to the specified prior, regardless of how many experimental trials have elapsed. When considering the degree of prior misinformation on the second trial of an experiment, i.e., after an observation (xy), this can be read as \(\mathbf{\left( M, \Theta \right) }_1 \vert \{y, x\}\) (recalling that the posterior from the first trial is the prior on the second trial). In the same way, the population posterior distribution is \(\mathbf{\left( M, \Theta \right) }_0 \vert \{ y,x \}\), which can be interpreted as the appropriate decision-theoretic prior for the next trial given the history of observations.

Definition of expected focal divergence

Equation 5 showed that global mutual information utility can be rewritten as an expectation of the focal divergence across the specified predictive distribution. In the context of the prior’s two lives, the focal divergence can be thought of as the degree to which the prior fulfills its role of efficient Bayesian inference. Taking the expectation of the focal divergence across the specified predictive distribution then invokes the prior’s decision-theoretic role: One uses the predictive distribution implied by the specified prior to calculate the relative likelihood of prospective observations.

In the case where the specified prior deviates from the population prior, i.e., the specified prior is misinformative, the global mutual information utility is not equivalent to the focal divergence an experimenter would achieve from a stimulus if they presented it to many members of the participant population. We refer to this latter quantity – the expectation of the focal divergence taken across the response distribution – as the expected focal divergence. The expected focal divergence associated with a stimulus x, denoted \(U^1{}(x)\), is:

$$\begin{aligned} U^1{}(x)= & {} \int _y \int _\phi \log {\left( \frac{p_1{}(\phi \vert y, x)}{p_1{}(\phi )} \right) } ~ p_1{}(\phi \vert y, x) ~ p_0{}(y \vert x) \nonumber \\= & {} \int _y D_{KL}\left( \mathbf{\Phi }_1 \vert \{ y,x \} ~ \vert \vert ~ \mathbf{\Phi }_1 \right) ~ p_0{}(y \vert x), \end{aligned}$$
(9)

i.e., is the expected Kullback–Leibler divergence between posterior and prior under the response distribution \(\textbf{Y}_0 \vert x\), or how much observations distributed according to the population distribution are expected to move the prior distribution.

Fig. 2
figure 2

The effect of prior misinformation: Motivating example from item response theory. (a) Response distributions. (b) Expected focal divergence. Note. Effect of the population distribution on (a) response distribution \(p_0{}(y = 1 \vert x)\), and (b) the expected focal divergence of a stimulus \(U^1{}(x)\). Colors denote different true distributions \(\mathbf{\Theta }_0\). In all cases, the specified prior is \(\mathbf{\Theta }_1 \sim \mathscr {N}(0, 1)\) (i.e., is a standard normal distribution). The vertical line indicates the stimulus, i.e., value of x, that would be selected by ADO under the specified prior

Motivating example

To illustrate our claim that misinformative priors can impact the effectiveness of ADO, we demonstrate how the population distribution can affect the expected focal divergence of a stimulus in the context of a simple item-response model. We consider an item-response model that uses a one-dimensional ‘proficiency’ trait \(\theta \) to predict the probability of a correct response to a multiple-alternative question with a given ‘item difficulty,’ x. For a fixed value of x, higher values of \(\theta \), i.e., greater proficiency, yields a higher probability of a correct response. For a fixed value of \(\theta \), higher values of x, i.e., more difficult items, yield lower probabilities of a correct response, with the lowest possible probability being some value greater than zero that is consistent with random guessing. The goal of an experiment is to estimate the proficiency of each participant from their responses to items of various difficulty levels.

In prior work, Weiss and McBride (1983) found that priors that differed from the population distribution induced biases in inferences drawn from experiments designed using a version of ADO.Footnote 6 As our running example, we adopt the item-response model used in their simulation study:Footnote 7

$$\begin{aligned} p(y = 1 \vert x, \theta ) = .2 + \frac{.8}{1 + e^{-2.72(\theta - x)}}. \end{aligned}$$
(10)

The black curve in Fig. 2a shows, for each item difficulty x between -3 and 3, the predictive distribution associated with a prior \(\mathbf{\Theta }_1 \sim \mathscr {N}(0,1)\) (i.e., distributed according to a standard normal distribution). In the case this prior is informative, i.e., the population distribution is also \(\mathbf{\Theta }_0 \sim \mathscr {N}(0,1)\), this curve also shows the empirical distribution of responses one should expect. The black curve in Fig. 2b shows the global utility corresponding to each candidate design under this prior. In the case this prior is informative, this curve also shows the expected focal divergence corresponding to each candidate design.

The blue curves show the distribution of observations and expected focal divergence values under two other possible populations. The light blue curves correspond to the population distribution \(\mathbf{\Theta }_0 \sim \mathscr {N}(-2, 1)\), and the dark blue curves correspond to the population distribution \(\mathbf{\Theta }_0 \sim \mathscr {N}(2, 1)\). With reference to Fig. 2b, if the true population distribution is \(\mathbf{\Theta }_0 \sim \mathscr {N}(2, 1)\), the stimulus selected by ADO will yield much less focal divergence than ADO anticipates, on average. By contrast, if the population distribution is \(\mathbf{\Theta }_0 \sim \mathscr {N}(-2, 1)\), the stimulus selected by ADO will yield much more focal divergence than ADO anticipates, on average.

What accounts for this difference? Are there systematic properties of prior distributions that determine which will yield a greater or less expected focal divergence? The following section unpacks these questions.

Decomposition of the expected focal divergence

The expected focal divergence \(U^1{}(x)\) decomposes into three terms, which provide insight into how prior misinformation may affect ADO’s efficiency:

$$\begin{aligned} U^1{}(x) =&~~~ H\left( \textbf{Y}_0 \vert x \right)&\biggr \}&\text {Response variability} \nonumber \\&+ D_{KL}\left( \textbf{Y}_0 \vert x ~ \vert \vert ~ \textbf{Y}_1 \vert x \right)&\biggr \}&\text {Surprisal} \nonumber \\&+ \int _y \int _\phi \log {\left( p_1{}(y \vert x, \phi ) \right) } ~ p_1{}(\phi \vert y, x) ~ p_0{}(y \vert x).&\biggr \}&\text {Hindsight} \end{aligned}$$
(11)

Derivation is deferred to Appendix A.

Fig. 3
figure 3

Reproduction of Fig. 2 along with the three components of the expected focal divergence curves in Fig. 2b: response variability, surprisal and hindsight (Eq. 11). Note. As in Fig. 2b, colors denote different true distributions \(\mathbf{\Theta }_0\). In all cases, the prior is \(\mathbf{\Theta }_1 \sim \mathscr {N}(0, 1)\)

Response variability is the entropy in responses to a given stimulus. Entropy is an information-theoretic measure of the uncertainty related to the possible outcomes of a random variable. If a random variable has only one possible outcome then it has no entropy, while a distribution with high entropy is very dispersed across its support. This captures the intuitive notion that questions are less informative when the experimenter already knows what the response will be. Response variability stems from a) uncertainty about the value of the focus, and b) uncertainty about the responses given a particular value of the focus. The source of the stochasticity will determine how this term affects inference, which we discuss more in “Item response theory”. Another important characteristic of response variability is that it is a function only of the response distribution, and so should not affect one’s choice of prior.

Surprisal is the Kullback–Leibler divergence between the specified prior predictive distribution and the response distribution. Higher surprisal contributes to higher expected focal divergence since the specified prior is forced to update in light of observed inconsistencies. Considered differently, high surprisal indicates that there is a lot to learn – i.e., the specified prior is in a sense more misinformed. Thus, despite its contribution to the expected focal divergence, one would generally prefer a specified prior that induces low surprisal.

Hindsight is the expected posterior log likelihood of responses under the specified prior. Posterior likelihood is a function of both prior likelihood and the specified prior’s ability to ‘respond’ to observations. We discuss this property of ‘responsiveness’ more formally in “Prior misinformation inthe context of parameter estimation”. Surprisal and hindsight will tend to be inversely related through the prior likelihood. Our discussion of considerations in the specification of one’s prior, particularly in “Prior misinformation in the context ofparameter estimation”, will focus on the effect of different specified priors on hindsight.

Revisiting motivating example

Figure 3 shows the amount of response variability, surprisal and hindsight under each of the three population distributions shown in Fig. 2. This gives insight into the puzzle posed in “Motivating example”: Why does the zero-centered prior lead to a more powerful experiment when the population exhibits low values of the trait \(\theta \) than when the population exhibits high values of \(\theta \)?

Figure 3 reveals that this is because of two (in this case, related) reasons: Both response variability and surprisal are higher in the low-\(\theta \) population. Looking more carefully at the response distributions shown in the lefthand panel, the probability of an observation of \(y = 1\) is closest to .5 in the low-\(\theta \) population. This makes sense: As discussed in “Motivatingexample”, low values of the trait lead to arbitrary responses – i.e., responses that are harder to predict. Thus, response variability is much higher. For the same reason, surprisal is also higher: The low-\(\theta \) population surprises the specified prior by producing \(y = 0\) much more often than it anticipates. (The high-\(\theta \) population also surprises the specified prior by producing \(y = 0\) less often than it anticipates, but the surprise is not as much as in the low-\(\theta \) population.)

This section has shown that when the specified prior is misinformed, ADO’s global utility may mislead about the expected focal divergence. The following sections explore the practical relevance of this misalignment. As stressed in “Introduction”, the motivation for our work is a situation where the population distribution is inaccessible to the experimenter. While our motivating example applied our framework to understanding the effect of variation in the population distribution, what is of more practical interest is what can be controlled by the experimenter: the prior they use, and whether they use ADO at all. “Prior misinformation in thecontext of parameter estimation” and “Prior misinformationin the context of model selection” address these questions in the context of parameter estimation and model selection, respectively.

Prior misinformation in the context of parameter estimation

The prior’s two lives” and “Expected focal divergence” showed that under prior misinformation, ADO can be mistaken about the expected gain in information from a particular stimulus. In cases where it cannot reliably anticipate the expected focal divergence, does ADO still enjoy an advantage over other experimental design methods? In this section, we investigate this question in the context of the problem of parameter estimation. We show that even under prior misinformation, ADO facilitates identification of the correct parameter value faster than other sequential design methods. In many practical cases, using methods like ADO may be even more important when there is danger of prior misinformation, since this misinformation can be overcome comparably faster than under other experimental design methods.

As discussed in “Decomposition of the expected focaldivergence”, when identifying properties of specified priors that are robust to prior misinformation, we are most interested in their effect on hindsight. With reference to Eq. 11, hindsight is composed of three terms: \(p_1{}(y \vert x,\phi )\), \(p_1{}(\phi \vert y,x)\) and \(p_0{}(y \vert x)\). In the case of parameter estimation, these become \(p_1{}(y \vert x,\theta )\), \(p_1{}(\theta \vert y,x)\) and \(p_0{}(y \vert x)\). Here, unlike in the case where the focus is the model, the focal predictive distributions do not depend on the specified prior, i.e., \(p_1{}(y \vert x,\theta ) = p_0{}(y \vert x,\theta )\). Thus, of these three terms, only \(p_1{}(\theta \vert y,x) \propto p_0{}(y \vert x,\theta ) ~ p_1{}(\theta )\), representing the specified posterior, depends on the specified prior. One way to achieve high hindsight given a misinformative prior is to specify a prior for which the likelihood dominates the posterior. As we discussed in “Types of priors”, this is the definitional property of priors that are uninformative in parameter space. Indeed, empirical studies by Alcalá-Quintana and Garcia-Pérez (2004) showed that in the context of the adaptive estimation of psychometric functions, uniform priors led to less bias than other commonly specified priors. These results lead us to expect that priors that are uninformative in parameter space will contribute to robustness in the face of prior misinformation.

Empirical results

This section empirically tests the robustness of ADO to misinformation in two modeling paradigms: the item response paradigm introduced in “Motivating example”, and a paradigm used to measure a participant’s capacity for memory retention. All experiments reported in this paper were run using the pyBAD package for ADO (Sloman, 2022).

Item response theory

This section discusses simulation experiments to estimate the parameters of item-response models run under the modeling paradigm used as our motivating example.

Experimental setup We simulated experiments under two design methods: ADO and a fixed design method. Again drawing inspiration from Weiss and McBride (1983), who discretized the parameter space into 31 equally spaced levels ranging from -3 to 3, the fixed design was set a priori as all such 31 stimuli (presented in a random order). ADO was similarly constrained to select from amongst these 31 candidate stimuli. All experiments were run for 31 trials.

For the fixed design, this means that each stimulus would have been presented exactly once, while in the ADO experiments some of those candidate stimuli may be repeated or not presented at all. For each combination of design method, population distribution, and specified prior, we simulated a total of 1000 experiments. In each experiment, a new value of \(\theta ^*\), the parameter value governing the true distribution of responses, was sampled at random from the corresponding population distribution, and held fixed for that experiment. Data were generated according to Eq. 10. Both methods were initialized with the specified prior.

Fig. 4
figure 4

Item response models: Empirical results. (a) Results for different populations with the same prior, \(\mathbf{\Theta }_1 \sim \mathscr {N} (0, 1)\). (b) Results as a function of uncontrolled changes in specified prior with population \(\mathbf{\Theta }_0 \sim \mathscr {N}(2, 1)\) . (c) Results as a function of controlled changes in specified prior with population \(\mathbf{\Theta }_0 \sim \mathscr {N}(0, 1)\). Note. Lines track the posterior probability assigned to the true parameter value across trials, averaged over \(n = 1000\) simulation experiments. Shaded regions indicate standard errors. In Panel (a) shows results for different populations with the same prior, \(\mathbf{\Theta }_1 \sim \mathscr {N}(0, 1)\). The specified prior is fixed and different colors correspond to different population distributions. Panel (b) shows results as a function of uncontrolled changes in specified prior with population \(\mathbf{\Theta }_0 \sim \mathscr {N}(2, 1)\). Panel (c) shows results as a function of controlled changes in specified prior with population \(\mathbf{\Theta }_0 \sim \mathscr {N}(0, 1)\). In each of Panels (b) and (c), the population distribution is fixed and different colors correspond to different specified priors. Black curves always denote the case where the specified prior is informative, i.e., \(\mathbf{\Theta }_1 = \mathbf{\Theta }_0\). Solid lines show the performance of ADO. Dashed lines show the performance of the fixed design

We here show the results of three sets of experiments:

  1. 1.

    Experiments that show the effect of changes in population distribution, with the specified prior held fixed, were run under the same conditions as shown in Figs. 2 and 3.

  2. 2.

    Experiments that show the effect of uncontrolled changes in specified prior fixed the population distribution to \(\mathbf{\Theta }_0 \sim \mathscr {N}(2, 1)\) and varied the specified prior among an informative prior (\(\mathbf{\Theta }_1 = \mathbf{\Theta }_0 \sim \mathscr {N}(2, 1)\)), a misinformative prior (\(\mathbf{\Theta }_1 \sim \mathscr {N}(0, 1)\)), and a more dispersed misinformative prior, i.e., a prior that is uninformative in parameter space (\(\mathbf{\Theta }_1 \sim \mathscr {N}(0, 2)\)). We refer to these manipulations as ‘uncontrolled’ changes because they do not control for the degree of prior misinformation: The uninformative prior assigns a higher prior log probability to \(\theta ^*\), and induces lower surprisal across part of the stimulus space. Thus, the misinformative prior is at an initial disadvantage but may learn faster because of the mismatch in surprisal.

  3. 3.

    To isolate the effect of dispersion from prior misinformation, experiments that show the effect of controlled changes in specified prior fixed the population distribution to \(\mathbf{\Theta }_0 \sim \mathscr {N}(0, 1)\) and varied the specified prior among an informative prior (\(\mathbf{\Theta }_1 = \mathbf{\Theta }_0 \sim \mathscr {N}(0, 1)\)), a misinformative prior (\(\mathbf{\Theta }_1 \sim \mathscr {N}(0, .65)\)) and a more dispersed misinformative prior, i.e., a prior that is uninformative in parameter space (\(\mathbf{\Theta }_1 \sim \mathscr {N}(0, 2)\)). While these conditions are more artificial than those in our second set of experiments, they control for prior misinformation in the sense that the uninformative prior both tends to assign a lower prior log probability to \(\theta ^*\), and induces higher surprisal across the entire stimulus space.

Results Each panel of Fig. 4 shows results corresponding to one of the three sets of experiments described above. The x-axis of each panel indicates the trial number. The y-axis indicates the log posterior probability of the true parameter value.Footnote 8Footnote 9 In all cases, the black curve corresponds to the informative case, where the specified prior matches the population distribution.

First, comparing ADO (solid lines) to the fixed design (dashed lines), it is clear that ADO outperforms the fixed design in all three cases. In fact, ADO even under prior misinformation ultimately results in stronger inference than the fixed design under an informative prior.

Taking a closer look at the first set of simulations in Fig. 4a, we find no discernible difference. Although Fig. 3 showed the low-\(\theta \) population induced higher expected focal divergence, this difference does not translate into a difference in the rate of convergence on the correct parameter value. Recall from “Revisiting motivating example” that the higher expected focal divergence in the low-\(\theta \) population was largely driven by higher response variability. If high response variability stems mainly from dispersion across values of the focus, this indicates that each value of the focus makes distinct predictions, facilitating identification of the correct value (Houlsby et al., 2011).

However, in the low-\(\theta \) population, response variability stems mostly from higher guessing rates. More generally, as this example illustrates, high response variability that is inherent in the model, i.e., that does not disappear even when conditioning on a particular parameter value, inhibits identification of the correct parameter value.

Figure 4b and c show that the prior that is uninformative in parameter space generally converges more quickly on the correct parameter value, whether using ADO or the fixed design.Footnote 10 This is the case even when controlling for prior misinformation (Fig. 4c), since priors that are uninformative in parameter space are able to respond more effectively to unexpected observations.

Fig. 5
figure 5

Memory retention models (parameter estimation): Empirical results. Note. Panel (a) shows the predictive distributions under each prior, with lines denoting mean predictions, and shaded regions indicating the standard deviation across the corresponding prior. Panel (b) shows performance across the course of the experiment. Lines denote the mean log posterior probability assigned to the generating parameter values, and shaded regions denote standard errors around those means (across \(n =\) 2 populations \(\times \) 100 repetitions = 200 simulated experiments). Different colors correspond to results under different specified priors. Solid lines show the performance of ADO. Dashed lines show the performance of the fixed design

Memory retention

While the simplicity of the item-response paradigm allows careful control of our experimental conditions and facilitates interpretation, it potentially limits the generalizability of our findings. We now test whether the main finding – that ADO for parameter estimation outperforms other sequential design methods under prior misinformation – holds in a more complex modeling paradigm: estimating a participant’s capacity for memory retention.

Over a century of research on forgetting has shown that a person’s ability to remember information just learned drops quickly for a short time after learning and then levels off as more and more time elapses (Ebbinghaus, 1913; Laming, 1992). The simplicity of this data pattern has led to the introduction of a number of models to describe the rate at which information is retained in memory (Rubin & Wenzel, 1996).

One of these is the power-law model, which posits that the probability a participant will recall an item (\(y = 1\)) x seconds after presentation is (Wixted & Ebbesen, 1991):

$$\begin{aligned} p(y = 1) = a(x + 1)^{-b}. \end{aligned}$$
(12)

The parameters of the model are a and b, where \(0 \le a \le 1\) encodes a baseline level of accuracy, and \(0 \le b \le 1\) encodes the forgetting rate.

Experimental set-up We again ran experiments under two design methods: ADO and a fixed design method. The design variable to be manipulated was the time delay between presentation of the target and the recall phase (i.e., x in Eq. 12).

The fixed design method was a slight variation on a benchmark used by Cavagnaro et al. (2010), taken from previous literature (Rubin et al., 1999). In the fixed design method scheme, delays were \(\{ 0, 1, 2, 4, 7, 12, 21, 35, 59, 99 \}\). Each fixed-design experiment ran for 100 trials, allowing each of these ten delays to be repeated ten times. The order of stimuli was randomized separately for each experiment. ADO experiments also ran for 100 trials. In each ADO trial, the time delay could be any integer between 0 and 100 s.

We simulated experiments under two different population distributions, each combined with four types of specified priors. For the \(\underline{\text {high}\, {b}}\) population, we set \(\varvec{b}_0 \sim \) Beta(2, 1), i.e., the forgetting rate is high, on average, but negatively skewed. For the \(\underline{\text {low}\, {b}}\) population, we set \(\varvec{b}_0 \sim \) Beta(1, 2), i.e., the forgetting rate is low, on average, but positively skewed. For both populations, we set a to Beta(1, 1), which is equivalent to a uniform distribution between 0 and 1. The four types of specified priors are as follows:

  1. 1.

    Informative priors matched the population distributions given above.

  2. 2.

    Priors that mistook the two populations: The specified prior for the high b population was \(a \sim \) Beta\((1,1), b \sim \) Beta(1, 2), and the specified prior for the low b population was \(a \sim \) Beta\((1,1), b \sim \) Beta(2, 1). In the context of these experiments, we refer to these as misinformative priors.

  3. 3.

    Priors that were uninformative in parameter space specified that \(a \sim \) Beta\((1,1), b \sim \) Beta(1, 1).

  4. 4.

    Priors that were uninformative in data space resulted in maximally dispersed predictive distributions. The prior that achieves this is \(a \sim \) Beta\((2,1), b \sim \) Beta(1, 4) (Cavagnaro et al., 2010).Footnote 11

Figure 5a shows typical forgetting curves under each prior.

For each population and for each type of prior, we simulated 100 experiments, for a total of 2 design methods \(\times \) 2 populations \(\times \) 4 types of specified priors \(\times \) 100 repetitions = 1600 experiments. In each experiment, a true parameter \(\theta ^*=\{a^*,b^*\}\) was randomly drawn from the corresponding population distribution, the time delay on each trial was selected according to the design method, and data were generated according to Eq. 12.

Results Figure 5b shows how the correctness of inference evolves over the course of the experiment under each type of prior (results are pooled across the two populations). Values on the y-axes are the log probabilities assigned to the true, generating parameter value under each specified prior. This figure shows replication of our main result from “Itemresponse theory”: ADO outperforms the benchmark for each population and every type of specified prior. Interestingly, unlike in the item-response paradigm, differences in performance at the end of the experiment are mostly accounted for by the type of prior: The fixed design under the informative prior generally does better than ADO under the misinformative or uninformative in data space priors (this is despite the fact that, unlike in the item-response example, ADO has access to a larger stimulus bank than the fixed design method).Footnote 12

In sum, in both simulation paradigms, ADO performed better than the fixed design method even under prior misinformation. In other words, we do not find that prior misinformation diminishes ADO’s relative advantage. In fact, our results suggest that using ADO when there is prior misinformation may help to overcome that misinformation more quickly than using other design methods.

Choosing a prior distribution

Our results show that access to an informative prior offers a clear advantage: Experiments run under an informative prior lead to the quickest convergence on the true parameter value. However, both our mathematical and empirical results also show that specifying a prior that is uninformative in parameter space can be nearly as good: Experiments run under a prior that is uninformative in parameter space result in a final parameter estimate that is essentially as accurate as the estimate from experiments run under an informative prior. In other words, informative priors offer inference a ‘head start,’ while the efficiency of priors that are uninformative in parameter space enables inference to quickly catch up with inference under the informative priors.

Of course, when there is uncertainty about the population distribution, attempting to specify an informative prior exposes one to the downside risk that the specified prior actually slows convergence relative to what could be achieved with an uninformative prior (cf., the orange line in Fig. 5b). Is this risk worth the potential head start, or are researchers better off specifying priors that are uninformative in parameter space as a rule? The answer to that question is that it depends.

The size of the informative prior’s upside advantage – the size of its head start – depends on how misinformed the uninformative in parameter space prior is. The true prior shown in Fig. 5 is quite similar to the corresponding uninformative in parameter space prior, and so the size of the head start was relatively small (and in practice, would likely be outweighed by the downside risk of misinformation). However, this may not be the case when prior knowledge more severely restricts the set of likely parameter values.

Another important consideration is the number of responses the experimenter can elicit from each participant: In our simulations, experiments were run for long enough that inference under the uninformative in parameter space priors could catch up to inference under the informative priors. In these cases, priors that are uninformative in parameter space offer the same upside advantage as informative priors without the downside risk. However, in practice, experimenters may have the resources to collect far fewer responses per participant – indeed, it is precisely such cases that motivate the use of ADO in the first place! In these cases, the head start provided by informative priors will result in a more accurate final inference.

Prior misinformation in the context of model selection

Prior misinformation in the context of parameter estimation” showed that in the context of parameter estimation, ADO usually leads to faster convergence on the true parameter value under prior misinformation than other sequential design methods. This section explores whether the same can be said in the context of model selection. It will turn out that, in the context of model selection, the effect of prior misinformation can be more damaging: It can lead one to favor the wrong model.

A common measure of the strength of evidence in favor of one model \(m_1\) over another model \(m_2\) is the Bayes factor, or relative likelihood of data \(y \vert x\) under \(m_1\) and \(m_2\):

$$\begin{aligned} BF(m_1,m_2)= & {} \frac{p_1(y \vert x,m_1)}{p_1(y \vert x,m_2)} \nonumber \\= & {} \frac{\int _\theta p(y \vert x,\theta ) ~ p_1(\theta \vert m_1)}{\int _\theta p(y \vert x,\theta ) ~ p_1(\theta \vert m_2)}. \end{aligned}$$
(13)

Equation 13 reveals the sensitivity of model selections to prior misinformation: The apparent strength of evidence in favor of one model over the other is a function of the specified priors \(\mathbf{\Theta }_1 \vert m_1\) and \(\mathbf{\Theta }_1 \vert m_2\). Under prior misinformation, the magnitude and even direction of the Bayes factor can be misleading – implying that it can lead to the erroneous selection of one model over the true, generating model (Vanpaemel, 2010; Lopes & Tobias, 2011).

This is an important concern in Bayesian inference, and addressing it through the choice of prior has been the subject of much literature (Vanpaemel, 2010; Lee et al., 2019). In this section, we show that this relates importantly to the consequences of the choice of prior in its decision-theoretic role.

Recall Eq. 7, which gives the global mutual information utility in the context of model selection. Cavagnaro et al. (2010) showed that Eq. 7 can be rewritten as a function of the Bayes factors between all pairs of candidate models. This result implies that ADO results in the selection of stimuli that are expected to lead to extreme Bayes factors according to the specified prior. When the Bayes factors are misleading, this effect of ADO can exacerbate the amount of information encountered that leads one to the wrong model.

The results presented in the remainder of this section will show that in the case of model selection, like in the case of parameter estimation, ADO tends to accelerate convergence towards a particular model. However, under a deceptive prior, this might be the wrong model. In such cases, desirable behavior for an experimental design method would be to decelerate, rather than accelerate, convergence. We will show in “Empirical results” that in such cases other experimental design methods outperform ADO.

Effect of prior misinformation through the lens of Bayesian inference

Before turning to our results on the effect of ADO, we first present a simple example illustrating the potential effect of prior misinformation in the context of Bayesian inference more generally. Consider the toy example shown in Fig. 6. The task is to distinguish between two models, Model A and Model B. Each model has a free parameter, \(\mu _A \sim \varvec{\mu }_A\) and \(\mu _B \sim \varvec{\mu }_B\), respectively,Footnote 13 and makes predictions for a single stimulus \(x_0\). Under Model A, responses to \(x_0\) are distributed as \(\textbf{Y} \vert x_0, \mu _A \sim \mathscr {N}(\mu _A, 10)\), while under Model B, responses to \(x_0\) are distributed as \(\textbf{Y} \vert x_0, \mu _B \sim \mathscr {N}(\mu _B, 11)\). Thus, the families of functions captured by the two models are distinguished by the inherent variance in responses. The experimenter is required in advance of the experiment to assign a prior distribution to \(\mathbf{\left( M,\varvec{\mu }_A,\varvec{\mu }_B \right) }\), i.e., to both assess the relative likelihoods of Model A and Model B and to specify the distributions over \(\varvec{\mu }_A\) and \(\varvec{\mu }_B\).

Fig. 6
figure 6

Prior misinformation biases inference for model selection: Motivating example (see discussion in main text). Note. Panel (a) shows parameter distributions and Panel (b) shows focal predictive distributions

The top two panels of Fig. 6a show the prior parameter distributions the experimenter specifies for Models A and B. The corresponding focal predictive distributions for the two models are shown, respectively, as the green and orange dashed lines in Fig. 6b.

Now, consider a heterogeneous participant population in which everyone responds according to Model A (i.e., \(\sigma = 10\)), but with different values of \(\mu _A\) as represented in the bottom panel of Fig. 6a. The solid green line in Fig. 6b shows the distribution of responses from this population. The dispersion in this curve captures both the inherent variance in each participant’s responses (\(\sigma = 10\)), and variance due to the distribution of values of \(\mu _A\) across participants. Importantly, most participants will produce data that is more likely under Model B than under Model A, under their respective specified priors, yielding apparently strong evidence in favor of Model B.

The upshot is that the true state of the world, i.e., the true response distribution, may look very different from the focal predictive distribution corresponding to the generating model. In essence, the specified prior sets an expectation for what data from a given model will look like, but data from that model may look different in reality if the specified prior is far from the population distribution, and that can lead to wrong inference. In effect, unless the true state of the world happens to coincide exactly with the predictive distribution of \(m^*\), each possible value of the focus is effectively misspecified a priori. Notice that this doesn’t matter in the case of parameter estimation: In this case, the focal predictive distributions are unaffected by prior misinformation – as Eq. 1 shows, they are a function only of the model structure, which is (by assumption) known.

This example is albeit quite contrived to prove a point. However, such deceptive priors – priors that induce initial convergence towards the wrong model – can actually emerge in practice, as we show in “Empirical results”. In the remainder of this section, we explore – conceptually in “ Effect of prior misinformation through the lens of Bayesian decision theory” and empirically in “Empirical results” – the degree to which this phenomenon persists in the context of ADO. The consistency of Bayesian inference guarantees that the experimenter in this example will eventually be able to recover \(m^*\). However, when the amount of data collected is not large, relying on Bayesian decision-theoretic policies – i.e., choosing data on the basis of these misinformed inferences – has the potential to exacerbate the effect of misinformation.

Effect of prior misinformation through the lens of Bayesian decision theory

In the toy example in “Effect of prior misinformation throughthe lens of Bayesian inference”, ADO would assign \(x_0\) a high global utility because it induces a large divergence between the predictions of Models A and B – even though these predictions are made on the basis of prior misinformation.

In general, when crafting a policy for selecting optimal designs, the goals of parameter estimation and model selection may come into conflict. A stimulus that ADO calculates is optimal for discriminating between models may not be optimal for refining estimates of the distribution of parameter values. In other words, ADO for model selection faces a version of an explore – exploit dilemma: By acting on its prior beliefs about each model’s predictions, it may fail to explore parts of the sample space that could challenge these beliefs.

Thus, when the goals of model selection and parameter estimation are in conflict, ADO can actually exacerbate the problem. By aggressively ‘exploiting’ areas of the design space that appear to yield information about the models, ADO finds powerful evidence in favor of its prior beliefs. In contrast, by ‘exploring’ less apparently informative stimuli, other methods may have more of an opportunity to learn the correct parameter distributions before making strong conclusions about the generating model.

ADO’s aggressiveness is thus a double-edged sword: It converges quickly on conclusions based on what it believes about the predictions of the foci. However, in the case where prior beliefs do not reflect the population distribution, it does not seek opportunities to challenge these incorrect beliefs.

Choosing a prior distribution

Prior misinformation in the context of parameter estimation” showed that, in the case of parameter estimation, priors that are uninformative in parameter space can somewhat mitigate the damage of prior misinformation. One would hope that the issues that arise in model selection could be avoided by using similarly uninformative priors.

Unfortunately, this is not the case: As will be shown in the following section, priors that are uninformative in parameter space nevertheless associate models with particular response distributions, and are also prone to inducing biased inference. One could nevertheless hope that specifying such priors over the parameter distributions of candidate models might mitigate the problem by facilitating more rapid convergence on informative parameter distributions. Indeed, we find empirically that in one model selection context, recovery from biased inference is relatively fast under a uniform prior. However, it is difficult to disentangle the effect of the responsiveness of the uniform prior from its effect on the focal predictive distributions – in particular, how they diverge from the response distribution. We leave investigating whether specifying priors that are uninformative in parameter space mitigates biased inference in the context of model selection as an avenue for future work.

Is it possible to identify a prior that is instead ‘uninformative in model space’? In the case of parameter estimation, the important characteristic of an uninformative prior was that it was responsive: Areas of the parameter space quickly became represented in proportion to the relative likelihood they assigned to the history of observations. A prior that was uninformative in model space would facilitate the proportional representation of models according to their relative conditional likelihood. But as emphasized in “Model selection”, the relative conditional likelihood of a model depends on the prior parameter distribution; indeed, the problem of not knowing the parameter distribution is in a sense the problem of not knowing the conditional likelihood distribution \(\textbf{Y}_0 \vert x, m\).

In summary, these results suggest the absence of concrete guidance for the case of model selection. The following section reinforces through simulation results that apparently uninformative priors can inadvertently induce biased inference.

Empirical results

This section extends the memory retention paradigm introduced in “Memory retention” to model selection. The goal of these results will be to demonstrate that apparently uninformative priors can inadvertently bias inference, and that this bias is exacerbated by ADO.

In these experiments, the goal is to distinguish the power-law model introduced in “Memory retention” (Eq. 12) from the exponential model of memory retention, which posits that the probability a participant will recall an item x seconds after presentation is:

$$\begin{aligned} p(y = 1) = ae^{-bx}. \end{aligned}$$
(14)
Fig. 7
figure 7

Memory retention models (model selection): Empirical results. Note. Panel (a) shows the predictive distribution of each model under each prior, with lines denoting mean predictions, and shaded regions indicating standard deviations across the corresponding prior. Panel (b) shows results where data were generated from population A. Panel (c) shows results where data were generated from population B. In each of panels (b) and (c), lines denote the mean log posterior probability assigned to the generating model, and shaded regions denote standard errors around those means (across \(n =\) 2 models \(\times \) 100 repetitions = 200 simulated experiments). Different colors correspond to results under different specified priors. Solid lines show the performance of ADO. Dashed lines show the performance of the fixed design

Experimental set-up We considered two types of prior distributions, prior A and prior B. We also varied the population distribution such that each prior was informative in half of our experiments and misinformative in the other half.

  1. 1.

    Prior A assigns a 0.5 probability to each of the power-law and exponential models, with \(a~\sim ~\)Beta(1, 1) and \(b~\sim ~\)Beta(1, 1) under both models. According to prior A, all viable parameter values are equally likely. Therefore, prior A is uninformative in parameter space. When encountering the population characterized by this distribution, which we call population A, prior A is also the informative prior. When encountering any other population, prior A is misinformative.

  2. 2.

    Prior B also assigns probability 0.5 to each model, but \(a~\sim ~\)Beta(2, 1) and \(b~\sim ~\)Beta(1, 4) under the power-law model, and \(a~\sim ~\)Beta(2, 1) and \(b~\sim ~\)Beta(1, 80) under the exponential model. The predictive distribution under prior B exhibits an extremely diffuse set of behavioral patterns, as shown in Fig. 7a (see also Cavagnaro et al., 2010). Therefore, prior B is uninformative in data space. When encountering the population characterized by this distribution, which we refer to as population B, prior B is also the informative prior. When encountering any other population, such as population A, prior B is misinformative.

We simulated experiments with two different design methods: ADO and the fixed design described earlier. In each simulated experiment, data were generated from a given model, either the power-law or exponential model, with parameters randomly drawn from the corresponding population distribution (either A or B). We did not randomize the generating model, but rather simulated an equal number of experiments with each model (each population distribution assigns a probability of .5 to each of the power-law and exponential models, and so the prior over models was always correctly specified). In all, we simulated 1600 experiments: 100 for each combination of design method (ADO or fixed), population (A or B), prior (A or B) and model (power-law or exponential).

Results Figure 7b and c show how the correctness of inference evolves over the course of the experiment under each type of prior (results are pooled across the two generating models). Values on the y-axes are the log probabilities assigned to the generating model \(m^*\) under each generating prior. Like in the case of parameter estimation (Fig. 5b), when the specified priors are informative, inference converges steadily, and more quickly under ADO, towards the true model.

However, inference under the misinformative priors exhibits the dynamic explained in the previous subsections, favoring the wrong model (at least initially) – and this is exacerbated by ADO. The reasons for this are precisely the reasons for the confusion illustrated in Fig. 6: As shown in Fig. 7a, in both cases the specified prior distributions are wildly off base about the expected behavior of the population characterized by each model.

Recovery from biased inference under the specified prior that is uninformative in parameter space (Fig. 7c) is quicker than recovery from biased inference under the specified prior that is uninformative in data space (Fig. 7b), potentially reflecting the capacity of the prior that is uninformative in parameter space to more quickly ‘respond’ to unexpected observations.Footnote 14 However, our setup here is not adequate to confirm this. Notice first that while the specified prior varies between the two panels of Fig. 7, so does the population distribution. More fundamentally, the specified prior changes the focal predictive distributions. Taken together, this implies that our setup does not (and perhaps cannot) control for qualitative differences in the divergence between the focal predictive distributions and the response distribution, which, as discussed in “Effect of prior misinformation through the lensof Bayesian inference”, is the source of the biased inferences.

Robust practices for ADO for model selection

The results from the previous section highlight the importance of taking steps to ensure one’s priors are informative – especially when used in conjunction with decision-theoretic methods like ADO, which amplify biases induced by prior misinformation. In the context of model selection, if an experimenter specifies a prior that faithfully captures their epistemic uncertainty, ADO will treat that uninformative prior as being a true representation of relative likelihoods in the world and select designs accordingly. Because the two roles of the prior here conflict, this can result in incorrect inferences.

While we framed our results in “Prior misinformation inthe context of model selection” as applying to the problem of model selection – identification of model structure – note that these results apply to any situation in which knowing the value of the focus of interest does not completely identify the true data-generating distribution. In the case of model selection, this applies because one needs the value of both m and \(\theta \) to identify the data-generating distribution, yet evaluates performance based only on m. However, one could also apply ADO to, e.g., a parameter estimation problem for which some ‘nuisance parameters’ are not considered foci for inference (e.g., estimating only main effects in the presence of fixed or participant-level effects). In these cases, our results on model selection, not parameter estimation, would apply.

Prior misinformation in the context of model selection” discussed the potential beneficial effect of specifying priors that are uninformative in parameter space in mitigating these biases. This section discusses additional methods to alleviate or anticipate this bias, some of which have been adopted by previous studies, and some of which provide promising avenues for future research.

Additional trials to inform specified priors

One way to increase confidence in one’s specified priors is to devote a portion of one’s experimental resources to collecting observations from which to learn more informed parameter distributions. For example, when using ADO to distinguish between competing models of intertemporal choice, Cavagnaro et al. (2016) devoted three quarters of each experiment to parameter estimation, i.e., selecting stimuli to maximize the global utility function for parameter estimation, before using the inferred posteriors for each participant during the later model selection trials.

In a parameter estimation application, Kim et al. (2014) leveraged hierarchical modeling techniques to pool information across participants to construct informed distributions: Data from each sequential participant was used to refine the specified prior for the next participant. They showed that this method led to better parameter estimates in the context of a psychophysical experiment.

While these methods offer promising solutions for many use cases, their application falls outside the scope considered by our work. As we stated in “Introduction”, we consider situations in which the experimenter wishes to use the same prior for every participant. This characterizes situations in which incorporating data from previous participants would be infeasible or unfair (e.g., educational testing), or when the experimenter cannot afford to spend scarce resources on additional parameter estimation trials. (Note that participants in Cavagnaro et al. (2016)’s study were required to complete 80 experimental trials. Conducting an experiment of this length would be at best difficult and at worst impossible in cases in which candidate stimuli correspond to potentially irritating or invasive tests such as a medical procedure.)

Total entropy utility

Borth (1975) introduced the total entropy utility function in order to cope with the dual sources of uncertainty that characterize the model selection problem, i.e., uncertainty about both the model identifier and the parameter value. The total entropy utility function considers the entire state of the world as the focus of the utility function:

$$\begin{aligned} U(x) = \sum _{m \in M} p(m) \int _\theta \int _y \log {\left( \frac{p(m,\theta \vert y,x)}{p(m,\theta )} \right) } ~ p(y \vert x,\theta ,m) ~ p(\theta \vert m). \end{aligned}$$
(15)

We had hoped that running ADO using the total entropy utility function would, like Cavagnaro et al. (2016)’s method, lead to a balance between parameter estimation and model selection trials. We had further hoped that it would do so more efficiently than fixed or heuristic methods of achieving this balance.

To test this, we ran simulation experiments with exactly the same setup as those discussed in “Empirical results”, with the exception that when using ADO, the stimulus that maximized Eq. 15 (rather than Eq. 7) was selected. The results of these experiments, presented in Appendix C, did not show a consistent advantage of the total entropy utility function in leading to more robust selection of the correct model.

Novel approaches to robust adaptive experiments

The previous two subsections discussed existing methods for coping with the effect of prior misinformation on model selection. However, these existing methods can be prohibitively costly (running additional trials to inform priors) or potentially ineffective (using the total entropy utility function). An important direction for future research is the development of methods that increase the robustness of adaptive design methods to the pitfalls introduced in “ Prior misinformation in the context of model selection”. To this end, in this section, we propose two steps experimenters can take in the design and implementation of adaptive experiments to increase their robustness. We leave further development and stress testing of these approaches as avenues for future research.

1. Anticipating biases via prior sensitivity analyses As mentioned in “The prior’s two lives”, the choice of prior distribution in the context of Bayesian inference is the topic of a substantial literature. One practice advocated in this literature (e.g., Lee et al. (2019)) is to perform prior sensitivity analyses, i.e., to perform data analysis under a variety of priors to ensure one’s inferences are robust to the specification of the prior.

We echo the importance of this practice. In the context of adaptive experiments, analogous prior sensitivity analyses are important to understand not only the direct effect of the prior on inference, but also the prior’s indirect effect through its effect on the data collected. For a given specified prior, experimenters should simulate sets of experiments where data is generated by parameter values distributed according to several different ‘participant’ populations. If these simulated experiments are reliably able to identify the true model, this will provide reassurance that actual experiments run under the specified prior will be able to recover the generating model, even if the true participant population differs slightly from the specified prior.

2. Using a design policy that navigates the explore–exploit dilemma Another approach is to respecify the utility function itself in a way that is more robust to such biases (Go & Isaac, 2022). The total entropy utility function (“Total entropy utility”) is one example of an alternative utility function designed for a similar purpose.

As we discussed in “Effect of prior misinformationthrough the lens of Bayesian decision theory”, in the context of model selection, ADO effectively faces an explore–exploit dilemma: Should it select a stimulus that ‘exploits’ what it thinks it knows about the predictions of the competing models, or a stimulus that has the potential to contradict these pre-existing beliefs? Designing decision-making policies that effectively navigate the explore–exploit dilemma has been the subject of literature spanning cognitive science (Hills et al., 2015) to machine learning (Schulz et al., 2018). Utility functions intended to navigate this dilemma in the context of model selection could draw from this literature.

One approach to sequential decision-making that navigates this dilemma in a principled way is known as upper confidence bound (UCB) sampling (Schulz et al., 2018): Rather than sample where their expectation of the value of the local utility is highest, a UCB sampler would sample where an additive combination of this expectation and a measure of the variance around this expectation is the highest. UCB effectively constructs a confidence interval around the expectation, and samples at the upper bound of that confidence interval. During early trials, the variance measure usually dominates, inducing exploration. As the variance measure decreases, the expectation measure begins to dominate, and the sampler gradually turns to exploiting areas where the expectation of the utility is highest.

In Appendix D, we leverage our framework to suggest one way the global mutual information utility function could be modified to incorporate principles from UCB sampling.

Discussion and limitations

While our work provides insight into the behavior of ADO under prior misinformation, it also has a number of limitations. Firstly, the scope of our theory is limited. In the case of model selection, our work shows that ADO can exacerbate inferential biases in the presence of prior misinformation, but does not provide additional guidance as to when or the extent to which one should anticipate these biases. While some work provides additional theoretical characterization of the conditions that lead to related biases in the context of ADO (Sloman et al., 2024) and other adaptive design schemes, e.g., optional stopping (Hendriksen et al., 2020; Heide & Grünwald, 2021), there remain many open theoretical questions.

Secondly, when ADO is used in practice, the correctness of the resultant inferences can be affected by multiple factors not captured by our mathematical and simulation-based framework. For example, parameter values may drift due to, e.g., changes in the environment. In the context of real-world applications of item-response theory, one of our motivating experimental paradigms, this can lead to biased proficiency estimates (Wells et al., 2002). However, we expect that the dynamics induced by such factors will usually interact with, rather than counteract, the dynamics revealed by our results.

One of the most important factors we have not considered is the potential for model misspecification, or misspecification of the functional form of the model \(p(y \vert x, \theta , m)\) itself. The case of model misspecification differs from our case, prior misinformation, in that the true state of the world is assigned a probability of zero: Since the functional form of the model does not update in light of observed data, one can never hope to counteract this form of misspecification. Under model misspecification, inference can stray further and further from the Bayes-optimal parameter estimate, resulting in totally random designs leading to better estimates than ADO (Sloman et al., 2022). Recent work has proposed design policies that increase the robustness of ADO in such cases (Overstall & McGree, 2022; Catanach & Das, 2023). Alongside development of practices to increase ADO’s robustness to prior misinformation, further development of methods to enhance ADO’s robustness to model misspecification is an important avenue for future research.

Conclusion

When performing Bayesian inference, there are many considerations experimenters must keep in mind. An important one is the specification of one’s prior distribution. When using optimal design methods like ADO, which rely on specified prior distributions in the design of the experiment itself, this decision has dual consequences: Misinformative priors both bias inference, and mislead the experimental design process.

In this paper, we introduced a conceptual and mathematical framework for reasoning about the effect of prior misinformation on the efficiency of ADO. Our framework elucidated one general limitation of mutual information utility functions: While the implied expected focal divergence indicates the degree of posterior divergence, it does not in general indicate whether that divergence is in the right direction.

We applied our framework to two common use cases for ADO: the estimation of parameters that measure individually varying psychological characteristics, and the identification of model structure to inform the development of psychological theory. Through mathematical analysis and simulation experiments, we demonstrated counterintuitive pitfalls of using uninformative priors in the case of model selection. In the context of parameter estimation, our framework elucidated principles upon which users of ADO can base selection of their prior – namely, to favor priors that are uninformative in parameter space, rather than data space. In the context of model selection, we discussed and suggested several practices users of ADO can adopt to enhance the robustness of their design and analysis strategies to the biases we identified. Investigating these practices is a promising direction for future research.

Open Practices Statement

All the simulation code used to generate the results reported in this paper is publicly available at https://github.com/sabjoslo/prior-impact.