Online Simulator-Based Experimental Design for Cognitive Model Selection

The problem of model selection with a limited number of experimental trials has received considerable attention in cognitive science, where the role of experiments is to discriminate between theories expressed as computational models. Research on this subject has mostly been restricted to optimal experiment design with analytically tractable models. However, cognitive models of increasing complexity with intractable likelihoods are becoming more commonplace. In this paper, we propose BOSMOS, an approach to experimental design that can select between computational models without tractable likelihoods. It does so in a data-efficient manner by sequentially and adaptively generating informative experiments. In contrast to previous approaches, we introduce a novel simulator-based utility objective for design selection and a new approximation of the model likelihood for model selection. In simulated experiments, we demonstrate that the proposed BOSMOS technique can accurately select models in up to two orders of magnitude less time than existing LFI alternatives for three cognitive science tasks: memory retention, sequential signal detection, and risky choice.


Introduction
The problem of selecting between competing models of cognition is critical to progress in cognitive science.The goal of model selection is to choose the model that most closely represents the cognitive process which generated the observed behavioural data.Typically, model selection involves maximizing the fit of each model's parameters to data and balancing the quality of the model-fit and its complexity.It is crucial that any model selection method used is robust and sample-efficient, and that it correctly measures how well each model approximates the data-generating cognitive process.
It is also crucial that any model selection process is provided with high quality data from well-designed experiments, and that these data are sufficiently informative to support efficient selection.Research on optimal experimental design (OED) addresses this problem by focusing on how to design experiments that support parameter estimation of single models and, in some cases, maximize information for model selection [Cavagnaro et al., 2010, Moon et al., 2022, Blau et al., 2022].
However, one outstanding difficulty for model selection is that many models do not have tractable likelihoods.The model likelihoods represent the probability of observed data being produced by model parameters and simplify tractable inference [van Opheusden et al., 2020].In their absence, likelihood-free inference (LFI) methods can be used, which rely on forward simulations (or samples from the model) to replace the likelihood.Another difficulty is that existing methods for OED are slow -very slow -which makes them impractical for many applications.In this paper, we address these problems by investigating a new algorithm that automatically designs experiments for likelihood-free models much more quickly than previous approaches.The new algorithm is called Bayesian optimization for simulator-based model selection (BOSMOS).
In BOSMOS, model selection is conducted in a Bayesian framework.In this setting, inference is carried out using marginal likelihood, which incorporate, by definition, a penalty for model complexity, i.e., Occam's Razor.Additionally, the Bayesian framework allows getting Bayesian posteriors over all possible values, rather than point estimates; this is crucial for quantifying uncertainty, for instance, when multiple models can explain the data similarly well (non-identifiability or poor identifiability; Anderson, 1978, Acerbi et al., 2014), or when some of the models are misspecified (e.g. the behaviour cannot be reproduced by the model due to non-independence of the experimental trials; Lee et al., 2019).These problems are compounded in computational cognitive modeling where non-identifiability also arises due to human strategic flexibility [Howes et al., 2009, Madsen et al., 2019, Kangasrääsiö et al., 2019, Oulasvirta et al., 2022].For these reasons, there is an interest in Bayesian approaches in computational cognitive science [Madsen et al., 2018].
As we have said, a key problem for model selection is selection of the design variables that define an experiment.When resources are limited, experimental designs can be carefully selected to yield as much information about the models as possible.Adaptive Design Optimization (ADO) [Cavagnaro et al., 2010[Cavagnaro et al., , 2013b] ] is one influential approach to selecting experimental designs.ADO proposes designs by maximizing the so-called utility objective, which measures the amount of information about the candidate models and their quality.Unfortunately, common utility objectives, such as mutual information [Shannon, 1948, Cavagnaro et al., 2010] or expected entropy [Yang and Qiu, 2005], cannot be applied when computational models lack a tractable likelihood.In such cases, research suggests adopting LFI methods, in which the computational model generates synthetic observations for inference [Gutmann and Corander, 2016, Sisson et al., 2018, Papamakarios et al., 2019].This broad family of methods is also known as approximate Bayesian computation (ABC) [Beaumont et al., 2002, Kangasrääsiö et al., 2019] and simulator-or simulation-based inference [Cranmer et al., 2020].To date, LFI methods for ADO have focused on parameter inference for a single model rather than model selection.
Model selection with limited design iterations requires a choice of design variables that optimize model discrimination, as well as improving parameter estimation.The complexity of this task is compounded in the context of LFI, where expensive samples from the model are required.We aim at reducing the number of model simulations.For this reason, in our approach, called BOSMOS, we use Bayesian Optimization (BO) [Frazier, 2018, Greenhill et al., 2020] for both design selection and model selection.The advantage of BO is that it is highly sample-efficient and therefore has a direct impact on reducing the need for model simulation.BOSMOS combines the ADO approach with LFI techniques in a novel way, resulting in a faster method to carry out optimal design of experiments to discriminate between computational cognitive models, with a minimal number of trials.
The main contributions of the paper are as follows: • A novel approach to simulator-based model selection that casts LFI for multiple models under the Bayesian framework through the approximation of the model likelihood.As a result, the approach provides a full joint Bayesian posterior for models and their parameters given collected experimental data.
• A novel simulator-based utility objective for choosing experimental designs that maximizes the behavioural variation in current beliefs about model configurations.Along with the sample-efficient LFI procedure, it reduces the time cost from one hour, for competitor methods, to less than a minute in the majority of case studies, bringing the method closer to enabling real-time cognitive model testing with human subjects.
• Close integration of the two above contributions yields the first online, sample-efficient, simulation-based, and fully Bayesian experimental design approach to model selection.
• The new approach was tested on three well-known paradigms in psychology -memory retention, sequential signal detection and risky choice -and, despite not requiring likelihoods, reaches similar accuracy to the existing methods which do require them.

Background
In this article, we are concerned with situations where the purpose of experiments is to gather data that can discriminate between models.The traditional approach in such a context begins with the collection of large amounts of data from a large number of participants on a design that is fixed based on intuition; this is followed by evaluation of the model fits using a desired model selection criteria such as as AIC, BIC, cross-validation, etc.This is an inefficient approach -the informativeness of the collected data for choosing models is unknown in advance, and collecting large amounts of data may often prove expensive in terms of time and monetary resources (for instance, cases that involve expensive equipment, such as fMRI, or in clinical settings).These issues have been addressed by modern optimal experimental design methods which we consider in this section and summarize in Table 1.
Optimal experimental design.Optimal experiment design (OED) is a classic problem in statistics [Lindley, 1956, Kiefer, 1959], which saw a resurgence in the last decade due to improvements in computational methods and availability of computational resources.Specifically, adaptive design optimization (ADO) Cavagnaro et al. [2010Cavagnaro et al. [ , 2013b] ] was proposed for cognitive science models, which has been successfully applied in different experimental settings including memory and decision-making.In ADO, the designs are selected according to a global utility objective, which is an average value of the local utility over all possible data (behavioural responses) and model parameters, weighted by the likelihood and priors [Myung et al., 2013].More general approaches, such as Kim et al. [2014], improve upon ADO by combining it with hierarchical modelling, which allow them to form richer priors over the model parameters.
While useful, the main drawback of these methods is that they work only with tractable (or analytical) parametric models, that is models whose likelihood is explicitly available and whose evaluation is feasible.
Model selection for simulator-based models.In the LFI setting, a critical feature of many cognitive models is that they lack a closed-form solution, but allow forward simulations for a given set of model parameters.A few approaches have made advances in tackling the problem of intractability of these models.For instance, Kleinegesse and Gutmann [2020] and Valentin et al. [2021] proposed a method which combines Bayesian optimal experimental design (BOED) and approximate inference of simulator-based models.The Mutual Information Neural Estimation for Bayesian Experimental Design (MINEBED) method performs BOED by maximizing a lower bound on the expected information gain for a particular experimental design, which is estimated by training a neural network on synthetic data generated by the computational model.By estimating mutual information, the trained neural network no longer needs to model the likelihood directly for selecting designs and doing the Bayesian update.Similarly, Mixed Neural Likelihood Estimation (MNLE) by Boelts et al. [2022] trains neural density estimators on model simulations to emulate the simulator.Pudlo et al. [2016] proposed an LFI approach to model selection, which uses random forests to approximate the marginal likelihood of the models.Despite these advances, these methods have not been designed for model selection in an adaptive experimental design setting.Table 1 summarizes the main differences between modern approaches and the method proposed in this paper.
Cognitive models increasingly operate in an agent-based paradigm [Madsen et al., 2019], where the model is a reinforcement learning (RL) policy [Kaelbling et al., 1996, Sutton andBarto, 2018].The main problem with these agent-based models is that they need retraining if any of their parameters are altered, which introduces a prohibitive computational overhead when doing model selection.Recently, Moon et al. [2022] proposed a generalized model parameterized by cognitive parameters, which can quickly adapt to multiple behaviours, theoretically bypassing the need for model selection altogether and replacing it with parameter inference.Although the cost of evaluating these models is low in general, they lack the interpretability necessary for cognitive theory development.Therefore, training a parameterized policy within a single RL model family may be preferable: this would still require model selection but would avoid the need for retraining when parameters change (see Section 4.4 for a concrete example).
Amortized approaches to OED.Recently proposed amortized approaches to OED [Blau et al., 2022] -i.e., flexible machine learning models trained upfront on a large set of problems, with the goal of making fast design selection at runtime -allow more efficient selection of experimental designs by introducing an RL policy that generates design proposals.This policy provides a better exploration of the design space, does not require access to a differentiable probabilistic model and can handle both continuous and discrete design spaces, unlike previous amortized approaches [Foster et al., 2021, Ivanova et al., 2021].These amortized methods are yet to be applied to model selection.
Even though OED is a classical problem in statistics, its application has mostly been relegated to discriminating between simple tractable models.Modern methods such as likelihood-free inference and amortized inference can however make it more feasible to develop OED methods that can work with complex simulator models.In the next sections, we elaborate on our LFI-based method BOSMOS, and demonstrate its working using three classical cognitive science tasks: memory retention, sequential signal detection and risky choice.) with the references to the selected representative works.Here, we emphasize LFI methods, as they do not need tractable model likelihoods, and amortized methods since they are the fastest to propose designs.The amortized approaches, however, need to be retrained when the population distributions (i.e.priors over models or parameters) change, as in the setting such as ours where beliefs are updated sequentially as new data are collected.

Methods
Our method carries out optimal experiment design for model selection and parameter estimation involving three main stages as shown in Figure 1: selecting the experimental design d, collecting new data x at the design d chosen from a design space, and, finally, updating current beliefs about the models and their parameters.The process continues until the allocated budget for design iterations T is exhausted, and the preferred cognitive model m est ∈ M, which explains the subject behaviour the best, and its parameters θ est ∈ Θ est are extracted.While the method is rooted in Bayesian inference and thus builds a full joint posterior over models and parameters, we also consider that ultimately the experimenter may want to report the single 'best' model and parameter setting, and we use this decision-making objective to guide the choices of our algorithm.The definition of what 'best' here means depends on a cost function chosen by the user [Robert et al., 2007].In this paper, for the sake of simplicity, we choose the most common Bayesian estimator, the maximum a posteriori (MAP), of the full posterior computed by the method: (1) where m ∈ M, θ m ∈ Θ m and D 1:t = ((d 1 , x 1 ), ...(d t , x t )) is a sequence of experimental designs d (e.g.shown stimulus) and the corresponding behavioural data x (e.g. the response of the subject to the stimuli) pairs.
In our usage context, it is important to make a few reasonable assumptions.First, we assume that the prior over the models p(m) and their parameters p(θ m | m), as well as the domain of the design space, have been specified using sufficient prior knowledge; they may be given by expert psychologists or previous empirical work.This guarantees that the space of the problem is well-defined.Notice that this also implies that the set of candidate models M = (m 1 , . . ., m k ) is known, and each model is defined, for any design, by its own parameters.Second, we assume that the computational models that we consider may not necessarily have a closed-form solution, in case their likelihoods p(x | d, θ m , m) are intractable, but it is possible to sample from the forward model m, given parameter setting θ m , and design d.In other words, we operate in a simulator-based inference setting.Please note that this likelihood depends only on the current design and parameters, as assumed in our setting.The third assumption is that each subject's dataset is analyzed separately: we consider single subjects with fixed parameters undergoing the whole set of experiments, as opposed to the statistical setting where information about one dataset may impact the whole population such as, for instance, in hierarchical modelling or pooled models.
As evidenced by Equations ( 1) and (2), the sequential choice of the designs at any point depends on the current posterior over the models and parameters , which needs to be approximated and updated at each iteration step of the main loop in Figure 1.This problem can be formulated through sequential importance sampling methods, such as Sequential Monte Carlo (SMC; Del Moral et al., 2006).Thus, the resulting parameter posteriors can be approximated, up to resampling, in the form of equally weighted particle sets: m , m (i) the parameters and models associated with the particle i, as an approximation of p(θ m | m, D 1:t ).These particle sets are later sampled to select designs and update parameter posteriors.In the following sections, we take a closer look at the design selection and belief update stages.
Figure 1: Components of the model selection approach.Main loop continues until the experimental design budget is depleted.Input panel: the experimenter defines a design policy (e.g.random choice of designs), as well as the models and their parameter priors.Middle panel: (i) the next experimental design is selected based on the design policy and current beliefs about models and their parameters (initially sampled from model and parameter priors); (ii) the experiment is carried out using the chosen design, and the observed response-design pair is stored; (iii) current beliefs are updated (e.g.resampled) based on experimental evidence acquired thus far.Output panel: the model and parameters that are most consistent with the collected data are selected by applying one of the well-established decision rules to the final beliefs about models and their parameters.

Selecting experimental designs
Traditionally, in the experimental design literature, the designs are selected at each iteration t by maximizing the reduction of the expected entropy H(•) of the posterior p(m, θ m | D 1:t ).By definition of conditional probability we have: where x t is the response predicted by the model.The first equality comes from the definition of entropy and the second from Bayes rule, where we removed the prior, as this term is a constant term in d t .Here, lower entropy corresponds to a narrower, more concentrated, posterior -with maximal information about models and parameters.
Since neither p(x t | d t , θ m , m) nor, by extension, Equation ( 4) are tractable in our setting, we propose a simulatorbased utility objective where q t is a particle approximation of the posterior at time t, and Ĥ is a kernel-based Monte Carlo approximation of the entropy H.
The intuition behind this utility objective is that we choose such designs d t that would maximize identifiability (minimize the entropy) between N responses x simulated from different computational models p(• | d t , θ m , m).The models m as well as their parameters θ m are sampled from the current beliefs q t (θ m , m | D 1:t−1 ).The full asymptotic validity of the Monte Carlo approximation of the decision rule in Equation ( 5) can be found in Appendix A.
The utility objective in (5) allows us to use Bayesian Optimization (BO) to find the design d t and then run the experiment with the selected design.In the next section, we discuss how to update beliefs about the models m and their parameters θ m based on the data collected from the experiment.

Likelihood-free posterior updates
The response x t from the experiment with the design d t is used to update approximations of the posterior q t (m | D t ) and q t (θ m | m, D t ), obtained via marginalization and conditioning, respectively, from q t (θ m , m | D t ).We use LFI with synthetic responses x θm simulated by the behavioural model m to perform the approximate Bayesian update.
Parameter estimation conditioned on the model.We start with parameter estimation for each of the candidate models using Bayesian Optimization for Likelihood-Free Inference (BOLFI; Gutmann and Corander, 2016).In BOLFI, a Gaussian process (GP) [Rasmussen, 2003] surrogate for the discrepancy function between the observed and simulated data, ρ(x θm , x t ) (e.g., Euclidean distance), serves as a base to an unnormalized approximation of the intractable likelihood p(x t | d t , θ m , m).Thus, the posterior can be approximated through the following approximation of the likelihood function L m (•) and the prior over model parameters p(θ m ): Here, following Section 6.3 of [Gutmann and Corander, 2016], we choose , where the bandwidth m takes the role of a acceptance/rejection threshold.Using a Gaussian likelihood for the GP, this leads to: , where Φ(•) denotes the standard Gaussian cumulative distribution function (cdf).Note that µ(θ m ) and ν(θ m ) + σ 2 are the posterior predictive mean and variance of the GP surrogate at θ m .
Model estimation.A principled way of performing model selection is via the marginal likelihood, that is p( , which is proportional to the posterior over models assuming an equal prior for each model.Unfortunately, a direct computation of the marginal likelihood is not possible with Equation ( 7), since it only allows us to compute a likelihood approximation up to a scaling factor that implicitly depends on .For instance, when calculating a Bayes factor (ratio of marginal likelihoods) for models m 1 and m their respective m1 and m2 , chosen independently, may potentially bias the marginal likelihood ratio in favour of one of the models, rendering it unsuitable for model selection.Choosing the same for each model is not possible either, as it would lead to numerical instability due to the shape of the kernel.
To approximate the marginal likelihood p(x t | m), we adopt a similar approach as in Equation ( 7), by reframing the marginal likelihood computation as a distinct LFI problem.In ABC for parameter estimation, we would generate pseudo-observations from the prior predictive distribution of each model, and compare the discrepancy with the true observations on a scale common to all models.This comparison involves a kernel that maps the discrepancy into a likelihood approximation.For example, in rejection ABC [Tavaré et al., 1997, Marin et al., 2012] this kernel is uniform.
In our case, we will generate samples from the joint prior predictive distribution on both models and parameters, and we use a Gaussian kernel κ η (•) = N (• | 0, η 2 ), chosen to satisfy all of the requirements from Gutmann and Corander [2016]; in particular, this kernel is non-negative, non-concave and has a maximum at 0. The parameter η > 0 serves as the kernel bandwidth, similarly to m in Equation ( 7).The value of κ η (•) monotonically increases as the model m produces smaller discrepancy values.This kernel leads to the following approximation of the marginal likelihood: where ), and ρ is the GP surrogate for the discrepancy.Eq. 9 is a direct equivalent of Eq. 7, but here we integrate (marginalize) over both θ and x θ .Here we used the Gaussian kernel, instead of the uniform kernel used in Eq. 7, as it produced better results for model selection in preliminary numerical experiments.Note that in Eq. 9 we have two approximations, the first one from κ η , stating that the likelihood is approximated from the discrepancy, and the second from the use of a GP surrogate for the discrepancy.
The choice of η is a complex problem, and in this paper we propose the simple solution of setting η as the minimum value of E x θ ∼p(•|θm,m)•q(θm|m,Dt−1) ρ(x θ , x t ) across all models m ∈ M.This value has the advantage of giving non extreme values to the estimations of the marginal likelihood, which should in principle avoid over confidence.
Posterior update.The resulting marginal likelihood approximation in Equation ( 9) can then be used in posterior updates for new design trials as follows: Which is equivalent to: Once we updated the joint posterior of models and parameters, it is straightforward to obtain the model and parameter posterior through marginalization and apply a decision rule (e.g.MAP) to choose the estimate.The entire algorithm for BOSMOS can be found in Appendix B.

Experiments
In the experiments, our goal was to evaluate how well the proposed method described in Section 3 discriminated between different computational models in a series of cognitive tasks: memory retention, signal detection and risky choice.Specifically, we measured how well the method chooses designs which help the estimated model imitate the behaviour of the target model, discriminate between models, and correctly estimate their ground-truth parameters.
In our simulated experimental setup, we created 100 synthetic participants by sampling the ground-truth model and its parameters (not available in the real world) through priors p(m) and p(θ m | m).Then, we ran the sequential experimental design procedure for a range of methods described in Section 4.1, and recorded four main performance metrics shown in Figure 3 for 20 design trials (results analysed further later in the section): the behavioural fitness error η b , defined below, the parameter estimation error η p , the accuracy of the model prediction η m and the empirical time cost of running the methods.Furthermore, we evaluated the methods at different stages of design iterations in Figure 3 for the convergence analysis.The complete experiments with additional evaluation points can be found in Appendix C.
We compute η b , η p and η m for a single synthetic participant using the known ground truth model m true and parameters θ true .The behavioural fitness error η b = X true − X est 2 is calculated as the Euclidean distance between the groundtruth model (X true ) and synthetic (X est ) behavioural datasets, which consist of means µ(•) of 100 responses evaluated at the same 100 random designs T generated from a proposal distributions p(d), defined for each model: Here, m est and θ est are, respectively, the model and parameter values estimated via the MAP rule (unless specified otherwise).m est is also used to calculate the predictive model accuracy η m as a proportion of correct model predictions for the total number of synthetic-participants, while θ est is used to calculate the averaged Euclidean distance θ true − θ est 2 across all synthetic participants, which constitutes the parameter estimation error η p .

Comparison methods
Throughout the experiments, we compare several strategies for experimental design selection and parameter inference, where prior predictive distribution (evaluation of the prior without any collected data) with random design choice from the proposal distribution of each model is used as a baseline (we call this method results Prior in the results).The explanations of these methodologies, as well as the exact setup parameters, are provided below.

Likelihood-based inference with random design
"Likelihood-based" inference with random design (LBIRD) applies the ground-truth likelihood, where it is possible, to conduct Bayesian inference and samples the design from the proposal distribution p(d) instead of design selection: . This procedure serves as a baseline by providing unbiased estimates of models and parameters.As other methods in this section, LBIRD uses 5000 particles (empirical samples) to approximate the joint posterior of models and parameters for each model.The Bayesian updates are conducted through importance-weighted sampling with added Gaussian noise applied to the current belief distribution.

ADO
ADO requires a tractable likelihood of the models, and hence is used as an upper bound of performance in cases where the likelihood is available.ADO [Cavagnaro et al., 2010] employs BO for the mutual information utility objective: where we used 500 parameters sampled from the current beliefs to integrate Similarly to other approaches below which also use BO, the BO procedure is initialized with 10 evaluations of the utility objective with d sampled from the design proposal distribution p(d), while the next 5 design locations are determined by the Monte-Carlo-based noisy expected improvement objective.The GP surrogate for the utility uses a constant mean function, a Gaussian likelihood and the Matern kernel with zero mean and unit variance.All these components of the design selection procedure were implemented using the BOTorch package [Balandat et al., 2020].

MINEBED
MINEBED [Kleinegesse and Gutmann, 2020] focuses on design selection for parameter inference with a single model.
Since our setting requires model selections and by extension working with multiple models, we compensate for that by having a separate MINEBED instance for each of the models and then assigning a single model (sampled from the current beliefs) for design optimization at each trial.The model is assigned by the MAP rule over the current beliefs about models q(m | D 1:t ), and the data from conducting the experiment with the selected design are used to update all MINEBED instances.We used the original implementation of the MINEBED method by Kleinegesse and Gutmann [2020], which uses a neural surrogate for mutual information consisting of two fully connected layers with 64 neurons.This configuration was optimized using Adam optimizer [Kingma and Ba, 2014] with initial learning rate of 0.001, 5000 simulations per training at each new design trial and 5000 epochs.

BOSMOS
BOSMOS is the method proposed in this paper and described in Section 3. It uses the simulator-based utility objective from Equation ( 5) in BO to select the design and BO for LFI, along with the marginal likelihood approximation from Equation ( 9) to conduct inference.The objective for design selection is calculated with the same 10 models (a higher number increases belief representation at the cost of more computations) sampled from the current belief over models (i.e.particle set q t (m | D 1:t ) at each time t), where each model is simulated 10 times to get one evaluation point of the utility (100 simulations per point).In total, in each iteration, we spent 1500 simulations to select the design and additional 100 simulations to conduct parameter inference.
As for parameter inference in BOSMOS, BO was initialized with 50 parameter points randomly sampled from the current beliefs about model parameters (i.e. the particle set q t (θ m | m, D 1:t )), the other 50 points were selected for simulation in batches of 5 through the Lower Confidence Bound Selection Criteria [Srinivas et al., 2009] acquisition function.Once again, a GP is used as a surrogate, with the constant mean function and the radial basis function [Seeger, 2004] kernel with zero mean and unit variance.Once the simulation budget of 100 is exhausted, the parameter posterior is extracted through an importance-weight sampling procedure, where the GP surrogate with the tolerance threshold set at a minimum of the GP mean function [Gutmann and Corander, 2016] acts as a base for the simulator parameter likelihood.

Demonstrative example
The demonstrative example serves to highlight the significance of design optimization for model selection with a simple toy scenario.We consider two normal distribution models with either positive (PM) or negative (NM) mean.
Responses are produced according to the experimental design d ∈ [0.001, 5] which determines the quantity of observational noise variance: These two models have the same prior over parameters θ µ ∈ [0, 5] and may be clearly distinguished when the optimal design value is d = 0.001.We choose a uniform prior over models.

Results
As shown in the first set of analyses in Figure 2, selecting informative designs can be crucial.When compared to the LBIRD method, which picked designs at random, all the design optimization approaches performed exceedingly well.This highlights the significance of design selection, as random designs produce uninformative results and impede the inference procedure.
Figure 3 illustrates the convergence of the key performance measures, demonstrating that the design optimization methods had nearly perfect estimates of ground-truths after only one design trial.This indicates that the PM and NM models are easily separable, provided informative designs.In terms of the model predictive accuracy, MINEBED outperformed BOSMOS after the first trial, however BOSMOS rapidly caught up as trials proceeded.This is most likely because our technique employs fewer simulations per trial but a more efficient LFI surrogate than MINEBED.As a result, our method has the second-best time cost not only for the demonstrative example but also across all cognitive tasks.The only method that was faster is the LBIRD method, which skips the design optimization procedure entirely and avoids lengthy computations related to LFI by accessing the ground-truth likelihood.

Memory retention
Studies of memory are a fundamental research area in experimental psychology.Memory can be viewed functionally as a capability to encode, store and remember, and neurologically as a collection of neural connections [Amin and Malik, 2013].Studies of memory retention have a long history in psychological research, in particular in relation to the shape of the retention function [Rubin and Wenzel, 1996].These studies on functional forms of memory retention seek to quantitatively answer how long a learned skill or material is available [Rubin et al., 1999], or how quickly it is forgotten.Distinguishing retention functions may be a challenge [Rubin et al., 1999], and Cavagnaro et al. [2010] showed that employing an ADO approach can be advantageous.Specifically, studies of memory retention typically consist of a 'study phase' (for memorizing) followed by a 'test phase' (for recalling), and the time interval between the two is called a 'lag time'.Varying the lag time by means of ADO allowed more efficient differentiation of the candidate models [Cavagnaro et al., 2010].To demonstrate our approach with the classic memory retention task, we consider the case of distinguishing two functional forms, or models, of memory retention, defined as follows.
Power and exponential models of memory retention.In the classic memory retention task, the subject recalls a stimulus (e.g. a word) at a time d ∈ [0, 100], which is modelled by two Bernoulli models B(1, p): the power (POW) and exponential (EXP) models.The samples from these models are the responses to the task x, which can be interpreted as 'stimulus forgotten' in case x = 0 and x = 1 otherwise.We follow the definition of these models by Cavagnaro et al. [2010], where p = θ a • (d + 1) −θPOW in POW and p = θ a • e −θEXP•d in EXP, as well as the same priors: Similarly to the previous demonstrative example and the rest of the experiments, we use equal prior probabilities for the models.

Results
Studies on the memory task show that the performance gap between LFI approaches and methods that use groundtruth likelihood grows as the number of design trials increases (Figure 2).This is expected, since doing LFI introduces an approximation error, which becomes more difficult to decrease when the most uncertainty around the models and their parameters has been already removed by previous trials.Unlike in the demonstrative example, where design selection was critical, the ground-truth likelihood appears to have a larger influence than design selection for this task, as evidenced by the similar performance of the LBIRD and ADO approaches.
In regard to LFI techniques, BOSMOS outperforms MINEBED in terms of behavioural fitness and parameter estimation, as shown in Figure 3, but only marginally better for model selection.Moreover, both approaches seem to converge to the wrong solutions (unlike ADO), as evidenced by their lack of convergence in the parameter estimation and model accuracy plots.Interestingly, both techniques continued improving behavioural fitness, implying that behavioural data of the models can be reproduced by several parameters that are different from the ground-truth, and LFI methods fail to distinguish them.A deeper examination of the parameter posterior can reveal this issue, which can be likely alleviated by adding new features for observations and designs that can assist in capturing the intricacies within the behavioural data.

Sequential signal detection
Signal detection theory (SDT) focuses on perceptual uncertainty, presenting a framework for studying decisions under such ambiguity [Tanner and Swets, 1954, Peterson et al., 1954, Swets et al., 1961, Wickens, 2002].SDT is an influential developing model stemming from mathematical psychology and psychophysics, providing an analytical framework for assessing optimal decision-making in the presence of ambiguous and noisy signals.The origins of SDT can be traced to the 1800s, but its modern form emerged in the latter half of the 20th century, with the realization that sensory noise is consciously accessible [Wixted, 2020].Example of a signal detection task could be a doctor making a diagnosis: they have to make a decision based on a (noisy) signal of different symptoms [Wickens, 2002].SDT is largely considered a normative approach, assuming that a decision-maker is bounded rational [Swets et al., 1961].We will consider a sequential signal detection task and two models, Proximal Policy Optimization (PPO) and Probability Ratio (PR), implemented as follows.
SDT.In the signal detection task, the subject needs to correctly discriminate the presence of the signal o sign ∈ {present, absent} in a sensory input o in ∈ R. The sensory input is corrupted with sensory noise σ sens ∈ R: Due to the noise in the observations, the task may require several consecutive actions to finish.At every time-step, the subject has three actions a ∈ {present, absent, look} at their disposal: to make a decision that the signal is present or absent, and to take another look at the signal.The role of the experimenter is to adjust the signal strength d str ∼ Unif(0, 4) and discrete number of observations d obs ∼ Unif discr (2, 10) the subject can make such that the experiment will reveal characteristics of human behaviour.In particular, our goal is to identify the hit value parameter of the subject, which determines how much reward r(a, s) the subject receives, in case the signal is both present and identified correctly.Hence, we have that r(a, s) = r a (s) + r step , r a (s) = θ hit , when the signal is present, and the action is present.r a (s) = 2, when the signal is absent, and the action is absent.r a (s) = 0, when the action is look.r a (s) = −1, in other cases.
where r step = −0.05 is the constant cost of every consecutive action.
PPO.We implement the SDT task as an RL model due to the sequential nature of the task.In particular, the look action will postpone the signal detection decision to the next observation.The model assumes that the subject acts according to the current observation o in and an internal state β: π(a | o in , β).The internal state β is updated over trials by aggregating observations o in using a Kalman Filter, and after each trial, the agent chooses a new action.As we have briefly discussed in Section 2, the RL policies need to be retrained when their parameters change.To address this issue, the policy was parameterized and trained using a wide range of model parameters as policy inputs.The resulting model was implemented using the PPO algorithm [Schulman et al., 2017].
PR.An alternative to the RL model is a PR model.It also assumes sequential observations: a hypothesis test as to whether the signal is present is performed after every observation, and the sequence of observations is called evidence [Griffith et al., 2021].A likelihood for the evidence (sequence of observations) is the product of likelihoods of each observation.A likelihood ratio is used as a decision variable (denoted f t here).Specifically, f t is evaluated against a threshold, which determines which action a t to take as follows: where Here, N CDF (•; µ, ν) is the Gaussian cumulative distribution function (CDF) with the mean µ and standard deviation ν.
For more information about the PR model, we refer the reader to Griffith et al. [2021].
For both models, we used the following priors for their parameters and design values:

Results
BOSMOS and MINEBED are the only methodologies capable of performing model selection in sequential signal detection models, as specified in Section 4.4, due to the intractability of their likelihoods.The experimental conditions are therefore very close to those in which these LFI approaches are usually applied, with the exception that we now know the ground-truth of synthetic participants for performance assessments.
BOSMOS showed a faster convergence of the estimates than MINEBED requiring only 4 design trails to reduce the majority of the uncertainty associated with model prediction accuracy and behaviour fitness error, as demonstrated in Figure 3.In contrast, it took 20 design trials for MINEBED to converge, and extending it beyond 20 trials provided very little benefit.Similarly as in the memory retention task from Section 4.3, error in BOSMOS parameter estimates did not converge to zero, showing difficulty in predicting model parameters for PPO and PR models.Improving parameter inference may require modifying priors to encourage more diverse behaviours and selecting more descriptive experimental responses.Finally, BOSMOS outperformed MINEBED across all performance metrics after only one design trial, with the model predictive accuracy showing a large difference, establishing BOSMOS as a clear favourite approach for this task.
An example of posterior distributions returned by BOSMOS is demonstrated in Figure 4. Despite overall positive results, there are occasional cases in a population of synthetic participants, where BOSMOS fails to converge to the ground-truth.The same problem can be observed with MINEBED, as demonstrated in Appendix D. These findings may be attributed to poor identifiability of the signal detection models, suggested earlier in the memory task, but also due to the approximation inaccuracies accumulated over numerous trials.Since both methods operate in a LFI setting, some inconsistency between replicating the target behaviour and converging to the ground-truth parameters is to be expected when the models are poorly identifiable.

Risky choice
Risky choice problems are typical tasks used in psychology, cognitive science and economics to study attitudes towards uncertainty.Specifically, risk refers to 'quantifiable' uncertainty, where a decision-maker is aware of probabilities associated with different outcomes [Knight, 1985].In risky choice problems, individuals are presented with options that are lotteries (i.e., probability distributions of outcomes).For example, a risky choice problem could be a decision between winning 100 euros with a chance of 25%, or getting 25 euros with a chance of 99%.The choice is between two lotteries (100, 0.25; 0, 0.75) and (25, 0.99; 0, 0.01).The goal of the participant is to maximize the subjective reward of their single choice, so they need to assess the risk associated with outcomes in each lottery.
Several models have been proposed to explain tendencies in these tasks, including normative approaches derived from logic to descriptive approaches based on empirical findings [Johnson and Busemeyer, 2010].In this paper, we will consider four classic models (following Cavagnaro et al., 2013b): expected utility theory (EU) [Von Neumann and Morgenstern, 1990], weighted expected utility theory (WEU) [Hong, 1983], original prospect theory (OPT) [Kahneman and Tversky, 1979] and cumulative prospect theory (CPT) [Tversky and Kahneman, 1992].The risky choice models we consider consist of a subjective utility objective (characterizing the amount of value an individual attaches to an outcome) and possibly a probability weighting function (reflecting the tendency for non-linear weighting of probabilities).Despite the long history of development, risky choice is still a focus of the ongoing research [Begenau, 2020, Gächter et al., 2022, Frydman and Jin, 2022].
The objective is to maximize reward from risky choices.Risky choice problems consist of 2 or more options, each of which is described by a set of probability and outcome pairs.For each option, the probabilities sum to 1. Problems may also have an endowment and/or have multiple stages.These variants are not modelled in this version.We will use similar implementations as Cavagnaro et al. [2013b] to test four models M with our method: EU, WEU, OPT and CPT.Each model has its own corresponding parameters θ m .We consider choice problems where individuals choose between two lotteries A and B. The design space for the risky-choice problems is a combination of designs for lottery A and B. The design space for lottery A is defined as the probabilities of the high and low outcome (d phA and d plA ) in this lottery.The design space for lottery B is analogous to lottery A (d phB and d plB ).We assume that there the decisions contain choice stochasticity, which serves as a likelihood for the ADO and LBIRD methods.The models are implemented as follows.
Choice stochasticity.It is typical to assume that individual choices in risky choice problems are not deterministic (i.e., there is choice stochasticity).We use the following definition for probability of choosing lottery A over B in a choice problem i [Cavagnaro et al., 2013a]: where θ m denotes the model parameters and is a value in range [0,0.5]quantifying stochasticity of the choice (with = 0 corresponding to a deterministic choice).Whether lottery A is preferred is determined using the utilities defined for each model separately.

EU. Following Cavagnaro et al. [2013b]
, we specify EU using indifference curves on the Marschak-Machina (MM) probability triangle.Lottery A consists of three outcomes (x lA , x mA , x hA ), and associated probabilities (p lA , p mA , p hA ).Lottery A can be represented using a right triangle (MM) with two of the probabilities as the plane (p lA and p hA as Figure 4: An example of evolution of the posterior approximation in each of the models tested resulting from BOSMOS in the signal detection task.The last bottom row panels are empty as in both cases the posterior probability of the PR model becomes negligible, so that the particle approximation of this posterior does not contain any more particle.The true value of the parameters is indicated by the cross and the true model is POW in both cases.BOSMOS successfully identified the ground-truth model in both cases: all posterior density (shaded area) has concentrated there by 20 trials, and no more particle exists in the other model.However, only in the first example (top panel) did the ground-truth parameter values (cross) fall inside the 90% confidence interval, indicating some inconsistency in terms of the posterior convergence towards the ground-truth.The axes correspond to the model parameters: sensor-noise (x-axis) and hit value (y-axis); θ low and θ len of the PR model are omitted to simplify visualization.
x and y axes, respectively).Hence, the design space for lottery A consists of only the high and low probability (d plA and d phA ).Lottery B can be represented on the triangle similarly (using d plB and d phB ).Then, indifference curves can be drawn on this triangle, as their slope represents the marginal rate of substitution between the two probabilities.EU is defined using indifference curves that all have the same slope We ask to turn to Cavagnaro et al. [2013b] for a more comprehensive explanation of this modelling approach.
WEU. WEU is also defined using the MM-triangle, as per Cavagnaro et al. [2013b].In contrast to EU, the slope of the indifference curves varies across the MM-triangle for WEU.This is achieved by assuming that all the indifference curves intersect at a point (θ x , θ y ) outside the MM-triangle, where OPT.In contrast to EU and WEU, OPT assumes that both the outcomes x and probabilities p have specific editing functions v and w, respectively.Assuming that for lottery A, v(x A low ) = 0 and v(x A high ) = 1, the utility objectives in OPT can be defined using v(x A middle ) as a parameter θ v Utility u(B) for lottery B can be calculated analogously, and A i B i if u(A) > u(B).The probability weighting function w(•) used is the original work by Tversky and Kahneman [1992] is where θ r is a parameter describing the shape of the function.Thus, OPT has two parameters [θ v , θ r ] ∈ θ OPT , describing the subjective utility of the middle outcome and the shape of the probability weighting function, respectively.
CPT. CPT is defined similarly to OPT, however, the subjective utilities u for lottery A are calculated using Utility u(B) for lottery B is calculated similarly and [θ v , θ r ] ∈ θ CPT .We use the following priors for the parameters of models with the design proposal distributions Please note that d pmA and d pmB can be calculated analytically from d pmA = 2 − d plA − d phA , after which the designs for the same lottery (d plA , d pmA , d phA ) are normalized, so they are summed to 1 (and similar for the lottery B).

Results
The risky choice task comprises four computational models, which significantly expand the space of models and makes it much more computationally costly than the memory task.Despite the larger model space, BOSMOS maintains its position as a preferred LFI approach to model selection, most notably when compared to the parameter estimation error of MINEBED from Figure 2.With more models, BOSMOS's performance advantage over MINEBED grows, with BOSMOS exhibiting higher scalability for larger model spaces.
It is crucial to note that having several candidate models reduces model prediction accuracy by the LFI approaches, thus we recommend reducing the number of candidate models as low as feasible.In terms of performance, BOSMOS is comparable to ground-truth likelihood approaches during the first four design trials, as shown in Figure 3, since it is significantly easier to minimize uncertainty early in the trials.Similarly to the memory task, the error of LFI approximation becomes more apparent as the number of trials rises, as evidenced by comparing BOSMOS to ADO for the behavioural fitness error and model predictive accuracy.In terms of the parameter estimate error, BOSMOS performs marginally better than ADO.
Finally, BOSMOS has a relatively low runtime cost, especially compared to other methods (about one minute per design trial).This brings adaptive model selection closer to being applicable to real-world experiments in risky choice.

Discussion
In this paper, we proposed a simulator-based experimental design method for model selection, BOSMOS, that does design selection for model and parameter inference at a speed orders of magnitude higher than other methods, bringing the method closer to online design selection.This was made possible with newly proposed approximation of the model likelihood and simulator-based utility objective.Despite needing orders of magnitude fewer simulations, BOSMOS significantly outperformed LFI alternatives in the majority of cases, while being orders of magnitude faster, bringing the method closer to an online inference tool.Crucially, the time between experiment trials was reduced to less than a minute.Whereas in some settings this time between trials may be too long, BOSMOS is a viable tool in experiments where the tasks include a lag time, for instance, in studies of language learning (e.g., Gardner et al., 1997, Nioche et al., 2021) and task interleaving (e.g., Payne et al., 2007, Brumby et al., 2009, Gebhardt et al., 2021, Katidioti et al., 2014).Moreover, our code implementation represents a proof of concept and was not fully optimized for maximal efficiency: in particular, a parallel implementation that exploits multiple cores and batches of simulated experiments would enable additional speedups [Wu and Frazier, 2016].As an interactive and sample-efficient method, BOSMOS can help reduce the number of required experiments.This can be of interest to both the subject and the experimenter.
In human trials it allows for faster interventions (e.g.adjusting the treatment plan) in critical settings such as ICUs or RCTs.However, it can also have detrimental applications, such as targeted advertising and collecting personal data, therefore the principles and practices of responsible AI [Dignum, 2019, Arrieta et al., 2020] also have to be taken into account in applying our methodology.
There are at least two remaining issues left for future work.The first issue we witnessed in our experiments is that the accuracy of behaviour imitation does not necessarily correlate with the convergence to ground-truth models.This usually happens due to poor identifiability in the model-parameter space, which may be quite prevalent in current and future computational cognitive models, since they are all designed to explain the same behaviour.Currently, the only way to address this problem is to use Bayesian approaches, such as BOSMOS, that quantify the uncertainty over the models and their parameters.The second issue is the consistency of the method: in selecting only the most informative designs, the methods may misrepresent the posterior and return an overconfident posterior.This bias may occur, for example, due to a poor choice of priors or summary statistics [Nunes andBalding, 2010, Fearnhead andPrangle, 2012] for the collected data (when the data is high-dimensional).Ultimately, these issues do not hinder the goal of automating experimental designs, but introduce the necessity for a human expert, who would ensure that the uncertainty around estimated models is acceptable, and the design space is sufficiently explored to make final decisions.
Future work for simulator-based model selection in computational cognitive science needs to consider adopting hierarchical models, accounting for the subjects' ability to adapt or change throughout the experiments, and incorporating amortized non-myopic design selection.A first step in this direction would be to study hierarchical models [Kim et al., 2014] which would allow adjusting prior knowledge for populations and expanding the theory development capabilities of model selection methods from a single individual to a group level.We could also remove the assumption on the stationarity of the model by proposing a dynamic model of subjects' responses which adapts to the history of previous responses and previous designs, which is more reasonable in longer settings of several dozens of trials.Lastly, amortized non-myopic design selections [Blau et al., 2022] would even further reduce the wait time between design proposals, as the model can be pre-trained before experiments, and would also improve design exploration by encouraging long-term planning of the experiments.Addressing these three potential directions may have a synergistic effect on each other, thus expanding the application of simulator-based model selection in cognitive science even further.

C Full experimental results
We provide full experimental results data in Tables 2-4, where evaluations of performance metrics were made after different numbers of design iterations.Moreover, we report time costs of running 100 design trials for each method in Table 5, where our method was 80-100 times faster than the other LFI method, MINEBED.The rest of the section discusses three additional minor points with relation to the performance of ADO for the demonstrative example, the bias in the model space for the memory task, and results of testing two decision rules in the risky choice.
As we have seen in the main text, MINEBED had the fastest behavioural fitness convergence rate in the demonstrative example, while BOSMOS was the close second.Hence, ADO had the slowest convergence rate among design optimization methods for behavioural fitness (Table 2) and also for parameter estimation (Table 3).This result is somewhat counterintuitive, as we expected ADO, with its ground-truth likelihood and design optimization, to be the fastest to converge.Since the only other factor influencing this outcome, Bayesian updates, had access to the ground-truth and avoided LFI approximations, suboptimal designs are likely to blame for the poor performance.This problem is likely to be mitigated by expanding the size of the grid used by ADO to calculate the utility objective.However, expanding it would likely increase ADO's convergence at the expense of more calculations; therefore we aimed to get its running time closer to that of BOSMOS, so both methods could be fairly compared.
In the results of model selection for the memory retention task discussed in the main text, MINEBED showed a marginally better average model accuracy than BOSMOS.However, upon closer inspection (Table 4), this accuracy can be solely attributed to the strong bias towards the POW model; the other approaches show it as well, albeit less dramatically.This suggests that the two models in the memory task are separable, but the EXP model cannot likely replicate parts of the behaviour space that the POW model can, resulting in this skewness towards the more flexible model.This is also more broadly related to non-identifiability: since these cognitive models were designed to explain the same target behaviour, it is inevitable that there will be an overlap in their response (or behavioural data) space, complicating model selection.
Since the risky choice model had four models of varied complexity, we experimented with two distinct decisionmaking rules for estimating the models and parameters: the default MAP and Bayesian information criterion (BIC) [Schwarz, 1978, Vrieze, 2012].Both decision-making rules include a penalty for the size of the model (artificially for BIC, and by definition for MAP).Interestingly, the results are the same for both decision-making rule, indicating that the EU model cannot be completely replaced by a more flexible model.BIC's slightly superior parameter estimates for the BOSMOS technique is most likely explained by the poorer model prediction accuracy.Nevertheless, the BIC rule remains a viable option for model selection in situations when there is a risk of having a more flexible and complex model alongside few-parameter alternatives, despite being less supported theoretically.

D Posterior evolution examples for BOSMOS and MINEBED
In Figures 5 and 6, we compare posterior distributions returned by MINEBED and BOSMOS for two synthetic participants in the signal detection task.In both examples, the methods have successfully identified the ground-truth POW model, as the majority of the posterior density (shaded area in the figure) has moved to the correct model.Nevertheless, BOSMOS and MINEBED have quite different posteriors, which emphasizes the influence of the design selection strategy on the resulting convergence, as one of the method performs better than the other in each of the provided examples.
Figure 5: The first example of evolution of the posterior distribution resulting from MINEBED (green; rows 1 and 2) and BOSMOS (red; rows 3 and 4) for the signal detection task.For each method, the first row shows the distributions of parameters of the POW model (ground-truth), followed by the PR model parameter distributions in the second row.The axes correspond to the model parameters: sensor-noise (x-axis) and hit value (y-axis); θ low and θ len of the PR model are omitted to simplify visualization.The last bottom row panels are empty as in both cases the posterior probability of the PR model becomes negligible, so that the particle approximation of this posterior does not contain any more particle.

Figure 2 :
Figure 2: An overview of the performance of the methods, compared with the prior predictive with random design, (rows) after 20 design trials across four different cognitive modelling tasks (columns): demonstrative example, memory retention, signal detection and risky choice.While requiring 10 times fewer simulations and 60-100 times less time, the proposed BOSMOS method (red) shows consistent improvement over the alternative LFI method, MINEBED (green), in terms of behavioural fitness error η b , parameter estimation error η p , model predictive accuracy η m and empirical time cost t log (here, for 100 designs, in minutes on a log scale).The model accuracy bars indicate the proportion of correct prediction of models across 100 simulated participants.The error bars show the mean (marker) and std.(caps) of the error by the respective methods.

Figure 3 :
Figure3: Evaluation of three performance measures (rows) after 1, 4 and 20 design trials for BOSMOS (solid red) and two alternative best methods, ADO (blue) and MINEBED (green), in four cognitive tasks (columns).As the number of design trials grows, the methods accumulate more observed data from subjects' behaviour and, hence, should reduce behavioural fitness error η b , parameter estimation error η p , and increase model predictive accuracy η m .Since η b is the performance metric MINBED and BOSMOS optimize, its convergence is the most prominent.The lack of convergence for the other two metrics in the memory retention and signal detection tasks is likely due to the possibility of the same behavioural data being produced by models and parameters that are different from the ground-truth (i.e., non-identifiability of these models).