An approach for finding fully Bayesian optimal designs using normal-based approximations to loss functions

The generation of decision-theoretic Bayesian optimal designs is complicated by the significant computational challenge of minimising an analytically intractable expected loss function over a, potentially, high-dimensional design space. A new general approach for approximately finding Bayesian optimal designs is proposed which uses computationally efficient normal-based approximations to posterior summaries to aid in approximating the expected loss. This new approach is demonstrated on illustrative, yet challenging, examples including hierarchical models for blocked experiments, and experimental aims of parameter estimation and model discrimination. Where possible, the results of the proposed methodology are compared, both in terms of performance and computing time, to results from using computationally more expensive, but potentially more accurate, Monte Carlo approximations. Moreover, the methodology is also applied to problems where the use of Monte Carlo approximations is computationally infeasible.


Introduction
The process of designing a physical experiment fits naturally within the Bayesian approach to statistical inference.Prior information on parameters and models can be represented by prior distributions, and the experimental aim encapsulated in a decision-theoretic framework by the loss function.A Bayesian optimal design is found by minimising the expected loss function over the space of all possible designs, i.e. the design space, where the expectation is with respect to the joint distribution of all unknown quantities, i.e. parameters, models and experimental responses.
Formally, suppose the experiment consists of n runs where the ith run (for i = 1, . . ., n) involves measuring response y i having specified settings d i = (d i1 , . . ., d ik ) for the k controllable factors.Let d = (d 1 , . . ., d n ) ∈ D be the W ×1 vector giving the design where W = nk and D ⊂ R W denotes the W -dimensional design space.Assume there is a set, M, of competing statistical models.Model m ∈ M posits a probability distribution, F m , for y = (y 1 , . . ., y n ) which is completely specified up to an unknown p m × 1 vector of parameters θ m .Bayesian inference on models and/or parameters is based on their joint posterior distribution given by Bayes' theorem as π (θ m , m|y, d) ∝ π (y|θ m , m, d) π (θ m |m) π(m), where π (y|θ m , m, d) is the probability mass/density function of F m ; π (θ m |m) is the probability density function of the prior distribution of θ m ; and π(m) is the prior model probability of model m.Note how π (y|θ m , m, d) depends on the design and this induces a dependence of the posterior on the design.The aim of the experiment is represented by a loss function, which can be tailored to experimental aims of parameter estimation or model discrimination.In general, the loss function is given by λ(θ m , m, y, d).
Essentially, it compares a summary of the posterior distribution (e.g.O' Hagan and Forster, 2004, pgs 13-14) for θ m and m (conditional on y and d) to the true values of θ m and m.However, θ m , m and y are unknown so a Bayesian optimal design (e.g.Chaloner and Verdinelli, 1995) is found by minimising the expected loss (2) over the design space, D, where the expectation in ( 2) is with respect to the joint distribution of θ m , m and y.
Robert (2007, Section 2.1) discusses how Bayesian inference relies on the specification of three components: a) the set of models, M; b) the joint prior distribution, given by π(θ m , m); and c) the loss function, λ(θ m , m, y, d).Notwithstanding the difficulties faced in specifying these three components (e.g.O' Hagan and Forster, 2004, Chapter 6), once in place, a Bayesian optimal design is conceptually straightforward to define.However, finding such a design in practice is hindered by the computational challenge of minimising the expected loss.Typically, the expected loss is given by a multi-dimensional analytically intractable integral, i.e. as given by (2).In the two decades since the seminal review of Bayesian design by Chaloner and Verdinelli (1995), there have been few general-purpose approaches to finding such designs, as recently highlighted by Ryan et al. (2016) and Woods et al. (2017).Note that being able to find the exact optimal design for an arbitrary problem is, at present, an unrealistic goal.Instead, the aim is to find a design "close" to the optimal design, termed a near-optimal design by Hamada et al. (2001).
Existing approaches to approximately finding Bayesian designs can be divided into two broad strategies.First, the simulation-based approach of Müller (1999) where c ≥ sup λ(θ m , m, y, d).The marginal mode of d corresponds to the Bayesian optimal design.The so-called Müller algorithm essentially proceeds by using simulation methods to generate a sample from the joint distribution of θ m , m, y and d and to use this sample to estimate the marginal mode of d.This approach has been further modified by Müller et al. (2004) and Amzal et al. (2006).However, due to the difficulty in implementing efficient sampling methods, the Müller algorithm is difficult to implement for high dimensional design spaces with the limit typically considered to be just W = 4 (e.g.Ryan et al., 2016).Alternatively, the second broad strategy is the smoothing-based approach reliant on the following Monte Carlo approximation to the expected loss where is a sample generated from the joint distribution θ m , m and y.The stochastic nature of Monte Carlo approximation means that L(d) is not a smooth function and makes application of standard optimisation methods (including heuristic methods) difficult.Instead, Müller and Parmigiani (1995) proposed an approach whereby L(d) is evaluated at a series of designs.A statistical model (or smoother) is then fitted which builds a relationship between design and expected loss allowing prediction of the expected loss for any design.This prediction is minimised in place of the true expected loss to give an approximation to the Bayesian optimal design.Similar to the simulation-based Müller algorithm, the scalability of this approach to higher dimensional design spaces remains an issue.The chosen smoother needs to balance the increased flexibility required for adequate predictive accuracy with the computational effort of the increased number of evaluations of L(d) required.Müller and Parmigiani (1995) employed local regression models and considered design spaces up to W = 2.More recently, Weaver et al. (2016) used a Gaussian process model (e.g.Santner et al., 2003) and considered an application with W = 3 and Jones et al. (2016) used Bayes linear analysis (Goldstein and Wooff, 2007) and considered up to W = 9.To increase the applicability to higher dimensional design spaces, Overstall and Woods (2017) proposed the approximate coordinate exchange (ACE) algorithm whereby a cyclic descent algorithm (commonly called coordinate exchange in the design of experiments literature, Meyer and Nachtsheim 1995) is used to minimise the expected loss.Very briefly, a Gaussian process prediction of the expected loss is sequentially minimised over each one-dimensional element of the design space.This can be seen as a generalisation of the approaches of Müller and Parmigiani (1995) and Weaver et al. (2016) to higher dimensional design spaces via the use of coordinate exchange, and thus allowed consideration of examples with design spaces of dimensionality nearly two orders of magnitude greater than previously addressed in the literature.Both the Müller algorithm and the smoothing-based approaches require evaluations of the loss function λ(θ m , m, y, d) either in the evaluation of h(θ m , m, y, d) or L(d), respectively.In both cases, a large number of evaluations of the loss function will be required to find a design.Since the loss function depends on y through the posterior distribution of θ m and m, and given that this distribution will typically be analytically intractable, a further approximation is required.The most obvious approach to this problem is to use an additional Monte Carlo approximation, the exact nature of which depends on the chosen loss function.In the case of the Monte Carlo approximation to the expected loss, L(d), given by ( 4), this will result in a nested or double loop Monte Carlo (DLMC) approximation to the expected loss where the inner loop refers to the approximation to the loss function and the outer loop to the approximation to the expected loss.Let B denote the Monte Carlo sample size in the inner loop with B being the corresponding value in the outer loop.For typical loss functions, DLMC can induce a bias of order B−1 in the approximation (e.g.Ryan, 2003;Rainforth et al., 2016) to the expected loss.Moreover, the computational complexity of this approach is typically in the order of B × B evaluations of π(y|θ m , m, d) for each m ∈ M. Since B and B will typically be O(10 3 ) or higher, this will be a computationally expensive approach.The result of which is that, even by using the ACE algorithm, finding Bayesian optimal designs with the DLMC approximation to the expected loss has been confined to simple problems where the number of models under consideration is |M | = 1 and inference has been focused on parameter estimation.
In this paper, we consider using normal-based approximations to the posterior distribution of θ m , centred around the posterior mode of θ m , and how these can be used to approximate, in principle, any loss function.The normal-based Laplace approximation has previously been used to approximate the commonly-used selfinformation loss function (see Section 3.1) for a non-linear model by Long et al. (2013) for a low-dimensional design space and under no model uncertainty.However application to other loss functions has not previously been considered.We apply the new methodology to illustrative examples which are challenging in the context of Bayesian optimal design.In cases where the computationally more expensive DLMC approach is feasible, we show, empirically, that the difference in performance (measured in terms of expected loss) between designs found under the two different approximations is negligible.We also apply the proposed approach to problems where use of the DLMC approximation to find a design would be computationally infeasible.

Normal-based approximations to posterior quantities
As discussed in Section 1, a typical loss function compares a summary of the joint posterior distribution of θ m and m to the "true" values of θ m and m in a way that is relevant to the experimental aim.However, the joint posterior distribution is usually not available in closed form.The methodology in this paper is based on forming an approximation to this joint distribution using normal-based approximations and therefore providing an alternative approximation to the loss function than the Monte Carlo approximation.The normal-based approximation to the loss function, denoted by λ(θ m , m, y, d), can then be substituted into the Monte Carlo approximation to the expected loss, given by (4) and we refer to such an approximation as normal-based Monte Carlo (NBMC).We can then use the ACE algorithm or one of the other smoothing based approaches discussed in Section 1. Conversely, we may also substitute λ(θ m , m, y, d) into the density of the joint distribution over θ m , m, y and d, given by (3), and apply the Müller algorithm.
First, note that the joint posterior distribution of θ m and m, given by (1), can be decomposed as follows (e.g.O' Hagan and Forster, 2004, page 169) where the posterior model probability of model m ∈ M is given by and the posterior distribution of θ m (conditional on model m ∈ M) is given by usually called the marginal likelihood or evidence (Friel and Wyse, 2012) for model m ∈ M.
The posterior model probabilities are completely determined by the marginal likelihoods.Therefore for cases where there is model uncertainty, i.e. |M| > 1, it will be necessary to approximate π(y|m, d) for all m ∈ M. First, let the posterior mode for model m ∈ M be denoted by θm (y where i.e. the Fisher information matrix minus the second derivative of the log prior density.A second-order Taylor series expansion of the log integrand in (7) about the posterior mode yields the following so-called Laplace approximation (e.g.Gelman et al., 2014;Long et al., 2013) Furthermore, we approximate the posterior distribution of θ m (conditional on m), by a normal distribution with mean θm (y) and variance Σm (y), i.e.
N θm (y), Σm (y) . (10) The tractability of the normal distribution means there are now many direct approximations to posterior summaries of interest (e.g.O' Hagan and Forster, 2004, page 237).For example, trivially, the posterior median of θ m conditional on m is approximated by the posterior mode.These approximations to posterior summaries of interest can then be used to approximate many loss functions of practical interest (see Section 3.1 for examples).
Since the posterior distribution of θ m converges to a normal distribution as n → ∞ (e.g.Gelman et al., 2014, pages 585-588) the approximation will be more accurate for large n.The approximation can be expected to be poor when the true posterior distribution of θ m is multi-modal or has significant skewness.In the latter case, the approximation could be improved by a reparameterisation (e.g.Achcar and Smith, 1990).

Finding the posterior mode and Fisher information
The approximations above are reliant on finding the posterior mode, θm (y), for each m ∈ M.This will need to be accomplished for each , from the joint distribution of θ m , m and y, to evaluate the NBMC approximation to the expected loss.To do this we use a scoring algorithm (e.g.Lange, 2013, pgs 254-257), i.e.Newton's method where evaluation of the Hessian matrix of the log posterior density is replaced by evaluation of −H(θ m ; m).Specifically, let The scoring algorithm requires the repeated inversion of the p m ×p m matrix H(θ m ; m).Often experiments are conducted in blocks where a block consists of homogenous experimental units.A suitable model (e.g.Pawitan, 2013, Chapter 17) in this case is a hierarchical (or mixed model) where the effect of the block is accounted for by using block-specific parameters (sometimes referred to as random effects) for each block, which are assumed to be independent having a common prior distribution.A consequence is that the number of parameters, p m , for these types of models is proportional to the number of blocks and therefore can be large.However due to the conditional independence structure exhibited by hierarchical models, H(θ m ; m) will be sparse leading to computationally efficient methods for finding the inverse.

Examples
In this section we begin by discussing a range of exemplar loss functions and how they can be approximated using the approach described in Section 2. This selection of loss functions is not exhaustive but instead demonstrate how typical loss functions may be approximated by using the approximations outlined in Section 2. We then apply the proposed methodology to find designs for experiments involving standard and hierarchical logistic regression (Section 3.2) and a non-linear model (Section 3.3), for experimental aims of parameter estimation and model discrimination.In the examples we use the ACE algorithm (briefly described in Section 1) to find the optimal design since this is the only method in the literature suitable for finding Bayesian designs for the dimensionality of design space considered.However, the approximations to the loss function could be applied with any method such as the Müller algorithm or another smoothing-based method.A more detailed description of the ACE algorithm is provided in Appendix A with a description of the choice of tuning parameters.
All designs found in Sections 3.2 and 3.3 can be reproduced via the R (R Core Team, 2016) package NBdesigns (Overstall et al., 2017).This is available as a Supplementary Material to this paper.This package allows users to compare future computational methodology to the approaches described in this paper via the benchmark examples considered.

Loss functions
The loss functions considered in this section can be categorised into those for a) parameter estimation (Section 3.1.1);and b) model discrimination (Section 3.1.2).

Parameter estimation
Suppose interest lies in the q × 1 vector of transformed parameters φ = g m (θ m ), where q ≤ min m∈M p m , the g m are a set of one-to-one and invertible functions, and φ has a consistent interpretation for all m ∈ M. Inference about φ is based on the model-averaged posterior distribution (O' Hagan and Forster, 2004, pages 171-174) given by π φ (φ|y, d) where we have introduced a subscript of φ on the density function to indicate it refers to the transformed parameters.Consider the following loss functions representing the experimental aim of estimating φ.In each case, a special case occurs when there is no model uncertainty, i.e. |M| = 1, in which case we are interested in parameter estimation under a single model.

Self-information loss
The self-information (SI) loss is Minimising the expected SI loss is equivalent to maximising the expected Shannon information gain (Lindley, 1956) and expected Kullback-Liebler divergence between prior and posterior distributions.Note that the expected SI loss is non-positive.Typically, the density of the posterior distribution of φ, given by π φ (φ|y, d), in the SI loss will be analytically intractable.However, we can approximate it by first approximating the posterior distribution of θ m by ( 10) and then deriving the approximate distribution of φ = g m (θ m ) (conditional on m) after taking a first-order Taylor series expansion of φ = g m (θ m ) about θm (y) (e.g.Khuri, 2003, Chapter 4).This leads to the following approximation to π φ (φ|y, d) where π(m|y, d) is given by ( 9) and πφ (φ|m, y, d) is the density of The result is that the approximate model-averaged posterior distribution of φ given by ( 12) is a mixture of normal distributions where each component is given by ( 13) and weighted by π(m|y, d).
It may not always be possible to find a closed form for the density of the prior distribution of φ, evaluation of which is necessary for the calculation of the SI loss given by (11).In these cases, we suggest approximating the prior distribution of θ m for each m ∈ M by a normal distribution (with mean µ m = E(θ m |m) and variance Ψ m = var(θ m |m)).Now the prior density π φ (φ) can be approximated via where πφ (φ|m) is the density of N (µ m , Ψ m ).Similar to the posterior distribution, the model-averaged prior distribution of φ is approximated by a mixture of normal distributions but where the weights are the true prior model probabilities.
It is well known (e.g.Chaloner and Verdinelli, 1995) that in cases where there is no model uncertainty, |M| = 1, g(θ) = θ, and under a normal approximation to the posterior distribution the expected SI loss is equal to the objective function that defines pseudo-Bayesian D-optimality (a commonly used criterion in classical optimal design of experiments).In Appendix B we discuss the relationship between the normalbased approximation to the SI loss above and the pseudo-Bayesian D-optimal approximation.It is shown that the objective function for pseudo-Bayesian D-optimality is itself an approximation to the expectation of the normal-based approximation to the SI loss.This places the normal-based approximations as being a compromise between the computationally expensive DLMC approximation and the computationally cheap pseudo-Bayesian D-optimal approximation.Furthermore, in Section 3.2, we empirically compare the two approximations.
Absolute error loss The absolute error (AE) loss (e.g.Robert, 2007, pages 79-80) is given by where Q(φ j |y, d) is the model-averaged posterior marginal median of φ j .If there is no model uncertainty, i.e. |M| = 1, then we can approximate the posterior median by Q(φ j |y, d) = g m ( θm (y)) j .However, if |M| > 1 then there is no closed form for the median (or any quantile) of a mixture of normal distributions.
To overcome this problem we use simulation.We generate a sample {φ c } C c=1 from the approximate modelaverage posterior distribution of φ.We then approximate Q(φ j |y, d) by the corresponding sample median.The sample {φ c } C c=1 can be generated as follows where model m is chosen with probability π(m|y, d). 2. For c = 1, . . ., C, complete the following steps Squared error loss The squared error loss (e.g.Robert, 2007, pages 77-79) is given by Typically, unless the g m are linear functions, the posterior mean E θm|y,m,d [g m (θ m )] will not be available in closed form.To approximate this posterior mean, we use the simulation approach described above for the case of approximating the posterior median, only replacing the sample median by sample mean.

Model discrimination
Now suppose the experimental aim is model discrimination and thus inference is based on the posterior model probabilities.Consider the following loss functions.In both cases, we can derive approximations by replacing the marginal likelihood, π(y|m, d), or posterior model probability, π(m|y, d) by the corresponding approximations, π(y|m, d) or π(m|y, d), given by ( 8) and ( 9), respectively.

Logistic Regression
The setup for the following example is adapted from Overstall and Woods (2017) who correspondingly adapted it from a simpler problem studied by Woods et al. (2006) and Gotwalt et al. (2009).It concerns a first-order logistic regression model in k = 4 factors and n runs.Although from a Bayesian inference perspective, logistic regression is a relatively simple model, it (or more generally some type of binary response model) is frequently used to benchmark new computational approaches in statistics (e.g.Minka, 2001;Girolami and Calderhead, 2011;Hoffman and Gelman, 2014).Moreover, until Overstall and Woods (2017), fully Bayesian design for such a model had not previously been attempted in the literature indicating that in the context of fully Bayesian design logistic regression remains a non-trivial problem.
A binary response is measured for G blocks each of n G = 6 runs, i.e. the total number of runs is n = Gn G .Let y ij and x tij denote the experimental response and value of the tth factor for the jth run from the ith block (i = 1, . . ., G; j = 1, . . ., n G ; t = 1, . . ., 4), respectively.It is assumed that where β = (β 0 , β 1 , β 2 , β 3 , β 4 ) are the regression parameters, γ = (γ 1 , . . ., γ G ) are the block-specific parameters, v = (v 0 , . . ., v 4 ) is a binary vector with v t = 1 (v t = 0) indicating whether the tth factor is active (inactive), and • denotes element-wise multiplication.The complete p m × 1 vector of parameters is T be the n×5 model matrix where X i is the n G ×5 model matrix for the ith group with jth row given by x T ij .The design is given by d = vec(D T ) where D is the n × 4 matrix given by X with the first column of ones (corresponding to the intercept) removed.The design space D has dimensionality W = 4n and is such that each element of d lies in the interval [−1, 1].
(i) A prior point mass at γ i = 0 for all i, resulting in standard logistic regression with p m = 1 + 4 t=1 v t .(ii) A hierarchical prior distribution in which elements of γ i are independent and identically distributed as γ ti ∼ U[−ζ t , ζ t ], for t = 0, . . ., 4 with ζ t ∈ (0, Z t ) unknown and having triangular prior density Under each of these two prior distributions, we find designs for the experimental aims of parameter estimation and model discrimination.

Parameter Estimation
We set v t = 1 for all t so that there is no model uncertainty and consider generating designs under the aim of estimating φ = β.This means g m = g is a linear function given by g(θ) = Aθ, where for i) standard logistic regression, A = I 5 ; and for ii) hierarchical logistic regression, A = (I 5 , R) is a 5 × 5(1 + G) matrix where R is a 5 × 5G matrix of zeros.
We consider the self-information and squared error loss functions and compare designs found under the DLMC approximation (as found by Overstall and Woods 2017 and referred henceforth as DLMC designs) against designs found under the NBMC approximations proposed in this paper (henceforth referred to as NBMC designs).To make comparisons valid, we found the NBMC designs under exactly the same implementation of ACE as those used to find the DLMC designs (see Appendix A for details).Additionally, we also compare against pseudo-Bayesian D-and A-optimal designs (also as found by Overstall and Woods 2017).
We compare designs using relative efficiency.Let d * SI and d * SE be the DLMC designs under the SI and SE loss functions, respectively, found by Overstall and Woods (2017).The relative SI and SE efficiencies of a design d are defined as respectively, where L SI and L SE refer to the expected SI and SE loss functions, respectively.Note that the definition of relative SI efficiency, given by ( 15), follows from how the expected SI loss is non-positive.The relative efficiencies are approximated by approximating the expected losses in ( 15) and ( 16) by using the DLMC approximation.
The top row of Figure 1 shows boxplots of twenty DLMC approximations to the relative efficiency for SI (left) and SE (right) for the NBMC and pseudo-Bayesian designs plotted against n ∈ N S = {6, 8, . . ., 48} (meaning W ranges from 24 to 192) for standard logistic regression.The bottom row shows the same boxplots for hierarchical logistic regression plotted against n ∈ N H = {12, 18, . . ., 48} (meaning W ranges from 48 to 192).In both cases, the relative efficiencies of the NBMC designs are clearly very close to one for all values of n indicating that the performance of these designs (in terms of expected loss) is very close to the performance of the DMLC designs.However, the relative efficiency of the pseudo-Bayesian designs only become close to one as n gets larger.In the case of the SI loss, it appears from Figure 1 that we could obtain a negligible difference in expected SI loss for values of n over approximately forty by using the pseudo-Bayesian D-optimal design.This design is computationally more efficient to find than a fully Bayesian design, however, knowing that it had nearly equivalent performance to the fully Bayesian design would be hard without first finding the fully Bayesian design.
We now investigate the accuracy of the NBMC approximation to the expected loss.For the SI and SE loss, we randomly generate T designs and for each one calculate three different approximations to the expected loss: a DLMC approximation with B = 50000 (which we consider near exact evaluation of the expected loss), and NBMC approximations with B = 1000 and B = 20000.The tth design for t = 1, . . ., T , is generated by perturbing the DLMC designs under each of the loss functions as follows where, at each iteration, u t ∼ U(0, 1 2 ), and d (t) is a design in which each element is generated from U(−1, 1).The top row of Figure 2 shows plots of the two NBMC approximations to the expected loss plotted against the DLMC approximation to the expected loss for standard logistic regression under the SI (left) and SE (right) loss function for n = 6 (i.e. the smallest number of runs considered).The bottom row of Figure 2 shows the same plots for hierarchical logistic regression with n = 12 (again the smallest number of runs considered).In all four cases, although the NBMC approximation to the expected loss can be inaccurate, especially for designs close to the minimum, the ordering of designs in terms of expected loss is the same as for the DLMC approximation.This means the NBMC approximation to the expected loss is minimised for design close to the design that minimises the DLMC approximation.
Figure 3 shows plots of the mean computer time required to compute the three types of design (NBMC, DLMC and pseudo-Bayesian) for the models and loss functions considered.Note that the algorithm was run on the IRIDIS 4 supercomputer facility at the University of Southampton which has 2.6Ghz processors with 4Gb memory.Note that the MBMC designs are typically found in a third of the computing time required to find the DLMC designs.The pseudo-Bayesian designs require the smallest amount of computing time but this should be judged in parallel with the lack of efficiency these designs exhibit when compared to the corresponding fully Bayesian design (see Figure 1).-2.0 DLMC approximation to expected SI loss (B = 50,000) NBMC approximation to expected SI loss

Standard logistic regression
Hierarchical logistic regression
We find designs for both standard and hierarchical logistic regression for each value of n (N S for standard, and N H for hierarchical), under both the 0-1 and model self-information loss functions, respectively.Bayesian optimal design on such a scale for the experimental aim of model discrimination has not been addressed previously in the literature.It would be infeasible to find DLMC designs in this situation since it would require |M| = 16 Monte Carlo approximations to the marginal likelihood for every b = 1, . . ., B. Conversely, the normal-based approximations will require |M| = 16 Laplace approximations which will be computationally less intensive.
However, although it is infeasible to use the DLMC approximation to find designs, we can use it to assess the NBMC designs.Figure 4 shows boxplots of twenty DLMC approximations to the expected 0-1 and model self-information loss functions for both standard and hierarchical logistic regression against n for the NBMC designs.Note that the expected loss for the hierarchical model is always greater than for the standard model and that this difference increases as n increases.This is due to the extra uncertainty introduced by the blocks and their associated block-specific parameters (whose number is proportional to n).
Similar to Section 3.2.1 we check the validity of the approximation by plotting NBMC approximations (with B = 1, 000 and B = 20, 000) against the DLMC approximation to the expected loss with B = 50, 000.DLMC approximation to expected 0-1 loss (B=50,000) NBMC approximation to expected 0-1 loss

(d) Hierarchical logistic regression -MSI loss
DLMC approximation to expected MSI loss (B=50,000) NBMC approximation to expected MSI loss B = 1,000 B = 50,000 Figure 5 shows the resulting plots for n = 6 (standard logistic regression) and n = 12 (hierarchical logistic regression).In this case, we can see that the NBMC approximations to the expected loss appear very accurate for both loss functions.

Mechanistic Modelling of Chemical Reactions
In this section we consider the famous example from Box and Hill (1967) concerning discriminating between non-linear models for describing chemical reactions.Suppose the ith run consists of specifying temperature x i1 ∈ (0, 150) and reaction time x i2 ∈ (450, 600), i.e. d i = (x i1 , x i2 ), and measuring the yield y i from a chemical reaction, for i = 1, . . ., n.It is assumed that independently for i = 1, . . ., n where m ∈ M. Consider the set M = {1, 2, 3, 4} of |M| = 4 competing models for η m (θ; d i ) given as follows where the unknown parameters θ m = (θ m1 , θ m2 ) have the same interpretation under each model.Following Box and Hill (1967), a-priori we assume that θ m1 ∼ N(400, 25 2 ) and θ m2 ∼ N(5000, 250 2 ), independently, and that π(m) = 1/4.Box and Hill (1967) assumed that the response variance was fixed as σ 2 = 0.1 2 .However, we let σ 2 be unknown and assume that σ 2 ∼ U[0, 1].We consider n = {5, 10, 15, . . ., 50} meaning that W ranges from 10 to 100.To demonstrate the versatility of the normal-based approximations presented in this paper, for each value of n, we find NBMC designs under a total of eight different loss functions.We consider the three exemplar loss functions (Section 3.1.1)for parameter estimation for a) φ = g m (θ m ) = θ m ; and b) φ = g m (θ m ) = θ m2 /θ m1 , where in the latter case we are interested in the ratio of the two unknown parameters.In both cases we are taking account of model uncertainty by considering the model-averaged posterior distribution of φ.We also consider the two exemplar loss functions for model discrimination (Section 3.1.2).
It is not clear how to find DLMC approximations to the expected loss for every loss function considered in this section.For example, a DLMC approximation to the model-averaged posterior median required for the AE loss would require samples from the posterior distribution of θ m for each m ∈ M and y b , for b = 1, . . ., B. Therefore in this section we rely on the NBMC approximations to assess the performance of designs.Figure 6 shows twenty boxplots of the NBMC approximation (B = 20, 000) to the expected loss for the NBMC designs plotted against n for each of the loss functions.Note how the expected loss functions for φ = θ m have a faster relative decrease in expected loss with increasing n than the expected loss for φ = θ m2 /θ m1 .This is easiest to see for the SE in loss in Figure 6(b) and 6(e), where the expected SE loss for φ = θ m has a relative decrease of approximately 35% when n increase from 5 to 50.The corresponding relative decrease in expected SE loss for φ = θ m2 /θ m1 is approximately 15%.This is due to the form of the parameterisation of interest g m (θ m ).When φ = θ m , the trace of the prior variance of φ is 63,125 which gives an upper bound on the expected loss.The corresponding value for φ = θ m2 /θ m1 is approximately 1.Under the latter parameterisation, it appears the choice of design does not have as great an impact on the expected SE loss than the former.

Conclusion
In this paper, we have proposed the use of normal-based approximations to posterior summaries to aid in the approximation of the loss function.The resulting approximate loss can be used in conjunction with any algorithm suitable for finding the design that minimises the expected loss function.The methodology was used in conjunction with the ACE algorithm to find designs which have similar performance (in terms of expected loss) to designs generated by the original ACE algorithm with the DLMC approximation to the expected loss, but using a fraction of the computing time.The methodology was also able to find designs for problems, under model uncertainty, where use of the DLMC approach would be infeasible and, as such, have not previously been addressed in the literature.
The normal-based approximations utilised in this paper are formed by using the location of, and curvature around, the posterior mode.Taking advantage of alternative normal-based of other deterministic approximations may be of future interest.In particular, one could consider using the expectation propagation (EP) algorithm of Minka (2001) to form efficient approximations to the posterior distribution of the parameters.Similar to mode/curvature based approximations used in this paper, EP also provides an efficient approximate of the marginal likelihood.EP could potentially be useful in a wider range of problems as any distribution in the exponential family could be considered as the parametric form for approximating the posterior distribution.
) be the function given by the expected loss function which only varies over the design space, D i , for the ith element.
(b) For j = 1, . . ., Q, evaluate the MC approximation to the expected loss given by

A.2 Comparison procedure
In step 2d, we accept the proposed design, d * with probability p * .The proposed design originates from from the Gaussian process emulator.Similar to all statistical models, Gaussian process emulators, can fit inadequately.To mitigate the effects of an inadequate emulator, Overstall and Woods (2017)

A.3 Implementation details
To reduce the likelihood of the ACE algorithm converging to local optima, it is restarted from E different starting designs.These E repetitions of the ACE algorithm aree run in an embarrassingly parallel fashion.Note that there is further scope for parallelising the ACE algorithm.The Q evaluations of the Monte Carlo approximation to the expected loss in step 2b could be parallelised.Furthermore, the calculation of the posterior mode for b = 1, . . ., B could also be parallelised.These have not been pursued here through.Even by repeating the ACE algorithm E times does not guarantee that it will converge to the true optimal design.Following Overstall and Woods (2017) we set E = 20, Q = 20, and B = 1, 000, except for in the comparison procedure where B = 20, 000.These values were found by Overstall and Woods (2017) to perform well for a variety of examples.Additionally, Müller and Parmigiani (1995) also found that B could be similarly small when using their smoothing-based approach.Note that in their use of DLMC approximation to the expected loss, Overstall and Woods (2017) used B = B for the Monte Carlo sample size in the inner loop of the DLMC approximation.Again following Overstall and Woods (2017), we fit the Gaussian process model using a squared exponential correlation function.
The ACE algorithm is implemented in the R package acebayes (Overstall and Woods, 2016) available from the Comprehensive R Archive Network.
The 0-1 loss (e.g.Robert, 2007, pages 80-81)  is given by λ 01 (m, y, d) = 1 − I(m = M (m|y, d)), where I(A) denotes the indicator of event A and M (m|y, d)) = arg max m∈M π(y|m, d)π(m) is the posterior modal model.The design that minimises the expected 0-1 loss equivalently maximises the expected posterior model probability of the modal model.Model self-information loss The model self-information loss (e.g McGree, 2017) is derived by extending the self-information loss for parameters to the posterior model probabilities.It is given by λ M SI (m, y, d) = log π(m) − log π(m|y, d).

Figure 2 :
Figure 2: Plots of NBMC approximations (with B=1,000 and B = 20, 000) to the expected loss plotted against the DLMC approximation (B = 50, 000) to the expected loss for standard ((a) and (b) for n = 6) and hierarchical ((c) and (d) for n = 12) logistic regression under the self-information ((a) and (c)) and squared error ((b) and (d)) loss function.A line through the origin with slope one has been added to aid in the comparison.

Figure 3 :
Figure 3: Plots of mean computer time required to find the NBMC, DLMC and pseudo-Bayesian designs for standard ((a) and (b)) and hierarchical ((c) and (d)) logistic regression under the SI ((a) and (c)) and SE ((b) and (d)) loss function.

Figure 4 :
Figure 4: Plots of the DLMC approximation (B = 20, 000) to the expected 0-1 (a) and model self-information (b) loss function for standard and hierarchical logistic regression against n.

Figure 6 :
Figure6: Boxplots of twenty NBMC approximations (B = 20, 000) to the expected loss for the NBMC designs plotted against n under the each of the loss functions for the non-linear model example.
Fit a GP emulator to {z j , d j } Q j=1 and set Li (d) to be the resulting predictive mean.(c) Find d * i = argmin d∈Di Li (d), and let d * = d C 1 , . . ., d C i−1 , d * , d C i+1 , . . . . . ., d C W be the proposed design.(d) Set d C = d * with probability p * .3. Return to step 2.

∼∼
proposed a comparison between the proposed design d * and the current design d C .Note that the proposed design d * should be accepted ifE θm,m,y|d * [λ(θ m , m, y, d * )] < E θm,m,y|d C λ(θ m , m, y, d C ) (17) For b = 1, . . ., B we generate samples λ b * θ * b m , m * b , y * b , d * ), λ b C = λ(θ b m , m b , y b , d C ),where θ * b m , m * b , y * b the joint distribution of θ m , m and y conditional on d * and d C , respectively.We use these samples to perform a Bayesian hypothesis test of (17).The form of the Bayesian hypothesis test, as described inOverstall and Woods (2017), assumes that the λ b * 's and λ b C 's are continuous and their distribution reasonably assumed normal.In this case, the probability of accepting the proposed design isp * = 1 − F − B b=1 λ b C − B b=1 λ b * √ 2Bv ,where F (•) is the distribution function of the t-distribution with 2B − 2 degrees of freedom, λ * are the sample means of the λ b C 's and λ b * 's, respectively.The assumption of normality will clearly be violated for the 0-1 loss function for model discrimination, described in Section 3.1, where the λ b * 's and λ b C 's will be binary in the set {0, 1}.For such loss functions, we introduce the following modification.Assume that Bernoulli (ρ * ) , for b = 1, . . ., B. We also assume the following independent prior distributions:ρ C ∼ U[0, 1] and ρ * ∼ U[0, 1].The resulting posterior distributions areρ C |λ 1 C , . . ., λ B C Beta 1 + B λC , 1 + B − B λC , ρ * |λ 1 * , . . ., λ B * ∼ Beta 1 + B λ * , 1 + B − B λ * .The probability of accepting the new design is then given byp * = P ρ * < ρ C |λ 1 C , . . ., λ B C , λ 1 * , . . .,λ B * which is evaluated via simulation as follows B λ * , 1 + B − B λ * , where F (•; a, b) denotes the distribution function of the Beta(a, b) and ρ b C B b=1 is a sample from Beta 1 + B λC , 1 + B − B λC .
Consider the case where |M | = 1 and g(θ) = θ.The approximated SI loss is then λSI (θ, y, d) = log π(θ)The corresponding expected approximate SI loss is given byLSI (d) = E θ E y|θ,d λSI (θ, y, d) .(18)Assume that the prior distribution for θ is sufficiently diffuse so that Σθ = I(θ; d) −1 and the posterior mode is approximately equal to the maximum likelihood estimator (MLE).Using the delta-method and the approximate distribution of the MLE, the following approximation for LSI (d) can be derivedLSI (d) = E θ p 2 log(2π) + p 2 + log π(θ) − 1 2 log |I(θ; d)| .This is proportional to the objective function for pseudo-Bayesian D-optimality.Similarly, an approximation to the expected squared error loss is given by LSI (d) = E θ tr I(θ; d) −1 , the objective function for pseudo-Bayesian A-optimality.
laborations initiated at the "Bayesian Optimal Design of Experiments" workshop (Queensland University of Technology, Brisbane, Australia, December 2015; http://www.bode2015.wordpress.com).A.M. Overstall was supported by a Research Incentive Grant (70424) from the Carnegie Trust for the Universities of Scotland.C.C. Drovandi was supported by an Australian Research Council's Discovery Early Career Researcher Award funding scheme (DE160100741).