# An approach for finding fully Bayesian optimal designs using normal-based approximations to loss functions

- 1.8k Downloads
- 4 Citations

## Abstract

The generation of decision-theoretic Bayesian optimal designs is complicated by the significant computational challenge of minimising an analytically intractable expected loss function over a, potentially, high-dimensional design space. A new general approach for approximately finding Bayesian optimal designs is proposed which uses computationally efficient normal-based approximations to posterior summaries to aid in approximating the expected loss. This new approach is demonstrated on illustrative, yet challenging, examples including hierarchical models for blocked experiments, and experimental aims of parameter estimation and model discrimination. Where possible, the results of the proposed methodology are compared, both in terms of performance and computing time, to results from using computationally more expensive, but potentially more accurate, Monte Carlo approximations. Moreover, the methodology is also applied to problems where the use of Monte Carlo approximations is computationally infeasible.

## Keywords

Loss function Model discrimination Bayesian optimal design Parameter estimation Hierarchical model## 1 Introduction

The process of designing a physical experiment fits naturally within the Bayesian approach to statistical inference. Prior information on parameters and models can be represented by prior distributions, and the experimental aim encapsulated in a decision-theoretic framework by the loss function. A Bayesian optimal design is found by minimising the expected loss function over the space of all possible designs, i.e. the design space, where the expectation is with respect to the joint distribution of all unknown quantities, i.e. parameters, models and experimental responses.

*n*runs where the

*i*th run (for \(i=1,\dots ,n\)) involves measuring response \(y_i\) having specified settings \(\mathbf {d}_i = \left( d_{i1},\dots ,d_{ik}\right) \) for the

*k*controllable factors. Let \(\mathbf {d} = \left( \mathbf {d}_1,\dots ,\mathbf {d}_n\right) \in \mathcal {D}\) be the \(W \times 1\) vector giving the design where \(W=nk\) and \(\mathcal {D} \subset \mathbb {R}^W\) denotes the

*W*-dimensional design space. Assume there is a set, \(\mathcal {M}\), of competing statistical models. Model \(m \in \mathcal {M}\) posits a probability distribution, \(\mathrm {F}_m\), for \(\mathbf {y}=\left( y_1,\dots ,y_n\right) \) which is completely specified up to an unknown \(p_m \times 1\) vector of parameters \(\varvec{\theta }_m\). Bayesian inference on models and/or parameters is based on their joint posterior distribution given by Bayes’ theorem as

*m*. Note how \(\pi \left( \mathbf {y}|\varvec{\theta }_m,m,\mathbf {d}\right) \) depends on the design and this induces a dependence of the posterior on the design.

*m*(conditional on \(\mathbf {y}\) and \(\mathbf {d}\)) to the true values of \(\varvec{\theta }_m\) and

*m*. However, \(\varvec{\theta }_m\),

*m*and \(\mathbf {y}\) are unknown so a Bayesian optimal design (e.g. Chaloner and Verdinelli 1995) is found by minimising the expected loss

*m*and \(\mathbf {y}\).

Robert (2007, Section 2.1) discusses how Bayesian inference relies on the specification of three components: (a) the set of models, \(\mathcal {M}\); (b) the joint prior distribution, given by \(\pi (\varvec{\theta }_m,m)\); and (c) the loss function, \(\lambda (\varvec{\theta }_m,m,\mathbf {y},\mathbf {d})\). Notwithstanding the difficulties faced in specifying these three components (e.g. O’Hagan and Forster 2004, Chapter 6), once in place, a Bayesian optimal design is conceptually straightforward to define. However, finding such a design in practice is hindered by the computational challenge of minimising the expected loss. Typically, the expected loss is given by a multi-dimensional analytically intractable integral, i.e. as given by (2). In the two decades since the seminal review of Bayesian design by Chaloner and Verdinelli (1995), there have been few general-purpose approaches to finding such designs, as recently highlighted by Ryan et al. (2016) and Woods et al. (2017). Note that being able to find the exact optimal design for an arbitrary problem is, at present, an unrealistic goal. Instead, the aim is to find a design “close” to the optimal design, termed a near-optimal design by Hamada et al. (2001).

*m*, \(\mathbf {y}\) and design \(\mathbf {d}\) given by

*m*, \(\mathbf {y}\) and \(\mathbf {d}\) and to use this sample to estimate the marginal mode of \(\mathbf {d}\). This approach has been further modified by Müller et al. (2004) and Amzal et al. (2006). However, due to the difficulty in implementing efficient sampling methods, the Müller algorithm is difficult to implement for high-dimensional design spaces with the limit typically considered to be just \(W=4\) (e.g. Ryan et al. 2016).

*m*and \(\mathbf {y}\). The stochastic nature of Monte Carlo approximation means that \(\hat{L}(\mathbf {d})\) is not a smooth function and makes application of standard optimisation methods (including heuristic methods) difficult. Instead, Müller and Parmigiani (1995) proposed an approach whereby \(\hat{L}(\mathbf {d})\) is evaluated at a series of designs. A statistical model (or smoother) is then fitted which builds a relationship between design and expected loss allowing prediction of the expected loss for any design. This prediction is minimised in place of the true expected loss to give an approximation to the Bayesian optimal design. Similar to the simulation-based Müller algorithm, the scalability of this approach to higher dimensional design spaces remains an issue. The chosen smoother needs to balance the increased flexibility required for adequate predictive accuracy with the computational effort of the increased number of evaluations of \(\hat{L}(\mathbf {d})\) required. Müller and Parmigiani (1995) employed local regression models and considered design spaces up to \(W=2\). More recently, Weaver et al. (2016) used a Gaussian process model (e.g. Santner et al. 2003) and considered an application with \(W=3\) and Jones et al. (2016) used Bayes linear analysis (Goldstein and Wooff 2007) and considered up to \(W=9\). To increase the applicability to higher dimensional design spaces, Overstall and Woods (2017) proposed the approximate coordinate exchange (ACE) algorithm whereby a cyclic descent algorithm (commonly called coordinate exchange in the design of experiments literature, Meyer and Nachtsheim 1995) is used to minimise the expected loss. Very briefly, a Gaussian process prediction of the expected loss is sequentially minimised over each one-dimensional element of the design space. This can be seen as a generalisation of the approaches of Müller and Parmigiani (1995) and Weaver et al. (2016) to higher dimensional design spaces via the use of coordinate exchange, and thus allowed consideration of examples with design spaces of dimensionality nearly two orders of magnitude greater than previously addressed in the literature.

Both the Müller algorithm and the smoothing-based approaches require evaluations of the loss function \(\lambda (\varvec{\theta }_m,m,\mathbf {y},\mathbf {d})\) either in the evaluation of \(h(\varvec{\theta }_m,m,\mathbf {y},\mathbf {d})\) or \(\hat{L}(\mathbf {d})\), respectively. In both cases, a large number of evaluations of the loss function will be required to find a design. Since the loss function depends on \(\mathbf {y}\) through the posterior distribution of \(\varvec{\theta }_m\) and *m*, and given that this distribution will typically be analytically intractable, a further approximation is required. The most obvious approach to this problem is to use an additional Monte Carlo approximation, the exact nature of which depends on the chosen loss function. In the case of the Monte Carlo approximation to the expected loss, \(\hat{L}(\mathbf {d})\), given by (4), this will result in a nested or double loop Monte Carlo (DLMC) approximation to the expected loss where the inner loop refers to the approximation to the loss function and the outer loop to the approximation to the expected loss. Let \(\tilde{B}\) denote the Monte Carlo sample size in the inner loop with *B* being the corresponding value in the outer loop. For typical loss functions, DLMC can induce a bias of order \(\tilde{B}^{-1}\) in the approximation (e.g. Ryan 2003; Rainforth et al. 2016) to the expected loss. Moreover, the computational complexity of this approach is typically in the order of \(B \times \tilde{B}\) evaluations of \(\pi (\mathbf {y}|\varvec{\theta }_m,m,\mathbf {d})\) for each \(m \in \mathcal {M}\). Since *B* and \(\tilde{B}\) will typically be \(\mathcal {O}(10^3)\) or higher, this will be a computationally expensive approach. The result of which is that, even by using the ACE algorithm, finding Bayesian optimal designs with the DLMC approximation to the expected loss has been confined to simple problems where the number of models under consideration is \(|M|=1\) and inference has been focused on parameter estimation.

In this paper, we consider using normal-based approximations to the posterior distribution of \(\varvec{\theta }_m\), centred around the posterior mode of \(\varvec{\theta }_m\), and how these can be used to approximate, in principle, any loss function. The normal-based Laplace approximation has previously been used to approximate the commonly used self-information loss function (see Sect. 3.1) for a nonlinear model by Long et al. (2013) for a low-dimensional design space and under no model uncertainty. However, application to other loss functions has not previously been considered. We apply the new methodology to illustrative examples which are challenging in the context of Bayesian optimal design. In cases where the computationally more expensive DLMC approach is feasible, we show, empirically, that the difference in performance (measured in terms of expected loss) between designs found under the two different approximations is negligible. We also apply the proposed approach to problems where use of the DLMC approximation to find a design would be computationally infeasible.

## 2 Methodology

### 2.1 Normal-based approximations to posterior quantities

As discussed in Sect. 1, a typical loss function compares a summary of the joint posterior distribution of \(\varvec{\theta }_m\) and *m* to the “true” values of \(\varvec{\theta }_m\) and *m* in a way that is relevant to the experimental aim. However, the joint posterior distribution is usually not available in closed form. The methodology in this paper is based on forming an approximation to this joint distribution using normal-based approximations and therefore providing an alternative approximation to the loss function than the Monte Carlo approximation. The normal-based approximation to the loss function, denoted by \(\tilde{\lambda }(\varvec{\theta }_m,m,\mathbf {y},\mathbf {d})\), can then be substituted into the Monte Carlo approximation to the expected loss, given by (4), and we refer to such an approximation as normal-based Monte Carlo (NBMC). We can then use the ACE algorithm or one of the other smoothing-based approaches discussed in Sect. 1. Conversely, we may also substitute \(\tilde{\lambda }(\varvec{\theta }_m,m,\mathbf {y},\mathbf {d})\) into the density of the joint distribution over \(\varvec{\theta }_m\), *m*, \(\mathbf {y}\) and \(\mathbf {d}\), given by (3), and apply the Müller algorithm.

*m*, given by (1), can be decomposed as follows (e.g. O’Hagan and Forster 2004, page 169)

*m*), by a normal distribution with mean \(\hat{\varvec{\theta }}_m(\mathbf {y})\) and variance \(\hat{\varSigma }_m(\mathbf {y})\), i.e.

*m*is approximated by the posterior mode. These approximations to posterior summaries of interest can then be used to approximate many loss functions of practical interest (see Sect. 3.1 for examples).

Since the posterior distribution of \(\varvec{\theta }_m\) converges to a normal distribution as \(n \rightarrow \infty \) (e.g. Gelman et al. 2014, pp. 585–588) the approximation will be more accurate for large *n*. The approximation can be expected to be poor when the true posterior distribution of \(\varvec{\theta }_m\) is multi-modal or has significant skewness. In the latter case, the approximation could be improved by a reparameterisation (e.g. Achcar and Smith 1990).

### 2.2 Finding the posterior mode and Fisher information

*m*and \(\mathbf {y}\), to evaluate the NBMC approximation to the expected loss. To do this, we use a scoring algorithm (e.g. Lange 2013, pgs 254–257), i.e. Newton’s method where evaluation of the Hessian matrix of the log posterior density is replaced by evaluation of \(-\mathbf {H}(\varvec{\theta }_m;m)\). Specifically, let

The scoring algorithm requires the repeated inversion of the \(p_m \times p_m\) matrix \(\mathbf {H}(\varvec{\theta }_m;m)\). Often experiments are conducted in blocks where a block consists of homogenous experimental units. A suitable model (e.g. Pawitan 2013, Chapter 17) in this case is a hierarchical (or mixed model) where the effect of the block is accounted for by using block-specific parameters (sometimes referred to as random effects) for each block, which are assumed to be independent having a common prior distribution. A consequence is that the number of parameters, \(p_m\), for these types of models is proportional to the number of blocks and therefore can be large. However, due to the conditional independence structure exhibited by hierarchical models, \(\mathbf {H}(\varvec{\theta }_m;m)\) will be sparse leading to computationally efficient methods for finding the inverse.

## 3 Examples

In this section, we begin by discussing a range of exemplar loss functions and how they can be approximated using the approach described in Sect. 2. This selection of loss functions is not exhaustive but instead demonstrate how typical loss functions may be approximated by using the approximations outlined in Sect. 2. We then apply the proposed methodology to find designs for experiments involving standard and hierarchical logistic regression (Sect. 3.2) and a nonlinear model (Sect. 3.3), for experimental aims of parameter estimation and model discrimination. In the examples, we use the ACE algorithm (briefly described in Sect. 1) to find the optimal design since this is the only method in the literature suitable for finding Bayesian designs for the dimensionality of design space considered. However, the approximations to the loss function could be applied with any method such as the Müller algorithm or another smoothing-based method. A more detailed description of the ACE algorithm is provided in “Appendix 1” with a description of the choice of tuning parameters.

All designs found in Sects. 3.2 and 3.3 can be reproduced via the R (R Core Team 2016) package NBdesigns (Overstall et al. 2017a). This is available as a Supplementary Material to this paper. This package allows users to compare future computational methodology to the approaches described in this paper via the benchmark examples considered.

### 3.1 Loss functions

The loss functions considered in this section can be categorised into those for (a) parameter estimation (Sect. 3.1.1); and (b) model discrimination (Sect. 3.1.2).

#### 3.1.1 Parameter estimation

*Self-information loss*

*m*) after taking a first-order Taylor series expansion of \(\varvec{\phi } = g_m(\varvec{\theta }_m)\) about \(\hat{\varvec{\theta }}_m(\mathbf {y})\) (e.g. Khuri 2003, Chapter 4). This leads to the following approximation to \(\pi _{\phi }(\varvec{\phi }|\mathbf {y},\mathbf {d})\)

It is well known (e.g. Chaloner and Verdinelli 1995) that in cases where there is no model uncertainty, \(|\mathcal {M}|=1\), \(g(\varvec{\theta }) = \varvec{\theta }\), and under a normal approximation to the posterior distribution the expected SI loss is equal to the objective function that defines pseudo-Bayesian D-optimality (a commonly used criterion in classical optimal design of experiments). In “Appendix 2”, we discuss the relationship between the normal-based approximation to the SI loss above and the pseudo-Bayesian D-optimal approximation. It is shown that the objective function for pseudo-Bayesian D-optimality is itself an approximation to the expectation of the normal-based approximation to the SI loss. This places the normal-based approximations as being a compromise between the computationally expensive DLMC approximation and the computationally cheap pseudo-Bayesian D-optimal approximation. Furthermore, in Sect. 3.2, we empirically compare the two approximations.

*Absolute error loss*The absolute error (AE) loss (e.g. Robert 2007, pp. 79–80) is given by

- 1.
Generate \(\left\{ m^c\right\} _{c=1}^C\) where model

*m*is chosen with probability \(\tilde{\pi }(m|\mathbf {y},\mathbf {d})\). - 2.For \(c=1,\dots ,C\), complete the following steps
- (a)
Generate \(\varvec{\theta }_{m^c}^c\) from \(\mathrm {N}\left( \hat{\varvec{\theta }}_m^c(\mathbf {y}),\hat{\varSigma }_{m^c}\right) \).

- (b)
Set \(\varvec{\phi }^c = g_{m^c}(\varvec{\theta }_{m^c}^c)\).

- (a)

*Squared error loss*The squared error loss (e.g. Robert 2007, pp. 77–79) is given by

#### 3.1.2 Model discrimination

Now, suppose the experimental aim is model discrimination, and thus inference is based on the posterior model probabilities. Consider the following loss functions. In both cases, we can derive approximations by replacing the marginal likelihood, \(\pi (\mathbf {y}|m,\mathbf {d})\), or posterior model probability, \(\pi (m|\mathbf {y},\mathbf {d})\) by the corresponding approximations, \(\tilde{\pi }(\mathbf {y}|m,\mathbf {d})\) or \(\tilde{\pi }(m|\mathbf {y},\mathbf {d})\), given by (8) and (9), respectively.

##### 3.1.2.1 0–1 loss

*I*(

*A*) denotes the indicator of event

*A*and \(M(m|\mathbf {y},\mathbf {d})) = \mathrm {arg} \max _{m \in \mathcal {M}} \pi (\mathbf {y}|m,\mathbf {d})\pi (m)\) is the posterior modal model. The design that minimises the expected 0–1 loss equivalently maximises the expected posterior model probability of the modal model.

*Model self-information loss*

### 3.2 Logistic regression

The setup for the following example is adapted from Overstall and Woods (2017) who correspondingly adapted it from a simpler problem studied by Woods et al. (2006) and Gotwalt et al. (2009). It concerns a first-order logistic regression model in \(k=4\) factors and *n* runs. Although from a Bayesian inference perspective, logistic regression is a relatively simple model, it (or more generally some type of binary response model) is frequently used to benchmark new computational approaches in statistics (e.g. Minka 2001; Girolami and Calderhead 2011; Hoffman and Gelman 2014). Moreover, until Overstall and Woods (2017), fully Bayesian design for such a model had not previously been attempted in the literature indicating that in the context of fully Bayesian design logistic regression remains a non-trivial problem.

*G*blocks each of \(n_G=6\) runs, i.e. the total number of runs is \(n=Gn_G\). Let \(y_{ij}\) and \(x_{tij}\) denote the experimental response and value of the

*t*th factor for the

*j*th run from the

*i*th block (\(i=1,\dots ,G\); \(j=1,\dots ,n_G\); \(t=1,\dots ,4\)), respectively. It is assumed that \(y_{ij} \sim \mathrm {Bernoulli}\left( \rho _{ij}\right) \), independently, where

*t*th factor is active (inactive), and \(\circ \) denotes element-wise multiplication. The complete \(p_m \times 1 \) vector of parameters is \(\varvec{\theta }_m = \left( \varvec{\beta }_m,\varvec{\gamma }_m\right) \). Each model \(m \in \mathcal {M}\) is determined by the \(|\mathcal {M}| = 2^4 = 16\) combinations of \(\mathbf {v}\).

*i*th group with

*j*th row given by \(\mathbf {x}_{ij}^T\). The design is given by \(\mathbf {d} = \mathrm {vec}(\mathbf {D}^T)\) where \(\mathbf {D}\) is the \(n \times 4\) matrix given by \(\mathbf {X}\) with the first column of ones (corresponding to the intercept) removed. The design space \(\mathcal {D}\) has dimensionality \(W=4n\) and is such that each element of \(\mathbf {d}\) lies in the interval \([-1,1]\).

- (i)
A prior point mass at \(\varvec{\gamma }_i = \mathbf {0}\) for all

*i*, resulting in standard logistic regression with \(p_m = 1 + \sum _{t=1}^4 v_t\). - (ii)
A hierarchical prior distribution in which elements of \(\varvec{\gamma }_i\) are independent and identically distributed as \(\gamma _{ti} \sim \mathrm {U}[-\zeta _t,\zeta _t]\), for \(t=0,\dots ,4\) with \(\zeta _t \in (0,Z_t)\) unknown and having triangular prior density \(\pi (\zeta _t) = 2(Z_t-\zeta _t)/Z_t^2\) with \(\left( Z_0,Z_1,\dots ,Z_4\right) = \left( 3,3,3,1,1\right) \). The numbers of parameters is \(p_m = 1 + G +(1+G)\sum _{t=1}^4 v_t\).

#### 3.2.1 Parameter estimation

We set \(v_t = 1\) for all *t* so that there is no model uncertainty and consider generating designs under the aim of estimating \(\varvec{\phi } = \varvec{\beta }\). This means \(g_m = g\) is a linear function given by \(g(\varvec{\theta }) = A \varvec{\theta }\), where for (1) standard logistic regression, \(A = I_5\); and for (2) hierarchical logistic regression, \(A = \left( I_5,R\right) \) is a \(5 \times 5(1+G)\) matrix where *R* is a \(5 \times 5G\) matrix of zeros.

We consider the self-information and squared error loss functions and compare designs found under the DLMC approximation (as found by Overstall and Woods 2017 and referred henceforth as DLMC designs) against designs found under the NBMC approximations proposed in this paper (henceforth referred to as NBMC designs). To make comparisons valid, we found the NBMC designs under exactly the same implementation of ACE as those used to find the DLMC designs (see “Appendix 1” for details). Additionally, we also compare against pseudo-Bayesian D- and A-optimal designs (also as found by Overstall and Woods 2017).

The top row of Fig. 1 shows boxplots of twenty DLMC approximations to the relative efficiency for SI (left) and SE (right) for the NBMC and pseudo-Bayesian designs plotted against \(n \in N_S = \left\{ 6,8,\dots ,48\right\} \) (meaning *W* ranges from 24 to 192) for standard logistic regression. The bottom row shows the same boxplots for hierarchical logistic regression plotted against \(n \in N_H = \left\{ 12,18, \dots ,48\right\} \) (meaning *W* ranges from 48 to 192). In both cases, the relative efficiencies of the NBMC designs are clearly very close to one for all values of *n* indicating that the performance of these designs (in terms of expected loss) is very close to the performance of the DMLC designs. However, the relative efficiency of the pseudo-Bayesian designs only become close to one as *n* gets larger. In the case of the SI loss, it appears from Fig. 1 that we could obtain a negligible difference in expected SI loss for values of *n* over approximately forty by using the pseudo-Bayesian D-optimal design. This design is computationally more efficient to find than a fully Bayesian design, however, knowing that it had nearly equivalent performance to the fully Bayesian design would be hard without first finding the fully Bayesian design.

*T*designs and for each one calculate three different approximations to the expected loss: a DLMC approximation with \(B=\) 50,000 (which we consider near exact evaluation of the expected loss), and NBMC approximations with \(B=1000\) and \(B=20,000\). The

*t*th design for \(t=1,\dots ,T\), is generated by perturbing the DLMC designs under each of the loss functions as follows

Figure 3 shows plots of the mean computer time required to compute the three types of design (NBMC, DLMC and pseudo-Bayesian) for the models and loss functions considered. Note that the algorithm was run on the IRIDIS 4 supercomputer facility at the University of Southampton which has 2.6 Ghz processors with 4Gb memory. Note that the MBMC designs are typically found in a third of the computing time required to find the DLMC designs. The pseudo-Bayesian designs require the smallest amount of computing time but this should be judged in parallel with the lack of efficiency these designs exhibit when compared to the corresponding fully Bayesian design (see Fig. 1).

#### 3.2.2 Model discrimination

*m*. The total number of models is \(|\mathcal {M}| = 2^4 = 16\) ranging from \((v_1,\dots ,v_4) = (0,0,0,0)\) (intercept only; all factors inactive) to \((v_1,\dots ,v_4) = (1,1,1,1)\) (full model; all factors active). The prior model probabilities are such that \(\mathrm {P}(v_t = 1)\sim \mathrm {U}[0,1]\) independently, for each \(t=1,\dots ,4\). For model

*m*, let \((v_{m1},\dots ,v_{m4})\) denote the values of \(v_t\) where the number of regression parameters is given by \(b_m = 1 + \sum _{t=1}^4 v_{mt}\). The marginal prior model probabilities are then given by

We find designs for both standard and hierarchical logistic regression for each value of *n* (\(N_S\) for standard, and \(N_H\) for hierarchical), under both the 0–1 and model self-information loss functions, respectively. Bayesian optimal design on such a scale for the experimental aim of model discrimination has not been addressed previously in the literature. It would be infeasible to find DLMC designs in this situation since it would require \(|\mathcal {M}| = 16\) Monte Carlo approximations to the marginal likelihood for every \(b=1,\dots ,B\). Conversely, the normal-based approximations will require \(|\mathcal {M}| = 16\) Laplace approximations which will be computationally less intensive.

However, although it is infeasible to use the DLMC approximation to find designs, we can use it to assess the NBMC designs. Figure 4 shows boxplots of twenty DLMC approximations to the expected 0–1 and model self-information loss functions for both standard and hierarchical logistic regression against *n* for the NBMC designs. Note that the expected loss for the hierarchical model is always greater than for the standard model and that this difference increases as *n* increases. This is due to the extra uncertainty introduced by the blocks and their associated block-specific parameters (whose number is proportional to *n*).

Similar to Sect. 3.2.1, we check the validity of the approximation by plotting NBMC approximations (with \(B=1000\) and \(B=\) 20,000) against the DLMC approximation to the expected loss with \(B=\) 50,000. Figure 5 shows the resulting plots for \(n=6\) (standard logistic regression) and \(n=12\) (hierarchical logistic regression). In this case, we can see that the NBMC approximations to the expected loss appear very accurate for both loss functions.

### 3.3 Mechanistic modelling of chemical reactions

*i*th run consists of specifying temperature \(x_{i1} \in (0,150)\) and reaction time \(x_{i2} \in (450,600)\), i.e. \(\mathbf {d}_i = (x_{i1},x_{i2})\), and measuring the yield \(y_i\) from a chemical reaction, for \(i=1,\dots ,n\). It is assumed that

We consider \(n = \left\{ 5,10,15,\dots ,50\right\} \) meaning that *W* ranges from 10 to 100. To demonstrate the versatility of the normal-based approximations presented in this paper, for each value of *n*, we find NBMC designs under a total of eight different loss functions. We consider the three exemplar loss functions (Sect. 3.1.1) for parameter estimation for (a) \(\varvec{\phi } = g_m(\varvec{\theta }_m) = \varvec{\theta }_m\); and (b) \(\varvec{\phi } = g_m(\varvec{\theta }_m) = \theta _{m2}/\theta _{m1}\), where in the latter case we are interested in the ratio of the two unknown parameters. In both cases, we are taking account of model uncertainty by considering the model-averaged posterior distribution of \(\varvec{\phi }\). We also consider the two exemplar loss functions for model discrimination (Sect. 3.1.2).

It is not clear how to find DLMC approximations to the expected loss for every loss function considered in this section. For example, a DLMC approximation to the model-averaged posterior median required for the AE loss would require samples from the posterior distribution of \(\varvec{\theta }_m\) for each \(m \in \mathcal {M}\) and \(\mathbf {y}^b\), for \(b=1,\dots ,B\). Therefore, in this section, we rely on the NBMC approximations to assess the performance of designs. Figure 6 shows twenty boxplots of the NBMC approximation (\(B=\) 20,000) to the expected loss for the NBMC designs plotted against *n* for each of the loss functions. Note how the expected loss functions for \(\varvec{\phi } = \varvec{\theta }_m\) have a faster relative decrease in expected loss with increasing *n* than the expected loss for \(\phi = \theta _{m2}/\theta _{m1}\). This is easiest to see for the SE in loss in Fig. 6b, where the expected SE loss for \(\varvec{\phi } = \varvec{\theta }_m\) has a relative decrease of approximately 35% when *n* increase from 5 to 50. The corresponding (see Fig. 6e) relative decrease in expected SE loss for \(\phi = \theta _{m2}/\theta _{m1}\) is approximately 15%. This is due to the form of the parameterisation of interest \(g_m(\varvec{\theta }_m)\). When \(\varvec{\phi } = \varvec{\theta }_m\), the trace of the prior variance of \(\phi \) is 63,125 which gives an upper bound on the expected loss. The corresponding value for \(\phi = \theta _{m2}/\theta _{m1}\) is approximately 1. Under the latter parameterisation, it appears the choice of design does not have as great an impact on the expected SE loss than the former.

## 4 Conclusion

In this paper, we have proposed the use of normal-based approximations to posterior summaries to aid in the approximation of the loss function. The resulting approximate loss can be used in conjunction with any algorithm suitable for finding the design that minimises the expected loss function. The methodology was used in conjunction with the ACE algorithm to find designs which have similar performance (in terms of expected loss) to designs generated by the original ACE algorithm with the DLMC approximation to the expected loss, but using a fraction of the computing time. The methodology was also able to find designs for problems, under model uncertainty, where use of the DLMC approach would be infeasible and, as such, have not previously been addressed in the literature.

The normal-based approximations utilised in this paper are formed by using the location of, and curvature around, the posterior mode. Taking advantage of alternative normal-based or other deterministic approximations may be of future interest. In particular, one could consider using the expectation propagation (EP) algorithm of Minka (2001) to form efficient approximations to the posterior distribution of the parameters. Similar to mode-/curvature-based approximations used in this paper, EP also provides an efficient approximate of the marginal likelihood. EP could potentially be useful in a wider range of problems as any distribution in the exponential family could be considered as the parametric form for approximating the posterior distribution.

## Notes

### Acknowledgements

The authors would like to thank the two anonymous referees for their insightful comments which have significantly improved the paper. The authors would also like to thank Prof David Woods (University of Southampton) for helpful suggestions on a draft of this paper. This work arose from discussions and collaborations initiated at the “Bayesian Optimal Design of Experiments” workshop (Queensland University of Technology, Brisbane, Australia, December 2015; http://www.bode2015.wordpress.com). A.M. Overstall was supported by a Research Incentive Grant (70424) from the Carnegie Trust for the Universities of Scotland. C.C. Drovandi was supported by an Australian Research Council’s Discovery Early Career Researcher Award funding scheme (DE160100741).

## References

- Achcar, J., Smith, A.: Aspects of reparameterization in approximate Bayesian inference. In: Hodges, J. (ed.) Essays in Honour of G. A. Barnard. North-Holland, Amsterdam (1990)Google Scholar
- Amzal, B., Bois, F.Y., Parent, E., Robert, C.P.: Bayesian-optimal design via interacting particle systems. J. Am. Stat. Assoc.
**101**, 773–785 (2006)MathSciNetCrossRefzbMATHGoogle Scholar - Box, G.E.P., Hill, W.J.: Discrimination among mechanistic models. Technometrics
**9**, 57–71 (1967)MathSciNetCrossRefGoogle Scholar - Chaloner, K., Verdinelli, I.: Bayesian experimental design: a review. Stat. Sci.
**10**, 273–304 (1995)MathSciNetCrossRefzbMATHGoogle Scholar - Friel, N., Wyse, J.: Estimating the evidence—a review. Stat. Neerlandica
**66**, 288–308 (2012)MathSciNetCrossRefGoogle Scholar - Gelman, A., Carlin, J., Stern, H., Dunson, D., Vehtari, A., Rubin, D.: Bayesian Data Analysis, 3rd edn. Chapman and Hall, Boca Raton (2014)zbMATHGoogle Scholar
- Girolami, M., Calderhead, B.: Riemann manifold langevin and hamiltonian monte carlo methods. J. R. Stat. Soc. Ser. B
**73**, 123–214 (2011)MathSciNetCrossRefGoogle Scholar - Goldstein, M., Wooff, D.: Bayes Linear Statistics: Theory and Methods. Wiley, Chichester (2007)CrossRefzbMATHGoogle Scholar
- Gotwalt, C.M., Jones, B.A., Steinberg, D.M.: Fast computation of designs robust to parameter uncertainty for nonlinear settings. Technometrics
**51**, 88–95 (2009)MathSciNetCrossRefGoogle Scholar - Hamada, M., Martz, H., Reese, C., Wilson, A.: Finding near-optimal Bayesian experimental designs via generic algorithms. Am. Stat.
**55**, 175–181 (2001)CrossRefzbMATHGoogle Scholar - Hoffman, M., Gelman, A.: The no-u-turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. J. Mach. Learn. Res.
**15**, 1593–1623 (2014)MathSciNetzbMATHGoogle Scholar - Jones, M., Goldstein, M., Jonathan, P., Randell, D.: Bayes linear analysis for Bayesian optimal experimental design. J. Stat. Plan. Inference
**171**, 115–129 (2016)MathSciNetCrossRefzbMATHGoogle Scholar - Khuri, A.: Advanced Calculus with Applications in Statistics, 2nd edn. Wiley, Hoboken (2003)zbMATHGoogle Scholar
- Lange, K.: Optimization, 2nd edn. Springer, New York (2013)CrossRefzbMATHGoogle Scholar
- Lindley, D.: On a measure of the information provided by an experiment. Ann. Math. Stat.
**27**, 986–1005 (1956)MathSciNetCrossRefzbMATHGoogle Scholar - Long, Q., Scavino, M., Tempone, R., Wang, S.: Fast estimation of expected information gains for Bayesian experimental designs based on Laplace approximations. Comput. Methods Appl. Mech. Eng.
**259**, 24–39 (2013)MathSciNetCrossRefzbMATHGoogle Scholar - McGree, J. M.: Developments of the total entropy utility function for the dual purpose of model discrimination and parameter estimation in Bayesian design. Comput. Stat. Data Anal. To appear (2017)Google Scholar
- Meyer, R., Nachtsheim, C.: The coordinate-exchange algorithm for constructing exact optimal experimental designs. Technometrics
**37**, 60–69 (1995)MathSciNetCrossRefzbMATHGoogle Scholar - Minka, T.: Expectation propagation for approximate Bayesian inference. In: Breese, J., Koller, D. (eds.) Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence (2001)Google Scholar
- Müller, P.: Simulation-based optimal design. Bayesian. Statistics
**6**, 459–474 (1999)zbMATHGoogle Scholar - Müller, P., Parmigiani, G.: Optimal design via curve fitting of monte carlo experiments. J. Am. Stat. Assoc.
**90**(432), 1322–1330 (1995)MathSciNetzbMATHGoogle Scholar - Müller, P., Sanso, B., De Iorio, M.: Optimal Bayesian design by inhomogeneous Markov chain simulation. J. Am. Stat. Assoc.
**99**, 788–798 (2004)MathSciNetCrossRefzbMATHGoogle Scholar - O’Hagan, A., Forster, J .J.: Kendall’s Advanced Theory of Statistics: Volume 2B Bayesian Inference, 2nd edn. Wiley, Chichester (2004)zbMATHGoogle Scholar
- Overstall, A.M., McGree, J.M., Drovandi, C.C.: Normal Based Designs. R package Version 1.0 (2017a)Google Scholar
- Overstall, A.M., Woods, D.C., Adamou, M.: acebayes: Optimal Bayesian experimental design using the ACE algorithm. R package version 1.3.4 (2017b). https://CRAN.R-project.org/package=acebayes
- Overstall, A.M., Woods, D.C.: Bayesian design of experiments using approximate coordinate exchange. Technometrics, (2017). doi: 10.1080/00401706.2016.1251495
- Pawitan, Y.: In All Likelihood: Statistical Modelling and Inference using Likelihood. Oxford University Press, Oxford (2013)zbMATHGoogle Scholar
- R Core Team R: A language and environment for statistical computing. R Foundation for Statistical Computing Vienna, Austria (2016) https://www.R-project.org/
- Rainforth, T., Cornish, R., Yang, H., Wood, F.: On the pitfalls of nested Monte Carlo. Tech. rep. University of Oxford (2016) ArXiv:1612.00951
- Robert, C.: The Bayesian Choice, 2nd edn. Springer, New York (2007)Google Scholar
- Ryan, E.G., Drovandi, C.C., McGree, J.M., Pettitt, A.N.: A review of modern computational algorithms for Bayesian optimal design. Int. Stat. Rev.
**84**, 128–154 (2016)MathSciNetCrossRefGoogle Scholar - Ryan, K.: Estimating expected information gains for experimental designs with application to the random fatigue-limit model. J. Comput. Graph. Stat.
**12**, 585–603 (2003)MathSciNetCrossRefGoogle Scholar - Santner, T., Williams, B., Notz, W.: The Design and Analysis of Computer Experiments. Springer, New York (2003)Google Scholar
- Scott, J., Berger, J.: Bayes and empirical-Bayes multiplicity adjustment in the variable-selection problem. Ann. Stat.
**38**, 2587–2619 (2010)MathSciNetCrossRefzbMATHGoogle Scholar - Weaver, B., Williams, B., Anderson-Cook, C., Higdon, D.: Computational enhancements to bayesian design of experiments using gaussian computational enhancements to bayesian design of experiments using gaussian processes. Bayesian Anal.
**11**, 191–213 (2016)Google Scholar - Woods, D., Overstall, A., Adamou, M., Waite, T.: Bayesian design of experiments for generalised linear models and dimensional analysis with industrial and scientific application. Qual. Eng.
**29**, 91–103 (2017)Google Scholar - Woods, D.C., Lewis, S.M., Eccleston, J.A., Russell, K.G.: Designs for generalized linear models with several variables and model uncertainty. Technometrics
**48**, 284–292 (2006)MathSciNetCrossRefGoogle Scholar

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.