1 Introduction

When designing experiments for nonlinear models there is usually uncertainty about the model parameters, \(\varvec{\psi }\in \varPsi \), and often also in the structural form of the model itself. A Bayesian approach enables this uncertainty to be taken into account coherently when choosing the variable settings to be applied in the experiment.

In contrast, frequentist optimal designs such as locally optimal designs (Chernoff 1953) and minimax designs have a less satisfactory approach to a priori parameter uncertainty. Locally optimal designs are tailored for a specific set of assumed parameter values and may perform poorly if the assumed values differ from the truth. Minimax designs optimize worst-case performance, potentially at the expense of reduced efficiency in the most likely parameter scenarios.

Suppose that the design is denoted by \(\varvec{\xi }= (\mathbf {x}_{1},\ldots ,\mathbf {x}_{n})\), where \(\mathbf {x}_i=(x_{i1},\ldots ,x_{iq})^\mathrm {T}\in {\mathbb {R}}^q\) is a vector that defines the settings of the q controllable variables to be applied to the ith experimental unit, with corresponding response \(y_i\) (\(i=1,\ldots , n\)). A design \(\varvec{\xi }^*\) is Bayesian optimal if it maximizes the expected utility,

$$\begin{aligned} U(\varvec{\xi }) = \int _{{\mathbb {R}}^n} \int _{\varPsi } u(\varvec{\xi }, \varvec{\psi }, \mathbf {y}) f_R(\mathbf {y}| \varvec{\psi }, \varvec{\xi }) f_B(\varvec{\psi }) d\varvec{\psi }d\mathbf {y}, \end{aligned}$$
(1)

with respect to \(\varvec{\xi }\in \varXi \), where \(\varXi \) denotes the set of possible designs. Above \(\mathbf {y}= (y_1,\ldots ,y_n)^\mathrm {T}\), with \(f_B(\varvec{\psi })\) denoting the prior probability density of the parameters and \(f_R(\mathbf {y}|\varvec{\psi },\varvec{\xi })\) denoting the conditional probability density of the response vector under the assumed model.

The utility function u is chosen to reflect the goal of the experiment, such as point estimation of \(\varvec{\psi }\) or hypothesis testing. We will focus on the case where the goal is to report all knowledge about the parameters via the full posterior distribution, with density \(f_{A}(\varvec{\psi }| \mathbf {y}, \varvec{\xi }) \propto f_{R}(\mathbf {y}| \varvec{\psi }, \varvec{\xi })f_{B}(\varvec{\psi })\), ensuring that this is as concentrated as possible. Here a commonly recommended utility is

$$\begin{aligned} u(\varvec{\xi }, \varvec{\psi }, \mathbf {y})&= \log \frac{ f_A(\varvec{\psi }| \mathbf {y}, \varvec{\xi }) }{ f_B(\varvec{\psi }) } \nonumber \\&= \log f_R(\mathbf {y}| \varvec{\psi }, \varvec{\xi }) - \log f_E(\mathbf {y}|\varvec{\xi }) \,, \end{aligned}$$
(2)

involving the model evidence, defined via \( f_E(\mathbf {y}| \varvec{\xi }) = \int _{\varPsi } f_R(\mathbf {y}| \varvec{\psi }, \varvec{\xi }) f_B(\varvec{\psi }) d\varvec{\psi }\). The above is the unique utility corresponding to a local proper scoring rule. A Bayesian optimal design for utility (2) maximizes the expected Kullback–Leibler divergence, or equivalently the expected Shannon information gain (SIG), between the prior and posterior distributions (Lindley 1956; Bernardo 1979; Chaloner and Verdinelli 1995).

Note that the role of the subscripts above is to ensure that different functions have different names, e.g. \(f_A(\cdot |\cdot ,\cdot )\) is the posterior of \(\varvec{\psi }\) and \(f_B(\cdot )\) is the prior for \(\varvec{\psi }\). This is more precise than the simpler notation more commonly used in Bayesian statistics in which both density functions would be denoted by f and distinguished purely by their arguments; it is also shorter than the more formal probabilistic notation in which the two functions would be denoted \(f_{\varTheta |Y,\varXi }(\cdot |\cdot ,\cdot )\) and \(f_{\varTheta }(\cdot )\). The more precise notation will be important later, when we wish to substitute other quantities, e.g. one denoted \({\hat{\varvec{\mu }}}\), into the posterior density of \(\varvec{\psi }\). The simpler notation is considered an ‘abuse of notation’ by mathematicians (e.g. Gelman et al. 2013, p.6), though it is often expedient.

Despite the apparent simplicity of the above theory, until recently it was all but impossible to compute a Bayesian optimal design in practice for realistically complex experiments. This is due to the presence of two main challenges. Firstly, the (potentially high-dimensional) integrals involved in (1) and (2) are analytically intractable except for linear models with normally-distributed response. Thus, in general the expected utility can only be evaluated approximately using numerical integration. Typically the outer integral in (1) is estimated via Monte Carlo. The inner integral in the model evidence in (2) can be estimated stochastically, giving Double Loop Monte Carlo (Ryan 2003) or nested Importance Sampling (Feng 2015). Alternatively, deterministic estimates such as Laplace approximations can be used (Long et al. 2013; Overstall et al. 2018). Earlier approaches such as Bayesian D-optimality relied more heavily on asympototic approximations (Chaloner and Verdinelli 1995).

The second challenge is numerical maximization of the approximately evaluated utility. This is difficult as a result of the high dimension of the design space. In addition, due to the use of Monte Carlo, the approximate evaluations of the objective function are computationally expensive, noisy, and non-smooth. This precludes the use of standard optimization algorithms such as quasi-Newton methods or co-ordinate exchange algorithms. Instead, more sophisticated optimization techniques have been developed, one of the most promising being approximate co-ordinate exchange (ACE; Overstall and Woods 2017). Alternative methods include stochastic approximation (Huan and Marzouk 2013) and sampling-based methods (Müller et al. 2004).

The idea of the ACE algorithm is to optimize one co-ordinate of the design at a time using a Gaussian process emulator to form a smooth estimate of the expected utility as a function of the current co-ordinate. To ensure robustness to the quality of the emulator, each proposed change to a co-ordinate is subject to an independent emulator-free acceptance-rejection step. After making several passes through the design matrix using this process, the design points are consolidated using a point exchange procedure. An implementation is available in the R package acebayes (Overstall et al. 2019).

This paper makes several contributions. First, we introduce a new method for the approximation of the expected SIG utility, called Approximate Laplace Importance Sampling (ALIS). Our method is computationally cheaper (in some cases up to 70%) than the Laplace Importance Sampling (LIS) method used by Beck et al. (2018) to find low-dimensional designs for partial differential equation models, and by Senarathne et al. (2020) for sequential design. Second, we conduct a thorough comparison of ALIS and LIS with a number of other algorithms in the context of nonlinear models familiar from the statistics literature. Third, we discuss approximations to the expected SIG utility in the common case where there are nuisance parameters (cf. Feng and Marzouk 2019). Finally, we demonstrate that the use of ALIS and LIS in conjunction with the ACE optimization algorithm gives better designs than previous approximations in some models with a large number of parameters.

2 Existing approximations for expected Shannon information gain

All of the methods considered in this paper use Monte Carlo integration to estimate the (outer) integral in (1), giving an approximation of the form

$$\begin{aligned} {\tilde{U}}(\varvec{\xi }) = \frac{1}{M_1} \sum _{h=1}^{M_1} \left[ \log f_R(\mathbf {y}_h \,|\, \varvec{\psi }_h, \varvec{\xi }) - \log {\tilde{f}}_E^h \right] , \end{aligned}$$
(3)

where \((\varvec{\psi }_h, \mathbf {y}_h)\), \(h=1,\ldots ,M_1\), are independent random samples from the joint prior density, i.e. \( f_J(\varvec{\psi },\mathbf {y}\,|\,\varvec{\xi }) = f_B(\varvec{\psi })f_R(\mathbf {y}\,|\,\varvec{\psi },\varvec{\xi }) \), and \({\tilde{f}}_E^h\) is an estimate of the evidence \(f_E(\mathbf {y}_h | \varvec{\xi })\) in (2).

The main difference between the various methods is in the choice of the estimate of the evidence in (3), which affects both accuracy and computational expense. The primary distinction is whether a second Monte Carlo estimate is used, giving a ‘Nested Monte Carlo’ method, or a deterministic estimate such as the Laplace approximation. We detail these different methods below.

2.1 Naïve Monte Carlo

The simplest way to approximate the evidence in (3) is via \( {\tilde{f}}_E^h = \frac{1}{M_2} \sum _{k=1}^{M_2} f_R(\mathbf {y}_h \,|\, {\tilde{\varvec{\psi }}}_{hk}, \varvec{\xi }) \,, \) where the ‘inner sample’ \({\tilde{\varvec{\psi }}}_{hk}\), \(k=1,\ldots ,M_2\), is another independent random sample from the prior density, \(f_B(\varvec{\psi })\). The inner sample is chosen independently of the ‘outer sample’, \((\varvec{\psi }_h, \mathbf {y}_h)\). This gives an overall approximation

$$\begin{aligned} \begin{aligned} {\tilde{U}}_{\text {nMC}}(\varvec{\xi }) \\ = \frac{1}{M_1}&\sum _{h=1}^{M_1} \Bigg [ \log f_R(\mathbf {y}_h \,|\, \varvec{\psi }_h,\varvec{\xi }) \\&- \log \left( \frac{1}{M_2} \sum _{k=1}^{M_2} f_{R}( \mathbf {y}_h \,|\, {\tilde{\varvec{\psi }}}_{hk}, \varvec{\xi }) \right) \Bigg ] . \end{aligned} \end{aligned}$$

We refer to the above approximation as naïve Monte Carlo (nMC); it is known elsewhere in the literature as Double Loop Monte Carlo (DLMC). The estimator \({\tilde{U}}_\text {nMC}(\varvec{\xi })\) has variance of asymptotic order \(O(1/M_1)\) and positive asymptotic bias \(C(\varvec{\xi })/M_2\), where

$$\begin{aligned} C(\varvec{\xi })= \frac{1}{2} \mathbb {E}\left[ {\mathrm{Var}}\left( \frac{f_R(\mathbf {y}\,|\, \varvec{\psi }, \varvec{\xi }) }{f_{E}(\mathbf {y}\,|\, \varvec{\xi })} \,\Big |\, \mathbf {y}\right) /f_E(\mathbf {y}\,|\,\varvec{\xi })^2 \right] \end{aligned}$$

(Ryan 2003). Thus, the variance can be reduced by increasing the outer sample size, and the bias can be reduced by increasing the inner sample size.

Despite its good asymptotic properties, for practical inner sample sizes the naïve Monte Carlo estimator commonly suffers from problems with numerical underflow. When this happens one obtains a numerically negligible estimate for the evidence and a numerical estimate of infinity for the expected utility. The latter is clearly unreasonable, making it questionable whether the method can be reliably used to compare designs when \(M_1\) and \(M_2\) are small. This zero evidence problem is particularly acute when the posterior is highly concentrated relative to the prior. In this case the likelihood \(f_R(\mathbf {y}|\varvec{\psi }, \varvec{\xi })\) is numerically negligible throughout the majority of the parameter space, except on a very small neighbourhood around the maximum likelihood estimate. It is thus highly likely that all of the \({\tilde{\varvec{\psi }}}_{hk}\), which are sampled from the prior, will lie outside of this neighbourhood, giving a numerically negligible estimate of the evidence.

2.2 Reuse estimator

To alleviate the numerical stability problems of the Naïve Monte Carlo estimator, Huan and Marzouk (2013) proposed the reuse approximation,

$$\begin{aligned} \begin{aligned} {\tilde{U}}_\text {reuse}(\varvec{\xi }) =&\frac{1}{M_1} \sum _{h=1}^{M_1} \Bigg [ \log f_R(\mathbf {y}_h \,|\, \varvec{\psi }_h,\varvec{\xi }) \\&- \log \left( \frac{1}{M_1} \sum _{k=1}^{M_1} f_{R}( \mathbf {y}_h \,|\, \varvec{\psi }_k, \varvec{\xi }) \right) \Bigg ] , \end{aligned} \end{aligned}$$

which uses the same parameter sample in both the inner and outer summation. The asymptotic bias of the reuse estimator has the same order of magnitude as that of the naïve Monte Carlo method. However the reuse estimator is more numerically stable for small Monte Carlo sample sizes. In particular, it will usually give a finite estimate of the expected utility gain because each inner sum contains the term \(f_{R}(\mathbf {y}_h\,|\, \varvec{\psi }_h, \varvec{\xi })\), which is non-negligible as \(\varvec{\psi }_h\) is the parameter vector used to generate \(\mathbf {y}_h\) in the simulation.

2.3 Laplace approximations

The literature contains two methods for using Laplace approximations to avoid nested Monte Carlo integration. For the first method, considered by Overstall et al. (2018) and denoted LA1 here, equation (3) is used with the standard Laplace approximation to the evidence, giving

$$\begin{aligned} \begin{aligned} {\tilde{U}}_{\text {LA1}}(\varvec{\xi })&= \frac{1}{M_1} \sum _{h=1}^{M_1} \Bigg [ \log f_R(\mathbf {y}_h \,|\, \varvec{\psi }_h, \varvec{\xi }) \\&\quad - \log {\tilde{f}}_A( {\hat{\varvec{\psi }_h}}| \mathbf {y}_h, \varvec{\xi }) - \frac{p}{2} \log 2\pi + \frac{1}{2} \log |{\mathbf {H}}_{h}| \Bigg ] , \end{aligned} \end{aligned}$$

where \({\tilde{f}}_A(\varvec{\psi }| \mathbf {y}, \varvec{\xi }) = f_R(\mathbf {y}| \varvec{\psi }, \varvec{\xi }) f_B(\varvec{\psi })= f_J(\varvec{\psi }, \mathbf {y}|\varvec{\xi })\) denotes the unnormalized posterior. In addition \({\hat{\varvec{\psi }}}_h = {\mathrm{arg max}}_{\varvec{\psi }} {\tilde{f}}_A(\varvec{\psi }| \mathbf {y}_h, \varvec{\xi })\) denotes the posterior mode for the hth response realization, while

$$\begin{aligned} {\mathbf {H}}_h = -\frac{\partial ^2 \log {\tilde{f}}_A(\varvec{\psi }| \mathbf {y}_h, \varvec{\xi })}{\partial \varvec{\psi }\partial \varvec{\psi }^\mathrm {T}} \big |_{\varvec{\psi }= {\hat{\varvec{\psi }_h}}} \end{aligned}$$

denotes the Hessian of the negative log-posterior at the mode. This asymptotic approximation should be accurate provided n is large enough for the posterior to be approximately normal, and can be used to find efficient designs for a wide range of sample sizes.

A second method, denoted LA2 here, was considered by Long et al. (2013). This requires additionally that the sample size is large enough for the posterior to be highly concentrated around the posterior mode. In this case, both the log-posterior and the log-prior can be approximated within the region of highest posterior density by a second-order Taylor expansion, giving

$$\begin{aligned} \begin{aligned} \mathbb {E}_{\mathbf {y}}&\mathbb {E}_{\varvec{\psi }|\mathbf {y}} \left[ \log f_A (\varvec{\psi }|\mathbf {y}) - \log f_B(\varvec{\psi }) \right] \\ \approx&\mathbb {E}_{\mathbf {y}}\mathbb {E}_{\varvec{\psi }|\mathbf {y}} \Big [ \frac{1}{2} \log |\mathbf {H}_{\mathbf {y}}| - \frac{p}{2}\log 2\pi \\&- \frac{1}{2} (\varvec{\psi }- \hat{\varvec{\psi }}_{\mathbf {y}} )^\mathrm {T}\mathbf {H}_{\mathbf {y}} (\varvec{\psi }- \hat{\varvec{\psi }}_{\mathbf {y}}) - \log f_B({\hat{\varvec{\psi }}}_{\mathbf {y}}) \\&- \nabla _{\varvec{\psi }} \log f_B({\hat{\varvec{\psi }}}_{\mathbf {y}}) (\varvec{\psi }- {\hat{\varvec{\psi }}}_{\mathbf {y}}) \\&- \frac{1}{2} (\varvec{\psi }- {\hat{\varvec{\psi }}}_{\mathbf {y}} )^\mathrm {T}\nabla ^2_{\varvec{\psi }} \log f_B({\hat{\varvec{\psi }}}_{\mathbf {y}}) (\varvec{\psi }- {\hat{\varvec{\psi }}}_{\mathbf {y}}) \Big ] \\&\approx \mathbb {E}_{\mathbf {y}} \Big [ \frac{1}{2} \log |\mathbf {H}_{\mathbf {y}}| - \frac{p}{2}(\log 2\pi +1)\\&- \log f_B({\hat{\varvec{\psi }}}_{\mathbf {y}}) - \frac{1}{2} \text {tr} ( \nabla ^2_{\varvec{\psi }} \log f_B({\hat{\varvec{\psi }}}_{\mathbf {y}}) \mathbf {H}_{\mathbf {y}}^{-1} ) \Big ] , \end{aligned} \end{aligned}$$

where \({\hat{\varvec{\psi }}}_{\mathbf {y}}\) denotes the posterior mode given response vector \(\mathbf {y}\) and \({\mathbf {H}}_{\mathbf {y}}\) the corresponding Hessian of the negative log-posterior. Above, the last line has been obtained using the elementary fact that if \(\varvec{\psi }\sim N(\varvec{\mu }, {\mathbf {V}})\) then \(\mathbb {E}[(\varvec{\psi }-\varvec{\mu })^\mathrm {T}{\mathbf {Q}} (\varvec{\psi }-\mu )] = \text {tr} {\mathbf {Q}} {\mathbf {V}}\). Monte Carlo estimation of the above gives

$$\begin{aligned} \begin{aligned} \tilde{U}_{\text {LA2}}(\varvec{\xi })&= \frac{1}{M_1} \sum _{h=1}^{M_1} \Big [ \frac{1}{2} \log |\mathbf {H}_{h}| - \frac{p}{2}(\log 2\pi +1) \\&\quad - \log f_B({\hat{\varvec{\psi }}}_h) - \frac{1}{2} \text {tr} ( \nabla ^2_{\varvec{\psi }} \log f_B({\hat{\varvec{\psi }}}_{h}) \mathbf {H}_{h}^{-1} ) \Big ] . \end{aligned} \end{aligned}$$

3 Approximate Laplace importance sampling

3.1 Importance sampling

figure a

Another Monte Carlo method for estimating the evidence is importance sampling, i.e.

$$\begin{aligned} {\tilde{f}}^h_E = \frac{1}{M_2} \sum _{ k=1 }^{M_2} \frac{ f_R( \mathbf {y}_h \,|\, {\tilde{\varvec{\psi }}}_{hk}, \varvec{\xi }) f_B({\tilde{\varvec{\psi }}}_{hk}) }{ q_h({\tilde{\varvec{\psi }}}_{hk}) } , \end{aligned}$$
(4)

where \({\tilde{\varvec{\psi }}}_{hk}\), \(k=1,\ldots ,M_2\) is an independent sample from the importance density \(q_h\). Note that nMC corresponds to the special case where the prior is chosen as the importance density. By standard theory (e.g. Lemieux 2009, p.114) the optimal importance density is \(q^*_h(\varvec{\psi }) \propto f_R(\mathbf {y}_h \,|\, \varvec{\psi }, \varvec{\xi }) f_B(\varvec{\psi }_{hk})\), i.e. \(q^*_h\) is the posterior density of \(\varvec{\psi }\) given \(\mathbf {y}_h\). This gives a zero variance unbiased (i.e. error-free) estimator; unfortunately the optimal importance density cannot be used in practice as it requires knowledge of the evidence, the quantity we are trying to estimate.

The above discussion suggests that a good choice of importance density would be a computationally cheap approximation to the posterior, such as \( N(\varvec{\psi }; {\hat{\varvec{\mu }}}_h,{\hat{\varvec{\Sigma }}}_h) \) or \( t_\nu (\varvec{\psi }; {\hat{\varvec{\mu }}}_h, {\hat{\varvec{\Sigma }}}_h) \), where \( {\hat{\varvec{\mu }}}_h \) and \( {\hat{\varvec{\Sigma }}}_h \) are approximations to the mean vector and variance matrix of \(f_{A}(\varvec{\psi }| \mathbf {y}_h, \varvec{\xi })\). Here \(N(\cdot ; \varvec{\mu }, \varvec{\Sigma })\) and \( t_\nu (\cdot ;\varvec{\mu }, \varvec{\Sigma }) \) denote respectively the probability density function of a multivariate normal and multivariate t distribution with mean \(\varvec{\mu }\), variance matrix \(\varvec{\Sigma }\), and degrees of freedom \(\nu \) for the latter. Below we discuss different methods for choosing \({\hat{\varvec{\mu }}}_h\) and \({\hat{\varvec{\Sigma }}}_h\).

3.2 Laplace-type Importance Sampling Methods

Laplace-type importance sampling methods set the variance of the importance distribution as \( {\hat{\varvec{\Sigma }}}_h = {\mathbf {H}}_h({\hat{\varvec{\mu }}}_h)^{-1} \) where \( {\mathbf {H}}_h({\hat{\varvec{\mu }}}_h) = - \frac{ \partial ^2 \log {\tilde{f}}_{A}(\varvec{\psi }\,|\,\mathbf {y}_h,\varvec{\xi }) }{ \partial \varvec{\psi }\partial \varvec{\psi }^\mathrm {T}} \big |_{\varvec{\psi }= {\hat{\varvec{\mu }}}_h} \). The two variants, Laplace Importance Sampling (LIS), and Approximate Laplace Importance Sampling (ALIS), are distinguished via the choice of mean \({\hat{\varvec{\mu }}}_h\).

3.2.1 LIS

With LIS, the mean is approximated using \( {\hat{\varvec{\mu }}}_h = {\hat{\varvec{\psi }}}_h = {\mathrm{arg max}}_{\varvec{\psi }\in \varPsi } {\tilde{f}}_A(\varvec{\psi }\,|\, \mathbf {y}_h, \varvec{\xi }) \). This necessitates a total of \(M_1\) potential costly numerical optimizations to find the mode, \({\hat{\varvec{\psi }}}_h\) (\(h=1,\ldots ,M_1\)), of the posterior distribution: one for each the \(M_1\) simulated response vectors, \(\mathbf {y}_1,\ldots , \mathbf {y}_{M_1}\), in the outer sample. However, provided the search for \({\hat{\varvec{\psi }}}_h\) is initialized at the data-generating parameter values \(\varvec{\psi }_h\), it typically converges in a small number of iterations. We performed these optimizations using the Broyden-Fletcher-Goldfarb-Shanno (BFGS) method. Algorithm 1 gives a detailed description of the LIS method and our proposed new variant, ALIS. See Ryan et al. (2015) and Beck et al. (2018) for low-dimensional examples of design selection with LIS.

3.2.2 ALIS

The key observation underpinning ALIS is that the posterior mode used to center the importance distribution in LIS is frequently close to the data-generating values, i.e. it is often the case that \({\hat{\varvec{\psi }}}_h \approx \varvec{\psi }_h\). Given this, an obvious question is whether it is possible to reduce the computational cost of the LIS method by removing some of the optimization steps, setting \({\hat{\varvec{\mu }}}_h = \varvec{\psi }_h\) for some h. For this choice to work we require at a minimum that \({\mathbf {H}}_h(\varvec{\psi }_h)\) is positive definite, since without this \({\hat{\varvec{\Sigma }}}_h\) would not be a valid covariance matrix and it would be impossible to sample from the importance distribution \(q_h\). The ALIS importance distribution is thus centred at

$$\begin{aligned} {\hat{\varvec{\mu }}}_h = {\left\{ \begin{array}{ll} \varvec{\psi }_h &{} \text {if }\, {\mathbf {H}}_h({\hat{\varvec{\mu }}}_h)\, \text {is numerically }\\ &{} \text {positive-definite,} \\ {\hat{\varvec{\psi }}}_h &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$

We show in Sect. 4.2 that this choice gives a method with comparable accuracy but lower computational cost.

3.3 Nested importance sampling

In nested importance sampling (nIS; Feng 2015), the posterior mean \(\varvec{\mu }_h\) and variance \(\varvec{\Sigma }_h\) are approximated via self-normalized importance sampling using the outer sample. This gives

$$\begin{aligned} \begin{aligned} {\hat{\varvec{\mu }}}_h&= \sum _{k=1}^{M_1} {\bar{w}}_{hk} \varvec{\psi }_k , \\ {\hat{\varvec{\Sigma }}}_h&= \sum _{k=1}^{M_1} {\bar{w}}_{hk} (\varvec{\psi }_h - {\hat{\varvec{\mu }}}_h)( \varvec{\psi }_h - {\hat{\varvec{\mu }}}_h)^\mathrm {T}, \end{aligned} \end{aligned}$$

where \( {\bar{w}}_{hk} = f_R(\mathbf {y}_h \,|\, \varvec{\psi }_k, \varvec{\xi }) / \sum _{l=1}^{M_1} f_{R}(\mathbf {y}_h \,|\, \varvec{\psi }_l, \varvec{\xi }) \). This approach has the potential to suffer from low effective sample size. To counter this, Feng proposed to revert to the original naïve Monte Carlo estimate of the evidence if \( \text {ESS}_h = 1/(\sum _{k=1}^{M_1} {\bar{w}}_{hk}^2) \) drops below a prespecified minimum effective sample size.

4 Performance comparison

Table 1 Prior means and standard deviations for the lubricant kinematic viscosity model
Fig. 1
figure 1

Distribution of the estimator of expected utility from different methods, using the Michaelis-Menten model and a space-filling design with \(n=5\) runs. Empirical distributions of the estimator are based on 100 evaluations for each method. The ‘true’ expected Shannon information gain given by the reference approximation is indicated by the red line. Numbers after the method name indicate the outer Monte Carlo sample size \(M_1\). The pairs \((M_1,M_2)=(300,300)\) and (2000, 10000) were used

Table 2 Percentage RRMSE of the estimator of expected Shannon information gain, for different combinations of model, sample size, and approximation method. Methods are ordered according to the average RRMSE across all examples
Table 3 ALIS method: long-run percentage of outer loop iterations for which the data-generating values of \(\varvec{\psi }\) are used to centre the importance distribution, i.e. the percentage of outer loop iterations for which numerical optimization to find the posterior mode is avoided. Percentages were estimated by simulating 10,000 outer loop iterations. For smaller values of \(M_1\) the actual number of outer loop iterations for which numerical optimization is avoided will follow a Binomial distribution with success probabilities approximately equal to the above percentages
Table 4 Computational expense and accuracy of different methods. The left part of the table shows, for each method, the mean rRMSE and mean time to produce one evaluation of the utility function. The mean is an average across ten repeats of all examples for the Michaelis-Menten and BOD models. Methods are sorted from least expensive to most expensive. The right part of the table shows a detailed breakdown of the mean evaluation time of the utility function for each example, averaged across 10 repeats
Table 5 Comparison between ALIS and LIS for smaller inner loop sample sizes. Cost reduction is shown as a percentage, while rRMSE increase is shown as the absolute increase (which is a difference in percentage points). The table is ordered according to the magnitude of the cost reduction from ALIS

4.1 Models for performance assessment

In this section we compare the performance of the methods from Sects. 2 and 3. Results are given for two tasks: (i) evaluation of the utility function and (ii) design selection, in Sects. 4.2 and 4.3 respectively. Three different nonlinear models are considered, all of the form

$$\begin{aligned} y_i = \eta ({\mathbf {x}}_i, \varvec{\theta }) + \epsilon _i , \end{aligned}$$
(5)

with \(\epsilon _i \overset{\text {i.i.d.}}{\sim } N(0,\sigma ^2_\epsilon )\) for \(i=1,\ldots , n\). The models differ with respect to their mean functions \(\eta \), parameters, and priors, with details given below.

4.1.1 Michaelis-Menten model

The Michaelis-Menten model has mean function

$$\begin{aligned} \eta (x, \varvec{\theta })= \theta _1 x/(\theta _2 + x) , \end{aligned}$$

with unknown parameters \(\theta _1\), \(\theta _2\), both positive. It is assumed a priori that \(\log \theta _1 \sim N(4.38, 0.07^2)\), \(\log \theta _2 \sim N(1.19,0.84^2)\), and \(\sigma ^2_\epsilon \sim \text {Inverse-Gamma}(3,2)\) independently.

The prior on \(\theta _2\) is relatively diffuse, implying a wide range of possible shapes of the response curve. The prior on \(\sigma ^2_\epsilon \) was chosen to imply a low noise-to-signal ratio as this leads to a relatively concentrated posterior, the most demanding scenario for methods such as naïve Monte Carlo and nested importance sampling. Specifically, the 10% and 90% quantiles of \(\sigma _\epsilon /\eta (400,\varvec{\theta })\) are 0.009 and 0.02 respectively, where the denominator is the maximum value of \(\eta \) over the design region [0, 400].

In order to preserve the positivity constraint on \(\theta _1\) and \(\theta _2\), we reparameterize the model in terms of \(\vartheta _1 = \log \theta _1\) and \(\vartheta _2 = \log \theta _2\). The resulting normal ALIS/LIS importance distribution on the \(\vartheta \) scale effectively implies a log-normal importance distribution on the \(\theta \) scale. Note that reparameterization does not change the value of the expected Shannon information gain, but it does change our numerical estimate thereof due to the modified importance distributions. Here the noise variance \(\sigma ^2_\epsilon \) can be integrated out analytically owing to its conjugate prior. See the appendix for full technical details of the calculations for the marginal likelihood and its derivatives.

4.1.2 Biochemical oxygen demand (BOD) model

Bates & Watts (1988, Chap. 2) modelled biochemical oxygen demand y (mg/L) with the mean function

$$\begin{aligned} \eta (x, \varvec{\theta }) = \theta _1 \{ 1 -\exp (-\theta _2 x) \} , \end{aligned}$$

where x is time (in days). We adopt the following independent priors:

$$\begin{aligned} \begin{aligned}&\log \theta _1 \sim N(3.38, 0.20^2), \\&\log \theta _2 \sim N(1.098, 1.12^2), \\&\pi _b(\sigma _\epsilon ) \propto \sigma ^{-1}_\epsilon . \end{aligned} \end{aligned}$$

The prior means for \(\theta _1\), \(\theta _2\) were chosen to match the means given by DiCiccio et al. (1997), while the variances were chosen to illustrate the differences between the methods (smaller and larger variances resulted in more similar performance). Similar to the Michaelis-Menten model, we reparameterize in terms of \(\vartheta _j = \log \theta _j\) when carrying out utility approximations, and \(\sigma ^2_\epsilon \) is integrated out analytically. The design region is [0, 7].

4.1.3 Lubricant kinematic viscosity model

Bates & Watts (1988, Chap. 3) modelled the kinematic viscosity of a lubricant using the following mean function, depending on temperature, \(x_1\) (\(^\circ \)C) and pressure, \(x_2\) (atm):

$$\begin{aligned} \begin{aligned} \eta ({\mathbf {x}}, \varvec{\theta })&= \frac{\theta _1 }{\theta _2 + x_{1}} + \theta _3 x_{2} + \theta _4 x^2_{2} + \theta _5 x^3_{2} \\&\quad + (\theta _6 + \theta _7 x^2_{2}) x_{2} \exp \left\{ -\frac{x_{1}}{\theta _8 + \theta _9 x^2_{2} } \right\} . \end{aligned} \end{aligned}$$

Defining \(\theta _{10} = \log \sigma _\epsilon \), we adopt independent normal priors on \(\theta _j\), \(j=1,\ldots ,10\), with means and standard deviations equal to the maximum likelihood estimates and their standard errors based on the data from Bates and Watts (1988) (see Table 1). Unlike the previous models, no reparameterization is used for the \(\theta _j\). Moreover the noise variance is treated as an interest parameter. The design region for \((x_1,x_2)\) is \([0,100]\times [0,7]\).

4.2 Utility evaluation results

For utility evaluation, we compare the methods in terms of accuracy and computational expense. To assess accuracy, we need an approximation with negligible error to serve as a reference. For the Michaelis-Menten and BOD models we were able to obtain such a reference approximation by using naïve Monte Carlo with \(M_1=M_2= 10^6\), though this approximation is too computationally expensive for routine use. However for the lubricant model, whose parameter space is of substantially larger dimension, nMC yields unstable estimates even with such a large Monte Carlo sample size. Thus we restrict our attention to the Michaelis-Menten and BOD models in this section, though our results will suggest that expected utility can be reliably estimated for the lubricant model using the LIS and ALIS approximations.

To investigate how performance depends on the number of experimental runs, results are obtained for a variety of experiment sizes. For the Michaelis-Menten model space-filling designs with \(n=5\), 10, and 20 are considered, while for the BOD model the design from Bates and Watts (1988) with \(n=6\) is considered alongside space-filling designs with \(n=10\) and 20. The type of space-filling design used throughout is a random Latin Hypercube design. For consistency the same specific design realisation was used throughout, so that differences between different sampled designs of the same size are not a factor in the results.

Figure 1 shows how the distribution of the estimator of expected Shannon information varies across different methods and different combinations of inner and outer Monte Carlo sample sizes. The results shown are for the Michaelis-Menten model and a space-filling design with \(n=5\) runs. The reference value of the expected utility is indicated by the red horizontal line. It is seen that ALIS and LIS have small bias and variance compared to all other methods with similar Monte Carlo sample size, even for small \(M_1\) and \(M_2\). The nMC, nIS, and reuse methods give highly biased and variable estimators for small Monte Carlo sample size, but increasing \(M_1\) and \(M_2\) reduces both the variance and bias, and with \((M_1,M_2)=(2000,10000)\) both the nMC and nIS methods give comparable utility values to LIS and ALIS. In contrast, for the Laplace approximations, increasing the Monte Carlo sample size only reduces the variance, not the bias, as these methods are intrinsically biased due to the poor quality of the asymptotic approximation when \(n=5\). This figure is quite representative of the general picture, but further insight can be obtained by combining results across several examples.

Table 2 shows the accuracy of the methods across models, Monte Carlo sample sizes, and numbers of experimental runs. Accuracy is measured by the percentage relative root mean squared error (RRMSE), i.e. \(100\times \sqrt{\text {MSE}[{\tilde{U}}(\varvec{\xi })]}/U(\varvec{\xi })\). It is seen that the most accurate methods overall are LIS and ALIS; these have excellent performance even with low Monte Carlo sample size \((M_1=M_2=300)\). Moreover, the accuracy of LIS and ALIS remains stable or even improves as the number of experimental runs increases. nMC is the next most accurate method when Monte Carlo sample size is large but it performs poorly when Monte Carlo sample size is small, i.e. when \((M_1,M_2)=(2000,10000)\) and (300, 300) respectively. Moreover, the performance of nMC degrades as the number of experimental runs increases. This result is intuitive: as the number of experimental runs increases, the posterior will become more concentrated and the prior will become a worse importance distribution. nIS has similar performance and caveats to nMC. The accuracy of LA1 is good when n is large, poor when n is small, and fairly insensitive to \(M_1\). Similar comments apply to LA2, but with slightly worse accuracy overall. The reuse estimator has relatively poor performance even with large Monte Carlo sample sizes and is not recommended.

The left part of Table 4 shows the relationship between the accuracy of the different methods and their computational cost. The timings show that the most efficient methods are LA1, ALIS, and LIS: all other methods have worse accuracy than another method with lower computational cost. In particular LIS and ALIS with \(M_1=M_2=300\) give a good trade-off between accuracy and computational expense. Increasing the Monte Carlo sample size to \((M_1,M_2)=(2000,10000)\) gives only a small increase in accuracy for a very large increase in cost. The right part of Table 4 shows how the utility evaluation time varies across methods, models and experiment sizes. It is clear that larger n results in increased evaluation time, though the relative timings of the different methods are similar for all examples.

The difference between LIS and ALIS can be considered in more detail. Table 3 shows that a high percentage of the numerical optimizations required in the LIS method can be avoided through ALIS. The percentage is higher for large n, which is intuitive since we would expect in that case that the posterior mode would be closer to the data-generating values. The percentage of avoided optimizations is also higher for the Michaelis-Menten examples than for those using the BOD model. This is consistent with the fact that the priors for the Michaelis-Menten example were chosen to have a low noise-to-signal ratio, which is anticipated to give a posterior that is relatively highly concentrated around the data-generating values.

Although ALIS greatly reduces the amount of optimization required compared to LIS, the computational cost saving in Table 4 is modest: approximately 5% for \((M_1,M_2)=(300,300)\) and 1.4% for \((M_1,M_2)=(2000,10000)\). This is due to the values of \(M_2\), which are large enough that the optimization cost is relatively small compared to the cost of the inner loop sampling and averaging. However, for smaller inner loop sample sizes the cost savings due to ALIS are much larger; Table 5 shows cost savings of 10–70% from ALIS compared to LIS when \(M_2\) ranges from 10 to 50. The computational cost saving for ALIS usually comes at the expense of a small decrease in accuracy, though in one case (BOD, \(n=6\), \(M_1=2000\), \(M_2=50\)) ALIS is both cheaper and more accurate than LIS.

The smaller values of \(M_2\) in Table 5 are more than an intellectual curiosity; there is empirical and theoretical evidence that smaller values may sometimes be a more efficient choice than setting \(M_1=M_2\). E.g. for naïve Monte Carlo asymptotic results suggest it is optimal to take \(M_2=O(\sqrt{M_1})\) (Beck et al. 2018). Empirically, taking as an example the BOD model with \(n=20\), we find that ALIS and LIS with \((M_1,M_2)=(300,30)\) are both cheaper and more accurate than the Laplace approximation with \(M_1=2000\).

Clearly such timings will depend on the implementation language and hardware involved, but they nonetheless give a useful idea of the relative cost of the different methods. We used C++ in R via the Rcpp and RcppArmadillo libraries (Eddelbuettel et al. 2011; Eddelbuettel and Sanderson 2014) to obtain high-performance code. Timings were carried out on a 2018 Mac Mini with a 3GHz 6-core Intel i5 processor and 8GB RAM; the calculations took place on a single core.

A common technique to gain better estimates in importance sampling is to inflate the tails of the importance distribution, e.g. by using a t-distribution. We obtained results for t importance distributions, but for brevity the results are omitted here as there was not a substantial difference in performance from the multivariate normal importance distributions. For full details see the first author’s PhD thesis (Englezou 2018).

Fig. 2
figure 2

Comparison of (near-)optimal designs found from the different utility approximation methods for the BOD model with \(n=20\). Each boxplot corresponds to the best design found from 10 random starts of the ACE algorithm using a particular method, and shows the distribution of 100 independent evaluations of the ALIS estimator of expected Shannon information gain, obtained with \(M_1=M_2=300\). \(\xi _{20}\) refers to a 20-run space filling design

Fig. 3
figure 3

Comparison of (near-)optimal designs found from the different utility approximation methods for the lubricant model with \(n=20\). Each boxplot corresponds to the best design found from 10 random starts of the ACE algorithm using a particular method, and shows the distribution of 100 independent evaluations of the ALIS estimator of expected Shannon information gain, obtained with \(M_1=M_2=300\). \(\xi _{20}\) refers to a 20-run space filling design

4.3 Design optimization results

In this section we compare the performance of the different expected utility approximation methods for the purpose of design optimization. To enable the comparison we found (near-)optimal designs for each of the different methods discussed in Sects. 2 and 3. This was done for the Michaelis-Menten, BOD, and lubricant models discussed in Sect. 4.1 using the ACE algorithm to perform utility optimization. The experiment sizes considered were as follows: \(n=5\), 10, and 20 for the Michaelis-Menten; \(n=6\), 10, and 20 for BOD; and \(n=20\) and \(n=53\) for the lubricant model. The cases \(n=6\) for BOD and \(n=53\) for the lubricant model correspond to designs in the literature.

Different runs of the ACE algorithm may result in multiple different near-optimal designs being found for the same design problem. This can arise due to different starting designs being used and also the stochastic nature of the expected utility estimates. To obtain more stable results we therefore ran the ACE algorithm 10 times with a different random starting design for each problem, i.e. each combination of approximation method, model, and design size. The best design resulting from these random starts, judged via an independent estimate of the expected utility, was chosen as our estimate of the overall (near-)optimal design for that problem.

The designs obtained from each method were compared using an independent estimate of the expected utility calculated using ALIS with \(M_1=M_2=300\). This calculation was repeated 100 times for each design to form an empirical distribution for the estimator, thereby giving an indication of the variability of the expected utility estimate. Comparisons with a ‘naïve’ (i.e. non-optimal) design were also included for each example. This naïve design was taken from the literature where available, that is when \(n=6\) for BOD and \(n=53\) for the lubricant model. For other cases a space-filling design was used as the naïve design, namely a random Latin Hypercube design. Figures 2 and 3 show typical figures resulting from this process for the BOD and lubricant models with \(n=20\). Similar figures for other examples are given in the first author’s PhD thesis (Englezou 2018).

Within ACE, separate Monte Carlo sample sizes must be specified for the emulator-building and accept-reject steps. These were chosen as follows. In the accept-reject step \(B=M_1=M_2=10000\) was used throughout. For the emulator-building step, in the Michaelis-Menten and BOD models we used \(M_1=2000\) for the ‘single-loop’ methods LA1 and LA2; \(M_1=M_2=2000\) was used for all other ‘double loop’ methods. For the lubricant model, we used \(M_1=M_2=300\) for LIS and ALIS, \(M_1=300\) for LA1 and LA2, and a larger sample size of \(M_1=2000\), \(M_2=10000\) for nMC and nIS. The latter was needed to avoid failure of the evaluation due to the zero evidence problem discussed in Sect. 2.1.

Combining results from across the different examples several observations can be made. First, as anticipated, the optimized designs are better than naïve comparator designs in all but one case. The single exception is that, as seen in Fig. 2, the design from the reuse estimator is worse than a space-filling design for the BOD model when \(n=20\). Second, for the two-parameter models, in most cases designs from the different approximations have similar expected utility, aside from the reuse and LA2 designs which appear somewhat worse for \(n=10\) and 20. Aside from these special cases for two-parameter models the utility differences between the designs from different methods are usually smaller than the variability of the utility estimator for a fixed design.

Bigger differences in the performance of the designs from different methods are seen for the lubricant model, which is of substantially higher dimension. In particular, the ALIS and LIS designs substantially outperform the designs from all other methods when \(n=20\) (see Fig. 3), and all methods except LA1 when \(n=53\). This improved performance for LA1 for large experiment sizes is expected due to the asymptotic nature of the Laplace approximation.

We did not record computational times for finding (near-)optimal designs. However, ACE is usually performed with a fixed number of iterations, and the dominant computational cost is that of the expected utility evaluations. Thus the relative cost for the different utility approximation methods will be similar to Sect. 4.2.

5 Nuisance parameters

In this section we discuss the case where the model contains nuisance parameters, meaning parameters that are not of direct interest but which must nonetheless be considered when making inference about the parameters of interest. Laplace approximations for this case have been developed by Overstall et al. (2018), and a Layered Multiple Importance Sampling approximation has been developed by Feng and Marzouk (2019), who refer to the resulting optimal designs as ‘focused’. Both of these approaches used the idea of conditioning a multivariate normal approximation to the posterior. Similar ideas can be applied in the ALIS/LIS context, as follows.

First we partition the overall parameter vector as \(\varvec{\psi }= (\varvec{\theta }^\mathrm {T}, \varvec{\gamma }^\mathrm {T})^\mathrm {T}\), where \(\varvec{\theta }\in \varTheta \subseteq {\mathbb {R}}^{p_\theta }\) is the vector of interest parameters, and \(\varvec{\gamma }\in \varGamma \subseteq {\mathbb {R}}^{p_\gamma }\) is the vector of nuisance parameters. The expected Shannon information gain for the interest parameters now takes the form

$$\begin{aligned} U(\varvec{\xi }) = \int _\varTheta \int _{{\mathbb {R}}^n} \log \frac{ f_M({\mathbf {y}} | \varvec{\theta }, \varvec{\xi }) }{ f_E({\mathbf {y}} | \varvec{\xi }) } f({\mathbf {y}}, \varvec{\theta }) d{\mathbf {y}} d\varvec{\theta } , \end{aligned}$$
(6)

where \( f_M({\mathbf {y}}|\varvec{\theta }, \varvec{\xi }) = \int _\varGamma f_R({\mathbf {y}} | \varvec{\theta }, \varvec{\gamma }, \varvec{\xi }) f_B(\varvec{\gamma }|\varvec{\theta }) d \varvec{\gamma } \) denotes the marginal density of the response after integrating out the nuisance parameters. The expected utility (6) can be estimated via

$$\begin{aligned} {\tilde{U}}_{\text {(A)LIS}}(\varvec{\xi }) = \frac{1}{M_1} \sum _{h=1}^{M_1} \log \frac{ {\tilde{f}}_M^h }{{\tilde{f}}_E^h } , \end{aligned}$$

where as before \( {\tilde{f}}^h_E = \frac{1}{M_2}\sum _{k=1}^{M_2} f_R({\mathbf {y}}_h | {\tilde{\varvec{\psi }}}_{hk}, \varvec{\xi }) \frac{ f_B({\tilde{\varvec{\psi }}}_{hk}) }{ q_h({\tilde{\varvec{\psi }}}_{hk}) } \) is a LIS/ALIS estimate of the evidence. In addition, now we also require a second importance sampling approximation, \({\tilde{f}}_M^h\), to estimate the marginal likelihood, \(f_M({\mathbf {y}}|\varvec{\theta }, \varvec{\xi })\), of the interest parameters after integrating out the nuisance parameters.

In particular we suggest using the following approximation for the marginal likelihood:

$$\begin{aligned} {\tilde{f}}_M^h = \frac{1}{M_3} \sum _{s=1}^{M_3} f_R({\mathbf {y}}_{h} | \varvec{\theta }_h, \tilde{\tilde{\varvec{\gamma }}}_{hs}, \varvec{\xi }) \frac{ f_B(\tilde{\tilde{\varvec{\gamma }}}_{hs} | \varvec{\theta }_h ) }{ q_{\varvec{\gamma }|\varvec{\theta }_h}( \tilde{\tilde{\varvec{\gamma }}}_{hs} ) }, \end{aligned}$$
(7)

where \(\{ \tilde{\tilde{\varvec{\gamma }}}_{hs} \}_{s=1}^{M_3}\) is an i.i.d. sample from the importance density \(q_{\varvec{\gamma }|\varvec{\theta }_h}\). To minimize the variance of the estimator, the importance distribution \(q_{\varvec{\gamma }|\varvec{\theta }_h}\) should approximate the conditional posterior \(f_A(\varvec{\gamma }|{\mathbf {y}}_h,\varvec{\theta }_h, \varvec{\xi })\) for the nuisance parameters. To obtain such an approximation we suggest closed-form conditioning of the ALIS/LIS approximation to the joint posterior, \(\varvec{\psi }|{\mathbf {y}}_h \overset{\text {approx}}{\sim } N({\hat{\varvec{\mu }}}_h, {\hat{\varvec{\Sigma }}}_h)\), giving

$$\begin{aligned} \begin{aligned} q_{\varvec{\gamma }|\varvec{\theta }_h}&\sim N\Big [ {\hat{\varvec{\mu }}}^h_{\varvec{\gamma }} + {\hat{\varvec{\Sigma }}}^h_{\varvec{\gamma }\varvec{\theta }} ({\hat{\varvec{\Sigma }}}^h_{\varvec{\theta }\varvec{\theta }})^{-1} (\varvec{\theta }_h - {\hat{\varvec{\mu }}}^h_{\varvec{\theta }}), \\&{\hat{\varvec{\Sigma }}}^h_{\varvec{\gamma }\varvec{\gamma }} - {\hat{\varvec{\Sigma }}}^h_{\varvec{\gamma }\varvec{\theta }} ({\hat{\varvec{\Sigma }}}^h_{\varvec{\theta }\varvec{\theta }})^{-1} {\hat{\varvec{\Sigma }}}^h_{\varvec{\theta }\varvec{\gamma }} \Big ] , \end{aligned} \end{aligned}$$

where \({\hat{\varvec{\mu }}}^h_{\varvec{\theta }}\), \({\hat{\varvec{\mu }}}^h_{\varvec{\gamma }}\), \({\hat{\varvec{\Sigma }}}^h_{\varvec{\theta }\varvec{\theta }}\), \({\hat{\varvec{\Sigma }}}^h_{\varvec{\theta }\varvec{\gamma }}\), \({\hat{\varvec{\Sigma }}}^h_{\varvec{\gamma }\varvec{\theta }}\), \({\hat{\varvec{\Sigma }}}^h_{\varvec{\gamma }\varvec{\gamma }}\) denote the appropriate subcomponents of \({\hat{\varvec{\mu }}}_h\) and \({\hat{\varvec{\Sigma }}}_h\).

Note that we already gave some examples of models with nuisance parameters in Sect. 4. Approximation (7) was not needed in these cases, as the nuisance parameters could be integrated out analytically. Approximation (7) will be more useful when the nuisance parameters are analytically intractable.

6 Discussion

Given the results here, our overall recommendation would be to use ALIS or LIS when finding (near)-optimal designs if the computational budget allows. If a smaller cost is required then LA1 may give a competitive design if the experiment size is sufficiently large relative to the number of parameters in the model. We would discourage the use of other methods, especially the reuse estimator, due to their potential for poor performance.

Uptake of the ALIS and LIS methods would likely be enhanced by their inclusion in software such as the acebayes package. A major barrier to this is that to obtain acceptable computation times we found it necessary to hard-code various model-specific functions in C++, including the mean function \(\eta (x,\theta )\), the likelihood and prior, and their derivatives. A non-specialist user is unlikely to have the time or inclination to implement such functions in C++ for their models, even with the benefit of high level linear algebra packages. One potential solution to this quandry may be to leverage recent probabilistic programming languages such as STAN (Stan Development Team 2021) or Turing.jl (Ge et al. 2018), a package for the Julia language (Bezanson et al. 2017). Both of these frameworks allow user-friendly high-level specification of Bayesian models but achieve performance comparable to compiled code. In addition these frameworks allow automatic differentiation, avoiding the need for detailed manual calculation of derivatives.

The results in this paper are limited to the expected Shannon information gain criterion. This is the most common choice in the literature, and it is a good one when the goal is inference and uncertainty quantification about the parameters using the full posterior distribution for reasons discussed in Sect. 1. In other situations, such as point estimation, a different utility function may be preferable. We believe that similar numerical approximations to LIS/ALIS could be developed for other utility functions. The idea of approximating the optimal importance distribution could again be used, though this would no longer be the posterior distribution. However, such methods are outside the scope of the present paper. While of interest, comparisons with other recent approaches such as amortized variational inference (Foster et al. 2019) and layered multiple importance sampling (Feng and Marzouk 2019) are also outside of scope.