Approximate Laplace importance sampling for the estimation of expected Shannon information gain in high-dimensional Bayesian design for nonlinear models

One of the major challenges in Bayesian optimal design is to approximate the expected utility function in an accurate and computationally efficient manner. We focus on Shannon information gain, one of the most widely used utilities when the experimental goal is parameter inference. We compare the performance of various methods for approximating expected Shannon information gain in common nonlinear models from the statistics literature, with a particular emphasis on Laplace importance sampling (LIS) and approximate Laplace importance sampling (ALIS), a new method that aims to reduce the computational cost of LIS. Specifically, in order to centre the importance distributions LIS requires computation of the posterior mode for each of a large number of simulated possibilities for the response vector. ALIS substantially reduces the amount of numerical optimization that is required, in some cases eliminating all optimization, by centering the importance distributions on the data-generating parameter values wherever possible. Both methods are thoroughly compared with existing approximations including Double Loop Monte Carlo, nested importance sampling, and Laplace approximation. It is found that LIS and ALIS both give an efficient trade-off between mean squared error and computational cost for utility estimation, and ALIS can be up to 70% cheaper than LIS. Usually ALIS gives an approximation that is cheaper but less accurate than LIS, while still being efficient, giving a useful addition to the suite of efficient methods. However, we observed one case where ALIS is both cheaper and more accurate. In addition, for the first time we show that LIS and ALIS yield superior designs to existing methods in problems with large numbers of model parameters when combined with the approximate co-ordinate exchange algorithm for design optimization.


Introduction
When designing experiments for nonlinear models there is usually uncertainty about the model parameters, ψ ∈ Ψ , and often also in the structural form of the model itself. A Bayesian approach enables this uncertainty to be taken into account coherently when choosing the variable settings to be applied in the experiment.

B Timothy W. Waite
In contrast, frequentist optimal designs such as locally optimal designs (Chernoff 1953) and minimax designs have a less satisfactory approach to a priori parameter uncertainty. Locally optimal designs are tailored for a specific set of assumed parameter values and may perform poorly if the assumed values differ from the truth. Minimax designs optimize worst-case performance, potentially at the expense of reduced efficiency in the most likely parameter scenarios.
Suppose that the design is denoted by ξ = (x 1 , . . . , x n ), where x i = (x i1 , . . . , x iq ) T ∈ R q is a vector that defines the settings of the q controllable variables to be applied to the ith experimental unit, with corresponding response y i (i = 1, . . . , n). A design ξ * is Bayesian optimal if it maximizes the expected utility, with respect to ξ ∈ Ξ , where Ξ denotes the set of possible designs. Above y = (y 1 , . . . , y n ) T , with f B (ψ) denoting the prior probability density of the parameters and f R (y|ψ, ξ ) denoting the conditional probability density of the response vector under the assumed model. The utility function u is chosen to reflect the goal of the experiment, such as point estimation of ψ or hypothesis testing. We will focus on the case where the goal is to report all knowledge about the parameters via the full posterior distribution, with density f A (ψ|y, ξ ) ∝ f R (y|ψ, ξ ) f B (ψ), ensuring that this is as concentrated as possible. Here a commonly recommended utility is involving the model evidence, defined via f E (y|ξ ) = Ψ f R (y|ψ, ξ ) f B (ψ)dψ. The above is the unique utility corresponding to a local proper scoring rule. A Bayesian optimal design for utility (2) maximizes the expected Kullback-Leibler divergence, or equivalently the expected Shannon information gain (SIG), between the prior and posterior distributions (Lindley 1956;Bernardo 1979;Chaloner and Verdinelli 1995).
Note that the role of the subscripts above is to ensure that different functions have different names, e.g. f A (·|·, ·) is the posterior of ψ and f B (·) is the prior for ψ. This is more precise than the simpler notation more commonly used in Bayesian statistics in which both density functions would be denoted by f and distinguished purely by their arguments; it is also shorter than the more formal probabilistic notation in which the two functions would be denoted f Θ|Y ,Ξ (·|·, ·) and f Θ (·). The more precise notation will be important later, when we wish to substitute other quantities, e.g. one denoted μ, into the posterior density of ψ. The simpler notation is considered an 'abuse of notation' by mathematicians (e.g. Gelman et al. 2013, p.6), though it is often expedient.
Despite the apparent simplicity of the above theory, until recently it was all but impossible to compute a Bayesian optimal design in practice for realistically complex experiments. This is due to the presence of two main challenges. Firstly, the (potentially high-dimensional) integrals involved in (1) and (2) are analytically intractable except for linear models with normally-distributed response. Thus, in general the expected utility can only be evaluated approximately using numerical integration. Typically the outer integral in (1) is estimated via Monte Carlo. The inner integral in the model evidence in (2) can be estimated stochastically, giving Double Loop Monte Carlo (Ryan 2003) or nested Importance Sampling (Feng 2015). Alternatively, deterministic estimates such as Laplace approximations can be used (Long et al. 2013;Overstall et al. 2018). Earlier approaches such as Bayesian D-optimality relied more heavily on asympototic approximations (Chaloner and Verdinelli 1995).
The second challenge is numerical maximization of the approximately evaluated utility. This is difficult as a result of the high dimension of the design space. In addition, due to the use of Monte Carlo, the approximate evaluations of the objective function are computationally expensive, noisy, and non-smooth. This precludes the use of standard optimization algorithms such as quasi-Newton methods or co-ordinate exchange algorithms. Instead, more sophisticated optimization techniques have been developed, one of the most promising being approximate co-ordinate exchange (ACE; Overstall and Woods 2017). Alternative methods include stochastic approximation (Huan and Marzouk 2013) and sampling-based methods (Müller et al. 2004).
The idea of the ACE algorithm is to optimize one coordinate of the design at a time using a Gaussian process emulator to form a smooth estimate of the expected utility as a function of the current co-ordinate. To ensure robustness to the quality of the emulator, each proposed change to a co-ordinate is subject to an independent emulatorfree acceptance-rejection step. After making several passes through the design matrix using this process, the design points are consolidated using a point exchange procedure. An implementation is available in the R package acebayes (Overstall et al. 2019).
This paper makes several contributions. First, we introduce a new method for the approximation of the expected SIG utility, called Approximate Laplace Importance Sampling (ALIS). Our method is computationally cheaper (in some cases up to 70%) than the Laplace Importance Sampling (LIS) method used by Beck et al. (2018) to find lowdimensional designs for partial differential equation models, and by Senarathne et al. (2020) for sequential design. Second, we conduct a thorough comparison of ALIS and LIS with a number of other algorithms in the context of nonlinear models familiar from the statistics literature. Third, we discuss approximations to the expected SIG utility in the common case where there are nuisance parameters (cf. Feng and Marzouk 2019). Finally, we demonstrate that the use of ALIS and LIS in conjunction with the ACE optimization algorithm gives better designs than previous approximations in some models with a large number of parameters.
where (ψ h , y h ), h = 1, . . . , M 1 , are independent random samples from the joint prior density, i.e. f J (ψ, The main difference between the various methods is in the choice of the estimate of the evidence in (3), which affects both accuracy and computational expense. The primary distinction is whether a second Monte Carlo estimate is used, giving a 'Nested Monte Carlo' method, or a deterministic estimate such as the Laplace approximation. We detail these different methods below.

Naïve Monte Carlo
The simplest way to approximate the evidence in (3) where the 'inner sample' ψ hk , k = 1, . . . , M 2 , is another independent random sample from the prior density, f B (ψ). The inner sample is chosen independently of the 'outer sample', (ψ h , y h ). This gives an overall approximatioñ We refer to the above approximation as naïve Monte Carlo (nMC); it is known elsewhere in the literature as Double Loop Monte Carlo (DLMC). The estimatorŨ nMC (ξ ) has variance of asymptotic order O(1/M 1 ) and positive asymptotic bias Ryan 2003). Thus, the variance can be reduced by increasing the outer sample size, and the bias can be reduced by increasing the inner sample size. Despite its good asymptotic properties, for practical inner sample sizes the naïve Monte Carlo estimator commonly suffers from problems with numerical underflow. When this happens one obtains a numerically negligible estimate for the evidence and a numerical estimate of infinity for the expected utility. The latter is clearly unreasonable, making it questionable whether the method can be reliably used to compare designs when M 1 and M 2 are small. This zero evidence problem is particularly acute when the posterior is highly concentrated relative to the prior. In this case the likelihood f R (y|ψ, ξ ) is numerically negligible throughout the majority of the parameter space, except on a very small neighbourhood around the maximum likelihood estimate. It is thus highly likely that all of theψ hk , which are sampled from the prior, will lie outside of this neighbourhood, giving a numerically negligible estimate of the evidence.

Reuse estimator
To alleviate the numerical stability problems of the Naïve Monte Carlo estimator, Huan and Marzouk (2013) proposed the reuse approximation, which uses the same parameter sample in both the inner and outer summation. The asymptotic bias of the reuse estimator has the same order of magnitude as that of the naïve Monte Carlo method. However the reuse estimator is more numerically stable for small Monte Carlo sample sizes. In particular, it will usually give a finite estimate of the expected utility gain because each inner sum contains the term f R (y h | ψ h , ξ ), which is non-negligible as ψ h is the parameter vector used to generate y h in the simulation.

Laplace approximations
The literature contains two methods for using Laplace approximations to avoid nested Monte Carlo integration. For the first method, considered by Overstall et al. (2018) and denoted LA1 here, equation (3) is used with the standard Laplace approximation to the evidence, giving denotes the unnormalized posterior. In additionψ h = argmax ψf A (ψ|y h , ξ ) denotes the posterior mode for the hth response realization, while denotes the Hessian of the negative log-posterior at the mode. This asymptotic approximation should be accurate provided n is large enough for the posterior to be approximately normal, and can be used to find efficient designs for a wide range of sample sizes. A second method, denoted LA2 here, was considered by Long et al. (2013). This requires additionally that the sample size is large enough for the posterior to be highly concentrated around the posterior mode. In this case, both the log-posterior and the log-prior can be approximated within the region of highest posterior density by a second-order Taylor expansion, giving whereψ y denotes the posterior mode given response vector y and H y the corresponding Hessian of the negative logposterior. Above, the last line has been obtained using the

Importance sampling
Another Monte Carlo method for estimating the evidence is importance sampling, i.e.
whereψ hk , k = 1, . . . , M 2 is an independent sample from the importance density q h . Note that nMC corresponds to the special case where the prior is chosen as the importance density. By standard theory (e.g. Lemieux 2009, p.114 h is the posterior density of ψ given y h . This gives a zero variance unbiased (i.e. error-free) estimator; unfortunately the optimal importance density cannot be used in practice as it requires knowledge of the evidence, the quantity we are trying to estimate.
The above discussion suggests that a good choice of importance density would be a computationally cheap approximation to the posterior, such as N (ψ;μ h ,ˆ h ) or t ν (ψ;μ h ,ˆ h ), whereμ h andˆ h are approximations to the mean vector and variance matrix of f A (ψ|y h , ξ ). Here N (·; μ, ) and t ν (·; μ, ) denote respectively the probability density function of a multivariate normal and multivariate t distribution with mean μ, variance matrix , and degrees of freedom ν for the latter. Below we discuss different methods for choosingμ h andˆ h .

Laplace-type Importance Sampling Methods
Laplace-type importance sampling methods set the variance of the importance distribution asˆ The two variants, Laplace Importance Sampling (LIS), and Approximate Laplace Importance Sampling (ALIS), are distinguished via the choice of meanμ h .

With LIS, the mean is approximated usingμ
This necessitates a total of M 1 potential costly numerical optimizations to find the mode,ψ h (h = 1, . . . , M 1 ), of the posterior distribution: one for each the M 1 simulated response vectors, y 1 , . . . , y M 1 , in the outer sample. However, provided the search forψ h is initialized at the data-generating parameter values ψ h , it typically converges in a small number of iterations. We performed these optimizations using the Broyden-Fletcher-Goldfarb-Shanno (BFGS) method. Algorithm 1 gives a detailed description of the LIS method and our proposed new variant, ALIS. See Ryan et al. (2015) and Beck et al. (2018) for low-dimensional examples of design selection with LIS.

ALIS
The key observation underpinning ALIS is that the posterior mode used to center the importance distribution in LIS is frequently close to the data-generating values, i.e. it is often the case thatψ h ≈ ψ h . Given this, an obvious question is whether it is possible to reduce the computational cost of the LIS method by removing some of the optimization steps, set- For this choice to work we require at a minimum that H h (ψ h ) is positive definite, since without thisˆ h would not be a valid covariance matrix and it would be impossible to sample from the importance distribution q h . The ALIS importance distribution is thus centred at We show in Sect. 4.2 that this choice gives a method with comparable accuracy but lower computational cost.

Nested importance sampling
In nested importance sampling (nIS; Feng 2015), the posterior mean μ h and variance h are approximated via self-normalized importance sampling using the outer sample. This giveŝ . This approach has the potential to suffer from low effective sample size. To counter this, Feng proposed to revert to the original naïve Monte Carlo estimate of the evidence if ESS h = 1/( M 1 k=1w 2 hk ) drops below a prespecified minimum effective sample size.

Models for performance assessment
In this section we compare the performance of the methods from Sects. 2 and 3. Results are given for two tasks: (i) evaluation of the utility function and (ii) design selection, in Sects. 4.2 and 4.3 respectively. Three different nonlinear models are considered, all of the form The models differ with respect to their mean functions η, parameters, and priors, with details given below.
The prior on θ 2 is relatively diffuse, implying a wide range of possible shapes of the response curve. The prior on σ 2 was chosen to imply a low noise-to-signal ratio as this leads to a relatively concentrated posterior, the most demanding scenario for methods such as naïve Monte Carlo and nested importance sampling. Specifically, the 10% and 90% quantiles of σ /η(400, θ ) are 0.009 and 0.02 respectively, where In order to preserve the positivity constraint on θ 1 and θ 2 , we reparameterize the model in terms of ϑ 1 = log θ 1 and ϑ 2 = log θ 2 . The resulting normal ALIS/LIS importance distribution on the ϑ scale effectively implies a log-normal importance distribution on the θ scale. Note that reparame-terization does not change the value of the expected Shannon information gain, but it does change our numerical estimate thereof due to the modified importance distributions. Here the noise variance σ 2 can be integrated out analytically owing to its conjugate prior. See the appendix for full technical details of the calculations for the marginal likelihood and its derivatives. Biochemical oxygen demand (BOD) model Bates & Watts (1988, Chap. 2) modelled biochemical oxygen demand y (mg/L) with the mean function where x is time (in days). We adopt the following independent priors: log θ 1 ∼ N (3.38, 0.20 2 ), log θ 2 ∼ N (1.098, 1.12 2 ), The prior means for θ 1 , θ 2 were chosen to match the means given by DiCiccio et al. (1997), while the variances were chosen to illustrate the differences between the methods (smaller and larger variances resulted in more similar performance). Similar to the Michaelis-Menten model, we reparameterize in terms of ϑ j = log θ j when carrying out utility approximations, and σ 2 is integrated out analytically. The design region is [0, 7]. Bates & Watts (1988, Chap. 3) modelled the kinematic viscosity of a lubricant using the following mean function, depending on temperature, x 1 ( • C) and pressure, x 2 (atm):

Lubricant kinematic viscosity model
Defining θ 10 = log σ , we adopt independent normal priors on θ j , j = 1, . . . , 10, with means and standard deviations equal to the maximum likelihood estimates and their standard errors based on the data from Bates and Watts (1988) (see Table 1). Unlike the previous models, no reparameterization is used for the θ j . Moreover the noise variance is treated as an interest parameter. The design region for (

Utility evaluation results
For utility evaluation, we compare the methods in terms of accuracy and computational expense. To assess accuracy, we need an approximation with negligible error to serve as a reference. For the Michaelis-Menten and BOD models we were able to obtain such a reference approximation by using naïve Monte Carlo with M 1 = M 2 = 10 6 , though this approximation is too computationally expensive for routine use. However for the lubricant model, whose parameter space is of substantially larger dimension, nMC yields unstable esti-mates even with such a large Monte Carlo sample size. Thus we restrict our attention to the Michaelis-Menten and BOD models in this section, though our results will suggest that expected utility can be reliably estimated for the lubricant model using the LIS and ALIS approximations.
To investigate how performance depends on the number of experimental runs, results are obtained for a variety of experiment sizes. For the Michaelis-Menten model spacefilling designs with n = 5, 10, and 20 are considered, while for the BOD model the design from Bates and Watts (1988) with n = 6 is considered alongside space-filling designs with n = 10 and 20. The type of space-filling design used throughout is a random Latin Hypercube design. For consistency the same specific design realisation was used throughout, so that differences between different sampled designs of the same size are not a factor in the results. Figure 1 shows how the distribution of the estimator of expected Shannon information varies across different methods and different combinations of inner and outer Monte Carlo sample sizes. The results shown are for the Michaelis-Menten model and a space-filling design with n = 5 runs. The reference value of the expected utility is indicated by the red horizontal line. It is seen that ALIS and LIS have small bias and variance compared to all other methods with similar Monte Carlo sample size, even for small M 1 and M 2 . The nMC, nIS, and reuse methods give highly biased and variable estimators for small Monte Carlo sample size, but increas-ing M 1 and M 2 reduces both the variance and bias, and with (M 1 , M 2 ) = (2000, 10000) both the nMC and nIS methods give comparable utility values to LIS and ALIS. In contrast, for the Laplace approximations, increasing the Monte Carlo sample size only reduces the variance, not the bias, as these methods are intrinsically biased due to the poor quality of the asymptotic approximation when n = 5. This figure is quite representative of the general picture, but further insight can be obtained by combining results across several examples. Table 2 shows the accuracy of the methods across models, Monte Carlo sample sizes, and numbers of experimental runs. Accuracy is measured by the percentage relative root mean squared error (RRMSE), i.e. 100 × MSE[Ũ (ξ )]/U (ξ ). It is seen that the most accurate methods overall are LIS and ALIS; these have excellent performance even with low Monte Carlo sample size (M 1 = M 2 = 300). Moreover, the accuracy of LIS and ALIS remains stable or even improves as the number of experimental runs increases. nMC is the next most accurate method when Monte Carlo sample size is large but it performs poorly when Monte Carlo sample size is small, i.e. when (M 1 , M 2 ) = (2000, 10000) and (300, 300) respectively. Moreover, the performance of nMC degrades as the number of experimental runs increases. This result is intuitive: as the number of experimental runs increases, the posterior will become more concentrated and the prior will become a worse importance distribution. nIS has similar performance and caveats to nMC. The accuracy of LA1 is good when n is large, poor when n is small, and fairly insensitive to M 1 . Similar comments apply to LA2, but with slightly worse accuracy overall. The reuse estimator has relatively poor performance even with large Monte Carlo sample sizes and is not recommended.
The left part of Table 4 shows the relationship between the accuracy of the different methods and their computational cost. The timings show that the most efficient methods are LA1, ALIS, and LIS: all other methods have worse accuracy than another method with lower computational cost. In particular LIS and ALIS with M 1 = M 2 = 300 give a good trade-off between accuracy and computational expense. Increasing the Monte Carlo sample size to (M 1 , M 2 ) = (2000, 10000) gives only a small increase in accuracy for a very large increase in cost. The right part of Table 4 shows how the utility evaluation time varies across methods, models and experiment sizes. It is clear that larger n results in increased evaluation time, though the relative timings of the different methods are similar for all examples.
The difference between LIS and ALIS can be considered in more detail. Table 3 shows that a high percentage of the numerical optimizations required in the LIS method can be avoided through ALIS. The percentage is higher for large n, which is intuitive since we would expect in that case that the posterior mode would be closer to the data-generating values.
The percentage of avoided optimizations is also higher for the Michaelis-Menten examples than for those using the BOD model. This is consistent with the fact that the priors for the Michaelis-Menten example were chosen to have a low noiseto-signal ratio, which is anticipated to give a posterior that is relatively highly concentrated around the data-generating values.
Although ALIS greatly reduces the amount of optimization required compared to LIS, the computational cost saving in Table 4 is modest: approximately 5% for (M 1 , M 2 ) = (300, 300) and 1.4% for (M 1 , M 2 ) = (2000, 10000). This is due to the values of M 2 , which are large enough that the optimization cost is relatively small compared to the cost of the inner loop sampling and averaging. However, for smaller inner loop sample sizes the cost savings due to ALIS are much larger; Table 5 shows cost savings of 10-70% from ALIS compared to LIS when M 2 ranges from 10 to 50. The computational cost saving for ALIS usually comes at the expense of a small decrease in accuracy, though in one case (BOD, n = 6, M 1 = 2000, M 2 = 50) ALIS is both cheaper and more accurate than LIS.
The smaller values of M 2 in Table 5 are more than an intellectual curiosity; there is empirical and theoretical evidence that smaller values may sometimes be a more efficient choice than setting M 1 = M 2 . E.g. for naïve Monte Carlo asymptotic results suggest it is optimal to take M 2 = O( √ M 1 ) (Beck et al. 2018). Empirically, taking as an example the BOD model with n = 20, we find that ALIS and LIS with (M 1 , M 2 ) = (300, 30) are both cheaper and more accurate than the Laplace approximation with M 1 = 2000.
Clearly such timings will depend on the implementation language and hardware involved, but they nonetheless give a useful idea of the relative cost of the different methods. We used C++ in R via the Rcpp and RcppArmadillo libraries (Eddelbuettel et al. 2011;Eddelbuettel and Sanderson 2014) to obtain high-performance code. Timings were carried out on a 2018 Mac Mini with a 3GHz 6-core Intel i5 processor and 8GB RAM; the calculations took place on a single core.
A common technique to gain better estimates in importance sampling is to inflate the tails of the importance distribution, e.g. by using a t-distribution. We obtained results for t importance distributions, but for brevity the results are omitted here as there was not a substantial difference in performance from the multivariate normal importance distributions. For full details see the first author's PhD thesis (Englezou 2018).

Design optimization results
In this section we compare the performance of the different expected utility approximation methods for the purpose of design optimization. To enable the comparison we found (near-)optimal designs for each of the different methods Fig. 2 Comparison of (near-)optimal designs found from the different utility approximation methods for the BOD model with n = 20. Each boxplot corresponds to the best design found from 10 random starts of the ACE algorithm using a particular method, and shows the distribution of 100 independent evaluations of the ALIS estimator of expected Shannon information gain, obtained with M 1 = M 2 = 300. ξ 20 refers to a 20-run space filling design Fig. 3 Comparison of (near-)optimal designs found from the different utility approximation methods for the lubricant model with n = 20. Each boxplot corresponds to the best design found from 10 random starts of the ACE algorithm using a particular method, and shows the distribution of 100 independent evaluations of the ALIS estimator of expected Shannon information gain, obtained with M 1 = M 2 = 300. ξ 20 refers to a 20-run space filling design discussed in Sects. 2 and 3. This was done for the Michaelis-Menten, BOD, and lubricant models discussed in Sect. 4.1 using the ACE algorithm to perform utility optimization. The experiment sizes considered were as follows: n = 5, 10, and 20 for the Michaelis-Menten; n = 6, 10, and 20 for BOD; and n = 20 and n = 53 for the lubricant model. The cases n = 6 for BOD and n = 53 for the lubricant model correspond to designs in the literature.
Different runs of the ACE algorithm may result in multiple different near-optimal designs being found for the same design problem. This can arise due to different starting designs being used and also the stochastic nature of the expected utility estimates. To obtain more stable results we therefore ran the ACE algorithm 10 times with a different random starting design for each problem, i.e. each combination of approximation method, model, and design size. The best design resulting from these random starts, judged via an independent estimate of the expected utility, was chosen as our estimate of the overall (near-)optimal design for that problem.
The designs obtained from each method were compared using an independent estimate of the expected utility calculated using ALIS with M 1 = M 2 = 300. This calculation was repeated 100 times for each design to form an empirical distribution for the estimator, thereby giving an indication of the variability of the expected utility estimate. Comparisons with a 'naïve' (i.e. non-optimal) design were also included for each example. This naïve design was taken from the literature where available, that is when n = 6 for BOD and n = 53 for the lubricant model. For other cases a spacefilling design was used as the naïve design, namely a random Latin Hypercube design. Figures 2 and 3 show typical figures resulting from this process for the BOD and lubricant models with n = 20. Similar figures for other examples are given in the first author's PhD thesis (Englezou 2018).
Within ACE, separate Monte Carlo sample sizes must be specified for the emulator-building and accept-reject steps.
These were chosen as follows. In the accept-reject step B = M 1 = M 2 = 10000 was used throughout. For the emulatorbuilding step, in the Michaelis-Menten and BOD models we used M 1 = 2000 for the 'single-loop' methods LA1 and LA2; M 1 = M 2 = 2000 was used for all other 'double loop' methods. For the lubricant model, we used M 1 = M 2 = 300 for LIS and ALIS, M 1 = 300 for LA1 and LA2, and a larger sample size of M 1 = 2000, M 2 = 10000 for nMC and nIS. The latter was needed to avoid failure of the evaluation due to the zero evidence problem discussed in Sect. 2.1.
Combining results from across the different examples several observations can be made. First, as anticipated, the optimized designs are better than naïve comparator designs in all but one case. The single exception is that, as seen in Fig. 2, the design from the reuse estimator is worse than a spacefilling design for the BOD model when n = 20. Second, for the two-parameter models, in most cases designs from the different approximations have similar expected utility, aside from the reuse and LA2 designs which appear somewhat worse for n = 10 and 20. Aside from these special cases for two-parameter models the utility differences between the designs from different methods are usually smaller than the variability of the utility estimator for a fixed design.
Bigger differences in the performance of the designs from different methods are seen for the lubricant model, which is of substantially higher dimension. In particular, the ALIS and LIS designs substantially outperform the designs from all other methods when n = 20 (see Fig. 3), and all methods except LA1 when n = 53. This improved performance for LA1 for large experiment sizes is expected due to the asymptotic nature of the Laplace approximation.
We did not record computational times for finding (near-)optimal designs. However, ACE is usually performed with a fixed number of iterations, and the dominant computational cost is that of the expected utility evaluations. Thus the relative cost for the different utility approximation methods will be similar to Sect. 4.2.

Nuisance parameters
In this section we discuss the case where the model contains nuisance parameters, meaning parameters that are not of direct interest but which must nonetheless be considered when making inference about the parameters of interest. Laplace approximations for this case have been developed by Overstall et al. (2018), and a Layered Multiple Importance Sampling approximation has been developed by Feng and Marzouk (2019), who refer to the resulting optimal designs as 'focused'. Both of these approaches used the idea of conditioning a multivariate normal approximation to the posterior. Similar ideas can be applied in the ALIS/LIS context, as follows.
First we partition the overall parameter vector as ψ = (θ T , γ T ) T , where θ ∈ Θ ⊆ R p θ is the vector of interest parameters, and γ ∈ Γ ⊆ R p γ is the vector of nuisance parameters. The expected Shannon information gain for the interest parameters now takes the form where f M (y|θ , ξ ) = Γ f R (y|θ , γ , ξ ) f B (γ |θ)dγ denotes the marginal density of the response after integrating out the nuisance parameters. The expected utility (6) can be estimated viã is a LIS/ALIS estimate of the evidence. In addition, now we also require a second importance sampling approximation,f h M , to estimate the marginal likelihood, f M (y|θ, ξ ), of the interest parameters after integrating out the nuisance parameters.
In particular we suggest using the following approximation for the marginal likelihood: where {γ hs } M 3 s=1 is an i.i.d. sample from the importance density q γ |θ h . To minimize the variance of the estimator, the importance distribution q γ |θ h should approximate the conditional posterior f A (γ |y h , θ h , ξ ) for the nuisance parameters. To obtain such an approximation we suggest closed-form conditioning of the ALIS/LIS approximation to the joint posterior, ψ|y h Note that we already gave some examples of models with nuisance parameters in Sect. 4. Approximation (7) was not needed in these cases, as the nuisance parameters could be integrated out analytically. Approximation (7) will be more useful when the nuisance parameters are analytically intractable.

Discussion
Given the results here, our overall recommendation would be to use ALIS or LIS when finding (near)-optimal designs if the computational budget allows. If a smaller cost is required then LA1 may give a competitive design if the experiment size is sufficiently large relative to the number of parameters in the model. We would discourage the use of other methods, especially the reuse estimator, due to their potential for poor performance.
Uptake of the ALIS and LIS methods would likely be enhanced by their inclusion in software such as the acebayes package. A major barrier to this is that to obtain acceptable computation times we found it necessary to hard-code various model-specific functions in C++, including the mean function η(x, θ), the likelihood and prior, and their derivatives. A non-specialist user is unlikely to have the time or inclination to implement such functions in C++ for their models, even with the benefit of high level linear algebra packages. One potential solution to this quandry may be to leverage recent probabilistic programming languages such as STAN (Stan Development Team 2021) or Turing.jl (Ge et al. 2018), a package for the Julia language (Bezanson et al. 2017). Both of these frameworks allow user-friendly high-level specification of Bayesian models but achieve performance comparable to compiled code. In addition these frameworks allow automatic differentiation, avoiding the need for detailed manual calculation of derivatives.
The results in this paper are limited to the expected Shannon information gain criterion. This is the most common choice in the literature, and it is a good one when the goal is inference and uncertainty quantification about the parameters using the full posterior distribution for reasons discussed in Sect. 1. In other situations, such as point estimation, a different utility function may be preferable. We believe that similar numerical approximations to LIS/ALIS could be developed for other utility functions. The idea of approximating the optimal importance distribution could again be used, though this would no longer be the posterior distribution. However, such methods are outside the scope of the present paper. While of interest, comparisons with other recent approaches such as amortized variational inference (Foster et al. 2019) and layered multiple importance sampling (Feng and Marzouk 2019) are also outside of scope. visiting the Isaac Newton Institute, Cambridge, UK, as part of the programme 'Uncertainty quantification for complex systems: theory and methodologies' during 2018.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.

Appendix: Derivative calculations
The ALIS and LIS methods require the (marginal) likelihood and the gradient and Hessian of the log unnormalized posterior. In this appendix we report the details of these calculations for the models in Sect. 4.

Nonlinear models with 2 treated as nuisance
Here we assume that the model is a general nonlinear regression (5) where the noise variance is a nuisance parameter and has the conjugate prior σ 2 ∼ IG(a, b), where a and b denote the shape and scale hyperparameters. In this case the nuisance parameter can be integrated out analytically. We work on the scale ϑ j = log θ j on which the parameters are assumed to follow independent normal priors, ϑ j ∼ N (θ j , v j ). The Michaelis-Menten and BOD examples from Sect. 4 both fit into this framework.