Optimal Bayesian design for model discrimination via classification

Performing optimal Bayesian design for discriminating between competing models is computationally intensive as it involves estimating posterior model probabilities for thousands of simulated data sets. This issue is compounded further when the likelihood functions for the rival models are computationally expensive. A new approach using supervised classification methods is developed to perform Bayesian optimal model discrimination design. This approach requires considerably fewer simulations from the candidate models than previous approaches using approximate Bayesian computation. Further, it is easy to assess the performance of the optimal design through the misclassification error rate. The approach is particularly useful in the presence of models with intractable likelihoods but can also provide computational advantages when the likelihoods are manageable. Supplementary Information The online version contains supplementary material available at 10.1007/s11222-022-10078-2.


Properties and Estimation of CARTs and Random Forests
The CART algorithm generates a binary tree where each internal node consists of one binary rule that involves exactly one of the features, e.g., y 3 < 10. The feature space is split recursively at the internal nodes according to the binary rules, thereby creating a partition of the feature space consisting of hyperrectangles aligned along the feature axes. Each terminal node or leaf contains all the observations in the training sample which fall into the associated hyperrectangle. The hyperrectangle region of the feature space associated to a leaf is defined by the binary rules in the nodes leading to that leaf. For a classification tree, the class label which is assigned to a particular region of the feature space is determined by majority vote of the training samples in the corresponding leaf. The class proportions of the training samples in a leaf can be used to obtain crude estimates of the posterior class probabilities for observations falling into the associated feature space region.
Trees are constructed recursively beginning at the root. Each leaf contains those training samples that meet all the conditions leading down the path from the root to that leaf. If no stopping criterion is met and the leaf's sample contains more than one distinctive feature value, the leaf is split into two daughter nodes and becomes an internal node. To that end, the binary rule that splits the sample at the node into two subsets for the two new leaves has to be determined. The feature variable and the split point are selected such that a given criterion is minimised across all subsets. For classification, the default criterion of node impurity used for growing the tree is the Gini index K m=1p m (1 −p m ), wherep m is the proportion of training samples from class m in the node.
As noted for example by Hastie et al. (2009), fully grown trees, where no further splits are possible, usually overfit the data. Therefore, one might stop earlier and define a minimum size of a node or a parent node. More preferably, one can grow a full tree and prune it afterwards according to a cost-complexity criterion that incorporates the node impurities and the number of terminal nodes. For an efficient algorithm to find the optimal pruned tree see Breiman et al. (1984). The optimal choice of the minimum node size or the tuning parameters for cost-complexity pruning can be determined by cross-validation.
Exploiting the similarities between trees and nearest neighbour classifiers, Breiman et al. (1984) show that the misclassification error rate of a fully grown tree is bounded above by twice the Bayes error rate, which has been shown for 1-nearest neighbour classification by Cover and Hart (1967). It also follows from Breiman et al. (1984) that the misclassification error rate of a classification tree attains the Bayes error rate as the sample size tends to infinity.
The CART algorithm automatically assumes equal prior class probabilities, even if the training sample is not balanced. This is achieved by dividing the class counts in the leaves by the overall class counts in the training sample. Therefore, a given leaf is classified as arg max m∈{1,...,K} N m (leaf) N m (root) , (1.1) where N m (leaf) and N m (root) are the number of observations from class m in the leaf and in the entire training sample, respectively. One may switch off this mechanism if the training sample reflects the true prior class probabilities. It is also possible to provide user-defined prior class probabilities. In that case the fractions in (1.1) are multiplied by these user-defined prior probabilities.
Due to the recursive nature of their construction, trees exhibit a high variance. A suboptimal split at a top node affects the whole tree structure below that node, so slight changes in the data might lead to widely different trees. To reduce the variance, an ensemble method called bagging was proposed by Breiman (1996).
Bagging means to draw B bootstrap samples from the training sample and to apply the classification method to each bootstrap sample. As a result, one obtains B different classifiers trained on the B bootstrap samples. The class of a new observation y * is predicted by casting a majority vote among the class predictions returned by the B classifiers. Bagging has been shown to be particularly useful for classification methods that are unstable and exhibit a high variance such as trees and neural networks, where bagging can lead to a substantial reduction of the variance.
An ensemble of bagged trees might be highly correlated, which has a negative effect on the variance of the bagged predictor. To reduce the variance further, random forests (Breiman, 2001) seek to de-correlate the trees by considering only a random subset of the feature variables for splitting the tree at each node when the trees are grown. For classification, the default setting is to consider √ p variables at each node, where p is the total number of feature variables. The random selection of feature subsets reduces the correlation between the trees but it also increases the bias of the trees. On the other hand, the trees used in random forests are normally not pruned, and unpruned trees have less bias than pruned trees.
Random forests are able to account for overfitting when computing the misclassification error rate without the need to employ cross-validation or to generate a separate test set. Each tree is constructed from a bootstrap sample of the training set. The bootstrap samples are drawn from the training set with replacement. It follows that about one third of the training set is omitted in each bootstrap sample. It is therefore possible to make predictions for each training sample y i based on those trees where y i does not appear. These out-of-bag class predictions can then be used to estimate the misclassification error rate. Out-of-bag estimation is qualitatively similar to leave-one-out cross-validation.
Random forests also provide estimates for the posterior model probabilities p(m|y, d).
The estimates are formed by simply averaging the posterior model probability estimates obtained from the trees in the forest. Due to the averaging, the posterior model proba-bility estimates of the random forest are much more stable than those given by a single tree.
Unfortunately, there are some difficulties when trying to estimate the expected multinomial deviance loss by classification trees or random forests using cross-validation, independent test samples or out-of-bag class predictions. For a single tree, the lack of smoothness of its posterior model probability estimates means that in an independent test sample there are almost certainly some observations for which the estimated posterior model probability of the true model is 0. Therefore, minus the logarithm of the posterior model probability is ∞ and the expected multinomial deviance loss is also ∞. When evaluating a random forest on a test sample or when using out-of-bag class predictions, it is also very likely that some probability estimates are 0. In our examples, we therefore set the estimated posterior model probability to a value of ε = 0.001 whenever the posterior model probability is estimated to be 0. The lower the value of ε, the higher the variance of the expected loss estimate, because it becomes very sensitive to the number of posterior model probabilities estimated to be 0. In our experience, setting ε to 0.001 was striking a good balance between being reasonably close to 0 while not exhibiting excessive variability.
For all our examples except the spatial extremes example, we use the Matlab functions fitctree and TreeBagger to train classification trees and random forests, respectively. We mostly use the default settings of those functions. That is, for classification trees the maximum number of splits is set to the sample size -1, the minimum leaf size is 1 and the minimum internal node size is 10. This leads to rather deep trees. The trees are not pruned. The default settings of TreeBagger amount to following the standard methodology of random forests as outlined in this section. The random forests we employ are generally made up of 100 trees and utilise out-of-bag class predictions. Algorithm 2: Gaussian process regression post-processing step Input: Sets of visited designs V GP,i and sets of corresponding expected loss estimates L i (q values for each design in V GP,i ) for i = 1, . . . , I parallel runs of the modified coordinate exchange algorithm (Algorithm 1); preliminary optimal designs d * CE,i and corresponding estimated expected loss values l * CE,i for i = 1, . . . , I parallel runs of Algorithm 1; function estimate loss(d) that estimates the expected loss for a given design d. Output: Overall optimal design d * . 1 Combine sets of visited designs V GP,i for all i = 1, . . . , I parallel runs into one set V GP . Do the same for the sets of expected loss estimates L i and combine them into L; 2 Train Gaussian process with the expected loss values in L as (univariate) response variable and the visited designs V GP as predictors (each design is repeated q times); 3 Find the minimum value of the predictive mean function of the Gaussian process over the design space using some generic optimisation function. In all our examples we set p = 6 and q = 10.
Algorithm 1 can be run in parallel for different initial designs d to account for multimodality and local optima. We conduct 20 parallel runs in all our examples.
The selection of the candidate design points in Line 7 of Algorithm 1 depends on the example. For the logistic regression and the macrophage example, there is no restriction and C = A. For the other examples, the current design points d 1 , . . . , d n in d have to be excluded since each design point can only be selected once. Furthermore, for the spatial extremes example we only consider design points with the same x-or y-coordinate as the current design point.
The sets of best designs found in each of the parallel runs of Algorithm 1 and their associated estimated expected loss values are combined and used as inputs for Algorithm 2. In Algorithm 2, a Gaussian process (GP; see, e.g., Rasmussen and Williams, 2006) is trained on the combined data from all the parallel runs in order to obtain a smooth estimate of the expected loss surface by means of the predictive mean function of the GP. The predictive mean function is minimised and a new candidate for the optimal design is obtained. Since the predictive variance is relatively high in our examples, we compare this design to the best design found through Algorithm 1 without the GP post-processing step of Algorithm 2. To reduce the uncertainty for this comparison, we estimate the expected loss 100 times at each of the two designs and take the design with the lower average expected loss value as the overall optimal design. We do not perform the GP post-processing step for the spatial extremes example.
For Gaussian process regression, we use the default settings of the Matlab function fitrgp except that all the predictors are standardised. The default kernel function used is the squared exponential kernel and a constant GP prior mean is assumed. To find the optimal value for the initial value of the prior noise variance parameter, Bayesian optimisation is conducted with respect to the cross-validation loss.
For finding the minimum of the GP's predictive mean function, we use the Nelder-Mead simplex algorithm (Nelder and Mead, 1965). Restrictions of the design space are considered by employing suitable transformations. For example, design points with the restriction d i ∈ (a, b] are transformed by the logit transformation

Computational Performance Measures for Examples in Main Paper
In this section, we provide some measures of computational performance for the design search algorithms used for the three examples in Section 4 of the main paper. As explained in Section 2, we ran Algorithm 1 twenty times in parallel, so there is a distribution of runtimes to consider. We focus on the exchange part of Algorithm 1 (lines 4 to 23), because this is usually the most time-consuming part. However, it is not sensible to just compare the distributions of runtimes of the exchange part because the number of sweeps through the design grid until the algorithm converges (i.e., the number of passes through the while-loop) is random. Therefore, in Tables 1 to 4 we provide the distributions for the runtimes per sweep for all the examples in Section 4.1 of the main paper. More precisely, we state the mean and the standard deviation of the runtime per sweep over the parallel runs. As expected, one can see that these distributions exhibit little variation. The reason is that the number of calls of the estimate loss function in a sweep through the design grid is fixed for any given example and design configuration. Within any call to estimate loss, the simulated sample sizes are fixed (for details see the example settings for the respective models in Section 4 of the main paper). The sample sizes for trees and random forests are always the same, so differences in runtimes can solely be attributed to the classification method. In general, for all our examples simulation is rather efficient, so the simulation effort is only a minor fraction of the total runtime. This is also true for ABC, where sorting the reference table for each draw from the outer sample is much more time-consuming than creating the reference table itself (see also Section 4.1.4 in the main paper).
It is interesting to note that there do not seem to be any systematic differences between the distributions of the number of sweeps between the different methods. Therefore, it is entirely sufficient to consider the runtimes per sweep or runtimes per call when analysing the differences between the methods.
In our examples, it was about four to five times faster to use cross-validated trees than to use random forests when the data dimension is small. However, as the data dimension increases, the tree method loses some of that advantage (see Tables 3 and  4). Note that in the macrophage example the data consist of the various observed cell proportions at each design point and are therefore quite high-dimensional despite the low dimensionality of the designs. Furthermore, the higher the dimension, the bigger the advantage of the designs found through random forest classification in terms of discriminatory performance (see, e.g., Figure 5 in this document). Therefore, for higher dimensions the recommendation is to use random forests.
Both classification approaches are many times faster than the other approaches investigated in Table 1 (ABC) and Table 2 (likelihood-based). Note that this is despite the relatively small simulation sizes for the outer Monte Carlo samples from the prior predictive distribution that we used for ABC (sample size 2000) as well as for the likelihood-based approach (sample size 800) to keep the runtimes within a tolerable range. These small outer sample sizes led to a considerably larger noise in the expected loss estimates for those two approaches compared to the classification approaches (see Figure 1 in the main paper and Figure 3 in this document).
Runtimes are machine-and implementation-specific and should therefore be taken with caution. However, Tables 1 to 4 can still give some clues on the relative efficiency of the different methods. All our examples were run on an SGI UV 3000 global shared memory system from Hewlett Packard Enterprises. It uses 12-core processors of type Intel Xeon E5-4650V3 that operate on 2.8 GHz and have an L3 cache of 30 MB.
We do not further analyse Algorithm 2.    The prior distributions for the four epidemiological Markov process models of Section 4.1 are given in Table 5. Here LN (µ, σ) denotes the lognormal distribution with location µ and scale σ. E(η) denotes the exponential distribution with rate η.

Model Number Parameter
4.2 Optimal Designs for n = 4 and n = 5 Table 6: Optimal designs obtained by tree classification (cross-validated), random forest classification (using out-of-bag class predictions), and ABC approaches under the 0-1 loss (01L) or multinomial deviance loss (MDL) (n = 4 and 5) for the infectious disease example. The equidistant designs are also shown.

Misclassification Matrices
The random forest classifiers and the corresponding random samples which we use to compute the misclassification error rates in Table 3 of the main paper can also be used to compute misclassification matrices for the various optimal designs. A misclassification or confusion matrix contains for each combination of true model m i (in the rows) and predicted model m j (in the columns) the proportions of samples from true model m i that were classified as model m j . In the case of random forests, the misclassification matrix is computed using out-of-bag class predictions. It provides a comprehensive picture of the classification accuracy at a given design.
For the optimal design obtained by the tree classification approach with cross-validation under the 0-1 loss, the misclassification matrices for 1 -4 time points are shown in Figure 1. The figure suggests that it is difficult to discriminate between models 1 and 3 and also models 2 and 4. This is not surprising given that we do not observe the exposed population. Especially model 3 is most often misclassified as model 1. The misclassification matrices for the other machine learning classification approaches and loss functions are qualitatively all very similar to Figure 1.
In Figure 2, the misclassification matrices for the ABC approach under the 0-1 loss are depicted. The ABC approach leads to designs with generally lower values for the design points than the machine learning approaches (see Table 2 in the main paper and Table 6 in this document). The overall misclassification error rates are similar, but one can see that the pattern is a bit different from Figure 1. At 4 design points, model 3 is more likely to be correctly classified, but the misclassification error of models 1 and 2 increases.
Collect all the S i,j in the vector S in the same way as the design times d i,j have been collected in the vector d. Each S i,j is a discrete random variable that can assume the N + 1 values 0 to N . The parameters are denoted by θ = (log(b 1 ), log(b 2 )).
Since the death and SI models are continuous-time Markov processes, their likelihood functions have the form Cook et al., 2008).
Let the (N +1)-dimensional vector v i,j|S i,j−1 =k contain the probabilities of all the possible states of the random variable S i,j when the value of S i,j−1 is known to be k. The m th element of v i,j|S i,j−1 =k gives the probability that S i,j = m − 1 when S i,j−1 = k. Since the value of S i,j−1 is known and therefore certain, the state probability vector at observation time d i,j−1 reduces to e k+1 , where e m denotes a vector for which the m th element is 1 and the remaining elements are 0.
Given the vectors and notation introduced above, the transition probabilities can be written as where the matrix A θ,i,j has dimension (N + 1) × (N + 1) and contains the transition probabilities for all pairs of states between observation times d i,j−1 and d i,j . This matrix follows from the solution of the Kolmogorov forward equations and can be calculated using the matrix exponential (see Higham, 2008), where G θ is the infinitesimal generator matrix that is constructed from the transition rates given in Table 1 of the main paper, see, e.g., Grimmett and Stirzaker (2001), pp. 258.
Let the N +1 rows of the generator matrix be numbered from 0 to N . For the SI model, row i (i = 0, . . . , N ) of the generator matrix is given by .
Setting b 2 = 0 for the death model, the transition probabilities can be simplified to a binomial probability mass function (see Cook et al., 2008): Therefore, there is no need to numerically compute the matrix exponential for the death model, and so the likelihood function can be evaluated very quickly. However, for the SI model each of the q · n d matrices A θ,i,j in the likelihood function (5.1) is obtained by numerical computation of the matrix exponential (5.2).

Approximating the Marginal Likelihood
To obtain we need to compute the marginal likelihood We pursue two different approaches to approximating this integral. During the optimisation procedure, we use a comparatively quick Laplace-type approximation to the marginal likelihood, see Gelman et al. (2013), p. 318. Let be the posterior mode of model m. Performing a second-order Taylor expansion of p(S|θ m , m, d) p(θ m |m) aroundθ m and integrating out θ m yields where p m is the number of parameters of model m and is the Hessian of the negative log-posterior evaluated at the posterior mode.
When validating the optimal designs found by the different methods, we employ generalised Gauss-Hermite quadrature (Kautsky and Elhay, 1982;Elhay and Kautsky, 1987) with Q sample points to compute the integral (5.3). As weighting kernel we use a multivariate normal density with mean and variance-covariance matrix given by the mean and twice the variance-covariance matrix, respectively, of the normal Laplace approximation to the posterior, whereθ m is given by (5.4) and Σ S,θm,d is given by (5.6). Using this weighting kernel, we expect that many sample points are in relevant regions where the integrand has high mass. In the bivariate case, determining the sample points involves two steps, see Jäckel (2005). First, all combinations of sample points resulting from applying the standard univariate Gauss-Hermite quadrature rule to each dimension are considered. The sample weights are simply computed by multiplying the univariate weights. To account for the correlation and different scaling and location implied by the multivariate normal weighting kernel, the sample points are then transformed accordingly based on a spectral decomposition of the variance-covariance matrix, seeking to align the diagonals of the rectangle of sample points to the principal axes of the confidence ellipsoid. Furthermore, for the two-parameter SI model we drop sample points below a weight of w 1 · w ( Q is the number of univariate sample points of the Gauss-Hermite quadrature rule and w i denotes the weight for the ith ordered univariate sample point. After obtaining the Q sample points θ m,i and quadrature weights w i (i = 1, . . . , Q) according to the quadrature rule, the marginal likelihood can be approximated by .
(5.7) Figure 3 shows the estimated expected 0-1 loss surface for the one-dimensional design obtained by the different approaches using the simulation sizes we used for the design search. The comparatively high volatility of the expected 0-1 loss under the likelihoodbased approach is evident from Figure 3. To create Figure 3 on our computer, it took about 17 seconds for the tree classification approach, about 2.7 minutes for the random forest classification approach, but more than 18 minutes for the likelihoodbased approach despite the low data dimension and the much smaller prior predictive sample size.

Further Results
Figures 4 (lower-dimensional designs) and 5 (higher-dimensional designs) display the distributions of posterior model probabilities for samples of size 2K (1K per model) from the prior predictive distribution at the various optimal designs found for all the dimension settings and the different methods. We also include equispaced designs Figure 3: Plots of the approximated expected 0-1 loss functions produced by the tree classification approach with cross-classification (solid), the random forest classification approach (dotted), and the likelihood-based approach using a Laplace-type approximation to the marginal likelihood (dashed) for the infectious disease example with two models.
for comparison. The marginal likelihoods are computed using the generalised Gauss-Hermite quadrature approximation (5.7) with Q = 30 quadrature points for the death model and up to Q = 30 2 quadrature points for the SI model. Figure 4 shows that for lower-dimensional designs all methods lead to designs with a very similar classification accuracy as measured by the distribution of the posterior model probabilities of the true model. For the higher-dimensional designs, Figure 5 indicates that the designs found using random forests are performing slightly better than the designs found using cross-validated trees. This comes at the cost of a higher computing time. prior predictive simulations (1K from each of the two models) for the infectious disease example with two models. The data are all simulated at the respective optimal designs for the different approaches. The 0-1 loss is used as criterion. Settings with q = 1 to q = 4 realisations and n d = 1 to n d = 4 observations per realisation are considered (q ≤ 2 for n d = 3 and n d = 4). For each setting, from left to right the boxplots are for the cross-validated tree classification design (Tr; magenta), the random forest classification design (RF; blue), the design found using the Laplace approximations to the marginal likelihoods (ML; red), and the equispaced design (Eq; black). prior predictive simulations (1K from each of the two models) for the infectious disease example with two models. The data are all simulated at the respective optimal designs for the different approaches. The 0-1 loss is used as criterion. Settings with various numbers of realisations and 1 ≤ n d ≤ 4 observations per realisation are considered. The number of realisations were chosen such that the total number of observations n = q · n d is equal to n = 12, 24, 36, or 48. For each setting, from left to right the boxplots are for the cross-validated tree classification design (Tr; magenta), the random forest classification design (RF; blue), and the equispaced design (Eq; black).
6 Additional Details and Results for Macrophage Example

Models and Prior Distributions
In all three models, a macrophage can acquire a new bacterium with a constant rate φ while there is no antibiotic in the medium (t < t exp ); this rate then drops to 0 for the remainder of the simulations. In model 1, we assume that a proportion p > 0 of available bacteria are non-replicating, so these are acquired by macrophages at rate φ p, while replicating bacteria are acquired at rate φ(1 − p). Intracellular bacteria are degraded at rate d for replicating bacteria and rate for non-replicating bacteria. Within permissive macrophages containing R > 1 replicating bacteria, the number of replicating bacteria increases by one every time one of these bacteria divides, but this division rate is assumed to be a decreasing function of R (due to limited resources for bacterial growth within a macrophage), expressed as a e −bR , where a is the maximum division rate of bacteria and b is a dimensionless scaling parameter. Finally, in model (1), replicating bacteria within permissive macrophages become non-replicating at rate δ. All these transitions are listed in Table 7. Table 7: Three competing models considered in the macrophage example. R(t) represents the number of replicating bacteria and D(t) the number of non-replicating bacteria within a macrophage. In model 2, a proportion q of macrophages are refractory and 1 − q permissive.

Model Number Event Type
For each macrophage, numerical simulations of the three models are produced using the Gillespie algorithm (Gillespie, 1977). In line with the general experimental setup, each macrophage is initially uninfected, but in model 2 it has a probability q of being refractory. This state is set at the start of each simulation and does not change thereafter. To reproduce the data collection process described above, we produce two independent sets of simulations for each observation time t obs in a given experimental design. First, we run S simulations of individual macrophages and record the proportion π(t obs ) of infected macrophages. Second, we run another set of simulations for the same duration until S infected macrophages are obtained, from which we record the proportions {µ k (t obs ), k > 0} of infected macrophages containing k bacteria. This can be repeated multiple times to generate multiple sets of observations from each model m, parameter vector θ m and experimental design d. Importantly, the simulations' results do not distinguish between replicating and non-replicating bacteria (model 1) or between refractory and permissive macrophages (model 2), as these cannot be told apart by microscopy alone.
The number of infected macrophages at time t obs has the binomial distribution Bin(S; E[π(t obs )]). Likewise, the vector of numbers of infected macrophages containing k = 1, . . . , K + bacteria has the multinomial distribution Mult(S; {E[µ 1 (t obs )], . . . , E[µ K + (t obs )]}). The last category K + contains all macrophages with at least K + bacteria.
The most involved part is to obtain the expected proportions E[π(t obs )] and E[µ 1 (t obs )], . . . , E[µ K + (t obs )] for any particular set of parameters. A system of linear differential equations consisting of the Kolmogorov forward equations for the models in Table 7 has to be solved to determine the expected proportions of macrophages that contain a certain number of replicating and non-replicating bacteria (see Restif et al. (2012)). The solution of this system can be computed using matrix exponentials. Considering only the total number of bacteria in a macrophage, the expected proportions E[π(t obs )], E[µ 1 (t obs )], . . . , E[µ K + (t obs )] can then be derived.
The prior distributions for each model were driven by the analysis of the experimental system in Restif et al. (2012). We assume truncated multivariate normal distributions, where the mean vector and variance-covariance matrix are based on the maximum likelihood estimates (MLEs) and the inverse of the Hessian obtained from the optimisation routine, respectively. All parameters are truncated below at 0. The proportion parameters p and q are additionally truncated above at 1.
In model 1, all macrophages are permissive, so q = 0. The mean vector and the variance-covariance matrix of the truncated normal prior for the remaining parameters of model 1 are given by .46 1.54 0.073 2.529 · 10 −10 0.035 0.097 0.25  Tables 8 and 9 show the optimal designs for each classification method and for the different numbers of observation times. The tree and the random forest classification approaches lead to very similar designs.

Misclassification Matrix
We can use the same random forest classifiers and their associated samples that were created to estimate the misclassification error rates in Table 4 of the main paper to compute the misclassification matrices. The misclassification matrices for the optimal designs obtained under the random forest classification approach are displayed in Figure 6. The classification power is very high for all the models. One can see that it is slightly more difficult to detect heterogeneity between bacteria (model 1) than heterogeneity between macrophages (model 2). The misclassification matrices for the designs obtained under the tree classification approach are almost identical.

Logistic Regression Example
We consider the logistic regression example of Overstall and Woods (2017) and Overstall et al. (2018). The response is binary, y ij ∼ B(p ij ), and where j = 1, . . . , n G and i = 1, . . . , G. Here G is the total number of groups and n G is the number of observations per group. The total number of observations is n = G × n G . The model parameter of interest is θ = (β 0 , β 1 , β 2 , β 3 , β 4 ) . The random effect for the ith group is γ i = (γ 0i , γ 1i , γ 2i , γ 3i , γ 4i ) . The observed vector of responses for the ith group is y i = (y i1 , . . . , y in G ) and the total dataset is denoted y = (y 1 , . . . , y G ) . The design vector is the concatenation of the controllable elements of the design matrix, d = {x aij ; a = 1, . . . , 4, i = 1, . . . , G, j = 1, . . . , n G } and is of length n × 4. Each design element is restricted, x aij ∈ [−1, 1]. The variable v a is an indicator variable that is equal to 1 if the ath predictor is present in the model. It may not be clear which of the four predictors should be included in the model, so there are 2 4 = 16 possible models to choose from. We aim to select the design d that maximises our ability to discriminate between all possible models under various prior assumptions as described below.
As in Overstall et al. (2018), two different model structures are considered. The first structure is that all random effects (RE) are set to 0, resulting in the fixed effects (FE) structure. The second structure is that the random effects are allocated a distribution (RE structure). Within each chosen structure, there are 16 models to discriminate between. In both the FE and RE structures, we use the priors β 0 ∼ U(−3, 3), β 1 ∼ U(4, 10), β 2 ∼ U(5, 11), β 3 ∼ U(−6, 0), β 4 ∼ U(−2.5, 3.5). We assume that all parameters are independent a priori. For the RE model we set γ ai ∼ U(−ζ a , ζ a ) and allocate a triangular prior to ζ a , p(ζ a ) = 2(U a − ζ a )/U 2 a , 0 < ζ a < U a , where (U 0 , U 1 , U 2 , U 3 , U 4 ) = (3, 3, 3, 1, 1). One possibility for the prior distribution placed on each model is a prior which depends on the number of predictors present in the model. Let (v m1 , . . . , v m4 ) denote the values of (v 1 , . . . , v 4 ) for model m. A model prior accounting for Bayesian multiplicity (Scott and Berger, 2010) is (7.1) In order to estimate the misclassification error rate under the Bayes classifier (the Bayes error rate) for some design d, we need to estimate posterior model probabilities for J datasets simulated from the prior predictive distributions of all the models. A common approach for rapid approximation of the evidence for model m, p(y|m, d), in the context of Bayesian optimal design is importance sampling (IS), where the importance distribution is the prior (e.g. Ryan et al. (2014)). However, if the data is informative (as might be the case in this example if n is large), the number of IS samples to estimate the evidence with reasonable precision may be prohibitively large. The situation is significantly worse for the RE structure, as an importance distribution is required over the space of both the parameter of interest and the random effects (see, e.g., Ryan et al. (2015)). For the FE structure and n = 48, using 100K importance samples from the prior and J = 800 (50 per model), the time taken to approximate the misclassification error rate for a random design on a cluster using 24 parallel threads was almost 2.75 minutes. This is very computationally intensive considering that we need to optimise over 48 × 4 design variables. Performing IS for the RE structure might be considered as completely intractable. Overstall et al. (2018) propose the use of normal-based approximations to the posterior in the Bayesian design context to provide a convenient estimate of the evidence. They consider the same logistic regression example but use normal priors to facilitate the approximation of the evidence. In some applications, a normal-based approximation may not be adequate.
In contrast, our classification approach avoids computing posterior quantities and requires only simulation from all the models. Interestingly, moving to the RE structure poses little additional difficulty for the classification approach as it remains trivial to simulate from the models. This is a significant advantage of the classification approach.
For the FE structure we consider n ∈ {6, 12, 24, 48} and for the RE structure we consider n G = 6 and G ∈ 2, 4, 8 (to give n ∈ {12, 24, 48}). Two prior distributions on the model indicator are trialled: (1) the prior where models are equally likely a priori and (2) the prior in (7.1) that corrects for Bayesian multiplicity. We refer to the first as the equal prior and the second as the unequal prior. For this example, the only design criterion that we consider is the misclassification error rate (the excepted 0-1 loss). During the design optimisation phase, we estimate the expected loss by employing cross-validated classification trees using a sample of size 80K (5K simulations per model). The observations are weighted within the trees according to their prior model probabilities. We consider a discretised design space for each x aij consisting of the five values {−1, −0.5, 0, 0.5, 1}.
After having obtained the optimal designs for the different scenarios regarding model structure (FE or RE) and prior distributions (equal or unequal), we attempt to assess the classification performance of these optimal designs using random forests. For each optimal design, 10K simulations under each model are used to train a random forest with 100 trees. A fresh set of 16 × 10K = 160K simulations is used to estimate the misclassification error rate and the misclassification matrix. The model proportions of this test sample reflect the prior model probabilities. The results for the optimal designs of the different scenarios are shown in the rows with bold row labels in Table 10. For each scenario, results for optimal designs under different scenarios as well as a randomly generated design are also provided. For the randomly generated designs, each design point x aij equals 1 or −1 with equal probability.
The results suggest that the optimal designs found for this example are remarkably robust with respect to the assumed model structure (FE or RE) and the assumed prior model probabilities (equal or unequal). The random design has the worst performance under all scenarios. We can also see a decrease in the misclassification error rate as the sample size is increased, as expected.
It is also of interest to see how well the optimal designs found under the tree classification approach perform in terms of posterior model probabilities. We conduct a simulation study under the FE structure using either the equal or the unequal prior.
For each design we want to assess, we simulate a sample of 800 datasets from the marginal distributions of all the various models, where the proportion of datasets from a particular model in the sample corresponds to that model's prior model probability.
For each of the 800 datasets, we approximate the posterior model probability of the model m that generates the dataset y using IS with 100K prior simulations. As for the classification results in Table 10, we are also interested in the performance of optimal designs found under some wrongly assumed scenarios. We also consider a 'random' setup where we select designs randomly for each of the 800 datasets. Figure 7 shows the boxplots of the estimated posterior model probabilities of the correct model for some of the designs of interest when the true scenario is the FE structure with the equal prior. The resulting boxplots when the true scenario is the FE structure with the unequal prior are shown in Figure 8. It is again evident that the optimal designs found are robust under the choice of the structure (FE or RE) and the choice of the prior model probabilities (equal or unequal). We do not perform a simulation study under the RE structure given the increasing difficulty of estimating the posterior model probabilities under this structure.
It is important to note that the random forest-based validation results in Table 10 were obtained in a small fraction of the time that it took to conduct the simulation study used to produce the results in Figures 7 and 8. The improvement in classification accuracy is clear as the sample size is increased. When the unequal prior is selected, it is evident for small sample sizes that it is easier to classify the models with higher prior probability. The misclassification matrices for the RE structure are omitted because they are very similar.

Spatial Extremes Example
In this example, the goal is to place a fixed number of measuring sites in space in order to maximise the ability to discriminate between different spatial models for extreme outcomes (e.g., maximum annual temperatures). There are many spatial models for extreme events, see Davison et al. (2012) for an overview. For this example, we consider to discriminate between three isotropic models: two max-stable models and one copula model.

Models
Max-stable processes are popular for modelling spatial extremes because they are the only possible limits of renormalised pointwise maxima of infinitely many independent copies of a stochastic process (de Haan and Ferreira, 2006). The advantage of working with the limiting process is that no knowledge about the underlying true process is necessary. Inference for extreme outcomes based on the true underlying process is fraught with high uncertainty and most often not feasible because only the tails of the distribution are observed. If the limiting assumption is (approximately) appropriate, it is much easier to model the extreme data according to a max-stable process.
All the univariate marginal distributions of a max-stable process are members of the family of generalised extreme value (GEV) distributions. We assume that all the univariate marginal distributions have a unit Fréchet distribution (Pr{Y (x) ≤ y} = exp{−1/y}, y > 0), so the focus is on modelling the dependence structure of the process. The assumption of unit Fréchet margins is not too restrictive from a practical perspective since a simple transformation can be applied to the univariate margins to make them unit Fréchet distributed, see Davison et al. (2012). The marginal parameters needed for that transformation can be estimated in a separate step. Alternatively, one may estimate the dependence and marginal parameters together.
The spectral representation of a max-stable process {Y (x), x ∈ X ⊆ R d } with unit Fréchet margins is given by of a Poisson point process on the positive real line with intensity dΛ(ζ) = ζ −2 dζ and of the independent realisations {Z i (x), x ∈ X } ∞ i=1 of a non-negative stochastic process with continuous sample paths and E[Z(x)] = 1 ∀ x ∈ X (see, e.g., Ribatet (2013)).
Different max-stable processes are obtained by choosing different stochastic processes Z. We consider two very popular stationary models, the extremal-t model (Opitz, 2013) and the Brown-Resnick model with power variogram (Brown and Resnick, 1977;Kabluchko et al., 2009). The specifications for Z i (x) for each of the models are Extremal-t: where i and ε i are independent copies of Gaussian processes.
In the case of the extremal-t model, is a stationary Gaussian process defined by the correlation function ρ(h), where h is the Euclidean distance between two points. For our example, we assume the powered exponential or stable correlation function: The Brown-Resnick process is defined by its semi-variogram. If the process ε is a fractional Brownian motion centred at the origin, the Brown-Resnick process is stationary and the semi-variogram has the form where h denotes the distance between two locations.
Both models depend on two parameters governing the dependence between two locations separated by a distance h: the range parameter λ and the smoothness parameter κ. In addition, the extremal-t model has a degrees of freedom parameter denoted by ν.
We assume there is no discontinuity of the correlation function at h = 0 (i.e., no nugget effect).
The third model we consider is a copula model. Similar to the max-stable models, the univariate marginal distributions of the copula model are unit Fréchet. However, the extremal dependence between the locations is simply modelled by a standard (nonextremal) copula. For an introduction to copulas see Nelsen (2006). We assume the multivariate Student-t copula in our example. The multivariate cumulative distribution function (CDF) at locations (x 1 , . . . , x H ) implied by the non-extremal Student-t copula model (Demarta and McNeil, 2005) is ; Σ}, where F (y) = exp{−1/y} is the CDF of the unit Fréchet distribution, T −1 1;ν [·] is the quantile function of the univariate Student-t distribution with ν degrees of freedom, and T H;ν {· · · ; Σ} is the CDF of the H-variate Student-t distribution with ν degrees of freedom and dispersion matrix Σ. The diagonal elements of Σ are 1 and the offdiagonal elements contain the correlations between the locations. Therefore, the entries of Σ are given by Σ ij = ρ(h ij ) for i, j = 1, . . . , H, where h ij is the distance between locations i and j. As for the extremal-t model, we assume the correlation function to be the powered exponential correlation function (8.2). This also implies that the non-extremal Student-t copula model has the same set of parameters as the extremal-t model: range (λ), smoothness (κ), and degrees of freedom (ν).

Summary Statistics
If a reasonable amount of observations are collected at each location, the data collected quickly becomes very high-dimensional, while each observation is only marginally informative. This diminishes the classification power of the classifiers we use. We therefore aim to reduce the dimension of the data by generating informative summary statistics. Unfortunately, none of the statistics we consider guarantee consistent model choice. This can potentially result in large biases when estimating the posterior model probabilities (Robert et al., 2011), which can also affect the estimates of the misclassification error rates. However, trees and random forests work reasonably well with a sizeable amount of moderately informative feature variables. Therefore, we can include a wide variety of summary statistics, where each contains some information about the process. Considering the combined information of all the summary statistics, we expect that only a small loss in information is incurred compared to the full dataset.
First, we include all the F-madogram estimates for all the pairs of locations. The F-madogram (Cooley et al., 2006) is similar to the semi-variogram, but unlike the semi-variogram it also exists if the variances or means of the random variables are not finite. Given n observations {y 1 (x 1 ), . . . , y n (x 1 )} and {y 1 (x 2 ), . . . , y n (x 2 )} collected at locations x 1 as well as x 2 , the pairwise F-madogram between locations x 1 and x 2 is estimated asν where F {y} = exp{−1/y} is the CDF of the unit Fréchet distribution.
As a second set of summary statistics, we include estimates for all the pairwise extremal coefficients (Schlather and Tawn, 2003). For a max-stable process, the pairwise extremal coefficient between locations x 1 and x 2 is defined as the value θ(x 1 , x 2 ) for which The pairwise extremal coefficient can assume values between 1 and 2. A value of θ(x 1 , x 2 ) = 1 indicates complete dependence between the two locations. If θ(x 1 , x 2 ) = 2, the two locations are completely independent. We estimate it using the fast estimator of Coles et al. (1999),θ The extremal coefficient as defined by (8.3) only exists for max-stable processes. In general, the coefficient also depends on the level y. However, the quantities computed by Equation (8.4) might still provide useful information about the dependence structure.
For the t copula model, Lee et al. (2018) demonstrate by simulation that the estimates given by (8.4) are indeed informative about the dependence structure.
The last set of summary statistics we consider is the set of Kendall's τ estimates between all pairs of locations. Kendall's τ between locations x 1 and x 2 is estimated bŷ Dombry et al. (2018) show that for max-stable processes Kendall's τ is equal to the probability that the maxima at two locations occur concurrently and are therefore attained for the same extremal function, so All of the summary statistics we incorporate are also considered by Lee et al. (2018), who perform ABC model selection using the summary statistic projection method of Prangle et al. (2014) for a very similar set of models as in this example. Therefore, a more detailed discussion of the summary statistics can be found in Lee et al. (2018).

Bayesian Inference for Spatial Extremes Models
The likelihood functions of max-stable models are practically intractable for most models for dimensions greater than two or three. Composite likelihood methods have been the most popular way to conduct classical inference for max-stable models, so model discrimination is usually based on the composite likelihood information criterion (CLIC) (Padoan et al., 2010).
The observed extrema at several locations might occur at the same time, which means that the extrema at these locations arise from the same extremal function ϕ i (x) in Equation (8.1). The locations can then be partitioned according to which extremal functions ϕ i (x) produce the extreme observations at the different locations. Stephenson and Tawn (2005) show that the joint likelihood of the extreme observations and the partitions is substantially simpler than the likelihood of the extreme observations without knowledge of the partitions. Thibaud et al. (2016) and Dombry et al. (2017) use this property to devise a Gibbs sampler with the partitions as auxiliary variables to conduct Bayesian inference for max-stable models. However, even the augmented likelihoods are expensive to evaluate for the Brown-Resnick and extremal-t model because they include multivariate Gaussian (Brown-Resnick) and Student-t (extremal-t) CDFs.
Due to the intractability of the likelihoods, ABC has also been a popular method for Bayesian inference of max-stable models, see, e.g., Erhardt and Smith (2012) or the overview in Erhardt and Sisson (2015). Lee et al. (2018) present an ABC application with the joint goal of model selection and parameter estimation for the same set of models we consider. Hainy et al. (2016) seek to find optimal designs for parameter estimation for the extremal-t model with ν = 1 (called the 'Schlather model'). They use ABC to estimate the posterior variances, which they use as design criterion. Their design algorithm is confined to very low-dimensional design spaces in order to be able to store the reference table for all possible designs. They sequentially select the best single location among a small set of possible locations. With our classification approach, we are able to overcome these limitations.

Settings and Results
In our example, we want to select H (H = 3, . . . , 8) locations on a regular grid such that the ability to discriminate between the three models as measured by the misclassification error rate is optimised. We search the H optimal design points over a regular 6 × 6 grid laid over a square with edge length 10. The data consist of n = 10 independent realisations of the process collected at all the locations. Due to the isotropic nature of the processes, there are potentially many equivalent optimal solutions. With our modification of the coordinate exchange algorithm using 20 random starts, we seek to find one of these designs or at least a nearly optimal design.
Simulating from the t copula model is straightforward. It only requires simulating from a multivariate t distribution and then transforming the margins with respect to the univariate t CDF followed by the inverse unit Fréchet CDF. For simulating from the max-stable models, we use the exact simulation algorithm via extremal functions of Dombry et al. (2016).
During the design phase, we use cross-validated classification trees as well as random forests with 500 trees using out-of-bag class predictions to estimate the misclassification error rates. We had implemented the simulator functions for this example in R, therefore we use the R function rpart for classification trees, for which we keep all the default settings except for not considering any surrogate splits to speed up computing time. For random forests, we employ the function randomForest from the R (R Core Team, 2018) package of the same name (Liaw and Wiener, 2002). The simulated sets for both methods contain 5K simulations per model. The optimal designs obtained for these two methods are shown in Figures 11 (trees) and 12 (random forests).
To evaluate the designs found by our classification approach, we repeat estimating the misclassification error rate via random forests with 500 trees using out-of-bag class predictions on 100 different simulated samples of size 15K (5K simulations per model) from  Figure 11: Optimal classification designs found using trees for design sizes from three to eight for the spatial extremes example. Selected design points are marked by red triangles. Figure 12: Optimal classification designs found using random forests for design sizes from three to eight for the spatial extremes example. Selected design points are marked by red triangles.
the prior predictive distribution. The distributions of the estimated misclassification error rates are plotted in Figure 13. We also include the distributions of the estimated misclassification error rates for 100 simulated samples from the prior predictive distribution generated on 100 randomly selected designs. The optimal classification designs found using random forests clearly perform best for all design sizes. Using classification trees with cross-validation instead of random forests leads to designs which are a bit worse. However, the average misclassification error rate of the classification tree designs is still smaller than the average error rate of the random designs up until 7 design points. Figure 13: Spatial extremes example: distributions of the random forest-estimated misclassification error rates over 100 random samples of size 15K generated from the prior predictive distribution at the optimal classification designs found using random forests (rf) or trees (tr) for design sizes from three to eight. The distributions of the misclassification error rates over 100 random samples of size 15K generated from the prior predictive distribution at 100 random designs (rd) are also shown for the same design sizes.
In addition to the misclassification error rate, we also compute the misclassification matrix yielded by the random forest classifier for each of the 100 simulated samples for each evaluated design. The average misclassification matrices over the 100 samples are depicted in Figure 14 for the optimal designs obtained by the random forest classification approach. They show that discriminating between the two max-stable models is more difficult than discriminating between the t copula model and either of the max-stable models.