In this section, we consider several examples to highlight the utility of our proposed method. To perform the design optimisation, we use a modification of the coordinate exchange (CE) algorithm (Meyer and Nachtsheim 1995), which involves cycling through each of the design variables iteratively, trialling a set of candidate replacements and updating the value of the design variable if the objective/loss function is reduced. This is continued until no updates to the design are made in a given cycle. To guard against possible local optima, we run the algorithm in parallel 20 times with random starts. We acknowledge the stochastic nature of our objective function by considering the (up to) six last designs visited in each of the 20 runs as candidates for the overall optimal design. For each of the candidates, we compute the loss function ten times to reduce the noise. The best design found through this algorithm is the one with the lowest average loss among the candidate designs across all runs. As an additional post-processing step, we combine all the candidate designs and estimated loss function values from all the runs. Then, we employ Gaussian process regression (Rasmussen and Williams 2006) on them to obtain a smooth estimate for the expected loss surface, which we seek to minimise with respect to the design. Finally, we compare the expected loss at this new design to the expected loss found previously by the coordinate exchange algorithm. This is done by estimating the expected loss 100 times at each of the two designs and selecting the design with the lower average expected loss as the optimal design. A detailed description of the optimisation algorithm that we employ is provided in Sect. 2 of Online Resource 1. We do not expend any effort on finding the best optimisation algorithm for each of the examples as this is not the focus of the paper. We find that the CE algorithm performs adequately to illustrate the findings of the paper.
The first example in Sect. 4.1 compares the results of our supervised classification approach to ABC for different loss functions for an infectious disease application. It demonstrates that ABC and the computationally much more tractable classification approaches lead to designs with similar efficiency. The second example in Sect. 4.2 is a modification of the first example. It only considers the first two models of the first example, which have reasonably tractable likelihoods. This makes it possible to obtain likelihood-based loss estimates and find likelihood-based designs at least for lower dimensions, which we can use for comparisons with our classification approach. In addition, we demonstrate how we are also able to apply our approach successfully to higher-dimensional design settings. The third example is a practically important application in the field of experimental biology. The goal is to obtain good designs for discriminating between different hypotheses about unobserved heterogeneity with respect to the reproduction of bacteria within phagocytic cells. We apply our classification-based design method to two further examples in Sects. 7 and 8 of Online Resource 1. The first is a fairly high-dimensional logistic regression example with fixed and random effects, for which previous attempts on finding Bayesian optimal designs were only possible by making some additional approximations (Overstall et al. 2018). The second example is an application to intractable max-stable spatial extremes models, for which designs were previously only found on a very limited number of candidate design points using the ABC approach (Hainy et al. 2016).
Listings of computational runtime performance statistics for the different methods and design settings for all the examples in this section can be found in Sect. 3 of Online Resource 1.
Stochastic models in epidemiology
Problem formulation
An example involving four competing continuous-time Markov process models for the spread of an infectious disease is considered in Dehideniya et al. (2018b). Let S(t), E(t), and I(t) denote the number of susceptible, exposed, and infected individuals at time t in a closed population of size \(N=50\) such that \(S(t) + E(t) + I(t) = N\) for all t. The possible transitions in an infinitesimal time \(\delta _t\) for each of the four models are shown in Table 1. Models 1–4 are referred to as the death, SI, SEI, and SEI2 models, respectively. Models 1 and 2 do not have an exposed population. The algorithm of Gillespie (1977) can be used to efficiently generate samples from all the models. The prior distributions for all the parameters of each model are provided in Table 5 of Online Resource 1. All models are assumed equally likely a priori.
Table 1 Four competing models considered in the infectious disease example of Section 4.1 We consider the design problem of determining the optimal times (in days) \(\mathbf {d}= (d_1,d_2,\ldots ,d_n)\), where \(d_1< d_2< \cdots < d_n \le 10\), to observe the stochastic process in order to best discriminate between the four models under the available prior information. Only the infected population can be observed. Unfortunately, the likelihood functions for all but the simplest model are computationally cumbersome as they require computing the matrix exponential (see, e.g. Drovandi and Pettitt 2008). Whilst computing a single posterior distribution is feasible, as in a typical data analysis, computing the posterior distribution or posterior model probabilities for thousands of prior predictive simulations, as in a standard optimal Bayesian design approach, is computationally intractable.
Approximate Bayesian computation
Dehideniya et al. (2018b) develop a likelihood-free approach based on approximate Bayesian computation (ABC) to solve this model discrimination design problem. Given a particular level of discretisation of the design space (time in this case), the ABC approach involves generating a large number of prior predictive simulations at all discrete time points and storing them in the so-called reference table. Then, for a particular ‘outer’ draw from the prior predictive distribution, \(\mathbf {y}\), at some proposed design, \(\mathbf {d}\), the ABC rejection algorithm of Grelaud et al. (2009) is used to estimate the posterior model probabilities and in further consequence the loss functions. This means that the posterior model probability \(p(m|\mathbf {y},\mathbf {d})\) is estimated by computing the proportion of model m simulations in the retained sample, where the retained sample is composed of those simulations from the reference table which are ‘closest’ to the process realisation \(\mathbf {y}\) with respect to some distance such as Euclidean or Manhattan distance. The size of the retained sample is only a very small fraction of the size of the reference table. The estimated posterior model probability is used to compute the estimated loss for process realisation \(\mathbf {y}\). Finally, the estimated expected loss is obtained by averaging the loss estimates for all the ‘outer’ draws. The reader is referred to Dehideniya et al. (2018b) for more details. Price et al. (2016) improve the efficiency for these models by making use of the discrete nature of the data to efficiently estimate the expected loss.
Simulation settings
For each of the classification methods from machine learning, we use a sample of 5K simulations from each model to train the classifier and to estimate the expected loss at each new design. For the classification trees, we use tenfold cross-validation to estimate the expected loss functions. When using random forests, we employ out-of-bag class predictions. As a criterion, we use expected 0–1 loss as well as expected multinomial deviance loss. When computing the expected multinomial deviance loss, we set the posterior model probability of the correct model to 0.001 whenever it is estimated to be 0, see Sect. 1 of Online Resource 1 for more information. We could follow the ABC method and draw the simulations from a large bank of prior predictive process realisations simulated at the whole design grid to reduce the computing time. However, since the machine learning classification method requires significantly fewer simulations, we find that it is still fast to draw a fresh process realisation for each proposed design. For the ABC approach, the reference table contains 100K stored prior predictive simulations for each model. To compute the expected loss, we average the estimated loss over 500 ‘outer’ draws from \(p(\mathbf {y}|m,\mathbf {d})\) for each model and retain a sample of size 2K from the reference table for each draw. For all the methods, the optimal design search was conducted over a grid of time points from 0.25 to 10 with a spacing of 0.25.
One-dimensional estimated expected loss curves
Figure 1 shows the approximate expected loss functions for 1 design observation under several estimation approaches and loss functions over a grid of design points with spacing 0.1. It is evident that all the functions are qualitatively similar and produce the same optimal design around \(0.5 - 0.7\) days. In particular, one can see that the expected loss curves for both the 0–1 loss and the multinomial deviance loss seem to be minimised at around the same observation time. However, the times needed to construct the curves are vastly different between the different approaches. On our workstation, it took less than half a minute for the cross-validated tree classification approach (single core), between 4 and 5 minutes for the random forest classification approach (single core), and between 9.5 and 10 minutes using 8 parallel cores for the ABC approach to generate the respective graphs. Creating the reference table with 400K simulations required only between 3.5 and 4 seconds in this example, since sampling via the Gillespie algorithm is very efficient. In our example, what is causing the computational inefficiency of ABC is having to sort the large reference table for each outer draw to obtain the retained ABC sample. Despite the much higher computational effort needed for the ABC approach, its estimates of the expected loss functions are still considerably noisier than the estimates of the classification approaches, which is mostly due to the relatively small outer sample size of 2000.
Optimal designs
The optimal designs obtained by the machine learning and ABC approaches are shown in Table 2 for \(n=1\) to \(n=3\) time points and Table 6 of Online Resource 1 for \(n=4\) and \(n=5\) time points. The machine learning methods lead to designs with a general preference for later sampling times. The designs obtained by trees and random forests are very similar. The ABC approach produces designs with notably lower sampling times. However, the results obtained by the ABC approach should be taken with caution, since the high noise of the expected loss estimates makes it harder to optimise over the design space, especially for higher dimensions. Moreover, the approximation of the posterior gets worse the higher the dimension. It is also interesting to note that there are hardly any differences between the two loss functions for any given method. This reaffirms our decision to consider only the 0–1 loss in the other examples.
Table 2 Optimal designs obtained by tree classification (cross-validated), random forest classification (using out-of-bag class predictions), and ABC approaches under the 0–1 loss (01L) or multinomial deviance loss (MDL) (\(n = 1\), 2, and 3) for the infectious disease example. The equidistant designs are also shown Classification performance evaluations of optimal designs
As our next step, we compare the optimal designs found under the different approaches using a random forest classifier. For each of the optimal designs, we train a random forest with 100 trees based on 10K simulations from each model. The misclassification error rates and the misclassification matrices are estimated from a fresh set of 10K simulations from each model. This is repeated 100 times to be able to quantify the random error in estimating the misclassification error rates. The results for all the optimal designs as well as for the equispaced designs are shown in Table 3. For more than two observations, the designs that clearly perform best are those found under the machine learning classification approaches. However, also the ABC optimal designs generally perform well except for \(n = 5\) design times. We can also observe that the loss function used for optimisation has little effect on the performance of the optimal design, only for the designs found using ABC there is a notable difference for \(n = 4\). The equispaced designs perform substantially worse than all the optimal designs up until \(n = 4\) observations. Table 3 also shows that there is almost no gain in the classification performance by increasing the number of observations beyond 2. Any additional observation will only add a negligible amount of information regarding model discrimination. At some point, adding additional uninformative observations adversely affects the classification power of the random forest.
Table 3 Average misclassification error rates for optimal designs obtained by tree classification (cross-validated), random forest classification (using out-of-bag class predictions), and ABC approaches under the 0–1 loss (01L) or multinomial deviance loss (MDL) as well as for the equidistant designs for the infectious disease example. The average misclassification error rates were calculated by repeating the random forest classification procedure 100 times (see text) and taking the average. The standard deviations are given in parentheses Finally, we compare the optimal designs obtained by the different methods based on approximate posterior model probabilities estimated using ABC, as described in Sect. 4.1.2. To that end, for each design to evaluate we simulate 50 process realisations from the prior predictive distribution of each of the four models at that design and estimate the posterior model probability of the true model using ABC rejection. To get precise estimates of the posterior model probabilities for each of the 200 process realisations, we generate 10 million simulations from the prior predictive distribution to build the reference table. To estimate the posterior probabilities for each generated process realisation, we retain the 40K simulations from the reference table closest to that process realisation with respect to the Manhattan distance of the standardised observations. Box plots showing the distributions of the estimated model probabilities over the 200 prior predictive process realisations for all the optimal designs as well as for the equispaced designs for 1–5 observations are plotted in Fig. 2. It can be seen that the results for all the different optimal designs are very similar, even though the approaches using the 0–1 loss criterion do not directly target the improvement in the posterior model probabilities. The equispaced designs perform appreciably worse up until \(n = 4\) observations. It is also evident that, given the prior information in this example, not much gain can be achieved by collecting more than two observations, which is similar to the random forest classification results obtained in Table 3. Assessing the optimal designs using random forests is much faster than performing this ABC simulation study.
A more detailed investigation of the classification performance at the optimal designs can be found in Sect. 4.3 of Online Resource 1.
Two-model epidemiological models with true likelihood validation
Aims and model set-up
For the example in this section, we use the same infectious disease model set-up as in Sect. 4.1. However, only the death and SI models (models 1 and 2) from Table 1 are considered. The reason is that for these two models the computation of the likelihood function is efficient enough to be able to compute likelihood-based posterior model probabilities for a sufficiently large amount of prior predictive samples. Therefore, we can compare the results for our likelihood-free approach using supervised classification methods to the results obtained by using the true likelihood functions to estimate the design criterion. Furthermore, we can assess the resulting optimal designs by computing the expected posterior model probabilities and misclassification error rates based on the true likelihood functions.
Another aim of this example is to demonstrate that the classification approach can easily cope with higher-dimensional designs, where other methods would fail to produce reasonable results in an acceptable amount of time. For the epidemiological example with four models in Sect. 4.1, one can see that there is hardly any gain in increasing the number of design points beyond three, so it makes no sense to consider any higher-dimensional designs. However, in Sect. 4.1 we assume that we can only observe one realisation of the infectious disease process. In this section, in order to explore the performance of our methods for high-dimensional designs, we assume that several independent realisations of the stochastic process can be observed. For example, these independent realisations may pertain to independent populations of individuals. We allow each realisation to be observed at potentially different time points.
For simplicity, we assume that the same number of observations, \(n_d\), is collected for each realisation. If there are q realisations, then the total number of observations and therefore the design dimension is \(n = q \cdot n_d\).
The prior distributions for the parameters are \(b_1 \sim {\mathscr {L}}{\mathscr {N}}(\mu = -0.48, \sigma = 0.3)\) for the death model and \(b_1 \sim \mathscr {LN}(\mu = -1.1, \sigma = 0.4)\), \(b_2 \sim \mathscr {LN}(\mu = -4.5, \sigma = \sqrt{0.4})\) for the SI model.
In this example, we will only consider designs based on using the misclassification error rate as the design criterion. In order to compute the misclassification error rates based on the likelihoods, it is necessary to compute the marginal likelihoods for both models. When searching for the optimum design, we employ a relatively fast Laplace-type approximation to the marginal likelihood. For validating the resulting designs using the likelihood-based approach, we use a more expensive Gauss–Hermite quadrature scheme to obtain the marginal likelihoods. Details on both integral approximation methods can be found in Sect. 5.2 of Online Resource 1.
Example settings and results
When searching for the optimal designs, we employ trees with cross-validation as well as random forests using out-of-bag class predictions for our supervised classification approach. For both classification approaches, we use simulated samples of size 10K (5K per model).
For the likelihood-based approach, the expected 0–1 loss (= misclassification error rate) is estimated by averaging the computed 0–1 loss over a sample of size 400 (200 per model) from the prior predictive distribution. The size of this prior predictive sample is considerably smaller than for the two supervised classification approaches due to computational limitations. Therefore, the volatility of our likelihood-based misclassification error rate estimates is much higher than for the supervised classification methods, so we expect our optimisation procedure to be less stable. The expected loss surface for the one-dimensional design is depicted in Fig. 3 of Online Resource 1.
However, setting the prior predictive sample size for the likelihood-based approach to 10K as well would have made it infeasible to find an optimal design in a reasonable amount of time. Running the classification methods is still much more time-efficient than evaluating the likelihood function many times, especially for the SI model in high dimensions, see also Sect. 3 of Online Resource 1. Therefore, we only used the likelihood-based approach to find designs up to a total design dimension of \(n = 8\). Furthermore, for the design search we used a relatively coarse grid with a spacing of 0.5 days between the limits 0.5 and 10 days. We used the same design grid for all approaches.
We consider various combinations of the number of realisations, q, and the number of observations per realisation, \(n_d\). All the design methods described in this section are applied to all integer combinations of \(1 \le n_d \le 4\) and \(1 \le q \le 4\) for which the total number of observations \(n = q \cdot n_d\) does not exceed 8. We also investigate higher-dimensional designs, where we only employ the supervised classification approaches but not the likelihood-based approach. As higher-dimensional settings we consider all integer combinations of q and \(n_d\) which amount to a total number of observations of either \(n = 12\), 24, 36, or 48, and where \(1 \le n_d \le 4\).
The optimal designs found with the different methods are validated in two ways. Firstly, for each observation from a sample of size 2K (1K per model) from the prior predictive distribution, the posterior model probabilities are computed using the generalised Gauss–Hermite quadrature approximation to the marginal likelihood with \(Q=30\) quadrature points for the death model and up to \(Q=30^2\) quadrature points (minus some pruned points) for the SI model. The resulting distributions of posterior model probabilities are displayed in Section 5.3 of Online Resource 1.
We can also use the estimates for the posterior model probabilities to compute estimates of the misclassification error rates for each of the methods and dimension settings. These estimates are provided in the plots on the right-hand side of Figure 3, where each row contains the results for one design method, the x-axis of each plot shows the total number of observations, n, and each line within each plot displays the results for a particular setting of \(n_d\). Alternatively, one can use a supervised classification method to estimate the misclassification error rates for the optimal designs. In our case, we use a random forest with training and test sets of size 20K (10K per model). The random forest classification procedure is repeated 100 times, and the average misclassification error rate over the 100 repetitions is taken. The random forest-based validation results are shown in the plots on the left-hand side of Fig. 3 analogous to the likelihood-based validation results.
From Figure 3, it is evident that the misclassification error rates computed by the random forests are very close to the likelihood-based misclassification error rate computations. In most cases, the random forest-based estimates of the misclassification error rate are a little higher than the likelihood-based estimates. This is no surprise since the likelihood-based estimates are directly targeting the Bayes error rate. However, the trajectories of the misclassification error rates as a function of n are very similar for both validation methods. This suggests that for this example random forests are suitable to validate and compare the efficiency of the resulting designs. In addition, it is reasonable to expect that the designs which are optimal for the random forest classification approach are close to the true optimal designs.
One can also see from Fig. 3 that for a fixed total number of observations there is not much difference in the performance of the different design configurations, at least for the small values of \(n_d\) that we considered. It seems that having \(n_d = 2\) observations per realisation is the most optimal choice, but only by a small margin.
Macrophage model
Aim of experiment
A common challenge in experimental biology is identifying the unobserved heterogeneity in a system. Consider for example the experimental system in Restif et al. (2012). In this system, the authors wished to identify the role of antibodies in modulating the interaction of intracellular bacteria—in particular, Salmonella enterica serovar Typhimurium (S. Typhimurium)—with human phagocytes, inside which they can replicate. The experiments assessed the effect of a number of different human immunoglobulin subclasses on the intracellular dynamics of infection by combining observed numbers of bacteria per phagocyte with a mathematical model representing a range of different plausible scenarios. These models were fit to experimental data corresponding to each human immunoglobulin subclass in order to determine the underlying nature of the interactions between the antibodies and bacteria. In these experiments, the data demonstrated bimodal distributions in the number of intracellular bacteria per phagocytic cell. The aim was to identify the source of the unobserved heterogeneity in the system that caused the observed patterns. Specifically, is there underlying heterogeneity in the bacteria’s ability to divide inside phagocytes, or is it the phagocyte population which is heterogeneous in its ability to control bacterial division? In this context, the classification approach allows us to find the experimental design which best enables us to discriminate between these competing hypotheses—(1) unobserved heterogeneity in the bacteria, (2) in the cells, or (3) no heterogeneity.
Experimental procedure
We give a brief account of the experimental procedure:
-
After bacterial opsonisation (i.e. the process by which bacteria are coated by antibodies), the bacteria are exposed to the phagocytic cells for a total of \(t_{exp}\) hours, which can take the values \(t_{exp} \in \{0.10, 0.20, \ldots , 1.50\}\) hours. During this time, phagocytosis occurs, i.e. the bacteria are internalised by the phagocytic cells.
-
Next, the cells are treated with gentamicin, an antibiotic that kills extracellular bacteria, so that phagocytosis stops.
-
At each of the n observation times \(\varvec{t}_{obs}=(t_1,\dots ,t_n)\) hours post-exposure, two random samples of S cells each are taken from the overall population of cells: one sample to count the proportion of infected cells (under a low-magnification microscope), and one sample of infected cells to determine the distribution of bacterial counts per infected cell (at higher magnification).
That is, a design is composed of \(\mathbf {d}=(t_{exp}; \varvec{t}_{obs})\). The full experimental procedure is detailed in Restif et al. (2012).
For the purpose of our example, we consider a realistic scenario where we have the resources to count a fixed number of cells, \(N_{cells}=200\). These cells are then equally split between all the observation times and the two independent observational goals at each observation time, so \(S = \lfloor N_{cells} / (2 \, n) \rfloor \).
Model
We consider three mathematical models, based on Restif et al. (2012), to represent the three competing hypotheses about heterogeneity. These models are continuous-time Markovian processes that simulate the dynamics of intracellular bacteria within macrophages. Model (1) tracks the joint probability distribution of the number of replicating and non-replicating bacteria within a single macrophage, assuming all macrophages in a given experiment are from the same type. In model (2), each macrophage has a fixed probability q of being refractory, in which case it only contains non-replicating bacteria, and a probability \(1-q\) of being permissive, in which case it only contains replicating bacteria. In model (3), all macrophages are permissive and all bacteria are replicating.
Simulations from the models are based on simulations of bacterial counts for the individual macrophages. As for our infectious disease examples, we can use the efficient Gillespie algorithm (Gillespie 1977). The outcomes for the individual macrophages are then aggregated to obtain the same type of data as observed in the real experiment.
It is possible but cumbersome to compute the likelihood functions for all the models. Computing the likelihood involves solving a system of linear differential equations, which can be achieved by using matrix exponentials. However, these operations are quite expensive so that computing the posterior model probabilities becomes very costly. Computing the expected losses and searching for an optimal design can be considered intractable in these circumstances. In contrast, simulations from the models can be obtained very quickly.
Section 6.1 of Online Resource 1 contains a more detailed description of the Markov process models, the simulation procedure, the likelihood function, and the prior distributions.
Results
We use the machine learning classification approach using classification trees with cross-validation or random forests to determine the optimal designs for discriminating between the three competing models (one model corresponding to each hypothesis) with respect to the misclassification error rate. It is assumed a priori that the models are equally likely. We use 5K simulations from the prior predictive distribution of each model during the design process. The design grid for \(t_{obs}\) goes from 0.25 to 10 with a spacing of 0.25. The optimal designs are given in Section 6.2 of Online Resource 1. The tree and the random forest classification approaches lead to very similar designs.
Similar to the other examples, we assess each design by producing 10K new simulations under each model at that design and using these to train a random forest with 100 trees. A further 10K new simulations per model are then used to estimate the misclassification error rate. This is repeated 100 times for each design. The estimated misclassification error rates for the designs found under the tree and random forest classification approaches are shown in Table 4. For comparison, we also include the estimated error rates for the equispaced designs.
Table 4 Average misclassification error rates for the optimal designs obtained under the classification approaches using trees or random forests and for the equispaced designs for the macrophage model. The average misclassification error rates were calculated by repeating the random forest classification procedure 100 times and taking the average. The standard deviations are given in parentheses We are also interested in the posterior model probabilities at the different optimal designs. At each optimal design, we simulate 20 process realisations under the prior predictive distribution of each model. For each process realisation, we approximate the posterior model probability of the model that generated the data using importance sampling (see, e.g. Liu 2001) with 50K simulations from the importance distribution. In our case, the prior distribution serves as the importance distribution. Figure 4 shows box plots for the distributions of the posterior model probabilities of the correct model over the prior predictive simulations for different optimal designs. The computations required to generate one of these box plots ranged from 5.9 hours to 17.2 hours using up to 24 parallel threads. In contrast, it took less than two minutes to obtain one estimate of the misclassification error rate using a random forest with training and test samples of size 30K each.
Table 4 indicates that \(n = 2\) observation times yield the optimal classification power when using trees and random forests, even though the posterior model probabilities of the correct model keep increasing until at least \(n = 4\) (see Fig. 4). For more than two observations, the higher data dimension impedes the classification accuracy of those classification methods and more than offsets the gains from having marginally more information in the data due to the more optimal allocation of resources to the different observation times. However, there are no substantial improvements in the posterior probabilities after \(n = 2\). Both machine learning classification approaches lead to very efficient designs for all design sizes.
Overall, the ability to correctly classify output from the three models and thus to decide between the three competing hypotheses is very good at all the optimal designs. This suggests that we are able to identify with high certainty if heterogeneity is present, and if so, whether the bacteria or the human cells are the source of this heterogeneity.