Evaluation of combinatorial optimisation algorithms for c-optimal experimental designs with correlated observations

We show how combinatorial optimisation algorithms can be applied to the problem of identifying c-optimal experimental designs when there may be correlation between and within experimental units and evaluate the performance of relevant algorithms. We assume the data generating process is a generalised linear mixed model and show that the c-optimal design criterion is a monotone supermodular function amenable to a set of simple minimisation algorithms. We evaluate the performance of three relevant algorithms: the local search, the greedy search, and the reverse greedy search. We show that the local and reverse greedy searches provide comparable performance with the worst design outputs having variance \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$<10\%$$\end{document}<10% greater than the best design, across a range of covariance structures. We show that these algorithms perform as well or better than multiplicative methods that generate weights to place on experimental units. We extend these algorithms to identifying modle-robust c-optimal designs. Supplementary Information The online version contains supplementary material available at 10.1007/s11222-023-10280-w.


Introduction
We consider the question of how to identify a c-optimal design when the observations are correlated.In particular, we assume the data generating process can be described using a generalised linear mixed model (GLMM).For a N × 1 vector of outcomes y, with an N × P matrix X of covariates and a N × Q 'design matrix' for random effects Z, a GLMM can be written as: where β are mean function parameters, h(.) is a link function, F (.) is a distribution function with scale parameter(s) φ, and u ∼ N (0, D) is a vector of random effects with covariance matrix D. Such models provide a flexible parametric approach to estimation of covariate effects when the observations are correlated, such as longitudinal designs (Zeger et al. 1988), cluster randomised trials (Hussey & Hughes 2007), and geospatial statistical modelling (Diggle et al. 1998).
The rows i = 1, ..., N of matrices X and Z define a discrete set of possible observations in the experiment.In some cases, it is assumed that the discrete design points are formed by a uniform lattice over a continuous design space (e.g.Yang et al. (2013)).For many study designs with correlation, the observations are grouped into blocks, clusters, or other set; we refer to such a group as an 'experimental unit'.An experimental unit is: e j ⊂ {1, ..., N } for j = 1, ..., J where we assume j e j = {1, ..., N } and e j ∩ e j = ∅ for j = j , that is all observations are in one and only one experimental unit.Without loss of generality we assume that all the experimental units have the same size r.A design space then consists of all the experimental units:

., J}
A design d ⊂ D of a maximum size m < N is then: The total number of observations in the design is n = mr.The design problem we face is then to find the design d * ⊂ D that minimises some objective function.There are multiple criteria used in the experimental design literature, here we only discuss c-optimality.

Information Matrix
We consider GLMMs where F (.) is in the exponential family, including Gaussian, binomial, and Poisson models.The likelihood of the model parameters is f y|u (y i |u, β, φ)f u (u|θ)du (2) where f y|u (y i |u, β, φ) = exp (y i η i − c(η i ))/a(φ) + d(y i , φ), f u is the multivariate Gaussian density, η i = x i β + z i u is the linear predictor, x i and z i are the ith rows of matrices X and Z, and θ are the parameters that define the covariance matrix D.
The information matrix for the model parameters β is: For the generalised least squares estimator of the GLMM this information matrix is equiv- where Σ = Cov(y).We use M d to denote the information matrix for design d.
The c-optimal objective function g : 2 d → R ≥0 we consider is then: where c is a p × 1 vector such that c ∈ range(M ) to ensure estimability of c T β (Pukelsheim 1980).The vector c often has all elements equal to zero except for a one at the position corresponding to the treatment effect parameter of interest.It is possible for a design not to produce a positive semi-definite information matrix, such as when the matrix X is not of full rank, which will occur if there are too few observations, for example.For these cases we assume that the objective function takes the value of infinity, i.e. the design provides no information about the parameters.The consequences of this choice are discussed later.
Our optimisation problem is the to find the design d ⊂ D of size m that minimises g(d).

Previous Literature
It is generally not possible to identify exact c-optimal designs in this context.For designs with correlated observations perhaps the most common approximate method is to identify the optimal 'weights' to place on each experimental unit or design point.These weights can be interpreted as an "amount of effort" to place on the observations in the design.Elfving's Theorem is a classic result in theory of optimal designs that underlies this method (Elfving 1952, Ford et al. 1992, Studden 2005).For independent identically distributed observations (i.e. when Σ = σ 2 I) we can write the information matrix as the sum M = i x T i x i for all observations i = 1, ..., N .We can place a probability measure over the observations ρ = {(ρ i , x i ) : i = 1, ...N ; ρ k ∈ [0, 1]} where ρ i are the weights on each observation.The information matrix of the approximate design is M (ρ) = i x T i x i ρ i .Elfving's theorem provides a geometric characterisation of c-optimality and shows that an optimal design ρ lies at the intersection of the convex hull of the x i and the vector defined by c.The values of ρ for the optimal design can then be obtained using linear programming methods (Harman & Jurík 2008).
there is no correlation between experimental units.In this case, the information matrix (4) can be written as the sum of the information matrices for each experimental unit: where we use X e j to represent the rows of X in e j , and similarly Σ e j is the submatrix of Σ given by the rows and columns in e j .This expression can be rewritten as j X T e j F F T X e j where F j is a square root of Σ −1 e j .The weights ρ are placed on each experimental unit so that a design can be represented by the pairs {(e 1 , ρ 1 ), ..., (e J , ρ j )} and the problem reduces to identifying the optimal weights using the generalised Elfving theorem.Sagnol (2011) shows that the optimal weights for each experimental condition can be solved using conic opimisation methods with a second order cone program.We refer to these approaches generally as 'multiplicative'.Holland-Letz et al. (2012) provide an estimate on the lower bound of the efficiency of multiplicative methods in a article examining optimal assignment of individuals to different dose schedules in a pharmacokinetic study.Prior approaches to optimal experimental designs with correlated observations relied on asymptotic arguments (Sacks & Ylvisaker 1968, Muller & Pázman 2003, Näther 1985).
One potential limitation of multiplicative methods is that there are several different approaches to rounding weights to integer totals of experimental units (Balinski & Young 2002).Pukelsheim & Rieder (1992) determine the optimal rounding scheme when at least one of each type of experimental unit is required.However, for many design problems this restriction is not necessary.As such, different rounding methods may produce different designs, which may not necessarily be optimal.Multiplicative methods also have the limitation that they cannot be extended to designs where there may be correlation between experimental units.We may also wish to accommodate restrictions in the design space, such as a maximum or minimum number (or weight) on particular experimental conditions given practical restrictions.
In this article, we show that 'combinatorial' algorithms are applicable to the problem of identifying c-optimal experimental design with correlated observations both within and between experimental units and compare their performance on a set of example study designs.
'Combinatorial' optimisation methods aim to select the optimal set of discrete items from a larger finite set.Where relevant, we refer to these methods as 'combinatorial algorithms' to differentiate them from the multiplicative methods.These algorithms can identify local minima, however a combinatorial approach cannot guarantee a global minimum is found.
Results from the optimisation literature show that the difference between the solutions from the algorithms and global minima can be bounded though.We discuss the relevant combinatorial algorithms in Section 2 and how they can be applied to the c-optimal design problem, including both Gaussian and non-Gaussian models.In Section 3 we compare the performance of these algorithms across a set of example problems.Section 4 compares the performance of multiplicative and combinatorial approaches, and section 5 extends the discussion to robust optimisation.

Monotone Supermodular Function Minimisaton
A function g is called supermodular if: for all d ⊆ d.That is, there are diminimising marginal reductions in the function with increasing size of the design.The function is monotone decreasing Equation ( 6) shows that when the observations in different experimental units are independent, then the information matrix can be written as a sum of information matrices for each unit.We can derive a more general expression for the marginal change to the information matrix when observations are added.Let d and d be two designs such that d ⊂ d ⊂ D and d = d ∪ d .We let X 1 and X 2 be the covariate matrices for designs d and d , respectively, and Σ 1 and Σ 2 be their covariance matrices.Σ 12 is the covariance between the observations in designs d and d .Then, where S = (Σ 2 − Σ T 12 Σ −1 1 Σ 12 ) is the Schur complement, which we assume is invertible, and we use δM d ,d to represent the marginal change in the information matrix of M d when the additional observations in d are added.It is evident that if Σ 12 = 0 then δM d ,d reduces to X T 2 Σ −1 2 X 2 as in Equation ( 6).We also note that δM d ,d is positive definite such that M d M d (where A B means A − B is positive semi-definite), which implies that g(d) ≤ g(d ) for d ⊆ d.Thus, the function g is monotone decreasing.
For our specific c-optimality problem, we can re-express (7) using Equation (8) as: where the second line follows from Hua's identity.The function is then supermodular if: It can be shown that this condition is satisfied if δM d,e j is symmetric, positive semidefinite for all d, which Equation (9) shows to be the case if the covariance matrix and Schur complement are invertible.Thus, the c-optimal design problem for the GLMM under the GLS estimator is monotone supermodular.
A function is called submodular if in Equation ( 7) the inequality is reversed.The maximisation of a submodular function is generally equivalent to the minimisation of a supermodular function (Sviridenko et al. 2017).The literature often discusses the former, but both problems have the same algorithms associated with finding approximate optimal solutions.

Local search algorithm
Algorithm 1 shows the local search algorithm.We start with a random design of size m and at each step of the algorithm the best swap of an experimental unit in the design with one not in the design is made until there are no more swaps that improve the design.
For monotone supermodular function minimisation in general, there is no guarantee the local search will converge to the globally optimal design.However, the algorithm does have a provable 'constant factor approximation'.If d is the output of the local search algorithm and d * is the global minimiser of the function, then the constant factor approximation is the upper bound of g(d)/g(d * ).Fisher et al. (1978) showed that under a cardinality constraint (such as |d| ≤ n) the optimality bound is 3/2.Filmus & Ward (2014) improved this bound to (1 + 1/e) by using an auxilliary function in place of g that excludes poor local optima.Feige (1998) showed that further improving these bounds is an NP-hard problem and Nemhauser & Wolsey (1978) showed that algorithms that improve on these bounds require an exponential number of function evaluations, rather than the polynomial number of the local search.In practice the local algorithm, and those discussed below, may perform significantly better than their lower bounds would suggest, however there exists little empirical evidence for the types of design we consider.We also note that Fedorov (1972) developed the first local search algorithm for D-optimal designs (i.e. a design that maximises det(M d )) with independent observations; several variants were later proposed (Nguyen & Miller 1992).Although there is presently no proof that such an approach converges to a D-optimal design.

Greedy and reverse greedy search algorithms
Algorithm 2 shows the "greedy algorithm".We start from the empty set (d = ∅) and at each step of the algorithm add the experimental unit with the smallest marginal increase in the objective function.Rather than sequentially adding experimental units, one can start from the complete design space D and sequentially remove units.This is the "reverse greedy algorithm", which is shown in Algorithm 3.
The constant factor approximations for the greedy and reverse greedy algorithms are Algorithm 3: Reverse greedy search algorithm more complex that the local search case.In the case of submodular function maximisation, a famous result is that the constant factor approximation is 1 + 1/e (Nemhauser & Wolsey 1978).However, this result does not carry over to minimising a supermodular function (Il'ev 2001).Indeed, it is not possible to implement the greedy algorithm for the design problems we discuss, as all designs with fewer than p observations, and many with more than p observations, will result in a non-positive semidefinite information matrix.We can start the algorithm from a random small design, as Algorithm 2 describes, but this of course would sacrifice any theoretical guarantees.Il'ev (2001) discusses the approximation factor for the reverse greedy algorithm in the case of minimising a supermodular function.The result depends on the 'steepness' or curvature of the function g, which is defined as: In Equation ( 5) we specified that the function had infinite variance for the empty set, in which case the steepness would be one, which is equivalent to an unbounded curvature.In these cases the reverse greedy search does not have an approximation factor (Il'ev 2001, Sviridenko et al. 2017).Specifying the value of the function to be undefined would also fail to provide a bound.
Greedy algorithms have been used in the experimental design literature previously.Yang et al. (2013) developed a continuous sequential/greedy algorithm to identify optimal designs for generalised linear models under a range optimality criteria, although not for coptimality.They discretised the design space using a regular lattice and showed convergence to optimal designs as the number of lattice cells grows.Variants and combinations of these methods have also been proposed, for example, the 'Cocktail algorithm' combines a sequential algorithm with two other algorithms in each step to find D-optimal designs (Yu 2011).Fedorov (1972) proposed a variant of this algorithm for D-optimal designs, often called a sequential algorithm, in which observations are sequentially added to an existing design until a convergence criterion identifying D-optimality is reached.Fedorov (1972) showed this algorithm produced a D-optimal design for linear models without cardinality constraint.Accelerated (or adaptive) greedy algorithms provide significant computational improvements on the standard greedy algorithm by avoiding recomputation of the objective function (Robertazzi & Schwartz 1989).Accelerated greedy algorithms have been used in experimental design, designing sensor networks, and other problems (Yang et al. 2019, Zou et al. 2016, Guo et al. 2019).However, again, there are few applications for c-optimality.
Given the lack of theoretical guarantees, one may consider these algorithms irrelevant to the c-optimal design problem.However, for some of the areas we use as examples below, they have been used informally.For example, there has been growing interest in methods to identify c-optimal designs for cluster randomised trials (e.g.Girling & Hemming (2016), Hooper et al. (2020)).Several recent articles have used an algorithmic approach that involves sequential removal of observations from a design space to identify c-optimal cluster randomised trial designs (Hooper et al. 2020), or using the change in variance of treatment effect estimators when experimental units are removed to identify efficient designs (Kasza & Forbes 2019).While these algorithms lack theoretical guarantees, they may empirically still perform adequately for these design problems.They also run faster than the local search.So we include them in the empirical comparisons below.

Computation and approximation 2.2.1 Information Matrix Approximations
The greatest limitation on executing these algorithms is the evaluation of the information matrix (4) as it requires estimation and inversion of Σ or estimation of the gradient of the log likelihood.As we discuss in Section 2.2.2, once the covariance matrix is obtained, updating its inverse after adding or removing an observation can be done relatively efficiently.
However, an efficient means of generating Σ is still required for non-linear models.Breslow & Clayton (1993) used the marginal quasilikelihood of the GLMM to propose the following first-order approximation: where W is a diagonal matrix with entries Var(y|u) , which are recognisable as the GLM iterated weights (McCullagh & Nelder 1989).The approximation is exact for the Gaussian model with identity link.Higher order approximations exist in the literature, however, their use has not been found to improve the quality of optimal designs, at least in the case of D-optimality.Waite & Woods (2015) consider the case of D-optimal designs and compare (11) with the GEE working covariance matrix.They find it does not perform as well as the approximation based on the marginal quasilikelihood.We do not consider the GEE covariance in this article, as we aim to use explicit covariance functions with different parameterisations.Zeger et al. (1988) suggests that when using the marginal quasilikelihood a better approximation to the marginal mean can be found by "attenuating" the linear predictor in non-linear models.For example, with the log link the "attenuated" mean is E(y i ) ≈ ) and for the logit link Waite & Woods (2015) find that using attenuated parameters with the approximation (11) can achieve more efficient designs, at least for D-optimality.For the non-linear models below we compare with and without attenutation.RWaite & Woods (2015) also propose approximations based on (3).They consider blocked designs where there is no correlation between experimental units, and so the information matrix can be computed as the sum of information matrices of the units in the design as Equation ( 8).
For a binomial-logistic model, the information matrix can then be calculated using (3) by completely enumerating the outcome space if the size of the experimental unit is relatively small.However, we consider designs where there may be correlation between experimental units, limiting the computational tractability of such an approach as a complete enumeration of the outcome space would be infeasible.

Algorithm Efficiency
Equation ( 8) also shows how we can achieve some computational efficiency with correlated observations.A naive optimisation approach that recalculated the information matrix for each design would require at least O(n 3 ) operations by needing to invert the covariance matrix each time.We can iteratively add or remove single observations at a time, i.
3 Comparative performance

Comparison of Algorithms
We consider several examples to compare the three algorithms described above.We compare them in two areas: quality of solution and computational time.As the local and greedy search algorithms have a random starting set, we run them each 100 times; the reverse greedy search is deterministic and so is run only once per example.
We use the approximation (11) for all the analyses.For the examples using a Gaussianidentity model, the approximation is exact.For non-linear models (binomial-logistic, binomial-log, and poisson-log) we make the additional comparison between an approximation with and without attenuation as described in Section 2.2.1.
For each example we calculate the 'relative efficiency' as the ratio of the variance (i.e. the value of c T M −1 d c) of the design(s) from each algorithm compared to the variance of the best design from all algorithms expressed as a percentage.For the non-linear models, we only evaluate the single best design from each algorithm, with and without attenuated parameters, using Equation ( 6).Enumerating the complete outcome space to evaluate the expectations in (6) would not be possible, so we use Monte Carlo integration with 100,000 iterations to estimate the relative variance.
We also report the approximate running time of each algorithm.Timings were made on a computer with Intel Core i7-9700K, 32GB RAM, Windows 10, and the program was compiled with gcc compiler.We make all these algorithms available as part of the glmmrOptim package for R.

Applied Examples
Our examples are derived from two study types that motivated this article.Identifying optimal cluster randomised trial designs, and determining optimal sampling locations to estimate treatment effects in a geospatial setting.We describe each of these in turn along with their associated examples.

Cluster randomised trial
A cluster randomised trial is a type of randomised trial design in which groups, or 'clusters', linear predictor for an observation i in cluster k at time t is: where ∆ kt is a treatment indicator equal to one if the cell has the intervention and zero otherwise as shown in Figure 1, τ t are T time-period indicators (so we do not include an intercept).We consider two covariance function specifications for ikt at the cluster-level and examine both cross-sectional and cohort designs.The covariance functions represent the most widely used specifications for these studies (Li et al. 2021).First, an exchangeable covariance function with cluster and cluster-period random effects with cross-sectional sampling in each time period (Hussey & Hughes 2007, Hemming et al. 2015): Second, an autoregressive covariance function: For models with cohort effects where the same individuals appear in each cluster in every period, we modify these covariance functions by adding an additional term σ 2 c to the covariance if the individual is the same i = i .For Gaussian models we notate the observation-level variance of the error term as σ 2 e .Table 1 lists the different model specifications we examined for the cluster randomised trial and other examples.We include both linear and non-linear models with differing covariance structures and parameters.The choice of covariates generally represents a range from 'high' to 'low' levels of between-cluster correlation in these settings (Hemming et al. 2015(Hemming et al. , 2020)).We specify a maximum number of individuals per cluster-period of 10, and aim to identify an optimal design of m = 100 individuals (out of a possible 300); an experimental unit is a single observation from an individual.We set c = (1, 0, ..., 0) T .

Geospatial sampling
There is a broad literature on selecting the optimal sampling locations (and times) to draw samples across an area of interest (e.g.Chipeta et al. (2017)).However, these sampling patterns are generally designed to estimate a statistic like the prevalence of a disease across an area and its spatio-temporal distribution.A related, but possibly more complex, question asks where across an area one should sample to provide the most efficient estimates of point-source interventions with spatially-heterogeneous effects.A geospatial statistical model can be represented as a GLMM (Diggle et al. 1998), thus if the possible sampling locations are discretised, the design problem is amenable to the methods in this article.
Table 1: Model specifications for the comparison examples for the cluster randomised trial.
Our design space is a unit-square A = [0, 1] 2 .The space is divided into a regular 15 × 15 lattice, where observations are made at the cell centroids a ∈ A. An intervention is located at the point z = (0.5, 0.5).The mean function is specified as: To accomodate the non-linear mean function in the framework described above we use an additional first-order approximation to the information matrix (following Holland-Letz et al. (2011,2012) and others): where the first column of F is a vector of ones, the second column is ∂µ/∂β 1 = exp(−β 2 |a− z|), and the third column is ∂µ/∂β and β 2 = 4.We specify a Poisson distribution with log link function.Finally, we specify an exponential covariance function: which is commonly used in geospatial applications.Our aim is to find an optimal design of size m = 80 (of a total possible 325).We set c = (0, 1, 0.1) T .Table 1 lists the parameter and model specifications for these examples.

Results
Table 2 reports the relative efficiency and approximate computational times for the Gaussian examples.The greedy search performed the worst of the three algorithms is all examples.The local and reverse greedy searches both often found the optimal design, although the reverse greedy algorithm was more consistent with the worst design having a variance only 0.1% greater than the local search.designs from the local search compared to the reverse greedy search.For the cross-sectional cluster trial designs the worst design produced by the local search had a variance less that 1% greater than the best design, but with the cohort cluster trial designs this rose to 12%.
The same design was produced on every iteration for the geospatial example.Running times for these examples ranged from 1 to 100 seconds.The local search scaled poorly.
The cross-sectional cluster trial examples had 30 unique experimental units compared to 60 for the cohort design and 325 for the geospatial example.Notably, the optimal design for Example H is not symmetric owing to the heteroskedastic variance of the binomial-log model.

Comparison with Multiplicative Methods
An approximate 'multiplicative' design is characterised by a probability measure over unique experimental units φ = {(ρ j , e j ) : j = 1, ...J; ρ j ∈ [0, 1]; j ρ j = 1} where ρ j are the weights associated with each experimental units.These weights can be identified for design spaces with uncorrelated experimental units (Holland-Letz et al. 2012, Sagnol 2011).However, there are several methods for rounding proportions to integer counts that sum to a given total (Balinski & Young 2002).Briefly, for a given set of weights ρ j and a target total of number of experimental units m in the design, the methods for determining the number of each experimental units m j such that j m j = m, are: 1. Hamilton's method Let π j = n * ρ j , we assign π j of each experimental condition.The remaining total is filled by the experimental conditions with the largest remainders |π j − π j | until we have n experimental conditions.

Divisor methods
Start with all n j = 0 and let π j = n * φ j .Proceeding iteratively, we choose the next experimental condition in the design to be that with max j π j /α(n j ), for which we update the total until the condition j n j = n is met.
Pukelsheim & Rieder (1992) showed that the optimal rounding method was a variant of Adam's method, although under the assumption that there is at least one of each experimental unit.However, for many experimental design problems this assumption is not required, including the example we examine below.To compare the performance of the multiplicative methods discussed in Section 1.2, we take the best design from the three combinatorial algorithms, and compare it to the design with the lowest variance from across all rounding methods.To identify the optimal approximate design ρ we use the second-order cone program proposed by Sagnol (2011), which is also implemented in the R package glmmrOptim along with the rounding methods.

Example
We return to the cluster randomised trial examples, setting the experimental unit to a whole cluster sequence, i.e. a row consisting of five time periods each with ten individuals in Figure 1.We wish to include m clusters in the design, each of which may be assigned to any of the experimental conditions, and consider m = 6 to m = 30.We use the models specified as Examples A-D in Table 1.
Figure 5 presents the ratio of the variance of the best design from the multiplicative algorithm to the variance of the best design from the combinatorial algorithms.For smaller sample sizes, particularly with an odd sample size, the multiplicative algorithm performed worst than the comibinatorial approach, although the worst design had a variance only 2% larger than the best design.For larger sample sizes the two methods generally produced

Robust optimal designs
The analysis and discussion so far has been based on the assumption that the correct model specification is known.However, this is typically an unrealistic assumption, and it is well known that an optimal design for one model may perform very poorly for an alternative model.We consider a model robust optimal design criterion that is amenable to combinatorial optimisation.Following Dette (1993), we assume that the "true" model belongs to a known class of GLMMs.We can define a model with the collection G = (F, h, β, θ).The class of models is then Att.

Conclusion
In this article, we have showed that the c-optimal design criterion is a monotone supermodular function for GLMMs using the GLS information matrix.We evaluated the performance of three supermodular function minimisation algorithms to identify c-optimal experimental designs with correlated experimental units.The theoretical upper bound on the relative variance of a design from the local search algorithm is 1.5 and no bound exists for the greedy and reverse greedy algorithms, however, for the examples we considered the performance is significantly better than 1.5 times the best design.The greedy algorithm performed the worst, which was to be expected given that it cannot be executed fully as it cannot start from the empty set.The local search and reverse greedy searches performed comparably in terms of their best designs, although the local search could produce designs with varaince more than 10% larger than the output of the reverse greedy search.Thus, the local search needs running multiple times to provide a reliable output.The local search also had poorer scaling in terms of computation time than the reverse greedy.Thus, while the reverse greedy search lacks a theoretical guarantee, it would be favoured empirically for the types of study design considered here.
We showed that the algorithms could also be applied for model robust optimal design identification using a weighted average design criterion.The method uses a prior, specifying the weights to place on each possible model.This specification suggests a way of applying these algorithms for use in Bayesian optimal design.Chaloner & Verdinelli (1995) comprehensively review Bayesian experimental design criteria and show the Bayesian c-optimality function to be: where V 0 is the prior covariance of the β parameters and p(θ) and p(β) the prior density function for the covariance and linear predictor parameters, respectively.One can approximate the integral above using a Riemann sum, which would discretise the parameter space and provide a set of weights to place on each model.Recent advances have generated general algorithms for Bayesian optimal design problems with non-linear models, in particular Overstall & Woods (2017).Further research is needed to determine whether an approximation using the simple algorithms in this paper provide a viable or useful alternative to more advanced approaches.
We cannot guarantee that the optimal design was included in the output of any of the algorithms.However, our comparison with other multiplicative methods provides some reassurance.For designs with correlation within but not between experimental units, deriving weights using multiplicative methods for each experimental unit provides one method of approximating an optimal design (Holland-Letz et al. 2011, Sagnol 2011).Combinatorial approaches produced the same or better designs in the examples we considered.
Optimal designs may sometimes be impractical or difficult to implement.The design in the right panel of Figure 3 is highly unlikely to ever be implemented.However, being able to identify approximately optimal designs provides a benchmark against which to justify proposed experiments.Many types of study that can be described by GLMMs, such as cluster randomised trials or spatio-temporal sampling across an area, can be significant and expensive undertakings.The combinatorial algorithms provide a means of identifying near-optimal or optimal designs to support their planning.Many design problems are not inherently discrete; however, we can discretise the design space by specifying a set of design points (Yang et al. 2013).Thus, the methods evaluated in this article provide a useful set of tools to support study design.

SUPPLEMENTARY MATERIAL
7 Rank-1 down/up dating to remove/add observations

Removing an observation
For a design d with m observations with inverse covariance matrix Σ −1 d we can obtain the inverse of the covariance matrix of the design with one observation removed d = d/{i}, Σ −1 d as follows.Without loss of generality we assume that the observation to be removed is the last row/column of Σ −1 d .We can write Σ −1 d as where C is the (m − 1) × (m − 1) principal submatrix of B, d is a column vector of length (m − 1) and e is a scalar.Then,

Adding an observation
For a design d with m observations with inverse covariance matrix Σ −1 d , we aim now to obtain the inverse covariance matrix of the design d = d ∪ {i }.Recall that Z is a R × Q design effect matrix with each row corresponding to a possible observation.We want to generate H −1 = Σ −1 d .Note that: where f = Z i∈d DZ i is the column vector corresponding to the elements of Σ = W −1 + ZDZ T with rows in the current design and column corresponding to i , and h is the scalar W −1 i ,i + Z i DZ T i .Also now define: so that and and u = (f T , 0) T and v = (0, ..., 0, 1) T , both of which are length m column vectors.So we can get H * * from H * using a rank-1 update as H * * = H * +uv T and similarly H = H * * +vu T .
Using the Sherman-Morison formula: and

2 :
Data: X,Z,Ω,β,φ,D Result: An optimal design d * Let d 0 be size m design ; Set δ = 1 and d ← d 0 ; while δ < 0 do foreach e j ∈ d and e j ∈ D/d do Calculate g(d/{e j } ∪ {e j }); ; Set d ← argmin j,j g(d/{e j } ∪ {e j }) ; δ = g(d ) − g(d) ; if δ > 0 then d ← d end end Algorithm 1: Local search algorithm Data: X,Z,Ω,β,φ,D,n Result: An optimal design d * Let d be a non-degenerate design of size p ≤ s < m ; Set k = 0; while k < m do foreach e j ∈ D/d do Calculate g(d ∪ {e j }); ; Set d ← d ∪ argmin e j g(d ∪ {e j }) ; k ← k + 1 end Algorithm Greedy search algorithm Data: X,Z,Ω,β,φ,D,n Result: An optimal design d * Let d = D be the design containing all experimental conditions; Set k = J; while k > m do foreach e j ∈ d do Calculate g(d/{e j }); ; Set d ← d/ argmin e j g(d/{e j }) ; e. moving from d to d/{i} ∪ {i } and i = i , so the calculation only requires O(rn 2 ) operations to add or remove an experimental condition through rank-1 up/down dates of the inverse covariance matrix (see supplementary material).The total running time of the local search algorithm scales with O(m 4 r 3 (J − m)), as we have to evaluate swapping m experimental units with J − m remaining units of size r up to m times.The experimental units are not always unique, in these cases we can detect any duplicated experimental conditions and only evaluate any swap involving it once, while keeping track of the number of copies in the design space and design, to reduce running time.As such the values of m and J in the expression for complexity can be interpreted as the numbers of unique experimental units in the design and design space, respectively.The computational complexity of the greedy search algorithm scales as O(m 3 r 3 (J −m)), however it generally runs much faster than the local search since most function evaluations are of designs smaller than m.The complexity of the reverse greedy algorithm scales as Figure 1: An illustration of a cluster randomised trial design space.Each cell represents a cluster-period and is comprised of observations from individuals within the cell.Yellow = control status with no intervention; blue = with intervention.

Figure 2 Table 2 :
Relative efficiency and approximate running time of the three algorithms for Examples using Gaussian-Identity model. .Time (s) Rel.eff.Time (s) Rel.eff.Time (s) Relative efficiency and approximate running of the best designs from the three algorithms with and without attenutation (Att.) for the examples with non-Gaussian models.

Figure 2 :
Figure 2: Histogram showing the values of the c-optimal criterion from the local search for each of the examples.The dashed red line indicates the value from the reverse greedy search.x-axis values are multiplied by 100.

Figure 3 :
Figure 3: Optimal designs for Examples E and G.The 0 and 1 labels in the top row indicate the treatment status of the cluster period.

Figure 4 :
Figure 4: Optimal geospatial sampling pattern for Example I.

Figure 5 :
Figure 5: Ratio of the variance of the design from each method to the highest variance for 6 to 30 clusters in example D

Figure 6 :
Figure 6: Approximate robust and Bayesian optimal designs for examples D-G.The 0 and 1 labels in the top row indicate the treatment status of the cluster period.
So we have calculated the updated inverse with only matrix-vector multiplication, which is O(n 2 ).Title: Brief description.(file type) R-package for MYNEW routine: R-package ÒMYNEW Ó containing code to perform the diagnostic methods described in the article.The package also contains all datasets used as examples in the article.(GNU zipped tar file) HIV data set: Data set used in the illustration of MYNEW method in Section 3.2.(.txt file)

Table 3
reports the results for the non-Gaussian examples.Similarly to the Gaussian examples, the local and reverse greedy searches performed the best with either one or both finding the best design.Attenuation appears to make little difference to the quality of the solutions in these examples.Figures 3 and 4 show the best design for examples E, H, and M.

Table 4 :
Relative efficiency of the best designs from the three algorithms with and without attenutation for Examples E-H and J with non-Gaussian models.variances up to 10% larger than the best design.Both the local and reverse greedy searches identify the best design in each class.Figure6shows the model-robust optimal designs.