G-optimal designs for hierarchical linear models: an equivalence theorem and a nature-inspired meta-heuristic algorithm

Hierarchical linear models are widely used in many research disciplines and estimation issues for such models are generally well addressed. Design issues are relatively much less discussed for hierarchical linear models but there is an increasing interest as these models grow in popularity. This paper discusses the G-optimality for predicting individual parameters in such models and establishes an equivalence theorem for confirming the G-optimality of an approximate design. Because the criterion is non-differentiable and requires solving multiple nested optimization problems, it is much harder to find and study G-optimal designs analytically. We propose a nature-inspired meta-heuristic algorithm called competitive swarm optimizer (CSO) to generate G-optimal designs for linear mixed models with different means and covariance structures. We further demonstrate that CSO is flexible and generally effective for finding the widely used locally D-optimal designs for nonlinear models with multiple interacting factors and some of the random effects are correlated. Our numerical results for a few examples suggest that G and D-optimal designs may be equivalent and we establish that D and G-optimal designs for hierarchical linear models are equivalent when the models have only a random intercept only. The challenging mathematical question of whether their equivalence applies more generally to other hierarchical models remains elusive.


Introduction
Hierarchical models are widely used to analyze data in various disciplines, such as in psychology, medicine, manufacturing industry and education. Such models are especially appealing for analyzing longitudinal analysis because they allow for the presence of missing data, time-varying or invariant covariates, and subjects measured at different time points. In educational research, hierarchical models are commonly used to evaluate the effectiveness of teaching methods using data from students nested within classrooms and classrooms are nested within schools that use different teaching methods. The distinguishing feature of these models is that they account for both individual-level and population-level effects. In the literature across disciplines, hierarchical models are variously referred to as multilevel models, nested data models, mixed models, random coefficient, random-effects models or random parameter models. In our work, we make no distinction among them and refer to them as statistical models with parameters that vary at one or more levels.
When the experimental settings are under the control of the investigator, design issues arise and they must be carefully addressed for maximal precision in the statistical inference at a minimal cost. The basic questions to answer are given a statistical model defined on a compact design space X , an optimality criterion and a fixed amount of resources to take N observations for the study, what are the optimal number of points, where these points are and how many replicates to take at each of these points. We denote these quantities by k, x 1 , . . . , x k and n 1 , . . . , n k , respectively where n 1 +· · ·+n k = N . Optimal exact designs are challenging to find and study because there are no theoretical tools for finding them in general and in addition to depending on the design criterion and the model, the optimal exact design also depends on the value of N .
An alternative option is to formulate the design problem to find an optimal approximate design, where we determine k, x 1 , . . . , x k as before, and now the optimal weights defined by w i = n i /N ∈ [0, 1] with w 1 + · · · + w k = 1. In practice, there are implemented by taking rounding each N w j to the nearest integer [N w j ] and taking [N w j ] observations at x j , j = 1 . . . , k, subject to [N w 1 ] + · · · + [N w k ] = N . Approximate designs were proposed in Kiefer (1959) and they are appealing because when the criterion is a concave functional, there is a theoretical tool called an equivalence theorem for confirming the optimality of an approximate design among all designs on the given design space X . There are algorithms with proof of convergence for searching some types of optimal designs and there are also tools to assess the proximity of an approximate design from the optimum without knowing the latter. For these reasons, we focus on approximate designs in the rest of the paper.
A primary interest of our paper is to find G-optimal designs for hierarchical linear models, with G standing for global. These designs are best for estimating the overall response surface with minimal variance across the design space X and so protect against the worst-case scenario. This is unlike the much simpler situation where there is interest in finding an optimal design for estimating the predicted response at a single point. G-optimality is a minimax-type of design criterion, it is non-differentiable and requires at least solving two nested layers of optimization problems over different search spaces. Consequently, even for fixed-effects models, G-optimal designs are among the most difficult to study mathematically. Compounding the optimization problem is that current algorithms for searching optimal designs cannot find G-optimal designs effectively, let alone one with proof of convergence.
The main aims of this paper are to find G-optimal designs for hierarchical linear models, propose a nature-inspired meta-heuristic algorithm to find them and develop an equivalence theorem to confirm the G-optimality of an approximate design for hierarchical linear models. We demonstrate the flexibility and effectiveness of the algorithm for finding Goptimal designs when the mean response is modeled by fractional polynomials and D-optimal designs for hierarchical nonlinear models with multiple interacting factors with possibly correlated random effects. We also establish that D and G-optimal designs for hierarchical linear models are equivalent when the models have only a random intercept only and pose the challenging mathematical question of whether their equivalence applies more generally to other hierarchical models.
The rest of the paper is organized as follows. Section 2 provides the background of hierarchical linear models and recent literature on constructing D-optimal designs for such models. Section 3 establishes an equivalence theorem for G-optimality and Sect. 4 presents G-optimal designs for various types of hierarchical models for relatively simple models. To find G-optimal designs for more complicated models, Sect. 5 introduces a nature-inspired meta-heuristic algorithm called competitive swarm optimizer (CSO) to find G-optimal designs for more complicated hierarchical linear models, such as, when the mean function is a fractional polynomial and the random effects may be correlated. We further show CSO can find locally D-optimal designs for estimating model parameters in a Poisson hierarchical model with multiple factors defined on a user-specified design space X and some random effects are correlated. Section 6 concludes with a discussion of possible equivalence between D and G-optimal designs for linear mixed models.

Preliminaries and model specification
This section gives the background, a brief literature review on constructing optimal designs for hierarchical linear models before we describe our statistical models.

Equivalence theorem
Constructing optimal designs for a given model under a specified criterion is challenging if the model is complex, and more so if the criterion is complicated. An analytical approach for finding optimal designs is limiting and is possible only for simple models with a couple of parameters, see examples in design monographs, such as, Fedorov (1972) and Berger and Wong (2009). A common practice is to find an optimal design under a set of restrictive conditions and hope that the design remains optimal under a broader setup. For instance, one may find the best two-point optimal design for the simple logistic model. Is the same design still optimal among all three-point designs? Equivalence theorem provides an answer, and more generally, is able to confirm whether the design is optimal among all designs of interest.
Specifically, suppose the given design space X is compact and the optimality criterion is formulated as a concave (or convex) function of approximate designs or, equivalently, as a function of the normalized Fisher Information matrix M(ξ ). The elements of this matrix are the expectation of the negative of the second derivatives of the log-likelihood function with respect to the parameters. As an example, suppose we have a standard linear regression model with the mean response where y is a univariate function and β is the p × 1 vector of unknown parameters. The expectation is over the model errors assumed to have independent normal variates with zero means and constant variance. A direct calculation shows the Fisher Information matrix is proportional to A typical design goal is find a design so that the estimated β has a minimal covariance matrix. The most common design criterion is D-optimality and a D-optimal design satisfies ξ D = arg max ξ ∈Ξ ln |M(ξ )| over all approximate designs on X . Here |M| is the determinant of M and Ξ is the set of all approximate designs on X . After the D-optimal design ξ D is found, the worth of another design ξ in terms of D-optimality is measured by its D-efficiency defined by Clearly, the above ratio is between 0 and 1 and if it is equal to one half, the interpretation is that the design ξ has to be replicated twice to do as well as the D-optimal design ξ D . If ξ D in the denominator is replaced by another design ξ * , the ratio is the D-efficiency of ξ relative to ξ * . The efficiency or relative efficiency of a design under another criterion is similarly defined.
The function ln |M| is concave on Ξ and using a standard convex analysis argument (Fedorov 1972;Pazman 1986), it can be shown that the equivalence theorem for D-optimality is as follows: ξ D is D-optimal if and only if for all x ∈ X , (1) The function on the left-hand side of the inequality is frequently called the sensitivity function of the design. As an example, if f (x) T = (1, x, x 2 ), the design space is X = [1, 3] and we want to estimate all the three parameters as accurately as possible. A direct calculation shows the D-optimal design ξ D takes equal proportion of observations at x = 1, x = 2 and x = 3 and one can directly verify its sensitivity function satisfies (1).
Each concave functional has a directional derivative which is used to derive its own unique equivalence theorem; see details in design monographs, such as Fedorov (1972) and Berger and Wong (2005). When X is an interval or a twodimensional space, the optimality of a design can be readily checked by plotting the sensitivity function on the left-hand side of the above inequality across the design space and visually ascertain whether the conditions in the equivalence theorem are satisfied. Equivalence theorems are frequently used to confirm the optimality of an approximate design, and more importantly, to ascertain the efficiency of any approximate design. An example is Wong and Cook (1993), where they proposed an algorithm to find G-optimal designs for fixed-effects linear models and assess the quality of the generated design using an equivalence theorem.

Brief review of optimal designs
There is a lot of work on finding optimal designs for various models in different fields and most concerned finding Doptimal designs for linear models; see references in Berger and Wong (2005) and those cited below. For nonlinear models, the information matrix depends on the unknown parameters and nominal values are required to replace them in the information matrix before the criterion is optimized. Because the resulting designs depend on the nominal values, they are called locally D-optimal designs and are the simplest to construct for nonlinear models and commonly used in practice when a single best guess for the parameters is available. However, where there are conflicting opinions from experts or prior information from previous studies, locally optimal designs become problematic to implement.
Two common design approaches for such a scenario are to adopt a Bayesian paradigm or a minimax approach. The latter design strategy was used by Berger et al. (2000) where they searched for the best D-optimal design over a plausible set of values for the nominal values. The minimax D-optimal design from Berger et al. (2000) minimizes the maximum inefficiencies that arise from misspecified values in the plausible set; variations of the theme are possible; see Chen et al. (2018). An appeal of this approach is that practitioners are likely able to provide a set of plausible nominal values for the model parameters than having to provide a prior distribution to implement the Bayesian optimal design. A disadvantage of the minimax approach is that minimax optimal designs are both theoretically and numerically more challenging to find than Bayesian optimal designs.
Random-effects models have more complicated equivalence theorems and a limited number of them are available. For instance, Fedorov and Hackl (1972) derived an equivalence theorem to confirm the D-optimality of an approximate design for a random coefficient regression model and Entholzner et al. (2005) obtained optimal or efficient designs for mixed models. Schmelter (2007b) showed that the search for an optimal design for hierarchical linear models could be restricted to the class of group-wise identical individual designs and Schmelter (2007a) noted that optimal designs found in the class of single-group designs remain optimal in a larger class having more group designs. Entholzner et al. (2005), Schmelter (2007a) and Schmelter (2007b) investigated optimal designs under random intercept models, random slope models and random coefficient cubic regression models, respectively. Debusho and Haines (2008) and Debusho and Haines (2011) constructed V -and D-optimal population designs for linear and quadratic regression models with a random intercept term. More recently, Prus and Schwabe (2016b) constructed optimal designs for predicting individual parameters in hierarchical models and called them D-optimal designs. They also extended their work to find interpolation and extrapolation optimal designs for random coefficient regression models in Prus and Schwabe (2016a). Further, Prus (2019) proposed a design criterion called Goptimality for predicting the response surface in hierarchical models and noted technical difficulties in finding such an optimal design, analytically or numerically, for more complicated models.

Statistical models
Throughout, we denote the jth response from the ith subject by y i j . The hierarchical model is defined on a user-defined compact design space X and given by where the jth observation from individual i is taken at the experimental setting x i j ∈ X , n is the number of individuals, m i is the number of observations from individual i, . . , f p ) T is the vector of known regression functions, and β i = (β i1 , . . . , β i p ) T is the individual parameter vector specifying the individual response. The observational errors ε i j are assumed to be centred with zero mean and homoscedastic and uncorrelated with common variance var(ε i j ) = σ 2 . We assume that E(β i )=β = (β 1 , . . . , β p ) T and Cov(β i )=σ 2 D (i = 1, . . . , n), where D is known, and all β i 's are uncorrelated with all ε i j 's. We note that the p × p matrix D can be singular, which happens when some of the individual parameters are non-random. In practice, experiments are usually conducted with identical regimes for all individuals, i.e., all individuals i have the same number m i = m of observations at the same values x i j = x j of the experimental settings. Such designs are popular because they are simple to implement and analyze (Prus and Schwabe 2016b). In what is to follow, we assume such a setting.
Let Prus and Schwabe (2016b) showed that the best linear unbiased estimatorβ of the population parameter β iŝ and the best linear unbiased predictorβ i of the individual parameter β i is a weighted average of the individualized estimateβ i;ind = (F T F) −1 F T Y i based on the observations of subject i only and the estimatorβ for the population parameter. Specifically, we havê The quality of the prediction in (4) for θ = (β T 1 , . . . , β T n ) T may be measured by the mean squared error matrix ofθ = Here I n is the n × n identity matrix, J n is the n × n matrix with all entries equal to 1 and "⊗" denotes the Kronecker product of matrices.

G-optimality for predicting individual parameters
The G-optimality criterion was proposed for prediction of the individual parameters in Prus (2019) and a G-optimal design minimizes the maximal prediction mean squared error over the experimental region. The standardized information matrix of an approximate design ξ for the above model without individual effects is The matrix M(ξ ) stands for the information obtained per observation and m M(ξ ) corresponds to the information contributed by the observations at the experimental settings per individual. In this paper, we are only concerned with approximate designs on X with non-singular information matrices and denote this set by Ξ . If ξ ∈ Ξ , its MSE-matrix in (5) is When D is non-singular, this expression simplifies to For predicting purposes, Prus (2019) used model (2) and proposed the G-criterion as the maximal sum of the expected squared differences of the predicted and real response across all individuals with respect to all possible observational settings: This criterion can be rewritten as where In the appendix, we show that the criterion is a convex function on Ξ and a design that minimizes the G-criterion ψ pred G (ξ ) over all designs ξ ∈ Ξ is G-optimal. The next two analytical results give a lower bound for the function ψ pred G (ξ ) for any design and an equivalence theorem to confirm the optimality of a design. The technical arguments are deferred to the appendix.

Theorem 1 For any design
An implication of the above theorem is that a design ξ * is G-optimal for model (2) if it satisfies (10) with equality at the support points. In what is to follow, we establish below an equivalence theorem for G-optimality to characterize Goptimal designs.
where μ is a probability measure on A(ξ ) defined by Similar to (1), the function on the left hand side of the inequality is the sensitivity function of the design ξ * . The equivalence theorem is more complex than the one for D-optimality because the G-optimality criterion is not differentiable. The theorem has a form similar to those for confirming G-optimal designs for heteroscedastic linear models in Wong and Cook (1993).
In Theorem 2, the probability measure μ * exists only if the design under investigation is G-optimal, whereupon the inequality becomes an equality at all design points of the optimal design ξ * . This follows because if there is a support point x of ξ * such that the inequality in Theorem 2 is strict, then integrating both sides with respect to ξ * , we have implying 0 < 0, which is impossible. It follows that all support points of ξ * are roots of the sensitivity function. In the next section, we show how this information is used to generate the G-optimal design.

G-optimal designs
We now provide examples of G-optimal designs for several types of linear mixed models and use Theorem 2 to confirm their optimality. Here and in the rest of the paper, to fix ideas, we assume that n = 10, m = 5 but other values for n and m can be similarly used. The first example is relatively simple and a formula for the G-optimal is available. When the model is slightly generalized to two additive factors with uncorrelated random factors and no intercept term, an analytical description for the G-optimal design for two-parameter models is no longer possible and numerical methods must be used. The implication is that G-optimal designs are difficult to find and they have to be found numerically. This motivates us to use competitive swarm optimizer (CSO) to find G-optimal designs and note that some of the results in this section are found by CSO.

G-optimal design for the simple linear model with a random slope
Suppose we wish to find the G-optimal design for model (2) with p = 2 and f (x) = (1, x) T on the experimental region X = [0, 1]. The model is and we assume that the slope parameter β i2 is random with mean β 2 and the dispersion matrix is diagonal and equal to σ 2 D = diag (0, dσ 2 ). Let δ = md and let ξ * w be the two-point design supported at 0 and 1 with weight at 1 equal to w * . A direct application of Theorem 2 shows that the G-optimal design for model (12) has and for the design ξ * w , (9) becomes This maximum of this function is attained at x = 0 and x = 1 and so A(ξ * w ) = {0, 1} . Let μ * be the two-point probability measure defined on A(ξ * w ) supported at 0 and 1 and its weight at 1 is The sensitivity function of the design ξ * w in terms of w μ and w * is with a maximum value of (1 − w μ )/(w * − 1) 2 at x = 0 and x = 1. This value can be shown to equal to and so by Theorem 2, the design ξ * w is G-optimal.

G-optimal designs for a two additive factor model without an intercept
Regression models with no intercept are quite common and they either arise naturally or from constraints imposed on the variables, see for example, Huang et al. (1995). For such a model with two additive random factors, we have where To find G-optimal designs, first consider designs of the form and let μ w be the associated probability measure on A(ξ w ) defined by The table below shows the optimal weights ξ * w and the weights of the measure μ * w for model (16) for selected values of d 1 and d 2 . These designs have been verified to be numerically G-optimal by Theorem 2.

G-optimal designs for quadratic and fractional polynomial mixed models
We show here that even for a relatively simple mixed model, an analytic description of the G-optimal design can be problematic. We consider two linear mixed models with independent errors i j having zero means and common variance. These two models are and Model (17) is quadratic and model (18) is an example of a fractional polynomial, which is increasingly used in the biomedical sciences to model a univariate continuous response. Fractional polynomial models were proposed by Royston and Altman (1994) who showed that they are more effective for modeling a continuous outcome than using polynomials (Royston et al. 1999;Royston and Sauerbrei 2008). We recall a fractional polynomial (FP) is given by φ m (x; , p) = α 0 + t j=1 α j H j (x), where α j are the real-valued coefficients and H j (x) are defined sequentially, The powers are given by the Box-Tidwell transformation with x ( p j ) = x p j if p j = 0, otherwise x (0) = ln[x] and for practical applications, 'powers' in a FP are selected from the set P = {−2, −1, −0.5, 0, 0.5, 1, 2, . . . , max(3, t)} (Royston and Altman 1994). Many software statistical packages now provide an option for fitting FP models, suggesting that FP models are gaining recognition as a modelling tool in statistics. Interestingly, optimal designs for FP models have never been reported in the literature.
There are no closed-form descriptions for the optimal designs for these two relatively simple models with random effects. A practical way is to find them using an algorithm. There are many traditional algorithms for generating many types of optimal designs and our experience is that many of them do not work well in such a setting where the criterion is not differentiable and the optimization problem has two levels. To this end, we use a meta-heuristic algorithm called competitive swarm optimizer to find the optimal designs.

Competitive swarm optimizer
An analytical approach is generally unable to determine a Goptimal design and we need an effective algorithm to generate a G-optimal design, or search for a design with sufficiently high G-efficiency for practical applications. The last few Goptimal designs in Sect. 4 were found by an algorithm that we now describe.
Nature-inspired meta-heuristic algorithms are commonly used in engineering and computer science to tackle hardto-solve optimization problems (Yang 2010). Examples are particle swarm optimization (PSO), differential evolutionary (DE), cuckoo search (CS) and imperialist competitive algorithm (ICA). Their main appeal is that they are general purpose optimization tools, they tend to be assumptions-free, easy to implement and use and frequently able to find highquality solutions quickly. Their meteoric rise in popularity is well documented in Whitacre (2011a, b)  (2018) applied a version of PSO to find standardized maximin optimal designs for several enzyme-kinetic inhibition models by solving multilevel nested optimization problems over different types of search spaces, and Kim and Wong (2018) likewise applied PSO and solved an adaptive clinical trial design problem by solving a complex discrete optimization problem by determining the optimal choice of ten integers with multiple constraints.
Evolutionary algorithms are continuously evolving and nature-inspired meta-heuristic algorithms are a major component of evolutionary algorithms. Competitive swarm optimizer (CSO) is popular because many simulation results in the literature show that it either outperforms or is competitive with several state-of-the-art evolutionary algorithms. This conclusion was arrived at after comparing CSO's performance with several state-of-the-art evolutionary algorithms using a variety of benchmark functions with dimensions up to 5000 (Cheng and Jin 2014;Zhou et al. 2016;Sun et al. 2018;Mohapatra et al. 2017;Zhang et al. 2016). They showed that CSO was frequently not only the winner but also required significantly less run time. CSO has also been successfully applied to solve many different types of complex optimization problems; see, for example, Gu et al. (2018), Kumarappan and Arulraj (2016) and Xiong and Shi (2018).
CSO was initially proposed by Cheng and Jin (2014) to tackle the premature convergence issue met by many evolutionary algorithms. CSO first generates a swarm of n particles at positions x 1 , . . . , x n with random velocities v 1 , . . . , v n in Ω. In each iteration, we randomly divide them into n 2 pairs and compare their objective function values. If we have a minimization problem, at iteration t, we compare each pair of particles x t i and x t j and identify x t i as the winner and x t j as the loser if the objective function has a smaller value at x t i than at x t j . Winner retains status quo and the loser learns from the winner. The two defining equations for CSO are v t+1 where R 1 , R 2 , R 3 are all random vectors whose elements are drawn from U (0, 1); operation ⊗ also represents elementwise multiplication; vectorx t is simply the swarm center at iteration t; social factor γ controls the influence of the neighboring particles to the loser and a large value is helpful for enhancing the swarm diversity (but possibly impacts convergence rate). This process iterates until some stopping criteria are met. Algorithm 1 displays a pseudo code of CSO.

Algorithm 1 The Pseudo Code for CSO
A swarm of n particles.
x ← Randomly assign initial positions in space to particles. v ← Randomly assign initial velocities to particles. while not stopping criteria do Randomly divide the swarm into n 2 pairs. for each pair do Compare their objective function values and decide the group of winners and the group of losers. Update loser particles by following the updating rule.

end for end while
The tuning parameters we used in the CSO algorithm for finding the optimal designs are similar to those suggested by Cheng and Jin (2014). For example, because the optimization problems we have dimensions fewer than 100, we set γ = 0 and used 128 particles in the search. This is because the examples to follow have about 100 variables or fewer to optimize. The maximum number of iterations for each search was set to be 350. For our examples, the algorithm typically converges in 200 or fewer iterations or less than 1 second of CPU time. The hardware we used is a Windows PC with 3.20GHz Intel i7-8700 CPU, 32GB DDR4 2666MHz memory and 512G SSD storage. Here convergence means that successive values of the objective function do not differ by less than 10 −6 in absolute value. On average, we observe that each search usually converged in 200 or fewer iterations.

Application of CSO to find G-optimal designs
To search for a G-optimal design, we set up a two-layer optimization structure for min-max the objective function. The inner optimization step is a low-dimensional maximization problem represented in formula (9) and the outer loop is to minimize formula (8), which, in our cases, is a multidimensional function with less than 100, or more specifically, around 10-20 variables to optimize. Accordingly, we set γ = 0, 32 particles, 128 iterations for the inner optimization and γ = 0, 128 particles and 200 iterations for the outer optimization task. During the search, the target matrix may sometimes become singular or close to being singular and affect the floating-point calculation accuracy. We found adding a diagonal matrix with very small positive diagonal elements to the original matrix can make inverting an ill-conditioned matrix more stable. In our work, we set M * = M + 0.0000001 × I.
After finding a design ξ * , we determine its answering set A(ξ * ) and the probability measure μ * that meet the conditions in the equivalence theorem. To efficiently find all x's that maximize φ(x, ξ * ), we suggest to split the design space into two or more subspaces and then run CSO on each subspace to search for all x's that maximize φ(x, ξ * ). For instance, if the design space X = [0, 1], we may split the search space into [0, 0.5] and [0.5, 1]. There are no firm rules for the number of subspaces and our suggestion is to first try with two subspaces, then aggregate all such points to obtain A(ξ * ). We then sequentially increase the number of subspaces and stop the process when splitting the design space into more subspace does not enlarge the size of A(ξ * ).
To find the probability measure μ * that meets the conditions in the equivalence theorem, we proceed iteratively as follows. For each generated design ξ from the algorithm, we first determine elements a 1 , a 2 , . . . , a k in A(ξ ) and find a candidate μ for μ * among probability measures supported at the k points. There are 2k variables in the pairs (a i , w i ), where w i is the weight of μ at a i , i = 1, . . . , k. Since the support points of μ * must be roots of the sensitivity function when ξ is optimal, one may apply CSO to minimize the following function with respect to the variable weights w i :  We note that in (19), for each design ξ , we have to first find x that maximizes φ G (x, ξ). This can be done by calling CSO again to tackle this optimization problem. However, after a lot of experiments, we find that only minimizing the second term in (19) frequently suffices to determine μ, i.e. we suggest given a design, first find μ by simply minimizing before solving the more complicated problem (19) to determine μ * . We ran CSO with φ = 0.05, 64 particles for 1200 iterations. We stop the algorithm if and when the objective value attains a small user pre-specified value of, say, 10 −5 . The CSO-generated designs for the next few examples were found by optimizing (20) and we confirm their Goptimality via the equivalence theorem. On average, the CPU time required to solve (20) was about 5 seconds or less for our examples and we expect a longer time is required to solve (19). This suggests that if we want to find G-optimal designs for more complicated problems, an efficient strategy to find G-optimal designs is to first solve the simpler optimization problem in (20) before solving (19).
The next few tables list CSO-generated designs for various linear mixed models with different assumptions on the covariance structure of the random effects and their Goptimality criterion values. We display the sensitivity plot of each CSO-generated design across the design space and it confirms the G-optimality of the CSO-generated design. All parameters for the design problems, such as the design space, the elements in the covariance matrix D, n and m are chosen randomly for illustrative purposes. The error variance σ 2 is a nuisance parameter and does not affect the optimization process, so we set σ 2 = 1 in all examples.    (17) with uncorrelated random effects in Table 1 Fig . 2 Sensitivity function of the CSO-generated for model (17) with correlated random effects in Table 2  Model Plot of φ G (x, ξ) Figure 3 Table 4 CSO-generated design and its features for model (18) for specific choices of the covariance matrix of the correlated random effects, design space and (n, m) Model   (18) with uncorrelated random effects in Table 3

CSO-generated locally D-optimal designs for Poisson models with mixed factors
This subsection demonstrates the utility and flexibility of CSO to find other types of optimal designs, such as locally D-optimal designs for estimating parameters in a Poisson  Table 4 regression model with possibly interacting factors and some factors are random. Poisson models are commonly used to study count data in a regression setting even though they have restrictive assumptions, such as, requiring the mean and variance of the response to be equal. Negative binomial regression models extend Poisson models when the variance is larger or smaller than the mean response. These models are frequently used in clinical trials when the primary outcome is a count variable, such as the number of falls by an elderly patient, or number of CT scans received in the last three months or modeling the numbers of insurance claims and results. Walters (2007a, b) used a negative binomial regression model to account for the number of aggressive incident reports in the following 12 months after subjects were put in an institutional correction center. Like all meta-heuristic algorithms, they can be modified to search for an optimum more effectively for specific problems. An enhanced version of CSO algorithm is available to find various types of optimal designs for the negative binomial regression models (Zhang et al. 2020).
Locally D-optimal designs for two-factor mixed Poisson models with an intercept term have been reported for two models, one with and the other without an interaction term. By using the quasi-information matrix defined in Niaparast and Schwabe (2013) and assuming that the random effects are uncorrelated, i.e., the covariance matrix is diagonal, Naderi et al. (2018) provided theoretical details for finding minimally supported D-optimal designs, including an equivalence theorem to confirm the D-optimality of an approximate design. We recall that minimally supported designs have the number of support points equal to the number of parameters in the mean function and so cannot be used to perform a lack of fit test to check model adequacy. Thus their approach can be restrictive in practice and it is also not clear whether their numerical procedure works for finding Doptimal designs when the model has more interacting factors with random effects or under another design criterion.
We applied CSO and found the same locally D-optimal designs in Tables 1, 2 and 3 of Naderi et al. (2018) for different models with various numbers of uncorrelated random coefficients. The next two sets of results in Tables 5 and 6 show that CSO-generated designs for a model with two interacting factors when the random effects are uncorrelated or correlated. The accompanying plots in Figs. 5 and 6 display their sensitivity plots and they confirm the CSO-generated designs are locally D-optimal. Interestingly, the results show different design spaces can produce D-optimal designs that are not minimally supported.
The usefulness of the CSO algorithm can also be seen when we applied it find D-optimal designs for mixed models when some random coefficients are correlated and able to find D-optimal designs that may not be minimally supported. We demonstrate using two Poisson models,one with four and the other with five mixed factors and both have some interaction terms. This is helpful because the bulk of theoretical optimal designs in the literature for nonlinear models have only a couple of additive factors and numerical results for models with interaction terms are also very limited.
For the model with four factors, we assume that the covariance matrix is D = diag(0.5, 2.2, 1, 1, 0, 0, 1.3, 0.7) and the coefficients in the linear predictor function is β = (1, 2, 3, −3, −1, −2, 1, 3). The 12-point CSOgenerated design ξ 4 is shown below and criterion value is −4.300. Here and elsewhere, the last row in the design for a multi-factor model shows the mass of the design point above it.
The second example concerns a Poisson regression model with five factors (x 1 , . . . , x 5 ) and three interaction terms (x 1 x 2 , x 1 x 3 , x 3 x 5 ) defined on [−1, 1] 5 . As an illustrative example, suppose the covariance matrix is    Table 5 and the nominal coefficients in the linear predictor function β = (1.0, 2.0, 3.0, −3.0, −1.0, −2.0, 0.2, 0.5, −0.5). The criterion value of the 12-point CSO-generated is −13.640 and the design ξ 5 is shown below. This is a more complicated model and as expected, CSO, like other meta-heuristic algorithms, can also encounter problems when the optimization problem becomes more complex. Convergence and numerical stability issues can arise and repeated reruns of the algorithm with different tuning parameters and swarm size did not produce an optimal design, which is the case here. Fig. 6 The sensitivity function of the CSO-generated design under the D-optimality criterion for a two-factor Poisson model with an interaction term and correlated random effects shown in Table 6 A common strategy to try to overcome the above problem is to hybridize the algorithm with another algorithm, such that the hybridized version performs better than either of the algorithms. Another option is to use the equivalence theorem to derive an efficiency lower bound for the generated approximate design without knowing the optimum (Pazman 1986). For D-efficiency, the lower bound depends on the maximum value of the sensitivity function of the design. Since CSO is a stochastic algorithm and depends on the initial design and choice of tuning parameters, we ran the algorithm mul-tiple times and reported the minimum of the D-efficiency of the generated design. For this example, the CSO generated design is ξ 5 and its computed minimum D-efficiency is 88%, which may suffice in practice.
For models with three or more factors, like in the last two examples, the CSO-generated designs appear to be Doptimal or highly D-efficient. Unlike previous examples, the sensitivity functions are now high dimensional and so it is difficult to plot them and visually appreciate their features. One can discretize the design space using a fine grid evaluate the values of their sensitivity function at every grid point but this can be time-consuming. Another way is to apply CSO to optimize the multi-dimensional sensitivity function and confirm all peaks occur at the design points of the CSOgenerated design. We chose the latter option and have verified that the CSO-generated design ξ 4 is numerically optimal and ξ 5 is not.

Are D and G-optimal designs equivalent for hierarchical linear models?
It is well known that for regression models with homoscedastic errors, the D and G-optimal approximate designs are equivalent (Kiefer and Wolfowitz 1960). The two optimality criteria look different and have different purposes and so it is an intriguing result. Does the equivalence of the two types of designs apply when we have linear mixed regression models? Let J n be the n-dimensional square matrix whose elements are all ones, let Δ = m D, and let ⊗ denote the Kronecker product. D-optimal design for model (2) minimizes the log |MSE(ξ )| where Tables 7 and 8 display CSO-generated designs for the D and G-optimality criteria on different settings for the mixed quadratic model with random components having various covariance matrices. Tables 9 and 10 show corresponding results for a fractional polynomial model with random components having various covariance matrices. The numerical results suggest that D and G-optimal designs for these models are almost equivalent since their D and G-efficiencies relative to the other are all very close to 1. Additional numerical results not shown here for space consideration, for other models we have investigated with more polynomial or fractional polynomial terms and different types of covariance structures also show that the two types of optimal designs have relative efficiencies very close to 1.
Our numerical results in the tables suggest that the celebrated theorem of Kiefer and Wolfowitz may also apply to hierarchical linear models, but a general proof is elusive at this time. However, we are able to show the equivalence of the two types of designs holds for hierarchical linear models when they contain only a random intercept. To see this, we recall D-optimality discussed in Prus and Schwabe (2016b) for the hierarchical linear model (2). Assuming that the dispersion matrix D has rank q, the D-criterion for prediction is the logarithm of the product of the (n − 1)q + p largest eigenvalues of the MSE-matrix: where λ 1 (ξ, Δ), . . . , λ q (ξ, Δ) are the q largest eigenvalues of N(ξ, Δ) = Δ − Δ(M −1 (ξ ) + Δ) −1 Δ. Corollary 6 of Prus and Schwabe (2016b) used this definition and gave an equivalence theorem for D-optimality, and showed that the D-optimal design in the fixed-effects model is also Doptimal for prediction in the random intercept model. We next consider using the G-optimal design for prediction in a hierarchical linear model with a random intercept. Assume f 1 (x) ≡ 1 in model (2). The dispersion matrix D can be written as D = d e 1 e T 1 , where e 1 = (1, 0, . . . , 0) T denotes the first unit vector in R p . For an approximate design ξ ∈ Ξ , the MSE-matrix for prediction in this randomintercepts model is where δ = md as defined in (13).    It follows that the lower bound in (10)  = p + (n − 1) δ 1 + δ , which does not depend on ξ . If ξ * is D-optimal for the fixedeffects model, we have max x∈X φ(x, ξ * ) = p + (n − 1) δ 1 + δ , and it follows that the design ξ * is G-optimal for prediction of individual parameters. Therefore, the D-optimal and Goptimal designs are equivalent for prediction in the randomintercept model.

Conclusions
G-optimal designs are challenging to determine and study because the criterion is not differentiable and they require solving two or more layers of nested optimization problems over different spaces. However, the criterion is compelling and should appeal to researchers interested to design an experiment to estimate the overall response surface. To facilitate greater use of such designs for linear models with random effects, we proposed an effective meta-heuristic algorithm call CSO to find G-optimal designs and developed an equivalence theorem to confirm whether a design is G-optimal. Additionally, we showed CSO is flexible and can also search for locally D-optimal designs for Poisson mixed regression models with several interacting factors, where some random effects may be correlated. We also provide R-codes freely and the interested reader can request them from the third author.
Proof of Theorem 1 For any design ξ ∈ Ξ , from (9), we have
A direct calculation shows that