Variable selection using conditional AIC for linear mixed models with data-driven transformations

When data analysts use linear mixed models, they usually encounter two practical problems: (a) the true model is unknown and (b) the Gaussian assumptions of the errors do not hold. While these problems commonly appear together, researchers tend to treat them individually by (a) finding an optimal model based on the conditional Akaike information criterion (cAIC) and (b) applying transformations on the dependent variable. However, the optimal model depends on the transformation and vice versa. In this paper, we aim to solve both problems simultaneously. In particular, we propose an adjusted cAIC by using the Jacobian of the particular transformation such that various model candidates with differently transformed data can be compared. From a computational perspective, we propose a step-wise selection approach based on the introduced adjusted cAIC. Model-based simulations are used to compare the proposed selection approach to alternative approaches. Finally, the introduced approach is applied to Mexican data to estimate poverty and inequality indicators for 81 municipalities.


Introduction
The linear mixed model is a broadly used statistical model for analyzing clustered or longitudinal data. When data analysts use these models, they often face two practical problems: (a) the true model for explaining the response variable is unknown and (b) the model assumptions, especially the Gaussian assumptions of the error terms, are violated.
As the true model is unknown, data analysts find suitable/optimal models for explaining the dependent variable by B Timo Schmid timo.schmid@uni-bamberg.de Yeonjoo Lee yeonjoo.lee@uni-bamberg.de Natalia Rojas-Perilla natalia.rojas@uaeu.ac.ae Marina Runge marina.runge@fu-berlin.de 1 using variable selection procedures. One popular approach in this context is the Akaike information criterion ( AI C) introduced by Akaike (1973). For linear mixed models, there are different versions of AI C (Müller et al. 2013). They can be divided into two groups: marginal types of AIC (m AI C) and conditional types of AI C (c AI C). The m AI C is the common AI C for linear mixed models which uses marginal density and is one of the most widely used selection criteria (Müller et al. 2013). However, the m AI C is only appropriate when the model parameters are fixed (Burnham and Anderson 2010) and the use of m AI C as selection criterion is problematic for linear mixed models (Han 2013). Vaida and Blanchard (2005) introduced the c AI C as a more proper selection criterion for linear mixed models. c AI C uses the conditional density in contrast to m AI C. Vaida and Blanchard (2005) derive c AI C in case that the (scaled) covariance matrix of random effects is known and recommend to use a plug-in estimator for the covariance matrix of the random effects in practice. Liang et al. (2008) derive a more general c AI C that accounts for the estimation of the covariance matrix of the random effects. However, their conditional AI C can be computationally demanding in situations with large sample sizes and many potential variables (Greven and Kneib 2010).
Linear mixed models regularly rely on parametric assumptions such as normality for the random effects and the error terms. These assumptions may be violated in many applications, for instance, with skewed variables like consumption or income. One possible way to tackle this issue is to use robust mixed models. Such models are robust in various aspects, including the violation of the Gaussian assumptions. They allow more flexible distributions (Verbeke and Lesaffre 1997;Zhang and Davidian 2001;Sinha and Rao 2009) or apply a Bayesian framework (Rosa et al. 2003;Lachos et al. 2009). Jiang (2019) gives an overview of further models which deal with this problem. Another way to solve this problem is to apply fixed logarithmic or data-driven transformations for the dependent variable. The latter transformations are generally based on an adaptive transformation parameter λ that depends on the particular shape of the data. Among different data-driven transformations, the Box-Cox transformation (Box and Cox 1964) is widely used, as it includes various power transformations and the logarithmic transformation as a special case. Gurka et al. (2006) extend the use of the Box-Cox transformation to linear mixed models. They apply the residual maximum likelihood (REML) approach to estimate the transformation parameter λ from the data based on linear mixed model with fixed auxiliary variables.
However, the optimal data-driven transformation depends on the fixed model and the optimal model depends on the selected data-driven transformation. In particular, to select the optimal data-driven transformation parameter λ by the REML approach, the linear mixed model should be fixed; and to perform a variable selection based on the c AI C, the dependent variable should be fixed using an appropriate (data-driven) transformation parameter λ. A first naive approach which is typically used in applications would be to perform the transformation and variable selection in a specific order. First, find an appropriate working model on the original/untransformed scale and keep this fixed when selecting the optimal data-driven transformation parameter. However, this may not offer the best way to the variable selection as the selected variables are not optimal on the transformed scale. In this paper, we aim to find the optimal model and the optimal transformation parameter simultaneously. This would allow for enjoying the advantages of both data-driven transformations and the optimal model for the transformed data. Hoeting and Ibrahim (1998) and Hoeting et al. (2002) discuss methods for transformation and variable selection based on posterior probabilities in linear models. They focus on change-point transformations to transform the predictors of the linear model. Bunke et al. (1999) discuss the selection of the optimal transformation and the optimal model based on cross validation for the nonlinear model. To the best of our knowledge, none of the existing literature provides a joint solution when variable selection based on the c AI C and estimation of the data-driven transformation parameter are simultaneously applied to linear mixed models. From a the-oretical perspective, we present an approach to concurrently choose the optimal linear model and the optimal transformation parameter. Since the c AI C is scale dependent, we can not directly compare different models with differently transformed response variables. Therefore, we adjust the c AI C using the Jacobian of the corresponding data-driven transformation such that different model candidates with differently transformed response variables can be compared. Although the paper focuses on the Box-Cox transformation as a particular data-driven transformation, the proposed approach is applicable to data-driven transformations in general. From a computational perspective, we provide a step-wise selection approach based on the proposed adjusted c AI C.
The structure of the paper is as follows: In Sect. 2 we provide an overview of linear mixed models and the c AI C. In Sect. 3, we derive the Jacobian adjusted c AI C for transformed linear mixed models and introduce the step-wise selection approach. In Sect. 4, we examine the performance of the proposed selection approach by using model-based simulations. In Sect. 5, the proposed selection approach is applied to data from Guerrero in Mexico for estimating poverty and inequality indicators at municipal level. Finally, we discuss our results and further directions of research in Sect. 6.

Variable selection using conditional AIC for linear mixed models
In this section we briefly introduce the existing variable selection methods for linear mixed models. In Sect. 2.1, we present a general notation of linear mixed models and in Sect. 2.2, we introduce and compare the c AI C by Vaida and Blanchard (2005) and Liang et al. (2008).

The linear mixed model
Assume there is a finite population divided into D clusters. Let y i be a vector of the response variable for the i-th cluster for i = 1, · · · , D, which is modeled with a linear mixed model N i is the cluster size of the i-th cluster, X i and Z i are known N i × p and N i × q design matrices for the fixed and random effects, β includes p fixed effects, u i is a vector of q random effects, and ε i is a vector of errors in the i-th cluster. u i and ε i are assumed to be independent and normally distributed with I N i , the N i ×N i identity matrix. G is the q×q covariance matrix of random effects in the i-th cluster and depends on a set of variance components η. Let N = D i=1 N i be the population size and θ = (β, σ, η) be the vector of parameters in the model. The model is described for the population as follows where , the covariance matrix of y is given by

Conditional Akaike information criterion for linear mixed models
Assume there are P possible explanatory variables in the data. Since the number of all possible combinations of P variables is M = 2 P , there are M possible model candidates which can be fitted to the data. In order to find the optimal model among them, the variable selection should be performed based on an appropriate selection criterion. In this study, we focus on the variable selection based on the c AI C for linear mixed models. While this study focuses on the c AI C, the m AI C is briefly explained first to provide a better understanding of the c AI C.
The m AI C is derived from the Kullback-Leibler (K-L) divergence between the density of the true model and the density of a candidate model (Akaike 1973). Assume that the true model has the same form as Eq. (1) with true parameters. The vector of true parameters is denoted by θ 0 = (β 0 , σ 0 , η 0 ). Let f (·) be the density function of the true generating model and g(·|θ) be the density of the approximating model with model parameters θ for fitting the data. If the true distribution f belongs to the class of model candidates and θ = θ 0 then g(·|θ 0 ) = f (·). The m AI C measures the K-L divergence between f (·) and g(·|θ).
The idea behind the c AI C derivation is the same as for the m AI C. While the m AI C measures the K-L divergence between two marginal densities, c AI C measures the K-L divergence between the true conditional density and the conditional density of a model candidate. The true conditional density is denoted by f (·|u 0 ) with the true random effects (u 0 ) and the conditional density of a model candidate is denoted by g (·|θ, u). Let y * be generated from the true conditional density and y be the observed data, also from the true conditional density. They are independent conditional on random effects, which means that y * and y share the random effects and only differ in error terms (i.e., y * = X β + Zu+ε * and y = X β + Zu + ε with ε * ∼ N (0, σ 2 I N ) and ε ∼ N (0, σ 2 I N )). The K-L divergence between f (y * |u 0 ) and g(y * |θ, u) with respect to f (y * |u 0 ) is defined by The discrepancy between the conditional generating model and the conditional approximation model is given by By using the given definition of the discrepancy, the K-L divergence can be written as follows Since 2E f (y * |u 0 ) [log f (y * |u 0 )] does not depend on θ and u from the approximating model, the ranking of candidate models based on d[(θ 0 , u 0 ), (θ, u)] is equivalent to the ranking of candidates based on 2I [(θ 0 , u 0 ), (θ, u)]. Therefore, the fitted candidate models can be evaluated by using the discrepancy withθ andû, whereθ includes the estimates of model parameters (i.e. θ = (β,σ ,η)) andû = E(u|θ, y) contains the predicted random effects based on the empirical Bayes estimation. Hence, the selection problem based on K-L divergence can be solved by comparing d[(θ 0 , u 0 ), (θ, u)]| θ=θ,u=û values of the candidate models. As the model parameters and random effects are estimated based on observed data, the expected estimated discrepancy should be used as the selection criterion (Burnham and Anderson 2010). This is also often denoted as conditional Akaike Information (c AI ) (Vaida and Blanchard 2005;Liang et al. 2008;Han 2013) As a consequence, the c AI C consists of the conditional log-likelihood and the bias correction term K andŷ is the fitted vectorŷ = Xβ + Zû. Vaida and Blanchard (2005) derive two different bias correction terms under different assumptions. When σ 2 and G 0 are assumed to be known, the K equals ρ, which is the effective degrees of freedom (Hodges and Sargent 2001) When it is assumed that σ 2 is unknown and σ −2 G 0 is known, K is calculated by .
The detailed derivation of K a and K M L E can be found in Vaida and Blanchard (2005). Vaida and Blanchard (2005) derive the c AI C under the assumption that G 0 , the covariance matrix of the random effects, or σ −2 G 0 , the scaled covariance matrix of the random effects, are known. However, in practice they are usually unknown. In the case of the unknown random effects covariance matrix, Vaida and Blanchard (2005) suggest to use K M L E for the c AI C with the estimated σ −2 G 0 , since the derivation of the bias correction term for the case of unknown σ −2 G 0 is analytically complicated and the effect of estimation can be asymptotically ignored. Liang et al. (2008) propose a general c AI C for known σ 2 , regardless of whether the covariance of random effects are known or unknown. Under these assumptions, Liang et al. (2008) derive the bias correction term using the first derivatives ofŷ subject to y. In their technical report, they also derive an additional bias correction term for c AI C assuming more realistically that neither σ 2 nor the covariance of random effects are known.
In practice, the true value of σ 2 and the true G 0 are usually unknown. Therefore, it seems reasonable to use the c AI C of Liang et al. (2008). However, Liang et al. (2008) show in the simulation part that their bias correction term is close to K a and it is also shown in their technical report that the bias correction term under more realistic assumptions is close to K M L E . Moreover, Greven and Kneib (2010) point out that the use of c AI C by Liang et al. (2008) as a selection criterion poses severe computational difficulties, since the calculation of the bias correction term of Liang et al. (2008) Liang et al. (2008), which is hard to implement for large N and M. As a result, this study focuses on the c AI C of Vaida and Blanchard (2005), and in particular on the c AI C with K M L E that allows for unknown σ 2 . The optimal model is the model which has the minimum value of c AI C among all M model candidates.

Variable selection for linear mixed models with transformations
In this section, we propose a step-wise variable selection approach for linear mixed models which allows comparing model candidates with differently transformed response variables. First, we give a general notation of linear mixed models with the Box-Cox transformation. Although the paper focuses on the Box-Cox transformation as a particular transformation, the proposed approach is applicable to data-driven transformations in general. In Sect. 3.2, we derive the Jacobian adjusted c AI C based on c AI C by Vaida and Blanchard (2005), which can compare model candidates with differently transformed data. In Sect. 3.3, we introduce a bootstrap method to estimate the bias correction term for Jacobian adjusted c AI C. From a computational perspective, we suggest to use step-wise selection with adjusted c AI C in Sect. 3.4

Linear mixed models with transformations
Assume that the original y variable is non-normal and there exists a transformation parameter of the Box-Cox transformation for which the transformed data follows the Gaussian assumption. The one-to-one Box-Cox transformation (Box and Cox 1964) of y is defined by where λ denotes the transformation parameter which has to be estimated and s denotes the shift parameter s = | min(y)| + 1 only when min(y) < 0. Let y be the vector of transformed y. Then, y is modeled as with u ∼ N (0, G 0 ) and ε ∼ N (0, σ 2 I N ). The covariance matrix of the transformed y is Gurka et al. (2006) use the REML approach to estimate λ, as the REML approach is recommended when the focus is the estimation of variance components (Verbeke and Molenberghs 2000). Moreover, Rojas-Perilla et al. (2020) compare the REML estimator of λ with other estimators and show that the REML approach has a smaller variability than alternative estimators. Accordingly, the optimal λ is estimated in this study with the REML approach. The optimal λ maximizes the residual log-likelihood function of a given model. However, the estimated optimal λ is only optimal for the given model. This means that each model candidate has its own optimal λ. As we do not know which model candidate is the optimal and which λ is the optimal for the corresponding model, we should select the model and the λ concurrently.
To simultaneously select the best model based on c AI C and obtain the optimal λ, we estimate it for each potential model in a first step. With P possible x variables there are M = 2 P model candidates. The m-th model is defined by where X (m) is the design matrix of the m-th model and λ m is the optimal transformation parameter for the m-th model. Based on the model in Eq. (5) the optimal transformation parameter is estimated using the REML approach andλ m denotes the estimated optimal transformation parameter for the m-th model. Further details about the estimation of λ m using the REML approach are explained in Gurka et al. (2006). In the second step, all model candidates with their ownλ m should be compared. However, AI C-type criteria cannot compare models with differently transformed target variable (Burnham and Anderson 2010). Therefore, an adjustment with the Jacobian to the c AI C should be performed first such that these M different models can be compared.

Jacobian adjusted cAIC for linear mixed models
Assume that f (·|u 0 ) is the true conditional density function with the true model parameters θ 0 and the true random effects u 0 , while g(·|θ, u) denotes the conditional density of an approximating model. Let y * = X β + Zu + ε * be a realization from the true conditional density function with ε * ∼ N (0, σ 2 ). Then, the c AI for the transformed model is given by whereθ is the vector of estimated model parameters andû is the vector of predicted random effects. − log g( y|θ,û) is a biased estimator of E f ( y,u) E f ( y * |u) [− log g( y * |θ,û)] = 0.5 · c AI . The bias is obtained by To obtain an unbiased estimator of 0.5 · c AI , the bias correction term (BC) should be added as follows where y = Xβ + Zû. Under the assumption that σ 2 is unknown, the BC in Eq. (6) can be replaced by K M L E from Eq. (2). Consequently, the c AI C for the transformed model is given by where However, this c AI C of the transformed model cannot be used to compare differently transformed model candidates.
The c AI C measures the K-L distance between the true conditional density and a conditional density of a model candidate.
In the case of linear mixed models without a transformation, the optimal model can be chosen using the c AI C by Vaida and Blanchard (2005), since all model candidates have the same response variable y. The model with the smallest distance (i.e., the smallest c AI C) is the optimal model among all candidates. However, for linear mixed models with a transformation we estimate for each model candidate its own optimal transformation parameter. As the transformation parameter differs from model to model, the transformed y differs too. Consequently, the response variables of the model candidates are no longer the same (i.e., y (1) = y (2) = · · · = y (M) ). Therefore, the c AI C in Eq. (7) of a model candidate is in fact not the distance of the model from the true density of y, but the distance from the true density of y. As y differs from candidate to candidate and c AI C is scale dependent, the model candidates cannot be compared with the c AI C.
To allow for comparing model candidates using the c AI C, it needs to be adjusted, so that the adjusted c AI C of a model candidate measures the divergence of the model from the true density of y. Akaike (1978) shows that this adjustment can be done by adding the Jacobian of the transformation to the AI C value of time series models.
The Jacobian adjusted c AI C denoted by J c AI C is derived from the K-L divergence between the true conditional density and the model conditional density of the original y, and not of the transformed y. To define the K-L divergence between the true and a candidate model of y, the true and model conditional densities of y should be defined. As we know the conditional densities of the transformed y, the conditional densities of y can be derived by multiplying the Jacobian of the transformation. Let h(y|u 0 ) be the true conditional density of y and l(y|θ, u) the conditional model density, which are defined with the Jacobian of the Box-Cox transformation J (λ, y) as where Let y * be a realization of the true conditional density h(y|u) and y * be the vector of transformed y * . Then, the K-L divergence between conditional densities of y * becomes u)]. Therefore, the K-L divergence can be formulated using discrepancies as follows − log(l(y|θ,û)) is a biased estimator of 0.5· J c AI . To obtain an unbiased estimator of 0.5 · J c AI , the bias should be corrected by the following bias correction term (BC): l(y|θ,û) is defined as in Eq. (8). Then, l(y * |θ,û) can be defined by g( y * |θ,û) · J (λ, y * ) using the same relation as in Eq. (8). By inserting these terms into the BC, we get The Jacobian term of y is defined in Eq. (9) and the Jacobian term for y * is given by

Estimation of the bias correction
We propose a parametric bootstrap -following the ideas of Donohue et al. (2011) and Rojas-Perilla et al. (2020) -to estimate the BC for the J c AI C. The bootstrap captures not only the uncertainty due to the estimation of the model parameters but also the additional uncertainty due to the estimation of the transformation parameter λ (Rojas-Perilla et al. 2020). In addition, we use a resampling approach because the bootstrap variants of AI C are comparable with analytic approximations of the AI C (Donohue et al. 2011) and perform better than analytic approximations in terms of the model choice (Shang and Cavanaugh 2008;Marhuenda et al. 2014).
The BC in Eq. (11) consists of two expectation terms. Each expectation term is estimated by averaging the values over the B bootstrap replicates. The steps of the proposed bootstrap are as follows: 2. Fit the model in Eq. (4) to obtain estimates of model parametersθ .
create a bootstrap y using y (b) = Xβ + Zu (b) + ε (b) . 4. Refit the model with the bootstrap sample y (b) and obtain the bootstrap estimates of the model parametersθ (b) and u (b) . 5. Calculate the second expectation term of the BC for each bootstrap usingθ (b) andû (b) . The unobserved (true) y * and y * are replaced by y and y respectively. Note that y and y are treated as realizations from the true transformed/untransformed density with correspondingλ.
obtain the bootstrap estimates of the model parameterŝ . Note that the estimates depend on the re-estimated transformation parameter indicated by the superscriptλ (b) . 8. Calculate the first expectation term of the BC for each bootstrap usingθ The bootstrap estimate of the BC is then obtained by Then, the J c AI C is estimated by J c AI C = − 2 log(l(y|θ,û)) + 2BC =− 2 log(g( y|θ,û))−2 log(J (λ, y))+2BC (13) with BC defined in Eq. (12).
The J c AI C is the measure of the K-L divergence of a model candidate from the true model on the original y scale. Therefore, model candidates can be compared with J c AI C despite of their different response variable. A model with the minimum J c AI C is the optimal model with the corresponding optimal transformation parameter.
Using the derived J c AI C for the Box-Cox transformation, we will compare model candidates whose response variables are Box-Cox transformed with different transformation parameters. However, J c AI C can be also derived for other types of transformations, such as a logarithmic or dual-power transformation (Yang 2006). The J c AI C always measures the divergence of a candidate model from the true model on the original y scale independent of how the response variable of the model is transformed. Therefore, the J c AI C can compare not only model candidates that use the same transformation with different transformation parameters, but also the models with different types of transformations.

Simultaneous selection of optimal transformation and model formula
As a consequence of the previous sections, we propose the following algorithm to simultaneously select the optimal λ of a Box-Cox transformation and the optimal model among several model candidates. As explained above, considering all possible theoretical M model candidates is often not feasible in practice due to the computational burden. Therefore, the usual step-wise algorithms can be applied where the algorithm stops, if no further improvement can be achieved. In the following, we have chosen backward elimination as the exemplary model selection direction. The exchange to forward or the extension to forward-backward are possible without any difficulties and were done for the simulation experiment in Sect. 4 and the application in Sect. 5.
(1) Start with the full model including all P possible xvariables in the data. For the start, the full model is set as the optimal model. Estimateλ based on the full model to initiate the backward model selection. (3) Compare the J c AI C value of the new optimal model in step s with the J c AI C value of the previous optimal model in step s − 1. If the J c AI C value of the new optimal model is smaller than the previous one, step 2) is repeated until there is no further improvement in terms of J c AI C values.

Model-based simulation experiment
To support our theoretical findings and the proposed framework from the previous section we conduct simulation studies, which include several settings. The aim of the study is to show that under known data settings with a given transformation and model formula, the presented simultaneous algorithm for optimal model and transformation selection depicts the true model for a linear mixed model. The settings include four scenarios: Normal (1), Normal (2), Log and Box-Cox, each with three explanatory variables. The scenarios are oriented to the simulation study of Rojas-Perilla et al. (2020). The distributions of the explanatory variables are chosen to be representative of both, numeric and categorical variables coded as dummies. The first scenario with normally-distributed random effects and error terms (Normal (1)) has an explanatory power of around 40%, and the second (Normal (2)) has an explanatory power of 85%, as well as the Log and Box-Cox scenario. The exact definition of the data settings is given in Table 1. In each simulation run (Monte Carlo replication), the explanatory variables, random intercepts and error terms are generated by drawing from the corresponding distributions. Thus, a new pseudo population is created in each simulation run. A total of 500 Monte Carlo replications are generated for each setting. Each of the finite populations consists of N = 10, 000 units evenly divided into D = 50 clusters. Within each cluster, a simple random sample is drawn. The cluster-specific sample sizes range from 0 to 29, so that the total sample size sums up to n = 565. The distribution of y i j of one population is shown in the Appendix in Table 9 and Fig. 4.
In addition to the explanatory variables x 1,i j , x 2,i j and x 3,i j , the random intercepts u i and the error terms e i j , an additional variable z i j ∼ N (1, 0.1 2 ) is generated in each Monte Carlo replications, which is used to estimate the linear mixed model (but not included in the true data generating mechanism): where T denotes the Box-Cox transformation defined in Eq.
(3). In each simulation run, the model selection is performed with four approaches, where the dependent variable y is on different scales: • on the original scale (no transformation), so that T λ (y i j ) = y i j (denoted by Original), • on the log scale, so that T λ (y i j ) = log(y i j +s) (λ = 0) (denoted by Log), • on the Box-Cox scale, so that T (y i j ) = (y i j +s) λ −1 λ for λ ∈ [−2, 2] and λ = 0; log(y i ) for λ = 0 (denoted by

Box-Cox Opt),
where s denotes the shift parameter s = | min(y)| + 1 only when min(y) is a negative number. A naive approach which is typically used in applications is • to perform the model selection on the original scale and afterwards estimate the optimal λ for a Box-Cox transformation (denoted by Box-Cox Naive).
In the Box-Cox Opt approach the optimal model and the optimal transformation parameter λ are determined simultaneously as described in Sect. 3.4. For each setting, the linear mixed model from (14) and a null model without covariates are estimated. The model selection is than performed with a step-wise algorithm using backward and forward directions based on the c AI C or J c AI C. For the Original approach (which operates on the untransformed scale), the c AI C is Table 1 Overview of data settings, i = 1, ..., d, j = 1, ..., N i Data setting calculated and the J c AI C in Eq. (13) is calculated for the other approaches (which operate on the transformed scale). They can be directly compared, as c AI C equals the J c AI C for the Original approach. As analytic approximations of the AI C can exhibit negative bias for small sample sizes (Marhuenda et al. 2014), we also use bootstrap versions to estimate the bias correction in the J c AI C/c AI C when a log transformation or no transformation is used. This ensures a fair comparison in the simulation experiment with the estimated J c AI C for a Box-Cox transformation. The bootstrap algorithms to estimate the c AI C for the Original and the J c AI C for the Log approach are described in the Appendix. The bootstrap algorithms were executed with B = 200 replications. In the following, we always refer to J c AI C, as in the case of no transformation the c AI C equals the J c AI C.
There are three points of interest in the simulation: First, choice of the correct approach for the model selection, second, choice of the transformation parameter and third, choice of the correct transformation and correct model specification.
To begin with, we want to evaluate whether the model with the correct approach based on the J c AI C is chosen in agreement with the data setting. For this, we look at the calculated J c AI C values and in relation to this, we also check whether in the case of the Box-Cox transformation the correct associated λ is estimated. Then, we focus on the proportion of simulation runs where the correct transformation is selected and the proportion of correctly specified model formula.
The parameter λ of the Box-Cox transformation is estimated with the REML algorithm and the simulation is implemented in the statistical programming language R (R Core Team 2021). For each combination of data settings and approaches the calculated J c AI C are compared and the model with the minimal J c AI C is chosen as optimal. Table 2 contains summary statistics of the J c AI C values over the 500 Monte Carlo replications. We observe that in the Normal (1) and Normal (2) data settings, the calculated J c AI C values of the model with no transformation (Original), the Box-Cox transformation (Box-Cox Opt), and the Box-Cox Naive approach are very close. Often, the calculated J c AI C values for Box-Cox Opt and Box-Cox Naive are identical. This makes sense considering the corresponding estimated λs in Table 3, which are very close to one for both approaches and the resulting distribution close to normality. The deviations of the estimated parameters from one can be explained by the finite population sample from the normal distribution. Looking at the Log data setting we see that the distributions of the J c AI C values using the Log and the Box-Cox Opt approach are very close to each other. Again this makes sense as the estimated λs (see Table 3) are close to zero, which results in a log transformation of the data. The J c AI C values of the Box-Cox Naive approach are slightly higher. In the case of the Box-Cox data setting, the J c AI C values of the Box-Cox Opt approach are the smallest, followed by the Box-Cox Naive approach. Again, the corresponding estimated λs match the true λ of −0.5 in this case. The values of the Log and Original approach are considerably higher, which is reasonable given the underlying distribution of the data in this setting. In each setting, the magnitudes and ordering of the values correspond to the underlying distributions of the data and thus to our expectations. Table 4 shows the proportions of selected optimal approaches/ transformations and model formulas. For each data setting, the model with the transformation underlying in the data-generating process is selected mostly as optimal, i.e., has the smallest J c AI C values. In the two Normal data settings the calculated J c AI C are in around 64% and 69% the smallest, when no transformation is used (Original), therefore it is chosen as optimal. This corresponds to the underlying data generating process. In the other cases, Box-Cox Opt and Box-Cox Naive are chosen as optimal with  identical J c AI C values and also very similar estimated λ's, when looking at Table 3. This makes sense due to the underlying normal distribution. For normal data, it should make no difference whether the optimal model formula without a transformation is chosen first and then an optimal λ close to one is estimated, or whether the model formula and transformation parameter are chosen simultaneously, as in the Box-Cox Opt approach. In the Log setting in 71% out of the 500 samples (simulation runs) the true underlying transformation (Log) is chosen as optimal. While in mostly the rest of the samples, the Box-Cox Opt approach with optimalλ near zero, which corresponds to a log transformation, outperforms the Box-Cox Naive approach. In 5.8% J c AI C values are identical for both Box-Cox approaches. The advantage of the Box-Cox Opt approach is further illustrated in the Box-Cox data setting, where this approach is outperforming the other approaches in 83.6% of cases. Looking at the second part of Table 4, it can be seen that in settings with high explanatory power (Normal (2), Log, Box-Cox), the correct model formula (x 1 + x 2 + x 3 ) is selected in over 85% of the simulation runs. However, in the Normal (1) setting with lower explanatory power in 60.8% of the samples the correct model formula is selected. This result seems justifiable since, if the explanatory power of the underlying true model is lower, it is more difficult to identify the true underlying relationship. The results emphasize that the presented approach allows for the selection of the optimal transformation parameter for the Box-Cox transformation and detects the true transformation.
In addition, it enables the selection of the correct model formula, whereby the degree depends on the explanatory power of the underlying model.

Case study: poverty and inequality in municipalities of Guerrero
In this section, the proposed selection approach is applied to data from the state Guerrero in Mexico for estimating poverty and inequality indicators at municipal level. To provide reliable estimates of these indicators at the municipal level, it is necessary to use small area estimation. In order to demonstrate the proposed selection approach, we use a particular small area method -the empirical best predictor (EBP) by Molina and Rao (2010) -which is based on a linear mixed model. In Sect. 5.1, we provide a brief overview of the small area estimation and the EBP. In Sect. 5.2, we describe the data and the problem of simultaneously finding the optimal (linear mixed) model and the transformation parameter. We apply our proposed selection approach and two naive approaches and present the results of the indicators in Sect. 5.3.

Small area estimation and the empirical best predictor
Many surveys are designed to study total populations. For a sample of the total population, direct estimators, for instance the Horvitz-Thompson estimator (Horvitz and Thompson 1952) can provide reliable estimates due to enough observations/units in the sample. However, direct estimation methods are appropriate only with a sufficient sample size for every domain/area of interest, which is often not the case on a disaggregated regional level. Furthermore, estimators cannot be calculated for domains with no sample data (i.e., out of sample domains) or estimators have too large standard errors for domains with only few sample data (Rao and Molina 2015). When direct estimation cannot provide adequate precision for a domain of interest because of insufficient data, the domain is defined as small and is called small area/small domain (Rao and Molina 2015;Tzavidis et al. 2018). One way to improve direct estimates is by small area estimation. Small area methods aim to improve the efficiency of the estimation by combining sample data with data from the census/register based on a model (Rao and Yu 1994;Jiang and Lahiri 2006). The census/register contains auxiliary variables that may be correlated with the dependent variable and may be used to improve the direct estimates. This is a more complex task as it depends on model building and diagnostics. The model building may include the use of transformation, the selection of the covariates or non-normal error terms.
Since there is no proper survey data which can produce reliable direct estimates of poverty and inequality indicators at municipal level in Guerrero, we use the EBP approach. The approach uses the nested error linear regression model by Battese et al. (1988). This model is a special linear mixed model which includes only random (area specific) intercepts. In the following, we briefly introduce the EBP. Further details are available in Molina and Rao (2010) and Rojas-Perilla et al. (2020).
Assume a finite population of size N divided into D domains. N i denotes the size of the i-th domain for i = 1, · · · , D. Let y be the target welfare variable (e.g. income) and y i j is the welfare measure of j-th unit in i-th domain where j = 1, · · · , N i . The sample data does not include all N units in the population but only a part of the population. The sample has a size of n and this sample can also be divided into D domains. n i denotes the sample size of the i-th domain and it results in n = D i=1 n i . Then, the nested error linear regression model is given by where u i denotes the area random effects and ε i j denotes the error term. Let θ = (β, σ u , σ ε ) be a vector of model parameters. The EBP approach is shortly outlined as follows: 1. Fit the model using the sample data to obtainθ = (β,σ 2 u ,σ 2 ε ) andû i . 2. For l = 1, ..., L, generatẽ for in sample domains, for out of sample domains, usingθ withγ =σ 2 û σ 2 u +σ 2 ε /n i . 3. Obtain L pseudo-populations by plugging in the explanatory variables in the auxiliary data (i.e. x i j ) withβ,û i , u i andε i j obtained in previous steps into the following model for in sample domains, The EBP with data-driven transformed y is obtained similarly to the described EBP above. The detailed estimation of the EBP with data-driven transformations and corresponding uncertainty measures based on MSE of the EBP are further explained in Rojas-Perilla et al. (2020).

Data and problem
This study uses survey data from the 2010 Encuesta Nacional de Ingresos y Gastos de los Hogares (ENIGH -National Survey of Household Income and Expenditure) as sample data. This survey is performed every two years by the Instituto Nacional de Estadística y Geografía (INEGI -The National Institute of Statistics and Geography) and contains sociodemographic information of households, which are also the units of data. INEGI also performs the national population and housing census every ten years. As auxiliary data, the census 2010 data is used for the further application. Guerrero is located in Southwestern Mexico and borders the Pacific ocean. The state is divided into 81 municipalities. 40 municipalities are in the survey data and 41 municipalities are not in the sample. Table 5 shows a summary of the number of households per domain in the survey and census data. is used as the measurement of welfare. As we used the linear mixed model in Eq. (15) to explain ictpc, the Gaussian assumptions of random effects and errors are required. However, the histogram of ictpc in Fig. 1 shows that the distribution of ictpc is very right skewed. Therefore, we apply the Box-Cox transformation to the target variable ictpc, such that the violation can be corrected/reduced. For the Box-Cox transformation the optimal transformation parameter λ should be found.
In the survey data there are 34 possible explanatory variables after excluding variables which are highly/perfectly correlated with other variables. We do not know which variables should be included to optimally explain the response variable, therefore, a variable selection should be performed. Consequently, we have two problems to solve: obtaining the optimal transformation parameter λ and finding the optimal model. To solve these problems simultaneously, the optimal transformation parameters are estimated by the REML approach for each model candidate and all model candidates with their own transformed data are compared with the J c AI C introduced in Sect. 3. There are 34 possible explanatory variables in the data, therefore, we theoretically have 2 34 model candidates. However, fitting these models is unfeasible because of the computational intensity. Instead, a step-wise variable selection proposed in Sect. 3.4 is applied to find the optimal model. With the chosen optimal model the EBP of poverty and inequality indicators are estimated.
To evaluate the EBP based on our optimal model, we apply two naive approaches which are typically used in applications. The first one takes the simple logarithmic transformation to avoid the problem of finding the optimal transformation and performs variable selection based on c AI C on the log-scale. The second specification performs the variable selection initially on the original y scale to find the optimal model. Afterwards, the optimal transformation parameter is chosen based on the optimal model for the original y. Consequently, we have three different EBP estimates: Box-Cox Opt denotes the EBP based on our selection method based on the J c AI C as described in Sect. 3, Log denotes the first alternative EBP approach and Box-Cox Naive denotes the second alternative EBP approach. These three EBP estimates are compared to show that the use of our proposed selection approach based on the J c AI C can improve the predictive power and reduce the uncertainty of the poverty and inequality estimates.

Results
First, the chosen variables for the optimal model and the optimal transformation parameters of each approach are compared. Table 6 shows the chosen variables of each approach and the estimated transformation parameter for models with the Box-Cox transformation. We can see that the results of variable selection can be strongly affected by the response variable. The Box-Cox Opt approach performs the variable selection on Box-Cox transformed y and Log performs the variable selection on logarithmic transformed y. For these two approaches, a transformation is used to correct the violation of the Gaussian assumptions and then the optimal model is chosen with transformed y. As a result, the chosen variables for the model of Box-Cox Opt and Log are very similar. In the meantime, the model of Box-Cox Naive choose the variables on the original y despite the violation of the Gaussian assumptions in the error terms. As a result, Box-Cox Naive has different variables in the model in comparison to the other models. Interestingly, optimal transformation parameters for Box-Cox Opt and for Box-Cox Naive only differ slightly even though they have many different variables in the models.
Second, in order to compare the predictive power of each model, marginal R 2 and conditional R 2 (Nakagawa and Schielzeth 2013) are calculated and summarized in  Chosen Variablesλ X 1 , X 2 , X 3 , X 4 , X 5 , X 6 , X 7 , X 8 , X 9 , Box-Cox Opt X 10 , X 11 , X 12 , X 13 , X 14 , X 15 , X 16 , X 17 , X 18 , X 19 , 0.1764 X 20 , X 21 , X 22 , X 23 X 1 , X 2 , X 3 , X 4 , X 5 , X 6 , X 7 , X 8 , X 9 , Log X 10 , X 11 , X 12 , X 13 , X 14 , X 15 , X 16 , X 17 , X 18 , X 19 , - Box-Cox Naive X 1 , X 2 , X 3 , X 4 , X 5 , X 6 , X 7 , X 8 , X 9 , 0.1888 X 20 , X 21 , X 22 , X 23 , X 26 , X 27 , X 28  Table 7. The marginal R 2 measures the proportion of variance explained by fixed effects and the conditional R 2 provides the proportion of variance explained by both the fixed and random effects. It is shown that the models with the Box-Cox transformation (i.e., Box-Cox Opt and Box-Cox Naive) have the higher predictive power than the model with the logarithmic transformation (i.e., Log). When we compare Box-Cox Opt and Box-Cox Naive, we can see that the Box-Cox Opt, whose model is optimal for transformed y, has a higher marginal and conditional R 2 than Box-Cox Naive, whose model is optimal for the original y scale.
Since the linear mixed model relies on Gaussian assumptions and we decided to use a transformation to correct the violation of the Gaussian assumptions, each approach should be examined concerning whether the violation is corrected. For the examination, the skewness, kurtosis of residuals, and p-value of the Shapiro-Wilk normality test (Shapiro and Wilk 1965) on residuals are calculated (Table 8). We observe that the logarithmic transformation performs worse than the Box-Cox transformations. For further details we provide quantile-quantile (Q-Q) plots of residuals from the three approaches in Fig. 5 in the Appendix. The household level residuals are clearly closer to the normal distribution with transformations. The Box-Cox transformation corrects the violation in household level residuals rather well, however, the residuals slightly deviate in the tails. From the models with the Box-Cox transformation we can at least observe that the municipal level residuals are very close to the normal distribution.
Finally, we want to assess if the improvement in the predictive power of the model due to the proposed simultaneous selection of the transformation and the covariates (Box-Cox Opt) translates to more precise small area estimates compared to the two alternative approaches (Log and Box-Cox Naive). Therefore, we estimate the mean income, head count ratio (HCR) (Foster et al. 1984), and Gini coefficient (Gini 1912) for the municipalities in Guerrero. To compare the efficiency of these three different approaches, the root mean squared error (R M S E) of the municipal indicators is estimated by a bootstrap (Rojas-Perilla et al. 2020). The R M S E values are visualized in Fig. 2. Figure 2 shows that the Box-Cox Opt is the most efficient approach, since for all three indicators it has the smallest estimated R M S E. When the naive approaches are compared, we cannot say which approach is more efficient because for some indicators Log has the smaller R M S E and for other indicators Box-Cox Naive has the smaller R M S E. It seems that the model and transformation selection is especially important for parameters associated with the tails of the distribution. Figure 3 shows EBP estimates of mean income, HCR, and Gini of municipalities in Guerrero based on Box-Cox Opt approach. The southwestern part of Guerrero, which resembles the coastline (Costa Grande region and Acapulco), features a tourism industry which contributes to the municipalities having a higher mean income. Furthermore, along a north-south axis between Chilpancingo in the south and Taxco in the north, numerous industries are concentrated. These industries focus on the production of handcrafted items using local resources. This also contributes to a higher income in these municipalities. Consequently, the HCR and Gini coefficient in these municipalities are lower than the others. This means, that the people in these municipalities earn more money than in other municipalities and the wealth is more equally distributed compared to other municipalities. On the other hand, the eastern part of Guerrero is suffering from higher levels of poverty and inequality. Municipalities in the region are covered with mountains and when compared to all other regions of Guerrero, these municipalities exhibit the highest number of indigenous people living there.

Conclusions and future research directions
The main purpose of this study was to find a solution to two practical problems in the context of linear mixed models: (a) the true model for explaining the response variable is unknown and (b) the model assumptions, especially the Gaussian assumptions of the error terms, are violated. Since these problems commonly appear together, we provide a solution to find the optimal model and the optimal transformation simultaneously. We focus on one of the most commonly used transformations, the Box-Cox transformation. Since the c AI C is scale dependent, we provide an adjusted c AI C by using the Jacobian of the transformation such that different models with differently compared transformed response variables can be compared. As a large number of possible explanatory variables increases computational costs, we propose an optimal simultaneous selection approach based on Jacobian adjusted c AI C (J c AI C), which is also feasible for a large number of variables. Our modelbased simulation studies show that the proposed selection approach chooses the true model with a transformation parameter close to the true value in most cases and performs better compared to naive selection approaches. The proposed simultaneous selection approach can be used in many different areas of research. As an example, we provide a case study where we apply the selection approach for estimating poverty and inequality indicators at municipal level in Mexico. We observe that the model selected by the proposed simultaneous approach has a higher predictive power than other approaches. The improvements in terms of predictive power and model building translate to more precise small area estimates of the poverty and inequality indicators. Further research should be shifted towards alternative variable selection criteria. For instance, Bunke et al. (1999) show that the cross validation selection criterion can simultaneously select the optimal parametric model and the optimal transformation parameter of the Box-Cox transformation for nonlinear regression models. Furthermore, Fang (2011) proves that the c AI C is asymptotically equivalent to the leave-one-observation-out cross validation for linear mixed models. Therefore, deriving the cross validation selection criterion for the linear mixed model and comparing the results with the J c AI C might be a promising avenue for further research. The selection based on cross validation criterion may improve the quality of the prediction. Moreover, it is also possible to derive the J c AI C for other transformations which require the estimation of the transformation parameter. The use of J c AI C as a selection criterion between different transformations with different optimal models is also a potential research direction. However, it should be noted that the use of our proposed approach is less useful when the point of interest is to interpret the effect of the chosen explanatory variables on the original scaled data. Gurka et al. (2006) introduce a bias corrected beta coefficient for linear mixed models under the Box-Cox transformation which produces a more precise interpretation of the beta coefficients. However, the interpretation does only hold for the transformed response variable. On the original scaled response, it is not clear how strong the effects of the explanatory variables are. To enable interpreting the effects of explanatory variables on the original data, further research is needed for general regression models with the Box-Cox transformed response variable.
Funding Open Access funding enabled and organized by Projekt DEAL. The authors did not receive support from any organization for the submitted work.

Declarations
Financial or non-financial interests The authors have no relevant financial or non-financial interests to disclose.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copy-right holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.

A. Bootstrap for Original
1. Fit the model in Eq. (1) to obtain estimates of model parametersθ . 2. Generate u (b) from N (0,Ĝ 0 ) and ε (b) from N (0,σ 2 ) and create bootstrap y using 3. Refit the model with the bootstrap sample y (b) and obtain bootstrap estimates of model parametersθ (b) andû (b) . 4. Obtain the BC by

B. Bootstrap for Log
1. Transform the y to the y using y = log(y + s).
2. Fit the model with y to obtain estimates of model parametersθ . 3. Generate u (b) from N (0,Ĝ 0 ) and ε (b) from N (0,σ 2 ) and create bootstrap y using 4. Re-fit the model with bootstrap sample y (b) and obtain bootstrap estimates of model parametersθ (b) andû (b) . 5. Back-transform y (b) to obtain y (b) . y (b) is obtained by y (b) = exp( y (b) ) − s. 6. Obtain the BC by log(y + s).

Graphics and tables
See Table 9, Figs. 4 and 5.