Robust Estimation under Linear Mixed Models: The Minimum Density Power Divergence Approach

Many real-life data sets can be analyzed using Linear Mixed Models (LMMs). Since these are ordinarily based on normality assumptions, under small deviations from the model the inference can be highly unstable when the associated parameters are estimated by classical methods. On the other hand, the density power divergence (DPD) family, which measures the discrepancy between two probability density functions, has been successfully used to build robust estimators with high stability associated with minimal loss in efficiency. Here, we develop the minimum DPD estimator (MDPDE) for independent but non identically distributed observations in LMMs. We prove the theoretical properties, including consistency and asymptotic normality. The influence function and sensitivity measures are studied to explore the robustness properties. As a data based choice of the MDPDE tuning parameter $\alpha$ is very important, we propose two candidates as"optimal"choices, where optimality is in the sense of choosing the strongest downweighting that is necessary for the particular data set. We conduct a simulation study comparing the proposed MDPDE, for different values of $\alpha$, with the S-estimators, M-estimators and the classical maximum likelihood estimator, considering different levels of contamination. Finally, we illustrate the performance of our proposal on a real-data example.


Introduction
A major interest in statistics concerns the estimation of averages and their variation.The most commonly used method for this purpose is, probably, the Linear Model (LM).In this model, to give an example from a two way layout, the expected value (mean) µ ij of an observation y ij , may be expressed as a linear combination of unknown parameters such as µ ij = µ+α i +β j , where µ, α i and β j are the constants which we are interested in estimating.The linearity in the parameters means that we can write a linear model in the form y i = X i β + i , where β is the vector of unknown parameters and the X i s are known matrices.This formulation is the same as that used in case of linear regression model.In the present work, we consider the Linear Mixed Models (LMMs), in which some (unknown) parameters are not treated as constants but as random variables.Random terms come into play when some items cannot be considered as fixed quantities, although their distributions are of interest.Hence, they are the tools to generalize the results to the entire population under study.The types of data that may be appropriately analyzed by LMMs include (i) Clustered data where the dependent variable is measured once for each subject (the unit of analysis) and the units of analysis are grouped into, or nested within, clusters; (ii) Repeated-measures data where the dependent variable is measured more than once on the same unit of analysis across levels of a factor, which may be time or experimental conditions; (iii) Longitudinal data where the dependent variable is measured at several points in time for each unit of analysis.For a general review of LMs and LMMs see McCulloch and Searle [2001].
The standard methods used to estimate the parameters in LMMs are methods of maximum likelihood and restricted maximum likelihood.Generally, LMMs are based on normality assumptions and it is well known that these classical methods are not robust and can be greatly affected in the presence of small deviations from the assumptions.Furthermore, outlier detection for modern large data sets can be very challenging and, in any case, robust techniques cannot be replaced by the application of classical methods on outlier deleted data.
To answer the need for robust estimation in linear mixed models, a few methods have been proposed.The initial attempts were based on weighted versions of the log-likelihood function (see Huggins [1993a], Huggins [1993b], Huggins and Staudte [1994], Stahel and Welsh [1994], Richardson and Welsh [1995], Richardson [1997], Welsh and Richardson [1997]).Another attempt, discussed in Welsh and Richardson [1997], of robustifying linear mixed models consists of replacing the Gaussian distribution by the Student's t distribution (see also Lange et al. [1989], Pinheiro et al. [2001]).However, this modification of the error distribution is intractable and complicated to implement.Copt and Victoria-Feser [2006] adapted a multivariate high breakdown point S-estimator, namely CVFS-estimator, to the linear mixed models setup, while the estimator given by Koller [2013], namely SMDM-estimator, attempts to achieve robustness by a robustification of the score equations.Robust estimators have been proposed, more generally, for generalized linear mixed models by Yau and Kuk [2002] and Sinha [2004].
The density power divergence (DPD) [Basu et al., 1998], which measures the discrepancy between two probability density functions, has been successfully used to build a robust estimator for independent and identically distributed observations.Ghosh and Basu [2013] extended the construction of the DPD and the corresponding minimum DPD estimator (MDPDE) to the case of independent but non-identically distributed data.This approach and theory covers the linear regression model, and has later been extended to more general parametric regression models (Ghosh and Basu [2016], Ghosh and Basu [2019]; Castilla et al. [2018], Castilla et al. [2019]; Ghosh [2019], etc.).This MDPDE has become widely popular in recent times due to its good (asymptotic) efficiency along with high robustness, easy computability and direct interpretation as an intuitive generalization of the maximum likelihood estimator (MLE).
In the present work, we aim to develop a general robust estimation procedure that is able to deal with the linear mixed model setup.Hence, we adapt the MD-PDE in order to treat LMMs where the data are independent but non-identically distributed.We prove that the introduced estimator satisfies the robustness properties as well as the recommended asymptotic properties of an estimator at the model.
The rest of the paper is organized as follows.In Section 2 we briefly present the MDPDE for non-homogeneous observations.In Section 3 we define the proposed estimator in case of linear mixed models and its asymptotic and robustness properties are exploited.Section 4 reports the simulation study we conducted, comparing the performance of the MDPDE to the most recent methods, exploring also the case of contaminated data.Section 5 provides the application of the proposed estimator to a real data example.Concluding remarks are presented in Section 6.For brevity, the assumptions needed to prove the asymptotic normality of the estimator, the proof of the main theorem and some additional theoretical and Monte-Carlo results are presented in the Online Supplementary Material.

The MDPDE for independent non-homogeneous observations
The density power divergence family was first introduced by Basu et al. [1998] as a measure of discrepancy between two probability density functions.The authors used this measure to robustly estimate the model parameters under the usual setup of independent and identically distributed data.The density power divergence measure d α (g, f ) between two probability densities g and f is defined, in terms of a single tuning parameter α ≥ 0, as where ln denotes the natural logarithm.Basu et al. [1998] demonstrated that the tuning parameter α controls the trade-off between efficiency and robustness of the resulting estimator.With increasing α, the estimator acquires greater stability with a slight loss in efficiency.Since the divergence is not defined for α = 0, d 0 (g, f ) in Equation ( 2) represents the divergence obtained in the limit of (1) as α → 0, which corresponds to a version of the Kullback-Leibler divergence.On the other hand, α = 1 generates the squared L 2 distance.
Let G be the true data generating distribution and g the corresponding density function.To model g, consider the parametric family of densities The minimizer of d α (g, f θ ) over θ ∈ Θ, whenever it exists, is the minimum DPD functional at the distribution point G.Note that, the third term of the divergence d α (g, f θ ) is independent of θ, hence it can be discarded from the objective function as it has no role in the minimization process.Consider a sequence of independent and identically distributed (i.i.d) observations Y 1 , . . ., Y n from the true distribution G. Using the empirical distribution function G n in place of G, the MDPDE of θ can be obtained by minimizing over θ ∈ Θ.In the above equation, the empirical distribution function is used to approximate its theoretical version (or, alternatively, the sample mean is used to approximate the population mean).Note that, it is valid in case of continuous densities also.See Basu et al. [2011] for more details and examples.Ghosh and Basu [2013] generalized the above concept of robust minimum DPD estimation to the more general case of independent non-homogeneous observations, i.e., they considered the case where the observed data Y 1 , . . ., Y n are independent but for each i, Y i ∼ g i where g 1 , . . ., g n are possibly different densities with respect to some common dominating measure.We model g i by the family F i,θ = {f i (•, θ) : θ ∈ Θ} for i = 1, . . ., n.While the distributions f i (•, θ) can be distinct, they share the same parameter vector θ.Ghosh and Basu [2013] proposed to minimize the average divergence between the data points and the model densities which leads to the minimization of the objective function where H i (Y i , θ) is the indicated term within the square brackets in the above equation.Differentiating the above expression with respect to θ we get the estimating equations of the MDPDE for non-homogeneous observations.Note that, the estimating equation is unbiased when each g i belongs to the model family F i,θ , respectively.When α → 0, the corresponding objective function reduces to /n, which is the negative of the log-likelihood function.In Section SM-1, we report the assumptions (A1)-(A7) which are used to prove the asymptotic normality of the MDPDE [Ghosh and Basu, 2013].

The MDPDE for Linear Mixed Models
The general formulation of an LMM may be expressed as Y = Xβ + Zu + , where X and Z are known design matrices, β is the parameter vector for fixed effects, u is the vector of random effects and is the random error vector.More explicitly, let the model have r random factors u j with q j levels, j = 1, . . ., r, and denote the size of the i-th group by n i .Note that n i=1 n i = N is the total number of observations.Under this setup we can rewrite the model as where Y i (n i × 1) is the response vector for group i, X i (n i × k) and Z ij (n i × q j ) are the model matrices, β (k × 1) is the vector of unknown parameters for fixed effects, u ij represents the realized values of u j for the i-th group and i (n i × 1) is the error term.We assume that i ∼ N (0, σ 2 0 I n i ) and u ij ∼ N (0, σ 2 j I q j ), where I n is the n × n identity matrix, and i and u ij are independent of each other for all i, j.Then Thus, Y i s are independent but not identically distributed; for each i, the covariance matrix of Y i can be rewritten as In this setting, we can obtain the MDPDE for the parameter vector θ = (β ; σ 2 j , j = 0, 1, . . ., r) by minimizing the objective function given in Equation (3) with f i ≡ N (X i β, V i ).Upon simplification, the objective function is given by .
(5) Differentiating the above equation with respect to β, we get the corresponding estimating equation for the MDPDE of β as (6) Let U ij denote the partial derivative of the matrix V i with respect to σ 2 j .We have that U i0 = I n i , and U ij = Z ij Z ij , j = 1, . . ., r.Then, the partial derivative of the objective function with respect to σ 2 j , j = 0, 1, . . ., r, leads to their MDPDE estimating equations as given by where Tr(•) denotes the trace of the argument matrix.Solving Equations ( 6)-( 7) numerically we can obtain the estimates of θ = (β ; σ 2 j , j = 0, 1, . . ., r) .In case multiple roots exist, we should chose the one minimizing the objective function H n (θ) in (5) as the targeted MDPDE of θ.
Note that, substituting α = 0 in the estimating equations in ( 5)-( 7), we get back the MLE score equations.Thus, the MDPDE at α = 0 is nothing but the usual MLE also under our LMMs.
Now, we present some conditions on the independent variables and on the variance-covariance matrices that will be used to derive the asymptotic distribution of the MDPDE of the parameter vector θ = (β ; σ 2 j , j = 0, 1, . . ., r) in the LMM application.

(MM1) Define
and X i and Z i are full rank matrices for all i.
(MM2) The values of X i 's are such that, for all j, k, l ) where 1(n i × 1) is a vector of 1's.
(MM3) The matrices V i and U ij are such that, for all j, k, l = 0, . . ., r, 1 1 where the determinant |V i | is bounded away from both zero and infinity ∀i.
1 2 X i , for each i, and Under these conditions, we can derive the asymptotic distribution of the MD-PDE of the parameters in case of linear mixed models which is presented in the following theorem; the proof is presented in Section SM-2 of the Supplementary Material.
Theorem 1.Consider the setup of the Linear Mixed Model presented in Section 3. Assume that the true data generating density belongs to the model family and that the independent variables satisfy Assumptions (MM1)-(MM4) for a given (fixed) α ≥ 0.Then, we have the following results as n → ∞ keeping n i fixed for each i.
(iii) The asymptotic distribution of Ω normal with mean zero and covariance matrix I (k+r+1) .In particular, the asymptotic distribution of (X * X * ) − 1 2 (X X )( β − β) is a k-dimensional normal with mean zero and covariance matrix I k , where X and X * are as defined in Assumptions (MM1) and (MM4), respectively.

Influence function
To explore the robustness properties of the coefficient estimates in our treatment of linear mixed models, we derive the influence function of the MDPDEs following the theory explained in Ghosh and Basu [2013].Denote the density power divergence functional T α = (T β α , T Σ α ) for the parameter vector θ = (β , Σ = (σ 2 0 , . . ., σ 2 r )).We continue with the notation of the previous subsections.
The influence function of the estimator T β α with contamination at the direction i 0 at the point t i 0 is computed to have the form and the corresponding influence function for the estimator T Σ α has the form where .
The functions ze −z z and z ze −z z are bounded for z ∈ R n i , and thus the influence functions in ( 14) and (15) are bounded in t i 0 for any i 0 and any α > 0. For α = 0, the influence functions for T β α and T Σ α are seen to be unbounded; indeed this case corresponds to the non-robust maximum likelihood estimator.Hence, unlike the MLE, the minimum DPD estimators are B-robust, i.e. their associated influence functions are bounded, for α > 0.
Using similar computations, the influence function of the estimators T β α and T Σ α with contamination in all the n cases at the contamination points t 1 , . . ., t n , respectively, are given by ) and (17) In this case also the influence functions are bounded for α > 0 and unbounded for α = 0. Now, we compute the sensitivity measures introduced in Ghosh and Basu [2013].For α > 0, the gross-error sensitivity and the self-standardized sensitivity of the estimator T β α in the case of contamination only in the i 0 -th direction are given by and , where λ max (A) indicates the largest eigenvalue of the matrix A, while they are equal to ∞ if α = 0. Details of the computations are provided in Section SM-3 of the Supplementary Material.The sensitivity measures for T Σ α have no compact form and they are not reported separately.

A Particular Example of the LMM
Consider the model defined by Equation ( 4).Here, we study the simplest case in which n i = p, for all i = 1, . . ., n, and the associated random effects covariates (Z ij ) are also the same for all i.In this case, the covariance matrix of Y i is the same for all i and is denoted by V having the form where γ j = σ 2 j /σ 2 0 and U j = U ij as it is independent of i.In this situation, we are able to derive an updating expression for the estimation of β, which is very useful for the implementation.Consider the objective function rewritten as Differentiating the above equation with respect to β, Equation ( 6) now corresponds to where we denote . Solving for β, we get so that, in an iterative fixed point algorithm, the successive iterates have the relation Note that, the asymptotic distribution of the estimator of β also has a simpler form.In particular, the asymptotic distribution of is a k-dimensional normal with mean zero and covariance matrix υ β α I k , where Unfortunately, we cannot derive a similar simple form for the estimator of variance parameters.
Furthermore, the following simpler form of the influence function allows us to assess the performance of the sensitivity measure with respect to the tuning parameter α.The influence function of the functional T β α with contamination in the direction i 0 , given in Equation ( 14), can be written as Using this expression, the gross-error sensitivity for the functional T β α is given by (21) Similarly, the self-standardized sensitivity of the functional T β α can be written as The function (1+α) √ α in the gross-error sensitivity (21) has a minimum for the value α * = 1 p+1 , suggesting that this value of the parameter α gives the most robust estimator.Similarly, the function (1+α)   p+2 4 √ α in the self-standardized sensitivity (22) has a minimum for the value ᾱ = 2 p .These are in contrast with the previously held knowledge about this parameter which was introduced as a trade-off between efficiency and robustness.
In the following, we present a simple example for which we will compute the theoretical quantities introduced above.This example in linear mixed models has been chosen for its similarity to the case of longitudinal data; it is often also named as LMM with random intercept and random slope.
We consider n = 50 different subjects (groups) and for each of them we have p = 10 measurements taken with respect to the factor u i2 , i = 1, . . ., n, with two levels, modeled here as a random effect.The X's model matrices are simulated from a standard normal.In particular, the model is described by ) and i ∼ N (0, σ 2 0 I p ) and they are independent.Hence, for this model, θ = (β 0 , β 1 , σ 2 1 , σ 2 2 , σ 2 0 ) and we take θ = (1, 2, 0.25, 0.5, 0.25) as the true values of the parameters.Using the given values, we compute the variance-covariance matrices V i and the matrices Ψ n and Ω n .First, we will look at the Asymptotic Relative Efficiency (ARE) of the minimum density power divergence estimators with respect to the fully efficient maximum likelihood estimator.For example, the ARE of β is given by where υ β α is as defined in Equation ( 20).The AREs of σ2 i , i = 0, 1, 2, are similarly defined and they are computed using the general formulation of asymptotic variance.Fig. 1 shows the asymptotic relative efficiencies of the estimators of β 1 and σ 2 1 for α ∈ [0, 0.6].It is easy to see that there is a loss of efficiency which increases with α.However, for small positive values of α, the estimator retains reasonable efficiency.The ARE of the estimators of the other parameters are similar to those displayed here, and are given in Section SM-4 of the Supplementary Material.
On the other hand, to study the robustness properties, Fig. 2 shows the influence functions of T β 0 α and T σ 2 1 α , with respect to α = 0, 0.05, 0.3, 0.6.Here we have plotted IF (t 1 , . . ., t n , T α , G 1 , . . ., G n ), the influence function of the estimator T α , computed with respect to constant vectors t = t(1, . . ., 1) for varying t ∈ R. Note that, except for the case α = 0, we can easily see that the influence function is bounded as may also be noted from equations ( 14) and (15); thus the estimator will be robust with respect to outliers.The influence function for the estimators of other parameters behaves similarly; these plots are available in Section SM-4 of the Supplementary Material.
Finally, Figure 3 shows the gross-error sensitivity and the self-standardized sensitivity of the functional T β α .Here, we have considered a particular direction i 0 ∈ {1, . . ., n}.Note that, in the present case of balanced data, the choice of i 0 does not change the behaviour of the sensitivity measures with respect to α.

Monte Carlo simulations
Here, we describe a simulation study conducted to assess the performance of the proposed estimator in case of LMMs.It will be compared to the primary existing competitors both under pure data as well as under contaminated data.

Model setting
This model setting has been introduced in Agostinelli and Yohai [2016] and reported here in order to facilitate the comparison of the considered estimators.Consider an LMM for a 2-way cross classification with interaction, where the model is given by where f = 1, . . ., F, g = 1, . . ., G, and h = 1, . . ., H. Here, we set F = 2, G = 2 and H = 3 getting p = F × G × H = 12.Also x f gh is a k × 1 vector where the last k − 1 components are from a standard multivariate normal and the first component is identically equal to 1, and β 0 = (0, 2, 2, 2, 2, 2) is a k × 1 vector of the fixed parameters with k = 6.The random variables a f , b g and c f g are the random effects which are normally distributed with variances σ 2 a , σ 2 b , and σ 2 c .Arranging the y f gh in lexicon order (ordered by h within g within f ) we obtain the vector y of dimension p and in the similar way the p × k matrix x obtained arranging x f gh .Similarly, we set a = (a 1 , . . ., a (1/4, 1/4, 1/2) and η 0 = σ 2 e = 1/4.We consider a sample of size n = 100 and four levels of contamination ε = 0, 5, 10 and 15%.Hence, n × ε observations are contaminated by replacing n × ε elements of the vector y by observations from y 0 ∼ N p (x 0 β 0 + ω 0 , Σ) and the corresponding components of x are replaced by the components of x 0 .The first column of x 0 is identically equal to 1 while the last k − 1 columns are from N p×(k−1) (φ 0 , 0.005 2 I p×(k−1) ) and all the components of φ 0 equal to 1 in the case of low leverage outliers (lev1) or to 20 for large leverage outliers (lev20).ω 0 is a p-vector of constants all equal to ω 0 taken in a grid of values which would generate unlikely responses for the model and allow us to explore the behavior of our estimator under such adverse conditions.

Performance Measures
Let (y, x) be an observation independent of the sample (y 1 , x 1 ), . . ., (y n , x n ) used to compute β and let y = x β be the predicted value of y using x.Then, the squared Mahalanobis distance between y and y using the matrix Σ 0 is Since y − xβ 0 is independent of x and has covariance matrix Σ 0 , putting Then, to evaluate an estimator β of β by its prediction performance we can use Let N be the number of replications in the simulation study, and let β j , 1 ≤ j ≤ N be the value of β at the j-th replication, then we can estimate E m( β, β 0 , A) by the mean square Mahalanobis distance as It is easy to prove that, as in this case, x is a p × k matrix where the cells are independent N (0, 1) random variables, then A = trace(Σ −1 0 )I k .Given two p-dimensional covariance matrices Σ 1 and Σ 0 , one way to measure how close Σ 1 and Σ 0 are is through the use of the Kullback-Leibler divergence between two multivariate normal distributions with the same mean and covariance matrices equal to Σ 1 and Σ 0 , given by Since (η 0 , γ 0 ) determines Σ 0 = Σ(η 0 , γ 0 ), the covariance matrix of y given x for the particular LMM considered in our simulation (as described in Section 4.1), one way to measure the performance of an estimator ( η, γ) of (η 0 , γ 0 ) is by E [KLD(Σ( η, γ), Σ 0 )].Let ( η j , γ j ), 1 ≤ j ≤ N , be the value of ( η, γ) at the j-th replication, then we can estimate E [KLD(Σ( η, γ), Σ 0 )] by the mean Kullback-Leibler divergence

Results
We begin with the performance of the estimators in the absence of contamination.
Table 1 shows the relative efficiency of the CVFS-estimator, the SMDM-estimator, and the MDPDE for different values of α with respect to maximum likelihood.The efficiency of the estimators of β has been measured by the MSMD ratio while the MKLD ratio was used for the efficiency of an estimator of (η, γ).
The MDPDEs exhibit a high relative efficiency, even greater than the competitor estimators, for small values of α, while the efficiency decreases with increasing α.Note that the MDPDEs are far more successful in retaining the efficiency of the estimators of the random component.For very small values of α the MDPDEs dominate either competitor (at least up to α = α * for SMDM, and at least up to α = 0.2 for CVFS) in terms of both (MSMD and MKLD) efficiency measures.As the value of α increases, the MSMD efficiency of the MDPDE eventually lags behind its competitors, but in terms of MKLD efficiency it beats both competitors at least up to α = 0.4.On the whole it is clear that under pure data, a properly chosen member of the MDPDE class can perform competitively, if not better, compared to the SMDM and CVFS estimators.For a simpler comparison, Table 2 reports the maximum values of MSMD and MKLD over the values of ω 0 considered in the range of our Monte Carlo setting.
Small values of α, as expected, provide much higher maximum values with respect to the other estimators in Table 2. However for slightly larger values of α, the MDPDEs are extremely competitive with the existing estimators.It may be easily observed that the MDPDE at ᾱ clearly beats both competitors (CVFS and SMDM) over both performance measures at both leverage values (except at MSMD, lev20, where its performance measure is equal to that of CVFS).In this example, the MDPDE at α = 0.2 fares even better.The values α * and ᾱ represent theoretical optimal choices, even though they may not present the lowest maximum values of MSMD or MKLD measures.
Figures 4a and 4b display the MSMD and MKLD as function of ω 0 , comparing the CVFS-and SMDM-estimators with the MDPDEs for three chosen values of α, under 10% of outlier contamination.In particular, we choose α * and ᾱ since they are the values suggested by theory, and α = 0.3 since it shows the lowest (or very close to the lowest) maximum values of MSMD and MKLD.We can see that most of the MDPDEs outperform the CVFS-and SMDM-estimators, especially in case of leverage 20 (lev20), where the SMDM-estimator shows an unbounded behaviour.On the other hand, in the case of leverage 1 (lev1), even if the CVFS-estimator presents lower maximum value of MSMD and MKLD for very small values of ω 0 , the MDPDEs show a better performance when ω 0 increases.In fact the MDPDE at α = 0.3 is competitive or better than CVFS at all values of ω 0 .

Real-data example: Extrafoveal Vision Acuity
Let us now present an application of the proposed estimation method to a real data example.We compare the estimates obtained by the minimum DPD method with those obtained using the classical (non-robust) restricted MLE, computed using the lmer function in R, as well as the robust competitors, the SMDM-estimator and the CVFS-estimator.A very important consideration in real situations is the selection of an "optimum" value of α that applies to the given data set.We will use different values of α to highlight the behavior of the estimator seen in the simulations.In general, we are going to consider the values α * and ᾱ, derived from theoretical computations, as suggested optimal values.We consider the study conducted by Frömer et al. [2015] about the relationship between individual differences in foveal visual acuity and extrafoveal vision (acuity and crowding) and reading time measures, such as reading rate and preview benefit.
There were 40 participants in the study, with normal visual acuity measured with the adaptive computerized Freiburg Acuity Test (FrACT) [Bach, 1996] extrafoveal vision.In addition, visual acuity of fovea was measured using the FrACT.The second session was taken after a week, consisting of an eye-tracking experiment with list reading followed by the EVA procedure.The EVA was performed considering four test conditions: identification of single letters and flanked letters in the left and right visual field.Here, we consider only data coming from measurements with the EVA procedure and do not deal with data related to the reading task.This kind of data can be modeled using a Linear Mixed Model.In particular, we studied a repeated measures Analysis of Variance (ANOVA) of the threshold eccentricities (TE) with random effects given by extrafoveal vision (EV)(single versus crowded letter), hemifield (H)(left, right), and test repetition (T 1 , T 2 , T 3 ).Thus, combining the factors given above, we have p = 12 measurements for each subject (participant).The model for each subject (i-th) has the form where i = 1, . . ., n, n = 40, while T 2 1 and T 3 2 substitute the factor time (T 1 , T 2 , T 3 ) indicating the transitions between the first and second sessions, and PDE seems quite stable with respect to the corresponding p-values, while those computed using lmer and the CVFS-estimator show large variations.

Conclusions
In this paper, we have developed an estimator based on the density power divergences to deal with the robustness issues in the linear mixed model setup.We demonstrated that this MDPDE satisfies the desirable properties of an estimator, such as consistency and asymptotic normality.In order to assess the robustness properties, the influence function and sensitivity measures of the estimator were computed.We found that the estimator is B-robust for α > 0. From a practical point of view, the choice of the value of the tuning parameter α is fundamental in applications.The behaviour of the sensitivity measures suggested two optimal values, denoted by α * and ᾱ, depending on the dimension p, where the term "optimal" is in the sense of providing minimum sensitivity, and thus producing maximum robustness.The existence of such values is in contrast to the previous knowledge about the parameter α.Indeed, it was shown that when α continues to increase, beyond a certain value we lose both robustness and efficiency.
The simulation study confirmed how the performance of the minimum density power divergence estimator changes with respect to α.Furthermore, the MD-PDE outperforms the competitor estimators; indeed our approach leads to more resistant estimators in the presence of case-wise contamination.Finally, the application of our estimator to a real-life data set indicated that the MDPDE has similar results to the classical maximum likelihood estimator, with the advantage of being resistant to the presence of few random cell-wise outliers.
We feel that many important extensions of this work are necessary and can be potentially useful.So far, the MDPDE has been implemented only for balanced data (although the theory that we have developed is perfectly general).In future, we propose to extend the implementation to the more general case of groups with possibly different dimensions.Also, the linear mixed models are based on normality assumptions, it would be useful to extend the application of the MDPDE to the larger class of generalized linear mixed models.The problem of testing of hypothesis also deserves a deeper look in the linear mixed models scenario.tivity measures in case of balanced data.In Section SM-4 we report the missing plots about theoretical quantities.Finally, further results obtained from the Monte Carlo experiments are presented in Section SM-5, whereas Section SM-6 contains complete results from the study of the real-data example.(A6) For all j, k, we have lim where I(B) denotes the indicator variable of the event B.
(A7) For all > 0, we have (3) Ghosh and Basu [2013] proved that, under assumptions (A1)-( A7), the MD-PDE θ is consistent for θ and asymptotically normal.In particular, the asymptotic distribution of ] is p-dimensional normal with (vector) mean 0 and covariance matrix I p , the p-dimensional identity matrix.

SM-2 Proof of Theorem 1
First let us note that, under the setup of the linear mixed models introduced in Section 3, the matrices Ω n and Ψ n defined in Section SM-1 simplifies to the forms given in Subsection 3.1 of the main paper and involved in Theorem 1. Also, the LMM setup is a special case of the general setup of independent non-homogeneous observations and we assumed that the true data generating density belongs to the model family.Then, the proof of our Theorem 1 is immediate from the results stated by Ghosh and Basu [2013], provided we can show that the general Assumptions (A1)-(A7) are satisfied under the our assumed Conditions (MM1)-(MM4) for the special case of LMMs.Now, the assumption that the true data generating density belongs to the model family, together with that the model density is normal with mean X i β and variance matrix V i , ensures that Assumptions (A1)-(A3) are directly satisfied.Also, Assumption (A4) follows from Condition (MM1).
Next, we prove Equation (1) of Assumption (A6).For any j = 1, . . ., p, the j-th partial derivative with respect to β is given by by Assumption (MM2) and sup n>1 η iα is bounded thanks to the boundness of |V i | in (MM3), by the Dominated Convergence Theorem (DCT) we have lim and this follows for all j = 1, . . ., p. On the other hand, consider the partial derivative with respect to σ 2 j , j = 0, 1, . . ., r, given in Equation ( 7).Hence, denoting with where in the last term hence, by Equation ( 11), The expectation in the first two terms goes to zero as N → ∞ by the DCT, as before, and the sums are bounded by Equation (11) in condition (MM3).This holds for all j = 0, . . ., r.

Gross Error Sensitivity
Recall the definition of the gross error sensitivity given in Ghosh and Basu [2013].Considering the influence function of the functional T β α , we have that , we have to find the sup of the function AZ e Z Z 2 with respect to Z.Then, we compute the derivative with respect to Z obtaining Finally, multiplying the above equation by Z we have A solution is given by Z such that .
We know that sup z z Az z z = λ max (A), where λ max (A) is the maximum eigenvalue of the matrix A. Using this general results in our case , we obtain Special Case: When n i = p, for all i = 1, . . ., n, the matrix X X can be rewritten as Substituting this simpler form in the formula of the gross error sensitivity we get Equation ( 21) of the main paper.

Self-Standardized Sensitivity
Starting from the definition of the self-standardized sensitivity, we have that (t − X i 0 β), then the above equation has the the form where . In order to find this sup, we compute the derivative with respect to Z, obtaining A solution is given by Z such that Z Z = 2, then Z = √ 2k k , with k ∈ R n i 0 .Note that we multiplied by Z in the same way done for the gross error sensitivity.Hence , Figure 1: Asymptotic Relative Efficiency with respect to α for β 0 , σ 2 0 and σ 2 1 , respectively.which corresponds to equation ( 19).Special Case: When n i = p, for all i = 1, . . ., n, the matrix X * X * can be rewritten as Substituting this simpler form in the formula of the self-standardized sensitivity we get equation ( 22).

SM-4 Additional Numerical Results
In this section we provide the plots of Asymptotic Relative Efficiency and Influence functions, corresponding to the Example in Section 3.3 of the main paper.
Figure 1 shows the Asymptotic Relative Efficiency with respect to α for β 0 , σ 2 0 and σ 2 1 , respectively.In Figure 2 we can see the influence function for the estimators T β 1 α , T

SM-5 Additional Results from Monte Carlo experiment
Here, we report the complete results of the Monte Carlo experiments with respect to the contamination levels considered.
The MSMD and MKLD performances of the CVFS-, SMDM-estimators and for MDPDE for different values of α with 5% of contamination are presented in Figure 4a and 4b, respectively.
Figure 5a and 5b display the MSMD and MKLD performances of the CVFS-, While Figures 6a and 6b show the MSMD and MKLD performances, respectively, of the CVFS-, SMDM-estimators and for MDPDE for different values of the parameter α in case of 15% of outlier contamination.
Tables 1 and 2

SM-6 Results from real-data example
In this Section, we report the complete results from the study of the real-data example about Extrafoveal Vision Acuity presented in Section 5.
Table 3 shows the estimates of model parameters obtained using the lmer estimator, the SMDM-and CVFS-estimators, the MDPDE with different values of α.

Figure 2 :
Figure 2: Influence function for the functionals T β 0 α (left panel) and T σ 2 1 α (right panel), for different values of the tuning parameter α.

Figure 3 :
Figure 3: Gross-error sensitivity (left panel) and self-standardized sensitivity (right panel) of the functional T β α with respect to i 0 = 10.
and similarly for b and c, while e = (e 111 , . . ., e F GH ) ∼ N p (0, σ 2 e I p ). Hence y is a p multivariate normal with mean µ = xβ and variance matrix Σ 0

Table 1 :
Relative efficiency for the SMDM-estimator, CVFS-estimator and MD-PDE for different values of α with respect to the maximum likelihood computed by the MSMD for the fixed terms β and by the MKLD for the random terms.Now, we consider the contamination setting.The Figures presented in Section SM-5 of the Supplementary Material show the MSMD and the MKLD of the MDPDE for different values of α compared to the CVFS-and SMDM-estimators, as a function of ω 0 .

Table 2 :
Maximum values of MSMD and MKLD for the CVFS-, SMDM-estimators and for the MDPDE at different values of α under 10% of outlier contamination.

Table 1 :
report the maximum values of MSMD and MKLD of the CVFS-, SMDM-estimators and for MDPDE considering different values of α in case of 5% and 15% of contamination, respectively.Maximum values of MSMD and MKLD for the CVFS-, SMDM-estimators and for MDPDE considering different values of α under 5% of outlier contamination.

Table 3 :
Estimates of model parameters obtained using different estimators.