A new method for obtaining explicit estimators in unbalanced mixed linear models

The general unbalanced mixed linear model with two variance components is considered. Through resampling it is demonstrated how the fixed effects can be estimated explicitly. It is shown that the obtained nonlinear estimator is unbiased and its variance is also derived. A condition is given when the proposed estimator is recommended instead of the ordinary least squares estimator.


Introduction
Unbalanced data have challenged statisticians for decades. In a simple two-way analysis of variance model with interactions and an unequal number of replicates per cell, which is often referred to as the unbalanced case, it is not obvious at all how to test various hypotheses and estimate effects. For this model, however, it is always possible to construct F-distributed test statistics, under standard normality assumptions, although there remains a serious issue of interpreting the results. Approximate tests in non-normal unbalanced two-way random models has been studied by Güven (2012).
In case of a two-way analysis of variance model with random effects the situation starts to become even more challenging since it is not obvious how to set up the test statistic so that inference can rely on the F-distribution. In this context, the idea to sacrifice some power in order to create a balanced data set proposed by Khuri (1986), Khuri and Littell (1987), Khuri (1990), Gallo and Khuri (1990), Öfversten (1993, 1995) and Christensen (1996) is very appealing.
The above mentioned works were mainly concerned with creating tests for variance components and estimators of fixed effects in the two-way model with interactions with the key idea that one can resample from the residuals to increase variation in order to mimic a balanced model.
In this article we adopt the above mentioned ideas and focus on estimation of fixed effects parameter in a mixed linear model with two variance components. The new proposed estimator can be considered as an alternative to the classical moment estimator, e.g., see Henderson (1953) or Al Sarraj and von Rosen (2009), where in the latter reference improved moment estimators were discussed although the focus was on variance components estimation.
Throughout the article, vectors (matrices) will be denoted by bold letters (bold upper cases), C(·) denotes the column vector space, and for any matrix A, P A denotes the orthogonal projection onto C( A). ∼N p (μ, ) stands for distributed as a p-dimensional multivariate normal distribution with mean μ and dispersion matrix . For a random vector (variable) E[·] stands for its expectation, D[·] for the dispersion matrix (variance), and cov[·, ·] for the covariance between two vectors (random variables). Other notation will be defined in the subsequent text as needed.

Model
The main focus in this article is on the explicit estimation of Xβ, the vector of fixed effects parameters, in the following mixed linear model, where y is an n × 1 observable random vector. Let X be an n × p, known model matrix of rank rank(X) = k ≤ p < n and Z an n × m, m < n known matrix, such that C(Z) ⊂ C(X).
Here, β ∈ R p is unknown and fixed parameter of the mean. Furthermore, the vector of random effects, γ ∼ N m (0, σ 2 γ I m ), is assumed to be independently distributed of the random errors ∼ N n (0, σ 2 I n ); the variances σ 2 γ ≥ 0 and σ 2 > 0 are unknown scalar parameters. Note that the model comprises the important case when E[γ ] = 0 because the mean of Zγ can be "moved into" Xβ.
Let ρ = σ 2 γ /σ 2 . The covariance matrix of y is then D[ y]=σ 2 (ρ Z Z + I n ). The ordinary least squares estimator (OLS) is always available for Xβ, i.e., Xβ = P X y, but it is known from maximum likelihood theory that in general, the weighted least squares estimator asymptotically performs better. In model (2.1), explicit closed form maximum likelihood estimators are available only in special cases. Therefore, in this article the aim is to find an explicit estimator for Xβ, which also uses the random effects variation due to the presence of random effects γ .

Main result
Denote l = rank(X : Z), then the dimension of the column space C(X) ⊥ ∩ C(X : Z) is l − k. Let A be an n × (l − k) matrix, such that the column vector space satisfies C( A) = C(X) ⊥ ∩ C(X : Z) = C P (X:Z) − P X and A A = I l−k . The matrix A can be obtained through the Gram-Schmidt orthogonalization algorithm.
Denote B 1 an n × k matrix, such that B 1 B 1 = P X and B 1 B 1 = I k . Let B 2 = A( A Z Z A) −1/2 , and let B 3 be any n × (n − l) matrix, such that C(B 3 ) = C(X : Z) ⊥ with B 3 B 3 = I n−l . From the assumptions it follows that A Z Z A is positive definite and therefore its inverse exists: The matrices B i , i = 1, 2, 3, satisfy Thus, a one-to-one transformation (B 1 : B 2 : B 3 ) of the model (2.1) yields the equivalent representation through the following three models: Since B 3 y is normally distributed and B 3 is orthogonal to (B 1 : B 2 ), the vector B 3 y is independent of B 1 y as well as B 2 y.
In the subsequent, the models (3.1), (3.2), and (3.3) constitute the basis for the derivation of the estimator for Xβ.
From construction of B 1 we have B 1 B 1 y = P X y = Xβ, the OLS estimator of Xβ, which is functionally independent of the variance components. However, the random effects vector γ is included in both (3.1) and (3.2), and thus for an alternative to the OLS estimator, i.e. a weighted estimator, (3.2) contains essential information about Xβ. Notice , which implies that the components in B 2 y are correlated and this can cause technical inference problems because if, for example, conditioning B 1 y on B 2 y, the inverse of D[B 2 y] is needed and may be difficult to handle. Now, one can add an extra random term to B 2 y, which means adding extra variation, in such a way that the dispersion matrix of the resulting random vector becomes diagonal, and its inverse is then easy to utilize. This is a key idea of this article. The main remaining issue is how to add the necessary portion of proper variation, since σ 2 γ and σ 2 are both unknown. The problem can be resolved adopting the ideas of Gallo, Khuri, Öfversten (see the Introduction for references), and others. These authors performed some kind of abstract bootstrapping. Implementing their ideas means to take the important step of adding observable random variables from (3.3) to (3.2) so that eventually a diagonal dispersion matrix is obtained. Let the new random vector u 2 be defined by ( 3.4) We have introduced here a selection operator, sel(B 3 y, (l − k)), which selects l − k independent observations from B 3 y ∼ N n−l (0, σ 2 I). This selection may be represented by e.g., an (l − k) × (n − l) matrix R, whose rows are the arbitrarily chosen l − k rows of I n−l . Obviously, R R = I l−k . One possible version of (3.4) is then The scalar λ is chosen large enough so that the matrix λI l−k − B 2 B 2 is nonnegative definite and hence the square root in (3.4) exists. There exist several matrix square roots but the different choices do not effect the results in any way.
Theorem 3.1 Let u 2 be given by (3.5). Then, Proof Since u 2 is a linear combination of two independently normally distributed random vectors B 2 y and B 3 y, it is also normally distributed. Thus, only the first two moments of u 2 have to be determined. For the expectation, it follows immediately that It is clear that if u 2 is to be used, the parameter λ should be chosen as small as possible (with respect to the existence of the square root) because the dispersion of u 2 is proportional to λ. Thus, a natural choice for λ is the largest eigenvalue of Note that all components in u 2 are mutually independent, which indeed is a very nice property. Moreover, denote u 1 = B 1 y. Then, because of independence between u 1 and B 3 y, Conditioning u 1 on u 2 we obtain Thus, if the ratio of the variances ρ = σ 2 γ /σ 2 is known and because the distribution of u 2 is independent of β, the "natural" estimator of Xβ is given by In general, the variances or their ratio are not known, hence the ratio has to be estimated. Denote after selection, the remaining variables in (3.3) by Because u 2 and u 3 are independently distributed, it immediately follows that and thus also Xβ can be estimated.
Proposition 3.2 For the model in (2.1), let u i , i = 1, 2, 3, and B j , j = 1, 2, be defined as in the text preceding this proposition. Then, if n − 2l + k > 0 and l − k − 2 > 0, The proposed estimator for Xβ is a nonlinear estimator since it is nonlinear in u 2 and u 3 . The main objection against this estimator might be that the choice of l − k components from B 3 y is arbitrary. Notice that the distribution of u 2 and u 3 does not depend on choice of R in (3.5). Some kind of U -statistic approach may circumvent this type of arbitrariness but this will not be explored in this article.

E[ Xβ] and D[ Xβ]
In this section, E[ Xβ] and D[ Xβ] will be studied, where Xβ was given in Proposition 3.1. The calculations are somewhat lengthy but fairly straightforward. In the following, the next lemma is needed.
Lemma 4.1 Let u 2 be given by (3.5). Then, Proof Let be an arbitrary orthogonal matrix. Since u 2 has the same distribution as for some constants c 1 and c 2 . Applying the trace-function to these relations yields c 1 = (l − k) −1 and where (3.9) was utilized (see e.g., Kollo and von Rosen 2005; Lemma 2.4.1 for discussion).
The following moment relations will be used in the subsequent calculations: Here we have used u 3 u 3 /σ 2 ∼ χ 2 (n − 2l + k).
Since E[u 2 (u 2 u 2 ) −1 ] = 0, it follows from Proposition 3.2 that for all β ∈ R k , (4.6) Furthermore, from Proposition 3.2 it follows that where as in Proposition 3.2, Hence, we calculate where (4.5) and Lemma 4.1 (i) were used; where (4.5) and Lemma 4.1 (ii) were used, which together yield Next cov[ f , u 1 ] will be calculated via conditioning u 1 |u 2 . Here E y [·] indicates that expectation is taken with respect to the distribution of y. . (4.10) After substituting (4.8), (4.9), and (4.10) into (4.7), we get (4.11) The above obtained results are summarized in the next theorem.
Theorem 4.2 Let Xβ be given by (3.11) and let Xβ be the OLS of Xβ. Then
Theorem 5.1 Let Xβ be given by (3.11). Then for every estimable function g β, Proof Manipulating (5.1) yields that Xβ has a smaller dispersion matrix if and only if The statement of the theorem is the solution to this inequality.
From Theorem 5.1 it follows that the inequality ρ >λ is a simple criterion for deciding if Xβ should be used instead of the OLS estimator, although ρ has to be estimated.

Example: unbalanced one-way random model
A special case of model (2.1) is the one-way random effects model  Bold values indicate the instances when the newly suggested estimator has smaller variance than the ordinary least squares estimator Bold values indicate the instances when the newly suggested estimator has smaller variance than the ordinary least squares estimator where i = 1, . . . , l, j = 1, . . . , n i , and n = l i=1 n i . The expectation β ∈ R is our parameter of interest. All distributional assumptions of model (2.1) are assumed to hold. Here k = 1, the matrix X is an n-dimensional column vector of ones, X = 1 n , and the matrix Z is a block matrix of column vectors of ones of length n i , Z = Diag{1 n i }. The OLS estimator of β isȳ = u 1 , the average of all n observations. It is well known that the maximum likelihood estimator (MLE) of μ, in case of an unbalanced model, has to be obtained iteratively. Note that C(X) ⊆ C(Z) and thus l = rank(Z).
We illustrate our procedure using a small simulation study for model (6.1). A broad range for the ratio of the variance components ρ is considered. The number of levels l of γ is chosen as small as l = 3 with relatively higher number of observations n i  Fig. 2 Histogram from 10,000 estimates of β for Configuration 2, for ρ = 5, illustrating the symmetry of the distribution of β per level, and a relatively higher l = 10, with smaller numbers n i . Because of the proportionality, we considered σ 2 = 1 only.
All setting configurations are presented in Table 1. For each configuration, 10,000 simulations were carried out. The OLS estimates as well as β estimates are presented as averages of 10,000 observed estimates. In addition, the estimates of σ 2 and of σ 2 γ are presented. The observed MSE of the OLS estimators as well as of the βs are also included here. The results are presented in Tables 2 and 3. In Sect. 5, it was suggested that the new proposed estimator has a smaller dispersion if ρ > λ = 7.175 which is in complete agreement with Table 2 where for ρ = 10 or ρ = 20 simulations indicate the new estimator has a smaller dispersion than the OLS estimator, but not for ρ = 5. The results for Configuration 2 are presented in Table 3. Now using Table 3 and Theorem 5.1, it follows that ρ should be larger than 1.97 if β is to be applied instead of the OLS estimator. This strategy is supported by results of simulations presented in Table 3. The tables indicate that even a smaller ρ can be used but one has to remember that estimated variances and MSEs are applied, in particular that σ γ can become negative (see Figure 1). However, with confidence we can state that the new estimator is better than the least squares estimator in certain regions of the parameter space (described through ρ) as it is shown in Theorem 5.1 and simulations. In addition, we would like to point out that in spite of β being a nonlinear estimator, it is an unbiased estimator as is shown in (4.6), and as we observed in our simulations, its distribution seems to be symmetric around its expectation. For illustration see histograms of β and of the OLS estimators in Figure 2 and Figure 3, respectively.