Fisher Scoring for crossed factor linear mixed models

The analysis of longitudinal, heterogeneous or unbalanced clustered data is of primary importance to a wide range of applications. The linear mixed model (LMM) is a popular and flexible extension of the linear model specifically designed for such purposes. Historically, a large proportion of material published on the LMM concerns the application of popular numerical optimization algorithms, such as Newton–Raphson, Fisher Scoring and expectation maximization to single-factor LMMs (i.e. LMMs that only contain one “factor” by which observations are grouped). However, in recent years, the focus of the LMM literature has moved towards the development of estimation and inference methods for more complex, multi-factored designs. In this paper, we present and derive new expressions for the extension of an algorithm classically used for single-factor LMM parameter estimation, Fisher Scoring, to multiple, crossed-factor designs. Through simulation and real data examples, we compare five variants of the Fisher Scoring algorithm with one another, as well as against a baseline established by the R package lme4, and find evidence of correctness and strong computational efficiency for four of the five proposed approaches. Additionally, we provide a new method for LMM Satterthwaite degrees of freedom estimation based on analytical results, which does not require iterative gradient estimation. Via simulation, we find that this approach produces estimates with both lower bias and lower variance than the existing methods. Supplementary Information The online version supplementary material available at 10.1007/s11222-021-10026-6.

: Parameters chosen as baseline truth for the parameter estimation simulations. Entries marked as '−' indicate that the design for the corresponding simulation setting did not include the specified parameter.

S2 MAE and MRD definition
Here, we formally define the Mean Absolute Error (MAE) and Mean Relative Difference (MRD) which were used in the main text as performance metrics for between-method comparison. For a model parameter γ, these were defined as: where n sim is the total number of simulations run,γ i is an estimate of the parameter vector γ obtained during the i th simulation using the parameter estimation method of interest, andγ i is a baseline truth to whichγ i is to be compared.
The columns of a matrix which correspond to random effects grouped into the j th level of the k th factor.
Half, full and Cholesky representations of θ , respectively.
Sub-matrix of the Fisher Information matrix. The superscript h, f or c specifies the half, full or Cholesky representations, respectively. The subscript represents the parameter vectors a and b, to which the submatrix corresponds. The notation I Matrix of zeros with dimensions (m × n). Section 2.1.1, equation (9) ⊗ Kronecker product. Section 2.1.1, equation (10) ∂ (e.g. ∂ X ∂Y ) Partial derivative operator. I.e. the matrix derivative operator which does not account for dependencies between elements of a matrix. For example, ∂ does not account for the equality between reflected elements of a symmetric matrix (see Supplementary Material S12.1).
Section 2.1.2, equation (11) F(θ f ) The matrix used in the place of the Fisher Information matrix for the Pseudo-Inverse Fisher Scoring approach.
Section 2.1.2, equation (14) F a,b , F a F a,b is the sub-matrix of F corresponding to parameter vectors a and b. F a is shorthand for F a,a .
The additive genetic, common environmental and residual error variances in the ACE model. σ 2 e is equivalent to the notation σ 2 employed in previous sections.
Kinship matrices for additive genetic and common environmental variance components, respectively (see see Supplementary Material S16.1).
Section 4.1.2 Table S9: Index of notation. Listed above are all notational conventions used throughout the paper, "Fisher Scoring for crossed factor Linear Mixed Models". Notations which are introduced in the appendices are not included.

S11 Ensuring non-negative definiteness
In order to ensure numerical optimization produces a valid covariance matrix when performing LMM parameter estimation, optimization must be constrained to ensure that for each factor, f k , D k is non-negative definite. This is a well-documented obstacle for numerical approaches to LMM parameter estimation for which many solutions have been suggested. In this section, we describe how non-negative definiteness was ensured for the covariance matrix estimates produced by the Fisher Scoring algorithms described in the main text. A common approach, utilized in Bates et al. [2015] for example, is to reparameterize optimization in terms of the Cholesky factor of the covariance matrix, Λ k , as oppose to the covariance matrix itself, D k . This approach is adopted by the CSFS method, described in Section 2.1.5. However, the methods described in Sections 2.1.1-2.1.4 require an alternative approach as optimization does not account for the constraint that D k must be non-negative definite. For the FS, FFS, SFS and FSFS algorithms, we use an approach described by Demidenko [2013], for the single factor LMM. Denoting the Eigendecomposition of {D k } k∈{1,...,r} as D k = U k Σ k U k , we project D k onto the space of all (q k × q k ) non-negative definite matrices by, following each update of the covariance matrix, D, replacing {D k } k∈{1,..r} with {D +,k } k∈{1,..r} , where Σ +,k = max(Σ k , 0) with 'max' representing the element-wise maximum. For each factor, f k , this ensures D k , and as a result D, is non-negative definite. We note here it could be argued that using the Eigendecomposition in this manner defeats our objective to employ only simplistic vectorizable operations for parameter estimation. In response to this potential criticism, however, we highlight that vectorized Eigendecomposition operations are widely available in practice and can be found in many programming languages such as, for example, MatLab and Python.

S12 Matrices and differentiation
In this section, supplementary information is provided for the material found in Appendix 6.1 of the main text. Section S12.1 provides an explicit definition for, and discussion of, the notion of derivative employed throughout the main text. Section S12.2 restates and proves the lemmas employed in the arguments of Appendix 6.1.

S12.1 Derivative definition
Many different notions of matrix derivative currently exist in the statistical literature. In this section, we explicitly state the definition of derivative which is employed throughout "Fisher Scoring for crossed factor Linear Mixed Models". Throughout this work, for arbitrary matrices A and B of dimensions (p × q) and (m × n) respectively, the following definition of matrix (partial) derivative has been implicitly employed: with the equivalent definition holding for the total derivative. In addition, for a scalar value, a, and matrix, B, defined as above, we employ the below matrix (partial) derivative definition: with, again, the equivalent definition holding for the total derivative. The above notation can be useful since by definition, the two derivatives defined above satisfy the below relation: .
The matrix derivative definitions we have used are commonly employed in the mathematical and statistical literature, but are not a universal standard. For example, sources such as Neudecker and Wansbeek [1983] and Magnus and Neudecker [1999] define ∂ vec(A)/∂ vec(B) as the transpose of the definition supplied above. This discrepancy between definitions is noteworthy as some of the referenced results used in the appendices of "Fisher Scoring for crossed factor Linear Mixed Models" are the transpose of those found in the original source. For a full account and in-depth discussion of the different definitions of matrix derivatives used in the literature, we recommend the work of Turkington [2013].

S12.2 Derivative lemmas
In this Section, Lemma 1 and Lemma 2, which were employed in Section 6.1 of the main text, are restated as Lemma S1 and Lemma S2 and proven rigorously.
Lemma S1. Let g be a column vector, A be a square matrix and {B s } be a set of arbitrary matrices of equal size. Let K be an unstructured matrix which none of g, A or any of the {B s } depend on. Further, assume A, {B s }, g and K can be multiplied as required. The below matrix derivative expression now holds; Proof. To begin, denote M = A + ∑ s B s KB s . For clarity and convenience, in this proof, we shall briefly depart from the usual notation of H [i, j] for the (i, j) th element of an arbitrary matrix H and instead employ Ricci calculus notation, in which the (i, j) th element of H is given by H i j and all summations are made implicit. The left hand side of (S1) is now given, using Ricci calculus notation, as; We employ a result from Giles [2008] which states, in Ricci calculus notation; Applying this result to (S2), it can be seen that the right hand side of (S2) is equal to: Evaluating the derivative in the above expression and rearranging the terms gives the below: By applying a transposition to the matrices in the above expression, in order to interchange the indices m and n, the result follows.
Lemma S2. Let A, {B s } and K be defined as in Lemma S1. Then the following is true: Proof. As in the proof of Lemma S1, the notation of M = A + ∑ t B t K B t will be adopted and the proof will be given in Ricci calculus notation. Expanding the derivative with respect to K m n using the partial derivatives of the elements of M gives the following: We now make use of the following well-known result, which can be found in Giles [2008].
Inserting (S5) into (S4) and noting that A i j does not depend on K m n , yields: Evaluating the derivative on the right hand side of the above now returns: Combining the summation terms in the above expression yields the following.
Finally, converting the above into matrix notation now gives (S3) as desired.

S13 Full Fisher Scoring
In this section, we prove that equations (14) and (18) in the main text are valid Fisher Scoring rules of the form given by equation (4). To achieve this, we first require the following lemma.
Lemma S3. Let n, p and k be arbitrary positive integers with p < n. LetÑ be of dimension (n × n) with rank(Ñ) < n, K be of dimension (n × p) with rank(K) = p, C be of dimension (n × n) with rank(Ñ) = n and δ be of size (n × k). If the following properties are satisfied: 1.KK + =Ñ, whereK + = (K K ) −1K is the generalized inverse ofK.

4.Ñδ = δ .
then it follows that: Proof. In proving the above lemma, we make use of the following result given by Barnett [1990]; • If A is a matrix of dimension (m × l) with rank(A) = l < m, and B is a matrix of dimension (l × m) with rank(B) = l, then; By property 1, we have that: By applying (S6) to the above, with A =K and B =K + C, and simplifying the result, the following is obtained: We now consider the inversion in the center of the above expression, i.e. the inversion of (K + CC K + ). This inversion is given by; We will demonstrate this by showing that the product of the matrix inside the inversion on the left hand side and the matrix on the right hand side is the identity matrix: Using properties 1, 2 and 3 we see that the above is equal to: Simplifying and applying property 1 now yields that the above is equal the below: where I is the identity matrix as required. Returning to (S8) and applying properties 1 and 2, we now have that: By applying properties 2 and 3, the above can now be reduced to: We now note that, by property 3,ÑC = CÑ. By multiplying the left and right side of this identity by C −1 it can be seen that: Multiplying the right hand side of the above byÑ and applying property 2 we obtain: Substituting the above into (S9) gives: The result of Lemma S3 now follows by multiplying the right-hand side by δ and applying property 4.
Theorem S1. The below two update rules are equivalent, where F(θ f s ) is the matrix defined in Section 2.1.2 and I (θ f s ) is the Fisher Information matrix of θ f s .
Proof. To prove Theorem S1, it suffices to prove the below equality: To prove the above, we employ Lemma S3. To achieve this, we proceed by defining: •Ñ to be the matrix:  • δ to be the column vector: and claim that F(θ f )Ñ = I (θ f ). To prove this equality, we first note the following two properties of the matrix N k , given by Magnus and Neudecker [1986].
1. N k 1 (A ⊗ A) = N k 1 (A ⊗ A)N k 2 = (A ⊗ A)N k 2 for any arbitrary matrix A and scalars k 1 and k 2 , such that the matrix multiplications are well defined.

N k vec(A) = vec(A) for any arbitrary symmetric matrix
A of appropriate dimension.
The first of the above properties, in combination with equations (13) and (15), can be seen to imply that, for any k 1 and k 2 between 1 and r: In a similar manner, the second of the above properties, in combination with equation (12), can be seen to imply that, for any k between 1 and r: Expanding the multiplication F(θ f s )Ñ block-wise demonstrates that the two equations above are sufficient to prove the equality F(θ f s )Ñ = I (θ f s ). The matricesÑ,K and C can be seen to satisfy properties 1-3 of Lemma S3 by noting standard results concerning the matrices D k , K m,n and N n , provided in Magnus and Neudecker [1986]. Property 4 of Lemma S3 can be seen to hold for the matrixÑ and column vector δ by noting the symmetry of ∂ l(θ f s ) ∂ D k (see Theorem 1). By combining the previous arguments and applying Lemma S3, the below is obtained: The result of Theorem S1 now follows.
Theorem S2. For any fixed integer k between 1 and r, the below two update rules are equivalent, where F vec(D k,s ) is as defined in Section 2.1.2 and I f vec(D k,s ) is the Fisher Information matrix of vec(D k ).
Proof. The proof of the above proceeds by definingÑ = N q k ,K = K q k ,q k , C = F vec(D k ) and δ to be the partial derivative appearing in the statement of the theorem. Following this, the proof follows the same structure as that of Theorem S1.

S14 Constraint Matrices
In this section, we provide further detail on the approach to constrained optimization which was outlined in Section 2.4 of the main text. We begin by providing examples of how the notation described in Section 2.4 may be used in practice. For a given k, we introduce the notation {d i } and {d i } to represent the set of unique parameters required to specify the constrained and unconstrained representations of D k , respectively. For example, if D k is (3 × 3) in dimension and we wish to induce Toeplitz structure onto D k , then D k can be considered to take the following "constrained" and "unconstrained" representations: The notation vecu(A) is used to denote the column vector of unique variables in the matrix A, given in order of first occurrence as listed in column major order. In the above example, this implies vecu(D k ) = [d 1 ,d 2 ,d 3 ] whilst vec(D k ) = [d 1 , . . . , d 9 ] . Using this notation, the constraint matrix C k may be defined element-wise. To see this, note that the chain rule for the total derivative is given by: Comparing this with the derivative expression provided in equation (27), it can be seen that the elements of C k are given by: In general, derivation of the constraint matrix for the k th factor is relatively straightforward for most practical applications. Table S10 supports this statement by providing examples of the constraint matrix for the k th factor operating under the assumption that D k exhibits several commonly employed covariance structures. In addition, we note that two instances in which a constraint matrix approach was employed implicitly can be found in the main body of this work. In Section 2.1.1, we took C k = D q k to enforce symmetry in D k and, in Section 2.1.5, we took C k = (dvech(D k )/dvech(Λ k ))D q k in order to derive the Fisher Scoring update in terms of the Cholesky factor, Λ k .
It is also worth noting that this approach to constrained optimization allows for different covariance structures to be enforced on different factors in the LMM. For instance, a researcher may enforce the first factor in an LMM to have an AR(1) structure while allowing the second factor to exhibit a completely different structure such as compound symmetry. As noted in Section 2.4, this approach can also be extended further to allow all elements of the vectorized covariance matrices, vec(D 1 ), . . . vec(D r ), to be expressed as functions of one set of parameters, ρ D , common to all {vec(D k )} k∈{1,...,r} . To further detail this, we define v(D) and extend the definition θ con : Through the same process used to obtain equation the equations presented in Section 2.4, the below can be obtained; dl(θ con ) dρ D = C ∂ l(θ con ) ∂ v(D) , where C is the constraint matrix defined to be the Jacobian matrix of partial derivatives of v(D) taken with respect to ρ D . The score vector and Fisher Information matrix associated to v(D) are readily seen to be given by block-wise combinations of the matrix expressions (11) and (13).  S10: The constraint matrix for the k th factor, C k , in the setting in which the k th factor groups 3 random effects, under various covariance structures.

S15 Satterthwaite degrees of freedom estimation
In this section, we provide an in-depth accounting of Satterthwaite estimation for degrees of freedom in the multi-factor LMM setting. In Section S15.1, we detail how the expression described by equation (28) of the main text is derived for estimating the degrees of freedom of the approximate T-statistic. Following this, in Section S15.2, we describe how this approach is extended to degrees of freedom estimation for the approximate F-statistic. These sections follow the proofs outlined in the work of Kuznetsova et al. [2017]. The reader is referred to this work for a more in-depth discussion of Satterthwaite degrees of freedom estimation for the LMM.

S15.1 Satterthwaite estimation for the approximate T-statistic
To derive the estimator given by (28), we first begin by noting that the approximate T-statistic used for LMM null hypothesis testing is given by: whereβ is the estimator of β obtained from numerical optimization and Var(β ) is an estimate of the variance of the β estimator, given by: We highlight that the above expression differs from the T-statistic typically employed in the setting of linear regression hypothesis testing. Under the standard assumptions for linear regression, it is possible to derive an exact expression for the distribution of the OLS variance estimate ofβ given byσ 2 (X X) −1 . However, the above estimator, Var(β ), differs from the OLS estimate as it includes an estimate of V which has an unknown distribution. As a direct result, no equivalent theory is available to obtain the distribution of Var(β ) in the setting of the LMM. A consequence of this is that, whilst T follows a student's t n−p distribution in the setting of linear regression, the distribution of T in the LMM setting is unknown. The aim of this section is to approximate the degrees of freedom of T assuming that it follows a students t-distribution.
To meet this aim, we now denote Var(β ) as the unknown true variance of theβ estimator. By multiplying both the numerator and denominator of the T-statistic by (LVar(β )L) 1/2 , the below expression for T is obtained: where Z is a random variable with standard normal distribution, given by Z = Lβ / LVar(β )L . We now consider the definition of the students t-distribution, which states that a random variable is t-distributed with v degrees of freedom if and only if it takes the following form: where Z ∼ N(0, 1) and X ∼ χ 2 v . When the degrees of freedom v are unknown, as both X and T are distributed with v degrees of freedom, it suffices to estimate the degrees of freedom of the chi-square variable X instead of the t-distributed variable T . By equating the approximate T-statistic with the defining form of a t-distributed random variable, under the assumption that the T-statistic is truly t-distributed, the below expression for X can be obtained: We now define the notation S 2 (η) to represent the estimated variance of Lβ as a function of the estimated variance parameters given byη = (σ 2 ,D 1 , . . . ,D r ). Similarly, we define Σ 2 to be the true unknown variance of Lβ . Noting that Σ 2 = LVar(β )L and by rearranging the above, the below can be obtained: By considering the variance of the above expression and noting that, as X ∼ χ 2 v , Var(X) = 2v, the below expression is obtained.
Under the assumption that T ∼ t v ,v(η) is an estimator for the degrees of freedom of X and, therefore, for the degrees of freedom of T . This concludes the derivation of (28).

S15.2 Satterthwaite estimation for the approximate F-statistic
In this section, we detail how the Satterthwaite method described in Section S15.1 may be extended in order to estimate the degrees of freedom of the approximate F-statistic. The approximate F-statistic for the LMM takes the below form: where, throughout this section only, the notation F will represent the approximate F-statistic and r will be used to denote r = rank(L). The aim of this section is to approximate F with an F r,v distribution where v, the denominator degrees of freedom, is estimated using a Satterthwaite method of approximation. To simplify the notation we will first consider Q, which is the numerator of F; We begin by considering the eigendecomposition of LVar(β )L , which we will denote: where U is an orthonormal matrix of eigenvectors and Λ is a diagonal matrix of eigenvalues. Using standard properties of the eigendecomposition the below is obtained: where (U Lβ ) i is the i th element of the vector U Lβ and λ i is the i th diagonal element (eigenvalue) of Λ, Λ [i,i] . For purposes which will become clear, we defineL as U L. By noting that λ i = (U LVar(β )L U) i , the above can be seen to be equal to: whereL i represents the i th row ofL. Noting now the definition ofL, it can be seen that the above is a sum of independent squared T-statistics of the form discussed in Section S15.1. By denoting the unknown degrees of freedom of each of the T-statistics in the above sum by v i , it follows that: where the second equality follows from the fact that a squared T-statistic with degrees of freedom v i is an F 1,v i -statistic, which in turn has expectation v i v i −2 . Noting that, in practice, the true variance ofβ is unknown, the methods of S15.1 must be employed to estimate each of the v i in the above sum.
To use the above expression for E(Q) to estimate the denominator degrees of freedom of F, three cases are considered; the cases r > 2, r = 2 and r = 1. Each of these will now be considered in turn.
Case 1 (r > 2): By recalling Q = rF, and by the standard formula for the expectation of an F distribution with numerator degrees of freedom v greater than 2, the below is obtained.
Setting the above equal to the previous expression for E(Q) and rearranging, the below expression for v is obtained: By using the Satterthwaite degrees of freedom approximation method described in Section S15.1 to estimate {v i } i∈{1,...,r} , it follows that the above formula can be used to estimate the degrees of freedom of the approximate F-statistic when r > 2.
Case 2 (r = 2): When r = 2, the expectation of Q is infinite and, resultantly, the equality used in the argument of case 1 no longer holds. A common convention for estimation of v in the case r = 2 is based on the observation that for r > 2 the expectation of Q satisfies the following relationship with v: v = 2E(Q) E(Q) − r .
By taking the limit from above of both sides of this expression, and noting that E(Q) → ∞ as v ↓ 2, the following estimate for v is obtained.
Case 3 (r = 1): For the case r = 1, it is well known that if F ∼ F 1,v 1 for some integer v 1 , then F is the square of a T-statistic with the same degrees of freedom, v 1 . Resultantly, in this case, the unknown denominator degrees of freedom of the F-statistic, v 1 , can be estimated using the methods of Section S15.1.

S16 The ACE model
This section provides additional material concerning the ACE model described in Section 4.1.2 of the main text. For readers unfamiliar with the ACE model, Section S16.1 provides a brief overview of the kinship and environmental matrices that appear in the model specification. For readers interested in computational efficiency, Section S16.2 provides detail on how the FSFS algorithm of Section 2.1.4, ReML adjustments of Appendix 6.3 and constrained optimization approach of Section 2.4 may be efficiently combined in order to perform parameter estimation for the ACE model.

S16.1 Kinship and environmental matrices
In twin studies, twin pairs are typically described as either Monozygotic (MZ) or Dizygotic (DZ). MZ twin pairs are identical twins, whilst DZ twins are non-identical. The additive genetic kinship matrix, K a k , describes the additive genetic similarity between members of a family unit. For a family of q k members, K a k , is a (q k × q k ) dimensional constant matrix with its (i, j) th element given by: The common environmental matrix, K c k , describes the similarity between members of a family unit which is attributed to individuals' being reared in a shared environment. K c k is a (q k × q k ) dimensional constant matrix with its (i, j) th element given by: 1 if subjects i and j were reared together. 0 if subjects i and j were not reared together.
For more information on the construction of kinship and common environmental matrices for twin studies, the reader is referred to Lawlor et al. [2009].

S16.2 Computation and the ACE model
A key aspect in which the ACE model differs from other LMM's considered in this work is that, in the ACE model, the second dimension of the random effects design matrix, q, is equal to the number of observations, n. Resultantly, the random effects design matrix, Z, which is given by the (n × n) identity matrix, has notably large dimensions. Due to the large size of Z, the methods of Section 2.3, which assume that the second dimension of Z is much smaller than n (i.e. q << n), cannot be applied to the ACE model in order to improve computation speed. In this section, we briefly outline how the FSFS algorithm of Section 2.1.4 may be implemented efficiently, in conjunction with the ReML adjustments described in Appendix 6.3 and the constrained optimization methods of Section 2.4 to perform parameter estimation for the ACE model.
To begin, we consider the GLS updates, given by equation (16), which are employed for updating the parameters β and σ 2 . By defining the matrix,D k byD k = I q k + D k , and noting that the matrix Z is the (n × n) identity matrix, the below matrix equality can be seen to hold for the ACE model: where X (k, j) represents the design matrix for the j th family unit which possesses family structure of type k. Via well-known properties of the Kronecker product, the above equality can be seen to be equivalent to the following: vec(X V −1 X) = r ∑ k=1 l k ∑ j=1 X (k, j) ⊗ X (k, j) vec(D −1 k ).
The above equality is noteworthy as the matrix expression which appears inside the large brackets on the right-hand side is fixed and does not depend on β , σ 2 or D. Further, the matrix expression inside the brackets is typically small in size and has dimensions (p 2 × q 2 k ). As a result, this expression can be calculated and stored prior to the iterative procedure of the FSFS algorithm. This means that the above equality offers a quick and convenient method for repeated evaluation of the matrix, X V −1 X, which is employed for many calculations throughout the FSFS algorithm. Similar logic can be applied to the evaluation of the vector, X V −1 Y , and the scalar value, Y V −1 Y . Using this approach, it can now be seen that the GLS estimators, (16), can be calculated using only {D k } k∈{1,...,r} and pre-calculated matrices which are small in dimension. Since the matrices {D k } k∈{1,...,r} are also typically small in dimension, this calculation is extremely quick and computationally efficient.
Next, we consider the matrix F vec(D k ) and the partial derivative vector of l R (θ f ), taken with respect to vec(D k ). Expressions for these quantities can be obtained using equations (15) and (11), respectively, in combination with the adjustments described in Appendix 6.3. Again noting that the random effects design matrix, Z, is equal to the identity matrix, the ReML equivalent of equation (11) simplifies substantially to the below expression: where E k = vec q k (e (k) ) , vec m is defined as the generalized vectorization operator described in Section 2.3 and e (k) is defined as the residual vector corresponding to subjects who belong to a family unit with family structure type k. Using equation (15), the sub-matrix of F corresponding to vec(D k ) can also be seen to be given by the following simplified expression.
The two expressions above can be evaluated computationally in an extremely efficient manner primarily for two reasons. The first of these is that the matrices involved in computation are typically small; D k , E k and X (k, j) have dimensions (q k × q k ), (q k × l k q k ) and (q k × p), respectively. The second reason is that, excluding the inversions of D k and X V −1 X, the only operations required to evaluate the above expressions are computationally quick to perform. These operations are: matrix multiplication, the reshape operation, matrix addition and the Kronecker product. Lastly, we consider the evaluation of the score vector and Fisher Information matrix associated to τ = [τ a , τ c ] = σ −1 e [σ a , σ c ] , which shall be denoted as dl(θ ACE ) dτ and I ACE τ , respectively. We define K k = [vec(K a k ), vec(K c k )] . By employing equation (S10) of Section S14, and using the constraint matrix derived in Appendix 6.6.2, the score vector and Fisher Information matrix associated to τ can be reduced to the following: where represents the Hadamard (entry-wise) product. The results of Section 4.2.2 were produced by using the GLS updates for the parameters β and σ 2 and by employing the two expressions above to perform Fisher Scoring update steps for the parameter vector τ.
It should be noted that, in the ACE model, misspecification of variance components can result in a low-rank Fisher Information matrix. This can be seen by noting the positioning of the Hadamard product in the above Fisher Information matrix expression and by considering the case in which one of the elements of τ equals 0. If not accounted for appropriately, this issue may, in practice, result in failure of the algorithm to converge. A simple, practical solution to this issue is to execute the algorithm multiple times using different initial starting points; once with a starting point where τ a is set to zero, once with τ c set to zero and once with neither τ a nor τ c set to zero. The results of Section 4.2.2 were obtained using this approach.