Appendix 1: Fixed and random effects coefficients estimation
For given values of the variance components \(\tau _d^2\) (\(d = 1, 2\)) and \(\phi \), estimation of the fixed and random effects coefficients of model (4), can be obtained by maximizing, with respect to \(\varvec{\beta }\) and \(\varvec{\alpha }\), the approximate penalized log-likelihood (see Eq. (6) in Breslow and Clayton 1993)
$$\begin{aligned} -\frac{1}{2\phi }\sum _{i=1}^{n}Dev_i\left( y_i, \mu _i\right) - \frac{1}{2}\varvec{\alpha }^{t}\varvec{G}^{-1}\varvec{\alpha }, \end{aligned}$$
where \(Dev_i\) denotes the deviance. This maximization can be carried out on the basis of a Fisher-Scoring algorithm, involving a working dependent variable and a weight matrix, which should be updated at each iteration. Specifically, at \((k+1)\)th Fisher-Scoring iteration, the working vector \(\varvec{z}\) is obtained as
$$\begin{aligned} z_i = g(\mu _i^{(k)}) + (y_i - \mu _i^{(k)})g^{\prime }(\mu _i^{(k)}), \end{aligned}$$
and the model’s fixed and random effects are then estimated as
$$\begin{aligned} \varvec{\hat{\beta }}^{(k+1)}&= \left( \varvec{X}^{t}\varvec{V}^{-1} \varvec{X}\right) ^{-1}\varvec{X}^{t}\varvec{V}^{-1}\varvec{z}, \end{aligned}$$
(10)
$$\begin{aligned} \varvec{\hat{\alpha }}^{(k+1)}&= \varvec{G}\varvec{Z}^{t} \varvec{V}^{-1}\left( \varvec{z}- \varvec{X} \varvec{\hat{\beta }}^{(k+1)}\right) \nonumber \\&= \varvec{G}\varvec{Z}^{t}\varvec{P}\varvec{z}, \end{aligned}$$
(11)
where
$$\begin{aligned} \varvec{V}&= \varvec{W}^{-1} +\varvec{Z}\varvec{G} \varvec{Z}^{t},\\ \varvec{P}&= \varvec{V}^{-1} - \varvec{V^{-1}} \varvec{X} \left( \varvec{X^{t}V^{-1}X}\right) ^{-1} \varvec{X}^{t}\varvec{V^{-1}}, \end{aligned}$$
and \(\varvec{W}\) is a diagonal matrix of weights with elements \(w_{ii} = \left\{ \phi [g'(\mu _i^{(k)})]^2\nu (\mu _i^{(k)})\right\} ^{-1}\).
From a computational point of view, a more convenient method for jointly obtaining \(\varvec{\hat{\beta }}\) and \(\varvec{\hat{\alpha }}\) is by the solution of the linear system (see Eq. (9) in Breslow and Clayton 1993)
$$\begin{aligned} \underbrace{ \begin{bmatrix} \varvec{X}^t\varvec{W}\varvec{X}&\varvec{X}^t\varvec{W}\varvec{Z}\varvec{G} \\ \varvec{Z}^t\varvec{W}\varvec{X}&\varvec{I} + \varvec{Z}^t\varvec{W}\varvec{Z}\varvec{G} \end{bmatrix}}_{\varvec{C}} \begin{bmatrix} \varvec{\hat{\beta }}^{(k+1)}\\ \varvec{\hat{b}}^{(k+1)} \end{bmatrix} = \begin{bmatrix} \varvec{X}^{t}\varvec{W}\varvec{z}\\ \varvec{Z}^{t}\varvec{W}\varvec{z} \end{bmatrix}, \end{aligned}$$
(12)
where \(\varvec{\hat{b}}^{(k+1)} = \varvec{G}^{-1} \varvec{\hat{\alpha }}^{(k+1)}\). Note that (12) corresponds to the normal equations of the best linear unbiased estimation of \(\varvec{\beta }\) and the best linear unbiased prediction of \(\varvec{\alpha }\) under the working linear mixed model
$$\begin{aligned} \varvec{z}&= \varvec{X}\varvec{\beta } + \varvec{Z}\varvec{\alpha } + \varvec{\epsilon },\;\;\; \text{ with }\;\;\; \varvec{\alpha }\sim N(\varvec{0},\varvec{G})\\&\quad \text{ and }\;\;\;\varvec{\epsilon }\sim N(\varvec{0},\varvec{W}^{-1}). \end{aligned}$$
Appendix 2: Proof of theorem
Proof
Ignoring the dependence of \(\varvec{W}\) on \(\tau _d\) (\(d = 1, 2\)), the approximate restricted log-likelihood of the working linear mixed model is given by (Breslow and Clayton 1993)
$$\begin{aligned} l^{*}&= -\frac{1}{2}\log |\varvec{V}|-\frac{1}{2}\log |\varvec{X^{t} V^{-1}X}|\\&-\frac{1}{2}(\varvec{z}-\varvec{X}\varvec{\hat{\beta }})^{t} \varvec{V}^{-1}(\varvec{z}-\varvec{X}\varvec{\hat{\beta }}). \end{aligned}$$
The REML estimates of the variance components are then obtained in the usual manner by maximizing this quantity. Taking derivatives with respect to the variance components \(\tau _d^2\) (\(d=1,2\)), we obtain (see online Supplementary Material for details)
$$\begin{aligned} \frac{\partial {l^{*}}}{\partial {\tau _d^2}} =-\frac{1}{2}trace \left( \varvec{Z}^{t}\varvec{P}\varvec{Z} \frac{\partial {\varvec{G}}}{\partial {\tau _d^2}}\right) +\frac{1}{2}\varvec{\hat{\alpha }}^{t}\varvec{G}^{-1} \frac{\partial {\varvec{G}}}{\partial {\tau _d^2}} \varvec{G}^{-1}\varvec{\hat{\alpha }}.\nonumber \\ \end{aligned}$$
(13)
Applying matrix differentiation properties, we have
$$\begin{aligned} \frac{\partial {\varvec{G}}}{\partial {\tau _2^2}} = - \varvec{G}\frac{\partial {\varvec{G}^{-1}}}{\partial {\tau _d^2}} \varvec{G}=\frac{1}{\tau _d^4}\varvec{G}\varvec{\Lambda }_d \varvec{G}, \end{aligned}$$
(14)
where
$$\begin{aligned}&\varvec{G} = \text{ diag }\left( \tau _2^2/\vec {\varvec{d}}_2, \tau _1^2/\vec {\varvec{d}}_1, 1/(\vec {\varvec{d}}_2^*/ \tau _2^2 + \vec {\varvec{d}}_1^*/ \tau _1^2)\right) , \\&\varvec{\Lambda }_1 = \text{ diag }(\vec {\varvec{0}}_{q_1(c_2 - q_2)},\vec {\varvec{d}}_1,\vec {\varvec{d}}_1^*),\nonumber \\&\varvec{\Lambda }_2 = \text{ diag }(\vec {\varvec{d}}_2, \vec {\varvec{0}}_{q_2(c_1 - q_1)}, \vec {\varvec{d}}_2^*),\nonumber \end{aligned}$$
(15)
with \(\vec {\varvec{0}}_{r}\) being a vector of zeroes of length \(r\), and \(\varvec{d}_1 = \varvec{I}_{q_2}\otimes \tilde{\varvec{\Sigma }}_1\), \(\varvec{d}_2 = \tilde{\varvec{\Sigma }}_2 \otimes \varvec{I}_{q_1}\), \(\varvec{d}_1^*= \varvec{I}_{c_2-q_2}\otimes \tilde{\varvec{\Sigma }}_1\), \(\varvec{d}_2^*= \tilde{\varvec{\Sigma }}_2 \otimes \varvec{I}_{c_1-q_1}\). By pluggin expression (14) in (13) we obtain that the first-order partial derivatives of the approximate restricted log-likelihood become
$$\begin{aligned} 2\frac{\partial {l^{*}}}{\partial {\tau _d^2}} = -\frac{1}{\tau _d^2}trace \left( \varvec{Z}^{t}\varvec{P}\varvec{Z}\varvec{G} \frac{\varvec{\Lambda }_d}{\tau _d^2}\varvec{G}\right) +\frac{1}{\tau _d^4}\varvec{\hat{\alpha }}^{t}\varvec{\Lambda }_d \varvec{\hat{\alpha }}. \end{aligned}$$
(16)
Then, REML estimates of the variance components \(\tau _d^2\) (\(d=1,2\)) are found by equating expression (16) to zero, which gives
$$\begin{aligned} \hat{\tau }_d^2=\frac{\varvec{\hat{\alpha }}^{t}\varvec{\Lambda }_d \varvec{\hat{\alpha }}}{trace\left( \varvec{Z}^{t}\varvec{P} \varvec{Z}\varvec{G}\frac{\varvec{\Lambda }_d}{\tau _d^2} \varvec{G}\right) }. \end{aligned}$$
Before proceeding with the estimation of \(\phi \)—if unknown—it is important to observe that the sum of the quantities involved in the denominators of the variance components estimates corresponds to the effective dimension of the penalized part (or random part) of the fitted model
$$\begin{aligned}&trace\left( \varvec{Z}^{t}\varvec{P}\varvec{Z}\varvec{G} \frac{\varvec{\Lambda }_1}{\tau _1^2}\varvec{G}\right) + trace\left( \varvec{Z}^{t}\varvec{P}\varvec{Z}\varvec{G} \frac{\varvec{\Lambda }_2}{\tau _2^2}\varvec{G}\right) \\&\quad =trace\left( \varvec{Z}^{t}\varvec{P}\varvec{Z} \varvec{G}\right) \\&\quad = trace\left( \varvec{Z}\varvec{G}\varvec{Z}^{t} \varvec{P}\right) \\&\quad = trace\left( \varvec{H}_{Random}\right) , \end{aligned}$$
where \(\varvec{H}_{Random}\) denotes the hat matrix (Hastie and Tibshirani 1990) of the random part [see (10)].
Finally, an estimate of \(\phi \) is obtained, as before, by taking derivatives of the approximate restricted log-likelihood with respect to \(\phi \)
$$\begin{aligned} \frac{\partial {l^{*}}}{\partial {\phi }}&= -\frac{1}{2}trace\left( \varvec{P} \frac{\partial {\varvec{V}}}{\partial {\phi }}\right) \\&+ \frac{1}{2}(\varvec{z} -\varvec{X}\varvec{\hat{\beta }})^{t} \varvec{V}^{-1}\frac{\partial {\varvec{V}}}{\partial {\phi }} \varvec{V}^{-1}(\varvec{z}- \varvec{X}\varvec{\hat{\beta }}). \end{aligned}$$
First, by Eq. (5.2) in Harville (1977), we have that \(\varvec{V}^{-1}(\varvec{z} - \varvec{X} \varvec{\hat{\beta }}) = \varvec{W}(\varvec{z} - \varvec{X} \varvec{\hat{\beta }} - \varvec{Z} \varvec{\hat{\alpha }})\). Moreover, given that \(\varvec{V}\) depends on \(\phi \) through \(\varvec{W}^{-1}\) which can be rewritten as \(\varvec{W} = \frac{1}{\phi } \widetilde{\varvec{W}}\), with \(\widetilde{\varvec{W}}\) being a diagonal matrix with elements \(\widetilde{w}_{ii} = \left\{ [g'\left( \mu _i\right) ]^2\nu \left( \mu _i\right) \right\} ^{-1}\), and ignoring again the dependence of \(\widetilde{\varvec{W}}\) on \(\phi \), it then follows that
$$\begin{aligned} 2\frac{\partial {l^{*}}}{\partial {\phi }}&= - \frac{1}{\phi }trace \left( \varvec{P}\varvec{W}^{-1}\right) \nonumber \\&+ \frac{1}{\phi ^2} (\varvec{z} -\varvec{X}\varvec{\hat{\beta }} - \varvec{Z}\varvec{\hat{\alpha }})^{t}\widetilde{\varvec{W}} (\varvec{z} - \varvec{X}\varvec{\hat{\beta }} - \varvec{Z}\varvec{\hat{\alpha }}). \end{aligned}$$
By equating the above expression to zero, we obtain
$$\begin{aligned} \hat{\phi } = \frac{(\varvec{z} -\varvec{X}\varvec{\hat{\beta }} - \varvec{Z}\varvec{\hat{\alpha }})^{t}\widetilde{\varvec{W}}(\varvec{z} - \varvec{X}\varvec{\hat{\beta }} -\varvec{Z}\varvec{\hat{\alpha }})}{trace\left( \varvec{P}\varvec{W}^{-1}\right) }, \end{aligned}$$
where [see Eq. (5.3) in Harville 1977 and expressions (10), (10), and (12)]
$$\begin{aligned}&trace\left( \varvec{P}\varvec{W}^{-1}\right) = trace\left( \varvec{W}^{-1}\varvec{P}\right) \\&\quad = trace\left( \varvec{I}_n - [\varvec{X}|\varvec{Z}\varvec{G}]\varvec{C}^{-1} \begin{bmatrix} \varvec{X}^{t}\varvec{W}\\ \varvec{Z}^{t}\varvec{W} \end{bmatrix}\right) \\&\quad = trace\left( \varvec{I}_n - [\varvec{X}|\varvec{Z}\varvec{G}] \begin{bmatrix} \left( \varvec{X}^{t}\varvec{V}^{-1}\varvec{X}\right) ^{-1} \varvec{X}^{t}\varvec{V}^{-1}\\ \varvec{Z}^{t}\varvec{P} \end{bmatrix}\right) \\&\quad = n - trace\left( \varvec{X}\left( \varvec{X}^{t}\varvec{V}^{-1} \varvec{X}\right) ^{-1}\varvec{X}^{t}\varvec{V}^{-1}\right) \\&\qquad - trace\left( \varvec{Z}\varvec{G}\varvec{Z}^{t}\varvec{P}\right) \\&\quad = n - rank\left( \varvec{X}\right) - \sum _{d=1}^{2}ed_d. \end{aligned}$$
Note that \(\varvec{H} =[\varvec{X}|\varvec{Z}\varvec{G}] \varvec{C}^{-1}[\varvec{X}|\varvec{Z}]^{t}\varvec{W}\) corresponds with the hat matrix of the fitted model, whose trace, as shown, can be decomposed as the sum of the traces of the hat matrices of the unpenalized (or fixed) part and the penalized (or random) part. \(\square \)