Various definitions and results from matrix calculus are used in the derivations of this section. These can be found in the "Appendix 1" section.
The nonlinear mixed effects model
Consider a population of \(N\) subjects and let the \(i\)th individual be described by the dynamical system
$$\begin{aligned} \begin{aligned} \frac{d{\mathbf{x}}_i(t)}{dt}&= {\mathbf{f}}\big ({\mathbf{x}}_i(t),t,{\mathbf{Z}}_i(t),\varvec{\theta },\varvec{\eta }_i\big )\\ {\mathbf{x}}_i(t_0)&= {\mathbf{x}}_{0i}\big ({\mathbf{Z}}_i(t_0),\varvec{\theta },\varvec{\eta }_i\big ), \end{aligned} \end{aligned}$$
(1)
where \({\mathbf{x}}_i(t)\) is a set of state variables, which for instance could be used to describe a drug concentration in one or more compartments, and where \({\mathbf{Z}}_i(t)\) is a set of possibly time dependent covariates, \(\varvec{\theta }\) a set of fixed effects parameters, and \(\varvec{\eta }_i\) a set of random effect parameters which are multivariate normally distributed with zero mean and covariance \(\varvec{{\Omega }}\). The covariance matrix \(\varvec{{\Omega }}\) is in general unknown and will therefore typically contain parameters subject to estimation. These parameters will for convenience of notation be included in the fixed effect parameter vector \(\varvec{\theta }\). Fixed effects parameters will hence be used to refer to all parameters that are not random, not being limited for parameters appearing in the model differential equations. A model for the \(j\)th observation of the \(i\)th individual at time \(t_{j_i}\) is defined by
$$\begin{aligned} {\mathbf{y}}_{ij} = {\mathbf{h}}\big ({\mathbf{x}}_{ij},t_{j_i},{\mathbf{Z}}_i(t_{j_i}),\varvec{\theta },\varvec{\eta }_i\big ) + {\mathbf{e}}_{ij}, \end{aligned}$$
(2)
where
$$\begin{aligned} {\mathbf{e}}_{ij} \in N\Big (\varvec{0},{\mathbf{R}}_{ij}\big ({\mathbf{x}}_{ij},t_{j_i},{\mathbf{Z}}_i(t_{j_i}),\varvec{\theta },\varvec{\eta }_i\big )\Big ), \end{aligned}$$
(3)
and where the index notation \(ij\) is used as a short form for denoting the \(i\)th individual at the \(j\)th observation. Note that any fixed effect parameters of the observational model are included in \(\varvec{\theta }\). Furthermore, we let the expected value of the discrete-time observation model be denoted by
$$\begin{aligned} \hat {\mathbf{y}}_{ij} = {\mathbf{E}}\big [{\mathbf{y}}_{ij}\big ]. \end{aligned}$$
(4)
The population likelihood
Given a set of experimental observations, \({\mathbf{d}}_{ij}\), for the individuals \(i=1, \ldots , N \) at the time points \(t_{j_i}\), where \(j=1, \dots , n_{i}\), we define the residuals
$$\begin{aligned} \varvec{\epsilon }_{ij} = {\mathbf{d}}_{ij} - \hat{\mathbf{y}}_{ij}, \end{aligned}$$
(5)
and write the population likelihood
$$\begin{aligned} L(\varvec{\theta }) = \prod _{i=1}^N \int p_1\big ({\mathbf{d}}_{i}|\varvec{\theta },\varvec{\eta }_i\big )p_2\big (\varvec{{\eta }}_i|\varvec{\theta }\big )\,d\varvec{\eta }_i, \end{aligned}$$
(6)
where
$$\begin{aligned} p_1\big ({\mathbf{d}}_i|\varvec{\theta },\varvec{\eta }_i\big ) = \prod _{j=1}^{n_i} \frac{\exp \big (-\frac{1}{2}\varvec{\epsilon }_{ij}^T{\mathbf{R}}_{ij}^{-1}\varvec{\epsilon }_{ij}\big )}{\sqrt{\det \big (2\pi {\mathbf{R}}_{ij}\big )}} \end{aligned}$$
(7)
and
$$\begin{aligned} p_2\big (\varvec{\eta }_i|\varvec{\theta }\big ) = \frac{\exp \big (-\frac{1}{2}\varvec{\eta }_i^T\varvec{{\Omega }}^{-1}\varvec{\eta }_i\big )}{\sqrt{\det \big (2\pi \varvec{{\Omega }}\big )}}, \end{aligned}$$
(8)
and where \({\mathbf{d}}_{i}\) is used to denote the collection of data from all time points for the \(i\)th individual.
The FOCE and FOCEI approximations
The marginalization with respect to \(\varvec{\eta }_i\) in Eq. 6 does not have a closed form solution. By writing Eq. 6 on the form
$$\begin{aligned} L(\varvec{\theta }) = \prod _{i=1}^N \int \exp (l_i)\,d\varvec{\eta }_i, \end{aligned}$$
(9)
where the individual joint log-likelihoods are
$$\begin{aligned} \begin{aligned} l_i&= -\frac{1}{2}\sum _{j=1}^{n_i} \left( \varvec{\epsilon }_{ij}^T {\mathbf{R}}_{ij}^{-1}\varvec{\epsilon }_{ij} + \log \det \big (2\pi {\mathbf{R}}_{ij}\big ) \right) \\&\quad -\frac{1}{2} \varvec{\eta }_i^T \varvec{{\Omega }}^{-1}\varvec{\eta }_i-\frac{1}{2}\log \det \big (2\pi \varvec{{\Omega }}\big ), \end{aligned} \end{aligned}$$
(10)
a closed form solution can be obtained by approximating the function \(l_i\) with a second order Taylor expansion with respect to \(\varvec{\eta }_i\). This is the well-known Laplacian approximation. Furthermore, we let the point around which the Taylor expansion is done to be conditioned on the \(\varvec{\eta }_i\) maximizing \(l_i\), here denoted by \(\varvec{\eta }_i^*\); I.e., the expansion is done at the mode of the posterior distribution. Thus, the approximate population likelihood, \(L_{L}\), becomes
$$\begin{aligned} L(\varvec{\theta }) \approx L_{L}(\varvec{\theta }) = \prod _{i=1}^N \left( \exp \big (l_i(\varvec{\eta }_i^*)\big ) \det \left[ \frac{-\mathrm {\Delta } l_i(\varvec{\eta }_i^*)}{2\pi } \right] ^{-\frac{1}{2}} \right) . \end{aligned}$$
(11)
Here, the Hessian \(\mathrm {\Delta } l_i(\varvec{\eta }_i^*)\) is obtained by first differentiating \(l_i\) twice with respect to \(\varvec{\eta }_i\), and evaluating at \(\varvec{\eta }_i^*\). If we let \(\eta _{ik}\) denote the \(k\)th component of \(\varvec{\eta }_i\), we have
$$\begin{aligned} \begin{aligned} \frac{d l_{i}}{d \eta _{ik}} =&- \frac{1}{2} \sum _{j=1}^{n_{i}} \Bigg ( 2 \varvec{\epsilon }_{ij}^T {\mathbf{R}}_{ij}^{-1} \frac{d \varvec{\epsilon }_{ij}}{d \eta _{ik}} - \varvec{\epsilon }_{ij}^T {\mathbf{R}}_{ij}^{-1} \frac{d {\mathbf{R}}_{ij}}{d \eta _{ik}} {\mathbf{R}}_{ij}^{-1} \varvec{\epsilon }_{ij}\\&\quad + {{\mathrm{tr}}}\left[ \mathrm {{\mathbf{R}}_{ij}^{-1}} \frac{d {\mathbf{R}}_{ij}}{d \eta _{ik}} \right] \Bigg ) - \varvec{\eta }_i^T \varvec{{\Omega }}^{-1} \frac{d \varvec{\eta }_i}{d \eta _{ik}}. \end{aligned} \end{aligned}$$
(12)
Differentiating component-wise again, now with respect to the \(l\)th component of \(\varvec{\eta }_i\), we get the elements of the Hessian
$$\begin{aligned} \begin{aligned} \frac{d^2 l_{i}}{d \eta _{ik} d \eta _{il}} =&- \frac{1}{2} \sum _{j=1}^{n_{i}} \Bigg ( 2 \frac{d \varvec{\epsilon }_{ij}^T}{d \eta _{il}} {\mathbf{R}}_{ij}^{-1} \frac{d \varvec{\epsilon }_{ij}}{d \eta _{ik}} - 2 \varvec{\epsilon }_{ij}^T {\mathbf{R}}_{ij}^{-1} \frac{d {\mathbf{R}}_{ij}}{d \eta _{il}} {\mathbf{R}}_{ij}^{-1} \frac{d \varvec{\epsilon }_{ij}}{d \eta _{ik}}\\&\quad + 2 \varvec{\epsilon }_{ij}^T {\mathbf{R}}_{ij}^{-1} \frac{d^2 \varvec{\epsilon }_{ij}}{d \eta _{ik} d \eta _{il}} - \varvec{\epsilon }_{ij}^T {\mathbf{R}}_{ij}^{-1} \frac{d^2 {\mathbf{R}}_{ij}}{d \eta _{ik} d \eta _{il}} {\mathbf{R}}_{ij}^{-1} \varvec{\epsilon }_{ij}\\&\quad + 2 \varvec{\epsilon }_{ij}^T {\mathbf{R}}_{ij}^{-1} \frac{d {\mathbf{R}}_{ij}}{d \eta _{ik}} {\mathbf{R}}_{ij}^{-1} \frac{d {\mathbf{R}}_{ij}}{d \eta _{il}} {\mathbf{R}}_{ij}^{-1} \varvec{\epsilon }_{ij} - 2 \varvec{\epsilon }_{ij}^T {\mathbf{R}}_{ij}^{-1} \frac{d {\mathbf{R}}_{ij}}{d \eta _{ik}} {\mathbf{R}}_{ij}^{-1} \frac{d \varvec{\epsilon }_{ij}}{d \eta _{il}}\\&- {{\mathrm{tr}}}\left[ {\mathbf{R}}_{ij}^{-1} \frac{d {\mathbf{R}}_{ij}}{d \eta _{il}} {\mathbf{R}}_{ij}^{-1} \frac{d {\mathbf{R}}_{ij}}{d \eta _{ik}}\right] + {{\mathrm{tr}}}\left[ {\mathbf{R}}_{ij}^{-1} \frac{d^2 {\mathbf{R}}_{ij}}{d \eta _{ik} d \eta _{il}} \right] \Bigg )\\&- \frac{d \varvec{\eta }_i^T}{d \eta _{il}} \varvec{{\Omega }}^{-1} \frac{d \varvec{\eta }_i}{d \eta _{ik}}, \end{aligned} \end{aligned}$$
(13)
where the last term is really just the \(kl\)th element of \(\varvec{{\Omega }}^{-1}\), \(\mathrm {\Omega }^{-1}_{kl}\). The expression for the elements of the Hessian may be approximated in different ways, with the main purpose of avoiding the need for computing the costly second order derivatives. We apply a first order approximation, where terms containing second order derivatives are ignored, and write the elements of the approximate Hessian, \({\mathbf{H}}_i\), as
$$\begin{aligned} \mathrm {H}_{ikl} = - \frac{1}{2} \sum _{j=1}^{n_{i}} \Bigg ( {\mathbf{a}}_l \, {\mathbf{B}} \, {\mathbf{a}}^T_k + {{\mathrm{tr}}}\left[ - {\mathbf{c}}_l \, {\mathbf{c}}_k \right] \Bigg ) - \mathrm {\Omega }_{kl}^{-1}, \end{aligned}$$
(14)
where
$$\begin{aligned} {\mathbf{a}}_k = \left( \frac{d \varvec{\epsilon }_{ij}^T}{d \eta _{ik}} - \varvec{\epsilon }_{ij}^T {\mathbf{R}}_{ij}^{-1} \frac{d {\mathbf{R}}_{ij}}{d \eta _{ik}} \right) ,\end{aligned}$$
(15)
$$\begin{aligned} {\mathbf{B}} = 2{\mathbf{R}}_{ij}^{-1}, \end{aligned}$$
(16)
and
$$\begin{aligned} {\mathbf{c}}_k = {\mathbf{R}}_{ij}^{-1} \frac{d {\mathbf{R}}_{ij}}{d \eta _{ik}}. \end{aligned}$$
(17)
This variant of the Laplacian approximation of the population likelihood is known as the first order conditional estimation with interaction (FOCEI) method. The closely related first order conditional estimation (FOCE) method is obtained by ignoring the dependence of the residual covariance matrix on the random effect parameters. The rationale for excluding the second order terms is that their expected values are zero for an appropriate model, as shown in the "Appendix 2" section. The Appendix also shows how the Hessian may be slightly further simplified, using similar arguments, to arrive at the variant of FOCE used in NONMEM. Those additional simplifications are however of relatively little importance from a computational point of view, since the components needed to evaluate these Hessian terms have to be provided for the remaining part of the Hessian anyway. We will therefore restrict the Hessian simplification by expectation to the second order terms only. Furthermore, we will from now on for convenience consider the logarithm of the FOCEI approximation to the population likelihood, \(L_{F}\),
$$\begin{aligned} \log L(\varvec{\theta }) \approx \log L_{F}(\varvec{\theta }) = \sum _{i=1}^N \left( l_i(\varvec{\eta }_i^*) -\frac{1}{2} \log \det \left[ \frac{-{\mathbf{H}}_i(\varvec{\eta }_i^*)}{2\pi } \right] \right) . \end{aligned}$$
(18)
Gradient of the individual joint log-likelihood with respect to the random effect parameters
We now turn to the computation of the gradient of the individual joint log-likelihoods, \(l_i(\varvec{\eta }_i)\), with respect to the random effect parameters, \(\varvec{\eta }_i\), using the approach of sensitivity equations. Consider the differentiation done in Eq. 12. Given values of \(\varvec{\theta }\) and \(\varvec{\eta }_i\), the quantities \(\varvec{\epsilon }_{ij}\), \({\mathbf{R}}_{ij}\), and \(\varvec{{\Omega }}\) can be obtained by solving the model equations. However, we additionally need to determine \(d \varvec{\epsilon }_{ij} / d \eta _{ik}\) and \(d {\mathbf{R}}_{ij} / d \eta _{ik}\). Expanding the total derivative of these quantities we see that
$$\begin{aligned} \frac{d \varvec{\epsilon }_{ij}}{d \eta _{ik}} = \frac{d \big ({\mathbf{d}}_{ij} - \hat{\mathbf{y}}_{ij}\big )}{d \eta _{ik}} = - \left( \frac{\partial {\mathbf{h}}}{\partial \eta _{ik}} + \frac{\partial {\mathbf{h}}}{\partial {\mathbf{x}}_{ij}} \frac{d {\mathbf{x}}_{ij}}{d \eta _{ik}} \right) , \end{aligned}$$
(19)
and
$$\begin{aligned} \frac{d {\mathbf{R}}_{ij}}{d \eta _{ik}} = \frac{\partial {\mathbf{R}}_{ij}}{\partial \eta _{ik}} + \frac{\partial {\mathbf{R}}_{ij}}{\partial {\mathbf{x}}_{ij}} \frac{d {\mathbf{x}}_{ij}}{d \eta _{ik}}. \end{aligned}$$
(20)
The derivatives of \({\mathbf{h}}\) and \({\mathbf{R}}_{ij}\) are readily obtained since these expressions are given explicitly by the model formulation. In contrast, the derivative of the state variables, \({\mathbf{x}}_{ij}\), are not directly available but can be computed from the so called sensitivity equations. The sensitivity equations are a set of differential equations which are derived by differentiating the original system of differential equations (and the corresponding initial conditions) with respect to each random effect parameter \(\eta _{ik}\),
$$\begin{aligned} \begin{aligned} \frac{d}{d t} \left( \frac{d {\mathbf{x}}_{i}}{d \eta _{ik}} \right)&= \frac{\partial {\mathbf{f}}}{\partial \eta _{ik}} + \frac{\partial {\mathbf{f}}}{\partial {\mathbf{x}}_{i}} \left( \frac{d {\mathbf{x}}_{i}}{d \eta _{ik}} \right) \\ \left( \frac{d {\mathbf{x}}_i}{d \eta _{ik}} \right) (t_0)&= \frac{\partial {\mathbf{x}}_{0i}}{\partial \eta _{ik}}. \end{aligned} \end{aligned}$$
(21)
The solution to the sensitivity equations can be used to evaluate the derivatives in Eqs. 19 and 20, which in turn are needed for the gradient of the individual joint log-likelihoods. Importantly, these derivatives are also used for computing the approximate Hessian, Eq. 14, appearing in the approximate population log-likelihood.
In the unusual event that one or more of the random effect parameters only appear in the observational model, all sensitivities of the state variables with respect to those parameters are trivially zero. Note also that the sensitivity equations for all but trivial models involve the original state variables, which means that the original system of differential equations has to be solved simultaneously. Thus, if there are \(q\) non-trivial sensitivities and \(n\) state variables, the total number of differential equations that has to be solved in order to be able to compute \(l_i\) and \(d l_i / d \varvec{\eta }_i\) for each individual is
$$\begin{aligned} n (1 + q). \end{aligned}$$
(22)
Gradient of the approximate population log-likelihood with respect to the fixed effect parameters
We now derive the expression for the gradient of the approximate population log-likelihood, \(\log L_{F}(\varvec{\theta })\), with respect to the parameter vector \(\varvec{\theta }\). Differentiating \(\log L_{F}\) with respect to the \(m\)th element of \(\varvec{\theta }\) gives
$$\begin{aligned} \frac{\log L_{F}}{d \theta _m} = \sum _{i=1}^N \left( \frac{d l_i(\varvec{\eta }_i^*)}{d \theta _m} -\frac{1}{2} {{\mathrm{tr}}}\left[ {\mathbf{H}}_i^{-1}(\varvec{\eta }_i^*) \frac{d {\mathbf{H}}_i(\varvec{\eta }_i^*)}{d \theta _m} \right] \right) . \end{aligned}$$
(23)
Here it must be emphasized that all derivatives with respect to components of the parameter vector \(\varvec{\theta }\) are taken after replacing \(\varvec{\eta }_i\) with \(\varvec{\eta }_i^*\). This is critical since \(\varvec{\eta }_i^*\) is an implicit function of theta, \(\varvec{\eta }_i^*=\varvec{\eta }_i^*(\varvec{\theta })\). In other words, we have to account for the fact that the \(\varvec{\eta }_i\) maximizing the individual joint log-likelihood changes as \(\varvec{\theta }\) changes.
To determine the total derivatives with respect to components of the parameter vector \(\varvec{\theta }\) we will be needing the following result. Consider a function \({\mathbf{v}}\) which may depend directly on the parameters \(\varvec{\theta }\) and \(\varvec{\eta }_i\), and on the auxiliary function \({\mathbf{w}}\) representing any indirect dependencies of these parameters,
$$\begin{aligned} {\mathbf{v}} = {\mathbf{v}} \big ({\mathbf{w}}(\varvec{\theta },\varvec{\eta }_i),\varvec{\theta },\varvec{\eta }_i\big ). \end{aligned}$$
(24)
We furthermore introduce the function \({\mathbf{z}}\) to denote the evaluation of \({\mathbf{v}}\) at \(\varvec{\eta }_i=\varvec{\eta }_i^*(\varvec{\theta })\),
$$\begin{aligned} {\mathbf{z}} = {\mathbf{z}} \big ({\mathbf{w}}(\varvec{\theta },\varvec{\eta }_i^*(\varvec{\theta })),\varvec{\theta },\varvec{\eta }_i^*(\varvec{\theta })\big ) = \left. {\mathbf{v}} \right| _{\varvec{\eta }_i=\varvec{\eta }_i^*(\varvec{\theta })}. \end{aligned}$$
(25)
Separating the complete dependence of \({\mathbf{z}}\) on \(\varvec{\theta }\) into partial dependencies we get that
$$\begin{aligned} \begin{aligned}&\frac{d }{d \varvec{\theta }} \left( \left. {\mathbf{v}} \right| _{\varvec{\eta }_i=\varvec{\eta }_i^*(\varvec{\theta })} \right) = \frac{d {\mathbf{z}}}{d \varvec{\theta }}\\&\quad = \frac{\partial {\mathbf{z}}}{\partial {\mathbf{w}}}\frac{d {\mathbf{w}}}{d \varvec{\theta }} + \frac{\partial {\mathbf{z}}}{\partial \varvec{\theta }} + \frac{\partial {\mathbf{z}}}{\partial \varvec{\eta }_i^*}\frac{d \varvec{\eta }_i^*}{d \varvec{\theta }}\\&\quad = \frac{\partial {\mathbf{z}}}{\partial {\mathbf{w}}}\frac{\partial {\mathbf{w}}}{\partial \varvec{\theta }} + \frac{\partial {\mathbf{z}}}{\partial {\mathbf{w}}}\frac{\partial {\mathbf{w}}}{\partial \varvec{\eta }_i^*}\frac{d \varvec{\eta }_i^*}{d \varvec{\theta }} + \frac{\partial {\mathbf{z}}}{\partial \varvec{\theta }} + \frac{\partial {\mathbf{z}}}{\partial \varvec{\eta }_i^*}\frac{d \varvec{\eta }_i^*}{d \varvec{\theta }}\\&\quad = \frac{\partial {\mathbf{z}}}{\partial {\mathbf{w}}}\frac{\partial {\mathbf{w}}}{\partial \varvec{\theta }} + \frac{\partial {\mathbf{z}}}{\partial \varvec{\theta }} + \frac{d {\mathbf{z}}}{d \varvec{\eta }_i^*}\frac{d \varvec{\eta }_i^*}{d \varvec{\theta }}\\&\quad = \frac{\partial }{\partial {\mathbf{w}}} \left( \left. {\mathbf{v}} \right| _{\varvec{\eta }_i=\varvec{\eta }_i^*(\varvec{\theta })} \right) \frac{\partial {\mathbf{w}}}{\partial \varvec{\theta }} +\frac{\partial }{\partial \varvec{\theta }} \left( \left. {\mathbf{v}} \right| _{\varvec{\eta }_i=\varvec{\eta }_i^*(\varvec{\theta })} \right) + \frac{d}{d \varvec{\eta }_i^*} \left( \left. {\mathbf{v}} \right| _{\varvec{\eta }_i=\varvec{\eta }_i^*(\varvec{\theta })} \right) \frac{d \varvec{\eta }_i^*}{d \varvec{\theta }}\\&\quad = \left. \left( \frac{\partial {\mathbf{v}}}{\partial {\mathbf{w}}} \frac{\partial {\mathbf{w}}}{\partial \varvec{\theta }} \right) \right| _{\varvec{\eta }_i=\varvec{\eta }_i^*(\varvec{\theta })} +\left. \left( \frac{\partial {\mathbf{v}}}{\partial \varvec{\theta }} \right) \right| _{\varvec{\eta }_i=\varvec{\eta }_i^*(\varvec{\theta })} + \left. \left( \frac{d {\mathbf{v}}}{d \varvec{\eta }_i} \right) \right| _{\varvec{\eta }_i=\varvec{\eta }_i^*(\varvec{\theta })} \frac{d \varvec{\eta }_i^*}{d \varvec{\theta }}\\&\quad = \left. \frac{d {\mathbf{v}}}{d \varvec{\theta }} \right| _{\varvec{\eta }_i=\varvec{\eta }_i^*(\varvec{\theta })} + \left. \frac{d {\mathbf{v}}}{d \varvec{\eta }_i} \right| _{\varvec{\eta }_i=\varvec{\eta }_i^*(\varvec{\theta })} \frac{d \varvec{\eta }_i^*}{d \varvec{\theta }}. \end{aligned} \end{aligned}$$
(26)
Thus, the total derivative with respect to \(\varvec{\theta }\) after insertion of \(\varvec{\eta }_i^*\) is equal to the sum of total derivatives with respect to \(\varvec{\theta }\) and \(\varvec{\eta }_i\) before insertion of \(\varvec{\eta }_i^*\), where the second derivative is multiplied with the sensitivity of the random effect optimum with respect to the parameters \(\varvec{\theta }\). It is straightforward to see that this result holds also when differentiating functions that only exhibit a subset of the possible direct and indirect dependencies of Eq. 24, for instance functions with just an indirect dependence on the two kind of parameters.
Applying the results from Eq. 26 to the first term within the summation of Eq. 23, we have that
$$\begin{aligned} \frac{d l_i(\varvec{\eta }_i^*)}{d \theta _m} = \left. \frac{d l_i(\varvec{\eta }_i)}{d \theta _m} \right| _{\varvec{\eta }_i=\varvec{\eta }_i^*(\varvec{\theta })} \,+\, \left. \frac{d l_i(\varvec{\eta }_i)}{d \varvec{\eta }_i} \right| _{\varvec{\eta }_i=\varvec{\eta }_i^*(\varvec{\theta })} \frac{d \varvec{\eta }_i^*}{d \theta _m}. \end{aligned}$$
(27)
However, since \(d l_i / d \varvec{\eta }_i\) evaluated at \(\varvec{\eta }_i^*\) is zero by definition, the second term of the right hand side of Eq. 27 disappears and
$$\begin{aligned} \begin{aligned} \frac{d l_i(\varvec{\eta }_i^*)}{d \theta _m} = \left. \frac{d l_i(\varvec{\eta }_i)}{d \theta _m} \right| _{\varvec{\eta }_i=\varvec{\eta }_i^*(\varvec{\theta })}&= \Bigg [ - \frac{1}{2} \sum _{j=1}^{n_{i}} \Bigg ( 2 \varvec{\epsilon }_{ij}^{T} {\mathbf{R}}_{ij}^{-1} \frac{d \varvec{\epsilon }_{ij}}{d \theta _m}\\&\quad - \varvec{\epsilon }_{ij}^{T} {\mathbf{R}}_{ij}^{-1} \frac{d {\mathbf{R}}_{ij}}{d \theta _m} {\mathbf{R}}_{ij}^{-1} \varvec{\epsilon }_{ij} + {{\mathrm{tr}}}\left[ \mathrm {{\mathbf{R}}_{ij}^{-1}} \frac{d {\mathbf{R}}_{ij}}{d \theta _m} \right] \Bigg )\\&\quad + \frac{1}{2} \varvec{\eta }_i^T \, \varvec{{\Omega }}^{-1} \frac{d \varvec{{\Omega }}}{d \theta _m} {\varvec{\Omega }}^{-1} \varvec{\eta }_i - \frac{1}{2} {{\mathrm{tr}}}\left[ \varvec{{\Omega }}^{-1} \frac{d \varvec{{\Omega }}}{d \theta _m} \right] \Bigg ]_{\varvec{\eta }_i=\varvec{\eta }_i^*(\varvec{\theta })}. \end{aligned} \end{aligned}$$
(28)
Using asterisks to denote that \(\varvec{\eta }_i\) has been replaced with \(\varvec{\eta }_i^*\), we also get the following for the derivative of the second term within the summation of Eq. 23,
$$\begin{aligned} \begin{aligned} \frac{d \mathrm {H}_{ikl}(\varvec{\eta }_i^*)}{d \theta _m} = - \frac{1}{2} \sum _{j=1}^{n_{i}} \Bigg (&\frac{d {\mathbf{a}}^{*}_l}{d \theta _m} \, {\mathbf{B}}^{*} \, {\mathbf{a}}^{*T}_k + {\mathbf{a}}^{*}_l \, \frac{d {\mathbf{B}}^{*}}{d \theta _m} \, {\mathbf{a}}^{*T}_k + {\mathbf{a}}^{*}_l \, {\mathbf{B}}^{*} \, \frac{d {\mathbf{a}}^{*T}_k}{d \theta _m}\\&+ {{\mathrm{tr}}}\left[ - \frac{d {\mathbf{c}}^{*}_l}{d \theta _m} \, {\mathbf{c}}^{*}_k - {\mathbf{c}}^{*}_l \, \frac{d {\mathbf{c}}^{*}_k}{d \theta _m} \right] \Bigg ) - \frac{d \mathrm {\Omega }_{kl}^{-1}}{d \theta _m}, \end{aligned} \end{aligned}$$
(29)
where
$$\begin{aligned} \frac{d {\mathbf{a}}^{*}_k}{d \theta _m}= & {} \frac{d}{d \theta _m} \left( \frac{d \varvec{\epsilon }_{ij}^{T}}{d \eta _{ik}} \right) ^* - \frac{\varvec{\epsilon }_{ij}^{*T}}{d \theta _m} {\mathbf{R}}_{ij}^{*-1} \left( \frac{d {\mathbf{R}}_{ij}}{d \eta _{ik}} \right) ^*\nonumber \\&\quad + \varvec{\epsilon }_{ij}^{*T} {\mathbf{R}}_{ij}^{*-1} \frac{d {\mathbf{R}}^{*}_{ij}}{d \theta _m} {\mathbf{R}}_{ij}^{*-1} \left( \frac{d {\mathbf{R}}_{ij}}{d \eta _{ik}} \right) ^*\nonumber \\&\quad - \varvec{\epsilon }_{ij}^{*T} {\mathbf{R}}_{ij}^{*-1} \frac{d}{d \theta _m} \left( \frac{d {\mathbf{R}}_{ij}}{d \eta _{ik}} \right) ^*,\end{aligned}$$
(30)
$$\begin{aligned} \frac{d {\mathbf{B}}^{*}}{d \theta _m}= & {} - 2 {\mathbf{R}}_{ij}^{*-1} \frac{d {\mathbf{R}}^{*}_{ij}}{d \theta _m} {\mathbf{R}}_{ij}^{*-1}, \end{aligned}$$
(31)
and
$$\begin{aligned} \frac{d {\mathbf{c}}^{*}_k}{d \theta _m} = - {\mathbf{R}}_{ij}^{*-1} \frac{d {\mathbf{R}}^{*}_{ij}}{d \theta _m} {\mathbf{R}}_{ij}^{*-1} \left( \frac{d {\mathbf{R}}_{ij}}{d \eta _{ik}} \right) ^* + {\mathbf{R}}_{ij}^{*-1} \frac{d}{d \theta _m} \left( \frac{d {\mathbf{R}}_{ij}}{d \eta _{ik}} \right) ^*. \end{aligned}$$
(32)
We now continue to expand the terms in Eqs. 28–32 containing derivatives with respect to \(\theta _m\). The terms \(d \varvec{{\Omega }}/d \theta _m\) and \(d \mathrm {\Omega }_{kl}^{-1}/d \theta _m\) are obtainable by straightforward differentiation. Noting that the terms \(\varvec{\epsilon }^*_{ij}\), \((d \varvec{\epsilon }_{ij} / d \eta _{ik})^*\), \({\mathbf{R}}^*_{ij}\), and \((d {\mathbf{R}}_{ij} / d \eta _{ik})^*\), have indirect and/or direct dependence on \(\varvec{\theta }\) and \(\varvec{\eta }_i^*\), we apply the results from Eq. 26 and expand the remaining derivatives. First,
$$\begin{aligned} \frac{d \varvec{\epsilon }^*_{ij}}{d \theta _m} = \left. \frac{d \varvec{\epsilon }_{ij}}{d \theta _m} \right| _{\varvec{\eta }_i =\varvec{\eta }_i^*(\varvec{\theta })} \,+\, \left. \frac{d \varvec{\epsilon }_{ij}}{d \varvec{\eta }_i} \right| _{\varvec{\eta }_i = \varvec{\eta }_i^*(\varvec{\theta })} \frac{d \varvec{\eta }_i^*}{d \theta _m}. \end{aligned}$$
(33)
Here, \(d \varvec{\epsilon }_{ij} / d \varvec{\eta }_i\) was determined previously in Eq. 19, and the derivative in the first term is given by
$$\begin{aligned} \frac{d \varvec{\epsilon }_{ij}}{d \theta _m} = \frac{d ({\mathbf{d}}_{ij} - \hat {\mathbf{y}}_{ij})}{d \theta _m} = - \left( \frac{\partial {\mathbf{h}}}{\partial \theta _m} + \frac{\partial {\mathbf{h}}}{\partial {\mathbf{x}}_{ij}} \frac{d {\mathbf{x}}_{ij}}{d \theta _m} \right) . \end{aligned}$$
(34)
The sensitivity of the random effect optimum with respect to the fixed effect parameters, \(d \varvec{\eta }_i^* / d \varvec{\theta }\), must also be determined, which we will return to later. Then,
$$\begin{aligned} \frac{d {\mathbf{R}}^*_{ij}}{d \theta _m} = \left. \frac{d {\mathbf{R}}_{ij}}{d \theta _m} \right| _{\varvec{\eta }_i=\varvec{\eta }_i^*(\varvec{\theta })} + \left. \frac{d {\mathbf{R}}_{ij}}{d \varvec{\eta }_i} \right| _{\varvec{\eta }_i=\varvec{\eta }_i^*(\varvec{\theta })} \frac{d \varvec{\eta }_i^*}{d \theta _m}, \end{aligned}$$
(35)
where \(d {\mathbf{R}}_{ij} / d \varvec{\eta }_i\) was determined in Eq. 20, and
$$\begin{aligned} \frac{d {\mathbf{R}}_{ij}}{d \theta _m} = \frac{\partial {\mathbf{R}}_{ij}}{\partial \theta _m} + \frac{\partial {\mathbf{R}}_{ij}}{\partial {\mathbf{x}}_{ij}} \frac{d {\mathbf{x}}_{ij}}{d \theta _m}. \end{aligned}$$
(36)
Next,
$$\begin{aligned} \begin{aligned}&\frac{d}{d \theta _m} \left( \left. \frac{d \varvec{\epsilon }_{ij}}{d \eta _{ik}} \right| _{\varvec{\eta }_i=\varvec{\eta }_i^*(\varvec{\theta })} \right) \\&\quad = \left. \left( \frac{d}{d \theta _m} \left( \frac{d \varvec{\epsilon }_{ij}}{d \eta _{ik}} \right) \right) \right| _{\varvec{\eta }_i=\varvec{\eta }_i^*(\varvec{\theta })} + \left. \left( \frac{d}{d \varvec{\eta }} \left( \frac{d \varvec{\epsilon }_{ij}}{d \eta _{ik}} \right) \right) \right| _{\varvec{\eta }_i=\varvec{\eta }_i^*(\varvec{\theta })} \frac{d \varvec{\eta }_i^*}{d \theta _m} \\&\quad = \left. \left( \frac{d}{d \theta _m} \left( \frac{d \varvec{\epsilon }_{ij}}{d \eta _{ik}} \right) \right) \right| _{\varvec{\eta }_i=\varvec{\eta }_i^*(\varvec{\theta })} + \sum _l \left. \left( \frac{d}{d \eta _{il}} \left( \frac{d \varvec{\epsilon }_{ij}}{d \eta _{ik}} \right) \right) \right| _{\varvec{\eta }_i=\varvec{\eta }_i^*(\varvec{\theta })} \frac{d \eta _{il}^*}{d \theta _m} \\&\quad = - \Bigg ( \frac{\partial ^2 {\mathbf{h}}}{\partial \eta _{ik} \partial \theta _m} + \frac{\partial ^2 {\mathbf{h}}}{\partial \eta _{ik} \partial {\mathbf{x}}_{ij}} \frac{d {\mathbf{x}}_{ij}}{d \theta _m} + \left( \frac{\partial ^2 {\mathbf{h}}}{\partial {\mathbf{x}}_{ij} \partial \theta _m} + \frac{\partial ^2 {\mathbf{h}}}{\partial {\mathbf{x}}_{ij}^2} \frac{d {\mathbf{x}}_{ij}}{d \theta _m} \right) \frac{d {\mathbf{x}}_{ij}}{d \eta _{ik}}\\&\qquad \left. + \frac{\partial {\mathbf{h}}}{\partial {\mathbf{x}}_{ij}} \frac{d^2 {\mathbf{x}}_{ij}}{d \eta _{ik} d \theta _m} \Bigg ) \right| _{\varvec{\eta }_i=\varvec{\eta }_i^*(\varvec{\theta })} - \sum _l \Bigg ( \frac{\partial ^2 {\mathbf{h}}}{\partial \eta _{ik} \partial \eta _{il}} + \frac{\partial ^2 {\mathbf{h}}}{\partial \eta _{ik} \partial {\mathbf{x}}_{ij}} \frac{d {\mathbf{x}}_{ij}}{d \eta _{il}}\\&\qquad + \left( \frac{\partial ^2 {\mathbf{h}}}{\partial {\mathbf{x}}_{ij} \partial \eta _{il}} + \frac{\partial ^2 {\mathbf{h}}}{\partial {\mathbf{x}}_{ij}^2} \frac{d {\mathbf{x}}_{ij}}{d \eta _{il}} \right) \frac{d {\mathbf{x}}_{ij}}{d \eta _{ik}} \left. + \frac{\partial {\mathbf{h}}}{\partial {\mathbf{x}}_{ij}} \frac{d^2 {\mathbf{x}}_{ij}}{d \eta _{ik} d \eta _{il}} \Bigg ) \right| _{\varvec{\eta }_i=\varvec{\eta }_i^*(\varvec{\theta })} \frac{d \eta _{il}^*}{d \theta _m}, \end{aligned} \end{aligned}$$
(37)
where we after the third equality have used the results from Eq. 19. The derivative of \((d {\mathbf{R}}_{ij} / d \eta _{ik})^*\) with respect to \(\theta _m\) is done in a highly similar way and is left to the reader as an exercise.
In the above expressions, derivatives of \({\mathbf{h}}\) and \({\mathbf{R}}_{ij}\) are obtained by direct differentiation. The derivatives of the state variables are determined by the previously derived sensitivity equation in Eq. 21 and by the additional sensitivity equations
$$\begin{aligned} \frac{d}{d t} \left( \frac{d {\mathbf{x}}_{i}}{d \theta _m} \right)= & {} \frac{\partial {\mathbf{f}}}{\partial \theta _m} + \frac{\partial {\mathbf{f}}}{\partial {\mathbf{x}}_{i}} \left( \frac{d {\mathbf{x}}_{i}}{d \theta _m} \right) \nonumber \\ \left( \frac{d {\mathbf{x}}_i}{d \theta _m} \right) (t_0)= & {} \frac{\partial {\mathbf{x}}_{0i}}{\partial \theta _m},\end{aligned}$$
(38)
$$\begin{aligned} \frac{d}{d t} \left( \frac{d^2 {\mathbf{x}}_{i}}{d \eta _{ik} d \theta _m} \right)= & {} \frac{\partial ^2 {\mathbf{f}}}{\partial \eta _{ik} \partial \theta _m} + \frac{\partial ^2 {\mathbf{f}}}{\partial \eta _{ik} \partial {\mathbf{x}}_{i}} \frac{d {\mathbf{x}}_{i}}{d \theta _m}\nonumber \\&\quad + \left( \frac{\partial ^2 {\mathbf{f}}}{\partial {\mathbf{x}}_{i} \partial \theta _m} + \frac{\partial ^2 {\mathbf{f}}}{\partial ^2 {\mathbf{x}}_{i}} \frac{d {\mathbf{x}}_{i}}{d \theta _m} \right) \left( \frac{d {\mathbf{x}}_{i}}{d \eta _{ik}} \right) + \frac{\partial {\mathbf{f}}}{\partial {\mathbf{x}}_{i}} \left( \frac{d^2 {\mathbf{x}}_{i}}{d \eta _{ik} d \theta _m} \right) \nonumber \\&\quad \left( \frac{d^2 {\mathbf{x}}_i}{d \eta _{ik} d \theta _m} \right) (t_0) = \frac{\partial ^2 {\mathbf{x}}_{0i}}{\partial \eta _{ik} \partial \theta _m}, \end{aligned}$$
(39)
and
$$\begin{aligned} \begin{aligned} \frac{d}{d t} \left( \frac{d^2 {\mathbf{x}}_{i}}{d \eta _{ik} d \eta _{il}} \right)&= \frac{\partial ^2 {\mathbf{f}}}{\partial \eta _{ik} \partial \eta _{il}} + \frac{\partial ^2 {\mathbf{f}}}{\partial \eta _{ik} \partial {\mathbf{x}}_{i}} \frac{d {\mathbf{x}}_{i}}{d \eta _{il}}\\&\quad + \left( \frac{\partial ^2 {\mathbf{f}}}{\partial {\mathbf{x}}_{i} \partial \eta _{il}} + \frac{\partial ^2 {\mathbf{f}}}{\partial ^2 {\mathbf{x}}_{i}} \frac{d {\mathbf{x}}_{i}}{d \eta _{il}} \right) \left( \frac{d {\mathbf{x}}_{i}}{d \eta _{ik}} \right) + \frac{\partial {\mathbf{f}}}{\partial {\mathbf{x}}_{i}} \left( \frac{d^2 {\mathbf{x}}_{i}}{d \eta _{ik} d \eta _{il}} \right) \\&\qquad \left( \frac{d^2 {\mathbf{x}}_i}{d \eta _{ik} d \eta _{il}} \right) (t_0) = \frac{\partial ^2 {\mathbf{x}}_{0i}}{\partial \eta _{ik} \partial \eta _{il}}. \end{aligned} \end{aligned}$$
(40)
As noted previously, all sensitivity equations must be solved simultaneously with the original differential equations for all but trivial models. However, since one or more parameters in the vector \(\varvec{\theta }\) may not appear in the differential equation part of the model (such as parameters appearing only in \(\varvec{{\Omega }}\)), there may be sensitivities which are trivially zero. If there are \(p\) non-trivial sensitivities among the parameters in \(\varvec{\theta }\), \(q\) non-trivial sensitivities among the parameters in \(\varvec{\eta }\), and \(n\) state variables, the total number of differential equations that has to be solved in order to be able to compute \(\log L_F\) and \(d \log L_F / d \varvec{\theta }\) for each individual is
$$\begin{aligned} n \big (1 + q\big ) \big (1 + p + q/2\big ). \end{aligned}$$
(41)
Finally, we need to determine \(d \varvec{\eta }_i^* / d \varvec{\theta }\). At the the optimum of each individual joint log-likelihood we have that
$$\begin{aligned} \frac{d l_{i}}{d \varvec{\eta }_i}=\varvec{0}, \end{aligned}$$
(42)
or put differently,
$$\begin{aligned} \left. \frac{d l_{i}}{d \varvec{\eta }_i} \right| _{\varvec{\eta }_i=\varvec{\eta }_i^*(\varvec{\theta })} = \varvec{0}. \end{aligned}$$
(43)
This equality holds for any \(\varvec{\theta }\), and thus
$$\begin{aligned} \frac{d}{d \varvec{\theta }} \left( \left. \frac{d l_{i}}{d \varvec{\eta }_i} \right| _{\varvec{\eta }_i=\varvec{\eta }_i^*(\varvec{\theta })} \right) = \varvec{0}. \end{aligned}$$
(44)
Recognizing that \(d l_{i}/d \varvec{\eta }_i\) fulfills the requirements of applying the results from Eq. 26, we can write this as
$$\begin{aligned} \frac{d}{d \varvec{\theta }} \left( \left. \frac{d l_{i}}{d \varvec{\eta }_i} \right| _{\varvec{\eta }_i=\varvec{\eta }_i^*(\varvec{\theta })} \right) = \left. \frac{d^2 l_{i}}{d \varvec{\eta }_i d \varvec{\theta }} \right| _{\varvec{\eta }_i=\varvec{\eta }_i^*(\varvec{\theta })} + \left. \frac{d^2 l_{i}}{d \varvec{\eta }_i^2} \right| _{\varvec{\eta }_i=\varvec{\eta }_i^*(\varvec{\theta })} \frac{d \varvec{\eta }_i^*}{d \varvec{\theta }} = \varvec{0}. \end{aligned}$$
(45)
By rearranging terms and inverting the matrix, we finally get that
$$\begin{aligned} \frac{d \varvec{\eta }_i^*}{d \varvec{\theta }} = - \left( \left. \frac{d^2 l_{i}}{d \varvec{\eta }_i^2} \right| _{\varvec{\eta }_i=\varvec{\eta }_i^*(\varvec{\theta })} \right) ^{-1} \left. \frac{d^2 l_{i}}{d \varvec{\eta }_i d \varvec{\theta }} \right| _{\varvec{\eta }_i=\varvec{\eta }_i^*(\varvec{\theta })}. \end{aligned}$$
(46)
The second order derivatives of the individual joint log-likelihoods with respect to the random effect parameters were previously derived in Eq. 13. In contrast to the first order approximation of the Hessian used in the approximate population log-likelihood, the second order derivatives of \(\varvec{\epsilon }_{ij}\) and \({\mathbf{R}}_{ij}\) are kept. These are obtained by differentiating Eqs. 19 and 20 once more with respect to \(\varvec{\eta }_i\) (not shown). This in turn requires the second order sensitivity equations of the state variables with respect to \(\varvec{\eta }_i\), which were previously provided in Eq. 40. In addition to second order derivatives of the individual joint log-likelihoods with respect to the random effect parameters, Eq. 46 also requires the second order mixed derivatives, which are given by
$$\begin{aligned} \begin{aligned} \frac{d^2 l_{i}}{d \eta _{ik} d \theta _{m}} =&- \frac{1}{2} \sum _{j=1}^{n_{i}} \Bigg ( 2 \frac{d \varvec{\epsilon }_{ij}^T}{d \theta _{m}} {\mathbf{R}}_{ij}^{-1} \frac{d \varvec{\epsilon }_{ij}}{d \eta _{ik}} - 2 \varvec{\epsilon }_{ij}^T {\mathbf{R}}_{ij}^{-1} \frac{d {\mathbf{R}}_{ij}}{d \theta _{m}} {\mathbf{R}}_{ij}^{-1} \frac{d \varvec{\epsilon }_{ij}}{d \eta _{ik}}\\&\quad + 2 \varvec{\epsilon }_{ij}^T {\mathbf{R}}_{ij}^{-1} \frac{d^2 \varvec{\epsilon }_{ij}}{d \eta _{ik} d \theta _{m}} - \varvec{\epsilon }_{ij}^T {\mathbf{R}}_{ij}^{-1} \frac{d^2 {\mathbf{R}}_{ij}}{d \eta _{ik} d \theta _{m}} {\mathbf{R}}_{ij}^{-1} \varvec{\epsilon }_{ij}\\&\quad + 2 \varvec{\epsilon }_{ij}^T {\mathbf{R}}_{ij}^{-1} \frac{d {\mathbf{R}}_{ij}}{d \eta _{ik}} {\mathbf{R}}_{ij}^{-1} \frac{d {\mathbf{R}}_{ij}}{d \theta _{m}} {\mathbf{R}}_{ij}^{-1} \varvec{\epsilon }_{ij} - 2 \varvec{\epsilon }_{ij}^T {\mathbf{R}}_{ij}^{-1} \frac{d {\mathbf{R}}_{ij}}{d \eta _{ik}} {\mathbf{R}}_{ij}^{-1} \frac{d \varvec{\epsilon }_{ij}}{d \theta _{m}}\\&\quad + {{\mathrm{tr}}}\left[ {\mathbf{R}}_{ij}^{-1} \frac{d {\mathbf{R}}_{ij}}{d \theta _{m}} {\mathbf{R}}_{ij}^{-1} \frac{d {\mathbf{R}}_{ij}}{d \eta _{ik}} + {\mathbf{R}}_{ij}^{-1} \frac{d^2 {\mathbf{R}}_{ij}}{d \eta _{ik} d \theta _{m}} \right] \Bigg )\\&- \varvec{\eta }_i^T \varvec{{\Omega }}^{-1} \frac{d \varvec{{\Omega }}}{d \theta _{m}} \varvec{{\Omega }}^{-1} \frac{d \varvec{\eta }_i}{d \eta _{ik}}. \end{aligned} \end{aligned}$$
(47)
Here, all terms have previously been introduced except \(d^2 \varvec{\epsilon }_{ij} / d \eta _{ik} d \theta _{m}\) and \(d^2 {\mathbf{R}}_{ij} / d \eta _{ik} d \theta _{m}\), which are provided within the derivation of Eq. 37 and through a corresponding derivation involving \({\mathbf{R}}_{ij}\).
Better starting values for optimization of random effect parameters
Computing the approximate population log-likelihood and its gradient with respect to the parameters \(\varvec{\theta }\) requires the determination of \(\varvec{\eta }_i^*\) for every individual. The first time \(\log L_F\) and its gradient are evaluated it is reasonable to initiate the inner level optimizations for \(\varvec{\eta }_i^*\) with \(\varvec{\eta }_i=\varvec{0}\). However, in the subsequent steps of the optimization with respect to \(\varvec{\theta }\), better starting values for \(\varvec{\eta }_i\) can be provided. One way of choosing the starting values \(\varvec{\eta }_i^0\) for the optimization of \(\varvec{\eta }_i\) is to set them equal to the optimized value from the last step of the outer optimization. If we for simplicity of notation from now on suppress the index of \(\varvec{\eta }_i\) denoting the individual, \(i\), and instead let the the index \(s\) denote the step of the outer optimization with respect to \(\varvec{\theta }\), this can be expressed as \(\varvec{\eta }^0_{s+1}=\varvec{\eta }^*_s\). This will be particularly helpful as the optimization converges and the steps in \(\varvec{\theta }\) become smaller. Using \(\varvec{\eta }^*\) from the evaluation of \(\log L_F\) as starting value is also a good strategy when computing the gradient of \(\log L_F\) by a finite difference approximation.
If the sensitivity approach is used for computing the gradient of \(\log L_F\), even better starting values of \(\varvec{\eta }\) can be provided. This is accomplished by exploiting the fact that the sensitivity \(d \varvec{\eta }^* / d \varvec{\theta }\) happens to be part of the gradient calculation. By making a first order Taylor expansion of the implicit function \(\varvec{\eta }^*(\varvec{\theta })\), we propose the following update of the starting values of the random effect parameters
$$\begin{aligned} \varvec{\eta }^0_{s+1}=\varvec{\eta }^*_s + \frac{d \varvec{\eta }^*_s}{d \varvec{\theta }} (\varvec{\theta }_{s+1}-\varvec{\theta }_{s}). \end{aligned}$$
(48)
The two approaches for choosing \(\varvec{\eta }^0_{s+1}\) are illustrated in Fig. 1.