Exact Gradients Improve Parameter Estimation in Nonlinear Mixed Effects Models with Stochastic Dynamics

Nonlinear mixed effects (NLME) modeling based on stochastic differential equations (SDEs) have evolved into a promising approach for analysis of PK/PD data. SDE-NLME models go beyond the realm of standard population modeling as they consider stochastic dynamics, thereby introducing a probabilistic perspective on the state variables. This article presents a summary of the main contributions to SDE-NLME models found in the literature. The aims of this work were to develop an exact gradient version of the first-order conditional estimation (FOCE) method for SDE-NLME models and to investigate whether it enabled faster estimation and better gradient precision/accuracy compared to the use of gradients approximated by finite differences. A simulation-estimation study was set up whereby finite difference approximations of the gradients of each level were interchanged with the exact gradients. Following previous work, the uncertainty of the state variables was accounted for using the extended Kalman filter (EKF). The exact gradient FOCE method was implemented in Mathematica 11 and evaluated on SDE versions of three common PK/PD models. When finite difference gradients were replaced by exact gradients at both FOCE levels, relative runtimes improved between 6- and 32-fold, depending on model complexity. Additionally, gradient precision/accuracy was significantly better in the exact gradient case. We conclude that parameter estimation using FOCE with exact gradients can successfully be applied to SDE-NLME models.


INTRODUCTION
Nonlinear mixed effects (NLME) models are used to describe populations of individuals that behave qualitatively similar, but where every individual has its own quantitative characteristics. These models have been highly applicable in pharmacokinetics (PK) and pharmacodynamics (PD) (1)(2)(3)(4). Physiological systems are often modeled with a continuoustime deterministic model and noisy observations that are discrete in time. However, since the mechanisms of the physiological systems of interest are often not completely understood, a natural extension is to account for the uncertainty in the dynamics. This is the motivation behind extending the NLME models with yet a stochastic part (5).
Stochastic differential equations (SDEs) greatly extend the descriptive power of ordinary differential equations (ODEs) that are usually used to encode dynamical system models in pharmacometrics. An ODE model can be turned into an SDE model by the addition of possibly non-linearly scaled noise terms. These additional terms can be thought of as a type of slack-variable introduced to pick-up potential discrepancies between the mechanistically modeled dynamics and unknown but indirectly observed dynamical effects. In other words, these noise terms in the dynamical system equations represent the lumped effect of all not explicitly mechanistically modeled effects in the system. This includes turning model parameters into state variables with random dynamics, which for instance could be used to model inter occasion variability (6) or time-dependent changes in parameters where the trend is unknown a priori, and for the estimation of an unknown input signal (7).
The application of SDEs to model uncertainty in dynamics has long been used in other fields, such as finance and control theory, and various parameter estimation methods have been developed (8). However, the sparsity and irregular sampling of PK/PD data has made it difficult to directly apply these parameter estimation methods. Nevertheless, some attempts to develop methods for parameter estimation in NLME models with stochastic dynamics have been successful, using for example (i) Bayesian inference (9,10), (ii) expectation maximization (EM) methods (11)(12)(13), and (iii) by expanding the traditional gradient-based estimation methods using Kalman filters (14,15). These methods have been used for several PK/PD applications (16)(17)(18)(19). This paper focuses on gradient-based methods. A general overview of some of the related method and application papers is shown in Table I. ODE models are a subset of SDE models because SDEs reduce to ODEs when the system noise (the Gaussian process covariance) is decreased towards zero. Ideally, the dynamical system under study is well characterized and the mechanistic terms present in the dynamical system model equations give an accurate description of the model's dynamical behavior. Then ODEs are a suitable model description. To account for incorrect model definitions, Kristensen et al. (20) introduced a method for evaluating the choice of a deterministic population model described with ODEs and iteratively improving it by using an SDE model to pinpoint where in the dynamical model equations misspecification may be present. A similar idea had previously been used for gray box models (21).
Instead of opting to find a deterministic model, Tornøe et al. (14) and Overgaard et al. (15) captured the additional uncertainty due to model misspecification, oversimplifications, and approximations by using SDEs in NLME models. The well-known parameter estimation method first-order conditional estimation (FOCE) (22) was expanded to account for the new mathematical framework. This involved state variable estimation using extended Kalman filters (EKFs) (23). The algorithm was initially implemented in NONMEM (14) and later implemented in both MATLAB (24) and R (25).
Optimizing the FOCE approximation of the log likelihood using gradient-based methods, such as the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm (26), requires sensitivities of the state variables of the underlying model. However, NLME models defined by differential equations often lack an analytic solution. Hence, the traditional way to compute the gradient is to use a finite difference approximation. An alternative approach is to obtain the gradients Table I. Timeline showing the main contributions to nonlinear mixed effects models with stochastic differential equations and to the S-FOCE method utilizing the sensitivity equations for the original set of differential equations. The algorithm originally proposed by Almquist et al. (27) is an extension of the FOCE algorithm that uses these sensitivity-based gradients for population NLME models based on ODEs. This algorithm is here referred to as S-FOCE. The sensitivity approach was also applied by Leander et al. (28) for single individual models using SDEs. For the single individual, the likelihood was approximated using the EKF, which in turn requires a firstorder differentiation of the EKF. These ideas have been combined for parameter estimation of NLME models using SDEs (29). In that study, the exact gradient approach was used instead of the traditional finite difference approach when using the FOCE-EKF approach by using the mixed symbolic and numeric algebra capabilities in Mathematica (30). However, the details of the algorithm have until now not been presented. This paper extends the S-FOCE algorithm to a general NLME model based on SDEs using the EKF approach. This involves both the firstand the second-order sensitivities of the EKF. Earlier stage of this work has been presented by (31). The method is evaluated on synthetically generated datasets from common PK/PD models with uncertainty in the dynamics, in addition to the traditional observation uncertainty and interindividual variability. Relative computational time of the sensitivity-based gradients is compared to the finite difference approach. Moreover, the precision and the accuracy of both gradient approaches are analyzed.
The paper is structured as follows. First, the necessary theory for the proposed extension is introduced, including the approximate population likelihood and the EKF, followed by the proposed algorithm. Second, the methods for evaluating and comparing of the algorithm are described. Third, the results of the comparison are presented, showing speedup in estimation time as well as increased precision and accuracy of gradients for the proposed algorithm. Finally, the results are discussed along with an outlook towards implementation of the algorithm in an existing parameter estimation framework.

THEORETICAL BACKGROUND
This section introduces the theory needed for extending the S-FOCE method to NLME models with stochastic dynamics (SDE-NLME).
Nonlinear Mixed Effects Models Based on Stochastic Differential Equations NLME models can be defined using SDEs to describe the dynamics of the continuous-time state variables x. The continuous-time, stochastic dynamics are modeled as with discrete observations modeled using an observation function h and Gaussian noise e i j ∼N 0; Σ x i j ; u i ; t i j ; θ; η i À Á À Á , The function f(x i , u i , t, θ, η i ), describing the deterministic part of the dynamics, is called the drift function and the part introducing randomness to the state variables, G(x i , u i , t, θ, η i )dW i , is called the system noise. Here, W i is defined as a Wiener process, with dW i ∼ N(0, dtI). The indices i and j correspond to individual i and its j-th observation at time t i j for each individual. Here, θ, η i , and u i are, respectively, the fixed effects, random effects, and known inputs, such as covariates including dosage. The random effects η i are chosen to be normally distributed with mean zero and covariance Ω(θ). This way of formulating the NLME model gives three sources of variability in the response (29), namely (i) observation noise, e ij ; (ii) system noise, G(x i , u i , t, θ, η i )dW i ; and (iii) parameter variability, η i .

The Approximate Population Likelihood
Since x i are stochastic processes, the distribution of y ij changes if conditioned on previous observations. Let Υ i(j − 1) = {υ i1 , υ i2 , …, υ i(j − 1) } denote those observations. The residuals ε ij are defined as with the expected observation value y b i j and covariance R ij conditioned on Υ i(j − 1) and θ defined as As shown in, e.g., (14,29), the combined likelihood L for all individuals simplifies to and where the indirect dependence of θ and η are suppressed for simplicity. The FOCE approximation of the population likelihood becomes Here, η Ã i maximizes the individual likelihood l i for a given θ, and H i is a first-order approximation of the Hessian Δl i . Further details on this approximation can be found in, e.g., (27).

The Extended Kalman Filter
The SDEs introduce uncertainty to the time evolution of the state variables of the system. The EKF can be used to estimate the state variables, as suggested in (14,15,29). The continuous-discrete EKF is a state-space estimator for the continuous-discrete state-space models of the form introduced in Eq. (1), with the exception that the state variables should be independent from both the observation noise and the system noise (32). The EKF estimates the conditional expectation of the state Here, ϕ i = (θ, η i , u i ) is a vector containing all parameters corresponding to individual i.
In the rest of this section, the notation is simplified by omitting the individual index i. The drift function f and the observation function h from Eqs. (2) and (1) are linearized by introducing The EKF has two main steps, a time-update step and a measurement-update step (33). In the time-update step, the state variables and covariance are predicted using all previous observations. This is done by solving the differential equations The above prediction gives In the measurement-update step, the prediction is used to compute the Kalman gain which is used to update the state and its covariance from the current observation The filter is initialized with

PROPOSED ALGORITHM
This section presents the extension of the S-FOCE method, originally proposed by Almquist et al. (27) for parameter estimation of general NLME models based on ODEs, to the use of SDEs. The method utilizes sensitivity equations to calculate exact gradients for a gradient-based optimization of the model parameters. The sensitivities are the derivatives of a function with regard to its parameters. The sensitivities needed for the gradient-based optimization in addition to the sensitivities in the existing S-FOCE algorithm are presented. A detailed derivation of terms needed for the sensitivity equations can be found in the Appendix.
The fixed effects parameters θ can be estimated by maximizing the approximate population likelihood (APL) in Eq. (7) using gradient-based methods. This means a need for computing (27) Unlike the ODE case, the APL is formulated in terms of conditional expected values of state variables and their covariances given by the EKF. Hence, sensitivities of entities computed by the EKF equations should be computed. These are the first-order sensitivities and the second-order sensitivities In addition, the sensitivities of the state variable covariance P are needed. The derivations of the results can be found in the Appendix. The resulting SDE-EKF extension of S-FOCE is outlined in Algorithm 1.

Initializing the EKF
The use of the EKF requires initial values of the state variable x 0 and state variable covariance P 0 . The state variable is set to the initial value of the system equations.
To estimate the covariance of the state, different methods can be used. Setting P 0 = I is a common but arbitrary way of initializing the EKF. Tornøe et al. (14) set P 0 to the integral of the Wiener processes and system dynamics between the first two observations to get a representative value. However, in the case of PK/PD models, one often starts in a steady state and can solve the initial covariance analytically from Eq. (9).

Models and Data
To evaluate the proposed method, a simulationestimation study was set up. Data was simulated from three common PK and PK/PD models: a two-compartmental PK model (M1), a PK/PD model consisting of a twocompartmental PK together with a direct PD response model (M2), and a PK/PD model consisting of a two-compartmental PK with an indirect PD response model (M3). All model dynamics are described using SDEs, with a one-dimensional Wiener process describing uncertainty of the absorption reaction. The number of state variables and parameters is shown in Table II. Each experimental setup included 50 individuals divided into five dose groups. Representative values were chosen for the simulation in compatible units. The sampling times are chosen to take the characteristics of the PK and PD responses into account. A full description of the model and experimental design can be found in the Appendix.

Comparison
Algorithm 1 consists of two levels of optimization, an inner level to find the empirical Bayes estimates η Ã i given θ, Algorithm 1 Extended S-FOCE parameter estimation algorithm for NLME models based on SDEs and an outer level to find the optimal fixed effects θ * . Both levels of optimization use gradient-based methods. Three different ways of computing gradients are considered. One uses sensitivities in both levels of optimization (S − S), another uses sensitivities in the inner level and finite differences in the outer level (S − F), and the last uses finite differences in both levels of optimization (F − F). Furthermore, the finite differences are either computed according to forward differences or central differences where q controls the relative step length and f denotes the function to be differentiated. The proposed algorithm is evaluated in terms of runtime and in terms of precision and accuracy of the gradient. The initial values of the optimization are chosen within 15% from the true value, drawn from a uniform distribution. The algorithm is implemented and run using Mathematica 11.0 (30). Calculations are performed on a workstation with a 4.00-GHz Intel Core i7-6700K CPU and 32.0 GB of RAM. A running version is available upon request.

Timing Comparison
The timing comparison is done in a similar way as the previously performed time comparison for S-FOCE with ODEs (27). For a given model and a relevant simulated dataset, the total evaluation time is calculated using each of the above approaches to obtain the gradient. To isolate the effect of computing the different gradients, they were calculated for the same set of points. The points were chosen as the points from the gradient-based optimization using the S − S gradient approach. This ensures that equally many evaluations were performed for each gradient. For the finite differences, the step length is kept constant, by letting q = 4, corresponding to relative step length of 0.0001.

Precision Comparison
Precision is compared only for finite differences in the outer level of optimization. For the inner level, both methods use sensitivity-based gradients. The gradients are computed at the found optimum, θ * , using the S − F and S − S gradients. The step length varies, with q ranging from 3.2 to 6.4. To capture the effect of new realizations on the numerical precision in log L F (θ), the gradient is computed 300 times for each q using different randomized starting values of η for the inner level of optimization. The starting values were drawn from a uniform distribution between −0.3 and 0.3.

Faster Estimation
The proposed algorithm was used to estimate the parameter values for models M1-M3. The resulting parameter estimates using the S1-S gradients along with the relative standard errors can be found in Table III. For each of the finite difference methods and models (M1-M3), the runtime of the parameter estimation was measured for all three previously mentioned optimization schemes. The speedup in each case is the relative gain when going from an F − F scheme to either of the S − F or S − S schemes. This is shown in Fig. 1. As the models become more complex in terms of number of fixed effects and number of state variables, the relative benefit of using sensitivities instead of finite differences increases. This holds both for forward differences and central differences. However, for the central differences, the advantage is more significant, since the central difference gradient requires approximately double the amount of calls to the likelihood function, compared to forward differences.

Improved Precision and Accuracy
The gradient error was computed as the difference between the obtained value and the average value of the values obtained using the sensitivity-based gradient method. The exact method does not utilize a finite difference and is independent of q.
As shown in Fig. 2, the sensitivity-based gradient is more precise than either of the other two methods. For large step size (small q), the gradients computed using either forward or central differences are biased but relatively precise. However, as shown in Fig. 3, the precision of the finite difference methods is not better than the precision of the sensitivitybased method. For small step size (large q), the accuracy increases but both methods suffer from a significant loss of precision on a relative scale compared to the sensitivity-based method. Figure 3 shows the standard deviation of the gradient error as a function of the step size. The linear dependency of the standard deviation on q can be reasoned directly from Eqs. (17) and (18), by assuming a numerical error term in relation to each calculation of the function value that is independent of the step size. The forward finite difference gradient Δf/Δθ k can be written as where ϵ 1 , ϵ 2 are independent, identically distributed, numerical error terms with mean zero and variance σ 2 . The total variance of the forward finite difference gradient due to measurement error is therefore Taking the logarithm of this gradient's standard deviation yields log 10 ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi which is linear in q with unit slope. Similar results can be derived for central difference gradients.

DISCUSSION
Misspecification of the dynamics in population modeling can be identified and quantified through an SDE-NLME approach. This is accomplished by estimating the magnitude of the system noise terms of the SDEs. These terms serve as a source of uncertainty that affects the time evolution of the model state variables, and they are used in parallel with the classical stochastic model components for the residual error and the between-subject variability.
In this work, a novel method for parameter estimation in SDE-NLME models based on exact gradients is presented. It is evaluated by performing parameter estimation for PK/PD models of different complexity, and comparing the computational time, precision, and accuracy to a standard approach based on a finite difference gradient. The method performs   well and shows potential for fast and robust analysis of PK/ PD data. Parameter estimation for NLME PK/PD models can be a time-consuming task, especially when the model of the physiological system is defined by ODEs. The computational burden increases even more if the mathematical framework is expanded to an SDE-NLME setting. By computing the exact gradient of the FOCE likelihood, the need for repeatedly having to solve the inner optimization problem for each perturbation of the fixed effect parameters is avoided, significantly decreasing computational time complexity. As seen in Fig. 1, larger and more complex models of this study benefit more from exact gradients. This suggests that the speedup provided by the exact gradient computations in general will be largest for complex models where this timesaving matters the most. Similar to the S-FOCE method proposed by Almquist et al. (27), the algorithm presented in this paper can easily be parallelized over individuals, further increasing its feasibility.
The exact gradients are computed from numerical solutions of ODEs and may therefore contain numerical errors. However, they are still considered exact since the numerical integration does not introduce any bias, and since the numerical error can be made arbitrarily small by controlling the tolerance of the integrator. The quality of the gradients computed with finite differences was evaluated by comparison to the mean of the exactly computed gradient (Fig. 2). For large finite difference step sizes, the relative precision is high but the gradient is biased. Although the accuracy of the finite difference gradient increased with smaller step size, the sample standard deviation clearly shows a simultaneous loss of precision. This demonstrates a fundamental problem with the finite difference method: it is hard to obtain both good precision and accuracy at the same time. Central differences may improve issues with bias, but still suffer from precision issues to a similar extent as forward differences. Exact gradients, on the other hand, elegantly circumvent the step length conundrum inherent to the finite difference approximation. They are unbiased per construction and the absolute precision is always comparatively high. Although exact gradients outperform the finite difference gradients when compared as such, it is unknown whether exact gradients improve robustness in the parameter estimation step, or in the covariance step, which ultimately is what matters. It may appear plausible that this indeed is the case, but future research is required to prove it.
The EKF has been the standard way of handing stochastic dynamics in NLME models. To fulfill the prerequisites of the EKF, the model state variables must be independent of both the observation noise and the system noise (32). However, in PK/PD models, it is often more realistic to assume proportional noise (1,34). A pragmatic way around this issue is to replace the state variables with their conditional expectation in the model definition. For some models, one could also consider the Lamperti transformation (35) for eliminating the state variable dependency of the system noise. Another filter that might also be applicable is the unscented Kalman filter (UKF), introduced by (36), which instead of linearization of the nonlinear system, uses a so-called unscented transformation. The UKF has been shown to perform very well for a variety of nonlinear models (37), and should be considered as an alternative for state variable estimation in SDE-NLME models.
The EKF was initialized by assuming steady state and solving Eq. (9). In the models chosen for the simulationestimation study, the steady-state approach is trivial, with P 0 = 0. This way of initializing the EKF uses information about the particular model and is superior to using simulations of the initial behavior (14) or simply choosing an arbitrary initialization.
To gain impact in the pharmacometric community, two important steps remain. First, additional applications of SDE-NLME models are needed to further demonstrate the benefits of PK/PD models with uncertain dynamics. Some applications have already been mentioned in this work (6,16,17,19). Second, for novel methods to be widely used, they must be implemented in software with large user-bases, such as NONMEM, Phoenix, and Monolix (22,38,39), or open-source initiatives such as nlmixr (40). Tornøe et al. (14) have previously reported on the implementation of SDE-NLME in the industry-standard software NONMEM. The prediction differential equations Eq. (9) were then introduced as a system of ODEs and event tags were introduced to account for the time-update and measurementupdate steps. In version 7.4, NONMEM introduced a FAST option to perform the optimization using the exact gradient approach introduced for ODE models by Almquist et al. (27). To our understanding, the combination of the recent implementation of the FAST method in NONMEM and the filter implementation of Tornøe et al. (14) should be enough for provide this functionality in NONMEM. However, it requires the user to explicitly state the update equations of the Kalman equations. It is only through such wide availability that SDE-NLME methods will get acceptance and popularity among PK/PD modelers.

CONCLUSION
A feasible method for parameter estimation of SDE-NLME models that extends S-FOCE has been proposed. It provides shorter computational time compared to previously used finite difference gradient-based methods. The gradients computed during optimization of the APL are both more precise and more accurate than those computed using finite differences. Fig. 3. Sample standard deviation of the gradient using the exact method (blue), forward differences (red), or central differences (yellow), plotted as a function of the finite difference step size, quantified as 10 −q ACKNOWLEDGEMENTS This project has been supported by the Swedish Foundation for Strategic Research, which is gratefully acknowledged.

Derivation of EKF Sensitivities
This section derives the equations needed for performing both the firstand second-order sensitivities calculation and thereby obtaining the gradients needed for extension of the S-FOCE method to the case of SDEs.

First-Order Sensitivities
The sensitivities dε ij /dη ik and dR ij /dη ik need to be determined. Using the chain rule gives and The derivatives ∂h ∂η ik ; and can be found directly from the definition of h. However, the derivatives remain to be found. In further calculations, the individual index i will be suppressed.
Differentiating Predicted Expected State Variables. Start by . This derivative can be obtained by solving the sensitivity equation and thus, Differentiating Predicted State Variable Covariance. Now consider the other two remaining derivatives, The equation gives From the definition of Σ j and h, ∂Σ j /∂η k and ∂C j /∂η k can be directly obtained. The remaining derivative is ∂P j | j − 1 /∂η k . By definition, directly follows. For positive integers j, the following sensitivity equation for P j + 1 | j d dt where In the same way as for Differentiating the Kalman Gain. In the above cases, the derivative and partial derivative of the Kalman gain K j are required. This can be done by first noting that and then calculating the partial derivative of K j with respect to η k as follows: In the same way, it follows that Moreover, the derivative of K j with respect to θ is calculated in the same way as the derivative with respect to η.

Second-Order Sensitivities
This section derives the equations needed to compute the derivatives The chain rule yields Note that in the above calculations, θ n can be replaced by η l in order to obtain the derivatives Second Derivative of Predicted Expected State. d dt where Second Derivative of Predicted State Covariance.
with d dt The derivatives , and are calculated in the same way. In the special case of , the equations can be simplified to Second Derivative of the Kalman Gain. In both cases above, the second partial derivative of the Kalman gain K j is required. This can be done as follows: The derivatives , and are calculated in the same way. In the special case of , the derivative can be simplified to

MODEL DEFINITIONS
In all following models, A 1 , A 2 , and A 3 represent the absorption, central, and peripheral compartments of the PK part of the model and R represents the PD response. The measured concentration in the central compartment is denoted by The two-compartment PK model is described with the equations In model M1, the initial values are Input(t) = 0, and the output is considered to be with The parameters k a and CL are considered to be lognormally distributed. The covariance matrix for the random effects is For model M2, the dynamical equations displayed in Eq. (53) are kept the same, but the initial values are Input(t) represents the dose given at time t, and the output is and The parameters k a , CL, and IC 50 are considered to be log-normally distributed. The covariance matrix for the random effects is For model M3, the dynamical system of equation is the initial values are Input(t) represents the dose given at time t and the output is with Σ t as given by Eq. (61). The parameters k a , CL, and IC 50 are considered to be log-normally distributed. The covariance matrix for the random effects is as given by Eq. (62). The sampling scheme was the same for all three models, relative to dosing. The PK part was sampled at hours 0.25, 0.5, 0.75, 1, 1.5, 2, 4, 6, 8, 12, 16, and 24 and the PD part was sampled at hours − 24, − 18, − 12, 0, 1, 2, 4, 12, 24, 48, 72, 96, and 124. The doses of the hypothetical drug were 5, 10, 50, 100, and 200 dose units.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.