Learning pharmacometric covariate model structures with symbolic regression networks

Efficiently finding covariate model structures that minimize the need for random effects to describe pharmacological data is challenging. The standard approach focuses on identification of relevant covariates, and present methodology lacks tools for automatic identification of covariate model structures. Although neural networks could potentially be used to approximate covariate-parameter relationships, such approximations are not human-readable and come at the risk of poor generalizability due to high model complexity.In the present study, a novel methodology for the simultaneous selection of covariate model structure and optimization of its parameters is proposed. It is based on symbolic regression, posed as an optimization problem with a smooth loss function. This enables training of the model through back-propagation using efficient gradient computations.Feasibility and effectiveness are demonstrated by application to a clinical pharmacokinetic data set for propofol, containing infusion and blood sample time series from 1031 individuals. The resulting model is compared to a published state-of-the-art model for the same data set. Our methodology finds a covariate model structure and corresponding parameter values with a slightly better fit, while relying on notably fewer covariates than the state-of-the-art model. Unlike contemporary practice, finding the covariate model structure is achieved without an iterative procedure involving manual interactions.


Introduction
Pharmacokinetics (PK) are the dynamics governing drug uptake, distribution and elimination from the body.For many drugs, it is common to assert a low-order linear and time-invariant (LTI) compartment model for PK modeling.Such low-order models typically capture the uptake, distribution, and elimination dynamics adequately.Increasing model complexity through, for example, additional compartments results in models where the parameters are not practically identifiable from data collected during clinical trials or clinical practice.
While suitable PK model structures (e.g.number of compartments and topology) for common drugs can be established from data, a notable challenge exists in that the parameter values that explain the PK for one individual, are often not suitable for another.Such inter-individual variability is partly explainable by individual-specific features that are referred to as covariates, and partly attributed to random effects.
The purpose of pharmacometric covariate modeling is to identify covariates (i.e., fixed effects) responsible for interindividual variability, thus minimizing random effects.The gold standard is to approach this problem in a Bayesian setting using mixed-effect modeling, for which there exists both mature [1] and novel [2] tools.However, to apply mixed-effect modeling, one needs to decide which of possibly many covariates (e.g.age, body mass, gender, or genetic factors) to include.One also needs to decide on a parametric function that maps included covariates to parameters of the PK model.There is a tradition of using functions that are of sufficiently low complexity to be human-readable, and some function classes are more popular than others [3,4].However, for data sets including numerous covariates, selection of covariates and functions to consider is a combinatorial problem, where the search space may become limiting.Furthermore, due to step-wise approaches typically used for covariate identification, functions including multiple covariates are typically only considered if supported by prior knowledge [5].
Recently, the interest in combining machine learning (ML) and pharmacometrics [6] has increased.For example, ML has shown promising results in concentration predictions [7], identification of influential covariates [8,9], as well as parameter regression and model selection with the use of genetic algorithms and neural networks (NNs) [10].The ML methods have been able to match, and even beat, classic NLME modeling at a much higher computational speed.However, using ML to select influential covariates and identify the structural model simultaneously has, to our knowledge, not yet been studied.
In the present paper, we provide an automatic method for simultaneous covariate selection and identification of covariate functions using an adaptation of ML.Similarly to the current standard approach, we maintain simple humanreadable expressions.The method is based on symbolic regression [11], that approximates the underlying combinatorial problem with a smooth continuous one, thus enabling the use of efficient gradient-based optimization methods.
To illustrate utility, we apply our methodology to a large data set for the drug propofol and compare the resulting covariate model with the current state-of-the-art model [4].
Our method produces a model with increased predictive performance, using fewer covariates.Furthermore, the covariate model is optimized autonomously in contrast to the prevalent iterative and manual procedure.

Methods
The method described in this paper aims to automatically learn closed-form pharmacometric parameter-covariate relationships from data.We will consider learning expressions for PK model parameters from drug administration and resulting blood plasma concentration data, both of which are time series, but not necessarily synchronous or equidistant samples.The overarching setup for this is shown in Fig. 1 and we will focus on a concrete example based on a large multi-study data set published in [4] for the anesthetic drug propofol, commonly modeled with a mammillary three-compartment model [12].

Data set
We demonstrate our method on a data set composed of propofol plasma concentration observations from 1031 individuals1 from 30 clinical studies aggregated by Eleveld et al. [4], from here on referred to as the data set.Ethical approvals of the underlying studies are declared in the original publications, referenced by [4].The data set contains 15,433 observations, of which 11,530 were arterial and 3903 venous.Of the 1031 individuals, there were 670 males and 361 females with ages ranging from 27 weeks to 88 years, and weights ranging from 0.68 to 160 kg, see Table 1.
The main reason for choosing this data set as a demonstrator is that propofol is a drug with well-studied pharmacokinetics.Evidence of this, which also constitutes a benchmark for our demonstrator, is the model presented in [4].Furthermore, the data set is that it has been openly disclosed by Eleveld et al., enabling transparent third-party analysis of our work.
In our model development, we consider age, weight, BMI, gender, and blood sampling site (arterial or venous) as potential covariates of our PK model.These are the same candidates as considered in [4], where individual demographics were disclosed as part of the data set.
The data was pre-processed the same way as in [4]: data points corresponding to subsequent infusion changes spaced closer than 1 s apart in the time dimension, or 0.5 lgs À1 in the dose dimension were merged.

Pharmacokinetic model
We consider a three-compartment mammillary model to describe the pharmacokinetics of propofol.The drug concentration x i ½lgL À1 in compartment i 2 f1; 2; 3g is where k ij describes the drug transfer rate [1/s] from compartment i to j.The drug is administered at rate u [lgs À1 ] to the central compartment (i ¼ 1), which is also where the propofol plasma concentration is measured.The volume of the central compartment is In the literature, the equivalent parameterization of volumes (V 1 ; V 2 ; V 3 ) and clearances (CL; Q 2 ; Q 3 Þ constitutes a common alternative to Eq. 1.The conversions between these parameterizations are We have chosen to implement our method using the parameterization in Eq. 1 due to numeric benefits.These are further explained in [13], where we developed a fast and natively differentiable simulator for the three-order mammillary model.

Covariate model
The covariate model, shown in Fig. 1, is expressed as a function f that maps the covariate vector u ¼ ½u 1 ; . ..; u n u > to the vector h ¼ ½h 1 ; . ..; h n h > of PK model parameters.Thus f has components f 1 ; . ..; f n h , each mapping the covariate vector u to one of the n h PK model parameters.
In our example with the three-compartment model for propofol, there are n u ¼ 5 covariates and n h ¼ 6 PK model parameters according to Table 1.To not favor covariates based on their scale, all input covariates were normalized during model development.Continuous inputs (age, weight and BMI) were scaled from 0 to 1, and categorical inputs (gender and blood sampling site) were scaled to AE0:5.

Predictive performance
To assess the quality of a particular covariate model candidate f , we need a performance measure that captures how well f reflects the training data.Most optimization methods, including the ones used in this paper, relies on this measure being scalar.
For comparability, we employ the same scalar performance measures as those used in [4]: We train our model to minimize an ensemble average of absolute logarithmic error (ALE).Subsequently, we evaluate predictive performance in terms of ALE, and three additional error measures: logarithmic error (LE), prediction error (PE), and absolute prediction error (APE).
For each individual in the data set, there is a vector of observation-prediction errors, where each entry corresponds to the difference between a blood sample observation and the value predicted by the model.Observationprediction errors could thus be computed for each sample, over a time series for one individual, or the entire data set.This prompts a consistent notation and we index by ij the Fig. 1 Mammillary three-compartment model example illustrating our novel method.The objective is to automatically learn the covariate model f that maps a known covariate vector u (comprising e.g., age, gender, or genetic factors) to the parameter vector h (e.g.rate constant k ÁÁ and volumes V Á ) of a fixed-structure pharmacometric (PK) model.The method is data-driven in that it uses drug administration profiles (time series data) u, and model-based in that it assumes a PK model of known structure.In this example, f is learned to minimize some error measure between observed (i.e.measured from samples) blood plasma concentrations and corresponding predictions C pred by the model.Dots in the graphs show instances of dose changes and blood samples, respectively where C obs ij are observed (measured) plasma concentrations and C pred ij are corresponding predictions produced by the model.
Similarly, the per-sample (absolute) prediction error (A)PE [15] is To avoid divisions by zero if C pred ij ¼ 0, a special casing is needed where the error is set to zero for such samples.Using Eqs. 3 and 4, corresponding per-individual errors were computed by taking the median, resulting in the median absolute logarithmic error (MdALE): where n i is the number of entries of the time series from individual i.
To train the model of [4] and the ones considered here, a single scalar loss function representing model fit is required.This is obtained by averaging across the individuals.Training the covariate model to minimize ALE thus translates into minimizing where n is the number of individuals in the data set.
Similarly to the analysis performed in [4], ALE and APE were used as indicators of model accuracy and LE and PE were used as indicators of bias.Values closer to zero for ALE and APE reflect better accuracy and values closer to zero for LE and PE indicate less bias.
Of note, the choice of loss function will affect how outliers are penalized, where using an average loss across individuals would for example penalize outliers more than a median loss over individuals.

Symbolic regression networks
At the core of our methodology lies a small artificial neural network (ANN) with a specific structure, that of a symbolic regression network [17], representing a simple closed-form expression.In our case, the purpose is to learn humanreadable closed-form expressions describing the covariate model f .A schematic illustration of such network is shown in Fig. 2.
In general, ANNs can be viewed as flexible function approximators that can be trained to represent input-output mappings that fit available data.An ANN consists of n l layers, where the output vector x lþ1 of layer l is obtained by applying a vector of nonlinear activation functions g l to an affine transformation z l to the layer input x l : The linear weight matrices W l and bias vectors b l constitute the free parameters used to train the network.In the conventional case, the components of the activation functions g l are monotonously increasing functions, such as sigmoids [18].In contrast, a symbolic regression network can be understood as an ANN with the additional constraint that the resulting approximator f should be a humanreadable closed-form expression of a mathematical function [17,[19][20][21].Training of symbolic regression networks is therefore often referred to as equation learning [17].The methodology has gained broad attention for its ability to produce impressive results in discovering (known) physical laws from data [22].
The activation functions g of a symbolic regression network represent mathematical functions that we refer to as base expressions.In standard symbolic regression, a sequence of topologies with different base expressions and scalar parameters-of which the weight matrices and bias vectors in Eq. 7 would constitute a special case-are evaluated in search of one that fits the available data well [23].Conventional equation learning is thus a combinatorial problem, with poor (exponential) time complexity, and therefore often approached using genetic algorithms [23].
Instead of relying on random perturbations as in genetic algorithms, we use gradient-based optimization methods common to conventional ANN training.To ensure humanreadability, we enforce sparsity of the ANN by alternating training epochs with pruning epochs, in which the least important parameters are removed from the network.We next rely on a concrete example based on the Eleveld data set, to describe and illustrate this approach.
We set out with a nominal (unpruned) network of Fig. 2. It constitutes our nominal representation of f k , where the inputs are the covariates according to Table 1.Thus, the output of the network represents one of the PK parameters h k , such as k 10 or V 1 of Eq. 1.For a model with n h PK parameters, we use a parallel interconnection of n h symbolic regression networks, each modeling one component f k of f , mapping the covariate vector u to each PK parameter h k ; k ¼ 1; . ..; n h .
In our example, the base expressions of each layer l 2 f1; 2; 3g, with input vector z l ¼ z l1 ; z l2 ; . . .½ > , were chosen to cover previously published PK models for propofol, such as [4] and [12]: where the g 3 assures positive output of the final layer.The division in g 2 has the term one in the denominator to assure that the output does not blow if z 25 !0 approaches zero.

Training
Minimizing the loss function Eq. 6 across trainable parameters of the covariate model f is referred to as training.In our case, these parameters are stacked into a vector c, made up by the elements of all weight matrices and bias vectors.Since the data set is static, we can view training as minimization of the scalar-valued function J ALE ðcÞ.In order to do so, we need to evaluate J ALE ðcÞ.Doing so requires simulation of one PK model for each data set individual, to obtain the predicted plasma concentration at each observation time instance.This can be efficiently done using the method introduced in [13].
In addition to the supporting fast simulation, the method of [13] enables exact and efficient evaluation of derivatives of individual prediction errors with respect to the trainable model parameters.This allows us to train the covariate model f using conventional ANN back-propagation, using the stochastic gradient-based optimization algorithm ADAM [24].As with artificial neural networks in general, establishing formal convergence guarantees is challenging.However, in practice both standard deep learning networks and our symbolic regression do converge to models that fit data adequately well.Similarly to the deep learning case, convergence rate will also vary with data, choice of activation functions, learning rate, and initialization of the training.Our implementation, using the neural network package Flux [25], relies on the differential programming capabilities of the Julia language [26].A full disclosure of our implementation is found in the GitHub repository [27].

Pruning
To obtain simple human-readable expressions from an initially dense symbolic regression network, such as the one in Fig. 2, we alternate between parameter training of the fixed network structure and pruning of the network.
The goal of this pruning is to obtain a sparse network structure, which translates into a readable expression of the corresponding covariate model.In the pruning process, we Fig. 2 Symbolic regression network with three layers, each marked by a gray box.The output of node z li at layer l is the i th component of z l ¼ W l x l þ b l , where x l , W l and b l are the input vector, weight matrix, and bias vector of that layer.The base expressions g li acting on z li take on the role of activation functions used in ordinary ANNs.
For example, the output of the first layer (and therefore input to the second layer) is Input and output of the network is the covariate vector u ¼ x 1 and the PK parameter h k ¼ x 4 , respectively remove covariates and network parameters that have relatively little influence on the network output.This process is visually exemplified in in Fig. 3.
It is desirable to only include as many covariates as needed to explain the data [5].Therefore, we start by identifying and removing the least important covariates from the symbolic regression network to obtain expressions with fewer covariates.Next, we prune network parameters c (linear weights and biases) to obtain simple covariate expressions.Pruning a network parameter is achieved by fixing its value to zero and removing it from the vector c of trainable parameters.
Before each pruning iteration, we train the network until convergence.This translates into finding a local minimum of the loss function in Eq. 6.At such minima, the partial derivatives of the loss function with respect to the trainable parameters are zero.Therefore, the second derivatives define the first non-zero terms of the Taylor series expansion that describe local parameter sensitivities, as further explained in Appendix A. The second-order derivatives make up the elements of the Hessian matrix.Specifically its diagonal elements represent sensitivities in the corresponding individual parameters.
In the Taylor series expansion, the second-order terms take on the form where H k is the k th Hessian diagonal element of the loss function with respect to the network parameter c k .Sðc k Þ denotes salience of a parameter c k [28].
Fig. 3 sequence of a regression output pharmacokinetic parameter h 2 ¼ k 12 .Input covariates are age, weight, gender, and arterial or venous sampling (AV).The nominal network has three dense layers, each followed by base expressions such as 1 (feedforward), multiplication, power function, division and absolute value.A black line represents a connection between two nodes, and a gray line represents a pruned (removed) connection.The final network represents the covariate expression of Eq. 10b For pruning of the network parameters c, Eq. 8 may be used directly.However, computing the salience for a covariate is not as straightforward since the Hessian elements would differ between individuals.A solution to this is to sum the salience contribution of each individual, where ðH i Þ k is the k th diagonal element of the Hessian, H i , determining the sensitivity of the covariate u ik for individual i, and n is the number of individuals.We use the Zygote package [29] in Julia to compute the Hessian diagonal elements.In Appendix A, we give a more detailed description of the Hessian-based pruning method, providing mathematical insight into this methodology.

Recipe: Symbolic regression
A compact summary of our training and pruning scheme is provided below.Initially, we start with a nominal symbolic regression network, like the one shown in Fig. 2. It is sequentially trained and pruned until we obtain a final expression of sufficient complexity and fit.Our training and pruning sequence of a symbolic regression network is as follows: 1. Choose a nominal symbolic regression network architecture, and corresponding base expressions.2. Train the network until convergence.

Compute the salience of each (remaining) covariate,
Sðu k Þ of Eq. 9. 4. Sort the covariates by saliency and remove the covariate with the smallest salience.5.If the desired final number of covariates is reached, continue to step 6, otherwise return to step 2. 6. Train the reduced network until convergence.7. Compute the salience of each (remaining) trainable network parameter, Sðc k Þ of Eq. 8.

Sort the network parameters by saliency and remove
N parameters with the smallest salience.9.If the desired final number of network parameters is reached, continue to step 10, otherwise return to step 6. 10.Train the reduced network until convergence.11.Convert the resulting network to a readable functional expression.
The number of covariates and network parameters to keep in the final symbolic regression network is a trade-off between fit to data and complexity of the final expression.
In our example, illustrated in Fig. 3, we remove one covariate at each pruning iteration, until only two covariates are left.Next, we remove the N ¼ 10 least sensitive parameters in the first parameter pruning iteration, and then one parameter per subsequent pruning iteration until only twelve network parameters are left.More details of the pruning and training can be found in [27].In [30], we demonstrate that our methodology can identify functions of known shapes correctly.Initialization of the parameter values associated with step 1 of the recipe affects the fit of the resulting final model.To mitigate the risk of poor model fit due to an unfortunate initialization, the recipe could be run several times, where the best fitting model is kept.In our example, we have executed the recipe eight times.

Limits of performance
Structural mismatch between the asserted PK model structure Eq. 1 and the data, in combination with measurement errors, induce upper and lower limits on prediction errors of the trained covariate model.Here we explain how these limits can be characterized by training two additional models.
Even with a very complex covariate model, one cannot expect perfect fit to data (zero loss).This is because the fixed (three-compartment) PK model structure is only an approximation of the actual pharmacokinetics, combined with (blood sample) measurement errors.Part of the loss remaining after applying our covariate modeling scheme can thus be attributed to this mismatch.To indicate how much, we optimize one set of (three-compartment) PK parameters for each individual in the data set.This results in a completely covariate-free model.While it will fit data better than any covariate model, it does not generalize.This makes it practically useless for purposes other than providing an upper limit for performance.
Another natural question to ask is how much we gain (in terms of loss) by considering covariate dependencies.To do this, we optimize a constant, i.e. covariate-free, model-one where all individuals share the same (threecompartment) PK parameter values.This model does constitute a lower bound of the achievable predictive performance that any covariate-based model should beat.
For these two additional models, PK parameter optimization was done with the optimization package Optim.jl[31], to minimize the same loss as for our covariate model based on symbolic regression.Implementation details can be found in [27].

Results
Applying the proposed methodology to the Eleveld data set [4], resulted in the following covariate model, mapping covariates to rate constants and central compartment volume of the PK model in Eq. 1: AGE, WGT, BMI represent age [years], weight [kg] and body mass index ½kgm À 1.The subscript max represent the input normalization where AGE max ¼ 88 years, WGT max ¼ 160 kg and BMI max ¼ 52:8 kgm À 1.
The subscript male or female indicates different PK parameter expressions depending on gender.The blood sampling site (arterial or venous) was available as a modeling covariate, but was automatically pruned by the symbolic regression algorithm.
The obtained model of Eq. 10 is less complex than the Eleveld model in [4], provided in Appendix B for reference, and comparable to simpler covariate models for propofol, such as [3,16].
The predicted concentrations of the final covariate model on the Eleveld data set are shown in Fig. 4 together with the corresponding Eleveld model predictions.The distribution of errors between predicted and observed  5 Comparison of prediction error MdALE Eq. 5 between predicted and observed propofol concentrations for pharmacokinetic models.Our covariate model is denoted Symreg and the Eleveld covariate model is described in [4].The constant model represents one parameter set over the population and individual represents individual set of model parameters.The lower whisker for the individual models goes to zero predictions is shown in the boxplot in Fig. 5, indicating comparable predictive capability of the models, despite our model being less complex, and involving fewer of the available covariates.For ease of comparison, we present the corresponding average prediction errors in Table 2.We compare our model to the Eleveld model, to a covariatefree PK model where all individuals share the same parameters values, and to covariate-free individual models with unique parameter sets.As expected, adding covariates explains some, but not all, variability between patients.This can be seen by comparing our model to the constant PK model and the individual models.As seen in Table 2, all of the prediction errors: MdLE, MdALE, MdPE, and MdAPE fall within clinically acceptable ranges, as presented further above.

Discussion
We have introduced a novel symbolic regression methodology for simultaneously automating the search for a suitable covariate model structure and optimization of its parameters.Similar to contemporary methodologies, it relies on a user-specified set of base expressions from which the covariant model can be composed.However, and in important contrast to contemporary methods, the need for a combinatorial search across combinations of these expressions is voided.
Throughout the paper, a propofol PK modeling example has been used as a demonstrator.Within this example, the proposed methodology manages to accomplish slightly better data fit than the result of state-of-the-art modeling [4] (see Figs. 4 and 5, Table 2), while relying on fewer covariates, see Eq. 10.The obtained values of individual volumes and clearances were comparable to those in [4].
The introduced methodology is broadly applicable to PK modeling from time series data, and likewise to pharmacodynamic (PD) modeling, and combined pharmacokinetic and pharmacodynamic (PK/PD) modeling.It's main benefit lies in that it poses the search for a suitable PK model as symbolic regression with a smooth loss function.This, in combination with efficient methods for simulation and gradient computations [13] enables efficient model learning using back-propagation.
Another advantage is that the method can find covariate functions that are both simple and explain available data, while having a structure that would generally not be considered in a manual model structure search, unless explicitly supported by prior knowledge.However, the obtained covariate model is deterministic in the sense that we do not obtain a distribution over individual PK parameter values.It would be possible to integrate this methodology into the traditional mixed-effect modeling framework.Yet, this would be computationally much more demanding, which is why we propose first applying symbolic regression in a deterministic setting to arrive at a covariate model structure, and then (if desired) apply mixed-effect modeling to maximize parameter likelihood (with respect to some parameter priors and subject to the considered data) within the found structure.
In this paper, we have focused on models that are of sufficiently low complexity to be human-readable, which is achieved by enforcing sparsity of the neural network that constitutes the expression tree of the covariate model.If human readability is not necessary, an ordinary deep ANN could be employed instead.However, there is also a tradeoff between fit to training data (expressiveness) and generalization to yet unseen data due to possible over-fitting.Enforcing human-readability naturally limits flexibility of the model, thus decreasing this risk of over-fitting.The ability to manually specify base expression also provides a means to integrate expert knowledge into the model.For example, a suspicion that a compartment volume should correlate to the square of patient age would motivate multiplication as a base expression.This enables incorporating expert knowledge into the model, for example a clearance to the power of 0.75 as in [32], or compartmental allometry as in [33].
We assessed the model's out-of-sample prediction performance using a five-fold cross-validation.The data set was divided into five equal parts, and a covariate model was trained on four of the partitions, excluding one each time.The excluded partition, referred to as the validation set, was used to evaluate the model.This process was repeated for all five partitions to obtain an average predictive performance.The resulting mean (range) MdALE from cross-validation on the test sets was 0.303 (0.152-À0.423), similar to the mean MdALE of 0.279 in table 2. There was thus only a modest 10 % difference in mean error between the training and validation sets, which suggests no over-fitting issue.If such a problem had occurred, reducing the number of covariates and parameters (harder pruning) could have balanced the prediction errors on both sets.
The way we have employed the methodology here differs from how covariate models are usually trained using mixed-effect modeling.Rather than asserting parameter priors and selecting the most likely parameters from the posterior distribution that the data infers, we have chosen to train a scalar loss function, resulting in a deterministic covariate model.It would in theory be possible to embed our methodology within an inference engine, but at the cost of high computational cost.If a Bayesian interpretation is desired, a likely better alternative is to first run our methodology to arrive at a covariate model structure-as we have done in our example-and then assert parameter priors to the parameters of the resulting model, to finally apply mixed-effect modeling to compute the corresponding posteriors.

Conclusion
We have presented a novel methodology for automatic and simultaneous covariate model structure discovery and parameter optimization.This model was demonstrated using an example on which it outperforms state-of-the-art modeling, in that it finds expressions that match data slightly better, while relying on notably fewer covariates.
We conclude that the potential of automated model structure discovery is substantial; it could greatly optimize the process of pharmacometric covariate modeling.Additionally, it's likely to provide an improved balance between model complexity and data fit.This improved balance is something that could be challenging to achieve when simply assessing a series of pre-set model structure candidates in sequence.

Appendix A. Hessian-based pruning
In this appendix we provide a mathematical interpretation of the roles played by the Hessian and parameter saliences in our pruning approach.
The effect of perturbing the parameter vector can be analyzed by approximating the loss function JðcÞ by a Taylor series.A perturbation oc of the parameter vector c will change the loss function by where oc i are the components of oc, rJðcÞ is the gradient of the loss function and H ij are the elements of the Hessian matrix H of J with respect to c so that rJðcÞ ¼ oJðcÞ oc H ij ¼ o 2 JðcÞ oc i oc j

:
The goal is to prune the parameters that are least sensitive, i.e. those that affect the loss J the least when perturbed.Even for our size of network, repeatedly computing the full Hessian H would notably slow down training of the symbolic regression model.Instead, we make a diagonal approximation of the Hessian, neglecting cross terms H ij with i 6 ¼ j.
Allowing training to converge before each pruning iteration, ensures that rJðcÞ ¼ 0 (or in practice negligible), thus enabling approximation of oJðcÞ by oJðcÞ % 1 2 This approximation is then used to form an importance measure (salience) of our network parameters, according to [28], where the salience of parameter c i becomes Appendix B. The Eleveld model The Eleveld PK covariate model in [4] is given here, where the reference patient is male, 35 years old, 70 kg and 1.7 metres tall.All parameters values h can be found in the original publication of [4].Alice Wallenberg Foundation.All authors are members of the ELLIIT Strategic Research Area at Lund University.
Data availibility Code and data for reproducing the results is available in [27].

Fig. 4
Fig. 4 Predicted versus observed propofol concentrations of our covariate model (Symreg, red) compared to the Eleveld covariate model in [4] (Eleveld, blue) in logarithmic scale.The identity function, representing a perfect model fit, is shown in black

Table 1
[4]ariate candidates considered in the PK model of[4]as well as in our modeling.Individuals of the data set fall within the reported ranges [14]r over a sample j for individual i, whereas a total error over a time series for an individual is indexed by i.The per-sample (absolute) logarithmic error (A)LE[14]is thus