Cluster Gauss–Newton method

Parameter estimation problems of mathematical models can often be formulated as nonlinear least squares problems. Typically these problems are solved numerically using iterative methods. The local minimiser obtained using these iterative methods usually depends on the choice of the initial iterate. Thus, the estimated parameter and subsequent analyses using it depend on the choice of the initial iterate. One way to reduce the analysis bias due to the choice of the initial iterate is to repeat the algorithm from multiple initial iterates (i.e. use a multi-start method). However, the procedure can be computationally intensive and is not always used in practice. To overcome this problem, we propose the Cluster Gauss–Newton (CGN) method, an efficient algorithm for finding multiple approximate minimisers of nonlinear-least squares problems. CGN simultaneously solves the nonlinear least squares problem from multiple initial iterates. Then, CGN iteratively improves the approximations from these initial iterates similarly to the Gauss–Newton method. However, it uses a global linear approximation instead of the Jacobian. The global linear approximations are computed collectively among all the iterates to minimise the computational cost associated with the evaluation of the mathematical model. We use physiologically based pharmacokinetic (PBPK) models used in pharmaceutical drug development to demonstrate its use and show that CGN is computationally more efficient and more robust against local minima compared to the standard Levenberg–Marquardt method, as well as state-of-the art multi-start and derivative-free methods.


Introduction
The parameter estimation of mathematical models often boils down to solving nonlinear least squares problems. Hence, algorithms for solving nonlinear least squares problems are widely used in many scientific fields.
The most traditional least squares solver is the Gauss-Newton method (Gauss 1857;Björck 1996). In practice, the Gauss-Newton method with regularisation [i.e., Levenberg-Marquardt (LM) method Marquardt 1963;Moré 1978] or with the Trust-Region method (Conn et al. 2000) is often used. Recently, derivativefree methods, which do not explicitly use derivative information of the nonlinear function, have been developed. These methods are usually computationally more efficient as it avoids the costly computation of the derivatives of the nonlinear functions. Also, they can be applied even to problems where the mathematical models are 'black box'. The state of the art derivative-free algorithms are DFO-LS  and POUNDERS (Wild 2017). A comprehensive review of the derivative-free methods can be found in Larson et al. (2019).
Another approach for obtaining a solution of nonlinear least squares problems is to directly minimise the sum of squared residuals (SSR) using generic optimisation algorithms for the scalar objective function. The most classical approach is to obtain a minimiser where the gradient of the SSR becomes zero using the Newton method. As it is usually too costly to compute the Hessian of the SSR, Quasi-Newton methods which approximate the Hessian are used. The commonly used Quasi-Newton method is the BFGS method (Broyden 1970;Fletcher 1970;Goldfarb 1970;Shanno 1970;Shanno and Kettler 1970). Another approach which makes use of the Newton-type method for optimisation is Implicit Filtering (Kelley 2011) which combines grid search and the Newton method. In addition to these optimisation algorithms, we can use numerous global optimisation algorithms when bound constraints are given. For example, Surrogate Optimisation (Gutmann 2001), Genetic Algorithm (Goldberg and Holland 1988), Particle Swarm algorithm (Kennedy and Eberhart 1995; Mezura-Montes and Coello 2011), and DIRECT (Jones et al. 1993) are well known global optimisation algorithms.
Although there are a variety of algorithms to solve the nonlinear least squares problems as listed above, they mostly focus on finding one minimiser. To the best of our knowledge, there is very limited methodological development on algorithms for simultaneously finding multiple approximate minimisers of nonlinear least squares problems. For instance, when using the Levenberg-Marquardt method, the local algorithm often gives a local minimiser which depends on the choice of the initial iterate. To reduce the analysis bias due to the initial iterate used for a local algorithm, it is a good practice to repeatedly use the local algorithm with various initial iterates, as in multi-start methods (Boender et al. 1982). Similarly, for problems where bound constraints are given, one can use global optimisation algorithms to find one of the global minimisers. On the other hand, if there are multiple global minisers, the global minimiser found can depend on the algorithm setting, for example, the random seed. Hence, it is beneficial to use global optimisation algorithms with various settings repeatedly if the uniqueness of the global minimiser is not guaranteed. The trivial bottleneck of repeatedly using these algorithms is the computation cost. In this paper, we propose a new method addressing this computational challenge of finding multiple approximate minimisers of nonlinear least squares problems.
Our algorithm development for finding multiple local minimisers of nonlinear least squares problems was motivated by a mathematical model of pharmaceutical drug concentration in a human body called the physiologically based pharmacokinetic (PBPK) model (Watanabe et al. 2009). The PBPK model is typically a system of mildly nonlinear stiff ordinary differential equations (ODEs) with many parameters. This type of mathematical model is constructed based on the knowledge of the mechanism of how the drug is absorbed, distributed, metabolised and excreted. Given the complexity of this process and the limitation of the observations we can obtain from a live human subject, the model parameters cannot be uniquely identified from the observations, meaning that there are non-unique global minimisers to the nonlinear least squares problem. The estimated parameters of the PBPK model are used to simulate the drug concentration of the patient from whom we are often unable to test the drug on (e.g., children, pregnant person, a person with rare genetic anomaly) or to predict the experiment that is yet to be run (different amount of drug administration, multiple drugs used at the same time). As the simulated drug concentration is used to predict the safety of the drug in these different scenarios, it is essential to consider multiple predictions based on multiple possible parameters that are estimated from the available observations. A motivating example is presented in Sect. 3. Another reason why we want to obtain multiple sets of parameters is that we can understand which parameters cannot be estimated from the available data. This will motivate the pharmaceutical scientists to perform additional (e.g., in-vitro or inanimal) experiments to determine these parameters that were not estimable from the available data.
In Aoki et al. (2011) and Aoki et al. (2014) we proposed the Cluster Newton (CN) method for obtaining multiple solutions of a system of underdetermined nonlinear equations. In recent years CN has been used in the field of pharmaceutical science (Yoshida et al. 2013;Fukuchi et al. 2017;Asami et al. 2017;Toshimoto et al. 2017;Kim et al. 2017;Nakamura et al. 2018). For example, Toshimoto et al. (2017) used the parameters estimated by CN to predict the adverse drug effect, Nakamura et al. (2018) used the estimated parameters to predict the outcome of a clinical trial. However, based on these applications of CN, we observed the necessity to develop a new algorithm for finding multiple approximate minimisers of a nonlinear least squares problem which is more robust against noise in the observed data. This is mainly because actual pharmaceutical data may contain measurement error and inconsistency coming from an inadequate model.

Nonlinear least squares problem of our interest
In this paper, we propose an algorithm for obtaining multiple approximate minimisers of nonlinear least squares problems which do not have a unique solution (global minimiser), that is to say, there exist x (1) ≠ x (2) such that Here, f is a nonlinear function from ℝ n to ℝ m , x (1) , x (2) ∈ ℝ n and y * ∈ ℝ m . The nonlinear function f can be derived from a mathematical model, the vector x can be regarded as a set of model parameters which one wishes to estimate, and the vector y * can be regarded as a set of observations one wishes to fit the model to. In the motivating problem of estimating parameters of PBPK models, since there are usually insufficient observations, the corresponding nonlinear least squares problem has multiple global minimisers. Therefore, we are interested in finding multiple minimisers instead of just one minimiser.
We shall call the following quantity the sum of squared residuals (SSR): and use it for the quantification of the goodness of x as the approximation of the solution of the least squares problem (1).

Well known example in pharmacokinetics
In this subsection, we present a simple pharmacokinetics parameter estimation problems where the corresponding nonlinear least squares problems have non-unique global minimisers. This example is called 'flip-flop kinetics' and can be found in most standard textbooks in pharmacokinetics (e.g. Gibaldi and Perrier 1982). Flipflop kinetics occurs when estimating the pharmacokinetic parameters of the drug that is orally given (e.g., as a pill or a tablet) to patients based on the observation of the drug concentration in the blood plasma. The simplest mathematical model for this pharmacokinetics can be written as follows: where For this problem u 2 corresponds to the drug concentration in the blood plasma, which is observable (i.e., u 2 (t i ) is the model simulation corresponding to the observation y * i , where t i is the ith observation time.). It can be shown analytically that there are two distinct parameter sets that realise the same drug concentration timecourse curve. Figure 1 shows the surface plot of the sum of squared residuals of the

3
Cluster Gauss-Newton method corresponding nonlinear least squares problem. As can be seen, there are two global minimisers for this nonlinear least squares problem.
In this subsection, we have shown the simplest possible form of this issue of nonunique global minimisers in pharmacokinetics. Thus, we would like to point out that one cannot assume the uniqueness of the global minimiser for more complex pharmacokinetic models, for example, the PBPK model of our interest.

Method: Algorithm
In this section we describe the proposed algorithm. We first introduce a rough concept using a toy example in Sect. 2.1 and then introduce the full algorithm in detail in Sect. 2.2.

Brief explanation of the algorithm
The aim of the proposed Cluster Gauss-Newton (CGN) algorithm is to efficiently find multiple approximate minimisers of the nonlinear least squares problem (1). We do so by first creating a collection of initial guesses which we call the 'cluster'. Then, we move the cluster iteratively using linear approximations of the nonlinear function f , similarly to the Gauss-Newton method (Björck 1996).
The unique idea in the CGN method is that the linear approximation is constructed collectively throughout the points in the cluster instead of using the Jacobian matrix which approximates the nonlinear function linearly at a point as in the Gauss-Newton or LM. By using points in the cluster to construct a linear approximation, instead of explicitly approximating the Jacobian, we minimise the computational cost associated with the nonlinear function (i.e., mathematical model) for each iteration. In addition, by constructing linear approximation using non-local information, CGN is more likely to converge to approximate minimisers with smaller SSR than methods using the Jacobian.
In order to visualise the key differences between the proposed linear approximation (CGN) and the Jacobian (i.e., derivative) approaches, we consider the nonlinear function As can be seen in this figure, the linear approximation used in CGN follows the global trend of the function, while the slope used in LM captures the local feature of the function. As a result, when using CGN all the iterates converge to the minimisers with the smallest residual (in this case global minimisers), while all the iterates of LM converge to local minimisere whose residuals are not the minimum. In addition, for the 10 iterations presented in the figure, CGN required only 50 function evaluations while LM required 121 function evaluations since the slope is approximated using finite differences (see Fig. 2) and aim to find global minimisers. Any point x ∈ [−1, 1] is a global minimiser of this problem. Hence, this problem has nonunique global minimisers. Let the points of the initial iterates be: We now compute the linear approximations used to move these points in the cluster to minimise the function f. Gradient (LM) For this nonlinear function, since the function is given in analytic form, we can obtain the gradient explicitly. In, practice, when f is given as a "black box", we can approximate the derivative by a finite difference scheme, for example, Then, the linear approximation at x i can be written as Notice that it requires one extra evaluation of f at x i + for each x i . This number of extra function evaluation is, when evaluating a full gradient estimate, equal to the number of independent variables of f. (This is not the case if a directional derivative estimate is used.) If f is given by a system of ODEs, one may use the adjoint method to obtain the derivatives more efficiently. However, it requires solving an additional system of ODEs (the adjoint equation). More importantly, iterates of methods based on the gradient may converge to local minimisers, since they use local gradient information.

Cluster Gauss-Newton (CGN) method (proposed method)
In the proposed method, we construct a linear approximation for each point in the cluster while using the value of f at other points in the cluster to globally approximate the nonlinear function with a linear function. The influence of another point in the cluster to the linear approximation is weighted according to how close the point is to the point of approximation, i.e., where a (i) is the slope of the linear approximation at x i and the linear approximation at x i can be written as There are many possibilities for the weight d j(i) . In this paper, we choose d j(i) = (x j − x i ) −2 where ≥ 0 ( = 0 corresponds to uniform weight). Note that Eq. (9) can also be regarded as a weighted least squares solution of a system of linear equations where the weight is chosen to be d j(i) . The weight is motivated by the fact that we weight the information from the neighbouring points in the cluster more than the ones further away when constructing the linear approximation. Note that we do not require any extra evaluation of f for obtaining these linear approximations.
For the multi-dimensional nonlinear function f ∶ ℝ n → ℝ m , when we have N points in the cluster, we solve the following linear least squares problem: is a diagonal matrix defining the weights, X (i) ∈ ℝ N×n is the difference between all the cluster points and x i , and Y (i) ∈ ℝ N×m is the difference between the nonlinear function f evaluated at all the cluster points and at x i . The precise definition of these quantities and derivation of (10) is given in the next subsection.
For the one dimensional case (i.e., m = n = 1 ), the linear approximations at each point for the first ten iterations for both CGN and LM are shown in Fig. 2. As can be seen, the gradient used in LM captures the local behaviour of the nonlinear function. The linear approximation used in CGN, on the other hand, captures the global behaviour of the nonlinear function. After nine iterations of the CGN, all the points reached the minimisers with smallest SSR. On the other hand, the LM converges to local minimisers whose SSR are not necessarily the minimum.

Detailed description of the algorithm
Next, we describe the proposed CGN algorithm in detail. In this subsection, we denote a scalar quantity by a lower case letter e.g., a, c, a matrix by a capital letter, e.g., A, or M, and a column vector by a bold symbol of a lower case letter, e.g., v , a , unless otherwise specifically stated. Super script T indicates the transpose. Hence, v T and a T are row vectors.
(1) Pre-iteration process (1-1) Create initial cluster The initial iterates of CGN, a set of vectors are genrated using uniform random numbers in each component within the domain of initial estimate of the plausible location of global minimisers given by the user. The unique point of the CGN is that the user specifies the domain of the initial estimates instead of a point. In this paper, we assume that the domain of initial guess is given by the user by two sets of vectors x L , x U , and the value x (0) ji is sampled from the uniform distribution between x L j and x U j , where x (0) ji is the j th element of vector x (0) i . Note that this does not mean that all the following iterates i ≤ x U . Also note that here we have used uniform distribution; however, other choices of initial distributions are possible. A brief numerical experiments using different initial distributions can be found in "Appendix C".
Store the initial set of vectors in a matrix X (0) , i.e., where the super script (0) indicates the initial iterate.
Evaluate the nonlinear function f at each 1 3

Cluster Gauss-Newton method
If the function f cannot be evaluated at Compute the sum of squared residuals vector r (0) i.e., The concise pseudo-code for the creation of the initial cluster can be found in Algorithm 1.

(2) Main iteration
Repeat the following procedure until the user specified stopping criteria are met. We denote the iteration number as k, which starts from 0 and is incremented by 1 after each iteration.
(2-1) Construct weighted linear approximations of the nonlinear function We first construct a linear approximation around the point Here, A (k) (i) ∈ ℝ m×n describes the slope of the linear approximation around The key difference of our algorithm compared to others is that we construct a Jacobian like matrix A (k) (i) collectively using all the function evaluations of f in the previous iteration, i.e., we solve are always computed at the previous iteration [e.g., as Eq. (12) when k = 0 and in Step 2-3 when k > 0 ]. Hence, no new evaluation of f is required at this step.
The key idea here is that we weight the information of the function evaluation near x (k) i more than the function evaluation further away. That is to say, The importance of this idea can be seen in the numerical experiment presented in "Appendix A".
Noting that we can rewrite Eq. (17) as where where d (k) l(i) ≥ 0 . In this paper, we choose the weights as (17) min respectively  ( l = 1, ..., n ), and ≥ 0 is a constant. We use this weighting scheme so that the "information" from the nonlinear function evaluation from the point closer to the point of approximation is more influential when constructing the linear approximation. The distance between x i and x j are normalised by the size of the domain of initial guess (i.e., x U and x L ). The effect of the weight d (k) j(i) and the parameter and its necessity is analysed in "Appendix A". The minimum norm solution of Eq. (21) is given by where † denotes the Moore-Penrose inverse.
If (2-2) Solve for x that minimises ||y * − ( We now compute the next iterate X (k+1) using the matrices {A (k) (i) } N i=1 similarly to the Gauss-Newton method with Tikhonov regularisation (e.g., Hansen 2005;Björck 1996), i.e., for i = 1, ..., N , where y * is the set of observations one wishes to fit the nonlinear function f to [cf. (1)], and y (k) i The necessity of the regularisation can be seen in the numerical experiment presented in "Appendix B".
(2-3) Update matrices X and Y and vectors r and Evaluate the nonlinear function f for each . Compute the sum of squared residuals vector as r (k+1) . Note that this process can be implemented in an embarrassingly parallel way.
If the residual r i increases, we replace x (k+1) i by x (k) i and increase the regularisation parameter i.e., if can not be evaluated, then let else decrease the regularisation parameter, i.e., for each i = 1, ..., N.
There are various ways to update the regularisation parameter . In this paper we followed the strategy used in the Matlab's implementation of the Levenberg-Marquardt method.
In addition, in this step, we impose the stopping criteria for each point in the iteration. As can be seen in Eq. (26), i for large i , so that we can expect very small update in x (k+1) i . Hence, in order to mimic the minimum step size stopping criteria, we stop the update for i where i > max .
The concise pseudo-code for updating matrices X and Y and vectors r and is given in Algorithm 3.
The influence of the choice of the initial value init of the regularization parameter in Eq. (15) is studied in "Appendix B".

Cluster Gauss-Newton method
We present a concise description of the CGN as a pseudo-code in Algorithm 4.

Motivating example
In this section, we introduce a motivating example to illustrate how the proposed method could be used in practice. We consider a scenario where a newly developed drug is tested for the first time in a human. Before the drug is given to a human, the biochemical properties of the drug are studied in-test-tube (in-vitro) and in-animal experiments. However, how the drug behaves in the human body is still uncertain. Based on the results of in-vitro and in-animal experiments, the team decides that 100mg is a safe amount of the drug to be given to a human and the experiment is conducted with a healthy normal volunteer, and the drug concentration in the blood plasma is measured at points in time. Using these measurements, we estimate the multiple possible model parameters which can be used to simulate various scenarios. The following workflow can be envisioned: 1: Construct a mathematical model based on the understanding of the physiology and biochemical properties of the drug.
In this example, we use the model presented in Watanabe et al. (2009). The mathematical model is depicted in Fig. 3, and it can be written as a system of nonlinear ordinary differential equations with 20 variables. We refer the readers to the supplementary material for the complete description of the model. There are two types of model parameters in this model: physiological parameters and kinetic parameters. Examples of the physiological parameters are the sizes of the organs or the blood flow rates between the organs. As the human physiology is well studied and these parameters usually do not depend on the drugs, we can assume these parameters to be known. The kinetic parameters, such as, how fast the drug gets excreted from the body or how easily it binds to tissues are the parameters that depend on the drug and usually are not very well known. Before the first in-human experiment, the drug development team characterises these parameters using an organ in test-tubes or by administering the drug to an animal. However, these parameters can differ from animal to human, so we do not have a very accurate estimate of these parameters. The differences in these parameters between a human and an animal can be several orders of magnitude. Figure 4 depicts plots of the drug concentration simulation where the kinetic parameters are sampled within a reasonable range of the Fig. 4 Simulation of the drug concentration in the blood plasma using the parameters that are naively sampled from the range of possible kinetic parameters. Note that these simulations do not give any useful information parameters. As can be seen in Fig. 4, we cannot obtain any useful information just by randomly sampling the kinetic parameters from the feasible range.
2: Sample multiple possible model parameter sets that fit the model prediction of the drug concentration to the observed data from the 100mg experiment. We now use the observed data from the experiment where 100mg of the drug was given to a human. The red dots in Fig. 5 depict the observed data. The left panel of Fig. 5 shows some of the simulation results using the parameter sets of the initial iterate of CGN. The right panel of Fig. 5 shows some of the simulation results after 20 iterations of CGN. As can be seen in Fig. 5, CGN can find multiple sets of parameters that fit the observed data. The parameter values are depicted in the box plots in Fig. 6. As can be seen in Fig. 6, after 20 iterations of CGN, the distribution of some of the parameters shrinks significantly suggesting these parameters can be identified from the observations while the distribution of some of the parameters are unchanged indicating that these parameters cannot be identified from the observation. In Fig. 7, we show scatter plots of the parameters found by CGN. As can be seen in Fig. 7, even if the parameter cannot be identified from the observation, some nonlinear relationships can occasionally be identified between the parameters.

Numerical experiments
In this section we illustrate the advantages of the proposed CGN algorithm by numerical experiments on three PBPK models.

Numerical experiment setup
In this subsection we specify the details of the numerical experiment set-up.

Mathematical models
For the numerical experiments, we used the following three published mathematical models of drug concentration in the blood of a human body (PBPK model) (Fukuchi et al. 2017;Yao et al. 2018;Yoshikado et al. 2016). The time course drug concentration was simulated using the model, and random noise was added to mimic the observation uncertainties. The random noise was generated by a normal distribution with a standard deviation of 10% of the simulated concentration value. The simulated drug concentration was used as the test dataset, and multiple possible parameters were estimated from the test dataset using CGN and conventional methods. The mathematical description of each PBPK model used for the numerical experiments is summarised in Table 1. Also the detailed Scatter plots of parameters. As can be seen in this example, we can find parameter-parameter correlations of some of the parameters found by CGN. (parameters whose corresponding SSR is the least 100 within the cluster of 1,000, i.e., the same parameter sets as the one used to plot the right hand side of Fig. 5)

Computation environment
All computational experiments were performed using Matlab 2019a on 3.1 GHz Intel Core i5 processors with MacOS version 10.13.6. When using the algorithms implemented in Python we used Python version 3.7. All results of the numerical experiments were summarised and visualised using ggplot2 version 2.2.1 (Wickham 2016) in R version 3.3.2.

The initial set of vectors
We generated the initial set of vectors {x (0) i } N i=1 randomly based on Algorithm 1. The range of the initial cluster was set by the pharmacologically likely parameter range based on a priori knowledge defined by domain specialists (e.g., from the values obtained from the animal experiments or lab experiments).
We stored and used this initial set of vectors as the initial cluster for the CGN and also as initial iterates for the other nonlinear least squares solvers and optimisation algorithms which were compared.
Notice that this range of the initial cluster does not necessary contain the global minimiser as many pharmacokinetics parameter can be few order of magnitudes different between species. For the sake of comparison with the constrained optimisation algorithms, we constructed Example 2 so that the initial range contains the set of parameters that we used to create the dataset. This ensures that a global minimiser is in the domain where the initial cluster is made. When comparing with the constrained optimisation algorithms, we used this domain as the bound constraint for the parameter search for these algorithms.

ODE solver
For Examples 1 and 3 the nonlinear function evaluations require the numerical solution of stiff systems of ODEs. We used the ODE15s solver ( Shampine and Reichelt 1997) with the default settings to solve these ODEs (relative tolerance 10 −3 , absolute tolerance 10 −6 ). We observed that for some set of parameters, the ODE solver can get stuck in an infinite loop. Here, we set the timeout, where if the ODE evaluation takes longer than 5 seconds, it terminates the ODE evaluation process and returns a not-a-number vector.

Setting for the cluster Gauss-Newton (CGN) method:
We used the following parameters unless stated otherwise: Here k max is the maximum number of iterations of the method. We used the initial set of vectors {x (0) i } N i=1 described in Sect. 4.1.3 as the initial cluster. Note that we chose init to be consistent with the default setting of the LM implementation (lsqnonlin in Matlab.) In "Appendicies A, B, and C", we have presented the numerical experiments on the influence of the user-defined algorithm parameters.

Algorithms compared
Here, we list the conventional and recently developed algorithms that we compared our proposed algorithm with. We used the default setting of the algorithms unless stated otherwise.

Nonlinear least squares solvers
We compare the proposed algorithm against gradient-based and gradient-free nonlinear least squares algorithm s. We used these algorithms to find multiple approximate minimisers of the nonlinear least square problem of our interest by repeatedly applying these algorithms with various initial iterates. Namely; we used each parameter vector in , which were randomly generated in the CGN method as the initial iterate, and excecuted each algorithm repeatedly to obtain N sets of estimated parameter vectors.
We use LM with the default setting as well as setting 'FiniteDifferenceS-tepSize' to be the square root of the default accuracy of the ODE solve (e.g., FiniteDifferenceStepSize = √ AbsTol = √ 10 −6 ). Note that the FiniteDifferenceS-tepSize of the default setting is 10 −6 .
Trust-Region (TR) method The conventional gradient-based nonlinear least squares solver, Trust-Region method (Conn et al. 2000) implemented in the lsqcurvefit function in Matlab.
DFO-LS method A gradient-free nonlinear least squares solver DFO-LS method . We obtained the Python code of DFO-LS Version 1.0.2 from http://peopl e.maths .ox.ac.uk/rober tsl/dfols / (last accessed on June 6th 2019). To make the numerical method for solving the model ODEs exactly the same, we Libensemble with POUNDERS We used the libensemble algorithm (Hudson et al. 2019) that was available at the developer branch of https ://githu b.com/Liben sembl e/liben sembl e (last accessed April 25th 2019) together with petsc version 3.10.4 available as https ://www.mcs.anl.gov/petsc /index .html. We used POUND-ERS as the nonlinear least squares solver. As libensemble was implemented in Python, we used the MATLAB Engine API to call the nonlinear function. POUND-ERS algorithm utilises bound constraints for the search domain, and we tested this algorithm only for Example 2.

Optimisation algorithms
We compared the proposed method against the optimisation algorithms that are designed to minimise a function which takes a vector quantity as an input and scalar quantity as output. We applied these algorithms to our nonlinear least squares problem by minimising SSR defined in Eq. (3) Quasi-Newton method We used the Quasi-Newton method implemented in Matlab's Optimisation toolbox as the fminunc function. This implementation uses the BFGS formula to update the approximate Hessian.
The following algorithms require bound constraints, and they were tested only for Example 2.
Implicit filter method This is another gradient-free optimisation algorithm. The code was downloaded from https ://archi ve.siam.org/books /se23/ (last accessed June 7th 2019). We used the 'least-squares' option with a budget of 200.
Surrogate optimisation method We used the Surrogate optimisation algorithm implemented in Matlab's Global Optimisation Toolbox as the surrogateopt function. We repeated this algorithm 250 times with distinct random-seed to obtain 250 global minimisers of the nonlinear-least squares problem of our interest.
Particle Swarm We used the Particle Swarm optimisation algorithm implemented in Matlab's Global Optimisation Toolbox as the particleswarm function. We repeated this algorithm 250 times with distinct random seeds to obtain 250 global minimisers of the nonlinear least squares problem of our interest.
Genetic Algorithm We used the Genetic algorithm implemented in Matlab's Global Optimisation Toolbox as the ga function. As one run of this algorithm required over 100,000 function evaluations for our examples, we did not repeat this algorithm to obtain multiple global minimisers.
DIRECT A sampling algorithm DIRECT (Jones et al. 1993). The Matlab implementation was obtained from https ://searc hcode .com/codes earch /view/12449 743/ (last accessed July 30th 2019). As one run of this algorithm required over 100,000 function evaluations for our examples, we did not repeat this algorithm to obtain multiple global minimisers.

Results
We compared the proposed CGN method with various existing algorithms. We compared the 'speed' of the algorithm by the number of function evaluations required and we compared the 'quality' of the minimisers found by the SSR (smaller the SSR the better minimiser). As we can see in Table 2, the dominant part of the computation time was spent by the nonlinear function evaluations in CGN. Hence, we claim that the number of function evaluations is a fair way to compare the computation cost.
We compared the speed for finding acceptable minimisers. We define the acceptable minimisers as the minimisers with SSR less than the SSR of the parameter that was used to generate the test dataset. In Fig. 8, we show the number of acceptable minimisers found given the total number of function evaluations. As can be seen, CGN is significantly faster than any other method for finding acceptable minimisers. In general, the nonlinear least squares solvers were faster than using optimisation algorithms to minimise SSR. Also, some optimisation methods could not find or could only find a few acceptable minimisers. Among the various least squares solvers, the derivative-free methods (CGN, DFO LS, and libensemble POUNDERS) were usually faster than the derivative-based methods.
As this analysis depends on how we define acceptable minimisers, we plotted the number of minimisers found for a given SSR threshold in Fig. 9. For this analysis each algorithm was run until its default stopping criterion (given in the reference of each method) was met. We summarised the number of function evaluations that each algorithm required to meet its default stopping criterion in Table 3.
As can be seen in Fig. 9, for Examples 1 and 3, CGN finds more approximate minimsers with small SSR than all the conventional algorithms. For Example 2, as we know that at least one global minimiser is enclosed in the domain defined by x U and x L , we were able to apply optimisation algorithms with bound constraints (libsensemble with POUNDERS, implicit filter, surrogate optimisation, and particle swarm). For this problem, particle swarm obtained slightly more approximate minimisers with small SSR than CGN. However, as can be seen in Table 3, particle swarm required significantly more function evaluations than the CGN (particle swarm: 3,646,900 v.s. CGN: 6768). Aside from the particle swarm, CGN outperformed all the other methods in the number of acceptable minimisers found. In addition to above rigorous comparison, we applied two global optimisation algorithms, Genetic algorithm (GA) and DIRECT to Example 2. GA and DIRECT required 120,600 and 100,063 nonlinear function evaluations, respectively, to obtain one minimiser each. The SSR of the minimisers found by GA and DIRECTwere 0.0197 and 0.244, respectively, while the minimum SSR found by the CGN is 0.0188. Hence, even to find just one minimiser, CGN is faster and more accurate than these two methods.

Concluding remarks
We proposed the Cluster Gauss-Newton (CGN) method, a new derivative free method specifically designed for finding multiple approximate minimisers of a nonlinear least squares problem. The development of this algorithm was motivated by the parameter estimation of physiologically based pharmacokinetic Fig. 9 Number of minimisers (out of 250) found by each method for given accuracy threshold (tSSR). Smaller SSR indicates more accurate solution to the nonlinear-least squares problem (PBPK) models that appears in pharmaceutical drug development. The particular nature of the model, where the model is over-parameterised, and consideration of multiple possible parameters is necessary, motivated us to develop the new method. The fact that our algorithm obtains multiple approximate minimisers collectively makes it significantly more computationally efficient compared to existing nonlinear least squares solvers or optimisation algorithms. In addition, we observed that in general, CGN obtains minimisers with smaller sum of squared residuals (SSR) than existing algorithms. We demonstrated these advantages using three examples that come from real world drug development projects.
By minimising the assumption on the nonlinear function, where it can be a "black box", we have ensured the ease of use and implementation for those who may not have a substantial background in mathematics or scientific computing. We believe this advantage of the proposed method will be appreciated by potential users of the algorithm in industry. In this paper, we used the pharmacokinetics models as examples. However, as we do not assume any particular form of the nonlinear function, we believe the proposed method can be used for many other mathematical models in various scientific fields.
CGN performs sufficiently well for the current use in pharmacokinetic model analyses. However, we believe there are various possible extensions of CGN. First, CGN only makes use of the function evaluations from the one previous iteration. On the other hand, we can consider using the function evaluations from all the previous iterations similarly to the multi-secant method (Eyert 1996;Bierlaire and Crittin 2006;Hicken et al. 2017). CGN and multi-secant methods already share similarities; both methods construct global linear approximations from previously conducted nonlinear function evaluations; this gives an error-tolerant nature and computational efficiency. The error-tolerant nature of the CGN can be seen in "Appendix D", where the CGN converges even if the nonlinear function is evaluated only up to the first decimal place. Given the similarity, combining these two methods may lead to significant improvement. Second, currently, the CGN does not detect if two or more points in the cluster converge to the same points; hence the computation becomes redundant. We may be able to save computational cost even further by introducing

3
Cluster Gauss-Newton method In Fig. 10, we show the number of minimisers found by CGN for given accuracy threshold. As can be seen from the figure , the weight for the linear approximation improves the accuracy and the speed of CGN. Note that = 0 , which corresponds to giving equal weights to all the cluster points is not optimal. In this example, ≥ 1 gave good results.

Numerical experiments on the influence of the regularisation
To illustrate the necessity and the influence of the regularisation in (2-2) of the algorithm [Eq. (26)], we conducted numerical experiments using Example 1 of Sect. 4. We varied the initial value of the regularisation coefficient init and tested CGN. Figure 11 shows the number of minimisers found by CGN for given accuracy threshold. As can be seen from the figure, regularisation is necessary for CGN to perform well. For this example, init = 0.1 gave a good result.
have used chaospy (Feinberg and Langtangen 2015) and pyDOE (https ://pytho nhost ed.org/pyDOE /) modules on Python version 3.7.1 to generate initial distributions. As can be seen in Fig. 12, when there is a sufficient number of points in the cluster (e.g., N = 1000), then quasi-random sampling technique such as Latin hypercube or Sobol sequence outperformed the uniform random sampling. On the other hand, the uniform random sampling performed consistently well across all numbers of points in the cluster we have tested. Fig. 12 Number of minimisers found for Example 1 given accuracy threshold (tSSR) using various number of points in the cluster and initial distributions. Smaller SSR indicates more accurate solution to the nonlinear-least squares problem Fig. 13 Example 1 when the nonlinear function is rounded to one decimal place: Number of solutions (out of 250) found by the various methods for given accuracy threshold (tSSR). Smaller SSR indicates more accurate solution to the nonlinear-least squares problem