In this section we describe the proposed algorithm. We first introduce a rough concept using a toy example in Sect. 2.1 and then introduce the full algorithm in detail in Sect. 2.2.
Brief explanation of the algorithm
The aim of the proposed Cluster Gauss–Newton (CGN) algorithm is to efficiently find multiple approximate minimisers of the nonlinear least squares problem (1). We do so by first creating a collection of initial guesses which we call the ‘cluster’. Then, we move the cluster iteratively using linear approximations of the nonlinear function \(\varvec{f}\), similarly to the Gauss–Newton method (Björck 1996).
The unique idea in the CGN method is that the linear approximation is constructed collectively throughout the points in the cluster instead of using the Jacobian matrix which approximates the nonlinear function linearly at a point as in the Gauss–Newton or LM. By using points in the cluster to construct a linear approximation, instead of explicitly approximating the Jacobian, we minimise the computational cost associated with the nonlinear function (i.e., mathematical model) for each iteration. In addition, by constructing linear approximation using non-local information, CGN is more likely to converge to approximate minimisers with smaller SSR than methods using the Jacobian.
In order to visualise the key differences between the proposed linear approximation (CGN) and the Jacobian (i.e., derivative) approaches, we consider the nonlinear function
$$\begin{aligned} f= & {} \left\{ \begin{array}{ll} (x+1)^2-2\cos (10(x+1))+5&{}\text {if }x<-1\\ 3&{}\text {if }-1 \le x \le 1 \\ (x-1)^2-2\cos (10(x-1))+5&{}\text {if }x>1 \end{array} \right. \end{aligned}$$
(7)
(see Fig. 2) and aim to find global minimisers. Any point \(x\in [-1,1]\) is a global minimiser of this problem. Hence, this problem has nonunique global minimisers. Let the points of the initial iterates be:
$$\begin{aligned}&x_1=-6.3797853, \qquad x_2=-4.1656025, \qquad x_3=-3.6145728,\nonumber \\&x_4=2.0755468, \qquad x_5= 4.1540421. \end{aligned}$$
(8)
We now compute the linear approximations used to move these points in the cluster to minimise the function f.
Gradient (LM)
For this nonlinear function, since the function is given in analytic form, we can obtain the gradient explicitly. In, practice, when f is given as a “black box”, we can approximate the derivative by a finite difference scheme, for example, \(f'(x_i)\approx \frac{f(x_i+\epsilon )-f(x_i)}{\epsilon }\). Then, the linear approximation at \(x_i\) can be written as \(f(x) \approx \frac{f(x_i+\epsilon )-f(x_i)}{\epsilon }(x-x_i)+f(x_i)\). Notice that it requires one extra evaluation of f at \(x_i+\epsilon\) for each \(x_i\). This number of extra function evaluation is, when evaluating a full gradient estimate, equal to the number of independent variables of f. (This is not the case if a directional derivative estimate is used.) If f is given by a system of ODEs, one may use the adjoint method to obtain the derivatives more efficiently. However, it requires solving an additional system of ODEs (the adjoint equation). More importantly, iterates of methods based on the gradient may converge to local minimisers, since they use local gradient information.
Cluster Gauss–Newton (CGN) method (proposed method)
In the proposed method, we construct a linear approximation for each point in the cluster while using the value of f at other points in the cluster to globally approximate the nonlinear function with a linear function. The influence of another point in the cluster to the linear approximation is weighted according to how close the point is to the point of approximation, i.e.,
$$\begin{aligned} \min _{a_{(i)}}\sum _{j\ne i}\left( d_{j(i)}\left( (x_j-x_i)a_{(i)} -\left( f(x_j)-f(x_i)\right) \right) \right) ^{\,2} \end{aligned}$$
(9)
where \(a_{(i)}\) is the slope of the linear approximation at \(x_i\) and the linear approximation at \(x_i\) can be written as \(f(x)\approx a_{(i)}(x-x_i)+f(x_i)\). There are many possibilities for the weight \(d_{j(i)}\). In this paper, we choose \(d_{j(i)}=(x_j-x_i)^{-2\gamma }\) where \(\gamma \ge 0\) (\(\gamma = 0\) corresponds to uniform weight). Note that Eq. (9) can also be regarded as a weighted least squares solution of a system of linear equations where the weight is chosen to be \(d_{j(i)}\). The weight is motivated by the fact that we weight the information from the neighbouring points in the cluster more than the ones further away when constructing the linear approximation. Note that we do not require any extra evaluation of f for obtaining these linear approximations.
For the multi-dimensional nonlinear function \(\varvec{f}:\mathbb {R}^n\rightarrow \mathbb {R}^m\), when we have N points in the cluster, we solve the following linear least squares problem:
$$\begin{aligned} \min _{A_{(i)}\in \mathbb {R}^{m\times n}} \left| \left| D_{(i)}\left( \varDelta X^\text {T}_{(i)} A^\text {T}_{(i)} -\varDelta Y^\text {T}_{(i)}\right) \right| \right| _\text {F}^2 \end{aligned}$$
(10)
where \(D_{(i)}=\text {diag}(d_{1i}, \ldots ,d_{Ni})\) is a diagonal matrix defining the weights, \(\varDelta X_{(i)}\in \mathbb {R}^{N\times n}\) is the difference between all the cluster points and \(\varvec{x}_i\), and \(\varDelta Y_{(i)}\in \mathbb {R}^{N\times m}\) is the difference between the nonlinear function \(\varvec{f}\) evaluated at all the cluster points and at \(\varvec{x}_i\). The precise definition of these quantities and derivation of (10) is given in the next subsection.
For the one dimensional case (i.e., \(m=n=1\)), the linear approximations at each point for the first ten iterations for both CGN and LM are shown in Fig. 2. As can be seen, the gradient used in LM captures the local behaviour of the nonlinear function. The linear approximation used in CGN, on the other hand, captures the global behaviour of the nonlinear function. After nine iterations of the CGN, all the points reached the minimisers with smallest SSR. On the other hand, the LM converges to local minimisers whose SSR are not necessarily the minimum.
Detailed description of the algorithm
Next, we describe the proposed CGN algorithm in detail. In this subsection, we denote a scalar quantity by a lower case letter e.g., a, c, a matrix by a capital letter, e.g., A, or M, and a column vector by a bold symbol of a lower case letter, e.g., \(\varvec{v}\), \(\varvec{a}\), unless otherwise specifically stated. Super script \(\text {T}\) indicates the transpose. Hence, \(\varvec{v}^{\text {T}}\) and \(\varvec{a}^{\text {T}}\) are row vectors.
(1) Pre-iteration process
(1-1) Create initial cluster
The initial iterates of CGN, a set of vectors \(\{\varvec{x}_{i}^{(0)}\}_{i=1}^{N}\) are genrated using uniform random numbers in each component within the domain of initial estimate of the plausible location of global minimisers given by the user. The unique point of the CGN is that the user specifies the domain of the initial estimates instead of a point. In this paper, we assume that the domain of initial guess is given by the user by two sets of vectors \(\varvec{x}^\text {L}\), \(\varvec{x}^\text {U}\), and the value \(x_{ji}^{(0)}\) is sampled from the uniform distribution between \(x^\text {L}_j\) and \(x^\text {U}_j\), where \(x_{ji}^{(0)}\) is the j th element of vector \(\varvec{x}_i^{(0)}\).
Note that this does not mean that all the following iterates \(\varvec{x}_i^{(k)} \quad (k \ge 1, i=1,2, \ldots , N)\) must satisfy \(\varvec{x}^L \le \varvec{x}_i^{(k)} \le \varvec{x}^U\). Also note that here we have used uniform distribution; however, other choices of initial distributions are possible. A brief numerical experiments using different initial distributions can be found in “Appendix C”.
Store the initial set of vectors in a matrix \(X^{(0)}\), i.e.,
$$\begin{aligned} X^{(0)} =[\varvec{x}_{1}^{(0)}, \varvec{x}_{2}^{(0)}, \ldots , \varvec{x}_{N}^{(0)} ] \end{aligned}$$
(11)
where the super script (0) indicates the initial iterate.
Evaluate the nonlinear function \(\varvec{f}\) at each \(\varvec{x}_{i}^{(0)}\) as \(\varvec{y}_{i}^{(0)}=\varvec{f}(\varvec{x}_{i}^{(0)}) \quad (i=1,2,\ldots ,N)\) and store in matrix \(Y^{(0)}\), i.e.,
$$\begin{aligned} Y^{(0)}=\left[ \varvec{y}_{1}^{(0)}, \varvec{y}_{2}^{(0)}, \ldots , \varvec{y}_{N}^{(0)}\right] . \end{aligned}$$
(12)
If the function \(\varvec{f}\) cannot be evaluated at \(\varvec{x}_{i}^{(0)}\), then re-sample \(\varvec{x}_{i}^{(0)}\) until \(\varvec{f}\) can be evaluated.
Compute the sum of squared residuals vector \(\varvec{r}^{(0)}\) i.e.,
$$\begin{aligned}&r_{i}^{0}=||\varvec{y}_i^{0} - \varvec{y}^* ||_2^2 \quad (i=1,2,\ldots ,N) \end{aligned}$$
(13)
$$\begin{aligned}&\varvec{r}^{(0)}=\left[ r_{1}^{(0)}, r_{2}^{0}, \ldots , r_{N}^{0} \right] ^{\text {T}}. \end{aligned}$$
(14)
The concise pseudo-code for the creation of the initial cluster can be found in Algorithm 1.
(1-2) Initialise regularisation parameter vector
Fill the regularisation parameter vector \(\varvec{\lambda }^{(0)}\in \mathbb {R}^N\), with the user-specified initial regularisation parameter \(\lambda _\text {init}\) .i.e.,
$$\begin{aligned} \varvec{\lambda }^{(0)}=[\lambda _\text {init},\lambda _ \text {init},...,\lambda _\text {init}]^{\text {T}} \end{aligned}$$
(15)
(2) Main iteration
Repeat the following procedure until the user specified stopping criteria are met. We denote the iteration number as k, which starts from 0 and is incremented by 1 after each iteration.
(2-1) Construct weighted linear approximations of the nonlinear function
We first construct a linear approximation around the point \(\varvec{x}_{i}^{(k)}\), s.t.,
$$\begin{aligned} \varvec{f}(\varvec{x})\approx A_{(i)}^{(k)}(\varvec{x}-\varvec{x}_{i}^{(k)}) +\varvec{f}(\varvec{x}_{i}^{(k)}) \,, \end{aligned}$$
(16)
Here, \(A^{(k)}_{(i)} \in \mathbb {R}^{m\times n}\) describes the slope of the linear approximation around \(\varvec{x}_{i}^{(k)}\).
The key difference of our algorithm compared to others is that we construct a Jacobian like matrix \(A_{(i)}^{(k)}\) collectively using all the function evaluations of \(\varvec{f}\) in the previous iteration, i.e., we solve
$$\begin{aligned}&\min _{A_{(i)}^{(k)} \in \mathbb {R}^{m \times n} }\sum _{j=1}^N\left[ d_{j(i)}^{(k)} \left| \left| \varvec{f}(\varvec{x}_j^{(k)})-\left\{ A_{(i)}^{(k)} \left( \varvec{x}_j^{(k)}-\varvec{x}_i^{(k)}\right) +\varvec{f}(\varvec{x}_i^{(k)})\right\} \right| \right| _2 \right] ^2 \nonumber \\&\quad = \min _{A_{(i)}^{(k)} \in \mathbb {R}^{m \times n} }\sum _{j=1}^N\left( d_{j(i)}^{(k)} \left| \left| \varDelta \varvec{y}_{j(i)}^{(k)} - A_{(i)}^{(k)} \varDelta \varvec{x}_{j(i)}^{(k)} \right| \right| _2 \right) ^2 \end{aligned}$$
(17)
for \(i=1,...,N\), where \(d_{j(i)}^{(k)}\ge 0\), \(j=1,...,N\) are weights. Here, \(\varDelta \varvec{y}_{j(i)}^{(k)} = \varvec{f}(\varvec{x}_j^{(k)})-\varvec{f}(\varvec{x}_i^{(k)})\in \mathbb {R}^m\) and \(\varDelta \varvec{x}_{j(i)}^{(k)} = \varvec{x}_j^{(k)}-\varvec{x}_i^{(k)}\in \mathbb {R}^n\). (Note that \(\varDelta \varvec{y}_{i(i)}^{(k)} = \varvec{0}\), \(\varDelta \varvec{x}_{i(i)}^{(k)} = \varvec{0}\).) Also, let
$$\begin{aligned} \varDelta Y_{(i)}^{(k)}= & {} \left[ \varDelta \varvec{y}_{1(i)}^{(k)},\varDelta \varvec{y}_{2(i)}^{(k)}, \dots \varDelta \varvec{y}_{N(i)}^{(k)} \right] \in \mathbb {R}^{m \times N} \end{aligned}$$
(18)
$$\begin{aligned} \varDelta X_{(i)}^{(k)}= & {} \left[ \varDelta \varvec{x}_{1(i)}^{(k)},\varDelta \varvec{x}_{2(i)}^{(k)}, \dots \varDelta \varvec{x}_{N(i)}^{(k)} \right] \in \mathbb {R}^{n \times N}. \end{aligned}$$
(19)
Note that \(\varvec{f}(\varvec{x}_{i}^{(k)})\) are always computed at the previous iteration [e.g., as Eq. (12) when \(k=0\) and in Step 2–3 when \(k>0\)]. Hence, no new evaluation of \(\varvec{f}\) is required at this step.
The key idea here is that we weight the information of the function evaluation near \(\varvec{x}_{i}^{(k)}\) more than the function evaluation further away. That is to say, \(d_{j(i)}^{(k)}>d_{j'(i)}^{(k)}\) if \(||\varvec{x}_{j}^{(k)}-\varvec{x}_{i}^{(k)}||<||\varvec{x}_{j'}^{(k)}-\varvec{x}_{i}^{(k)}||\). The importance of this idea can be seen in the numerical experiment presented in “Appendix A”.
Noting that
$$\begin{aligned} \sum _{j=1}^N\left( d_{j(i)}^{(k)}\left| \left| A^{(k)}_{(i)}\varDelta \varvec{x}_{j(i)}^{(k)}-\varDelta \varvec{y}_{j(i)}^{(k)} \right| \right| _2 \right) ^2= & {} \left| \left| \left( A^{(k)}_{(i)}\varDelta X_{(i)}^{(k)} -\varDelta Y_{(i)}^{(k)} \right) D^{(k)}_{(i)} \right| \right| _\text {F}^{\,2}\nonumber \\= & {} \left| \left| D^{(k)}_{(i)}\left( {\varDelta X_{(i)}^{(k)} }^\text {T} A^{(k)\text {T}}_{(i)}- { \varDelta Y_{(i)}^{(k)} }^\text {T} \right) \right| \right| _\text {F}^{\,2}\,, \end{aligned}$$
(20)
we can rewrite Eq. (17) as
$$\begin{aligned} \min _{A^{(k)}_{(i)}\in \mathbb {R}^{m\times n}} \left| \left| D^{(k)}_{(i)}\left( { \varDelta X^{(k)}_{(i)} }^\text {T} A^{(k)\text {T}}_{(i)} - { \varDelta Y^{(k)}_{(i)} }^\text {T} \right) \right| \right| ^2_\text {F} \end{aligned}$$
(21)
where
$$\begin{aligned} D_{(i)}^{(k)}= & {} \text {diag}\left( d^{(k)}_{1(i)},d^{(k)}_{2(i)},..., d^{(k)}_{N(i)}\right) , \end{aligned}$$
(22)
where \(d^{(k)}_{l(i)}\ge 0\). In this paper, we choose the weights as
$$\begin{aligned} d_{j(i)}^{(k)}=\left\{ \begin{array}{ll}\left( \frac{1}{ \sum _{l=1}^{n}(({x}_{lj}^{(k)}-{x}_{li}^{(k)})/(x_l^\text {U}-x_l^\text {L}))^2}\right) ^\gamma &{} \text {if }j\ne i\\ 0&{}\text {if }j = i \end{array} \right. , \end{aligned}$$
(23)
where \({x}_{lj}^{(k)}, x_l^\text {U}, x_l^\text {L}\) are the l th element of the vectors \(\varvec{x}_{j}^{(k)}, \varvec{x}^\text {U}, \varvec{x}^\text {L}\), respectively (\(l=1,...,n\)), and \(\gamma \ge 0\) is a constant. We use this weighting scheme so that the “information” from the nonlinear function evaluation from the point closer to the point of approximation is more influential when constructing the linear approximation. The distance between \(\varvec{x}_{i}\) and \(\varvec{x}_{j}\) are normalised by the size of the domain of initial guess (i.e., \(\varvec{x}^\text {U}\) and \(\varvec{x}^\text {L}\)). The effect of the weight \(d_{j(i)}^{(k)}\) and the parameter \(\gamma\) and its necessity is analysed in “Appendix A”. The minimum norm solution of Eq. (21) is given by
$$\begin{aligned} A^{(k)}_{(i)} = \varDelta Y^{(k)}_{(i)} D^{(k)}_{(i)} \left( \varDelta X^{(k)}_{(i)} D^{(k)}_{(i)} \right) ^\dagger . \end{aligned}$$
(24)
where \(^\dagger\) denotes the Moore–Penrose inverse.
If \({\mathrm{rank}} \varDelta X\,{(k)}_{(i)} D^{(k)}_{(i)} = n\),
$$\begin{aligned} A^{(k)}_{(i)} = \varDelta Y^{(k)}_{(i)} D^{(k)}_{(i)} \left( \varDelta X^{(k)}_{(i)} D^{(k)}_{(i)} \right) ^{\text {T}} \left\{ \left( \varDelta X^{(k)}_{(i)} D^{(k)}_{(i)} \right) \left( \varDelta X^{(k)}_{(i)} D^{(k)}_{(i)} \right) ^{\text {T}} \right\} ^{-1} . \end{aligned}$$
(25)
Generically, \(\text {rank} \varDelta X^{(k)}_{(i)} D^{(k)}_{(i)} = n.\) \(\text {rank} \varDelta X^{(k)}_{(i)} < n\) can happen when \(x^{(k)}_{li} = c \quad (i=1,2,\ldots , N)\), i.e. when \(\varvec{x}^{(k)}_{i}\) lie in the same hyperplane \(x_l = c\). This happens when the l-th component of \(\varvec{x}^{(k)}_i \quad (i=1,2,\ldots ,N)\) are all equal.
The concise pseudo-code for the weighted linear approximation can be found in Algorithm 2.
(2-2) Solve for
\(\varvec{x}\)
that minimises
\(||\varvec{y}^*- (A_{(i)}^{(k)} (\varvec{x}-\varvec{x}_{i}^{(k)})+\varvec{f}(\varvec{x}_{i}^{(k)}))||^{\,2}_2\)
We now compute the next iterate \(X^{(k+1)}\) using the matrices \(\{A^{(k)}_{(i)}\}_{i=1}^{N}\) similarly to the Gauss–Newton method with Tikhonov regularisation (e.g., Hansen 2005; Björck 1996), i.e.,
$$\begin{aligned} \varvec{x}_{i}^{(k+1)}=\varvec{x}_{i}^{(k)}+\left( A_{(i)}^{(k)\text {T}}A_{(i)}^{(k)}+\lambda _i^{(k)} I\right) ^{-1}A_{(i)}^{(k)\text {T}}(\varvec{y}^*-\varvec{y}_{i}^{(k)})\qquad \end{aligned}$$
(26)
for \(i=1,...,N\), where \(\varvec{y}^*\) is the set of observations one wishes to fit the nonlinear function \(\varvec{f}\) to [cf. (1)], and \(\varvec{y}_i^{(k)}\equiv \varvec{f}(\varvec{x}_i^{(k)})\). For CGN, we require \(\lambda _i^{(k)}> 0\). The necessity of the regularisation can be seen in the numerical experiment presented in “Appendix B”.
(2-3) Update matrices
X
and
Y
and vectors
\(\varvec{r}\)
and
\(\varvec{\lambda }\)
Evaluate the nonlinear function \(\varvec{f}\) for each \(\varvec{x}_{i}^{(k+1)}\) as \(\varvec{y}_i^{(k+1)} = \varvec{f}(\varvec{x}_i^{(k+1)} )\) \((i=1,2,\ldots ,n)\), and store as \(Y^{(k+1)}= [ {y}_1^{(k+1)}, \varvec{y}_2^{(k+1)},\ldots ,\varvec{y}_N^{(k+1)} ].\) Compute the sum of squared residuals vector as \(\varvec{r}^{(k+1)}= [ r_1^{(k+1)}, r_2^{(k+1)},\) \(\ldots , r_N^{(k+1)} ]^\text {T}\) where \(r_i^{(k+1)} = \Vert {y}_i^{(k+1)} - {y}^{*} \Vert _2^{\,2} \quad (i=1,2,\ldots ,N)\). Note that this process can be implemented in an embarrassingly parallel way.
If the residual \(r_i\) increases, we replace \(\varvec{x}_{i}^{(k+1)}\) by \(\varvec{x}_{i}^{(k)}\) and increase the regularisation parameter i.e.,
if \(r_{i}^{(k)}<r_{i}^{(k+1)}\) or \(\varvec{f}(\varvec{x}_{i}^{(k+1)})\) can not be evaluated, then let
$$\begin{aligned}&\varvec{x}_{i}^{(k+1)}=\varvec{x}_{i}^{(k)} \end{aligned}$$
(27)
$$\begin{aligned}&\varvec{y}_{i}^{(k+1)}=\varvec{y}_{i}^{(k)} \end{aligned}$$
(28)
$$\begin{aligned}&\lambda ^{(k+1)}_i=10 \,\lambda ^{(k)}_i \end{aligned}$$
(29)
else decrease the regularisation parameter, i.e.,
$$\begin{aligned} \lambda ^{(k+1)}_i= \frac{1}{10}\lambda ^{(k)}_i \,, \end{aligned}$$
(30)
for each \(i=1,..., N\).
There are various ways to update the regularisation parameter \(\lambda\). In this paper we followed the strategy used in the Matlab’s implementation of the Levenberg–Marquardt method.
In addition, in this step, we impose the stopping criteria for each point in the iteration. As can be seen in Eq. (26), \(\varvec{x}_{i}^{(k+1)}\approx \varvec{x}_{i}^{(k)}\) for large \(\lambda _i\), so that we can expect very small update in \(\varvec{x}_{i}^{(k+1)}\). Hence, in order to mimic the minimum step size stopping criteria, we stop the update for i where \(\lambda _i>\lambda _\text {max}\).
The concise pseudo-code for updating matrices X and Y and vectors \(\varvec{r}\) and \(\varvec{\lambda }\) is given in Algorithm 3.
The influence of the choice of the initial value \(\lambda _{\text {init}}\) of the regularization parameter in Eq. (15) is studied in “Appendix B”.
We present a concise description of the CGN as a pseudo-code in Algorithm 4.