1 Introduction

Let an iterative algorithm be applied to seek the least value of a general objective function \(F ( \underline{x}), \underline{x}\in \mathcal{R}^n\), subject to the linear constraints

$$\begin{aligned} \underline{a}_j^T \underline{x}\le b_j, \quad j = 1,2, \ldots , m, \end{aligned}$$
(1.1)

the value \(m = 0\) being reserved for the unconstrained case. We assume that, for any vector of variables \(\underline{x}\in \mathcal{R}^n\), the function value \(F ( \underline{x})\) can be calculated. The vectors \(\underline{x}\) of these calculations are generated automatically by the algorithm after some preliminary work. A feasible point \(\underline{x}_1\) with \(F ( \underline{x}_1 )\) are required for the first iteration, where feasible means that the constraints (1.1) are satisfied. For every iteration number k, we define \(\underline{x}_k\) to be the point that, at the start of the k-th iteration, has supplied the least calculated value of the objective function so far, subject to \(\underline{x}_k\) being feasible. If this point is not unique, we choose the candidate that occurs first, so \(\underline{x}_{k+1} \ne \underline{x}_k\) implies the strict reduction \(F ( \underline{x}_{k+1} ) < F ( \underline{x}_k )\).

The main task of the k-th iteration is to pick a new vector of variables, \(\underline{x}_k^+\) say, or to decide that the sequence of iterations is complete. On some iterations, \(\underline{x}_k^+\) may be infeasible, in order to investigate changes to the objective function when moving away from a constraint boundary, an extreme case being when the boundary is the set of points that satisfy a linear equality constraint that is expressed as two linear inequalities. We restrict attention, however, to an iteration that makes \(\underline{x}_k^+\) feasible, and that tries to achieve the reduction \(F ( \underline{x}_k^+ ) < F ( \underline{x}_k )\). We also restrict attention to algorithms that employ a quadratic model \(Q_k ( \underline{x}) \approx F ( \underline{x}), \underline{x}\in \mathcal{R}^n\), and a trust region radius \(\Delta _k > 0\).

We let the model function of the k-th iteration be the quadratic

$$\begin{aligned} Q_k ( \underline{x}) \,=\, F ( \underline{x}_k ) + ( \underline{x}- \underline{x}_k )^T \underline{g}_k + \frac{1}{2}_{\,} ( \underline{x}- \underline{x}_k )^{T} H_{k\,} ( \underline{x}- \underline{x}_k ), \quad \underline{x}\in \mathcal{R}^n, \end{aligned}$$
(1.2)

the vector \(\underline{g}_k \in \mathcal{R}^n\) and the \(n \times n\) symmetric matrix \(H_k\) being chosen before the start of the iteration. The trust region radius \(\Delta _k\) is also chosen in advance. The ideal vector \(\underline{x}_k^+\) would be the vector \(\underline{x}\) that minimizes \(Q_k ( \underline{x})\) subject to the linear constraints (1.1) and the trust region bound

$$\begin{aligned} \left\| \underline{x}- \underline{x}_k \right\| \,\le \, \Delta _k. \end{aligned}$$
(1.3)

The purpose of our work, however, is to generate a vector \(\underline{x}_k^+\) that is a useful approximation to this ideal vector, and whose construction requires only \(\mathcal{O}( n^2 )\) operations on most iterations. This is done by the LINCOA Fortran software [12], developed by the author for linearly constrained optimization when derivatives of \(F ( \underline{x}), \underline{x}\in \mathcal{R}^n\), are not available. LINCOA has been applied successfully to several test problems that have hundreds of variables without taking advantage of any sparsity, which would not be practicable if the average amount of work on each iteration were \(\mathcal{O}( n^3 )\), this amount of computation being typical if an accurate approximation to the ideal vector is required.

When first derivatives of F are calculated, the choice \(\underline{g}_k = \underline{\nabla }F ( \underline{x}_k )\) is usual for the model function (1.2). Furthermore, after choosing the second derivative matrix \(H_1\) for the first iteration, the k-th iteration may construct \(H_{k+1}\) from \(H_k\) by the symmetric Broyden formula [see equation (3.6.5) of [3], for instance]. There is also a version of that formula for derivative-free optimization ([9]). It generates both \(\underline{g}_{k+1}\) and \(H_{k+1}\) from \(\underline{g}_k\) and \(H_k\) by minimizing \(\Vert H_{k+1} - H_k \Vert _F\) subject to some interpolation conditions, where the subscript “F” denotes the Frobenius matrix norm. The techniques that provide \(\underline{g}_k, H_k\) and \(\Delta _k\) are separate from our work, however, except for one feature. It is that, instead of requiring \(H_k\) to be available explicitly, we assume that, for any \(\underline{v}\in \mathcal{R}^n\), the form of \(H_k\) allows the matrix vector product \(H_{k\,} \underline{v}\) to be calculated in \(\mathcal{O}( n^2 )\) operations. Thus our work is relevant to the LINCOA Fortran software, where the expression for \(H_k\) includes a linear combination of about \(2n + 1\) outer products of the form \(\underline{y}_{\,} \underline{y}^T, \underline{y}\in \mathcal{R}^n\).

Table 4 of [11] gives some numerical results, on the efficiency of the symmetric Broyden formula in derivative-free unconstrained optimization when F is a strictly convex quadratic function. In all of these tests, the optimal vector of variables is calculated to high accuracy, using only about \(\mathcal{O}( n )\) values of F for large n, which shows that the updating formula is successful at capturing enough second derivative information to provide fast convergence. There is no need for \(\Vert H_k - \nabla ^{2\!} F ( \underline{x}_k ) \Vert \) to become small, as explained by [1] when \(\underline{\nabla }F\) is available. In the tests of [11] with a quadratic F, every initial matrix \(H_1\) satisfies \(\Vert H_1 - \nabla ^{2\!} F \Vert _F > \frac{1}{2}_{\,} \Vert \nabla ^{2\!} F \Vert _F\). Further, although the sequence \(\Vert H_k - \nabla ^{2\!} F \Vert _F, k = 1,2, \ldots ,K\), decreases monotonically, where K is the final value of k, the property

$$\begin{aligned} \left\| _{\,} H_K - \nabla ^2 F_{\,} \right\| _F \,>\, 0.9\, \left\| _{\,} H_1 - \nabla ^2 F_{\,} \right\| _F \end{aligned}$$
(1.4)

is not unusual when n is large.

Therefore the setting for our choice of \(\underline{x}_k^+\) is that we seek a vector \(\underline{x}\) that provides a relatively small value of \(Q_k ( \underline{x})\) subject to the constraints (1.1) and (1.3), although we expect the accuracy of the approximation \(Q_k ( \underline{x}) \approx F ( \underline{x}), \underline{x}\in \mathcal{R}^n\), to be poor. After choosing \(\underline{x}_k^+\), the value \(F ( \underline{x}_k^+ )\) is calculated usually, and then, in derivative-free optimization, the next quadratic model satisfies the equation

$$\begin{aligned} Q_{k+1} \left( \underline{x}_k^+\right) \,=\, F \left( \underline{x}_k^+ \right) , \end{aligned}$$
(1.5)

even if \(\underline{x}_k^+\) is not the best vector of variables so far. Thus, if \(| Q_k ( \underline{x}_k^+ ) - F ( \underline{x}_k^+ ) |\) is large, then \(Q_{k+1}\) is substantially different from \(Q_k\). Fortunately, the symmetric Broyden formula has the property that, if F is quadratic, then \(\Vert H_{k+1} - H_k \Vert \) tends to zero as k becomes large, so it is usual on the later iterations for the error \(| Q_k ( \underline{x}_k^+ ) - F ( \underline{x}_k^+ ) |\) to be much less than a typical error \(| Q_k ( \underline{x}) - F ( \underline{x}) |, \Vert \underline{x}- \underline{x}_k \Vert \le \Delta _k\). Thus the reduction \(F ( \underline{x}_k^+ ) < F ( \underline{x}_k )\) is inherited from \(Q_k ( \underline{x}_k^+ ) < Q_k ( \underline{x}_k )\) much more often than would be predicted by theoretical analysis, if the theory employed a bound on \(| Q_k ( \underline{x}_k^+ ) - F ( \underline{x}_k^+ ) |\) that is derived only from the errors \(\Vert \underline{g}_k - \nabla F ( \underline{x}_k ) \Vert \) and \(\Vert H_k - \nabla ^{2\!} F ( \underline{x}_k ) \Vert \), assuming that F is twice differentiable.

Suitable ways of choosing \(\underline{x}_k^+\) in the unconstrained case (\(m = 0\)) are described by [2] and by [10], for instance. They are addressed in Sect. 2, where both truncated conjugate gradient and Krylov subspace methods receive attention. They are pertinent to our techniques for linear constraints (\(m > 0\)), because, as explained in Sect. 3, active set methods are employed. Those methods generate sequences of unconstrained problems, some of the constraints (1.1) being satisfied as equations, which can reduce the number of variables, while the other constraints (1.1) are ignored temporarily. The idea of combining active set methods with Krylov subspaces is considered in Sect. 4. It was attractive to the author during the early development of the LINCOA software, but two severe disadvantages of this approach are exposed in Sect. 4. The present version of LINCOA combines active set methods with truncated conjugate gradients, which is the subject of Sect. 5. The conjugate gradient calculations are complete if they generate a vector \(\underline{x}\) on the boundary of the trust region constraint (1.3), but moves round this boundary that preserve feasibility may provide a useful reduction in the value of \(Q_k ( \underline{x})\). This possibility is studied in Sect. 6. Some further remarks, including a technique for preserving feasibility when applying the Krylov method, are given in Sect. 7.

2 The unconstrained case

There are no linear constraints on the variables throughout this section, m being zero in expression (1.1). We consider truncated conjugate gradient and Krylov subspace methods for constructing a point \(\underline{x}_k^+\) at which the quadratic model \(Q_k ( \underline{x}), \Vert \underline{x}- \underline{x}_k \Vert \le \Delta _k\), is relatively small. These methods are iterative, a sequence of points \(\underline{p}_{\ell }, \ell = 1,2, \ldots , L\), being generated, with \(\underline{p}_1 = \underline{x}_k\), with \(Q_k (\underline{p}_{\ell +1} ) < Q_k ( \underline{p}_{\ell } ), \ell = 1,2, \ldots , L - 1\), and with \(\underline{x}_k^+ = \underline{p}_L\). The main difference between them is that the conjugate gradient iterations are terminated if the \(\ell \)-th iteration generates a point \(\underline{p}_{\ell +1}\) that satisfies \(\Vert \underline{p}_{\ell +1} - \underline{x}_k \Vert = \Delta _k\), but both \(\underline{p}_{\ell }\) and \(\underline{p}_{\ell +1}\) may be on the trust region boundary in a Krylov subspace iteration. The second derivative matrix \(H_k\) of the quadratic model (1.2) is allowed to be indefinite. We keep in mind that we would like the total amount of computation for each k to be \(\mathcal{O}( n^2 )\).

If the gradient \(\underline{g}_k\) of the model (1.2) were zero, then, instead of investigating whether \(H_k\) has any negative eigenvalues, we would abandon the search for a vector \(\underline{x}_k^+\) that provides \(Q_k ( \underline{x}_k^+ ) < Q_k ( \underline{x}_k )\). In this case LINCOA would try to improve the model by changing one of its interpolation conditions. We assume throughout this section, however, that \(\underline{g}_k\) is nonzero.

The first iteration (\(\ell = 1\)) of both the conjugate gradient and Krylov subspace methods picks the vector

$$\begin{aligned} \underline{p}_2 \;=\; \underline{p}_1 - \alpha _{1\,} \underline{g}_k \;=\; \underline{x}_k - \alpha _{1\,} \underline{g}_k, \end{aligned}$$
(2.1)

where \(\alpha _1\) is the value of \(\alpha \) that minimizes the quadratic function of one variable \(Q_k ( \underline{p}_1 - \alpha _{\,} \underline{g}_k ), \alpha \in \mathcal{R}\), subject to \(\Vert \underline{p}_1 - \alpha _{\,} \underline{g}_k - \underline{x}_k \Vert \le \Delta _k\). Further, whenever a conjugate gradient iteration generates \(\underline{p}_{\ell +1}\) from \(\underline{p}_{\ell }\), a line search from \(\underline{p}_{\ell }\) along the direction

$$\begin{aligned} \underline{d}_{\ell } = \left\{ \begin{array}{cr} - \underline{g}_k, &{} \ell = 1, \\ -\underline{\nabla }Q_k ( \underline{p}_{\ell } ) + \beta _{\ell \,} \underline{d}_{\ell -1}, &{} \quad \ell \ge 2, \end{array} \right. \end{aligned}$$
(2.2)

is made, the value of \(\beta _{\ell }\) being defined by the conjugacy condition \(\underline{d}_{\ell }^{\,T\!} H_{k\,} \underline{d}_{\ell -1} = 0\). Thus \(\underline{p}_{\ell +1}\) is the vector

$$\begin{aligned} \underline{p}_{\ell +1} = \underline{p}_{\ell } + \alpha _{\ell \,} \underline{d}_{\ell }, \end{aligned}$$
(2.3)

where \(\alpha _{\ell }\) is the value of \(\alpha \) that minimizes \(Q_k ( \underline{p}_{\ell } + \alpha _{\,} \underline{d}_{\ell } ), \alpha \in \mathcal{R}\), subject to the bound \(\Vert \underline{p}_{\ell } + \alpha _{\,} \underline{d}_{\ell } - \underline{x}_k \Vert \le \Delta _k\). This procedure is well-defined, provided it is terminated if \(\Vert \underline{p}_{\ell +1} - \underline{x}_k \Vert = \Delta _k\) or \(\underline{\nabla }Q_k ( \underline{p}_{\ell +1} ) = 0\) occurs. In theory, these conditions provide termination after at most n iterations (see [3], for instance).

The only task of a conjugate gradient iteration that requires \(\mathcal{O}( n^2 )\) operations is the calculation of the product \(H_{k\,} \underline{d}_{\ell }\). All of the other tasks of the \(\ell \)-th iteration can be done in only \(\mathcal{O}(n)\) operations, by taking advantage of the availability of \(H_{k\,} \underline{d}_{\ell -1}\) when \(\underline{\nabla }Q_k ( \underline{p}_{\ell } ), \beta _{\ell }\) and \(\underline{d}_{\ell }\) are formed. Therefore the target of \(\mathcal{O}( n^2 )\) work for each k is maintained if L, the final value of \(\ell \), is bounded above by a number that is independent of both k and n. Further, because the total work of the optimization algorithm depends on the average value of L over k, a few large values of L may be tolerable.

The two termination conditions that have been mentioned are not suitable for keeping L small. Indeed, if \(H_k\) is positive definite, then, for every \(\ell \ge 1\), the property \(Q_k ( \underline{p}_{\ell +1} ) < Q_k ( \underline{x}_k )\) implies \(\Vert \underline{p}_{\ell +1} - \underline{x}_k \Vert < \Delta _k\) for sufficiently large \(\Delta _k\). Furthermore, if \(\underline{\nabla }Q_k ( \underline{p}_{\ell +1} ) = 0\) is achieved in exact arithmetic, then \(\ell \) may have to be close to n, but it is usual for computer rounding errors to prevent \(\underline{\nabla }Q_k ( \underline{p}_{\ell +1} )\) from becoming zero. Instead, the termination conditions of LINCOA, which are described below, are recommended for use in practice. They require a positive constant \(\eta _1 < 1\) to be prescribed, the LINCOA value being \(\eta _1 = 0.01\).

Termination occurs immediately with \(\underline{x}_k^+ = \underline{p}_{\ell }\) if the calculated \(\underline{d}_{\ell }\) is not a descent direction, which is the condition

$$\begin{aligned} \underline{d}_{\ell }^{\,T} \underline{\nabla }Q_k ( \underline{p}_{\ell } ) \;\ge \; 0. \end{aligned}$$
(2.4)

Otherwise, the positive steplength \(\hat{\alpha }_{\ell }\), say, that puts \(\underline{p}_ {\ell } + \hat{\alpha }_{\ell \,} \underline{d}_{\ell }\) on the trust region boundary is found, and termination occurs also with \(\underline{x}_k^+ = \underline{p}_{\ell }\) if this move is expected to give a relatively small reduction in \(Q_k ( \cdot )\), which is the test

$$\begin{aligned} \hat{\alpha }_{\ell }\, \left| \underline{d}_{\ell }^{\,T} \underline{\nabla }Q_k ( \underline{p}_{\ell } ) \right| \;\le \; \eta _1\, \left\{ Q_k ( \underline{x}_k ) - Q_k ( \underline{p}_{\ell } ) \right\} . \end{aligned}$$
(2.5)

In the usual situation, when both inequalities (2.4) and (2.5) fail, the product \(H_{k\,} \underline{d}_{\ell }\) is generated for the construction of the new point (2.3). The calculation ends with \(\underline{x}_k^+ = \underline{p}_{\ell +1}\) if \(\underline{p}_{\ell +1}\) is on the trust region boundary, or if the reduction in \(Q_k\) by the \(\ell \)-th iteration has the property

$$\begin{aligned} Q_k ( \underline{p}_{\ell } ) - Q_k ( \underline{p}_{\ell +1} ) \;\le \; \eta _1\, \left\{ Q_k ( \underline{x}_k ) - Q_k ( \underline{p}_{\ell +1} ) \right\} . \end{aligned}$$
(2.6)

It also ends in the case \(\ell = n\), which is unlikely unless n is small. In all other cases, \(\ell \) is increased by one for the next conjugate gradient iteration.

Condition (2.4) is equivalent to \(\underline{\nabla }Q_k ( \underline{p}_{\ell } ) = 0\) in exact arithmetic, because, if \(\underline{\nabla }Q_k ( \underline{p}_{\ell } )\) is nonzero, then the given choices of \(\underline{p}_{\ell }\) and \(\underline{d}_{\ell }\) give the property

$$\begin{aligned} \underline{d}_{\ell }^{\,T} \underline{\nabla }Q_k ( \underline{p}_{\ell } ) \;=\; - \left\| \underline{\nabla }Q_k ( \underline{p}_{\ell } ) \right\| ^2 \;<\; 0, \end{aligned}$$
(2.7)

the vector norm being Euclidean. The requirement \(\underline{d}_{\ell }^{\,T} \underline{\nabla }Q_k ( \underline{p}_{\ell } ) < 0\) before \(\alpha _{\ell }\) is calculated provides \(\alpha _{\ell } > 0\). For \(\ell \ge 2\), the test (2.5) causes termination when \(\Vert \underline{\nabla }Q_k ( \underline{p}_{\ell } ) \Vert \) is small. Specifically, as \(\Vert \underline{p}_{\ell } - \underline{x}_k \Vert < \Delta _k\) and \(\Vert \underline{p}_{\ell } + \hat{\alpha }_{\ell \,} \underline{d}_{\ell } - \underline{x}_k \Vert = \Delta _k\) imply \(\Vert \hat{\alpha }_{\ell \,} \underline{d}_{\ell } \Vert < 2 \Delta _k\), we deduce from the Cauchy–Schwarz inequality that condition (2.5) holds in the case

$$\begin{aligned} \left\| \underline{\nabla }Q_k ( \underline{p}_{\ell } ) \right\| \;\le \; \eta _1\, \left\{ Q_k ( \underline{x}_k ) - Q_k ( \underline{p}_{\ell } ) \right\} /( 2 \Delta _k ). \end{aligned}$$
(2.8)

The test (2.6) gives \(L = \ell + 1\) and \(\underline{x}_k^+ = \underline{p}_{\ell +1}\) when \(Q_k ( \underline{p}_{\ell } ) - Q_k ( \underline{p}_{\ell +1} )\) is relatively small. We find in Sect. 5 that the given conditions for termination are also useful for search directions \(\underline{d}_{\ell }\) that satisfy linear constraints on the variables.

In the remainder of this section, we consider a Krylov subspace method for calculating \(\underline{x}_k^+\) from \(\underline{x}_k\), where k is any positive integer such that the gradient \(\underline{g}_k\) of the model (1.2) is nonzero. For \(\ell \ge 1\), the Krylov subspace \(\mathcal{K}_{\ell }\) is defined to be the span of the vectors \(H_k^{j-1} \underline{g}_k \in \mathcal{R}^n, j = 1,2, \ldots , \ell \), the matrix \(H_k^{j-1}\) being the \((j - 1)\)-th power of \(H_k = \nabla ^2 Q_k\), and \(H_k^0\) being the unit matrix even if \(H_k\) is zero. Let \(\ell ^*\) be the greatest integer \(\ell \) such that \(\mathcal{K}_{\ell }\) has dimension \(\ell \), which implies \(\mathcal{K}_{\ell } = \mathcal{K}_{\ell ^*}, \ell \ge \ell ^*\). We retain \(\underline{p}_1 = \underline{x}_k\). The \(\ell \)-th iteration of our method constructs \(\underline{p}_{\ell +1}\), and the number of iterations is at most \(\ell ^*\). Our choice of \(\underline{p}_{\ell +1}\) is a highly accurate estimate of the vector \(\underline{x}\) that minimizes \(Q_k ( \underline{x})\), subject to \(\Vert \underline{x}- \underline{x}_k \Vert \le \Delta _k\) and subject to the condition that \(\underline{x}- \underline{x}_k\) is in \(\mathcal{K}_{\ell }\). We find later in this section that the calculation of \(\underline{p}_{\ell +1}\) for each \(\ell \) requires only \(\mathcal{O}( n^2 )\) operations. An introduction to Krylov subspace methods is given by [2], for instance.

It is well known that each \(\underline{p}_{\ell +1}\) of the conjugate gradient method has the property that \(\underline{p}_{\ell +1} - \underline{x}_k\) is in the Krylov subspace \(\mathcal{K}_{\ell }\), which can be deduced by induction from Eqs. (2.2) and (2.3). It is also well known that the search directions of the conjugate gradient method satisfy \(\underline{d}_{\ell }^{\,T\!} H_{k\,} \underline{d}_j = 0, 1 \le j < \ell \). It follows that the points \(\underline{p}_{\ell }, \ell = 1,2,3, \ldots \), of the conjugate gradient and Krylov subspace methods are the same while they are strictly inside the trust region \(\{ \underline{x}: \Vert \underline{x}- \underline{x}_k \Vert \le \Delta _k \}\). We recall that the conjugate gradient method is terminated when its \(\underline{p}_{\ell +1}\) is on the trust region boundary, but the iterations of the Krylov subspace method may continue, in order to provide a smaller value of \(Q_k ( \underline{x}_k^+ )\).

Our Krylov subspace method has a standard form with termination conditions for use in practice. For \(\ell = 1,2, \ldots , \ell ^*\), we let \(\underline{v}_{\ell }\) be a vector of unit length in \(\mathcal{K}_{\ell }\) that, for \(\ell \ge 2\), satisfies the orthogonality conditions

$$\begin{aligned} \underline{v}_{\ell }^T \underline{d}\,=\, 0, \quad \underline{d}\in \mathcal{K}_{\ell -1}, \end{aligned}$$
(2.9)

which defines every \(\underline{v}_{\ell }\) except its sign. Thus \(\{ \underline{v}_j : j = 1,2, \ldots , \ell \}\) is an orthonormal basis of \(\mathcal{K}_{\ell }\). Further, because \(H_k \underline{v}_j\) is in the subspace \(\mathcal{K}_{j+1}\), the conditions (2.9) provide the highly useful property

$$\begin{aligned} \underline{v}_{\ell }^T H_{k\,} \underline{v}_j \,=\, 0, \quad 1 \le j \!\le \ell - 2. \end{aligned}$$
(2.10)

The usual way of constructing \(\underline{v}_{\ell }, \ell \ge 2\), is described below. It is extended in Sect. 4 to calculations with linear constraints on the variables.

After setting \(\underline{p}_1 = \underline{x}_k\), the vector \(\underline{p}_2\) of our Krylov subspace method is given by Eq. (2.1) as mentioned already. For \(\ell \ge 2\), we write \(\underline{p}_{\ell +1}\) as the sum

$$\begin{aligned} \underline{p}_{\ell +1} \;=\; \underline{x}_k + \, \mathop {\sum }\limits _{i=1}^{\ell } \, \theta _{i\,} \underline{v}_i. \end{aligned}$$
(2.11)

Therefore, assuming that every vector norm is Euclidean, we seek the vector of coefficients \(\underline{\theta }\in \mathcal{R}^{\ell }\) that minimizes the quadratic function

$$\begin{aligned} \Phi _{k \ell } ( \underline{\theta }) \;=\; Q_k \left( \underline{x}_k + \mathop {\sum }\limits _{i=1}^{\ell } \, \theta _{i\,} \underline{v}_i \right) , \quad \underline{\theta }\in \mathcal{R}^{\ell }, \end{aligned}$$
(2.12)

subject to \(\Vert \underline{\theta }\Vert \le \Delta _k\). This function has the second derivatives

$$\begin{aligned} d^{\,2} \Phi _{k \ell } ( \underline{\theta }) /d \theta _{i\,} d \theta _j \,=\; \underline{v}_i^{T} H_{k\,} \underline{v}_j, \quad 1 \le i,j \le \ell , \end{aligned}$$
(2.13)

and Eq. (2.10) shows that the matrix \(\nabla ^2 \Phi _{k \ell } ( \cdot )\) is tridiagonal. Moreover, the gradient \(\underline{\nabla }\Phi _{k \ell } (0)\) has the components \(\underline{v}_i^T \underline{g}_k, i = 1,2, \ldots , \ell \), so only the first component is nonzero, its value being \(\pm \Vert \underline{g}_k \Vert \). Hence, after forming the tridiagonal part of \(\nabla ^2 \Phi _{k \ell } ( \cdot )\), the required \(\underline{\theta }\) can be calculated accurately, conveniently and rapidly by the algorithm of [6].

At the beginning of the \(\ell \)-th iteration of our Krylov subspace method, where \(\ell \ge 2\), the vectors \(\underline{v}_i, i = 1,2, \ldots , \ell - 1\), are available, with the tridiagonal part of the second derivative matrix \(\nabla ^2 \Phi _{k \ell -1} ( \cdot )\), its last diagonal element being \(\underline{v}_{\ell -1}^{T} H_k \underline{v}_{\ell -1}\), so the product \(H_k \underline{v}_{\ell -1}\) has been calculated already. A condition for termination at this stage is suggested in the next paragraph. If it fails, then the Arnoldi formula

$$\begin{aligned} \underline{v}_{\ell } \;=\; \frac{H_{k\,} \underline{v}_{\ell -1} - \mathop {\sum }\nolimits _{j=1}^{\ell -1}\, \left( \underline{v}_j^{T} H_{k\,} \underline{v}_{\ell -1} \right) \, \underline{v}_j}{\left\| \, H_{k\,} \underline{v}_{\ell -1} - \mathop {\sum }\nolimits _{j=1}^{\ell -1}\, \left( \underline{v}_j^{T} H_{k\,} \underline{v}_{\ell -1} \right) \, \underline{v}_j\, \right\| } \end{aligned}$$
(2.14)

is applied, the amount of work being only \(\mathcal{O}( n )\), because Eq. (2.10) provides \(\underline{v}_j^{T} H_k \underline{v}_{\ell -1} = 0, j \le \ell - 3\). The calculation of \(H_k \underline{v}_{\ell }\) requires \(\mathcal{O}( n^2 )\) operations, however, which is needed for the second derivatives \(\underline{v}_{\ell -1}^{T} H_k \underline{v}_{\ell }\) and \(\underline{v}_{\ell }^{T} H_k \underline{v}_{\ell }\). Then \(\underline{p}_{ \ell +1}\) is generated by the method of the previous paragraph, followed by termination with \(\underline{x}_k^+ = \underline{p}_{\ell +1}\) if inequality (2.6) is satisfied. Otherwise, \(\ell \) is increased by one for the next iteration.

The left hand side of the test (2.5) is the reduction that occurs in the linear function

$$\begin{aligned} \Lambda _{k \ell } ( \underline{x}) \,=\, Q_k ( \underline{p}_{\ell } ) + ( \underline{x}- \underline{p}_{\ell } )^T \underline{\nabla }Q_k ( \underline{p}_{\ell } ), \quad \underline{x}\in \mathcal{R}^n, \end{aligned}$$
(2.15)

if a step is taken from \(\underline{p}_{\ell }\) along the direction \(\underline{d}_{\ell }\) to the trust region boundary. For Krylov subspace methods, we replace that left hand side by the greatest reduction that can be achieved in \(\Lambda _{k \ell } ( \cdot )\) by any step from \(\underline{p}_{\ell }\) to the trust region boundary. We see that this step is to the point

$$\begin{aligned} \widehat{\underline{p}}_{\ell +1} \,=\, \underline{x}_k - \Delta _{k\,} \underline{\nabla }Q_k ( \underline{p}_{\ell } )/ \left\| \underline{\nabla }Q_k ( \underline{p}_{\ell } ) \right\| , \end{aligned}$$
(2.16)

assuming \(\underline{\nabla }Q_k ( \underline{p}_{\ell } ) \ne 0\), so inequality (2.5) becomes the termination condition

$$\begin{aligned} \left( \underline{p}_{\ell } - \widehat{\underline{p}}_{\ell +1} \right) ^T \underline{\nabla }Q_k ( \underline{p}_{\ell } ) \;\le \; \eta _1\, \left\{ _{\,} Q_k ( \underline{x}_k ) - Q_k ( \underline{p}_{\ell } )_{\,} \right\} , \end{aligned}$$
(2.17)

which is the test

$$\begin{aligned} \Delta _k\, \left\| \underline{\nabla }Q_k ( \underline{p}_{\ell } ) \right\| - ( \underline{x}_k\!- \underline{p}_{\ell } )^{T\,} \underline{\nabla }Q_k ( \underline{p}_{\ell } ) \;\le \; \eta _1\, \left\{ _{\,} Q_k ( \underline{x}_k ) - Q_k ( \underline{p}_{\ell } )_{\,} \right\} \!. \end{aligned}$$
(2.18)

If it is satisfied, then, instead of forming \(\underline{v}_{\ell }\) and \(\nabla ^2 \Phi _{k \ell } ( \cdot )\) for the calculation of \(\underline{p}_{\ell +1}\), the choice \(\underline{x}_k^+ = \underline{p}_{\ell }\) is made, which completes the Krylov subspace iterations.

The vectors \(\underline{p}_{\ell }\) and \(\underline{\nabla }Q_k ( \underline{p}_{\ell } )\) are required for the termination condition (2.18) when \(\ell \ge 2\). The construction of \(\underline{p}_{\ell }\) provides the parameters \(\gamma _i\), say, of the sum

$$\begin{aligned} \underline{p}_{\ell } \,=\, \underline{x}_k + \mathop {\sum }\limits _{i=1}^{\ell -1} \, \gamma _i\, \underline{v}_i, \end{aligned}$$
(2.19)

which corresponds to Eq. (2.11), and we retain the vectors \(\underline{v}_i, i = 1,2, \ldots , \ell - 1\). Further, the quadratic function (1.2) has the gradient

$$\begin{aligned} \underline{\nabla }Q_k ( \underline{p}_{\ell } ) \,=\, \underline{g}_k + \mathop {\sum }\limits _{i=1}^{\ell -1} \, \gamma _i\, H_{k\,} \underline{v}_i, \end{aligned}$$
(2.20)

but we prefer not to store \(H_{k\,} \underline{v}_i, i = 1,2, \ldots , \ell - 2\). Instead, because \(H_{k\,} \underline{v}_i\) is in the space \(\mathcal{K}_{i+1}\), and because any vector in \(\mathcal{K}_{\ell -1}\) is unchanged if it is multiplied by \(\sum _{j=1}^ {\ell -1} \underline{v}_{j\,} \underline{v}_j^T\), we can write Eq. (2.20) in the form

$$\begin{aligned} \underline{\nabla }Q_k ( \underline{p}_{\ell } ) \,=\, \underline{g}_k + \gamma _{\ell -1\,} H_{k\,} \underline{v}_{\ell -1} +\, \mathop {\sum }\limits _{j=1}^{\ell -1}\, \left\{ \mathop {\sum }\limits _{i=1}^{\ell -2}\, \gamma _i\, \underline{v}_j^{T} H_{k\,} \underline{v}_i \right\} \underline{v}_j. \end{aligned}$$
(2.21)

Thus it is straightforward to calculate \(\underline{\nabla }Q_k ( \underline{p}_{\ell } )\), and again we take advantage of the property (2.10).

3 Active sets

We now turn to linear constraints on the variables, m being positive in expression (1.1). We apply an active set method, this technique being more than 40 years old (see [4], for instance). The active set \(\mathcal{A}\), say, is a subset of the indices \(\{ 1,2, \ldots , m \}\) such that the constraint gradients \(\underline{a}_j, j \in \mathcal{A}\), are linearly independent. Usually, until \(\mathcal{A}\) is updated, the variables \(\underline{x}\in \mathcal{R}^n\) are restricted by the equations

$$\begin{aligned} \underline{a}_j^T \underline{x}\,=\, b_j, \quad j \in \mathcal{A}, \end{aligned}$$
(3.1)

but, if j is not in \(\mathcal{A}\), the constraint \(a_j^T \underline{x}\le b_j\) is ignored, until updating of the active set is needed to prevent a constraint violation. This updating may occur several times during the search on the k-th iteration for a vector \(\underline{x}_k^+ = \underline{x}\) that provides a relatively small value of \(Q_k ( \underline{x})\), subject to the inequality constraints (1.1) and the bound \(\Vert \underline{x}- \underline{x}_k \Vert \le \Delta _k\). As in Sect. 2, the search generates a sequence of points \(\underline{p}_{\ell }, \ell = 1,2, \ldots , L\), in the trust region with \(\underline{p}_1 = \underline{x}_k, \underline{x}_k^+ = \underline{p}_L\), and \(Q_k ( \underline{p}_{\ell +1} ) < Q_k ( \underline{p}_{\ell } ), \ell = 1,2, \ldots , L - 1\). Also, every \(\underline{p}_{\ell }\) is feasible. Let \(\mathcal{A}\) be the current active set when \(\underline{p}_{\ell +1}\) is constructed. We replace the Eq. (3.1) by the conditions

$$\begin{aligned} \underline{a}_j^T \left( \underline{p}_{\ell +1} - \underline{p}_{\ell } \right) \,=\, 0, \quad j \in \mathcal{A}, \end{aligned}$$
(3.2)

throughout this section.

This replacement brings a strong advantage if one (or more) of the residuals \(b_j - \underline{a}_j^T \underline{x}, j = 1,2, \ldots , m\), is very small and positive, because it allows the indices of those constraints to be in \(\mathcal{A}\). The Eq. (3.1), however, require the choice of \(\mathcal{A}\) at \(\underline{x}= \underline{x}_k\) to be a subset of \(\{ j : b_j - \underline{a}_j^T \underline{x}_k = 0 \}\). Let \(b_i - \underline{a}_i^T \underline{x}_k\) be tiny and positive, where i is one of the constraint indices. If i is not in the set \(\mathcal{A}\), then the condition \(b_i - \underline{a}_i^T \underline{p}_2 \ge 0\) is ignored in the first attempt to construct \(\underline{p}_2\), so it is likely that the \(\underline{p}_2\) of this attempt violates the i-th constraint. Then \(\underline{p}_2\) would be shifted somehow to a feasible position in \(\mathcal{R}^n\), and usually the new \(\underline{p}_2\) would satisfy \(\underline{a}_i^T \underline{p}_2 = b_i\), with i being included in a new active set for the construction of \(\underline{p}_3\). Thus the active set of the k-th iteration may be updated after a tiny step from \(\underline{p}_1 = \underline{x}_k\) to \(\underline{p}_2\), which tends to be expensive if there are many tiny steps, the work of a typical updating being \(\mathcal{O}( n^2 )\). Therefore the conditions (3.2) are employed instead of the Eq. (3.1) in the LINCOA software, the actual choice of \(\mathcal{A}\) at \(\underline{x}_k\) being as follows.

For every feasible \(\underline{x}\in \mathcal{R}^n\), and for the current \(\Delta _k\), we let \(\mathcal{J}( \underline{x})\) be the set

$$\begin{aligned} \mathcal{J}( \underline{x}) \,=\, \left\{ j : b_j - \underline{a}_j^T \underline{x}\le \eta _{2\,} \Delta _{k\,} \Vert \underline{a}_j \Vert \right\} \,\subset \, \{ 1,2, \ldots , m \}, \end{aligned}$$
(3.3)

where \(\eta _2\) is a positive constant, the value \(\eta _2 = 0.2\) being set in the LINCOA software. In other words, the constraint index j is in \(\mathcal{J}( \underline{x})\) if and only if the distance from \(\underline{x}\) to the boundary of the j-th constraint is at most \(\eta _{2\,} \Delta _k\). The choice of \(\mathcal{A}\) at \(\underline{x}_k\) is a subset of \(\mathcal{J}( \underline{x}_k )\). As in Sect. 2, the step from \(\underline{p}_1 = \underline{x}_k\) to \(\underline{p}_2\) is along a search direction \(\underline{d}_1\), which is defined below. This direction has the property \(\underline{a}_j^{T} \underline{d}_1 \le 0, j \in \mathcal{J}( \underline{x}_k )\), in order that every positive step along \(\underline{d}_1\) goes no closer to the boundary of the j-th constraint for every \(j \in \mathcal{J}( \underline{x}_k )\). It follows that, if the length of the step from \(\underline{p}_1\) to \(\underline{p}_2\) is governed by the need for \(\underline{p}_2\) to be feasible, then the length \(\Vert \underline{p}_2 - \underline{p}_1 \Vert \) is greater than \(\eta _{2\,} \Delta _k\), which prevents the tiny step that is addressed in the previous paragraph. This device is taken from the TOLMIN algorithm of [7].

The direction \(\underline{d}_1\) is the unique vector \(\underline{d}\) that minimizes \(\Vert \underline{g}_k + \underline{d}\Vert _2\) subject to the conditions

$$\begin{aligned} \underline{a}_j^{T} \underline{d}\,\le \, 0, \quad j \in \mathcal{J}( \underline{x}_k ), \end{aligned}$$
(3.4)

and the choice of \(\mathcal{A}\) at \(\underline{x}_k\) is derived from properties of \(\underline{d}_1\). Let \(\mathcal{I}( \underline{x}_k )\) be the set

$$\begin{aligned} \mathcal{I}( \underline{x}_k ) \,=\, \left\{ j \in \mathcal{J}( \underline{x}_k ) : \underline{a}_j^{T} \underline{d}_1 = 0 \right\} . \end{aligned}$$
(3.5)

If the index j is in \(\mathcal{J}( \underline{x}_k )\) but not in \(\mathcal{I}( \underline{x}_k )\), then a sufficiently small perturbation to the j-th constraint does not alter \(\underline{d}_1\). It follows that \(\underline{d}_1\) is also the vector \(\underline{d}\) that minimizes \(\Vert \underline{g}_k + \underline{d}\Vert _2\) subject to \(\underline{a}_j^{T} \underline{d}\le 0, j \in \mathcal{I}( \underline{x}_k )\). Furthermore, the definition of \(\mathcal{I}( \underline{x}_k )\) implies that \(\underline{d}_1\) is the \(\underline{d}\) that minimizes \(\Vert \underline{g}_k + \underline{d}\Vert _2\) subject to the equations

$$\begin{aligned} \underline{a}_j^{T} \underline{d}\,=\, 0, \quad j \in \mathcal{I}( \underline{x}_k ). \end{aligned}$$
(3.6)

We pick \(\mathcal{A}= \mathcal{I}( \underline{x}_k )\) if the vectors \(\underline{a}_j, j \in \mathcal{I}( \underline{x}_k )\), are linearly independent. Otherwise, \(\mathcal{A}\) is any subset of \(\mathcal{I}( \underline{x}_k )\) such that \(\underline{a}_j, j \in \mathcal{A}\), is a basis of the linear subspace spanned by \(\underline{a}_j, j \in \mathcal{I}( \underline{x}_k )\). The actual choice of basis does not matter, because the conditions (3.2) are always equivalent to \(\underline{a}_j^T ( \underline{p}_{\ell +1} - \underline{p}_{\ell } ) = 0, j \in \mathcal{I}( \underline{x}_k )\).

In the LINCOA software, \(\underline{d}_1\) is calculated by the Goldfarb-Idnani algorithm [5] for quadratic programming. A subset \(\mathcal{A}\) of \(\mathcal{J}( \underline{x}_k )\) is updated until it becomes the required active set. Let \(\underline{d}( \mathcal{A})\) be the vector \(\underline{d}\) that minimizes \(\Vert \underline{g}_k + \underline{d}\Vert _2\) subject to \(\underline{a}_j^ {T\!} \underline{d}= 0, j \in \mathcal{A}\). The vectors \(\underline{a}_j, j \in \mathcal{A}\), are linearly independent for every \(\mathcal{A}\) that occurs. Also, by employing the signs of some Lagrange multipliers, every \(\mathcal{A}\) is given the property that \(\underline{d}( \mathcal{A})\) is the vector \(\underline{d}\) that minimizes \(\Vert \underline{g}_k + \underline{d}\Vert _2\) subject to \(\underline{a}_j^{T} \underline{d}\le 0, j \in \mathcal{A}\). It follows that the Goldfarb–Idnani calculation is complete when \(\underline{d}= \underline{d}( \mathcal{A})\) satisfies all the inequalities (3.4). Otherwise, a strict increase in \(\Vert \underline{g}_k + \underline{d}( \mathcal{A}) \Vert _2\) is obtained by picking an index \(j \in \mathcal{J}( \underline{x}_k )\) with \(\underline{a}_j^{T} \underline{d}( \mathcal{A}) > 0\) and adding it to \(\mathcal{A}\), combined with deletions from \(\mathcal{A}\) if necessary to achieve the stated conditions on \(\mathcal{A}\). For \(k \ge 2\), the initial \(\mathcal{A}\) of this procedure is derived from the previous active set, which is called a “warm start”. Usually, when the final \(\mathcal{A}\) is different from the initial \(\mathcal{A}\), the amount of work of this part of LINCOA is within our target, namely \(\mathcal{O}( n^2)\) operations for each k.

Let A be the \(n \times | \mathcal{A}|\) matrix that has the columns \(\underline{a}_j, j \in \mathcal{A}\), where \(\mathcal{A}\) is any of the sets that occur in the previous paragraph. When \(\mathcal{A}\) is updated in the Goldfarb–Idnani algorithm, the QR factorization of A is updated too. We let \(\widehat{Q}R\) be this factorization, where \(\widehat{Q}\) is \(n \times | \mathcal{A}|\) with orthonormal columns and where R is square, upper triangular and nonsingular. Furthermore, an \(n \times ( n - | \mathcal{A}| )\) matrix \(\check{Q}\) is calculated and updated such that the \(n \times n\) matrix \(( \widehat{Q}_{\,} |_{\,} \check{Q})\) is orthogonal. We employ the matrices \(\widehat{Q}, R\) and \(\check{Q}\) that are available when the choice of the active set at \(\underline{x}_k\) is complete.

Indeed, the direction \(\underline{d}\) that minimizes \(\Vert \underline{g}_k + \underline{d}\Vert _2\) subject to the Eq. (3.6) is given by the formula

$$\begin{aligned} \underline{d}_1 \,=\, - \check{Q}_{\,} \check{Q}^{T} \underline{g}_k. \end{aligned}$$
(3.7)

Further, the step \(\underline{p}_{\ell +1} - \underline{p}_{\ell }\) satisfies the conditions (3.2) if and only if \(\underline{p}_{\ell +1} - \underline{p}_{\ell }\) is in the column space of \(\check{Q}\). Thus, until \(\mathcal{A}\) is changed from its choice at \(\underline{x}_k\) to avoid a constraint violation, our search for a small value of \(Q_k ( \underline{x}), \underline{x}\in \mathcal{R}^n\), subject to the active constraints and the bound \(\Vert \underline{x}- \underline{x}_k \Vert \le \Delta _k\), is equivalent to seeking a small value of the quadratic function \(Q_k ( \underline{x}_k + \check{Q}\underline{\sigma }), \underline{\sigma }\in \mathcal{R}^{n-|\mathcal{A}|}\), subject to \(\Vert \underline{\sigma }\Vert \le \Delta _k\). The calculation is now without linear constraints, so we can employ some of the techniques of Sect. 2. We address the construction of \(\check{Q}\underline{\sigma }\) by Krylov subspace and truncated conjugate gradient methods in Sects. 4 and 5, respectively. Much cancellation is usual in the product \(\check{Q}^{T} \underline{g}_k\) for large k, due to the first order conditions at the solution of a smooth optimization problem, and this cancellation occurs in the second term of the product \(\underline{d}_1 = -\check{Q}( \check{Q}^{T} \underline{g}_k )\). Fortunately, because the definition of \(\check{Q}\) provides \(\underline{a}_j^T \check{Q}= 0, j \in \mathcal{A}\), the vector (3.7) satisfies the constraints \(\underline{a}_j^{T} \underline{d}_1 = 0, j \in \mathcal{A}\), for every product \(\check{Q}^{T} \underline{g}_k\).

Here is a simple example in two dimensions, illustrated in Fig. 1, that makes an important point. Letting \(x_1\) and \(x_2\) be the components of \(\underline{x}\in \mathcal{R}^2\), and letting \(\underline{x}_k\) be at the origin, we seek a relatively small value of the linear function

$$\begin{aligned} Q_k ( \underline{x}) \,=\, -2\, x_1 - x_2, \quad \underline{x}\in \mathcal{R}^2, \end{aligned}$$
(3.8)

subject to the constraints

$$\begin{aligned} \underline{a}_1^T \underline{x}= x_2 \le 0, \quad \underline{a}_2^T \underline{x}= x_1 + x_2 \le 2 \quad \text{ and } \quad \Vert \underline{x}- \underline{x}_k \Vert _2 \le \sqrt{10}. \end{aligned}$$
(3.9)

The feasible region of Fig. 1 contains the points that satisfy the constraints (3.9), and the steepest descent direction \(-\underline{\nabla }Q_k ( \cdot )\) is shown too. We pick the LINCOA value \(\eta _2 = 0.2\) for expression (3.3), which gives \(\mathcal{J}( \underline{x}_k ) = \{ 1 \}\) with \(m = 2\). A positive step from \(\underline{p}_1 = \underline{x}_k\) along \(-\underline{\nabla }Q_k ( \cdot )\) would violate \(x_2 \le 0\), so the first active set \(\mathcal{A}\) is also \(\{ 1 \}\). Thus \(\underline{p}_2\) is at (2, 0) as shown in Fig. 1, the length of the step to \(\underline{p}_2\) being restricted by the constraints because \(Q_k ( \cdot )\) is linear. Further progress can be made from \(\underline{p}_2\) only by deleting the index 1 from \(\mathcal{A}\), but an empty set is still excluded by the direction of \(-\underline{\nabla }Q_k ( \cdot )\). Therefore \(\mathcal{A}\) is updated from \(\{ 1 \}\) to \(\{ 2 \}\), which causes \(\underline{p}_3\) to be the feasible point on the trust region boundary that satisfies \(\underline{a}_2^T ( \underline{p}_3 - \underline{p}_2 ) = 0\). We find that \(\underline{x}= \underline{p}_3 = (3,-1)\) is the \(\underline{x}\) that minimizes the function (3.8) subject to the constraints (3.9), and that the algorithm supplies the strictly decreasing sequence \(Q_k ( \underline{p}_1 ) = 0, Q_k ( \underline{p}_2 ) = -4\) and \(Q_k ( \underline{p}_3 ) = -5\).

Fig. 1
figure 1

A change to the active set in two dimensions

The important point of this example is that, when \(\mathcal{A}\) is updated at \(\underline{p}_2\), it is necessary not only to add a constraint index to \(\mathcal{A}\) but also to make a deletion from \(\mathcal{A}\). Furthermore, in all cases when \(\mathcal{A}\) is updated at \(\underline{x}= \underline{p}_{\ell }\), say, we want the length of the step from \(\underline{p}_{\ell }\) to \(\underline{p}_{\ell +1}\) to be at least \(\eta _{2\,} \Delta _k\) if it is restricted by the linear constraints. Therefore, if a new \(\mathcal{A}\) is required at the feasible point \(\underline{p}_{\ell }\), it is generated by the procedure that is described earlier in this section, after replacing \(\underline{g}_k\) by \(\underline{\nabla }Q_k ( \underline{p}_{\ell } )\) and \(\underline{x}_k\) by \(\underline{p}_{\ell }\). We are reluctant to update the active set at \(\underline{p}_{\ell }\), however, when \(\underline{p}_{\ell }\) is close to the boundary of the trust region. Indeed, if a new active set at \(\underline{p}_{\ell }\) is under consideration in the LINCOA software, then the change to \(\mathcal{A}\) is made if and only if the distance from \(\underline{p}_{\ell }\) to the trust region boundary is at least \(\eta _2 \Delta _k\), which is the condition \(\Vert \underline{p}_{\ell } - \underline{x}_k \Vert \le ( 1 - \eta _2 )_{\,} \Delta _k\). Otherwise, the calculation of \(\underline{x}_k^+\) is terminated with \(L = \ell \) and \(\underline{x}_k^+ = \underline{p}_{\ell }\).

4 Krylov subspace methods

Let the active set \(\mathcal{A}\) be chosen at the trust region centre \(\underline{p}_1 = \underline{x}_k\) as in Sect. 3. We recall from the paragraph that includes Eq. (3.7) that the minimization of \(Q_k ( \underline{x}), \underline{x}\in \mathcal{R}^n\), subject to the active and trust region constraints

$$\begin{aligned} \underline{a}_j^T ( \underline{x}- \underline{x}_k ) \,=\, 0, \quad j \in \mathcal{A}, \quad \text{ and } \quad \left\| \underline{x}- \underline{x}_k \right\| \,\le \, \Delta _k \end{aligned}$$
(4.1)

is equivalent to the minimization of the quadratic function

$$\begin{aligned} Q_k^{{\, \mathrm{red}}}( \underline{\sigma }) \,=\, Q_k \left( \underline{x}_k + \check{Q}_{\,} \underline{\sigma }\right) , \quad \underline{\sigma }\in \mathcal{R}^{n-|\mathcal{A}|}, \end{aligned}$$
(4.2)

subject only to the trust region bound \(\Vert \underline{\sigma }\Vert \le \Delta _k\), where \(\check{Q}\) is an \(n \times (n - | \mathcal{A}| )\) matrix with orthonormal columns that are orthogonal to \(\underline{a}_j, j \in \mathcal{A}\). Because there is a trust region bound but no linear constraints on \(\underline{\sigma }\), either the Krylov subspace method or the conjugate direction method can be taken from Sect. 2 to construct a relatively small value of \(Q_k^{{\, \mathrm{red}}}( \underline{\sigma }), \Vert \underline{\sigma }\Vert \le \Delta _k\). The Krylov subspace alternative receives attention in this section. First we express it in terms of the original variables \(\underline{x}\in \mathcal{R}^n\).

Recalling that the Krylov subspace \(\mathcal{K}_{\ell }\) in Sect. 2 is the span of the vectors \(H_k^{j-1} \underline{g}_k \in \mathcal{R}^n, j = 1,2, \ldots , \ell \), we now let \(\mathcal{K}_{\ell }\) be the subspace of \(\mathcal{R}^{n-|\mathcal{A}|}\) that is spanned by \(( \nabla ^2 Q_k^{{\, \mathrm{red}}})^{j-1\,} \underline{\nabla }Q_k^{{\, \mathrm{red}}}(0), j = 1,2, \ldots , \ell \). Further, corresponding to the sequence \(\underline{p}_{\ell } \in \mathcal{R}^n, \ell = 1,2,3, \ldots \), in Sect. 2, we consider the sequence \(\underline{s}_{\ell } \in \mathcal{R}^{n-|\mathcal{A}|}, \ell = 1,2,3, \ldots \), where \(\underline{s}_1\) is zero, and where the \(\ell \)-th iteration of the Krylov subspace method sets \(\underline{s}_{\ell +1}\) to the vector \(\underline{\sigma }\) in \(\mathcal{K}_{\ell }\) that minimizes \(Q_k^{{\, \mathrm{red}}}( \underline{\sigma })\) subject to \(\Vert \underline{\sigma }\Vert \le \Delta _k\). The analogue of Eq. (2.11) is that we write \(\underline{s}_{\ell +1}\) as the sum

$$\begin{aligned} \underline{s}_{\ell +1} \,=\, \mathop {\sum }\limits _{i=1}^{\ell } \, \theta _{i\,} \underline{w}_i \,\in \, \mathcal{R}^{n-|\mathcal{A}|}, \end{aligned}$$
(4.3)

each \(\underline{w}_i\) being a vector of unit length in \(\mathcal{K}_i\) with the property \(\underline{w}_i^T \underline{w}_j = 0, i \ne j\). We pick \(\underline{w}_1 = -\underline{\nabla }Q_k^{{\, \mathrm{red}}}( 0 ) / \Vert \underline{\nabla }Q_k^{{\, \mathrm{red}}}(0) \Vert \), and, for \(\ell \ge 2\), the Arnoldi formula (2.14) supplies the vector

$$\begin{aligned} \underline{w}_{\ell } \;=\; \frac{\nabla ^2 {Q_k^{{\, \mathrm{red}}}}_{\,} \underline{w}_{\ell -1} - \mathop {\sum }\nolimits _{j=1}^ {\ell -1}\, \left( \underline{w}_j^{T\,} \nabla ^2 {Q_k^{{\, \mathrm{red}}}}_{\,} \underline{w}_{\ell -1} \right) \, \underline{w}_j}{\left\| \, \nabla ^2 {Q_k^{{\, \mathrm{red}}}}_{\,} \underline{w}_{\ell -1} - \mathop {\sum }\nolimits _{j=1}^{\ell -1}\, \left( \underline{w}_j^{T\,} \nabla ^2 {Q_k^{{\, \mathrm{red}}}}_{\,} \underline{w}_{\ell -1} \right) \, \underline{w}_j\, \right\| }. \end{aligned}$$
(4.4)

Equation (2.10), which is very useful for trust region calculations without linear constraints, takes the form

$$\begin{aligned} \underline{w}_{\ell }^{T\,} \nabla ^2 {Q_k^{{\, \mathrm{red}}}}_{\,} \underline{w}_j \,=\, 0, \quad 1 \le j \le \ell - 2, \end{aligned}$$
(4.5)

so again there are at most two nonzero terms in the sum of the Arnoldi formula.

Instead of generating each \(\underline{w}_i\) explicitly, however, we work with the vectors \(\underline{v}_i = \check{Q}\underline{w}_i \in \mathcal{R}^n, i \ge 1\), which are analogous to the vectors \(\underline{v}_i\) in the second half of Sect. 2. In particular, because \(\check{Q}^T \check{Q}\) is the \(( n - |\mathcal{A}| ) \times ( n - |\mathcal{A}| )\) unit matrix, they enjoy the orthonormality property

$$\begin{aligned} \underline{v}_i^T \underline{v}_j \,=\, \left( \check{Q}\underline{w}_i \right) ^T \left( \check{Q}\underline{w}_j \right) \,=\, \underline{w}_i^T \underline{w}_j \,=\, \delta _{ij}, \end{aligned}$$
(4.6)

for all relevant positive integers i and j. Furthermore, Eqs. (4.2) and (4.3) show that \(Q_k^{{\, \mathrm{red}}}( \underline{s}_{\ell +1} )\) is the same as \(Q_k ( \underline{p}_{\ell +1} )\), where \(\underline{p}_{\ell +1}\) is the point

$$\begin{aligned} \underline{p}_{\ell +1} \;=\; \underline{x}_k + \mathop {\sum }\limits _{i=1}^{\ell } \, \theta _{i\,} \check{Q}_{\,} \underline{w}_i \;=\; \underline{x}_k + \mathop {\sum }\limits _{i=1}^{\ell } \, \theta _{i\,} \underline{v}_i. \end{aligned}$$
(4.7)

It follows that the required values of the parameters \(\theta _i, i = 1,2, \ldots , \ell \), are given by the minimization of the function

$$\begin{aligned} \Phi _{k \ell } ( \underline{\theta }) \;=\; Q_k \left( \underline{x}_k + \mathop {\sum }\limits _{i=1}^{\ell } \, \theta _{i\,} \underline{v}_i \right) , \quad \underline{\theta }\in \mathcal{R}^{\ell }, \end{aligned}$$
(4.8)

subject to the trust region bound \(\Vert \underline{\theta }\Vert \le \Delta _k\), which is like the calculation in the paragraph that includes expressions (2.11)–(2.13).

In order to construct \(\underline{v}_i, i \ge 1\), we replace \(\underline{w}_i\) in the definition \(\underline{v}_i = \check{Q}\underline{w}_i\) by a vector that is available. The quadratic (4.2) has the first and second derivatives

$$\begin{aligned} \left. \begin{array}{c} \underline{\nabla }Q_k^{{\, \mathrm{red}}}( \underline{\sigma }) \,=\, \check{Q}^{T\,} \underline{\nabla }Q_k \left( \underline{x}_k + \check{Q}_{\,} \underline{\sigma }\right) , \quad \underline{\sigma }\in \mathcal{R}^{n-|\mathcal{A}|}, \\ \text{ and } \qquad \nabla ^2 Q_k^{{\, \mathrm{red}}}\;=\; \check{Q}^{T\,} \nabla ^2 Q_{k\,} \check{Q}\;=\; \check{Q}^{T} H_{k\,} \check{Q}. \end{array} \right\} \end{aligned}$$
(4.9)

Hence \(\underline{\nabla }Q_k ( \underline{x}_k ) = \underline{g}_k\) yields the vector

$$\begin{aligned} \underline{v}_1 \,=\, \check{Q}_{\,} \underline{w}_1 \,=\, -\check{Q}_{\,} \underline{\nabla }Q_k^{{\, \mathrm{red}}}(0)/\left\| \underline{\nabla }Q_k^{{\, \mathrm{red}}}(0) \right\| \,=\, -\check{Q}_{\,} \check{Q}^T \underline{g}_k /\left\| \check{Q}_{\,} \check{Q}^T \underline{g}_k \right\| \!. \end{aligned}$$
(4.10)

For \(\ell \ge 2, \underline{v}_{\ell } = \check{Q}\underline{w}_{\ell }\) is obtained from Eq. (4.4) multiplied by \(\check{Q}\). Specifically, the identities

$$\begin{aligned} \nabla ^2 {Q_k^{{\, \mathrm{red}}}}_{\,} \underline{w}_{\ell -1} \;=\; \check{Q}^{T} H_{k\,} \check{Q}_{\,} \underline{w}_{\ell -1} \;=\; \check{Q}^{T} H_{k\,} \underline{v}_{\ell -1} \end{aligned}$$
(4.11)

and

$$\begin{aligned} \underline{w}_{\ell }^{T\,} \nabla ^2 {Q_k^{{\, \mathrm{red}}}}_{\,} \underline{w}_j \;=\; \underline{w}_{\ell }^T \left( \check{Q}^{T} H_{k\,} \check{Q}\right) _{\,} \underline{w}_j \;=\; \underline{v}_{\ell }^{T} H_{k\,} \underline{v}_j, \quad 1 \le j \le \ell , \end{aligned}$$
(4.12)

give the Arnoldi formula

$$\begin{aligned} \underline{v}_{\ell } \;=\; \check{Q}_{\,} \underline{w}_{\ell } \;=\; \frac{\check{Q}_{\,} \check{Q}^{T} H_{k\,} \underline{v}_{\ell -1} - \mathop {\sum }\nolimits _{j=1}^ {\ell -1}\, \left( \underline{v}_j^{T} H_{k\,} \underline{v}_{\ell -1} \right) \, \underline{v}_j}{\left\| \, \check{Q}_{\,} \check{Q}^{T} H_{k\,} \underline{v}_{\ell -1} - \mathop {\sum }\nolimits _{j=1}^{\ell -1}\, \left( \underline{v}_j^{T} H_{k\,} \underline{v}_{\ell -1} \right) \, \underline{v}_j\, \right\| }, \quad \ell \ge 2. \end{aligned}$$
(4.13)

Only the last two terms of its sum can be nonzero as before due to Eqs. (4.5) and (4.12).

The Krylov subspace method with the active set \(\mathcal{A}\) is now very close to the method described in the two paragraphs that include Eqs. (2.11)–(2.14). The first change to the description in Sect. 2 is that, instead of the form (2.1), \(\underline{p}_2\) is now the point \(\underline{p}_1 - \alpha _1 \check{Q}\check{Q}^{T} \underline{g}_k\), where \(\alpha _1\) is the value of \(\alpha \) that minimizes \(Q_k ( \underline{p}_1 - \alpha \check{Q}\check{Q}^{T} \underline{g}_k ), \alpha \in \mathcal{R}\), subject to \(\Vert \underline{p}_2 - \underline{x}_k \Vert \le \Delta _k\). Equation (2.10) is a consequence of the properties (4.5) and (4.12), so \(\nabla ^2 \Phi _{k \ell } ( \cdot )\) is still tridiagonal. The last \(\ell - 1\) components of the gradient \(\underline{\nabla }\Phi _{k \ell } (0) \in \mathcal{R}^{\ell }\) are still zero, but its first component is now \(\pm \Vert \check{Q}\check{Q}^{T} \underline{g}_k \Vert = \pm \Vert \check{Q}^{T} \underline{g}_k \Vert \). Finally, the Arnoldi formula (2.14) is replaced by expression (4.13). We retain termination if inequality (2.6) holds.

The Krylov method was employed in an early version of the LINCOA software. If the calculated point (4.7) violates an inactive linear constraint, then \(\underline{p}_{\ell +1}\) has to be replaced by a feasible point, and often the active set is changed. Some difficulties may occur, however, which are addressed in the remainder of this section. Because of them, the current version of LINCOA applies the version of conjugate gradients given in Sect. 5, instead of the Krylov method.

Here is an instructive example in only two variables with an empty active set, where both \(\underline{p}_1 = \underline{x}_k\) and \(\underline{p}_2\) are feasible, but the \(\underline{p}_3\) of the Krylov method violates a constraint. We seek a small value of the quadratic model

$$\begin{aligned} Q_k ( \underline{x}) \;=\; -50_{\,} x_1 - 8_{\,} x_{1\,} x_2 - 44_{\,} x_2^2, \quad \underline{x}\in \mathcal{R}^2, \end{aligned}$$
(4.14)

subject to \(\Vert \underline{x}\Vert \le 1\) and \(x_2 \le 0.2\), the trust region centre being at the origin, which is well away from the boundary of the linear constraint. The point \(\underline{p}_2\) has the form (2.1), where \(\underline{p}_1 = \underline{x}_k = 0\) and \(\underline{g}_k = \underline{\nabla }Q_k ( \underline{x}_k ) = (-50,0)^T\). The function \(Q_k ( \underline{p}_1 - \alpha \underline{g}_k ), \alpha \in \mathcal{R}\), is linear, so \(\underline{p}_2\) is on the trust region boundary at \((1,0)^T\), which is also well away from the boundary of the linear constraint. Therefore \(\underline{p}_3\) has the form (2.11) with \(\ell = 2\), the coefficients \(\theta _1\) and \(\theta _2\) being calculated to minimize expression (2.12) subject to \(\Vert \underline{\theta }\Vert \le 1\). In other words, because \(n = 2\), the point \(\underline{p}_3\) is the vector \(\underline{x}\) that minimizes the model (4.14) subject to \(\Vert \underline{x}\Vert \le 1\), and one can verify that this point is the unconstrained minimizer of the strictly convex quadratic function \(Q_k ( \underline{x}) + 47_{\,} \Vert \underline{x}\Vert ^2, \underline{x}\in \mathcal{R}^2\). Thus we find the sequence

$$\begin{aligned} \underline{p}_1 \,=\, \left( \begin{array}{c} 0 \\ 0 \end{array} \right) , \quad \underline{p}_2 \,=\, \left( \begin{array}{c} 1 \\ 0 \end{array} \right) \quad \text{ and } \quad \underline{p}_3 \,=\, \left( \begin{array}{c} 0.6 \\ 0.8 \end{array} \right) , \end{aligned}$$
(4.15)

with \(Q_k (\underline{p}_1 ) = 0, Q_k ( \underline{p}_2 ) = -50\) and \(Q_k ( \underline{p}_3) = -62\).

The point \(\underline{p}_3\) is unacceptable, however, as it violates the constraint \(x_2 \le 0.2\). The condition \(Q_k ( \underline{p}_3 ) < Q_k ( \underline{p}_2 )\) suggests that it may be helpful to consider the function of one variable

$$\begin{aligned} Q_k \left( _{\,} \underline{p}_2 + \alpha _{\,} \left\{ \underline{p}_3 \!- \underline{p}_2 \right\} \right) \;=\; -50 + 13.6_{\,} \alpha - 25.6_{\,} \alpha ^2, \quad \alpha \in \mathcal{R}, \end{aligned}$$
(4.16)

in order to move to the feasible point on the straight line through \(\underline{p}_2\) and \(\underline{p}_3\) that provides the least value of \(Q_k ( \cdot )\) subject to the trust region bound. Feasibility demands \(\alpha \le 0.25\), but the function (4.16) increases strictly monotonically for \(0 \le \alpha \le 17/64\), so all positive values of \(\alpha \) are rejected. Furthermore, all negative values of \(\alpha \) are excluded by the trust region bound. In this case, therefore, the point \(\underline{p}_3\) of the Krylov method fails to provide an improvement over \(\underline{p}_2\), even when \(\underline{p}_3 - \underline{p}_2\) is used as a search direction from \(\underline{p}_2\). On the other hand, \(\underline{p}_2\) can be generated easily by the conjugate gradient method, and then \(\underline{x}_k^+ = \underline{x}_k + \underline{p}_2\) is chosen, because \(\underline{p}_2\) is on the trust region boundary.

The minimization of the function (4.14) subject to the constraints

$$\begin{aligned} \Vert \underline{x}\Vert \le 1, \quad x_2 \le 0.2 \quad \text{ and } \quad x_1 \!+ 0.1_{\,} x_2 \le 0.6 \end{aligned}$$
(4.17)

is also instructive. The data \(\underline{p}_1 = \underline{x}_k = 0\) and \(\underline{g}_k = (-50,0)^T\) imply again that the step from \(\underline{p}_1\) to \(\underline{p}_2\) is along the first coordinate direction, and now \(\underline{p}_2\) is at \((0.6,0)^T\), because of the last of the conditions (4.17). The move from \(\underline{p}_2\) is along the boundary of \(x_1 + 0.1 x_2 \le 0.6\), the index of this constraint being put into the active set \(\mathcal{A}\), so \(\underline{p}_3\) and \(Q_k ( \underline{p}_3 )\) take the values

$$\begin{aligned} \underline{p}_3 \,=\, \left( \begin{array}{c} 0.6 + 0.1 \alpha \\ -\alpha \end{array} \right) \quad \text{ and } \quad Q_k ( \underline{p}_3 ) \,=\, -30 - 0.2_{\,} \alpha - 43.2_{\,} \alpha ^2, \end{aligned}$$
(4.18)

for some steplength \(\alpha \) that has to be chosen. The Krylov method would calculate the \(\alpha \) that minimizes \(Q_k ( \underline{p}_3 )\) subject to \(\Vert \underline{p}_3 \Vert \le 1\), which is \(\alpha = -0.8576\), and then \(\underline{p}_3\) is on the trust region boundary at \(( 0.5142, 0.8576 )^T\) with \(Q_k ( \underline{p}_3 ) = -61.0647\), the exact figures being rounded to four decimal places. This \(\underline{p}_3\) violates the constraint \(x_2 \le 0.2\), however, so, if the sign of \(\alpha \) is accepted, and if \(| \alpha |\) is reduced by a line search to provide feasibility, then we find \(\alpha = -0.2, \underline{p}_3 = (0.58, 0.2 )^T\) and \(Q_k ( \underline{p}_3 ) = -31.688\). On the other hand, a conjugate gradient search would go downhill from \(\underline{p}_2\) to \(\underline{p}_3\), and expression (4.18) for \(Q_k ( \underline{p}_3 )\) shows that \(\alpha \) would be positive. This search would reach the trust region boundary at the feasible point \(\underline{p}_3 = ( 0.6739, -0.7388 )^T\), with \(\alpha = 0.7388\) and \(Q_k ( \underline{p}_3 ) = -53.7298\). Thus it can happen that the Krylov method is highly disadvantageous if its steplengths are reduced without a change of sign for the sake of feasibility.

Another infelicity of the Krylov method when there are linear constraints is shown by the final example of this section. We consider the minimization of the function

$$\begin{aligned} Q_k ( \underline{x}) = x_1^2 - 0.1_{\,} x_2^2 - 20_{\,} x_3^2 - 10_{\,} x_1 x_3 + 4.8_{\,} x_2 x_3 - 4_{\,} x_2 - x_3 + 21, \quad \underline{x}\in \mathcal{R}^3,\quad \end{aligned}$$
(4.19)

subject to the constraints

$$\begin{aligned} x_3 \,\le \, 1 \quad \text{ and } \quad \Vert \underline{x}\Vert \,\le \, \sqrt{26}, \end{aligned}$$
(4.20)

so again the trust region centre \(\underline{x}_k\) is at the origin. The step to \(\underline{p}_2\) from \(\underline{p}_1 = \underline{x}_k\) is along the direction \(-\underline{\nabla }Q_k ( \underline{x}_k ) = (0,4,1)^T\), and, because \(Q_k ( \cdot )\) has negative curvature along this direction, the step is the longest one allowed by the constraints, giving \(\underline{p}_2 = (0,4,1)^T\). Further, the active set becomes nonempty, in order to follow the boundary of \(x_3 \le 1\), which, after the step from \(\underline{p}_1\) to \(\underline{p}_2\), reduces the calculation to the minimization of the function of two variables

$$\begin{aligned} \widetilde{Q}_k ( \widetilde{\underline{x}}) \,=\, Q_k \left( \begin{array}{c} \widetilde{x}_1 \\ \widetilde{x}_2 \\ 1 \end{array} \right) \,=\, \widetilde{x}_1^2 - 0.1_{\,} \widetilde{x}_2^2 -10_{\,} \widetilde{x}_1 + 0.8_{\,} \widetilde{x}_2, \quad \widetilde{\underline{x}}\in \mathcal{R}^2, \end{aligned}$$
(4.21)

subject to \(\Vert \widetilde{\underline{x}}\Vert \le 5\), starting at the point \(\widetilde{\underline{p}}_1 = (0,4)^T\), although the centre of the trust region is still at the origin.

The Krylov subspace \(\mathcal{K}_{\ell }, \ell \ge 1\), for the minimization of \(\widetilde{Q}_k ( \widetilde{\underline{x}}), \widetilde{\underline{x}}\in \mathcal{R}^2\), is the space spanned by the vectors \(( \nabla ^2 \widetilde{Q}_k )^{j-1} \underline{\nabla }\widetilde{Q}_k ( \widetilde{\underline{p}}_1 ), j = 1,2, \ldots , \ell \). The example has been chosen so that the matrix \(\nabla ^2 \widetilde{Q}_k\) is diagonal and so that \(\underline{\nabla }\widetilde{Q}_k ( \widetilde{\underline{p}}_1 )\) is a multiple of the first coordinate vector. It follows that all searches in Krylov subspaces, starting at \(\widetilde{\underline{p}}_1 = (0,4)^T\), cannot alter the value \(\widetilde{x}_2 = 4\) of the second variable, assuming that computer arithmetic is without rounding errors. The search from \(\widetilde{\underline{p}}_1\) to \(\widetilde{\underline{p}}_2\) is along the direction \(-\underline{\nabla }\widetilde{Q}_k ( \widetilde{\underline{p}}_1 ) = (10,0)^T\). It goes to the point \(\widetilde{\underline{p}}_2 = (3,4)^T\) on the trust region boundary, which corresponds to \(\underline{p}_3 = (3,4,1)^T\). Here \(\widetilde{Q}_k ( \cdot )\) is least subject to \(\Vert \widetilde{\underline{x}}\Vert \le 5\) and \(\widetilde{x}_2 = 4\), so the iterations of the Krylov method are complete. They generate the monotonically decreasing sequence

$$\begin{aligned} Q_k ( \underline{p}_1 ) = 21, \quad Q_k ( \underline{p}_2 ) = \widetilde{Q}_k ( \widetilde{\underline{p}}_1 ) = 1.6 \quad \text{ and } \quad Q_k ( \underline{p}_3 ) = \widetilde{Q}_k ( \widetilde{\underline{p}}_2 ) = -19.4.\nonumber \\ \end{aligned}$$
(4.22)

In this case, the Krylov method brings the remarkable disadvantage that \(\widetilde{Q}_k ( \widetilde{\underline{p}}_2 )\) is not the least value of \(\widetilde{Q}_k ( \widetilde{\underline{x}})\) subject to \(\Vert \widetilde{\underline{x}}\Vert \le 5\). Indeed, \(\widetilde{Q}_k ( \widetilde{\underline{x}}) = -25\) is achieved by \(\widetilde{\underline{x}}= (5,0)^T\), for example. The conjugate gradient method would also generate the sequence \(\underline{p}_{\ell }, \ell = 1,2,3\), of the Krylov method, with termination at \(\underline{p}_3\) because it is on the boundary of the trust region. A way of extending the conjugate gradient alternative by searching round the trust region boundary is considered in Sect. 6. It can calculate the vector \(\widetilde{\underline{x}}\) that minimizes the quadratic (4.21) subject to \(\Vert \widetilde{\underline{x}}\Vert \le 5\), and it is included in the NEWUOA software for unconstrained optimization without derivatives ([10]).

5 Conjugate gradient methods

Taking the decision to employ conjugate gradients instead of Krylov subspace methods in the LINCOA software provided much relief, because the difficulties in the second half of Sect. 4 are avoided. In particular, if a step of the conjugate gradient method goes from \(\underline{p}_{\ell } \in \mathcal{R}^n\) to \(\underline{p}_{\ell +1} \in \mathcal{R}^n\), then the quadratic model along this step, which is the function \(Q_k ( \underline{p}_{\ell } + \alpha \{ \underline{p}_{\ell +1} - \underline{p}_{\ell } \} ), 0 \le \alpha \le 1\), decreases strictly monotonically. The initial point of every step is feasible. It follows that, if one or more of the linear constraints (1.1) are violated at \(\underline{x}= \underline{p}_{\ell +1}\), then it is suitable to replace \(\underline{p}_{\ell +1}\) by the point

$$\begin{aligned} \underline{p}_{\mathrm{new}}\;=\; \underline{p}_{\ell } + \alpha ^{*\,} \left( \underline{p}_{\ell +1} \!- \underline{p}_{\ell } \right) , \end{aligned}$$
(5.1)

where \(\alpha ^*\) is the greatest \(\alpha \in \mathcal{R}\) such that \(\underline{p}_{\mathrm{new}}\) is feasible. Thus \(0 \le \alpha ^* < 1\) holds, and \(\underline{p}_{\mathrm{new}}\) is on the boundary of a constraint whose index is not in the current \(\mathcal{A}\).

The conjugate gradient method of Sect. 2 without linear constraints can be applied to the reduced function

$$\begin{aligned} Q_k^{{\, \mathrm{red}}}( \underline{\sigma }) \,=\, Q_k \left( \underline{p}_1 \!+ \check{Q}_{\,} \underline{\sigma }\right) , \quad \underline{\sigma }\in \mathcal{R}^{n-|\mathcal{A}|}, \end{aligned}$$
(5.2)

after generating the active set \(\mathcal{A}\) at \(\underline{p}_1\), as suggested in the paragraph that includes Eq. (3.7). Thus, starting at \(\underline{\sigma }= \underline{\sigma }_1 = 0\), and until a termination condition is satisfied, we find the conjugate gradient search directions of the function (5.2) in \(\mathcal{R}^{ n-|\mathcal{A}|}\). Letting these directions be \(\underline{d}_{\ell }^{{\, \mathrm{red}}}, \ell = 1,2,3, \ldots \), which have the downhill property \(( \underline{d}_{\ell }^{{\, \mathrm{red}}} )^T \underline{\nabla }Q_k^{{\, \mathrm{red}}}( \underline{\sigma }_{\ell } ) < 0\), each new point \(\underline{\sigma }_{\ell +1} \in \mathcal{R}^{n-|\mathcal{A}|}\) has the form

$$\begin{aligned} \underline{\sigma }_{\ell +1} \,=\, \underline{\sigma }_{\ell } + \alpha _{\ell \,} \underline{d}_{\ell }^{{\, \mathrm{red}}}, \end{aligned}$$
(5.3)

where \(\alpha _{\ell }\) is the value of \(\alpha \ge 0\) that minimizes \(Q_k^{{\, \mathrm{red}}}( \underline{\sigma }_{\ell } + \alpha _{\,} \underline{d}_{\ell }^{{\, \mathrm{red}}} )\) subject to a trust region bound. Equation (2.2) for the reduced problem provides the directions

$$\begin{aligned} \underline{d}_{\ell }^{{\, \mathrm{red}}} \;=\; \left\{ \begin{array}{cr} - \underline{\nabla }Q_k^{{\, \mathrm{red}}}(0), &{} \ell = 1, \\ -\underline{\nabla }Q_k^{{\, \mathrm{red}}}( \underline{\sigma }_{\ell } ) + \beta _{\ell \,} \underline{d}_{\ell -1}^{{\, \mathrm{red}}}, &{} \quad \ell \ge 2, \end{array} \right. \end{aligned}$$
(5.4)

where \(\beta _{\ell }\) is defined by the conjugacy condition

$$\begin{aligned} ( \underline{d}_{\ell }^{{\, \mathrm{red}}} )^{T\,} \nabla ^2 Q_k^{{\, \mathrm{red}}}\, \underline{d}_{\ell -1}^{{\, \mathrm{red}}} \,=\, 0. \end{aligned}$$
(5.5)

As in Sect. 4, we prefer to work with the original variables \(\underline{x}\in \mathcal{R}^n\), instead of with the reduced variables \(\underline{\sigma }\in \mathcal{R}^{n-|\mathcal{A}| }\), so we express the techniques of the previous paragraph in terms of the original variables. In particular, the line search of the \(\ell \)-th conjugate gradient iteration sets \(\alpha _{\ell }\) to the value of \(\alpha \ge 0\) that minimizes \(Q_k ( \underline{p}_{\ell } + \alpha _{\,} \underline{d}_{\ell } )\) subject to \(\Vert \underline{p}_{\ell } + \alpha _{\,} \underline{d}_{\ell } - \underline{x}_k \Vert \le \Delta _k\), where \(\underline{p}_{\ell }\) and \(\underline{d}_{\ell }\) are the vectors

$$\begin{aligned} \underline{p}_{\ell } \,=\, \underline{p}_1 + \check{Q}_{\,} \underline{\sigma }_{\ell } \quad \text{ and } \quad \underline{d}_{\ell } \,=\, \check{Q}_{\,} \underline{d}_{\ell }^{{\, \mathrm{red}}}, \quad \ell \ge 1, \end{aligned}$$
(5.6)

which follows from the definition (5.2). Thus Eq. (5.3) supplies the sequence of points \(\underline{p}_{\ell +1} = \underline{p}_{\ell } + \alpha _{\ell \,} \underline{d}_{\ell }, \ell = 1,2,3, \ldots \), and there is no change to each steplength \(\alpha _{\ell }\). Furthermore, formula (5.4) and the definition (5.2) supply the directions

$$\begin{aligned} \underline{d}_{\ell } \,=\, \check{Q}_{\,} \underline{d}_{\ell }^{{\, \mathrm{red}}} \;=\; \left\{ \begin{array}{l@{\quad }ll} -\check{Q}_{\,} \underline{\nabla }Q_k^{{\, \mathrm{red}}}(0) \,=\, -\check{Q}_{\,} \check{Q}^{T\,} \underline{\nabla }Q_k ( \underline{p}_1 ), &{} \ell = 1, \\ -\check{Q}_{\,} \check{Q}^{T\,} \underline{\nabla }Q_k ( \underline{p}_{\ell } ) + \beta _{\ell \,} \underline{d}_{\ell -1}, &{} \ell \ge 2, \end{array} \right. \end{aligned}$$
(5.7)

and there is no change to the value of \(\beta _{\ell }\). Because Eqs. (5.6) and (4.9) give the identities

$$\begin{aligned} \underline{d}_{\ell }^{\,T\!} H_{k\,} \underline{d}_j \,=\, ( \underline{d}_{\ell }^{{\, \mathrm{red}}} )^T \check{Q}^{T} H_{k\,} \check{Q}_{\,} \underline{d}_j^{{\, \mathrm{red}}} \,=\, ( \underline{d}_{\ell }^{{\, \mathrm{red}}} )^{T\,} \nabla ^2 {Q_k^{{\, \mathrm{red}}}}_{\,} \underline{d}_j^{{\, \mathrm{red}}}, \quad j = 1,2, \ldots , \ell , \end{aligned}$$
(5.8)

the condition (5.5) that defines \(\beta _{\ell }\) is just \(\underline{d}_{\ell }^{\,T\!} H_{k\,} \underline{d}_{\ell -1} = 0\), which agrees with the choice of \(\beta _{\ell }\) in formula (2.2). Therefore, until a termination condition holds, the conjugate direction method for the active set \(\mathcal{A}\) is the same as the conjugate direction method in Sect. 2, except that the gradients \(\underline{\nabla }Q_k ( \underline{p}_{\ell } ), \ell \ge 1\), are multiplied by the projection operator \(\check{Q}\check{Q}^T\). Thus the directions (5.7) have the property \(\underline{a}_j^{T} \underline{d}_{\ell } = 0, j \in \mathcal{A}\), in order to satisfy the constraints (3.2).

The conditions of LINCOA for terminating the conjugate gradient steps for the current active set \(\mathcal{A}\) are close to the conditions in the paragraph that includes expressions (2.4)–(2.6). Again there is termination with \(\underline{x}_k^+ = \underline{p}_{\ell }\) if inequality (2.4) or (2.5) holds, where \(\hat{\alpha }_{\ell }\) is still the nonnegative value of \(\alpha \) such that \(\underline{p}_{\ell } + \alpha _{\,} \underline{d}_{\ell }\) is on the trust region boundary. Alternatively, the new point \(\underline{p}_{\ell +1} = \underline{p}_{\ell } + \alpha _{\ell \,} \underline{d}_{\ell }\) is calculated. If the test (2.6) is satisfied, or if \(\underline{p}_{\ell +1}\) is a feasible point on the trust region boundary, or if \(\underline{p}_{\ell +1}\) is any feasible point with \(\ell = n - | \mathcal{A}|\), then the conjugate gradient steps for the current iteration number k are complete, the value \(\underline{x}_k^+ = \underline{p}_{\ell +1}\) being chosen, except that \(\underline{x}_k^+ = \underline{p}_{\mathrm{new}}\) is preferred if \(\underline{p}_{\ell +1}\) is infeasible, as suggested in the first paragraph of this section. Another possibility is that \(\underline{p}_{\ell +1}\) is a feasible point that is strictly inside the trust region with \(\ell < n - | \mathcal{A}|\). Then \(\ell \) is increased by one in order to continue the conjugate gradient steps for the current \(\mathcal{A}\). In all other cases, \(\underline{p}_{\ell +1}\) is infeasible, and we let \(\underline{p}_{\mathrm{new}}\) be the point (5.1).

In these other cases, a choice is made between ending the conjugate gradient steps with \(\underline{x}_k^+ = \underline{p}_{\mathrm{new}}\), or generating a new active set at \(\underline{p}_{\mathrm{new}}\). We recall from the last paragraph of Sect. 3 that the LINCOA choice is to update \(\mathcal{A}\) if and only if the distance from \(\underline{p}_{\mathrm{new}}\) to the trust region boundary is at least \(\eta _2 \Delta _k\). Furthermore, after using the notation \(\underline{p}_1 = \underline{x}_k\) at the beginning of the k-th iteration as in Sect. 2, we now revise the meaning of \(\underline{p}_1\) to \(\underline{p}_1 = \underline{p}_{\mathrm{new}}\) whenever a new active set is constructed at \(\underline{p}_{\mathrm{new}}\). Thus the description in this section of the truncated conjugate gradient method is valid for every \(\mathcal{A}\).

The number of conjugate gradient steps is finite for each \(\mathcal{A}\), even if we allow \(\ell \) to exceed \(n - | \mathcal{A}|\). Indeed, failure of the termination test (2.6) gives the condition

$$\begin{aligned} Q_k ( \underline{x}_k ) - Q_k \left( \underline{p}_{\ell +1} \right) \,>\, \left\{ Q_k ( \underline{x}_k ) - Q_k ( \underline{p}_{\ell } ) \right\} /( 1 - \eta _1 ), \end{aligned}$$
(5.9)

which cannot happen an infinite number of times as \(Q_k ( \underline{x}), \Vert \underline{x}- \underline{x}_k \Vert \le \Delta _k\), is bounded below. Also, the number of new active sets is finite for each iteration number k, which can be proved in the following way.

Let \(\underline{p}_1\) be different from \(\underline{x}_k\) due to a previous change to the active set, and let the work with the current \(\mathcal{A}\) be followed by the construction of another active set at the point (5.1) with \(\ell = \ell ^*\), say. Then condition (5.9) and \(Q_k ( \underline{p}_{\mathrm{new}}) \le Q_k ( \underline{p}_{\ell ^*} )\) provide the bound

$$\begin{aligned} Q_k ( \underline{x}_k ) - Q_k \left( \underline{p}_{\mathrm{new}}\right) \,\ge \, \left\{ Q_k ( \underline{x}_k ) - Q_k ( \underline{p}_1 ) \right\} /( 1 - \eta _1 )^{\ell ^*-1}, \end{aligned}$$
(5.10)

which is helpful to our proof for \(\ell ^* \ge 2\), because \(0 < \eta _1 < 1, 0 < \eta _2 < 1\) and \(\ell ^* \ge 2\) imply \(1 / ( 1 - \eta _1 )^{\ell ^*-1} > ( 1 + \frac{1}{4} \eta _1 \eta _2 )\). In the alternative case \(\ell ^* = 1\), we recall that the choice of \(\mathcal{A}\) gives the search direction \(\underline{d}_1\) the property \(\underline{a}_j^{T} \underline{d}_1 \le 0, j \in \mathcal{J}( \underline{p}_1 )\), where \(\mathcal{J}( \underline{x}), \underline{x}\in \mathcal{R}^n\), is the set (3.3). Thus the infeasibility of \(\underline{p}_2\) and Eq. (5.1) yield \(\Vert \underline{p}_{\mathrm{new}}- \underline{p}_1 \Vert > \eta _2 \Delta _k\). Moreover, the failure of the termination condition (2.5) for \(\ell = 1\) with \(\Vert \hat{\alpha }_{1\,} \underline{d}_1 \Vert < 2 \Delta _k\) supply the inequality

$$\begin{aligned} \left| \left( \underline{p}_{\mathrm{new}}\!- \underline{p}_1 \right) ^T \underline{\nabla }Q_k ( \underline{p}_1 ) \right|= & {} \left| \hat{\alpha }_{1\,} \underline{d}_1^{\,T} \underline{\nabla }Q_k ( \underline{p}_1 ) \right| \, \left\| \underline{p}_{\mathrm{new}}\!- \underline{p}_1 \right\| /\left\| \hat{\alpha }_{\,} \underline{d}_1 \right\| \qquad \qquad \nonumber \\> & {} \eta _{1\,} \left\{ Q_k ( \underline{x}_k ) - Q_k ( \underline{p}_1 ) \right\} \, \left\| \underline{p}_{\mathrm{new}}\!- \underline{p}_1 \right\| /( 2_{\,} \Delta _k ),\qquad \qquad \end{aligned}$$
(5.11)

the first line being true because \(\underline{p}_{\mathrm{new}}- \underline{p}_1\) is a multiple of \(\underline{d}_1\). Now the function \(\phi ( \alpha ) = Q_k ( \underline{p}_1 + \alpha \{ \underline{p}_{\mathrm{new}}- \underline{p}_1 \} ), 0 \le \alpha \le 1\), is a quadratic that decreases monotonically, which implies \(\phi (0) - \phi (1) \ge \frac{1}{2} | \phi ^{\prime } (0) |\), and \(| \phi ^{\prime } (0) |\) is the left hand side of expression (5.11). Thus, remembering \(\Vert \underline{p}_{\mathrm{new}}- \underline{p}_1 \Vert > \eta _2 \Delta _k\), we deduce the property

$$\begin{aligned} \phi (0) - \phi (1) \,=\, Q_k ( \underline{p}_1 ) - Q_k ( \underline{p}_{\mathrm{new}}) \,>\, {\frac{1}{4}}_{\,} \eta _{1\,} \eta _{2\,} \left\{ Q_k ( \underline{x}_k ) - Q_k ( \underline{p}_1 ) \right\} , \end{aligned}$$
(5.12)

which gives the bound

$$\begin{aligned} Q_k ( \underline{x}_k ) - Q_k ( \underline{p}_{\mathrm{new}}) \,>\, \left( 1 + {\frac{1}{4}}_{\,} \eta _{1\,} \eta _2 \right) \, \left\{ Q_k ( \underline{x}_k ) - Q_k ( \underline{p}_1 ) \right\} \end{aligned}$$
(5.13)

in the case \(\ell ^* = 1\). It follows from expressions (5.10) and (5.13) that, whenever consecutive changes to the active set occur, the new value of \(Q_k ( \underline{x}_k ) - Q_k ( \underline{p}_1 )\) is greater than the old one multiplied by \(( 1 + \frac{1}{4} \eta _1 \eta _2 )\). Again, due to the boundedness of \(Q_k ( \underline{x}), \Vert \underline{x}- \underline{x}_k \Vert \le \Delta _k\), this cannot happen an infinte number of times, so the number of changes to the active set is finite for each k.

An extension of the given techniques for truncated conjugate gradients subject to linear constraints is included in LINCOA, which introduces a tendency for \(\underline{x}_k^+\) to be on the boundaries of the active constraints. Without the extension, more changes than necessary are made often to the active set in the following way. Let the j-th constraint be in the current \(\mathcal{A}\), which demands the condition

$$\begin{aligned} b_j - \underline{a}_j^T \underline{p}_1 \,\le \, \eta _{2\,} \Delta _{k\,} \Vert \underline{a}_j \Vert \end{aligned}$$
(5.14)

for the current \(\Delta _k\), let \(\underline{p}_{\mathrm{curr}}\) be the current \(\underline{p}_1\), and let \(( b_j - \underline{a}_j^T \underline{p}_{\mathrm{curr}}) / ( \eta _2 \Vert \underline{a}_j \Vert )\) be greater than the final trust region radius. Further, let the directions of \(\underline{a}_j\) and \(\underline{\nabla }F ( \underline{x})\) for every relevant \(\underline{x}\) be such that it would be helpful to keep the j-th constraint active for the remainder of the calculation. The j-th constraint cannot be in the final active set, however, unless changes to \(\underline{p}_1\) cause \(( b_j - \underline{a}_j^T \underline{p}_1 ) / ( \eta _2 \Vert \underline{a}_j \Vert )\) to be at most the final value of \(\Delta _k\). On the other hand, while j is in \(\mathcal{A}\), condition (3.2) shows that all the conjugate gradient steps give \(b_j - \underline{a}_j^T \underline{p}_{\ell +1} = b_j - \underline{a}_j^T \underline{p}_{\ell }\). Thus the index j may remain in \(\mathcal{A}\) until \(\Delta _k\) becomes less than \(( b_j - \underline{a}_j^T \underline{p}_{\mathrm{curr}}) / ( \eta _2 \Vert \underline{a}_j \Vert )\). Then j has to be removed from \(\mathcal{A}\), which allows the conjugate gradient method to generate a new \(\underline{p}_1\) that supplies the reduction \(b_j - \underline{a}_j^T \underline{p}_1 < b_j - \underline{a}_j^T \underline{p}_{\mathrm{curr}}\). If j is the index of a linear constraint that is important to the final vector of variables, however, then j will be reinstated in \(\mathcal{A}\) by yet another change to the active set.

The projected steepest descent direction \(-\check{Q}\check{Q}^T \underline{\nabla }Q_k ( \underline{p}_1 )\) is calculated as before when the new active set \(\mathcal{A}\) is constructed at \(\underline{p}_1\), but now we call this direction \(\underline{d}_{{\, \mathrm{old}}}\), because the main feature of the extension is that it picks a new direction \(\underline{d}_1\) for the formula \(\underline{p}_2 = \underline{p}_1 + \alpha _{1\,} \underline{d}_1\), in order that the residuals of the active constraints can be reduced by the move from \(\underline{p}_1\) to \(\underline{p}_2\). We still require \(\underline{d}_{{\, \mathrm{old}}}\) to be a direction of descent, so the conjugate gradient steps are terminated by setting \(\underline{x}_k^+ = \underline{p}_1\) if \(\underline{d}_ {{\, \mathrm{old}}}^{\,T} \underline{\nabla }Q_k ( \underline{p}_1 ) \ge 0\) holds in practice, which is the test (2.4). Also, if all the residuals \(b_j - \underline{a}_j^T \underline{p}_1, j \in \mathcal{A}\), are sufficiently small, which means in LINCOA that they satisfy the bounds

$$\begin{aligned} b_j - \underline{a}_j^T \underline{p}_1 \,\le \, 10^{-4\,} \Delta _k\, \Vert \underline{a}_j \Vert , \quad j \in \mathcal{A}, \end{aligned}$$
(5.15)

then the conjugate gradient steps from \(\underline{p}_1\) are as before, with \(\underline{d}_1 = \underline{d}_{{\, \mathrm{old}}}\).

In all other cases, the extension begins by calculating the shortest step from \(\underline{p}_1\) to the boundaries of the active constraints. This step is the vector \(\underline{d}_{{\, \mathrm{perp}}}\), say, in the column space of A that satisfies \(A^{T} \underline{d}_{{\, \mathrm{perp}}} = \underline{r}\), where A has the columns \(\underline{a}_j, j \in \mathcal{A}\), and where \(\underline{r}\) has the components \(b_j - \underline{a}_j^T \underline{p}_1, j \in \mathcal{A}\). We recall from near the middle of Sect. 3 that the QR factorization \(A = \widehat{Q}R\) is available, which assists the construction of \(\underline{d}_{{\, \mathrm{perp}}}\). Indeed, it has the form \(\underline{d}_{{\, \mathrm{perp}}} = \widehat{Q}\underline{s}\), and \(\underline{s}\) is defined by the equations

$$\begin{aligned} A^{T} \underline{d}_{{\, \mathrm{perp}}} \,=\, ( \widehat{Q}R )^T \widehat{Q}_{\,} \underline{s}\,=\, R^{\,T\!} \underline{s}\,=\, \underline{r}, \end{aligned}$$
(5.16)

so \(\underline{s}\) can be calculated easily, using the triangularity and nonsingularity of R. The new search direction \(\underline{d}_1\) of the extension is a nonnegative linear combination of \(\underline{d}_{{\, \mathrm{old}}}\) and \(\underline{d}_{{\, \mathrm{perp}}}\), as explained below.

The choice \(\underline{p}_2 = \underline{p}_1 + \underline{d}_1\), where \(\underline{d}_1\) is the direction

$$\begin{aligned} \underline{d}_1 \,=\, \eta _2\, \Delta _k\, \underline{d}_{{\, \mathrm{old}}}/\Vert \underline{d}_{{\, \mathrm{old}}} \Vert \,+\, \theta _{\,} \underline{d}_{{\, \mathrm{perp}}} \end{aligned}$$
(5.17)

with \(\theta = 1\), would bring the advantages that the length of the step from \(\underline{p}_1\) to \(\underline{p}_2\) is greater than \(\eta _{2\,} \Delta _k\), and \(\underline{p}_2\) would be on the boundaries of all the active constraints. This point \(\underline{p}_2\), however, can be infeasible and can violate the trust region bound, although \(\underline{p}_1 + \underline{d}_1\) would satisfy all the constraints in the case \(\theta = 0\). Therefore the extension picks the direction (5.17), after setting \(\theta \) to the greatest number in [0, 1] such that all the constraints hold at \(\underline{p}_1 + \underline{d}_1\). The value \(\theta = 0\) may occur, and then \(\underline{d}_1\) is a multiple of \(\underline{d}_{{\, \mathrm{old}}}\), so we would generate the sequence of steps from \(\underline{p}_1\) by the conjugate gradient method without the extension.

When \(\theta \) is positive in formula (5.17), we pick \(\underline{p}_2 = \underline{p}_1 + \alpha _{1\,} \underline{d}_1\), where \(\alpha _1\) is the value of \(\alpha \) that minimizes \(Q_k ( \underline{p}_1 + \alpha _{\,} \underline{d}_1 ), 0 \le \alpha \le 1\). The point \(\underline{p}_2\) is feasible due to the choice of \(\theta \), but, for every \(\alpha > 1\), at least one constraint is violated at \(\underline{p}_1 + \alpha _{\,} \underline{d}_1\), in the usual case when \(\underline{p}_1 + \underline{d}_1\) is on the boundary of a constraint that is satisfied as a strict inequality at \(\underline{p}_1\). Our step from \(\underline{p}_1\) to \(\underline{p}_2\) achieves the reductions

$$\begin{aligned} b_j - \underline{a}_j^T \underline{p}_2 \,=\, ( 1 - \alpha _{1\,} \theta )\, \left( b_j - \underline{a}_j^T \underline{p}_1 \right) , \quad j \in \mathcal{A}, \end{aligned}$$
(5.18)

in the residuals of the active constraints.

The following argument shows that \(\alpha _1\) is positive in Eq. (5.18). We have to rule out \(\alpha _1 = 0\), so it is sufficient to establish \(\underline{d}_1^{\,T} \underline{\nabla }Q_k ( \underline{p}_1 ) < 0\), and we recall that termination with \(\underline{x}_k^+ = \underline{p}_1\) would have happened earlier in the case \(\underline{d}_{{\, \mathrm{old}}}^{\,T} \underline{\nabla }Q_k ( \underline{p}_1 ) \ge 0\). Hence, because of the choice (5.17), it is sufficient to prove \(\underline{d}_{{\, \mathrm{perp}}}^{\,T} \underline{\nabla }Q_k ( \underline{p}_1 ) \le 0\). We find in the paragraph that includes expressions (3.4)–(3.6) that \(\underline{d}_{{\, \mathrm{old}}}\) is the vector \(\underline{d}\) that minimizes \(\Vert \underline{\nabla }Q_k ( \underline{p}_1 ) + \underline{d}\Vert \) subject to \(\underline{a}_j^{T} \underline{d}\le 0, j \in \mathcal{A}\), and the first order conditions at the solution of this quadratic programming problem provide the relation

$$\begin{aligned} \underline{d}_{{\, \mathrm{old}}} \,=\, - \underline{\nabla }Q_k ( \underline{p}_1 ) + \mathop {\sum }\limits _{j \in \mathcal{A}}\, {\uplambda }_{j\,} \underline{a}_j, \end{aligned}$$
(5.19)

for some multipliers \({\uplambda }_j\) that are all nonpositive. Moreover, \(\underline{d}_ {{\, \mathrm{perp}}}\) is derived from the equations \(\underline{a}_j^{T} \underline{d}_{{\, \mathrm{perp}}} = r_j, j \in \mathcal{A}\), each \(r_j = b_j - \underline{a}_j^T \underline{p}_1\) being nonnegative because \(\underline{p}_1\) is feasible. Thus, remembering that \(\underline{d}_{{\, \mathrm{perp}}}\) is orthogonal to \(\underline{d}_{{\, \mathrm{old}}}\), we deduce the required inequality

$$\begin{aligned} \underline{d}_{{\, \mathrm{perp}}\,}^{\,T} \underline{\nabla }Q_k ( \underline{p}_1 ) \,=\, \underline{d}_{{\, \mathrm{perp}}}^{\,T} \left\{ -\underline{d}_{{\, \mathrm{old}}} + \mathop {\sum }\limits _{j \in \mathcal{A}} \, {\uplambda }_{j\,} \underline{a}_j \right\} \,=\, \mathop {\sum }\limits _{j \in \mathcal{A}} \, {\uplambda }_{j\,} r_j \,\le \, 0. \end{aligned}$$
(5.20)

We now know from the property (5.18) that, with the extension, the step from \(\underline{p}_1\) to \(\underline{p}_2\) achieves a strict reduction in every positive residual of an active constraint. Termination with \(\underline{x}_k^+ = \underline{p}_2\) is suitable if \(Q_k ( \underline{p}_1 + \alpha _{1\,} \underline{d}_1 )\) is the least value of \(Q_k ( \underline{p}_1 + \alpha _{\,} \underline{d}_1 ), \alpha \ge 0\), and if condition (2.6) holds for \(\ell = 1\). Otherwise, a search direction \(\underline{d}_2\) is chosen for a step from \(\underline{p}_2\) to \(\underline{p}_3\). The constraints (3.2) with \(\ell = 2\) require \(\underline{d}_2\) to be in the column space of \(\check{Q}\), but the direction (5.17) is not in this space for \(\theta > 0\). Therefore \(\beta _2\) has to be zero in formula (5.7), giving \(\underline{d}_2 = -\check{Q}\check{Q}^T \underline{\nabla }Q_k ( \underline{p}_2 )\). It is convenient to call this direction \(\underline{d}_1 = -\check{Q}\check{Q}^T \underline{\nabla }Q_k ( \underline{p}_1 )\), so we replace \(\underline{p}_2\) by \(\underline{p}_1\) without making a further change to \(\mathcal{A}\). Thus the given description of the truncated conjugate gradient method for the current \(\mathcal{A}\) becomes valid for the continuation of the method. Formula (5.7) still provides search directions that are mutually conjugate, because the new \(\underline{d}_1\) is a projected steepest descent direction.

6 Moves round the trust region boundary

The conjugate gradient method of Sect. 5 terminates at \(\underline{x}_k^+ = \underline{p}_{\ell +1}\) if \(\underline{p}_{\ell +1}\) is a feasible point on the boundary of the trust region, but usually a move round the boundary can generate another feasible point, \(\underline{x}_k^{++}\) say, that provides the strict reduction \(Q_k ( \underline{x}_k^{++} ) < Q_k ( \underline{x}_k^+ )\). In the case (4.19)–(4.20), for example, the conjugate gradient method yields \(\underline{x}_k^+ = (3,4,1)^T\) with \(Q_k ( \underline{x}_k^+ ) = -19.4\), although the point \(\underline{x}_k^{++} = (5,0,1)^T\) is feasible with \(Q_k ( \underline{x}_k^{++} ) = -25\), as mentioned at the end of Sect. 4. The early versions of the LINCOA software included an extension to the conjugate gradient method that seeks reductions in \(Q_k ( \cdot )\) by searching round the trust region boundary, which is described briefly below. Now, however, LINCOA has been made simpler by the removal of the extension, because of some numerical results that are given too in this section.

A move from \(\underline{x}_k^+\) to \(\underline{x}_k^{++}\) round the trust region boundary is made by the early versions of LINCOA only if the Taylor series linear function

$$\begin{aligned} \Lambda _k^+ ( \underline{x}) \,=\, Q_k \left( \underline{x}_k^+ \right) + \left( \underline{x}- \underline{x}_k^+ \right) ^{T\,} \underline{\nabla }Q_k \left( \underline{x}_k^+ \right) , \quad \underline{x}\in \mathcal{R}^n, \end{aligned}$$
(6.1)

suggests that \(Q_k ( \underline{x}_k^{++} )\) is going to be substantially less than \(Q_k ( \underline{x}_k^+ )\). Indeed, letting \(\widehat{\underline{x}}_k^{++}\) be the vector \(\underline{x}\) that minimizes \(\Lambda _k^+ ( \underline{x})\) subject to \(\underline{a}_j^T ( \underline{x}- \underline{x}_k^+ ) = 0, j \in \mathcal{A}\), and to \(\Vert \underline{x}- \underline{x}_k \Vert \le \Delta _k\), the search for a relatively small value of \(Q_k ( \cdot )\) is terminated at \(\underline{x}_k^+\) if the inequality

$$\begin{aligned} \Lambda _k^+ \left( \underline{x}_k^+ \right) - \Lambda _k^+ \left( \widehat{\underline{x}}_k^{++} \right) \,=\, \left( \underline{x}_k^+ - \widehat{\underline{x}}_k^{++} \right) ^T\, \underline{\nabla }Q_k \left( \underline{x}_k^+ \right) \,\le \, \eta _{1\,} \left\{ Q_k ( \underline{x}_k ) - Q_k \left( \underline{x}_k^+ \right) \right\} \nonumber \\ \end{aligned}$$
(6.2)

holds, which corresponds to the test (2.17). The vector \(\widehat{\underline{x}}_k^{++}\) is not calculated explicitly, however, because, using \(\Vert \underline{x}_k^+ - \underline{x}_k \Vert = \Delta _k\), it can be shown that the reduction \(\Lambda _k^+ ( \underline{x}_k^+ ) - \Lambda _k^+ ( \widehat{\underline{x}}_k^{++} )\) is the sum

$$\begin{aligned} \left\| \check{Q}^{\,T} \left( \underline{x}_k^+ - \underline{x}_k \right) \right\| \, \left\| \check{Q}^{\,T\,} \underline{\nabla }Q_k \left( \underline{x}_k^+ \right) \right\| + \left\{ \check{Q}^{\,T} \left( \underline{x}_k^+ - \underline{x}_k \right) \right\} ^{T\,} \check{Q}^{\,T\,} \underline{\nabla }Q_k \left( \underline{x}_k^+ \right) .\quad \end{aligned}$$
(6.3)

Thus condition (6.2) causes termination if the vectors \(\check{Q}^T ( \underline{x}_k^+ - \underline{x}_k )\) and \(\check{Q}^T \underline{\nabla }Q_k ( \underline{x}_k^+ )\) are parallel with opposite signs. Also, a difficulty is avoided by termination in the highly unusual case when these two vectors are parallel (or nearly parallel) with the same sign. Specifically, there is a search round the trust region boundary from \(\underline{x}_k^+\) only if \(\underline{x}_k^+\) has the property

$$\begin{aligned}&\left\| \check{Q}^{\,T} \left( \underline{x}_k^+ - \underline{x}_k \right) \right\| \, \left\| \check{Q}^{\,T\,} \underline{\nabla }Q_k \left( \underline{x}_k^+ \right) \right\| - \left| \left\{ \check{Q}^{\,T} \left( \underline{x}_k^+ - \underline{x}_k \right) \right\} ^{T\,} \check{Q}^{\,T\,} \underline{\nabla }Q_k \left( \underline{x}_k^+ \right) \right| \nonumber \\&\quad > \eta _{1\,} \left\{ Q_k ( \underline{x}_k ) - Q_k \left( \underline{x}_k^+ \right) \right\} \!. \end{aligned}$$
(6.4)

When condition (6.4) holds, the vectors \(\check{Q}\check{Q}^T ( \underline{x}_k^+ - \underline{x}_k)\) and \(\check{Q}\check{Q}^{T\,} \underline{\nabla }Q_k ( \underline{x}_k^+ )\) are linearly independent, and we let \(\mathcal{S}\!\subset \! \mathcal{R}^n\) be the two dimensional linear space that is spanned by them. Each move \(\underline{x}_k^{++} - \underline{x}_k^+\) is chosen to be in \(\mathcal{S}\), which is reasonable because \(\widehat{\underline{x}}_k^{++} - \underline{x}_k^+\) is in \(\mathcal{S}\). The constraint \(\Vert \underline{x}_k^{++} - \underline{x}_k \Vert = \Delta _k\) is achieved by giving \(\underline{x}_k^{++} - \underline{x}_k\) the form

$$\begin{aligned} \underline{s}( \theta ) \,=\, ( \cos \theta - 1 )\, \check{Q}_{\,} \check{Q}^{\,T} \left( \underline{x}_k^+ \!- \underline{x}_k \right) + \sin \theta \, \underline{v}, \quad \theta \in \mathcal{R}, \end{aligned}$$
(6.5)

where \(\underline{v}\) is the vector in \(\mathcal{S}\) that is orthogonal to \(\check{Q}\check{Q}^T ( \underline{x}_k^+ - \underline{x}_k )\), and that satisfies \(\Vert \underline{v}\Vert = \Vert \check{Q}\check{Q}^T ( \underline{x}_k^+ - \underline{x}_k \Vert \) and \(\underline{v}^T \underline{\nabla }Q_k ( \underline{x}_k^+ ) < 0\). The value of \(\theta \) is calculated by seeking the first local minimum of the function

$$\begin{aligned} \phi ( \theta ) \,=\, Q_k \left( \underline{x}_k^+ \!+ \underline{s}( \theta )\right) , \quad \theta \ge 0, \end{aligned}$$
(6.6)

subject to the feasibility of \(\underline{x}_k^+ + \underline{s}( \theta )\). The inequality \(\underline{v}^T \underline{\nabla }Q_k ( \underline{x}_k^+ ) < 0\) supplies \(\phi ^{\prime } (0) < 0\).

Equations (1.2) and (6.5) show that the function (6.6) is a trigonometric polynomial of degree two. The coefficients of this polynomial are generated, the amount of work when n is large being dominated by the need for the product \(H_k \underline{v}\), the product \(H_k \check{Q}\check{Q}^T ( \underline{x}_k^+ - \underline{x}_k )\) being available. We calculate an estimate, \(\theta ^*\) say, of the least positive value of \(\theta \) that satisfies \(\phi ^{\prime } ( \theta ) = 0\), the relative error of the estimate being at most 0.01. By considering every inactive constraint whose boundary is within distance \(\Delta _k\) of \(\underline{x}_k\), the value of \(\theta ^*\) is reduced if necessary so that all the points \(\underline{x}_k^+ + \underline{s}( \theta ), 0 \le \theta \le \theta ^*\), are feasible. Then we make the choice \(\underline{x}_k^{++} = \underline{x}_k^+ + \underline{s}( \theta ^* )\).

We take the view that \(\underline{x}_k^{++}\) is a new vector \(\underline{x}_k^+\). There are no more searches round the trust region boundary for the current k if an inactive constraint causes a decrease in \(\theta ^*\) in the previous paragraph, or if the change above to \(\underline{x}_k^+\) reduces \(Q_k ( \underline{x}_k^+ )\) by an amount that is at most the new value of \(\eta _1 \{ Q_k ( \underline{x}_k ) - Q_k ( \underline{x}_k^+ ) \}\), which corresponds to the test (2.6), or if condition (6.4) fails for the new \(\underline{x}_k^+\). Otherwise, another move is made from \(\underline{x}_k^+\) to a new \(\underline{x}_k^{++}\) in the way that has been described already. This procedure is continued until termination occurs.

The success of searches round the trust region boundary was investigated by numerical experiments, including the application of versions of LINCOA to the following problem. We let n be even, and, for each \(\underline{x}\in \mathcal{R}^n\), we set the points

$$\begin{aligned} \underline{p}_i \,=\, \left( \begin{array}{c} x_{2i-1} \\ x_{2i} \end{array} \right) \in \mathcal{R}^2, \quad i = 1,2, \ldots , n/2. \end{aligned}$$
(6.7)

We seek the least value of the function

$$\begin{aligned} F ( \underline{x}) \,=\, n^{-2}\, \mathop {\sum }\limits _{i=2}^{n/2}\, \mathop {\sum }\limits _{j=1}^{i-1}\, \min _{\,} \left[ _{\,} \left\| \underline{p}_i \!- \underline{p}_j \right\| ^{-1\!} ,_{\,} 10^{3\,} \right] , \quad \underline{x}\in \mathcal{R}^n, \end{aligned}$$
(6.8)

subject to 3n / 2 linear constraints, namely that every \(\underline{p}_i\) is in the triangle with the vertices (0, 0), (2, 0) and (0, 2). The initial positions of the points are chosen randomly within the triangle. For example, the left hand and middle parts of Fig. 2 show the initial random positions and the final calculated positions of the points in a case with \(n = 80\), while the right hand part of Fig. 2 shows calculated positions for a different random start. Both sets of final points satisfy to high accuracy the first order conditions for the solution of the test problem, but the numbers of final points that are strictly inside the triangle are different, the two final values of the objective function being \(F = 0.15626737\) and \(F = 0.15603890\). This test problem has several local minima, and LINCOA tries to find only one of them.

Fig. 2
figure 2

The points in triangle problem with \(n = 80\)

LINCOA requires not only an initial vector of variables but also the initial and final values of \(\Delta _k\), which are set to 0.1 and \(10^{-6}\) in these calculations. The value of NPT is also required, which is the number of interpolation conditions satisfied by each approximation \(Q_k ( \underline{x}) \approx F ( \underline{x}), \underline{x}\in \mathcal{R}^n\). The amount of routine work for each k is of magnitude NPT squared, due to a linear system of equations that supplies each new quadratic model, so it is helpful for NPT to be of magnitude n when n is large. The values \(\text{ NPT } = n + 6\) and \(\text{ NPT } = 2n + 1\) are compared in the numerical results of this section.

Tables 1 and 2 give these results for \(n = 10 \times 2^{\ell }, \ell = 0,1,2,3,4,5\). There are five cases for each n, the cases being distinguished by different random choices of the initial vector of variables. For each application of LINCOA, we let #F be the number of calculations of the objective function, and we let #TRI (Trust Region Iterations) be the number of iterations that construct \(\underline{x}_k^+\) by the truncated conjugate gradient method of Sect. 5, with or without searches round the trust region boundary. The second and third columns of the tables show the averages to the nearest integer of #F and #TRI over the five cases for each n. We recall that every step in the construction of \(\underline{x}_k^+\) requires a vector to be multiplied by the matrix \(H_k = \nabla ^2 Q_k\), which is the most expensive part of every step. For each k, the number of multiplications is in the range [1, 3], [4, 10] or \([11, \infty )\). The number of occurrences of each range is counted for each test problem, the sum of these numbers being #TRI. These counts are also averaged to the nearest integer over the five cases for each n, the results being shown in the last three columns of the tables. Two versions of LINCOA are employed, the first one being the current version that is without searches round the trust region boundary, and the second one being the extension of the current version that includes the boundary searches and termination conditions of this section. The main entries in the table have the form p / q, where p and q are the averages of the version of LINCOA that is without and with boundary searches, respectively. Good accuracy is achieved throughout the experiments of Tables 1 and 2, the greatest residual of a KKT first order condition at a final vector of variables being about \(3 \times 10^{-5}\).

Table 1 The points in triangle problem with \(\text{ NPT } = n + 6\)
Table 2 The points in triangle problem with \(\text{ NPT } = 2n + 1\)

The results in the last rows of both tables are highly unfavourable to searches round the trust region boundary. We find in the \(n = 320\) row of Table 1 that the extra work of the searches causes #F to become worse by 27 %, while the 0.7 % improvement in #F in the \(n = 320\) row of Table 2 is very expensive. Indeed, although the conjugate gradient method is truncated after at most 3 steps in 39,116 out of 48,483 applications, the method with boundary searches requires more than 10 multiplications of a vector by \(H_k = \nabla ^2 Q_k\) in about half of its 49,428 applications. Furthermore, the boundary searches in the \(n = 80\) and \(n = 160\) rows of Table 2 also take much more effort, and they cause #F to increase by about 7 and 35%, respectively. Table 2 is more relevant than Table 1 for \(n \ge 80\) because its values of #F are smaller. Moreover, the #F entries in the \(n = 40\) rows of both tables suggest strongly that boundary searches are unhelpful. We give less attention to smaller values of n, because LINCOA is designed to be particularly useful for minimization without derivatives when there are hundreds of variables, by taking advantage of the discovery that the symmetric Broyden updating method makes such calculations possible. Thus Tables 1 and 2 provide excellent support for the decision, taken in 2013, to terminate the calculation of \(\underline{x}_k^+\) by LINCOA when the steps of the conjugate gradient method reach the trust region boundary.

7 Further remarks and discussion

We begin our conclusions by noting that, when there are no linear constraints on the variables, the Krylov subspace method provides searches round the boundary of the trust region that compare favourably with the searches of Sect. 6. Let \(\widehat{\underline{p}}_{\ell +1} \in \mathcal{R}^n\) and \(\check{\underline{p}}_{\ell +1} \in \mathcal{R}^n\) be the points that are generated by the \(\ell \)-th step of the Krylov method and by the \(\ell \)-th step of the conjugate gradient method augmented by boundary searches, respectively, starting at the trust region centre \(\widehat{\underline{p}}_1 = \check{\underline{p}}_1 = \underline{x}_k\), but the greatest values of \(\ell \) for the two methods may be different. Because \(\check{\underline{p}}_ {\ell +1} - \underline{x}_k\) is in the linear space spanned by the gradients \(\underline{\nabla }Q_k ( \check{\underline{p}}_j), j = 1,2, \ldots , \ell \), it is also in the Krylov subspace \(\mathcal{K}_{\ell }\), defined in the complete paragraph between expressions (2.8) and (2.9), even if the dimension of \(\mathcal{K}_{\ell }\) is less than \(\ell \). Also the choice of \(\widehat{\underline{p}}_ {\ell +1}\) by the \(\ell \)-th step of the Krylov method satisfies \(\widehat{\underline{p}}_{\ell +1} - \underline{x}_k \in \mathcal{K}_{\ell }\), with the additional property that \(Q_k ( \widehat{\underline{p}}_{\ell +1} )\) is the least value of \(Q_k ( \underline{x}), \Vert \underline{x}- \underline{x}_k \Vert \le \Delta _k\), subject to \(\underline{x}- \underline{x}_k \in \mathcal{K}_{\ell }\). Thus the Krylov method enjoys the advantage \(Q_k ( \widehat{\underline{p}}_{\ell +1} ) \le Q_k ( \check{\underline{p}}_{\ell +1} )\) for every \(\ell \) that occurs for both methods. Moreover, the Krylov method terminates when \(\mathcal{K}_{\ell +1} = \mathcal{K}_{\ell }\) holds, which gives \(\ell \le n\) even if the other conditions for termination are ignored. Usually, however, the number of boundary searches in Sect. 6 can be made arbitrarily large by letting the parameter \(\eta _1\) in the test (6.4) be sufficiently small. These remarks suggest that the very poor results for searches in the last three columns of Tables 1 and 2 may be avoided if the Krylov method is applied.

Much of the effort in the development of LINCOA was spent on attempts to include the Krylov subspace method of Sect. 4 when there are linear constraints on the variables. Careful attention was given to the situation when, having chosen an active set at the point \(\underline{p}_1 \in \mathcal{R}^n\), which need not be the trust region centre \(\underline{x}_k\), the method generates the sequence \(\underline{p}_{j+1}, j = 1,2, \ldots , \ell \), in the trust region, and \(\underline{p}_{\ell +1}\) is the first point in the sequence that violates a linear constraint. If the function \(\phi ( \alpha ) = Q_k ( \underline{p}_{\ell } + \alpha \{ \underline{p}_{\ell +1} - \underline{p}_{\ell } \} ), 0 \le \alpha \le 1\), decreases monotonically, then the method in the first paragraph of Sect. 5 is recommended, which allows either termination with \(\underline{x}_k^+ = \underline{p}_{\mathrm{new}}\), or a change to \(\mathcal{A}\) with \(\underline{p}_{\mathrm{new}}\) becoming the starting point \(\underline{p}_1\) of the Krylov method for the new active set, \(\underline{p}_{\mathrm{new}}\) being the vector (5.1).

Examples in Sect. 4 show, however, that occasionally the first derivative \(\phi ^{\prime } (0)\) is positive, although the Krylov method gives \(\phi (1) < \phi (0)\) and \(\phi ^{\prime } (1) \le 0\), where \(\phi ( \cdot )\) is defined above. A way of making further progress in this case is to pick \(\theta > 0\), and to let \(\underline{p}_{\ell +1} ( \theta )\) be the vector \(\underline{x}\) that minimizes the extended quadratic function

$$\begin{aligned} Q_k^+ ( \underline{x}, \theta ) \,=\, Q_k ( \underline{x}) \,+\, \theta _{\,} \left\| \underline{x}- \underline{p}_{\ell } \right\| ^2, \quad \underline{x}\in \mathcal{R}^n, \end{aligned}$$
(7.1)

subject to \(\Vert \underline{x}- \underline{x}_k \Vert \le \Delta _k\) and to \(\underline{x}- \underline{p}_1\) being in the current Krylov subspace. For every \(\theta > 0\), the calculation of \(\underline{p}_{\ell +1} ( \theta )\) is like the calculation of \(\underline{p}_ {\ell +1} = \underline{p}_{\ell +1} (0)\), due to \(\nabla ^2 Q_k^+ ( \cdot , \theta ) - \nabla ^2 Q_k\) being a multiple of the unit matrix. The function \(\phi ( \alpha , \theta ) = Q_k^+ ( \underline{p}_{\ell } + \alpha \{ \underline{p}_{\ell +1} ( \theta ) - \underline{p}_{\ell } \}, \theta ), 0 \le \alpha \le 1\), does decrease monotonically if \(\theta \) is sufficiently large, in particular when \(Q_k^+ ( \cdot , \theta )\) is convex, because of the property \(\phi ^ {\prime } ( 1, \theta ) \le 0\). The author has considered finding a relatively small value of \(\theta \) that supplies the downhill condition \(\phi ^{\prime } ( 0, \theta ) \le 0\), followed by a change in \(\underline{p}_ {\ell +1}\) from \(\underline{p}_{\ell +1} (0)\) to \(\underline{p}_{\ell +1} ( \theta )\). Equation (7.1) with the Krylov method show that the move from \(\underline{p}_{\ell }\) to the changed \(\underline{p}_{\ell +1}\) achieves the reduction

$$\begin{aligned} Q_k ( \underline{p}_{\ell +1} ) \,\le \, Q_k^+ ( \underline{p}_{\ell +1}, \theta ) \,=\, Q_k^+ \left( \underline{p}_{\ell +1} ( \theta ), \theta \right) \,\le \, Q_k^+ ( \underline{p}_{\ell }, \theta ) \,=\, Q_k ( \underline{p}_{\ell }).\nonumber \\ \end{aligned}$$
(7.2)

Having replaced \(\underline{p}_{\ell +1}\) by \(\underline{p}_{\ell +1} ( \theta )\), where \(\theta \) gives the required monotonicity, the method in the second paragraph of this section is applied if \(\underline{p}_{\ell +1}\) is infeasible. Otherwise, termination with \(\underline{x}_k^+ = \underline{p}_{\ell +1}\) is suitable if condition (2.6) holds, the alternative being to continue the steps of the Krylov method for the calculation of \(\underline{p}_{\ell +2}\), with the current active set and the original quadratic model. These techniques provide an iterative procedure that terminates for the current \(\mathcal{A}\), due to the argument in the paragraph that includes inequality (5.9).

We give further consideration to the Krylov method when the active set is chosen at a point \(\underline{p}_1\) that is different from the centre of the trust region. Then attention is restricted to vectors \(\underline{x}\in \mathcal{R}^n\) that satisfy \(\underline{a}_j^T ( \underline{x}- \underline{p}_1 ) = 0, j \in \mathcal{A}\), which allows the trust region bound \(\Vert \underline{x}- \underline{x}_k \Vert \le \Delta _k\) to be written as the inequality

$$\begin{aligned} \left\| \underline{x}- \widehat{\underline{x}}_k \right\| \,\le \, \left\{ \Delta _k^2 - \left\| \widehat{\underline{x}}_k \!- \underline{x}_k \right\| ^2 \right\} ^{1/2} \,=\, \widehat{\Delta }_k, \end{aligned}$$
(7.3)

say, where \(\widehat{\underline{x}}_k\) is now the shifted trust region centre

$$\begin{aligned} \widehat{\underline{x}}_k \,=\, \underline{p}_1 + \check{Q}_{\,} \check{Q}^T ( \underline{x}_k\! \!- \underline{p}_1 ) \,=\, \underline{x}_k + \widehat{Q}_{\,} \widehat{Q}^T ( \underline{p}_1 \!- \underline{x}_k ). \end{aligned}$$
(7.4)

Indeed, as \(\underline{x}- \underline{p}_1\) is restricted to the column space of \(\check{Q}\), Eq. (7.4) shows that \(\underline{x}- \widehat{\underline{x}}_k\) and \(\widehat{\underline{x}}_k - \underline{x}_k\) are in the column spaces of \(\check{Q}\) and \(\widehat{Q}\), respectively. Thus \(\underline{x}- \widehat{\underline{x}}_k\) is orthogonal to \(\widehat{\underline{x}}_k - \underline{x}_k\), so the identity \(\underline{x}- \underline{x}_k = ( \underline{x}- \widehat{\underline{x}}_k ) + ( \widehat{\underline{x}}_k - \underline{x}_k )\) gives the form (7.3) of the trust region bound. Further, because Eq. (7.4) implies \(\Vert \widehat{\underline{x}}_k - \underline{x}_k \Vert \le \Vert \underline{p}_1 - \underline{x}_k \Vert \), and because active sets are chosen only at points \(\underline{p}_1\) that satisfy \(\Vert \underline{p}_1 - \underline{x}_k \Vert < \Delta _k\), the trust region radius \(\widehat{\Delta }_k\) in expression (7.3) is positive.

The form (7.3) of the trust region bound exposes some deficiencies of the Krylov method in the case \(\underline{p}_1 \ne \underline{x}_k\). The description of the method in Sect. 4 is not convenient, however, because some of the details of the computation, like the use of the Arnoldi formula (4.13), are not relevant to our discussion. Instead, we employ a definition of the Krylov subspace \(\mathcal{K}_{\ell }, \ell \ge 1\), that depends on the sequence of points \(\underline{p}_j, j = 1,2, \ldots , \ell \). It can be shown to be equivalent to the previous one, because the Krylov and conjugate gradient methods generate the same points while strictly inside the trust region, the direction (5.7) being in the Krylov subspace \(\mathcal{K}_{\ell }\). Thus \(\mathcal{K}_1\) is the subspace

$$\begin{aligned} \mathcal{K}_{\ell } \,=\, \text{ span }\, \left\{ _{\,} \check{Q}_{\,} \check{Q}^{T\,} \underline{\nabla }Q_k ( \underline{p}_j ) \,:\, j = 1,2, \ldots , \ell _{\,} \right\} \end{aligned}$$
(7.5)

in the case \(\ell = 1\). Having found \(\mathcal{K}_1\), we let \(\underline{p}_2\) be the point given by the usual construction of \(\underline{p}_{\ell +1}\) in the case \(\ell = 1\). Indeed, \(\underline{p}_{\ell +1}\) is the vector \(\underline{x}\) that minimizes \(Q_k ( \underline{x}), \underline{x}\in \mathcal{R}^n\), subject to \(\underline{x}- \underline{p}_1 \in \mathcal{K}_{\ell }\) and to \(\Vert \underline{x}- \widehat{\underline{x}}_k \Vert \le \widehat{\Delta }_k\). The definition (7.5) with this construction are applied iteratively for \(\ell = 1,2,3, \ldots \), until a condition for termination is satisfied, one of them being \(\mathcal{K}_{\ell } = \mathcal{K}_{\ell -1}\). Two obvious features of formula (7.5) are that it provides \(\mathcal{K}_1 \!\subset \! \mathcal{K}_2 \!\subset \! \cdots \!\subset \! \mathcal{K}_{\ell }\), and that it puts the step \(\underline{p}_{\ell +1} - \underline{p}_{\ell }\) into the column space of \(\check{Q}\), as required by the constraints (3.2). The termination condition (2.6) is recommended. We have addressed already the situation when \(\underline{p}_{\ell +1}\) is infeasible.

We compare this choice of \(\underline{p}_{\ell +1}\) to the one that is optimal if \(Q_k ( \underline{x}), \underline{x}\in \mathcal{R}^n\), is replaced by the linear approximation

$$\begin{aligned} \Lambda _{k \ell } ( \underline{x}) \,=\, Q_k ( \underline{p}_{\ell } ) \,+\, ( \underline{x}- \underline{p}_{\ell } )^{T\,} \underline{\nabla }Q_k ( \underline{p}_{\ell } ), \quad \underline{x}\in \mathcal{R}^n, \end{aligned}$$
(7.6)

which occurs also in Eq. (2.15). The vector \(-\check{Q}\check{Q}^T \underline{\nabla }Q_k ( \underline{p}_{\ell } )\) is the steepest descent direction of \(\Lambda _{k \ell } ( \cdot )\) that is allowed by the active constraints, so the least value of \(\Lambda _{k \ell } ( \cdot )\) subject to these constraints and the trust region bound (7.3) is at the point

$$\begin{aligned} \widehat{\underline{p}}_{\ell +1} \,=\, \widehat{\underline{x}}_k \,-\, \widehat{\Delta }_{k\,} \check{Q}_{\,} \check{Q}^{T\,} \underline{\nabla }Q_k ( \underline{p}_{\ell } )/\left\| \check{Q}_{\,} \check{Q}^{T\,} \underline{\nabla }Q_k ( \underline{p}_{\ell } ) \right\| . \end{aligned}$$
(7.7)

Assuming for the moment that the second derivatives of \(Q_k ( \cdot )\) are small enough for the approximation \(\Lambda _{k \ell } ( \underline{x}) \approx Q_k ( \underline{x}), \Vert \underline{x}- \widehat{\underline{x}}\Vert \le \widehat{\Delta }_k\), to be useful, we would like the reduction \(Q_k ( \underline{p}_{\ell } ) - Q_k ( \underline{p}_{\ell +1} )\) to compare favourably with \(Q_k ( \underline{p}_{\ell } ) - \Lambda _{k \ell } ( \widehat{\underline{p}}_{\ell +1} )\). This hope is achieved if \(\widehat{\underline{p}}_{\ell +1} - \underline{p}_1\) is in \(\mathcal{K}_{\ell }\), because then the calculation of \(\underline{p}_{\ell +1}\) by the Krylov method provides \(Q_k ( \underline{p}_{\ell +1} ) \le Q_k ( \widehat{\underline{p}}_{\ell +1} ) \approx \Lambda _{k \ell } ( \widehat{\underline{p}}_{\ell +1})\). Equation (7.7) shows that \(\widehat{\underline{p}}_{\ell +1} - \underline{p}_1\) is a linear combination of \(\widehat{\underline{x}}_k - \underline{p}_1\) and \(\check{Q}\check{Q}^T \underline{\nabla }Q_k ( \underline{p}_{\ell } )\), and the definition (7.5) gives \(\check{Q}\check{Q}^T \underline{\nabla }Q_k ( \underline{p}_{\ell } ) \in \mathcal{K}_{\ell }\). It follows that the Krylov method is suitable in the case \(\widehat{\underline{x}}_k - \underline{p}_1 = 0\), which means that the starting point of the Krylov method for the current \(\mathcal{A}\) is at the centre of the trust region constraint (7.3). Otherwise, the Krylov method may be disadvantageous.

We apply the remarks above to the last example of Sect. 4, where we seek a relatively small value of the function (4.19) subject to the constraints (4.20), the trust region centre \(\underline{x}_k\) being at the origin. The initial active set is empty, so the move from \(\underline{x}_k\) goes along the direction \(-\underline{\nabla }Q_k ( \underline{x}_k )\), and it reaches the point \(\underline{p}_1 = (0, 4, 1 )^T\) on the boundary of the linear constraint \(x_3 \le 1\). This description is in Sect. 4, except that our present notation requires the point \((0, 4, 1 )^T\) to be called \(\underline{p}_1\) instead of \(\underline{p}_2\), because a new active set is generated there to prevent violations of the linear constraint. The new set \(\{ \underline{a}_j : j \in \mathcal{A}\}\) contains only the vector \((0, 0, 1 )^T\), which gives the matrices

$$\begin{aligned} \widehat{Q}_{\,} \widehat{Q}^T \,=\, \left( \begin{array}{c@{\quad }c@{\quad }c} 0 &{} 0 &{} 0 \\ 0 &{} 0 &{} 0 \\ 0 &{} 0 &{} 1 \end{array} \right) \qquad \text{ and } \qquad \check{Q}_{\,} \check{Q}^T \,=\, \left( \begin{array}{c@{\quad }c@{\quad }c} 1 &{} 0 &{} 0 \\ 0 &{} 1 &{} 0 \\ 0 &{} 0 &{} 0 \end{array} \right) \!. \end{aligned}$$
(7.8)

Now we are working in \(\mathcal{R}^3\), although the reduced space \(\mathcal{R}^2\) is employed in the penultimate paragraph of Sect. 4. Expressions (4.19), (7.8), (7.4) and (7.3) supply \(\underline{\nabla }Q_k ( \underline{p}_1 ) = (-10, 0, -21.8 )^T, \check{Q}\check{Q}^T \underline{\nabla }Q_k ( \underline{p}_1 ) = (-10, 0, 0)^T, \widehat{\underline{x}}_k = (0, 0, 1 )^T\) and \(\widehat{\Delta }_k = 5\). Therefore the Krylov method and Eq. (7.7) provide \(\underline{p}_2 = (3, 4, 1 )^T\) and \(\widehat{\underline{p}}_2 = (5, 0, 1 )^T\), respectively. The Krylov method is inferior due to \(\underline{p}_1 \ne \widehat{\underline{x}}_k\), the new values of \(Q_k ( \cdot )\) being \(Q_k ( \underline{p}_2 ) = -19.4\) and \(Q_k ( \widehat{\underline{p}}_2 ) = -25\). Furthermore, the Krylov method is unable to generate another step that yields the reduction \(Q_k ( \underline{p}_3 ) < Q_k ( \underline{p}_2 )\). Indeed, the projected gradient \(\check{Q}\check{Q}^T \underline{\nabla }Q_k ( \underline{p}_2 ) = (-4, 0, 0 )^T\) is parallel to \(\check{Q}\check{Q}^T \underline{\nabla }Q_k ( \underline{p}_1 )\), so \(\mathcal{K}_2 = \mathcal{K}_1\) occurs in the definition (7.5), which is a condition for termination.

There are three excellent reasons for starting the Krylov method for the new active set \(\mathcal{A}\) at \(\underline{p}_1\) instead of at \(\widehat{\underline{x}}_k\) when \(\underline{p}_1\) is not a trust region centre. The point \(\widehat{\underline{x}}_k\) may be infeasible, the value \(Q_k ( \underline{p}_1 )\) is the least known value of \(Q_k (\underline{x}), \underline{x}\in \mathcal{R}^n\), so far subject to the linear constraints and the trust region bound, and the new \(\mathcal{A}\) has been chosen carefully so that a move from \(\underline{p}_1\) along the direction \(-\check{Q}\check{Q}^T \underline{\nabla }Q_k ( \underline{p}_1 )\) does not violate a constraint until the length of the move is greater than \(\eta _2 \Delta _k\). Moreover, while the sequence of Krylov steps is strictly inside the trust region, the steps are suitable, because they are the same as the conjugate gradient steps in Sect. 5. When the Krylov steps move round the boundary of the trust region, however, there is the very strong objection that the definition (7.5) of the Krylov subspaces is without any attention to the actual trust region boundary, this deficiency being shown clearly by the property \(\mathcal{K}_1 = \mathcal{K}_2\) in the example of the previous paragraph. Therefore, although it is argued in the first paragraph of this section that boundary searches by the Krylov method are superior to those of Sect. 6 in the case \(\underline{p}_1 = \underline{x}_k\), and although this argument is valid too in the unusual situation \(\underline{p}_1 = \widehat{\underline{x}}_k \ne \underline{x}_k\), we expect the crude searches of Sect. 6 to be better in the cases \(\underline{p}_1 \ne \widehat{\underline{x}}_k\). We recall also that, when \(\underline{p}_{\ell +1}\) is generated by the Krylov method, the uphill property \(( \underline{p}_{\ell +1} - \underline{p}_{\ell } )^T \underline{\nabla }Q_k ( \underline{p}_{\ell } ) > 0\) is possible, which causes difficulties if \(\underline{p}_{\ell +1}\) is infeasible. These disadvantages led to the rejection of the Krylov method from the LINCOA software, as mentioned earlier. Nevetheless, the description of the Krylov method with linear constraints in Sect. 4 may be useful, because, in many applications of LINCOA, most of the changes to \(\mathcal{A}\) occur at the beginning of an iteration, and then \(\underline{p}_1\) is at the trust region centre \(\underline{x}_k\).

The choice of the quadratic model \(Q_k ( \underline{x}), \underline{x}\in \mathcal{R}^n\), for each iteration number k is important, but it is outside the range of our work. Nevertheless, because Tables 1 and 2 in Sect. 6 compare \(\text{ NPT } = n + 6\) with \(\text{ NPT } = 2n + 1\), we comment briefly on the number of interpolation conditions. When the author began to investigate the symmetric Broyden method for minimization without derivatives, as reported in the last three paragraphs of [8], NPT was chosen to be \(\mathcal{O}( n )\) for large n, in order to allow the routine work of each iteration to be only \(\mathcal{O}( n^2 )\). Comparisons were made with \(\text{ NPT } = \frac{1}{2} ( n + 1 ) ( n + 2 )\), which is the number of degrees of freedom in a quadratic function. The finding that smaller values of NPT often provide much lower values of #F was a welcome surprise. For the smaller values, the second derivative matrix \(\nabla ^2 Q_k\) is usually very different from \(\nabla ^{2\!} F ( \underline{x}_k )\) at the end of the calculation, even when \(F ( \cdot )\) is quadratic. It seems, therefore, that quadratic models without good accuracy can be helpful to the choice of \(\underline{x}_{k+1}\). This view is supported by the following advantage of \(Q_k ( \cdot )\) over a linear approximation to \(F ( \cdot )\).

We suppose that there are no linear constraints on the variables, and that we wish to predict whether the reduction \(F( \underline{x}) < F ( \underline{x}_k )\) is going to occur for some \(\underline{x}\) in the trust region of the k-th iteration. If a linear approximation \(\Lambda _k ( \cdot ) \approx F ( \cdot )\), say, is employed for the prediction, and if \(\underline{\nabla }\Lambda _k ( \underline{x}_k )\) is nonzero, then the answer is affirmative for all vectors \(\underline{x}\) in the set \(\{ \underline{x}: \Lambda _k ( \underline{x}) < \Lambda _k ( \underline{x}_k ) \} \cap \{ \underline{x}: \Vert \underline{x}- \underline{x}_k \Vert \le \Delta _k \}\), which is half of the trust region on one side of the plane \(\{ \underline{x}: ( \underline{x}- \underline{x}_k )^T \underline{\nabla }\Lambda _k ( \underline{x}_k ) = 0 \}\). On the other hand, a typical quadratic model \(Q_k ( \cdot )\) is subject to the interpolation conditions

$$\begin{aligned} Q_k ( \underline{y}_i ) \,=\, F ( \underline{y}_i ), \quad i = 1,2, \ldots , \text{ NPT }, \end{aligned}$$
(7.9)

with \(\text{ NPT } > n + 1\), and we expect \(\underline{x}_k\) to be a best interpolation point, which means \(\underline{x}_k = \underline{y}_t\), where t is an integer in \([1, \text{ NPT } ]\) that satisfies \(F ( \underline{y}_t ) \le F ( \underline{y}_i ), i = 1,2, \ldots , \text{ NPT }\). Thus the set \(\{ \underline{x}: Q_k ( \underline{x}) < Q_k ( \underline{x}_k ) \}\) is usually very different from a half plane, especially if \(\underline{x}_k\) is a strictly interior point of the convex hull of the interpolation points. Indeed, the set excludes a neighbourhood of every \(\underline{y}_i\) with \(F ( \underline{y}_i ) > F ( \underline{x}_k )\), and searches for relatively small values of \(Q_k ( \cdot )\) stay away automatically from the current interpolation points. Quadratic models with NPT of magnitude n are obvious candidates for providing this useful feature. Furthermore, because symmetric Broyden updating takes up the freedom in each new model by minimizing the change to the model in a particular way, some helpful properties of the old model can be inherited by the new one, although \(\nabla ^2 Q_k\) may be a very bad estimate of \(\nabla ^2 F ( \underline{x}_k )\). The author is enthusiastic about such models, because of their success in his software for optimization without derivatives when there are hundreds of variables.

Updating the quadratic model is an example of a subject that is fundamental to the development of algorithms such as LINCOA, but the subject is separate from our present work. Another fundamental subject that has not received our attention is the choice of vectors \(\underline{x}\) for the calculation of new values of \(F ( \underline{x})\) on iterations that are designed to improve the quadratic model, instead of trying to achieve the reduction \(F ( \underline{x}) < F ( \underline{x}_k )\). The number of these “model iterations” is about \(( \#F - \#\text{ TRI } - \text{ NPT } )\) in Tables 1 and 2. Therefore our paper is definitely not intended to be a description of LINCOA, although it may be welcomed by users of LINCOA, because that description has not been written yet. Instead, we have studied the investigations for LINCOA into the construction of feasible vectors \(\underline{x}\) that provide a sufficiently small value of \(Q_k ( \underline{x}), \Vert \underline{x}- \underline{x}_k \Vert \le \Delta _k\), subject to linear constraints. Most of the efforts of those investigations, which have taken about 2 years, were spent on promising techniques that have not been included in the software. The prime example is the Krylov subspace method, which was expected to perform better than truncated conjugate gradients, due to its attractive way of taking steps round the trust region boundary in the unconstrained case. The reason for giving so much attention to failures of the Krylov method is that our findings may be helpful to future research.