LMBOPT – a limited memory method for bound-constrained optimization

Recently, Neumaier and Azmi gave a comprehensive convergence theory for a generic algorithm for bound constrained optimization problems with a continuously diﬀerentiable objective function. The algorithm combines an active set strategy with a gradient-free line search CLS along a piecewise linear search path deﬁned by directions chosen to reduce zigzagging. This paper describes LMBOPT , an eﬃcient implementation of this scheme. It employs new limited memory techniques for computing the search directions, improves CLS by adding various safeguards relevant when ﬁnite precision arithmetic is used, and adds many practical enhancements in other details. The paper compares LMBOPT and many other solvers on the unconstrained and bound constrained problems from the CUTEst collection and makes recommendations on which solver to use and when. Depending on the problem class, the problem dimension, and the precise goal, the best solvers are LMBOPT , ASACG , and LMBFG-EIG-MS .


Introduction
The bound constrained optimization problem (BCOPT) is the task of minimizing a function subject to a feasible region defined by simple bounds on the variables.In this paper we describe the implementation and numerical evaluation of a new active set method for solving the bound constrained optimization problem where x = [x, x] is a bounded or unbounded box in R n describing the bounds on the variables and the objective function f : x → R is continuously differentiable with gradient g(x) := ∂f (x)/∂x ∈ R n .
Problems with naturally given bounds appear in a wide range of applications including the optimal design problem [4], contact and friction in rigid body mechanics [50], the obstacle problem [53], journal bearing lubrication and flow through a porous medium [48].Often variables of an optimization problem can only be considered meaningful within a particular interval [31].Some approaches [1] reduce the solution of variational inequalities and complementarity problems to bound constrained problems.The bound constrained optimization problem also arises as an important subproblem in algorithms for solving general constrained optimization problems based on augmented Lagrangians and penalty methods.Numerous research papers [17,27,39,38,51] deal with the development of efficient numerical algorithms for solving bound constrained optimization problems, especially when the number of variables is large.

Past work
In the last few years, many algorithms have been developed for solving the BCOPT problem (1).
Active set methods are among the most effective methods.They consist of two main stages that alternate until a solution is found.In the first stage one identifies a good approximation for the set of optimal active bound constraints, defining a face likely to contain a stationary point of the problem.A second stage then explores this face of the feasible region by approximately solving an unconstrained subproblem.
A classical reference for active set methods for bound constrained problems with a convex quadratic objective function (QBOPT) is the projected conjugate gradient method of Polyak [55], which drops and adds only one constraint in each iteration.
That is, at each step of this active set method, the dimension of the subspace of active variables is changed by one.This fact implies that if there are n 1 constraints active at the starting point x 0 and n 2 constraints active on the solution of QBOPT, we need at least |n 2 − n 1 | iterations to reach the solution of QBOPT.This may be a serious drawback in the case of large scale problems.Dembo & Tulowitzki [23] introduced methods for QBOPT in 1983 that are able to add and drop many constraints at each iteration.Their basic idea was further developed by Yang & Tolle [60] into an algorithm guaranteed to identify the face containing a local solution of QBOPT in finitely many iterations, even when the solution of the problem is degenerate.For further research on QBOPT we refer the reader to [25,26,52,53].
For BCOPT with a general nonlinear objective function, Bertsekas [3] proposed an active set algorithm that uses a gradient projection method to find the optimal active variables.He showed that this method is able to find the face containing a local solution very quickly.Further research on convergence and properties of projected gradient methods can be found in [3,14,28].The idea of using gradient projections for identifying optimal active constraints was followed up by many researchers.Most of them [12,16,15] combined Newton type methods with the gradient projection method to speed up the convergence.For example, LBFGSB, developed by Byrd et al. [12], performs the gradient projection method by computing the Cauchy point to determine the active variables.After determining the set of active variables, the algorithm performs line searches along the search directions obtained by a limited memory BFGS method [13] to explore the subspace of nonactive variables.In fact, the use of limited memory BFGS matrices and the line search strategy are the main properties that distinguish this method from others, especially from the trust region type method proposed by Conn et al. [16,15].
A non-monotone line search was first introduced by Grippo, Lampariello & Lucidi (GLL) in [36] for Newton methods, to improve the ability to follow a curved valley with steep walls.Later several works [19,22,29,37,58,61] on non-monotone line search methods pointed out that these methods are more efficient than monotone line search methods in many cases.Other papers [4,8,20,21,32,49,57] indicate that gradient projection approaches based on a Barzilai-Borwein step size [2] have impressive performance in a wide range of applications.Recent works [5,6,8,9,10,56] on Barzilai-Borwein gradient projection methods (BBGP) have modified them by incorporating them with the GLL non-monotone line search: For instance, Raydan [56] developed the BBGP method for solving unconstrained optimization problems, and Dai & Fletcher [20,21] proposed BBGP methods for large-scale bound constrained quadratic programming.The idea of Raydan [56] was further developed to generate a convex constrained solver (SPG) by Birgin et al. [8,9] and a bound constrained solver (GENCAN) by Birgin et al. [5,6], enriched by an active set strategy.
The GALAHAD package [34] uses as bound-constrained solver LANCELOT-B, a trust-region algorithm using truncated Newton directions.Recently, Burdakov et al. [11] constructed a family of limited memory quasi Newton methods for un-constrained optimization combined with line searches or trust regions, called the LMBFG package.
To deal with negative curvature regions, Birgin & Martínez [5] used the secondorder trust region algorithm of Zhang & Xu [62], and Birgin & Martínez [6] designed a new algorithm whose line search iteration is performed by means of backtracking and extrapolation.Hager & Zhang [43] developed an active set algorithm called ASACG for large scale bound constrained problems.ASACG consists of two main steps within a framework for branching between these two steps: a nonmonotone gradient projection step, called NGPA, which is based on their research on the cyclic Barzilai-Borwein method [22], and an unconstrained step that utilizes their developed conjugate gradient algorithms [40,41,42,44].ASACG version 3.0 was updated by calling CGdescent version 6.0 which uses the variable HardConstraint to evaluate the function or gradient at a point that violates the bound constraints, so it could improve performance by giving the code additional flexibility in the starting step size routine.In 2017, Cristofari et al. [18] proposed a two-stage active set algorithm for bound-constrained optimization, called ASABCP.ASABCP first finds an active set estimation with a guarantee that the function value is reduced.
Then it uses a truncated-Newton technique in the subspace of non-active variables.
A considerable amount of literature has been published on line search algorithms, which enforce the Wolfe conditions (Wolfe [59]) or the Goldstein conditions (Goldstein [33]).One problem with line search algorithms satisfying the Wolfe conditions is the need to calculate a gradient at each trial point.
(where g i := g i (x) is the ith component of gradient vector at x) vanishes at a local minimizer.At each point x during the iteration, a search direction is determined in a subspace obtained by varying the part indexed by a working set I, chosen either as the minimal set of free indices or as the maximal set of free or freeable indices.To ensure the absence of severe zigzagging, freeing iterations in which BOPT takes a starting point x 0 ∈ R n and the feasible set x as input and returns an optimal point x best and its function value f best = f (x best ) as output.It uses the tuning parameters of a line search CLS (discussed in Section 1.3) and two tuning parameters: 0 < ∆ a < 1 (parameter for the angle condition), 0 < ρ ≤ 1 (parameter for freeing iterations).
The bent line search and conditions ( 6)-( 9) on the search direction are essential for the convergence analysis in [54,Sections 8 and 11], where the following is proved:

Algorithm 1 BOPT, bound constrained optimization algorithm
Initialization 1: Compute the initial gradient g 0 := g(x 0 ).2: Compute the initial reduced gradient g red (x 0 ).3: Find the initial working set I 0 := I + (x 0 ). the maximal set is chosen 4: for = 0, 1, 2 Angle condition: For freeing iterations: if (5) holds, Finding the step size 9: Perform a line search CLS (discussed in Section 1.3) along the bent search path which is the projection of the ray x + α p into the feasible set x with components ) and g red (x +1 ).
Updating the working set 10: Find I +1 := I − (x +1 ) by (3). the working set changes to the minimal set 11: if (6) is violated then the working set changes to the maximal set 12: Find I +1 := I + (x +1 ) by (4).

13:
end if 14: end for Theorem 1 Let f be continuously differentiable, with Lipschitz continuous gradient g.Let x denote the value of x in Algorithm 1 after its th update.Then one of the following three cases holds: (i) The iteration stops after finitely many steps at a stationary point.(ii) We have Moreover, if BOPT converges to a nondegenerate stationary point, all strongly active variables are ultimately fixed, so that zigzagging through changes of the active set cannot occur infinitely often.
1.3 CLS -the curved line search of BOPT CLS (Algorithm 3.3 in [54] = Algorithm 2 below), the line search used in BOPT, is an efficient gradient-free curved line search algorithm.It searches for a piecewise linear search path defined by directions chosen to reduce zigzagging.It takes the th point x and its function value f = f (x ), the gradient vector g = g(x ), the search direction p , the feasible set x, and the initial step size α init as input and returns the ( + 1)th point x +1 = x(α ), its step size α , and its function value f +1 = f (x +1 ) as output.It uses several tuning parameters: β ∈ ]0, 1  4 [ (parameter for efficiency), q > 1 (extrapolation factor), the positive integer l max (limit on the number of iterations), α max (maximal value for the step size α).Compute the point x(α k ) on the bent search path (10), its function value f (x(α k )).

6:
Compute the Goldstein quotient (Goldstein [33])  13) is an improved form of the Goldstein condition It forbids step sizes too large or too small by enforcing that µ(α) is sufficiently positive and not too close to one.
• According to Theorem 3.2 of [54], a step size satisfying (13) can be found by performing CLS if the objective function f is bounded below.
• If the objective function is quadratic and an exact line search reveals that the secant step size is a minimizer of a convex quadratic function along the search ray and so the quadratic case is optimally started.Otherwise the function is far from quadratic and bounded.

LMBOPT -an efficient version of BOPT
In this paper we introduce a new limited memory method for bound-constrained optimization called LMBOPT.It conforms to the assumptions of BOPT, hence converges to a stationary point in exact arithmetic and fixes all strongly active variables after finitely many iterations, but also takes care of various efficiency issues that are difficult to explain in theory but must be addressed in a robust and efficient implementation.
Important novelties compared to the literature are • the useful trick of moving the starting point slightly into the relative interior of the feasible domain x, • a useful starting direction based on the gradient signs, • a new quadratic limited memory model for progressing in a subspace, • a numerically stable version of the descent direction proposed by Neumaier & Azmi [54] for removing zigzagging, • a new regularized conjugate gradient direction, • safeguards for the curved line search taking into account effects due to finite precision arithmetic, • new heuristic methods for an initial step size, for a robust minimal step size, and for handling null steps without progress in the line search, Taken together, these enhancements make LMBOPT very efficient and robust.
We describe how to compute search directions in Section 2: • Subspace information is defined in Subsection 2.1.
• A new quasi Newton direction and a regularized conjugate gradient step are introduced in Subsections 2.2 and 2.3, respectively.
• Some implementation details of these directions are given in Subsection 2.4.Improvements in the line search are discussed in Section 3: • Issues with finite arithmetic are described in Subsection 3.1.
• Ingredients of an improved version of CLS are introduced in Section 3.
We introduce the master algorithm and its implementation details in Section 4: • A useful starting point is suggested in Subsection 4.1.
• The conditions for accepting a new point are explained in Subsection 4.2.
• Some implementation details are given in Subsection 4.3.
• The master algorithm is introduced in Subsection 4.4.
Numerical results for unconstrained and bound constrained CUTEst problems [35] are summarized in Section 5: • Details of test problems and a shifted starting point are discussed in Subsection 5.1.
• Default parameters for LMBOPT are given in Subsection 5.2.• Subsection 5.3 contains a list of all compared solvers and explains how unconstrained solvers turn into bound constrained solvers.
• First numerical results are given in Subsection 5.4.1, resulting in the three best solvers (LMBOPT, ASACG, and LMBFG-EIG-MS).
• Additional numerical results classified by constraints and dimensions are given in Subsection 5.4.2, resulting in a solver choice in Subsection 5.6.
• Further numerical results for hard problems are given in Subsection 5.5.
• As a consequence, a solver choice is based on our findings depending on the problem dimension, the presence or absence of constraints, the desired robustness, and the relative costs of the function and gradient evaluations in Subsection 5.6.
The website http://www.mat.univie.ac.at/~neum/software/LMBOPT contains public Matlab source code for LMBOPT together with more detailed documentation and an extensive list of tables and figures with numerical results and comparisons.

Search directions
In this section, we describe the search direction used at each iteration.In Subsection 2.1, subspace information is described.Accordingly, a new limited memory quasi Newton direction is discussed in Subsection 2.2.Then Subsection 2.3 describes how conjugate gradient directions are constructed and regularized.Finally, Subsection 2.4 contains the implementation details of our regularized conjugate gradient direction.

Subspace information
After each iteration we form the differences where x, g are the current point and gradient, and x old , g old are the previous point and gradient.
For some subspace dimension m, the matrix S ∈ R n×m has as columns (in the actual implementation a permutation of) m previous point differences s.A second matrix Y ∈ R n×m has as columns the corresponding gradient differences y.
If the objective function is quadratic with (symmetric) Hessian B and no rounding errors are made, the matrices S, Y ∈ R n×m satisfy the quasi-Newton condition Since B is symmetric, must be symmetric.If we calculate y = Bs at the direction s = 0, we have the consistency relations for all α ∈ R in exact precision arithmetic.If the columns of S (and hence those of Y ) are linearly independent then m ≤ n, and H is positive definite.Then the minimum of f (x + Sz) with respect to z ∈ R m is attained at where the associated point and gradient are and we have If m reaches its limits, we use γ := y T s and form the augmented matrices the augmented vector c new := (S new ) T g new = 0 s T g new , and put But when the allowed memory for S and Y is full we delete the oldest column of S and Y and the corresponding row and column of H to make room for the new pair of vectors, and then augment as described above.
The implementation contains a Boolean variable updateH as a tuning parameter to compute If the objective is not quadratic, ( 14) does not hold exactly and H := S T Y need not be symmetric.However, the update (20) always produces a symmetric H, even in finite precision arithmetic.

A new quasi-Newton direction
We use S and Y to construct a Hessian approximation of the form for some symmetric matrix W ∈ R n×m and some matrix X ∈ R n×m .Thus, the additional assumption is temporarily made that B deviates from a diagonal matrix D by a matrix of rank at most m.Under these assumptions, we reconstruct the Hessian uniquely from the data S and Y = BS, in a manifestly symmetric form that can be used as a surrogate Hessian even when this structural assumption is not satisfied.
This provides an efficient alternative to the traditional L-BFGS-B formula [12], which needs twice as much storage and computation time.14) and ( 23) imply where and is symmetric.The solution of Bp = −g is given in terms of the symmetric matrix Proof The matrices U := Y − DS and Σ := U T S are computable from S and Y , and we have This product relation and the invertibility of Z imply that Σ is invertible, too, and we conclude that X = ZΣ −1 Z T , hence To apply it to the bound constrained case, we note that the first order optimality condition predicts the point x+p, where the nonactive part p I of p solves the equation Noting that where z 27) and using (26).With the symmetric matrix H defined by (15), we compute the symmetric and find hence Here, with where J contains the indices of newest and oldest pair (s, y) and • denotes componentwise multiplication.
Enforcing the angle condition.Due to rounding errors, a computed descent direction p need not satisfy the angle condition (8).We may add a multiple of the gradient to enforce the angle condition (8) for the modified direction with a suitable factor t ≥ 0; the case t = 0 corresponds to the case where p already satisfies the bounded angle condition (8).The choice of t depends on the three numbers these are related by the Cauchy-Schwarz inequality We want to choose t such that the angle condition (8) holds with p new in the place of p.If σ new ≤ −∆ a , this holds for t = 0, and we make this choice.Otherwise we may enforce the equality (8) by choosing The following proposition is a special case of Proposition 5.2 in [54].
Given z by ( 29), we could compute the nonactive part of p from (30); however, this need not lead to a descent direction since B need not be positive definite.We therefore compute u := U I: z, and choose where t is chosen analogous to (33) if this results in t < 1, and t = 1 otherwise.By Proposition 1, the direction (34) satisfies the angle condition (8).

A conjugate gradient step
Neumaier & Azmi [54, Section 7] introduced a new nonlinear conjugate gradient method chosen to reduce zigzagging for unconstrained optimization which, applied to the working subspace, may be used by BOPT to generate search directions as long as the active set does not change.
Any search direction p must satisfy g T p < 0. To avoid zigzagging, [54] generated the search direction p as the vector with a fixed value g T p = −c < 0 closest (with respect to the 2-norm) to the previous search direction p old .By Theorem 7.1 in [54] (applied for B = I), p = βp old − λg, (35) with The resulting method has finite termination on quadratic objective functions, where it reduces to linear conjugate gradients.
[54, Theorem 7.3] shows that the bounded angle condition holds for sufficiently large if an efficient line search such as CLS is used and there are positive constants κ 1 and κ 2 such that either p is parallel to the steepest descent direction −g or the conditions (g ) hold (where y −1 := g − g −1 ).Convergence is locally linear when the sequence x converges to a strong local minimizer.
As in [54,Theorem 7.5] the sequence generated by ( 35) and ( 36) can be rewritten into the nonlinear conjugate gradient method of Fletcher & Reeves [30] equivalent to the linear conjugate gradient method of Hestenes & Stiefel [45] when f is quadratic with the positive definite Hessian matrix and bounded below.Therefore it requires at most n steps to obtain a minimizer of f .
As a consequence of [54,Theorem 7.3], [54,Theorem 7.6] showed that the sequence x of the conjugate gradient method generated by ( 35) and ( 36) satisfies and convergence is locally linear if the sequence x converges to a strong local minimizer.
In this section, we discuss a new conjugate gradient method equivalent to the linear conjugate gradient method of Hestenes & Stiefel [45] in the cases where f is quadratic with the positive definite Hessian matrix and bounded.Theorems 7.3, 7.5, 7.6 in [54] are valid for our conjugate gradient method.
The conjugate gradient method chooses the direction s in the subspace generated by typeSubspace and enforces the conjugacy relation thus making H diagonal.Both conditions together determine s up to a scaling factor: s must be a multiple of p := Sr + p init (40) for some r ∈ R m , in which p init is a descent direction, computed by searchDir (discussed later in Subsection 2.4).
Then (39) requires hist denotes the subspace basis index set (discussed later in Subsection 2.4).By restricting S, Y , and H to hist, we define Then we solve the linear systems H h z h := −c h and H h r h := q h according to (21) and (41), respectively.Afterwards, we construct the conjugate gradient direction depending on whether the subspace is possible (m 0 > 0) or not (m 0 = 0) by with Here γ is computed by where α is found by a heuristic way (goodStep; discussed later in Section 3).
In finite precision arithmetic, a tiny denominator in (43) produces a very inaccurate γ.This drawback is overcome by regularization.The error made in γ reg is a tiny multiple of We therefore shift the denominator in (43) away from zero to where ∆ H ∈ (0, 1) is a tiny factor.Then the regularized conjugate gradient direction is computed by with If is negative, ( 46) is a descent direction.

Some implementation details
In this section, we first discusses how to determine the ingredients of the subspace.e.g., the subspace dimension, the subspace basis index set, the subspace type, and p init .Then we describe how to implement our regularized conjugate gradient direction.
Regardless of whether the activity changes or not, the subspace cannot be updated whenever very little progress is made, y ≈ 0, while the gradient is still large, i.e., a new pair (s, y) violates the condition where ∆ po ∈ (0, 1) is a tiny tuning parameter.Initially m = 0 and whenever a new pair (s, y) satisfies ( 49), m is increased by one by appending these vectors to S and Y , respectively.But once m reaches its limit, it is kept to be fixed and the oldest column of S and Y is replaced by s and y, respectively.
When the activity does not change, the step is called a local step.For the construction of our regularized conjugate gradient direction, it is important which subspace basis index set is present and what is subspace dimension.We denote by nlocal the number of local steps and by nwait the number of local steps before starting the regularized conjugate gradient direction, which will be a tuning parameter.We use nlocal and nwait to determine the subspace dimension.
To encode which subspace should be used, typeSubspace updates three variables m 0 (subspace dimension), hist (subspace basis index set), and CG (subspace type).
Given the number of updated subspace nh, typeSubspace defines m := min(m, nh) and identifies CG, m 0 , and hist as follows: The value of γ depends on the search direction.Here searchDir is used to compute p init including scaleDir, quasiNewtonDir, and AvoidZigzagDir.It works as follows: • In the first iteration the starting search direction makes use of the gradient signs only, and has nonzero entries in some components that can vary.Each starting search direction is computed by scaleDir.In this case, scaleDir, • If nlocal = nwait, a modified direction is used to avoid zigzagging.p init is computed by ( 35) using (36).Since g T I p init I = −c, the direction will be a descent direction.This direction is implemented by AvoidZigzagDir and enriched by a new heuristic choice of with tuning parameters 0 < θ < 1 and c > 0.
• Otherwise, quasiNewtonDir is used in subspace.
Afterwards, enforceAngle is used if the angle condition (8) does not hold: In summary, the implementation of our regularized conjugate gradient direction, called ConjGradDir, for computing p in ( 46) is given as follows: (i) γ reg is computed by getGam according to (44), (ii) if the subspace dimension is nonzero, our conjugate gradient direction is used; otherwise, it reduces to p new = −ζ reg p init and the subspace basis index set is restarted, (iii) a regularization for the denominator of ( 47) is made according to (45) by regDenom, (iv) the condition ( 48) is computed to know whether the regularized conjugate gradient direction is descent or not, (v) the new trial point, x + p new , is projected into x, resulting in x new and the direction is recomputed by p new := x new − x.

Improvements in the line search
In this section, we introduce an improved version of CLS, called CLS-new, with enhancements for numerical stability (finding a starting good step size, a target step size and a minimum step size with safeguards in finite precision arithmetic).The variable eff indicates the status of the step in CLS-new -taking the values 1 (efficient step), 2 (non-monotone step), 3 (inefficient decrease), and 4 (inefficient step).

Issues with finite precision arithmetic
Rounding errors prevent descent for step sizes that are too small.
Example 1 We consider the function For x = 5 + 3 × 10 −10 and p = −1, the plot f (x + αp) versus α in Figure 1 shows that one needs to find a sensible minimal step size.
In practice if the step size is too small, rounding errors will often prevent that the function value is strictly decreasing.Due to cancellation of leading digits, the Goldstein quotient can become very inaccurate, which may lead to a wrong bracket and then to failure of the line search.The danger is particularly likely when the search direction is almost orthogonal to the gradient.Hence, before each line search method, we need to produce a starting step size by a method, called goodStep, to find the starting good step size α good , the target step size α target , and the minimum step size α min with safeguards in finite precision arithmetic.goodStep computes the first and second breakpoint, respectively, by Here is the indices of the second breakpoint.Then it computes the breakpoint by α break := min(α break , α break ) in finite precision arithmetic and adjusts it by α break := α break (1+∆ b ), where ∆ b ∈ (0, 1) is a tiny factor for adjusting a target step size.In the cases where ind and ind are empty, we set α break := +∞ and α break := +∞.Given a tiny factor ∆ α ∈ (0, 1) and an index set indp := {i | p i = 0}, the minimal step size is computed by a heuristic formula and the target step size is chosen by α target := max(α min , df/|g T p|).We discuss how df is computed in the next subsection.If an exact line search on quadratic is requested, α target is restricted by α target := min(α target , α break ).
In the special case, if α min = 1, α good := 1 and goodStep ends due to being the zeros direction; otherwise, it computes the good step size by α good := α target if qα target ≤ α break , max(α min , α break ) otherwise; when it equals α min , adverse finite precision effects are avoided.Here q > 1 is an input parameter for goodStep which is used to expand (reduce) step sizes by CLS-new.
The number of stuck iterations is the number nstuck of times that LMBOPT cannot update the best point.Its limits are nstuckmax (maximum number of all stuck iterations) and nsmin (how many stucks are allowed before a trial point is accepted), both of which will be tuning parameters.In the final step of goodStep, if nstuck ≥ nsmin, α good is increased by the factor 2 * nstuck to avoid remaining stuck.

CLS-new -an improved version of CLS
Before CLS-new tries to enforce the sufficient descent condition (13), the following steps must be taken: • LMBOPT calls enforceAngle to enforce the angle condition (8).
• Once LMBOPT calls initInfo to initialize the best function value f best := f 0 and to compute the factor for adjusting increases in f (discussed below) where facf > 0 is a relative accuracy of f 0 .We denote the list of acceptable increases in f by Df and its size by mf and the number of gradient evaluations by ng.Moreover, initInfo chooses, for i = 1, • • • , mf − 1, Df i := −∞ and Df mf := δ f .After the first call to CLS-new, LMBOPT always calls updateInfo to update (1) the number of times that the best point is not updated by (2) the best point information by f best := f new and x best := x new if nstuck = 0; (3) δ f and Df.In this case, if f new < f , then δ f := f − f new and nm := mod(ng, mf) are computed.Otherwise since the function value is not decreased, δ f is expanded by and nm := mod(ng, mf) is updated.Here ∆ m ∈ (0, 1) is a tiny factor for adjusting δ f and ∆ f > 1 is a tuning parameter for expanding δ f .If nm is zero, the last component of Df is replaced by δ f ; otherwise, the nmth component of Df is replaced by δ f ; (4) f by f new if f new < f holds.
• If α good ≥ 1, q is updated by q = max q min , q/∆ q , where 1 < q min < q and 0 < ∆ q < 1 are the tuning parameters.Whenever the term qα good is moderately large, this choice may helps CLS-new to prevent a failure.To get target step sizes which should not become too small, an acceptable increase in f (denoted by df) must be estimated in a heuristic way such that it becomes slowly small.Accordingly, at first, df is δ f computed by (52).Next, it is a multiple of the old δ f value if the tuning parameter mdf divides ng.Otherwise it is the maximum of the mf old δ f values İn this case, target step sizes do not become too small.• goodStep is used to find an initial step size.CLS-new tries to find a step size α > 0 satisfying the sufficient descent condition (13).
• CLS-new ends once the sufficient descent condition holds, resulting in the line search being efficient and eff = 1.
• In the first iteration if the Goldstein quotient satisfies µ(α) < 1 an exact line search uses the secant step 1  2 α/(1 − µ(α)) for the quadratic objective function.In fact this ensures finite termination of our conjugate gradient method for the quadratic functions.Otherwise extrapolation is done by the factor q > 1.In the next iteration, if the sufficient descent condition (13) does not hold, then the function is far from quadratic and bounded.In such a case, either interpolate is performed if the lower bound on the step size is zero or extrapolation is done by the factor q > 1 until a bracket [α, α] is found.Then, a geometric mean of α and α is used.
• A limit on the number of iterations is used.
• At the end, if CLS-new ends up providing no improvement in the function values, LMBOPT calls robustStep to find a step size with corresponding lowest function value.Such a step size is called robust.Using a list of differences between the current best function value and the function values at trial points as gains, robustStep tries to find a point with smallest robust change if the minimum of gains is smaller than or equal to the acceptable increase in f (df).Otherwise, if the function is almost flat or flat a step with largest gain is chosen.Otherwise, a point with nonrobust change might be chosen provided that the minimum of gains ≤ ∆ r df, where ∆ r > 0 is a tuning parameter.
After LMBOPT accepts a new point x new and its step size α by CLS-new, the new step is defined by s := x new − x = α p .Due to the inefficiency of CLS-new, α may be too small, so that s goes to zeros.s with zero size is called a null step.If there have been too many null steps, LMBOPT cannot update the subspace information too many iterations, resulting a failure.To get rid of this weakness, nullStep is used, depending on whether CLS-new is inefficient or not.If CLSnew is inefficient (eff = 4), the new point x new is a multiple of the current best point.Otherwise, it is a multiple of the current point generated by CLS-new.Given a tiny tuning parameter del, x new is adjusted by a factor of 1 − del and all its zero components (if any) are replaced by del in both cases.Then it is projected into the feasible set x.

projStartPoint -the starting point
A poor choice of the starting point can lead to inefficiencies.For example, consider minimizing the quadratic function that starts with x 0 = 0.If a diagonal preconditioner is used, it is easy to see by induction that, for any method that chooses its search directions as linear combinations of the preconditioned gradients computed earlier, the ith iteration point has zero in all coordinates k > i and its gradient has zero in all coordinates k > i + 1.Since the solution is the all-one vector, this implies that at least n iterations are needed to reduce the maximal error in the components of x to below one.
Situations like this are likely to occur when both the Hessian and the starting point are sparse.To avoid this, projStartPoint moves a user-given starting point x slightly into the relative interior of the feasible domain.

getSuccess -successful iteration
The goal of getSuccess is to test whether the sufficient descent condition (13) holds or not; the only difference being that the tuning parameter β is replaced by the other tuning parameter β CG .The Goldstein quotient ( 12) is computed provided that all of the following hold: • The regularized conjugate gradient direction is descent, i.e., (48) is negative, but it is not zero.
After computing the Goldstein quotient (12), the iteration will be successful if either line search is efficient, meaning the sufficient descent condition (13) with β = β CG holds, or there exists an improvement in the function value by at least δ f and nstuck ≥ nsmin.In this case, the Boolean variable success is evaluated as true; otherwise, it is evaluated as false.

Some implementation details
To determine the working set I, it is checked if one of the following holds: (1) The function value cannot be decreased.
(2) The size of the new free index set is smaller than that of the old free index set (i.e., the activity is not fixed).
(3) The maximal number of local steps before finding the freeing iteration (which is a tuning parameter) is exceeded.( 4) Condition ( 6) is violated.
We use the algorithms findFreePos and findFreeNeg to get the working set.At the first iteration, BOPT calls findFreePos to find I + (x) by ( 4) and initializes the working set with I(x) := I + (x).Then if the statements (1)-( 3) are true, find-FreeNeg finds I − (x) by ( 3) and findFreePos checks whether the statement (4) is true or not.If this statement is not true, the working set is I(x) := I − (x); otherwise, findFreePos finds I + (x) by ( 4) and chooses it as the new working set I(x) := I + (x).
If at least one of the statements (1)-( 4) holds, a scaled Cauchy point is tried.It is computed in the same way as [46] but with the difference that the scaling matrix is computed by where • denotes componentwise multiplication, if at least once S and Y are updated.Otherwise, it is computed by where p init is computed as discussed above.

The master algorithm
We now recall the main ingredients of LMBOPT, the new limited memory bound constrained optimization method.The mathematical structure of LM-BOPT is described in Section 1 of suppMat.pdf.LMBOPT first calls projStart-Point described in Subsection 4.1 to improve the starting point.Then the function value and gradient vector for such a point are computed and adjusted by adjust-Grad; the same computation happens later for other points.In practice, if the gradient is contaminated by NaN or ±∞, adjustGrad replaces these values by a tuning parameter.In the main loop, • LMBOPT first computes the reduced gradient by redGrad in each iteration and then the working set is determined and updated by findFreePos.
• As long as the reduced gradient is not below a minimum threshold, it generates the direction p init by searchDir to construct the subspace, and then constructs the regularized conjugate gradient direction p by ConjGradDir to achieve a successful iteration, provided the activity is changed; otherwise the scaled Cauchy point is computed by scaleCauchy if at least one of the statements ( 1)-( 4) holds, discussed earlier in Subsection 2.4.Such a successful iteration is determined by getSuccess and then the best point is updated.
• Otherwise it performs a gradient-free line search CLS-new along a regularized direction (enforceAngle) since the function is not near the quadratic case.
• Then if at least nnullmax null steps are repeated in a sequence, the point leading to such steps is replaced by nullStep with a point around the previous best point if CLS-new is not efficient; otherwise by the current point generated by CLS-new.This is repeated until no null step is found.
• Afterwards, the gradient at the new point is computed and adjusted by adjust-Grad.In addition, the new free index set is found by findFreeNeg.At the end of every iteration, the subspace is updated provided that (i) there is no more null step, (ii) either the condition (49) holds or the number of local steps exceeds its limit.
LMBOPT minimizes the bound constrained optimization problem (1).It takes the initial point x 0 , the feasible set x and the tuning parameters -detailed in Table 4 in suppMat.pdf-as input and returns an optimum point x best and its function value f best as output.For the convergence analysis of Algorithm 3 we refer to Theorem 1.

Numerical results
In this section we compare our new solver LMBOPT with many other state-of-theart solvers from the literature (see Subsection 5.3) on a large public benchmark.Only summary results are given; supplementary information with much more detailed test results can be found in suppMat.pdffrom the LMBOPT web site.

Test problems used
As test problems we used all 1088 unconstrained and bound constrained problems with up to 100001 variables from the CUTEst collection of optimization problems by Gould et al. [35], in case of variable dimension problems for all allowed dimensions in this range; see Section 5 of suppMat.pdf.nf, ng, and sec denote the number of function evaluations, the number of gradient evaluations, and the time in seconds, respectively.Since the cost of computing the gradient is typically about twice the cost of the function value (see Section 3 of suppMat.pdf),we also use the cost measure nf2g := nf + 2ng.These measures are used as the cost measures to do performance profiles [24] shown in Figures 2-5.
We limited the budget available for each solver by requiring nf2g ≤ 20n + 10000 in the first and second runs, 50n + 200000 in the third run function evaluations plus two times gradient evaluations for a problem with n variables and allowing at most 300 in the first run, 1800 in the second run, 7200 in the third run sec of run time.A problem is considered solved if g k ≤ 10 −6 .
To identify the best solver under appropriate conditions on test problems and budgets, we made three different runs: • In the first and second runs, the initial point is x 0 := 0, but we shift the arguments by to avoid a solver guessing the solution of toy problems with a simple solution (such as all zero or all one) -there are quite a few of these in the CUTEst library.This means that the initial point is chosen by x 0 := ξ and the initial function value is f 0 := f (x 0 ) while the other function values are computed by f := f (x + ξ) for all ≥ 0. Compared to the standard starting point, this shift usually preserves the difficulty of the problem.In the second run, the three best solvers from the first run try to solve all test problems with an increased time limit of 1800 sec.
• In the third run, the initial point x 0 is the standard starting point.The three best solvers from the first run try to solve the 98 test problems unsolved in the first run without the shift (53).Maximal time in sec increased from 300 sec to 7200 sec and maximum number of nf2g increased from 20n + 10000 to 50n + 200000.In this case, the three best solvers from the first run succeeded to solve many of these unsolved problems.Test problems unsolved in third run could not be solved by any solver, even with a huge budget.

Default parameters for LMBOPT
For our tests we used for LMBOPT the following tuning parameters q min = 2.5; They are based on a limited tuning by hand.In a further release we plan to find optimal tuning parameters [47], as the quality of LMBOPT depends on it.
Details about the solvers and options used can be found in Section 1 of suppMat.pdf.
For some solvers, we have chosen options other than the default ones to make them more competitive.
We only compare public software with an available Matlab interface.LANCELOT-B combines a trust region approach with projected gradient directions.But since there was no mex-file to run LANCELOT-B in Matlab, we could not call and run it in our Matlab environment.Similarly, we could not find a version of GENCAN, the bound constrained version of ALGENCAN [7], which could be handled in Matlab.GENCAN is a combination of spectral projected gradient and an active set strategy.It is unlikely to introduce significant bias in the comparison.Hence, we compare LMBOPT to many known solvers using various active set strategies and either projected conjugate gradient methods, projected truncated Newton methods, or projected quasi Newton methods.
Unconstrained solvers were turned into bound-constrained solvers by pretending that the reduced gradient at the point π[x] is the requested gradient at x. Therefore no theoretical analysis is available, but the results show that this is a simple and surprisingly effective strategy.

5.4
The results for stringent resources

Unconstrained and bound constrained optimization problems
We tested all 15 solvers for problems in dimension 1 to 100001.A list of problems unsolved by all solvers can be found in Section 4 of suppMat.pdf.
For more refined statistics, we use our test environment (Kimiaei & Neumaier [47]) for comparing optimization routines on the CUTEst test problem collection.
For a given collection S of solvers and a collection P of problems, the efficiency of the solver so for solving the problem j ∈ P with respect to the cost measure c s is the strength of the solver so ∈ S -relative to an ideal solver that matches on each problem the best solver.It is measured by e j so := min s∈S c s /c so , if the solver so solved the problem j ∈ P, 0, otherwise.
The total mean efficiency of the solver so with respect to c s is defined by e so = mean j∈P (e j so ).
T mean is the mean of the time in seconds needed by a solver to solve the test problems chosen from the list of test problems P, ignoring the times for unsolved problems.#100 is the total number of problems for which the solver so was best with respect to nf2g (e j so = 1 = 100%).!100 is the total number of problems solved for which the solver so was better than all other solvers with respect to nf2g.
In the tables, efficiencies are given in percent.Larger efficiencies in the table imply a better average behaviour; a zero efficiency indicates failure.All values are rounded (towards zero) to whole integers.Mean efficiencies are taken over the 990 problems tried by all solvers and solved by at least one of them, out of a total of 1088 problems.The columns titled "# of anomalies" report statistic on failure reasons: • n indicates that nf2g ≥ 20n + 10000 was reached.
• t indicates that sec ≥ 300 was reached.
• f indicates that the algorithm failed for other reasons.
As can be seen from Table 1 and Figure 2, LMBOPT stands out as the most robust solver for unconstrained and bound constrained optimization problems; it is the best in terms of number of solved problems and the ng efficiency.Other best solvers in terms of the number of solved problems and the nf2g efficiency are ASACG and LMBFG-EIG-MS, respectively.LBFGSB is the best in terms of number of function evaluations #100 and !100, but it is not comparable to other algorithms in terms of the number of solved problems.

Classified by constraints and dimensions
Results for the three best solvers for all problems classified by dimension and constraint are given in Table 2, Figure 3, and Box plot 4.These results show that, • for low-dimensional problems (1 ≤ n ≤ 30), (1) LMBOPT is the best solver in terms of the ng and nf2g efficiencies and the number of solved problems, (2) LMBFG-EIG-MS is the best solver in terms of the nf efficiency (for both unconstrained and bound constrained problems), and (3) ASACG is the second best solver in terms of the number of solved problems (for both unconstrained and bound constrained problems); • for medium-dimensional problems (31 ≤ n ≤ 500), ( 1) LMBOPT is the best in terms of the ng efficiency and the number of solved problems in the both unconstrained and bound constrained problems.It is the best is the best in terms of the nf2g efficiency for the unconstrained problems, (2) LMBFG-EIG-MS is the best in terms of nf for the both unconstrained and bound constrained problems and nf2g for the bound constrained problems only, (3) ASACG is the best solver in terms of the nf2g efficiency for the bound constrained problems; • for large-dimensional problems (501 ≤ n ≤ 100001), ( 1) LMBOPT is the best solver in terms of the ng efficiency for both unconstrained and bound constrained problems, (2) LMBFG-EIG-MS is the best solver in terms of the nf and nf2g efficiencies and the number of solved problems (for all problems) and is the best solver in terms of the number of solved problems (for bound constrained problems), (3) ASACG is the best solver in terms of the number of solved problems for the unconstrained problems only.
• for all problems (1 ≤ n ≤ 100001), ( 1) LMBOPT is the best in terms of the number of solved problems and the ng efficiency in both unconstrained and bound constrained problems, (2) LMBFG-EIG-MS is the best solver in terms of the nf and nf2g efficiencies for both unconstrained and bound constrained problems.

Results for hard problems
All solvers have been run again on the hard problems defined as the 98 test problems unsolved in the first run.In this case, the standard starting point was used instead of ( 53) and both nfmax and secmax were increased.41 test problems were not solved by all solvers for dimensions 1 up to 100001, given in Table 3.
From Table 4 and Figure 5, we conclude  • LMBOPT is the best in terms of the number of solved problems and the ng and nf2g efficiencies for the hard bound constrained problems • ASACG is the best in terms of the number of solved problems and the ng and nf2g efficiencies for the hard unconstrained problems.
• LMBFG-EIG-MS is the best in terms of the ng and nf2g efficiencies for the hard unconstrained problems.

2 *
are restricted to cases where the choice I = I − (x) violates the inequality g I ≥ ρ g red 2 * , for some ρ ∈]0, 1].(6) Here • * is the dual norm of a monotone norm • , defined by g * := sup p =0 g T p/ p , and g I stands for the restriction of g to the index set I.More generally, we denote by x I the subvector of a vector x with indices taken from the set I, by A :k the kth column of a matrix A, and by A II the submatrix of A with row and column indices taken from I.

I
is chosen to be its opposite to move away from maximizer or saddle point.•By changing the sign of g, it may enforce g T I p init I ≤ 0.Even though g = 0, cancellation may lead to a tiny g T I p init I (and even with the wrong sign).Given a tiny parameter ∆ pg , to overcome this weakness, the subtract ∆ pg |g I | T |p init I | can be a bound on the rounding error to get the theoretically correct sign.A regularized directional derivative is done if the condition |g T I p init I | ≤ ∆ pg |g I | T |p init I | (51) holds, enforcing g T I p init I < 0. In this case, if (51) holds, p init I is either −g I or −λ b g I .Here λ b := max i∈I {D ii }. • If at least one of the conditions w > 0 and 0 ≤ |t| < ∞ does not hold, p init I is chosen to be −λ b g I .

Fig. 6 :
Fig. 6: (a) Flow chart for unconstrained problems classified by problems, (b) Flow chart for bound constrained problems classified by the problem dimension, (c) Flow chart for hard problems classified by constraint.Here L-E-M stands for LMBFG-EIG-MS.Dimension , • • • do best = x and f best = f (x ).Then terminate BOPT.

Table 1 :
The summary results for all problems efficiencies, respectively.ρ designates the percentage of problems solved within a factor τ of the best solver.Problem solved by no solver are ignored.

Table 2 :
The summary results classified by dimension and constraint for all problems

Table 3 :
The hard problems unsolved by all solvers

Table 4 :
The summary results for hard problems