Convergence of Successive Linear Programming Algorithms for Noisy Functions

Gradient-based methods have been highly successful for solving a variety of both unconstrained and constrained nonlinear optimization problems. In real-world applications, such as optimal control or machine learning, the necessary function and derivative information may be corrupted by noise, however. Sun and Nocedal have recently proposed a remedy for smooth unconstrained problems by means of a stabilization of the acceptance criterion for computed iterates, which leads to convergence of the iterates of a trust-region method to a region of criticality, Sun and Nocedal (2022). We extend their analysis to the successive linear programming algorithm, Byrd et al. (2023a,2023b), for unconstrained optimization problems with objectives that can be characterized as the composition of a polyhedral function with a smooth function, where the latter and its gradient may be corrupted by noise. This gives the flexibility to cover, for example, (sub)problems arising image reconstruction or constrained optimization algorithms. We provide computational examples that illustrate the findings and point to possible strategies for practical determination of the stabilization parameter that balances the size of the critical region with a relaxation of the acceptance criterion (or descent property) of the algorithm.


Introduction
Handling non-smoothness is an ubiquitous research question in nonlinear optimization because it arises naturally in different areas, for example, penalty functions for constrained optimization [23], statistical data analysis and signal processing [7,20], and neural network architectures [10].In this work we study the convergence properties of successive linear programming algorithms to solve the optimization problem min x∈R n φ(x) := ω(F (x)), (P) where ω : R p → R is convex and Lipschitz continuous with polyhedral epigraph, and F : R n → R p is twice continuously differentiable.Moreover, we assume that F and its gradient can only be accessed inexactly so that their evaluations are corrupted by noise.This and similar problems have been studied in the literature, see, for example, [1,2,3,8,15] and the references therein.Many optimization problems can be formulated in terms of problem (P) such as the Lagrangian form of the famous LASSO problem [17,20] with A ∈ R m×n , β > 0, and y ∈ R m that is particularly popular among data scientists for sparse parameter identification in over-parameterized models.It is an instance of a subproblem for the exact penalty method for general nonlinear constrained optimization problems that read min x∈R n f (x) s.t.g(x) ≤ 0, h(x) = 0 (NLP) with an objective f : R n → R and constraints g : R n → R m , h : R n → R k .
The problem (NLP) may be solved by minimizing a non-smooth exact penalty function of the form φ(x, ν) := f (x) + ν g(x) where y + := max(y, 0).In fact, strict local solutions of (NLP) are local minimizers of φ(x, ν) for a sufficiently large value of ν if g and h are smooth and satisfy the Mangasarian-Fromovitz constraint qualification [14,Theorem 4.4], [23,Theorem 17.3] at the respective points.In this case, the penalty function φ can be expressed as ω(F (x)), where F (x) := (f (x), g(x) T , h(x) T ) T is smooth and ω(x, y, z) := x+ν (y + T , z T ) T 1 is convex and polyhedral.Besides problems of type (NLP), a variety of other problems, such as linear or nonlinear fitting problems or unconstrained smooth optimization problems can be formulated in terms of (P) as well.
Noisy functions The combination of unconstrained optimization with noisy observations has recently been examined [18,19,24].The authors consider the minimization of a smooth function φ : R n → R while only having access to f (x) = φ(x) + ε(x) and g(x) = ∇φ(x) + e(x), where the only assumptions on ε and ε are that both |ε| and e are uniformly bounded.Consequently, it is not generally possible to generate a sequence {x k } of iterates converging to a local optimum or stationary point of φ.Intuitively, while the gradient noise e is small compared to ∇φ, the direction g is a suitable search direction with respect to φ.This allows for the use of an Armijo-like globalization strategy or, in case of [19], a trust-region method, where the noise is handled by stabilizing the reduction ratio, which is of course closely related to the Armijo condition.As soon as a region is reached where the noise produced by ε and e becomes too large relative to φ and ∇φ respectively, no further progress can be expected and the algorithm may stall.However, this critical region is visited infinitely often and once reaching it, the algorithm does not produce objective values much larger than the objective values attained in the critical region.The authors also study the problem of adapting quasi-Newton methods to the noisy setting.

Structure of the Remainder
We introduce the successive linear programming algorithm and the modified acceptance test in Section 2. The asymptotics of the algorithm are analyzed in Section 3. We provide computational examples and the corresponding results in Section 4. We draw a conclusion in Section 5.

A Noise-tolerant Successive Linear Programming Algorithm
In the noisy setting, we cannot expect to find the true optimum or stationary points of φ, since we do not have access to F and ∇F .Specifically, in a small region around the true optimum x * , F and F may oscillate by amounts of ±ε F and ±ε F , thereby making its evaluations unreliable.This impairs globalization strategies in nonlinear programming because their acceptance tests require reliable evaluations of F and a model function involving ∇F .
In the non-noisy regime, a trust-region method produces a sequence {x k } of iterates by assembling and subsequently optimizing model functions q k : R n → R, yielding a step d.The quality of d is determined according to the reduction ratio which is used to determine whether or not the step will be accepted.However, in the noisy setting, we only have access to F leading to a composite function φ defined as φ(x) := ω( F (x)).While we can build a model qk : R n → R p which coincides with F at x k , we cannot control the numerator φ( Indeed, if we reduce the trust region, sending d k to zero, the denominator of ρ k will tend to zero while the numerator will oscillate by up to ±2ε F , making the ratio unreliable.To alleviate this problem, we turn towards a recent adaptation [19] of trust-region methods to solving the noisy counterpart of smooth, unconstrained problems like (P).The authors of [19] add a correction term, that is a positive constant ϑ > 0, to both the numerator and denominator of the reduction ratio ρ k to mitigate the effect of noisy evaluations, yielding a modified ratio The parameter ϑ can then be chosen according to the noise levels ε F and ε F in order to stabilize the ratio.As we will see, this means that for ϑ large enough, the iterates of the successive linear programming algorithm converge to a critical region around stationary point.The downside is that this region grows with ϑ and the algorithm also accepts steps that do not improve the objective.
Apart from this adjustment, we follow the algorithmic approach in [6].Specifically, we use the following partially linearized and quadratic models at Algorithm Based on the models above, our noise-tolerant approach to solving (P) is laid out in Algorithm 1.In each iteration, an initial step d LP is computed in Line 2 by solving the problem where • LP is some norm on R n .We highlight that, from an algorithmic point of view, it is advantageous to cast this problem as a linear program, which can be solved using state-of-the-art LP solvers [12,13].To this end, both the epigraph of ω and the feasible region given in terms of • LP should be polyhedral as is the case, for example, for the 1 -and ∞ -norms.
Due to the equivalence of norms in R n there exists a constant γ > 0 such that for each d ∈ R n it holds that ( The algorithm proceeds to compute a Cauchy step d C k in Lines 3 -6.To this end, it employs a line search initialized with a step size sufficiently small to ensure that the Cauchy step falls into the trust region bounded by ∆ k .During the line search the step size is shortened by a factor of 0 < τ < 1 until the quadratic reduction achieved by the Cauchy point is within a factor of 0 < η < 1 of its linear reduction. The actual step d k to be taken in Line 7 can be different from the Cauchy step d C k , provided that it improves upon the quadratic reduction of d C k .This gives some algorithmic flexibility, allowing for the computation of Newton-type steps in order to achieve local quadratic convergence.Based on the stabilized reduction ratio ρk computed in Line 8, the step is either accepted (Lines 10 -11) or rejected (Lines 13 -14) according to an acceptance threshold of ρ u > 0. Additionally, the trust-region radii ∆ and ∆ LP are adjusted based on ρk : 1.The value of ∆ is increased or decreased based on whether ρk achieves a value of at least ρ s .The decrease is such that the new trust-region radius is at most κ u < 1 times as large as the previous one, thereby ensuring a true reduction, while being at least κ l d k (with 0 < κ l ≤ κ u ) in order to prevent an immediate collapse of the trust region.
2. If ρk achieves at least ρ u , the LP trust-region radius ∆ LP is increased beyond d C k LP , as long as it does not exceed the upper bound of ∆ LP max ≥ 1.The new LP trust-region radius is also only increased beyond ∆ LP if the full LP step d LP was accepted (i.e., α k = 1), indicating that the partially linearized model ˜ k is a good approximation of Φk across the entire LP trust region.If ρk falls short of ρ u , ∆ LP is decreased while being kept within a factor of θ > 0 of d k LP .
Remark 2.1.When applied to problem (NLP), Algorithm 1 uses the strategies introduced in [5], which form the basis of the active set method in the highly successful Knitro code [4], which combines sequential linear programming with equality constrained quadratic programming approaches in order to achieve robust performance over a range of large-scale nonlinear programming problems.

Convergence Analysis of Algorithm 1
We begin our convergence analysis with the introduction of the standing assumptions and a recap of the relevant stationarity concept for (P) in Section 3.1.We analyze the criticality measure for this notion of stationarity in the noisy setting in Section 3.2.We use these results to prove lower bounds on the trust-region radii that occur in Algorithm 1 in Section 3.3, which are then used to obtain sufficient decrease and, as a consequence, convergence of the produced iterates to critical regions in Section 3.4.

Standing Assumptions and Stationarity
In order to study the convergence properties of Algorithm 1, we make the following assumptions regarding the functions ω, F , and the matrices B k used in the quadratic models qk .
Assumption 2. F and F are Lipschitz-continuous with constants L F and L F , i.e., it holds for all x, y ∈ R n that Assumption 3. The Hessian approximations B k are bounded, i.e., there exists Our aim in the following is to find a local optimum of φ.A first-order necessary condition (see [9, pp. 184]) of optimality for (P) states that x * ∈ R n can only be a local optimum if where ∂ω(z) denotes the subdifferential of ω at z ∈ R p .To measure the distance of the iterates x k to a critical point of φ, the authors of [6] use the objective reduction with respect to the partially linearized model, given by Clearly, since d = 0 is feasible, the reduction Ψ k (∆) is always non-negative.On the other hand, the following results establishes that a vanishing reduction over a trust region of a normalized size is tantamount to reaching a critical point.

Analysis of Model Function and Criticality Measure in the Presence of Noise
Since we do not have access to the values of F and F required to compute Ψ k , we define a noisy measure of criticality via This function is also non-negative and we analyze its properties and relationship to Ψ k below.Since δ F cannot be assumed to be continuous, so can't φ.This differs from the analysis in [6], where the Lipschitz-continuity of φ is used to argue that the reduction ratio approaches one if the trust-region radius is driven to zero.We can, however, state that the criticality measures Ψ k and Ψk are related by the following approximation result: when considering a fixed x k , we claim that Ψk (1) → Ψ k (1) for ε F → 0 and ε F → 0 and that we also have convergence of the minimizers of the convex programs in the definitions of Ψk (1) and Ψ k (1).This follows from the epi-convergence of the functionals , Proof.We begin by showing the first inequality and consider and in turn the first inequality.
We continue with the second inequality and consider the constant sequence and in turn the second inequality.
The functionals L ε F ,ε F always admit a minimizer because the feasible set {d | d LP ≤ ∆}, on which L ε F ,ε F is finite, is compact, a standard argument yields that all accumulation points of a sequence of minimizers of the functionals L ε F ,ε F minimize the limit functional L 0,0 .
Consequently, if we drive Ψk to zero over the iterations, we have an upper bound on Ψ k , defining a critical region (sublevel set) into which the iterates converge.
Proof.This follows directly from Assumption 2 and the mean value theorem.Lemma 3.4.Under Assumptions 1 to 3, it holds that where The assumed Lipschitz continuity of ω, F , F , the representations F = F + δ F , F + δ F , Lemma 3.3, and the bounds on δ F , δ F yield the claim with elementary computations.Remark 3.5.If Assumptions 1 to 3 hold in the noiseless case, the achievable reduction is related to the trial value via 2 for some M > 0. This is no longer the case in the noisy model.
Several of the following results are due to [6] and are largely unaffected by moving from the noiseless to the noisy regime.We refer to their counterparts in [6] and prove them in the appendix.We begin by establishing that the linearized model ˜ is still Lipschitz-continuous, albeit with a Lipschitz-constant affected by the noise level ε F : Lemma 3.6.Under Assumptions 1 and 2 it holds for all d ∈ R n that where Proof.The assumed Lipschitz continuity of ω, F , the representation F + δ F , Lemma 3.3.and the bound on δ F and the bound • ≤ γ • LP yield the claim with elementary computations.
We proceed to examine the reduction according to the partially linearized model as a function of the size of an improvement step.The following result establishes that the reduction is well behaved in the step size in following sense: the model reduction that is achieved for a reduced step size is bounded from below by the model reduction achieved without step reduction multiplied by the step reduction.
Proof.This follows directly from the fact that ˜ k is convex, where we note that ˜ k (0) = φ(x k ) holds for d = 0.
Next, we establish that the criticality Ψk (∆) for a given trust-region radius ∆ > 0 is bounded below by Ψk (1) multiplied by ∆ if the latter is less than one.This holds in particular during the computation of the LP step in Algorithm 1.The proof requires the relationship established in Lemma 3.7.Lemma 3.8 (Lemma 3.2 in [6]).It holds for any ∆ > 0 that Ψk (∆) ≥ min(∆, 1) Ψk (1).
Proof.The proof is in the appendix.
The next result states that when progress is possible with respect to the criticality Ψk (1), the LP step either lies on the trust-region boundary or has a norm proportional to Ψk (1).Lemma 3.9 (Lemma 3.3 in [6]).If Assumptions 1 and 2 hold and that Ψk (1) = 0. Let d ∆ be a minimizer achieving Ψk (∆) for some ∆ > 0. Then it follows that L ε .
Proof.The proof is in the appendix.
We are now ready to examine the step d k computed by Algorithm 1 with respect to the reduction achieved by the model qk .Specifically, if progress can be made with respect to the criticality Ψk (1), then we can expect a positive reduction in qk .We use this result to prove that the objective φ decreases as well as long as Ψk ( 1) is sufficiently large.Lemma 3.10 (Lemma 3.4 in [6]).Under Assumptions 1 and 2, the model decrease satisfies Proof.The proof is in the appendix.
The following technical lemma shows that if Ψk (1) = 0, then d C k LP is bounded below, which we will need to ensure that the updated trust region radii do not collapse while progress in the objective can still be made.Lemma 3.11 (Lemma 3.6 in [6]).Under Assumptions 1 to 3 it holds that Proof.The proof is in the appendix.

Lower Bounds on the Trust-region Radii
We are now able to state a key result that provides lower bounds on both the trust-region radius ∆ for the quadratic model and the LP trust-region radius ∆ LP .It ensures that the algorithm does not stall while progress can be made with respect to the noisy criticality Ψk .The proof strategy follows Lemma 3.7 in [6] for the noiseless case.In order to compensate for the noise, we need to assume a sufficiently large stabilization parameter ϑ, which in turn depends on the constants introduced by the noise.
Lemma 3.12.Consider an application of Algorithm 1 to the noisy variant of problem (P).Suppose that Assumptions 1 to 3 hold, Ψk (1) ≥ δ > 0 for all k, and that with M ε 0 , M ε 1 from Lemma 3.4.Then it follows that where ∆ min = min(A, Bδ) with Proof.Using Ψk (1) ≥ δ, the bound in Lemma 3.11 becomes where If, on the other hand, the step is rejected, we can deduce the inequalities Consequently, we obtain the relationship To finish the proof, we distinguish two cases with respect to d k : Recall that the next trust region radius ∆ LP k+1 has a value of at least θ d k LP , which implies that We combine both cases by taking their minimum, resulting in This lower bound on ∆ LP k+1 dominates the previously shown lower bound (5) for accepted iterates.In order to derive a uniform lower bound on ∆ LP k (that is independent of k), we may assume the worst case, i.e. all steps are rejected, and resort to only (7).
Regarding the trust-region radius ∆ k for the quadratic model, we can follow a similar chain of reasoning as for ∆ LP k .The radius is only decreased when the reduction ratio ρk is less than or equal to ρ s , in which case it follows that ∆ k+1 ≥ κ l d k .We can use the inequalities that lead to (7) to obtain that We can combine these estimates to obtain the lower bound Starting from some k, we apply the inequality above recursively while decrementing k and arrive at It must therefore hold that ∆ k ≥ γ∆ low .Moreover, it holds that The result follows from grouping the terms in γ∆ low , according to whether or not they contain δ.

Global Convergence Theorem
We are now ready to establish the convergence of Algorithm 1.In order to simplify the proof of the main theorem, we handle the special case in which Algorithm 1 converges in a finite number of steps separately.
Lemma 3.13 (Corollary 3.8 in [6]).Consider an application of Algorithm 1 to the noisy variant of problem (P).Suppose that Assumptions 1 to 3 and (4) hold.If there are finitely many successful iterations (that is ρk ≥ ρ u ) during the execution of Algorithm 1, then it holds that x k = x * and Ψk (1) = 0 for all sufficiently large k.
The following convergence theorem states that when the objective of (P) is bounded below, an application of Algorithm 1 will produce one of two possible mutually exclusive outcomes: the algorithm may stop at a critical point after a finite number of iterations as described in Lemma 3.13 or, alternatively, Algorithm 1 visits a critical region infinitely often.In terms of the functions ω, F , and F , the critical region is defined as An iterate x k produced during the execution of Algorithm 1 is contained in C(δ) if and only if Ψk (1) ≤ δ.What is more, Proposition 3.2 establishes that Ψk (1) tends to Ψ k (1) as the errors ε F and ε F approach zero.These results therefore suggest that the iterate is close to being optimal in the sense of Lemma 3.1.
Theorem 3.14.Consider an application of Algorithm 1 to the noisy variant of problem (P).Suppose that Assumptions 1 to 3 and (4) hold.Then either or there are infinitely many k ∈ N such that x k ∈ C(δ max ), where is given in terms of the constants A, B from Lemma 3.12.
Proof.If there are only finitely many accepted steps, the result follows from Lemma 3.13, yielding the first possibility.Otherwise, we can assume that during the algorithm, an infinite number of accepted steps occurs.If φ(x k ) tends to −∞, the second possibility occurs, so we can assume in the following that φ(x k ) (and hence φ(x k )) is bounded below.Let K be the sequence of accepted steps, i.e., consisting of those k where x k+1 = x k .Clearly, if lim inf k→∞ Ψk (1) = 0, then the result follows.So we can assume that there exists a δ > 0 such that Ψk (1) ≥ δ for all k ≥ k 0 .The claim stating that the region C(δ max ) is visited infinitely often is tantamount to ensuring that δ ≤ δ max , which will be the aim of the remainder of this proof.
We deduce using Lemma 3.10 that We can now apply Lemma 3.12 to bound α k ∆ LP k below based on ∆ min and the constants A and B: Let us assume towards a contradiction that δ > δ max .We distinguish two cases with respect to the minimum min(A, Bδ): 1.The minimum is attained at A, implying that for a constant C 1 > 0. Using the fact that 2. The minimum is attained at Bδ, implying that for a constant C 2 > 0. We now use the fact that In either case φ decreases by min(C 1 , C 2 ) > 0 from x k to x k+1 .Since this decrease is strictly positive and there are infinitely many accepted steps in the sequence K, it follows that φ(x k ) tends to −∞, which is a contradiction.It must therefore hold that δ ≤ δ max as desired.
Interpretation of Theorem 3.14 In theoretical terms, the result in Theorem 3.14 is as expected: the size of the critical region C depends on the stabilization parameter ϑ, If we increase ϑ, Algorithm 1 can tolerate a larger amount of noise at the cost of a decreased accuracy with respect to the criticality measure Ψ.Of course, problem (P) can generally also be unbounded.The remaining case, where Ψk (1) = 0 for some k ∈ N, can involve different scenarios.If Ψ k (1) = 0 holds as well, the iterate x k is a critical point of (P), which is the ideal situation.Otherwise, the noises δ F (x k ) and δ F (x k ) attain values such that x k appears to be critical in the noisy model.In case of an unconstrained version of (NLP), this is tantamount to a non-zero gradient that is canceled out by noise.For a function F that is not afflicted by noise, i.e., satisfying ε F = ε F = 0, it holds that M ε = 0, allowing us to set ϑ = 0, whereby we recover the original algorithm discussed in [6].If we apply Theorem 3.14 in this situation, it follow from ϑ = 0 that δ max = 0, implying that the region C(δ max ) contains precisely the points critical with respect to Φ, which itself coincides with Φ in this particular case.Thus, the original convergence result [6, Theorem 3.9] follows for Algorithm 1 in the noiseless case.
We can also apply Algorithm 1 to solve smooth unconstrained nonlinear problems affected by noise.To this end, let p = 1 and ω(x) = x for x ∈ R, which leads to the Lipschitz constant L ω = 1 so that Theorem 3.14 suggests a stabilization of which is weaker than the stabilization of 2ε F /(1 − ρ s ) analyzed in [19].The weaker estimate is due to the estimate from Lemma 3.7 that uses the convexity of ω.Instead, the authors of [19] have a Lipschitz continuous derivative of the objective at hand.Then the criticality measure satisfies which in turn is equivalent to F (x) being bounded above by a constant.

Numerical Experiments
In order to illustrate the performance and examine the behavior of Algorithm 1, we implement the algorithm in Python (3.8.10), using numpy [numpy] (1.24.1), scipy [21] (1.9.3) (including HiGHS [13] (1.2.0) as LP solver), and Ipopt [22] (3.11.9) to solve the respective subproblems.We generally compare the performance of the classical algorithm (i.e., Algorithm 1 with a stabilization of ϑ = 0), with its stabilized counterpart (where ϑ > 0).In terms of termination criteria, we first of all impose an iteration limit, after which the algorithm terminates.Secondly, we monitor the LP trust-region radius, ∆ LP .If the radius contracts to a value close to zero (1 × 10 −10 ), we see this as a failure of the algorithm and let it terminate.Lastly, when the noise criticality Ψk (1) falls below the threshold of 1 × 10 −6 , we terminate the algorithm, knowing that the current iterate is very close to being optimal.We then examine the final iterate x f , i.e., the iterate x k in Algorithm 1 of the iteration at which the termination criterion becomes satisfied.If not indicated otherwise, we choose the parameters of Algorithm 1 according to the values in Table 1 and the stabilization parameter ϑ * ε .In order to obtain inexact evaluations of a given where B n (s) denotes the uniform distribution on the n-dimensional ball centered at the origin with radius s.Sampling randomly from these distributions at each point x ensures that the amount of noise is bounded according to the noise levels ε F and ε F while being sufficiently unpredictable.In Section 4.1, we augment an example of a quadratic test problem from [19] with a non-smooth term.We obtain qualitatively similar results in this case.Then we compare and visualize the different behaviors of the unstabilized algorithm and the stabilized algorithm for the Rosenbrock test function in Section 4.2.In Section 4.3, we apply the algorithm to an image reconstruction problem with total variation regularization and assess the impact of different choices of the stabilization parameter.Finally, in Section 4.4, we apply the algorithm in a penalty method for a small constrained optimization problem from CUTest [11] as motivated in the introduction, which points to future research directions.Sections 4.1, 4.2 and 4.4 use essentially the same type of a non-smooth objective function that includes an 1 -penalty term and we provide the required estimates of its Lipschitz constant in Appendix B.

Failure of the Classical Algorithm
To illustrate the difference in performance between the classical algorithm and Algorithm 1, we consider the case of 1 -penalized optimization problems of the form x → f (x) + λ x 1 with a smooth function f : R n → R. It is clear that these problems are non-smooth due to the presence of the • 1 term, while being expressible as problems of type (P) based on suitable choices of ω and F .This problem class also enables us to minimize ˜ k and qk over the trust regions defined in terms of ∆ LP and ∆ by solving linear or quadratic programs respectively.What is more, the only curvature information in this problem class is due to f , enabling us to either use the Hessian of f (or any quasi-Newton approximation) to obtain the matrices B k .We specifically examine the case where f is a quadratic of the form f (x) = 1 2 x, Dx , where D is the matrix in R n×n for n = 8 given as D = diag(10 −5 , 10 −4.75 , 10 −4.5 , . . ., 10 −3.25 ), taken from [19], where this optimization problem has been studied without an 1 -penalty.It is apparent that the optimal solution of this instance of (P) is x * = 0. We set the parameter λ to 1 × 10 −2 while injecting noise according to (8) with noise levels ε F = 1 × 10 −1 and ε F = 1 × 10 −5 , and initialize Algorithm 1 with the initial point x 0 = (1000, 0, . . ., 0) T , limiting the number of iterations to 50, and performing quadratic steps based on the true Hessian D.
We show an example of the difference in performance in Figure 1, where the values of the reduction ratio are clipped to ±5 in order to properly display the results.We see that the classical algorithm performs dramatically worse than its stabilized counterpart.Indeed, the classical algorithm stalls almost immediately, due to the reduction ratio ρ k becoming unreliable.Consequently, the LP trust region collapses, and the classical algorithm makes no progress towards optimality.Conversely, the addition of a stabilization yields an algorithm rapidly approaching the optimum, both in terms of primal distance and objective value while maintaining a reasonably large LP trust-region radius.Similarly, noisy and noiseless criticality decrease rapidly throughout the iterations of the stabilized algorithm.Unfortunately, the criticality bound established in Theorem 3.14 attains a value of δ max ≈ 6000, limiting its use in terms of the criticality actually achieved throughout the iterations.
We would like to point out that the failure of the classical algorithm is not guaranteed in this scenario: By running the experiment with 100 different random seeds we found that the classical algorithm stalls in about half (45) of the cases, while performing well in the other half.Conversely, the stabilized algorithm consistently performs well in all cases.Its characteristics are qualitatively similar to the case that is depicted in Figure 1.A key problem of the classical algorithm is therefore its unreliability when applied to noisy functions.

A Variant of the Rosenbrock Problem
Following the previous experiments conducted based on the quadratic function, we go on to examine the performance on a variant of the famous Rosenbrock function, given by R(x, y) with parameters of a = 1, b = 100.The Rosenbrock function has a unique optimum at (x * , y * ) = (a, a 2 ), i.e., at (1, 1) for our choice of parameters.We modify the problem by adding a penalty of λ (x, y) − (x * , y * ) 1 with a value of λ = 1 × 10 −1 , yielding a problem of type (P) having the same global optimum as R. We show an example of the difference in performance between the classical and stabilized algorithms in Figure 2. The figure shows the trajectories generated by Algorithm 1 with and without stabilization starting at (x 0 , y 0 ) = (−1.5,0), injecting noise according to (8) for different values of ε F and a fixed value of ε F = 1 × 10 −5 , performing quadratic steps according to the true Hessian of R with an iteration limit of 50.
Examining the trajectories of the classical algorithm, shown in Figure 2a, we find that for different values of ε F , the trajectories are initially almost identical, until the algorithm stalls at points with a distances to the optimum increasing with ε F .Conversely, the trajectories of the stabilized algorithm, shown in Figure 2b, vary significantly for different noise levels.However, the stabilization yields trajectories leading significantly closer to the optimum than those of the classical algorithm even for larger noise levels.This is confirmed by the statistics shown in Figure 3, displaying the distribution of the distance to the optimum for various noise levels for 100 different random seeds, demonstrating that the stabilized algorithm consistently outperforms the classical one, in particular for larger noise levels.

Image Reconstruction
Although this is not the focus of this article, we also provide a computational example that has a meaningful problem size.Specifically, we consider an artificial task of reconstructing an image under noisy observations.That is we seek to recover a matrix Y ∈ R M ×N with values normalized to be in [0, 1].In our setting, Y is only available in the form of noisy observations.Specifically, for an input X ∈ R M ×N , the fidelity of X, given by the term 1 2 X − Y 2 F and its derivative with respect to X cannot be evaluated.Instead, we have access to the map X → 1 2 X − Ỹ 2 F , where Ỹ is a noisy version of Y , redrawn for each guess X.We obtain the term Ỹ by sampling from a uniform distribution and setting Ỹ to Y + δ Y (X) clipped back to have coefficients in [0, 1].The amount of noise injected to the image is in turn governed by the parameter ε img ≥ 0. This noise model translates into noise injected into the evaluations of F and F , which can be estimated in terms of ε img , M , and N (see Appendix B) while not conforming to the noise model (8).We also impose an anisotropic total variation (TV) regularization penalty, defined as to our objective, turning the problem non-smooth, balancing off fidelity and regularity by a parameter λ > 0. The regularization term can be expressed as A M,N X 1 with a suitable matrix A M,N .Consequently, we can formulate the reconstruction problem as problem of type (P), consisting of a smooth term (the fidelity), and an 1 -penalized linear function.Naturally, the regularization does not suffer from any noise.
Based on a regularization parameter of λ = 5 × 10 −3 we reconstruct the image shown in Figure 5a.To avoid having to solve large quadratic problems, we do not compute quadratic steps and opt to instead increase the number of iterations to 100 starting from X 0 = 0.As a baseline, Figure 5b shows the image when we apply Algorithm 1 to the original image (i.e., setting ε img = 0).The restored image closely resembles the original one.
We proceed to study the effect of the value of ϑ on the quality of the reconstructed image.In principle, it must hold that ϑ ≥ ϑ * ε in order for the criticality to provably converge.It is however unclear whether setting ϑ to ϑ * ε yields the best results in practice.The large value of δ max seen in Section 4.1 seems to suggest that (4) is rather pessimistic.We therefore examine the performance of Algorithm 1 for values of ϑ not necessarily satisfying the inequality.
To gauge performance, we record both the original and noisy objective after the iterations.The results, shown in Figure 4, demonstrate the effect of ϑ: For small stabilization values, Algorithm 1 stalls early on, as was the case in our previous experiments.As we increase ϑ, there appears to be an optimal choice or small region, where both the noisy and the noiseless evaluation of the final objective are minimized.This effect is more pronounced for higher values of ε img , where the noiseless objective for ϑ = 0 is about 4 times as large as that of the optimal choice of ϑ.It is interesting to see that this sweet spot also shows in the noisy objective, suggesting that noisy observations may be sufficient to find it.Lastly, as we increase ϑ beyond the sweet spot, the final objective increases sharply.This is likely due to the case that the algorithm simply accepts too many steps, even when they are in fact disadvantageous in terms of progressing towards an optimum.Ultimately, for a sufficiently large value of ϑ, all steps are accepted, which, as the final objective suggests, leads to poor solutions.The values of ϑ * ε are given by 2 × 10 4 , 1 × 10 5 , and 2 × 10 5 for the respective noise levels, significantly exceeding the optimal values and beyond the point, where all steps are accepted.
We also find that the objectives are consistent with the visual appearance of the reconstructed images, shown in Figure 6: While setting ϑ to zero yields satisfactory results, even though a grainy appearance remains for larger noise levels, a disproportionately large value of ϑ produces a distorted result with visible artifacts.For our best guess of ϑ, the restored images do not suffer from artifacts and closely resemble the original one even for larger noise levels.

Constrained Optimization
As a final example and in order to demonstrate the possible use of Algorithm 1 as a subproblem solver in constrained optimization algorithms, we study a constrained optimization problem of the type (NLP).Specifically, we examine the behavior of Algorithm 1 when applied to the HS71 benchmark problem of the CUTest [11] suite.The problem is given as leading to suitable functions g and h according to (NLP).The problem features of four bounded optimization variables, two nonlinear constraints, and a nonlinear objective with an optimum at the point x * ≈ (1., 4.74, 3.82, 1.38) that satisfies MFCQ and in turn the conditions for the convergence of an exacty penalty method.As mentioned in the introduction, we solve problems of type (NLP) by using the penalty function (1) with a suitable penalization of ν > 0, knowing that convergence is guaranteed for a sufficiently large ν under mild assumptions, i.e., MFCQ.Increasing ν beyond its required value may slow down practical performance, but convergence is maintained.Consequently, ν is often set to a small initial value and increased when necessary (see for example [5]).
If the functions in (NLP) are affected by noise, the choice of ν is not as straightforward: The required stabilization (4) is dependent on M ε 0 and therefore L ω , which increases with ν.Similarly, the value of δ max increases with L ω and therefore with ν, so a large penalization has the adverse effect of increasing the size of the critical region C(δ max ), making a suitable choice of the parameter an interesting problem in and of itself.What is more, if the constraint functions g and h suffer from noise, we cannot assume the iterates x k to tend towards feasibility in the underlying noiseless problem regardless of the value of ν.
Therefore, for our investigation, we consider a fixed value of ν, which is suitable to solve the noiseless variant of (HS71), in our case ν = 100.We once again inject noise according to (8) with different values of ε F and a fixed ε F = 0. Specifically, we run the algorithm with the choice ε F = 10 −2 and ε F = 10 −1 .Since a reasonable choice of the quadratic model would likely require some dual estimation, we once again opt to skip quadratic steps and instead set the iteration limit to 100.After the algorithm has terminated, we record the criticality Ψ k (1), the feasibility residual, as well as their noisy counterparts for different values of ϑ (see Figure 7).As was the case for the image reconstruction problem, we observe pronounced minima of the quality metrics criticality and feasibility with respect to the choice of ϑ.For a given choice of ε F , the obtained minima for both quality metrics, criticality and feasibility residual, are in close vicinity to each other.The position of these minima is also fairly consistent across the noisy and noiseless measurements of both the criticality and the feasibility residuum.Unfortunately, setting ϑ = ϑ * ε does not yield optimal results, even though it appears as if ϑ * ε is closer to being optimal compared to the image reconstruction problem.Once again, for an informed choice of ϑ, the stabilized algorithm significantly outperforms the classical one, leading to about an order of magnitude of reduction in terms of both criticality and feasibility.The precise choice of the parameter and a systematic means to determine it do, however, remain elusive.

Conclusion
We have presented a noise-tolerant adaptation of a well-established trust-region method for a non-smooth optimization problem with a structured and convex non-smoothness described by a polyhedral function, which is therefore suitable to handling by linear programming techniques.The adaptation only requires knowledge of a Lipschitz constant and bounds on the noise in the objective function and its derivative.The analysis of the asymptotics of the successive linear programming algorithm can be carried out analogously to [6], where the noiseless case is handled.As we expect from the results in [19], we do not to get convergence to a first-order stationary point but a critical region instead.
In a noiseless setting, both the behavior of the algorithm and its convergence properties are consistent and similar to previous analyses.The computational results show that an informed choice of the stabilization parameter ϑ may improve the quality of the obtained results significantly so that we believe it makes sense to dedicate research to improved bounds and efficient practical determination strategies.
Further analysis is also needed in order to be able to use and interpret the method as a subproblem solver for constrained optimization with noisy constraint and objective evaluations.In particular, it is necessary to study the asymptotics of the feasibility residual, identify means to control it, and classify it with respect to existing concepts from the field of uncertainty quantification like (distributional) chance constraints or expectation constraints.

A Proofs
In the following we give the proofs of some of the result used in Section 3.These proofs closely follow those in [6].We provide them here to make this article more self-contained.
It remains to prove the case ∆ < 1.From d 1 LP ≤ 1 it follows that ∆d 1 LP ≤ ∆, i.e., ∆d 1 is a feasible solution with respect to ∆.Therefore, it holds that where the inequality is due to the feasibility of ∆d 1 and Lemma 3.7.
Proof of Lemma 3.9.Let d 1 be a minimizer for ∆ = 1.Assume (towards a contradiction) that d ∆ LP < min(∆, Ψk (1) If ∆ ≥ 1, (9) cannot hold because d 1 is feasible with respect to ∆ and therefore cannot yield a better objective with respect to ˜ k than the minimizer d ∆ .So it must hold that d ∆ LP ≥ Ψk (1)   L ε in this case ∆ ≥ 1. If, on the other hand, ∆ < 1, then d 1 may not be feasible.However, since ˜ k is convex, it holds for all λ ∈ (0, 1] that ˜ k (d ∆ ).
Therefore, any point on the line segment (d ∆ , d 1 ] has a strictly lower value of ˜ k than d ∆ .Therefore, no such point can be feasible with respect to the constraint on • LP .Consequently, d ∆ must lie on the boundary of the feasible set implying that d ∆ LP = ∆.The result is obtained by combining these bounds. Proof of Lemma 3.10.The actual step must satisfy that qk (d k ) ≤ qk (d C k ), so the first inequality is a given.Similarly, the last inequality is an application of Lemma 3.8.To show that the remaining inequality holds, recall that the line search for the Cauchy step d C k terminates with an α k such that ).Consequently it follows that We consider two cases: 2. The decrease condition is only satisfied at a later iteration of the line search.Recall that the line search computes step sizes by multiplying a base length with powers of an input parameter τ ∈ (0, 1).We can therefore deduce that the sufficient decrease condition was not satisfied for α k /τ in the previous iteration, i.e., Since the only difference between the linearized and quadratic model is the quadratic term, we have that The left hand side can be bounded above by using Assumption 3 and relation (2) to yield Similarly, for the right hand side we can use Lemmas 3. required to complete the proof.
Proof of Lemma 3.13.Let k 0 be the index of the last accepted step.Then, x k+1 = x k =: x * for all k > k 0 .Consequently, after finishing the k 0 -the iteration, Ψk (1) stays at a constant value of δ ≥ 0. What is more, following iteration k 0 we have that ∆ k+1 < κ u ∆ k , where κ u < 1.Therefore, ∆ k tends to zero.Recall from Lemma 3.12 that if δ > 0, then ∆ k is bounded away from zero.Therefore, since ∆ k tends to zero, it must hold that δ = 0.

B Estimations
Lipschitz constant of the 1 -penalty function In the following, we give an estimation for the Lipschitz constant L ω of the penalty function ω : R × R m → R, ω(x, y) = x + ν y 1 , based on the constant ν > 0 and the dimension m ∈ N. Since we use this function in all of the examples in Section 4, and since the value of ϑ * ε depends on the value of L ω , we make its derivation explicit.To obtain an optimal value of L ω , we solve the optimization problem We can simplify the problem further by realizing that we can assume both x and y to be non-negative, eliminating the absolute value in the objective.The largest ratio of ν y 1 over y 2 is achieved by setting all entries of y to the same value y 0 ∈ R, yielding the problem max x,y0 x + νmy 0 s.t.x 2 + my 2 0 ≤ 1 x, y 0 ≥ 0.
and qk (d) := ˜ k (d) + 1 2 d, B k d , where the B k ∈ R n×n are symmetric (not necessarily positive definite) approximations of the curvature of ω • F .

k and d LP k achieves
Ψk (∆ LP k ), the inequality follows from Lemma 3.7.Proof of Lemma 3.11.The first inequality is due to the fact that the Cauchy step is the LP step scaled by α k , where the LP norm of the LP step is bounded by ∆ LP k .Consider two cases for the second inequality: 1.The decrease condition is immediately satisfied for the initial step size of α k = min(1, ∆ k / d LP k Otherwise we know that dC k LP = d LP k LP ∆ k / d LP k .We can use (2) to obtain that d LP k ≤ γ d LP k LP , inferring that d C k LP = d LP k LP ∆ k / d LP k ≥ d LP k LP ∆ k /(γ d LP k LP ) = ∆ k /γ,which implies the claimed bound.
Trajectories of Algorithm 1 with ϑ = 0 for different noise levels.
Trajectories of Algorithm 1 with ϑ according to (4) for different noise levels.

Figure 2 :
Figure 2: Trajectories for the modified Rosenbrock problem plotted over the shifted criticality 1 + φ(x) − min d LP ≤1 ω(F (x) + F (x)d).The markers show the position of the final iterate.

Figure 4 :
Figure 4: Noiseless (red) and noisy evaluations (blue) of the objective values achieved by the final iterate of Algorithm 1 on an image reconstruction problem with different noise levels and stabilization parameters.

Figure 5 :
Figure 5: Sample image for the image reconstruction.

Figure 6 :
Figure 6: Reconstructed images for different noise levels and stabilizations (left: no stabilization, center: best stabilization, right: large stabilization so that all iterates are accepted).

Table 1 :
Parameters used for numerical experiments.
function F : R n → R p and its derivative, we inject noise by setting