An Update-and-Stabilize Framework for the Minimum-Norm-Point Problem

We consider the minimum-norm-point (MNP) problem over polyhedra, a well-studied problem that encompasses linear programming. We present a general algorithmic framework that combines two fundamental approaches for this problem: active set methods and first order methods. Our algorithm performs first order update steps, followed by iterations that aim to `stabilize' the current iterate with additional projections, i.e., find a locally optimal solution whilst keeping the current tight inequalities. Such steps have been previously used in active set methods for the nonnegative least squares (NNLS) problem. We bound on the number of iterations polynomially in the dimension and in the associated circuit imbalance measure. In particular, the algorithm is strongly polynomial for network flow instances. Classical NNLS algorithms such as the Lawson-Hanson algorithm are special instantiations of our framework; as a consequence, we obtain convergence bounds for these algorithms. Our preliminary computational experiments show promising practical performance.


Introduction
We study the minimum-norm-point (MNP) problem for the feasible set.The problem (P) generalizes the linear programming (LP) feasibility problem: the optimum value is 0 if and only if Ax = b, x ∈ B(u) is feasible.The case u(i) = ∞ for all i ∈ N is also known as the nonnegative least squares (NNLS) problem, a fundamental problem in numerical analysis.Two extensively studied approaches for MNP and NNLS are active set methods and first order methods.An influential active set method was proposed by Lawson and Hanson [19,Chapter 23] in 1974.Variants of this algorithm were also proposed by Stoer [28], Björck [2], Wilhelmsen [31], and Leichner, Dantzig, and Davis [20]. 1 Closely related is Wolfe's classical minimum-norm-point algorithm [32].These are iterative methods that maintain a set of active variables fixed at the lower or upper bounds, and passive (inactive) variables.In the main update steps, these algorithms fix the active variables at the lower or upper bounds, and perform unconstrained optimization on the passive variables.Such update steps require solving systems of linear equations.In all these methods, the set of columns corresponding to passive variables is linearly independent.The combinatorial nature of these algorithms enables to show termination with an exact optimal solution in a finite number of iterations.However, obtaining subexponential convergence bounds for such active set algorithms has remained elusive; see Section 1.1 for more work on NNLS and Wolfe's algorithm.
In the context of first order methods, the formulation (P) belongs to a family of problems for which Necoara, Nesterov, and Glineur [22] showed linear convergence bounds.That is, the number of iterations needed to find an ε-approximate solution depends linearly on log(1/ε).Such convergence has been known for strongly convex functions, but this property does not hold for (P).However, [22] shows that restricted variants of strong convexity also suffice for linear convergence.For problems of the form (P), the required property follows using Hoffman-proximity bounds [16]; see [26] and the references therein for recent results on Hoffman-proximity.In contrast to active set methods, first order methods are computationally cheaper as they do not require solving systems of linear equations.On the other hand, they do not find exact solutions.
We propose a new algorithmic framework for the minimum-norm-point problem (P) that can be seen as a blend of active set and first order methods.Our algorithm performs stabilizing steps between first order updates, and terminates with an exact optimal solution in a finite number of iterations.Moreover, we show poly(n, κ) running time bounds for multiple instantiations of the framework, where κ is the circuit imbalance measure associated with the matrix (A | I M ) (see Section 2.1).This gives strongly polynomial bounds whenever κ is constant; in particular, κ = 1 for network flow feasibility.We note that if A ∈ Z M ×N , then κ ≤ ∆(A) for the maximum subdeterminant ∆(A).Still, κ can be exponential in the encoding length of the matrix.
The stabilizing step is similar to the one used by Björck [2] who considered the same formulation (P).The Lawson-Hanson algorithm for the NNLS problem can be seen as special instantiations of our framework, and we obtain an O(n 2 m 2 • κ 2 • A 2 • log(n + κ)) iteration bound.These algorithms only use coordinate updates as first order steps, and maintain linear independence of the columns corresponding to passive variables.Our framework is signficantly more general: we waive the linear independence requirement and allow for arbitarty active and passive sets.This provides much additional flexibility, as our framework can be implemented with a variety of first order methods.This feature also yields a significant advantage in our computational experiments.
Overview of the algorithm A key concept in our algorithm is the centroid mapping, defined as follows.For disjoint subsets I 0 , I 1 ⊆ N , we let L(I 0 , I 1 ) denote the affine subspace of R N where x(i) = 0 for i ∈ I 0 and x(i) = u(i) for i ∈ I 1 .For x ∈ B(u), let I 0 (x) and I 1 (x) denote the subsets of coordinates i with x(i) = 0 and x(i) = u(i), respectively.The centroid mapping Ψ : This mapping may not be unique, since the columns of A corresponding to J(x) := {i ∈ N | 0 < x(i) < u(i)} may not be independent: the optimal centroid set is itself an affine subspace.The point x ∈ B(u) is stable if Ψ(x) = x.This generalizes an update used by Björck [2].However, in his setting J(x) is always linearly independent and thus the centroid set is always a single point.Stable sets can also be seen as the analogues of corral solutions in Wolfe's minimum-norm point algorithm.
Every major cycle starts with an update step and ends with a stable point.The update step could be any first-order step satisfying some natural requirements, such as variants of Frank-Wolfe, projected gradient, or coordinate updates.As long as the current iterate is not optimal, this update strictly improves the objective.Finite convergence follows by the fact that there can be at most 3 n stable points.
After the update step, we start a sequence of minor cycles.From the current iterate x ∈ B(u), we move to Ψ(x) in case Ψ(x) ∈ B(u), or to the intersection of the boundary of B(u) and the line segment [x, Ψ(x)] otherwise.The minor cycles finish once x = Ψ(x) is a stable point.The objective 1  2 Ax − b 2 is decreasing in every minor cycle, and at least one new coordinate i ∈ N is set to 0 or to u(i).Thus, the number of minor cycles in any major cycle is at most n.One can use various centroid mappings satisfying a mild requirement on Ψ, described in Section 2.3.
We present a poly(n, κ) convergence analysis for the NNLS problem with coordinate updates, which corresponds to the Lawson-Hanson algorithm and its variants.We expect that similar arguments extend to the capacitated case.The proof has two key ingredients.First, we show linear convergence of the first-order update steps (Theorem 4.5).Such a bound follows already from [22]; we present a simple self-contained proof exploiting properties of stable points and the uncapacitated setting.The second step of the analysis shows that in every poly(n, κ) iterations, we can identify a new variable that will never become zero in subsequent iterations (Theorem 4.1).The proof relies on proximity arguments: we show that for any iterate x and any subsequent iterate x ′ , the distance x − x ′ can be upper bounded in terms of n, κ, and the optimality gap at x.
In Section 5, we present preliminary computational experiments using randomly generated problem instances of various sizes.We compare the performance of different variants of our algorithm to standard gradient methods.For the choice of update steps, projected gradient performs much better than coordinate updates used in the NNLS algorithms.We compare an 'oblivious' centroid mapping and one that chooses Ψ(x) as the nearest point to x in the centroid set in the 'local norm' (see Section 2.2).The latter one appears to be significantly better.For choices of parameters n ≥ 2m, the running time of our method with projected gradient updates and local norm mapping is typically within a factor two of TNT-NN, the state-of-the-art practical active set heuristic for NNLS [21], despite the fact that we only use simple linear algebra tools and have not made any attempts for practical speed ups.The performance is often better than projected accelerated gradient descent, the best first order approach.
Proximity arguments and strongly polynomial algorithms Arguments that show strongly polynomial convergence by gradually revealing the support of an optimal solution are prevalent in combinatorial optimization.These date back to Tardos's [29] groundbreaking work giving the first strongly polynomial algorithm for minimum-cost flows.Our proof is closer to the dual 'abundant arc' arguments by Fujishige [12] and Orlin [24].Tardos generalized the above result for general LP's, giving a running time dependence poly(n, log ∆(A)), where ∆(A) is the largest subdeterminant of the constraint matrix.In particular, her algorithm is strongly polynomial as long as the entries of the matrix are polynomially bounded integers.This framework was recently strengthened in [7] to poly(n, log κ(A)) running time for the circuit imbalance measure κ(A).They also highlight the role of Hoffman-proximity and give such a bound in terms of κ(A).We note that the above algorithms-along with many other strongly polynomial algorithms in combinatorial optimization-modify the problem directly once new information is learned about the optimal support.In contrast, our algorithm does not require any such modifications, nor a knowledge or estimate on the condition number κ. Arguments about the optimal support only appear in the analysis.
Strongly polynomial algorithms with poly(n, log κ(A)) running time bounds can also be obtained using layered least squares interior point methods.This line of work was initiated by Vavasis and Ye [30] using a related condition measure χ(A).An improved version that also established the relation between χ(A) and κ(A) was recently given by Dadush et al. [6].We refer the reader to the survey [9] for properties and further applications of circuit imbalances.

Further related work
The Lawson-Hanson algorithm remains popular for the NNLS problem, and several variants are known.Bro and De Jong [3], and by Myre et al. [21] proposed empirically faster variants.In particular, [21] allows bigger changes in the active and passive sets, thus waiving the linear independence on passive variables, and reports a significant speedup.However, there are no theoretical results underpinning the performance of these heuristics.
Wolfe's minimum-norm-point algorithm [32] considers the variant of (P) where the box constraint x ∈ B(u) is replaced by i∈N x i = 1, x ≥ 0. It has been successfully employed as a subroutine in various optimization problems, e.g., submodular function minimization [14], see also [1,11,13].Beyond the trivial 2 n bound, the convergence analysis remained elusive; the first bound with 1/ε-dependence was given by Chakrabarty et al. [4] in 2014.Lacoste-Julien and Jaggi [17] gave a log(1/ε) bound, parametrized by the pyramidal width of the polyhedron.Recently, De Loera et al. [8] showed an example of exponential time behaviour of Wolfe's algorithm for the min-norm insertion rule (the analogue of a pivot rule); no exponential example for other insertion rules such as the linopt rule used in the application for submodular minimization. 2ur Update-and-Stabilize algorithm is also closely related to the Gradient Projection Method, see [5] and [23,Section 16.7].This method also maintains a non-independent set of passive variables.For each gradient update, a more careful search is used in the gradient direction, 'bending' the movement direction whenever a constraint is hit.The analogues of stabilizer steps are conjugate gradient iterations.Thus, this method avoids the computationally expensive step of exact projections; on the other hand, finite termination is not guaranteed.We further discuss the relationship between the two algorithms in Section 6.
There are similarities between our algorithm and the Iteratively Reweighted Least Squares (IRLS) method that has been intensively studied since the 1960's [18,25].For some p ∈ [0, ∞], A ∈ R M ×N and b ∈ R M , the goal is to approximately solve min{ x p | Ax = b}.At each iteration, a weighted minimum-norm point min{ w (t) , x | Ax = b} is solved, where the weights w (t) are iteratively updated.The LP-feasibility problem Ax = b, 0 ≤ x ≤ 1 for finite upper bounds u = 1 can be phrased as an ℓ ∞ -minimization problem min{ x ∞ | Ax = b − A1/2}.Ene and Vladu [10] gave an efficient variant of IRLS for ℓ 1 and ℓ ∞ -minimization; see their paper for further references.Some variants of our algorithm solve a weighted least squares problem with changing weights in the stabilizing steps.There are, however significant differences between IRLS and our method.The underlying optimization problems are different, and IRLS does not find an exact optimal solution in finite time.Applied to LP in the ℓ ∞ formulation, IRLS satisfies Ax = b throughout while violating the box constraints 0 ≤ x ≤ u.In contrast, iterates of our algorithm violate Ax = b but maintain 0 ≤ x ≤ u.The role of the least squares subroutines is also rather different in the two settings.

Preliminaries
Notation We use N ⊕ M for disjoint union (or direct sum) of the copies of the two sets.For a matrix A ∈ R M ×N , i ∈ M and j ∈ N , we denote the ith row of A by A i and jth column by A j .Also for any matrix X denote by X ⊤ the matrix transpose of X.We let • p denote the ℓ p vector norm; we use • to denote the Euclidean norm • 2 .For a matrix A ∈ R M ×N , we let A denote the spectral norm, that is, the ℓ 2 → ℓ 2 operator norm.
For any x, y ∈ R M we define x, y = i∈M x(i)y(i).We will use this notation also in other dimensions.We let [x, y] := {λx + (1 − λ)y | λ ∈ [0, 1]} denote the line segment between the vectors x and y.

Elementary vectors and circuits
For a linear space W R N , g ∈ W is an elementary vector if g is a support minimal nonzero vector in W , that is, no h ∈ W \ {0} exists such that supp(h) supp(g), where supp denotes the support of a vector.We let F(W ) ⊆ W denote the set of elementary vectors.A circuit in W is the support of some elementary vector; these are precisely the circuits in the associated linear matroid M(W ).
The subspaces W = {0} and W = R N are called trivial subspaces; all other subspaces are nontrivial.We define the circuit imbalance measure for nontrivial subspaces and κ(W ) = 1 for trivial subspaces.For a matrix A ∈ R M ×N , we use the notation κ(A) to denote κ(ker(A)).
The following theorem shows the relation to totally unimodular (TU) matrices.Recall that a matrix is totally unimodular (TU) if the determinant of every square submatrix is 0, +1, or −1.
Theorem 2.1 (Cederbaum, 1957, see [9,Theorem 3.4]).Let W ⊂ R N be a linear subspace.Then κ(W ) = 1 if and only if there exists a TU matrix A ∈ R M ×N such that W = ker(A).
We also note that if A ∈ Z M ×N is an integer matrix, then κ(A) ≤ ∆(A) for the maximum subdeterminant ∆(A).

Conformal circuit decompositions We say that the vector y
where ℓ ≤ n and h 1 , h 2 , . . ., h ℓ ∈ F(W ) are elementary vectors that conform to v. A fundamental result on elementary vectors asserts the existence of a conformal circuit decomposition; see e.g., [15,27].Note that there may be multiple conformal circuit decompositions of a vector.Lemma 2.2.For every subspace W ⊆ R N , every v ∈ W admits a conformal circuit decomposition.
Moreover, h k is an inner vector in the decomposition if Ah k = 0 and an outer vector otherwise.
We say that v ∈ R N is cycle-free with respect to A, if all generalized path-circuit decompositions of v contain outer vectors only.The following lemma will play a key role in analyzing our algorithms.

Lemma 2.3. For any
Proof.Consider a generalized path-circuit decomposition v = ℓ k=1 h k .By assumption, Ah k = 0 for each k.Thus, for every j ∈ supp(h k ) there exists an For the second inequality, note that for any outer vector (h k , Ah k ) ∈ X A , the columns in supp(h k ) must be linearly independent.Consequently, completing the proof.
Remark 2.4.We note that a similar argument shows that A ≤ mτ (A) • κ(X A ), where τ (A) ≤ m is the maximum size of supp(Ah) for an elementary vector (h, Ah) ∈ X A .
Example 2.5.If A ∈ R M ×N is the node-arc incidence matrix of a directed graph D = (M, N ).The system Ax = b, x ∈ B(u) corresponds to a network flow feasibility problem.Here, b(i) is the demand of node i ∈ M , i.e., the inflow minus the outflow at i is required to be b(i).Recall that A is a TU matrix; consequently, (A| − I M ) is also TU, and κ(X A ) = 1.Our algorithm is strongly polynomial in this setting.
Note that inner vectors correspond to cycles and outer vectors to paths; this motivates the term 'generalized path-circuit decomposition.'We also note τ (A) = 2, and thus A ≤ 2|M | in this case.

Optimal solutions and proximity
Thus, Problem (P) is to find the point in Z(A, u) that is nearest to b with respect to the Euclidean norm.We note that if the upper bounds u are finite, Z(A, u) is called a zonotope.Throughout, we let p * denote the optimum value of (P).Note that whereas the optimal solution x * may not be unique, the vector b * := Ax * is unique by strong convexity; we have The gradient of the objective 1 2 Ax − b 2 in (P) can be written as We recall the first order optimality conditions.
Lemma 2.6.The point x ∈ B(u) is an optimal solution to (P) if and only if g x (i) = 0 for all i ∈ J(x), g x (i) ≥ 0 for all i ∈ I 0 (x), and g x (i) ≤ 0 for all i ∈ I 1 (x).
Using Lemma 2.3, we can bound the distance of any x from the nearest optimal solution.
Lemma 2.7.For any x ∈ B(u), there exists an optimal solution x * to (P) such that and consequently, Proof.Let us select an optimal solution x * to (P) such that x − x * is minimal.We show that x − x * is cycle-free w.r.t.A; the statements then follow from Lemma 2.3.For a contradiction, assume a generalized path-circuit decomposition of x − x * contains an inner vector g, i.e., Ag = 0.By conformity of the decomposition, for x = x * + g we have x ∈ B(u) and Ax = Ax * .Thus, x is another optimal solution, but x − x < x − x * , a contradiction.

The centroid mapping
Let us denote by 3 N the set of all ordered pairs (I 0 , I 1 ) of disjoint subsets I 0 , I 1 ⊆ N , and let I * := {i ∈ N | u(i) < ∞}.For any (I 0 , I 1 ) ∈ 3 N with I 1 ⊆ I * , we let We call {Ax | x ∈ B(u) ∩ L(I 0 , I 1 )} ⊆ Z(A, u) a pseudoface of Z(A, u).We note that every face of Z(A, u) is a pseudoface, but there might be pseudofaces that do not correspond to any face.
We define a centroid set for (I 0 , I 1 ) as Proposition 2.8.For (I 0 , I 1 ) ∈ 3 N with I 1 ⊆ I * , C(I 0 , I 1 ) is an affine subspace of R N , and for some w ∈ R M , it holds that Ay = w for every y ∈ C(I 0 , I 1 ).
The centroid mapping Ψ : B(u) → R N is a mapping that satisfies We say that x ∈ B(u) is a stable point if Ψ(x) = x.A simple, 'oblivious' centroid mapping arises by taking a minimum-norm point of the centroid set: However, this mapping has some undesirable properties.For example, we may have an iterate x that is already in C(I 0 (x), I 1 (x)), but Ψ(x) = x.Instead, we aim for centroid mappings that move the current point 'as little as possible'.This can be formalized as follows.The centroid mapping Ψ is called cycle-free, if the vector Ψ(x) − x is cycle-free w.r.t.A for every x ∈ B(u).The next claim describes a general class of cycle-free centroid mappings.
Lemma 2.9.For every x ∈ B(u), let D(x) ∈ R N ×N >0 be a positive diagonal matrix.Then, defines a cycle-free centroid mapping.
Proof.For a contradiction, assume y − x is not cycle-free for y = Ψ(x), that is, a generalized pathcircuit decomposition contains an inner vector z.For y ′ = y − z we have Ay ′ = Ay, meaning that y ′ ∈ C(I 0 (x), I 1 (x)).This is a contradiction, since D(x)(y ′ − x) < D(x)(y − x) for any positive diagonal matrix D(x).
We emphasize that D(x) in the above statement is a function of x and can be any positive diagonal matrix.Note also that the diagonal entries for indices in I 0 (x) ∪ I 1 (x) do not matter.In our experiments, defining D(x) with diagonal entries 1/x(i) + 1/(u(i) − x(i)) for i ∈ J(x) performs particularly well.Intuitively, this choice aims to move less the coordinates close to the boundary. 3The next proposition follows from Lagrangian duality, and provides a way to compute Ψ(x) as in ( 6) by solving a system of linear equations.
Proposition 2.10.For a partition N = I 0 ∪ I 1 ∪ J, the centroid set can be written as For (I 0 , I 1 , J) = (I 0 (x), I 1 (x), J(x)) and D = D(x), the point y = Ψ(x) as in (6) can be obtained as the unique solution to the system of linear equations 3 The Update-and-Stabilize Framework Now we describe a general algorithmic framework MNPZ(A, b, u) for solving (P), shown in Algorithm 1.
Similarly to Wolfe's MNP algorithm, the algorithm comprises major and minor cycles.We maintain a point x ∈ B(u), and x is stable at the end of every major cycle.Each major cycle starts by calling the subroutine Update(x); the only general requirement on this subroutine is as follows: (U1) for y = Update(x), y = x if and only if x is optimal to (P), and Ay − b < Ax − b otherwise, and Property (U1) can be obtained from any first order algorithm; we introduce some important examples in Section 3.1.Property (U2) might be violated if using a fixed step-length, which is a common choice.In order to guarantee (U2), we can post-process the first order update: choose y as the optimal point on the line segment [x, y ′ ], where y ′ is the update found by the fixed-step update.
The algorithm terminates in the first major cycle when x = Update(x).Within each major cycle, the minor cycles repeatedly use the centroid mapping Ψ.As long as w := Ψ(x) = x, i.e., x is not stable, we set x := w if w ∈ B(u); otherwise, we set the next x as the intersection of the line segment [x, w] and the boundary of B(u).The requirement (U1) is already sufficient to show finite termination.
Algorithm 1: MNPZ(A, b, u) Output: An optimal solution x to (P) 1 x ←initial point from B(u) ; Proof.Requirement (U1) guarantees that if the algorithm terminates, it returns an optimal solution.We claim that the same sets (I 0 , I 1 ) cannot appear as (I 0 (x), I 1 (x)) at the end of two different major cycles; this implies the bound on the number of major cycles.To see this, we note that for x = Ψ(x),

The Update subroutine
We can implement the Update(x) subroutine satisfying (U1) and (U2) using various first order methods for constrained optimization.Recall the gradient g x from (2); we use g = g x when x is clear from the context.The following property of stable points can be compared to the optimality condition in Lemma 2.6.Lemma 3.2.If x(= Ψ(x)) is a stable point, then g x (j) = 0 for all j ∈ J(x).
We now describe three classical options.We stress that the centroid mapping Ψ can be chosen independently from the update step.
The Frank-Wolfe update The Frank-Wolfe or conditional gradient method is applicable only in the case when u(i) is finite for every i ∈ N .In every update step, we start by computing ȳ as the minimizer of the linear objective g, y over B(u), that is, We set Update(x) := x if g, ȳ = g, x , or y = Update(x) is selected so that y minimizes 1 2 Ay − b 2 on the line segment [x, ȳ].
Clearly, ȳ(i) = 0 whenever g(i) > 0, and ȳ(i) = u(i) whenever g(i) < 0. However, ȳ(i) can be chosen arbitrarily if g(i) = 0.In this case, we keep ȳ(i) = x(i); this will be significant to guarantee stability of solutions in the analysis.
The Projected Gradient update The projected gradient update moves in the opposite gradient direction to ȳ := x − λg for some step-length λ > 0, and obtains the output y = Update(x) as the projection y of ȳ to the box B(u).This projection simply changes every negative coordinate to 0 and every ȳ(i) > u(i) to y(i) = u(i).To ensure (U2), we can perform an additional step that replaces y by the point y ′ ∈ [x, y] that minimizes 1  2 Ay ′ − b 2 .Consider now an NNLS instance (i.e., u(i) = ∞ for all i ∈ N ), and let x be a stable point.Recall I 1 (x) = ∅ in the NNLS setting.Lemma 3.2 allows us to write the projected gradient update in the following simple form that also enables to use optimal line search.Define and use z = z x when clear from the context.According to Lemma 2.6, x is optimal to (P) if and only if z = 0. We use the optimal line search y := arg min If z = 0, this can be written explicitly as To verify this formula, we note that z 2 = − g, z , since for every i ∈ N either z(i) = 0 or z(i) = −g(i).
Coordinate update Our third update rule is the one used in the Lawson-Hanson algorithm.Given a stable point x ∈ B(u), we select a coordinate j ∈ N where either j ∈ I 0 (x) and g(j) < 0 or j ∈ I 1 (x) and g(j) > 0, and set y such that y(i) = x(i) if i = j, and y(j) is chosen in [0, u(j)] so that 1 2 Ay − b 2 is minimized.As in the Lawson-Hanson algorithm, we can maintain basic solutions throughout.Lemma 3.3.Assume A J is linearly independent for J = J(x).Then, A J ′ is also linearly independent for J ′ = J(y) = J ∪ {j}, where y = Update(x).
Proof.For a contradiction, assume A j = A J w for some w ∈ R J .Then, Let us start with x = 0, i.e., J(x) = I 1 (x) = ∅, I 0 (x) = N .Then, A J(x) remains linearly independent throughout.Hence, every stable solution x is a basic solution to (P).Note that whenever A J(x) is linearly independent, C(I 0 (x), I 1 (x)) contains a single point, hence, Ψ(x) is uniquely defined.
Consider now the NNLS setting.For z as in (8), let us return y = x if z = 0. Otherwise, let j ∈ arg max k z(k); note that j ∈ I 0 (x).Let The following lemma is immediate.In the NNLS setting, (U2) is guaranteed for the updates described above.For the general form with upper bounds, we can post-process as noted above to ensure (U2).

Lemma 3.4. The Frank-Wolfe, projected gradient, and coordinate update rules all satisfy (U1) and (U2).
Cycle-free update rules Definition 3.5.We say that Update(x) is a cycle-free update rule, if for every x ∈ B(u) and y = Update(x), x − y is cycle-free w.r.t. A.
Lemma 3.6.The Frank-Wolfe, projected gradient, and coordinate updates are all cycle-free.
Proof.Each of the three rules satisfy that for any x ∈ B(u) with gradient g and y = Update(x), y − x conforms to −g.We show that this implies the required property.
For a contradiction, assume that a generalized path-cycle decomposition of y − x contains an inner vector h.Thus, h = 0, Ah = 0, and h conforms to −g.Consequently, g, h < 0. Recalling the form of g from ( 2), we get

Analysis
Our main goal is to show the following convergence bound.The proof will be given in Section 4.3.Recall that in an NNLS instance, all upper capacities are infinite.
Theorem 4.1.Consider an NNLS instance of (P), and assume we use a cycle-free centroid mapping.Algorithm 1 terminates with an optimal solution in O(n ) major cycles using projected gradient updates (9), and in ) major cycles using coordinate updates (9), when initialized with x = 0.In both cases, the total number of minor cycles is

Proximity bounds
We show that if using a cycle-free update rule and a cycle-free centroid mapping, the movement of the iterates in Algorithm 1 can be bounded by the change in the objective value.First, a nice property of the centroid set is that the movement of Ax directly relates to the decrease in the objective value.Namely, Lemma 4.2.For x ∈ B(u), let y ∈ C(I 0 (x), I 1 (x)).Then, Consequently, if Ψ is a cycle-free centroid mapping and y = Ψ(x), then Proof.
Where the equality follows since (A J ) ⊤ (Ay − b) = 0 by Proposition 2.10.The second part follows from Lemma 2.3.
Next, let us consider the movement of x during a call to Update(x).
Lemma 4.3.Let x ∈ B(u) and y = Update(x).Then, If using a cycle-free update rule, we also have Proof.From property (U2), it is immediate to see that Ay − b, Ax − Ay ≥ 0. This implies the first claim.
The second claim follows from the definition of a cycle-free update rule and Lemma 2.3.
Lemma 4.4.Let x ∈ B(u), and let x ′ be an iterate obtained by consecutive t major or minor updates of Algorithm 1 using a cycle-free update rule and a cycle-free centroid mapping, starting from x.Then, Proof.Let us consider the (major and minor cycle) iterates x = x (k) , x (k+1) , . . ., x (k+t) = x ′ .From the triangle inequality, and the arithmetic-quadratic means inequality, The statement then follows using the bounds in Lemma 4.2 and Lemma 4.3.

Geometric convergence of the projected gradient and coordinate updates
We present a simple convergence analysis for the NNLS setting.For the general capacitated setting, similar bounds should follow from [22].Recall that η(x) denotes the optimality gap at x. Theorem 4.5.Consider an NNLS instance of (P), and let x ≥ 0 be a stable point.Then for y = Update(x) using the projected gradient update (9) we have Using coordinate updates as in (10), we have Consequently, either with projected gradient or with coordinate updates, after performing O(nm 2 •κ 2 (X A )• A 2 ) minor and major cycles from an iterate x, we obtain an iterate x ′ with η(x ′ ) ≤ η(x)/2.
Let us formulate the update progress using optimal line search.Lemma 4.6.For a stable point x ≥ 0, the update (9) satisfies and the update (10) satisfies Proof.For the update (9) with stepsize λ = z 2 / Az 2 , we have where the third equality uses g, z = − z 2 noted previously.The statement follows by using Az ≤ A • z .The proof is similar for the update (10).Here, y = x + λe j , where e j is the jth unit vector, and λ = z(j)/ A j 2 .The bound follows by noting that Ax − b, Ae j = g, e j = −z(j).
We now use Lemma 2.7 to bound z .Lemma 4.7.For a stable point x ≥ 0 and the update direction z = z x as in (8), we have Proof.Let x * ≥ 0 be an optimal solution to (P) as in Lemma 2.7, and b * = Ax * .Using convexity of where the second inequality follows by noting that for each i ∈ N , either z(i) = −g(i), or z(i) = 0 and g(i)(x * (i) − x(i)) ≥ 0. From the Cauchy-Schwarz inequality and Lemma 2.7, we get Recalling that η This can be further written as which is implied by the first order optimality condition at x * .This proves (11), and hence the lemma follows.
Proof of Theorem 4.5.The proof for the bound in projected gradient updates is immediate from Lemma 4.6 and Lemma 4.7.For coordinate updates, recall that j is selected as the index of the largest component z(j).Thus, z(j) 2 ≥ z 2 /n, and A j ≤ A .
For the second part, the statement follows for projected gradient updates by the first part and by noting that there are at most n minor cycles in every major cycle.For coordinate updates, every major cycle adds one component to J(x) whereas every minor cycle removes at least one.Hence, the total number of minor cycles is at most m plus the total number of major cycles.

Overall convergence bounds
In this subsection, we prove Theorem 4.1.Using Lemma 4.4 and Theorem 4.5, we can derive the following stronger proximity bound: Lemma 4.8.Consider an NNLS instance of (P).Let x ≥ 0 be an iterate of Algorithm 1 using projected gradient or coordinate updates, and let x ′ ≥ 0 be any later iterate.Then, for a value Proof.According to Theorem 4.5, after ) major and minor cycles, we get to an iterate x ′′ with η(x ′′ ) ≤ η(x)/4.Thus, Lemma 4.4 gives Let us now define x (k) as the iterate following x after T k major and minor cycles; we let x (0) := x.By Theorem 4.5, x (k) ≤ η(x)/4 k , and similarly as above, for each k = 0, 1, 2, . . .we get The above bound also holds for any iterate x ′ between x (k) an x (k+1) .Using these bounds and the triangle inequality, for any iterate x ′ after x, we obtain This completes the proof.
We need one more auxiliary lemma.
Lemma 4.9.Consider an NNLS instance of (P), and let x ≥ 0 be a stable point.Let x ≥ 0 such that for each i ∈ N , either x(i) = x(i), or x(i) = 0 < x(i).Then, Proof.The claim is equivalent to showing For the threshold Θ as in Lemma 4.8 and for any x ≥ 0, let us define The following is immediate from Lemma 4.8.
Lemma 4.10.Consider an NNLS instance of (P).Let x ≥ 0 be an iterate of Algorithm 1 using projected gradient updates, and x ′ ≥ 0 be any later iterate.Then, We are ready to prove Theorem 4.1.
Proof of Theorem 4.1.At any point of the algorithm, let J ⋆ denote the union of the sets J ⋆ (x) for all iterations thus far.Consider a stable iterate x at the beginning of any major cycle, and let ) major and minor cycles we arrive at an iterate x ′ such that η(x ′ ) < ε.We note that log(n + κ(X according to Remark 2.4.We show that From here, we can conclude that J ⋆ was extended between iterates x and x ′ .This may happen at most n times, leading to the claimed bound on the total number of major and minor cycles.Using Theorem 4.5 we also obtain the respective bounds on the number of major cycles for the two different updates.For a contradiction, assume that (12) does not hold.Thus, for every i ∈ I 0 (x), we have By the above assumption, x − x ′ ∞ ≤ Θε, and therefore Ax − Ax ′ ≤ √ nΘ A ε. From Lemma 4.9, we can bound Recall that since x is a stable solution, Since x is a feasible solution to this program, it follows that Ax − b 2 ≥ Ax − b 2 .We get that in contradiction with the choice of ε.

Computational Experiments
We give preliminary computational experiments of different versions of our algorithm, and compare them to standard gradient methods and existing NNLS implementations.The experiments were programmed and executed by MATLAB version R2023a on a personal computer having 11th Gen Intel(R) Core(TM) i7-11370H @ 3.30GHz and 16GB of memory.We considered two families of randomly generated NNLS instances.In Appendix A, we also present experiments for capacitated instances (finite u(i) values).
We tested each combination of two update methods: Projected Gradient (PG), and coordinate (C); and two centroid mappings, the 'oblivious' mapping (5) and the 'local norm' mapping (6) with diagonal entries 1/x(i), i ∈ N .Recall that for coordinate updates and starting from x = 0, there is a unique centroid mapping by Lemma 3.3.
Our first benchmarks are the projected gradient (PG) and the projected fast (accelerated) gradient (PFG) methods.In contrast to our algorithms, these do not finitely terminate.We stopped the algorithms once they found a near-optimal solution within a certain accuracy threshold.
Further, we also compare our algorithms against the standard MATLAB implementation of the Lawson-Hanson algorithm called lsqnonneg, and against the implementation TNT-NN from [21].We note that lsqnonneg and the coordinate update version of our algorithms are essentially the same.Generating instances We generated two families of experiments.In the rectangular experiments n ≥ 2m, and in the near-square experiments m ≤ n ≤ 1.1m.In both cases, the entries of the m × n matrix A were chosen independently uniformly at random from the interval [−0.5, 0.5].In the rectangular experiments, the entries of b were also chosen independently uniformly at random from [−0.5, 0.5].Thus, the underlying LP Ax = b, x ≥ 0 may or may not be feasible.
For the near-square instances, such a random choice of b leads to infeasible instances with high probability.We used this method to generate infeasible instances.We also constructed families where the LP is feasible as follows.For a sparsity parameter χ ∈ (0, 1], we sampled a subset J ⊆ N , adding each variable independently with probability χ, and generated coefficients {z i : i ∈ J} independently at random from [0, 1].We then set b = j∈J A j z j .

Computational results
We stopped each algorithm when the computation time reached 60 seconds.For each (m, n), we test all the algorithms 5 times and the results shown here are the 5-run averaged figures.Table 1 shows the overall computational times for rectangular instances; values in brackets show the number of trials whose computation time exceeded 60 seconds.Tables 2 and 3 show the number of major cycles, and the total number of minor cycles, respectively.Table 4 shows the overall computational times for near-square instances.The status 'I' denotes infeasible instances and 'F' feasible instances, with the sparsity parameter χ in brackets, with values 0.1, 0.5, and 1. Tables 5 and 6 show the number of major cycles, and the total number of minor cycles, respectively, for near-square instances.
Comparison of the results For rectangular instances, the 'local-norm' update (6) performs significantly better than the 'oblivious' update (5).The 'oblivious' updates are also outperformed by the coordinate updates, both in terms of running time as well as in the total number of minor cycles.
As noted above, while our algorithm with coordinate updates and lsqnonneg are basically the same, the running time of the latter algorithm is better by around factor two.This is since lsqnonneg might be using more efficient linear algebra operations, in contrast to our more basic implementation.
The algorithm TNT-NN from [21] is a fast practical algorithm using a number of heuristics, representing the state-of-the-art active set method for NNLS.Notably, our algorithm with 'local-norm' updates ( 6) is almost always within a factor two for rectangular instances, and performs better in some cases.This is despite the fact that we only use a basic implementation without using more efficient linear algebra methods or including further heuristics.For rectangular instances, TNT-NN and 'local-norm' updates also outperform fast projected gradient in most cases.
The picture is more mixed for near-square instances.There is a marked difference between feasible and infeasible instances.The 'local-norm' and 'oblivious' update rules perform similarly, with a small number of major cycles.The number of minor cycles is much higher for infeasible instances.For infeasible instances, coordinate updates are faster than either variant of the PG update rule, while PG updates are faster for feasible instances.
The algorithm TNT-NN is consistently faster than our algorithm, with better running times for infeasible instances.For projected gradient and projected fast gradient, the running times are similar to TNT-NN except for feasible instances with sparsity parameter χ = 1, where they do not terminate within the 60 seconds limit in most cases.In contrast, these appear to be the easiest instances to our method with PG updates with the 'local-norm' mapping.

Concluding Remarks
We have proposed a new 'Update-and-Stabilize' framework for the minimum-norm-point problem (P).Our method combines classical first order methods with 'stabilizing' steps using the centroid mapping that amounts to computing a projection to an affine subspace.Our algorithm is always finite, and is strongly polynomial when the associated circuit imbalance measure is constant.In particular, this gives the first such convergence bound for the Lawson-Hanson algorithm.
There is scope for further improvements both in the theoretical analysis and in practical implementations.In this paper, we only analyzed the running time for uncapacitated instances.Combined with existing results from [22], we expect that similar bounds can be shown for capacitated instances.We note that for the analysis, it would suffice to run minor cycles only once in a while, say after every O(n) gradient updates.From a practical perspective however, running minor cycles after every update appears to be highly beneficial in most cases.Rigorous computational experiments, using standard families of LP benchmarks, is left for future work.
Future work should also compare the performance of our algorithms to the gradient projection method [5,23], using techniques from that method to our algorithm and vice versa.We note that for NNLS instances, starting from a stable point our algorithm already finds the optimal gradient update.However, a similar search as in gradient projection methods may be useful in the capacitated case.In the other direction, we note that the conjugate gradient iterations used in gradient projection do not correspond to an explicit choice of a centroid mapping.A possible enhancement of gradient projection could come from approximating a 'local-norm' objective as in (6) in the second stage.
We also point out that the 'local-norm' selection rule (6) was inspired by the affine scaling method; the important difference is that our algorithm moves all the way to the boundary, whereas affine scaling stays in the interior throughout.
x ∈ C(I 0 (x), I 1 (x)) = C(I 0 , I 1 ); thus, Ax − b = min { Az − b | z ∈ L(I 0 , I 1 ))}.By (U1), Ay − b < Ax − b at the beginning of every major cycle.Moreover, it follows from the definition of the centroid mapping that Ax − b is non-increasing in every minor cycle.To bound the number of minor cycles in a major cycle, note that the set I 0 (x) ∪ I 1 (x) ⊆ N is extended in every minor cycle.

Table 1 :
Computation time (in sec) for uncapacitated rectangular instances

Table 3 :
The total # of minor cycles for uncapacitated rectangular instances

Table 4 :
Computation time (in sec) for uncapacitated near-square instances

Table 5 :
# of major cycles for uncapacitated near-square instances

Table 6 :
Total # of minor cycles for uncapacitated near-square instances

Table 9 :
The total # of minor cycles for capacitated rectangular instances

Table 10 :
Computation time (in sec) for capacitated near-square instances

Table 11 :
# of major cycles for capacitated near-square instances