1 Introduction

We study the minimum-norm-point (MNP) problem

$$\begin{aligned} \text {Minimize }\tfrac{1}{2}||Ax-b||^2\text { subject to }{\textbf{0}}\le x\le u\, ,\, x\in {\mathbb {R}}^N\, , \end{aligned}$$
(P)

where m and n are positive integers, \(M=\{1,\cdots ,m\}\) and \(N=\{1,\cdots ,n\}\), \(A\in {\mathbb {R}}^{M\times N}\) is a matrix with rank \({\text {rk}}(A)=m\), \(b\in {\mathbb {R}}^M\), and \(u\in ({\mathbb {R}}\cup \{\infty \})^N\). We will use the notation \({\textbf{B}}(u):=\{x\in {\mathbb {R}}^N\mid {\textbf{0}}\le x\le u\}\) for the feasible set. The problem P generalizes the linear programming (LP) feasibility problem: the optimum value is 0 if and only if \(Ax=b\), \(x\in {\textbf{B}}(u)\) is feasible. The case \(u(i)=\infty \) for all \(i\in N\) is also known as the nonnegative least squares (NNLS) problem, a fundamental problem in numerical analysis.

Two extensively studied approaches for MNP and NNLS are active set methods and first order methods. An influential active set method was proposed by Lawson and Hanson [19, Chapter 23] in 1974. Variants of this algorithm were also proposed by Stoer [28], Björck [2], Wilhelmsen [31], and Leichner, Dantzig, and Davis [20].Footnote 1 Closely related is Wolfe’s classical minimum-norm-point algorithm [32]. These are iterative methods that maintain a set of active variables fixed at the lower or upper bounds, and passive (inactive) variables. In the main update steps, these algorithms fix the active variables at the lower or upper bounds, and perform unconstrained optimization on the passive variables. Such update steps require solving systems of linear equations. In all these methods, the set of columns corresponding to passive variables is linearly independent. The combinatorial nature of these algorithms enables to show termination with an exact optimal solution in a finite number of iterations. However, obtaining subexponential convergence bounds for such active set algorithms has remained elusive; see Sect. 1.1 for more work on NNLS and Wolfe’s algorithm.

In the context of first order methods, the formulation P belongs to a family of problems for which Necoara, Nesterov, and Glineur [22] showed linear convergence bounds. That is, the number of iterations needed to find an \(\varepsilon \)-approximate solution depends linearly on \(\log (1/\varepsilon )\). Such convergence has been known for strongly convex functions, but this property does not hold for P. However, [22] shows that restricted variants of strong convexity also suffice for linear convergence. For problems of the form P, the required property follows using Hoffman-proximity bounds [16]; see [26] and the references therein for recent results on Hoffman-proximity. In contrast to active set methods, first order methods are computationally cheaper as they do not require solving systems of linear equations. On the other hand, they do not find exact solutions.

We propose a new algorithmic framework for the minimum-norm-point problem P that can be seen as a blend of active set and first order methods. Our algorithm performs stabilizing steps between first order updates, and terminates with an exact optimal solution in a finite number of iterations. Moreover, we show poly\((n,\kappa )\) running time bounds for multiple instantiations of the framework, where \(\kappa \) is the circuit imbalance measure associated with the matrix \((A\mid I_M)\) (see Sect. 2.1). This gives strongly polynomial bounds whenever \(\kappa \) is constant; in particular, \(\kappa =1\) for network flow feasibility. We note that if \(A\in \mathbb {Z}^{M\times N}\), then \(\kappa \le \Delta (A)\) for the maximum subdeterminant \(\Delta (A)\). Still, \(\kappa \) can be exponential in the encoding length of the matrix.

The stabilizing step is similar to the one used by Björck [2] who considered the same formulation P. The Lawson–Hanson algorithm for the NNLS problem can be seen as special instantiations of our framework, and we obtain an \(O(n^{2}m^2\cdot \kappa ^2\cdot \Vert A\Vert ^2\cdot \log (n+\kappa ))\) iteration bound. These algorithms only use coordinate updates as first order steps, and maintain linear independence of the columns corresponding to passive variables. Our framework is signficantly more general: we waive the linear independence requirement and allow for arbitarty active and passive sets. This provides much additional flexibility, as our framework can be implemented with a variety of first order methods. This feature also yields a significant advantage in our computational experiments.

Overview of the algorithm A key concept in our algorithm is the centroid mapping, defined as follows. For disjoint subsets \(I_0,I_1\subseteq N\), we let \({\textbf{L}}(I_0,I_1)\) denote the affine subspace of \({\mathbb {R}}^N\) where \(x(i)=0\) for \(i\in I_0\) and \(x(i)=u(i)\) for \(i\in I_1\). For \(x\in {\textbf{B}}(u)\), let \(I_0(x)\) and \(I_1(x)\) denote the subsets of coordinates i with \(x(i)=0\) and \(x(i)=u(i)\), respectively. The centroid mapping \({\Psi }: {\textbf{B}}(u)\rightarrow {\mathbb {R}}^N\) is a mapping with the property that \({\Psi }{(x)}\in \arg \min _y\{\tfrac{1}{2}\Vert Ay-b\Vert ^2\mid y\in {\textbf{L}}(I_0(x),I_1(x))\}\). This mapping may not be unique, since the columns of A corresponding to \(J(x):=\{i\in N\mid 0<x(i)<u(i)\}\) may not be independent: the optimal centroid set is itself an affine subspace. The point \(x\in {\textbf{B}}(u)\) is stable if \({\Psi }(x)=x\). This generalizes an update used by Björck [2]. However, in his setting J(x) is always linearly independent and thus the centroid set is always a single point. Stable sets can also be seen as the analogues of corral solutions in Wolfe’s minimum-norm point algorithm.

Every major cycle starts with an update step and ends with a stable point. The update step could be any first-order step satisfying some natural requirements, such as variants of Frank–Wolfe, projected gradient, or coordinate updates. As long as the current iterate is not optimal, this update strictly improves the objective. Finite convergence follows by the fact that there can be at most \(3^n\) stable points.

After the update step, we start a sequence of minor cycles. From the current iterate \(x\in {\textbf{B}}(u)\), we move to \({\Psi }(x)\) in case \({\Psi }(x)\in {\textbf{B}}(u)\), or to the intersection of the boundary of \({\textbf{B}}(u)\) and the line segment \([x,{\Psi }(x)]\) otherwise. The minor cycles finish once \(x={\Psi }(x)\) is a stable point. The objective \(\tfrac{1}{2}\Vert Ax-b\Vert ^2\) is decreasing in every minor cycle, and at least one new coordinate \(i\in N\) is set to 0 or to u(i). Thus, the number of minor cycles in any major cycle is at most n. One can use various centroid mappings satisfying a mild requirement on \({\Psi }\), described in Sect. 2.3.

We present a poly\((n,\kappa )\) convergence analysis for the NNLS problem with coordinate updates, which corresponds to the Lawson–Hanson algorithm and its variants. We expect that similar arguments extend to the capacitated case. The proof has two key ingredients. First, we show linear convergence of the first-order update steps (Theorem 4.5). Such a bound follows already from [22]; we present a simple self-contained proof exploiting properties of stable points and the uncapacitated setting. The second step of the analysis shows that in every poly\((n,\kappa )\) iterations, we can identify a new variable that will never become zero in subsequent iterations (Theorem 4.1). The proof relies on proximity arguments: we show that for any iterate x and any subsequent iterate \(x'\), the distance \(\Vert x-x'\Vert \) can be upper bounded in terms of n, \(\kappa \), and the optimality gap at x.

In Sect. 5, we present preliminary computational experiments using randomly generated problem instances of various sizes. We compare the performance of different variants of our algorithm to standard gradient methods. For the choice of update steps, projected gradient performs much better than coordinate updates used in the NNLS algorithms. We compare an ‘oblivious’ centroid mapping and one that chooses \({\Psi }(x)\) as the nearest point to x in the centroid set in the ‘local norm’ (see Sect. 2.2). The latter one appears to be significantly better. For choices of parameters \(n\ge 2m\), the running time of our method with projected gradient updates and local norm mapping is typically within a factor two of TNT-NN, the state-of-the-art practical active set heuristic for NNLS [21], despite the fact that we only use simple linear algebra tools and have not made any attempts for practical speed ups. The performance is often better than projected accelerated gradient descent, the best first order approach.

Proximity arguments and strongly polynomial algorithms Arguments that show strongly polynomial convergence by gradually revealing the support of an optimal solution are prevalent in combinatorial optimization. These date back to Tardos’s [29] groundbreaking work giving the first strongly polynomial algorithm for minimum-cost flows. Our proof is closer to the dual ‘abundant arc’ arguments by Fujishige [12] and Orlin [24]. Tardos generalized the above result for general LP’s, giving a running time dependence poly\((n,\log \Delta (A))\), where \(\Delta (A)\) is the largest subdeterminant of the constraint matrix. In particular, her algorithm is strongly polynomial as long as the entries of the matrix are polynomially bounded integers. This framework was recently strengthened in [7] to poly\((n,\log \kappa (A))\) running time for the circuit imbalance measure \(\kappa (A)\). They also highlight the role of Hoffman-proximity and give such a bound in terms of \(\kappa (A)\). We note that the above algorithms—along with many other strongly polynomial algorithms in combinatorial optimization—modify the problem directly once new information is learned about the optimal support. In contrast, our algorithm does not require any such modifications, nor a knowledge or estimate on the condition number \(\kappa \). Arguments about the optimal support only appear in the analysis.

Strongly polynomial algorithms with poly\((n,\log \kappa (A))\) running time bounds can also be obtained using layered least squares interior point methods. This line of work was initiated by Vavasis and Ye [30] using a related condition measure \(\bar{\chi }(A)\). An improved version that also established the relation between \(\bar{\chi }(A)\) and \(\kappa (A)\) was recently given by Dadush et al. [6]. We refer the reader to the survey [9] for properties and further applications of circuit imbalances.

1.1 Further related work

The Lawson–Hanson algorithm remains popular for the NNLS problem, and several variants are known. Bro and De Jong [3], and by Myre et al. [21] proposed empirically faster variants. In particular, [21] allows bigger changes in the active and passive sets, thus waiving the linear independence on passive variables, and reports a significant speedup. However, there are no theoretical results underpinning the performance of these heuristics.

Wolfe’s minimum-norm-point algorithm [32] considers the variant of P where the box constraint \(x\in {\textbf{B}}(u)\) is replaced by \(\sum _{i\in N} x_i=1\), \(x\ge \textbf{0}\). It has been successfully employed as a subroutine in various optimization problems, e.g., submodular function minimization [14], see also [1, 11, 13]. Beyond the trivial \(2^n\) bound, the convergence analysis remained elusive; the first bound with \(1/\varepsilon \)-dependence was given by Chakrabarty et al. [4] in 2014. Lacoste-Julien and Jaggi [17] gave a \(\log (1/\varepsilon )\) bound, parametrized by the pyramidal width of the polyhedron. Recently, De Loera et al. [8] showed an example of exponential time behaviour of Wolfe’s algorithm for the min-norm insertion rule (the analogue of a pivot rule); no exponential example for other insertion rules such as the linopt rule used in the application for submodular minimization.Footnote 2

Our Update-and-Stabilize algorithm is also closely related to the Gradient Projection Method, see [5] and [23, Section 16.7]. This method also maintains a non-independent set of passive variables. For each gradient update, a more careful search is used in the gradient direction, ‘bending’ the movement direction whenever a constraint is hit. The analogues of stabilizer steps are conjugate gradient iterations. Thus, this method avoids the computationally expensive step of exact projections; on the other hand, finite termination is not guaranteed. We further discuss the relationship between the two algorithms in Sect. 6.

There are similarities between our algorithm and the Iteratively Reweighted Least Squares (IRLS) method that has been intensively studied since the 1960s [18, 25]. For some \(p\in [0,\infty ]\), \(A\in {\mathbb {R}}^{M\times N}\) and \(b\in {\mathbb {R}}^M\), the goal is to approximately solve \(\min \{\Vert x\Vert _p\mid \, Ax=b\}\). At each iteration, a weighted minimum-norm point problem \(\min \{\sum _{i=1}^n w^{(t)}_i x^2_i\mid \, Ax=b\}\) is solved, where the weights \(w^{(t)}\) are iteratively updated. The LP-feasibility problem \(Ax=b\), \(\textbf{0}\le x\le \textbf{1}\) for finite upper bounds \(u=\textbf{1}\) can be phrased as an \(\ell _{\infty }\)-minimization problem \(\min \{\Vert x\Vert _\infty \mid \, Ax=b-A\textbf{1}/2\}\). Ene and Vladu [10] gave an efficient variant of IRLS for \(\ell _1\) and \(\ell _\infty \)-minimization; see their paper for further references. Some variants of our algorithm solve a weighted least squares problem with changing weights in the stabilizing steps. There are, however significant differences between IRLS and our method. The underlying optimization problems are different, and IRLS does not find an exact optimal solution in finite time. Applied to LP in the \(\ell _\infty \) formulation, IRLS satisfies \(Ax=b\) throughout while violating the box constraints \(\textbf{0}\le x\le u\). In contrast, iterates of our algorithm violate \(Ax=b\) but maintain \(\textbf{0}\le x\le u\). The role of the least squares subroutines is also rather different in the two settings.

2 Preliminaries

Notation We use \(N\oplus M\) for disjoint union (or direct sum) of the copies of the two sets. For a matrix \(A\in {\mathbb {R}}^{M\times N}\), \(i\in M\) and \(j\in N\), we denote the ith row of A by \(A_i\) and jth column by \(A^j\). Also for any matrix X denote by \(X^\top \) the matrix transpose of X. We let \(\Vert \cdot \Vert _p\) denote the \(\ell _p\) vector norm; we use \(\Vert \cdot \Vert \) to denote the Euclidean norm \(\Vert \cdot \Vert _2\). For a matrix \(A\in {\mathbb {R}}^{M\times N}\), we let \(\Vert A\Vert \) denote the spectral norm, that is, the \(\ell _2\rightarrow \ell _2\) operator norm.

For any \(x, y\in {\mathbb {R}}^M\) we define \(\left\langle x,y\right\rangle =\sum _{i\in M}x(i)y(i)\). We will use this notation also in other dimensions. We let \([x,y]:=\{\lambda x+(1-\lambda )y\mid \lambda \in [0,1]\}\) denote the line segment between the vectors x and y.

2.1 Elementary vectors and circuits

For a linear space \(W\subsetneq {\mathbb {R}}^N\), \(g\in W\) is an elementary vector if g is a support minimal nonzero vector in W, that is, no \(h\in W\setminus \{\textbf{0}\}\) exists such that \({\text {supp}}(h)\subsetneq {\text {supp}}(g)\), where \({\text {supp}}\) denotes the support of a vector. We let \({\mathcal {F}}(W)\subseteq W\) denote the set of elementary vectors. A circuit in W is the support of some elementary vector; these are precisely the circuits in the associated linear matroid \(\mathcal {M}(W)\).

The subspaces \(W=\{\textbf{0}\}\) and \(W={\mathbb {R}}^N\) are called trivial subspaces; all other subspaces are nontrivial. We define the circuit imbalance measure

$$\begin{aligned} \kappa (W):=\max \left\{ \left| \frac{g(j)}{g(i)}\right| \mid g\in {\mathcal {F}}(W), i,j\in {\text {supp}}(g)\right\} \, \end{aligned}$$

for nontrivial subspaces and \(\kappa (W)=1\) for trivial subspaces. For a matrix \(A\in {\mathbb {R}}^{M\times N}\), we use the notation \(\kappa (A)\) to denote \(\kappa (\ker (A))\).

The following theorem shows the relation to totally unimodular (TU) matrices. Recall that a matrix is totally unimodular (TU) if the determinant of every square submatrix is 0, \(+1\), or \(-1\).

Theorem 2.1

[Cederbaum, 1957, see c 3,4]ENV22] Let \(W \subset {\mathbb {R}}^N\) be a linear subspace. Then \(\kappa (W) = 1\) if and only if there exists a TU matrix \(A\in {\mathbb {R}}^{M\times N}\) such that \(W = \ker (A)\).

We also note that if \(A\in \mathbb {Z}^{M\times N}\) is an integer matrix, then \(\kappa (A)\le \Delta (A)\) for the maximum subdeterminant \(\Delta (A)\).

Conformal circuit decompositions We say that the vector \(y \in {\mathbb {R}}^N\) conforms to \(x\in {\mathbb {R}}^N\) if \(x(i)y(i) > 0\) whenever \(y(i)\ne 0\). Given a subspace \(W\subseteq {\mathbb {R}}^N\), a conformal circuit decomposition of a vector \(v\in W\) is a decomposition

$$\begin{aligned} v=\sum _{k=1}^\ell h^k, \end{aligned}$$

where \(\ell \le n\) and \(h^1,h^2,\ldots ,h^\ell \in {\mathcal {F}}(W)\) are elementary vectors that conform to v. A fundamental result on elementary vectors asserts the existence of a conformal circuit decomposition; see e.g., [15, 27]. Note that there may be multiple conformal circuit decompositions of a vector.

Lemma 2.2

For every subspace \(W\subseteq {\mathbb {R}}^N\), every \(v\in W\) admits a conformal circuit decomposition.

Given \(A\in {\mathbb {R}}^{M\times N}\), we define the extended subspace \({\mathcal {X}}_A\subset {\mathbb {R}}^{N\oplus M}\) as \({\mathcal {X}}_A:=\ker ([A\mid -I_M])\). Hence, for every \(v\in {\mathbb {R}}^N\), \((v,Av)\in {\mathcal {X}}_A\). For \(v\in {\mathbb {R}}^N\), the generalized path-circuit decomposition of v with respect to A is a decomposition \(v=\sum _{k=1}^\ell h^k\), where \(\ell \le n\), and for each \(1\le k\le \ell \), \((h^k,Ah^k)\in {\mathbb {R}}^{N\oplus M}\) is an elementary vector in \({\mathcal {X}}_A\) that conforms to (vAv). Moreover, \(h^k\) is an inner vector in the decomposition if \(Ah^k=\textbf{0}\) and an outer vector otherwise.

We say that \(v\in {\mathbb {R}}^N\) is cycle-free with respect to A, if all generalized path-circuit decompositions of v contain outer vectors only. The following lemma will play a key role in analyzing our algorithms.

Lemma 2.3

For any \(A\in {\mathbb {R}}^{M\times N}\), let \(v\in {\mathbb {R}}^N\) be cycle-free with respect to A. Then,

$$\begin{aligned} \Vert v\Vert _\infty \le \kappa ({{\mathcal {X}}_A})\cdot \Vert Av\Vert _1\, \quad \hbox {and}\quad \Vert v\Vert _2\le m\cdot \kappa ({{\mathcal {X}}_A})\cdot \Vert Av\Vert _2\,. \end{aligned}$$

Proof

Consider a generalized path-circuit decomposition \(v=\sum _{k=1}^\ell h^k\). By assumption, \(Ah^k\ne \textbf{0}\) for each k. Thus, for every \(j\in {\text {supp}}(h^k)\) there exists an \(i\in M\), such that \(|h^k(j)|\le \kappa ({{\mathcal {X}}_A}) |A_i h^k|\). For every \(j\in N\), the conformity of the decomposition implies \(|v(j)|=\sum _{k=1}^\ell |h^k(j)|\). Similarly, for every \(i\in M\), \(|A_i v|=\sum _{k=1}^\ell |A_i h^k|\). These imply the inequality \(\Vert v\Vert _\infty \le \kappa ({{\mathcal {X}}_A}) \Vert Av\Vert _1\).

For the second inequality, note that for any outer vector \((h^k,Ah^k)\in {\mathcal {X}}_A\), the columns in \({\text {supp}}(h^k)\) must be linearly independent. Consequently, \(\Vert h^k\Vert _2\le \sqrt{m}\cdot \kappa ({{\mathcal {X}}_A})\cdot |(Ah^k)_i|\) for each k and \(i\in {\text {supp}}(Ah^k)\). This implies

$$\begin{aligned} \Vert v\Vert _2\le \sum _{k=1}^\ell \Vert h^k\Vert _2\le \sqrt{m}\cdot \kappa ({{\mathcal {X}}_A})\cdot \Vert Av\Vert _1\le m\cdot \kappa ({{\mathcal {X}}_A})\cdot \Vert Av\Vert _2\,, \end{aligned}$$

completing the proof. \(\square \)

Remark 2.4

We note that a similar argument shows that \(\Vert A\Vert \le \sqrt{m\tau (A)}\cdot \kappa ({\mathcal {X}}_A)\), where \(\tau (A)\le m\) is the maximum size of \({\text {supp}}(Ah)\) for an elementary vector \((h,Ah)\in {\mathcal {X}}_A\).

Example 2.5

If \(A\in {\mathbb {R}}^{M\times N}\) is the node-arc incidence matrix of a directed graph \(D=(M,N)\). The system \(Ax=b\), \(x\in {\textbf{B}}(u)\) corresponds to a network flow feasibility problem. Here, b(i) is the demand of node \(i\in M\), i.e., the inflow minus the outflow at i is required to be b(i). Recall that A is a TU matrix; consequently, \((A|-I_M)\) is also TU, and \(\kappa ({{\mathcal {X}}_A})=1\). Our algorithm is strongly polynomial in this setting. Note that inner vectors correspond to cycles and outer vectors to paths; this motivates the term ‘generalized path-circuit decomposition.’ We also note \(\tau (A)=2\), and thus \(\Vert A\Vert \le \sqrt{2|M|}\) in this case.

2.2 Optimal solutions and proximity

Let

$$\begin{aligned} Z(A,u):=\{Ax\mid x\in {\textbf{B}}(u)\}. \end{aligned}$$
(1)

Thus, Problem P is to find the point in Z(Au) that is nearest to b with respect to the Euclidean norm. We note that if the upper bounds u are finite, Z(Au) is called a zonotope.

Throughout, we let \(p^*\) denote the optimum value of P. Note that whereas the optimal solution \(x^*\) may not be unique, the vector \(b^*:=Ax^*\) is unique by strong convexity; we have \(p^*=\tfrac{1}{2}\Vert b-b^*\Vert ^2\). We use

$$\begin{aligned} \eta (x):=\tfrac{1}{2}\Vert Ax-b\Vert ^2-p^* \end{aligned}$$

to denote the optimality gap for \(x\in {\textbf{B}}(u)\). The point \(x\in {\textbf{B}}(u)\) is an \(\varepsilon \)-approximate solution if \(\eta (x)\le \varepsilon \).

For a point \(x\in {\textbf{B}}(u)\), let

$$\begin{aligned}{} & {} I_0(x):=\{i\in N\mid \, x(i)=0\}\,,\quad I_1(x):=\{i\in N\mid \, x(i)=u(i)\}\,,\quad \hbox {and}\quad \\{} & {} J(x):=N\setminus (I_0(x)\cup I_1(x))\,. \end{aligned}$$

The gradient of the objective \(\tfrac{1}{2}\Vert Ax-b\Vert ^2\) in P can be written as

$$\begin{aligned} g^x:=A^\top (Ax-b)\, . \end{aligned}$$
(2)

We recall the first order optimality conditions.

Lemma 2.6

The point \(x\in {\textbf{B}}(u)\) is an optimal solution to P if and only if \(g^x(i)=0\) for all \(i\in J(x)\), \(g^x(i)\ge 0\) for all \(i\in I_0(x)\), and \(g^x(i)\le 0\) for all \(i\in I_1(x)\).

Using Lemma 2.3, we can bound the distance of any x from the nearest optimal solution.

Lemma 2.7

For any \(x\in {\textbf{B}}(u)\), there exists an optimal solution \(x^*\) to P such that

$$\begin{aligned} \begin{aligned} \Vert x-x^*\Vert _\infty&\le \kappa ({{\mathcal {X}}_A})\cdot \Vert Ax-b^*\Vert _1\,,\hbox { and}\\ \Vert x-x^*\Vert _2&\le m\cdot \kappa ({{\mathcal {X}}_A})\cdot \Vert Ax-b^*\Vert _2\,. \end{aligned} \end{aligned}$$

Proof

Let us select an optimal solution \(x^*\) to P such that \(\Vert x-x^*\Vert _2\) is minimal. We show that \(x-x^*\) is cycle-free w.r.t. A; the statements then follow from Lemma 2.3.

For a contradiction, assume a generalized path-circuit decomposition of \(x-x^*\) contains an inner vector g, i.e., \(Ag=\textbf{0}\). By conformity of the decomposition, for \(\bar{x}=x^*+g\) we have \(\bar{x}\in {\textbf{B}}(u)\) and \(A\bar{x}=A x^*\). Thus, \(\bar{x}\) is another optimal solution, but \(\Vert x-\bar{x}\Vert _2<\Vert x-x^*\Vert _2\), a contradiction. \(\square \)

2.3 The centroid mapping

Let us denote by \(3^N\) the set of all ordered pairs \((I_0,I_1)\) of disjoint subsets \(I_0,I_1\subseteq N\), and let \(I_*:=\{i\in N\mid u(i)<\infty \}\). For any \((I_0,I_1)\in 3^N\) with \(I_1\subseteq I_*\), we let

$$\begin{aligned} {\textbf{L}}(I_0,I_1):=\{x\in {\mathbb {R}}^N\mid \forall i\in I_0: x(i)=0, \ \forall i\in I_1: x(i)=u(i)\ \}\, . \end{aligned}$$
(3)

We call \(\{Ax \mid x\in {\textbf{B}}(u)\cap {\textbf{L}}(I_0,I_1)\} \subseteq Z(A,u)\) a pseudoface of Z(Au). We note that every face of Z(Au) is a pseudoface, but there might be pseudofaces that do not correspond to any face.

We define a centroid set for \((I_0,I_1)\) as

$$\begin{aligned} {\mathcal {C}}(I_0,I_1):=\arg \min _y\left\{ \Vert Ay-b\Vert \mid y\in {\textbf{L}}(I_0,I_1))\right\} \, . \end{aligned}$$
(4)

Proposition 2.8

For \((I_0,I_1)\in 3^N\) with \(I_1\subseteq I_*\), \({\mathcal {C}}(I_0,I_1)\) is an affine subspace of \({\mathbb {R}}^N\), and for some \(w\in {\mathbb {R}}^M\), it holds that \(Ay=w\) for every \(y\in {\mathcal {C}}(I_0,I_1)\).

The centroid mapping \({\Psi }:\, {\textbf{B}}(u)\rightarrow {\mathbb {R}}^N\) is a mapping that satisfies

$$\begin{aligned} {\Psi }({\Psi }(x))={\Psi }(x)\quad \hbox {and}\quad {\Psi }(x)\in {\mathcal {C}}(I_0(x),I_1(x))\, \quad \forall x\in {\textbf{B}}(u)\,. \end{aligned}$$

We say that \(x\in {\textbf{B}}(u)\) is a stable point if \({\Psi }(x)=x\). A simple, ‘oblivious’ centroid mapping arises by taking a minimum-norm point of the centroid set:

$$\begin{aligned} {\Psi }(x):={{\,\mathrm{arg\,min}\,}}\{\Vert y\Vert \mid y\in {\mathcal {C}}(I_0(x),I_1(x))\}\, . \end{aligned}$$
(5)

However, this mapping has some undesirable properties. For example, we may have an iterate x that is already in \({\mathcal {C}}(I_0(x),I_1(x))\), but \({\Psi }(x)\ne x\). Instead, we aim for centroid mappings that move the current point ‘as little as possible’. This can be formalized as follows. The centroid mapping \({\Psi }\) is called cycle-free, if the vector \({\Psi }(x)-x\) is cycle-free w.r.t. A for every \(x\in {\textbf{B}}(u)\). The next claim describes a general class of cycle-free centroid mappings.

Lemma 2.9

For every \(x\in {\textbf{B}}(u)\), let \(D(x)\in {\mathbb {R}}^{N\times N}_{>0}\)be a positive diagonal matrix. Then,

$$\begin{aligned} {\Psi }(x):={{\,\mathrm{arg\,min}\,}}\{\Vert D(x)(y-x)\Vert \mid y\in {\mathcal {C}}(I_0(x),I_1(x))\}\, \end{aligned}$$
(6)

defines a cycle-free centroid mapping.

Proof

For a contradiction, assume \(y-x\) is not cycle-free for \(y={\Psi }{(x)}\), that is, a generalized path-circuit decomposition contains an inner vector z. For \(y'=y-z\) we have \(Ay'=Ay\), meaning that \(y'\in {\mathcal {C}}(I_0(x),I_1(x))\). This is a contradiction, since \(\Vert D(x)(y'-x)\Vert <\Vert D(x)(y-x)\Vert \) for any positive diagonal matrix D(x). \(\square \)

We emphasize that D(x) in the above statement is a function of x and can be any positive diagonal matrix. Note also that the diagonal entries for indices in \(I_0(x)\cup I_1(x)\) do not matter. In our experiments, defining D(x) with diagonal entries \(1/x(i)+1/(u(i)-x(i))\) for \(i\in J(x)\) performs particularly well. Intuitively, this choice aims to move less the coordinates close to the boundary.Footnote 3 The next proposition follows from Lagrangian duality, and provides a way to compute \({\Psi }(x)\) as in (6) by solving a system of linear equations.

Proposition 2.10

For a partition \(N=I_0\cup I_1\cup J\), the centroid set can be written as

$$\begin{aligned} {\mathcal {C}}(I_0,I_1)=\left\{ y\in {\textbf{L}}(I_0,I_1)\mid ({A^J})^\top (Ay-b)=\textbf{0}\right\} \,. \end{aligned}$$

For \((I_0,I_1,J)=(I_0(x),I_1(x),J(x))\) and \(D=D(x)\), the point \(y={\Psi }(x)\) as in (6) can be obtained as the unique solution to the system of linear equations

$$\begin{aligned} \begin{aligned} (A^{J})^\top Ay&=(A^{J})^\top b&\\ y_{J}+(D_J^J)^{-1}(A^{J})^\top A^{J} \lambda&=x_{J}&\\ y(i)&=0&\quad \forall i\in I_0\\ y(i)&=u(i)&\quad \forall i\in I_1\\&\lambda \in {\mathbb {R}}^{J}&\,. \end{aligned} \end{aligned}$$

3 The update-and-stabilize framework

Now we describe a general algorithmic framework MNPZ(Abu) for solving P, shown in Algorithm 1. Similarly to Wolfe’s MNP algorithm, the algorithm comprises major and minor cycles. We maintain a point \(x\in {\textbf{B}}(u)\), and x is stable at the end of every major cycle. Each major cycle starts by calling the subroutine \(\texttt {Update}(x)\); the only general requirement on this subroutine is as follows:

  1. (U1)

    for \(y=\texttt {Update}(x)\), \(y=x\) if and only if x is optimal to P, and \(\Vert Ay-b\Vert <\Vert Ax-b\Vert \) otherwise, and

  2. (U2)

    if \(y\ne x\), then for any \(\lambda \in [0,1)\), \(z=\lambda y+(1-\lambda )x\) satisfies \(\Vert Ay-b\Vert <\Vert Az-b\Vert \).

Property U can be obtained from any first order algorithm; we introduce some important examples in Sect. 3.1. Property U2 might be violated if using a fixed step-length, which is a common choice. In order to guarantee U2, we can post-process the first order update: choose y as the optimal point on the line segment \([x,y']\), where \(y'\) is the update found by the fixed-step update.

The algorithm terminates in the first major cycle when \(x=\texttt {Update}(x)\). Within each major cycle, the minor cycles repeatedly use the centroid mapping \(\Psi \). As long as \(w:=\Psi (x)\ne x\), i.e., x is not stable, we set \(x:=w\) if \(w\in {\textbf{B}}(u)\); otherwise, we set the next x as the intersection of the line segment [xw] and the boundary of \({\textbf{B}}(u)\). The requirement U is already sufficient to show finite termination.

Algorithm 1
figure a

MNPZ(Abu)

Theorem 3.1

Consider any \(\texttt {Update}(x)\) subroutine that satisfies U and any centroid mapping \(\Psi \). The algorithm MNPZ(Abu) finds an optimal solution to P within \(3^n\) major cycles. Every major cycle contains at most n minor cycles.

Proof

Requirement U guarantees that if the algorithm terminates, it returns an optimal solution. We claim that the same sets \((I_0,I_1)\) cannot appear as \((I_0(x),I_1(x))\) at the end of two different major cycles; this implies the bound on the number of major cycles. To see this, we note that for \(x=\Psi (x)\), \(x\in {\mathcal {C}}(I_0(x),I_1(x))={\mathcal {C}}(I_0,I_1)\); thus, \(\Vert Ax-b\Vert =\min \left\{ \Vert Az-b\Vert \mid \ z\in {\textbf{L}}(I_0,I_1))\right\} \). By U, \(\Vert Ay-b\Vert <\Vert Ax-b\Vert \) at the beginning of every major cycle. Moreover, it follows from the definition of the centroid mapping that \(\Vert Ax-b\Vert \) is non-increasing in every minor cycle. To bound the number of minor cycles in a major cycle, note that the set \(I_0(x)\cup I_1(x)\subseteq N\) is extended in every minor cycle. \(\square \)

3.1 The update subroutine

We can implement the \(\texttt {Update}(x)\) subroutine satisfying U and U2 using various first order methods for constrained optimization.

Recall the gradient \(g^x\) from (2); we use \(g=g^x\) when x is clear from the context. The following property of stable points can be compared to the optimality condition in Lemma 2.6.

Lemma 3.2

If \(x(={\Psi }(x))\) is a stable point, then \(g^x(j)=0\) for all \(j\in J(x)\).

Proof

This directly follows from Proposition 2.10 that asserts \((A^{J(x)})^\top (Ax-b)=\textbf{0}\). \(\square \)

We now describe three classical options. We stress that the centroid mapping \({\Psi }\) can be chosen independently from the update step.

The Frank–Wolfe update The Frank–Wolfe or conditional gradient method is applicable only in the case when u(i) is finite for every \(i\in N\). In every update step, we start by computing \(\bar{y}\) as a minimizer of the linear objective \(\left\langle g,y\right\rangle \) over \({\textbf{B}}(u)\), that is,

$$\begin{aligned} \bar{y} \in \arg \min \{\left\langle g,y\right\rangle \mid y\in {\textbf{B}}(u)\}\, . \end{aligned}$$
(7)

We set \(\texttt {Update}(x):=x\) if \(\left\langle g,\bar{y}\right\rangle =\left\langle g,x\right\rangle \), or \(y=\texttt {Update}(x)\) is selected so that y minimizes \(\tfrac{1}{2}\Vert Ay-b\Vert ^2\) on the line segment \([x,\bar{y}]\).

Clearly, \(\bar{y}(i)=0\) whenever \(g(i)>0\), and \(\bar{y}(i)=u(i)\) whenever \(g(i)<0\). However, \(\bar{y}(i)\) can be chosen arbitrarily if \(g(i)=0\). In this case, we keep \(\bar{y}(i)=x(i)\); this will be significant to guarantee stability of solutions in the analysis.

The projected gradient update The projected gradient update moves in the opposite gradient direction to \(\bar{y}:=x-\lambda g\) for some step-length \(\lambda >0\), and obtains the output \(y=\texttt {Update}(x)\) as the projection y of \(\bar{y}\) to the box \({\textbf{B}}(u)\). This projection simply changes every negative coordinate to 0 and every \(\bar{y}(i)>u(i)\) to \(y(i)=u(i)\). To ensure U2, we can perform an additional step that replaces y by the point \(y'\in [x,y]\) that minimizes \(\tfrac{1}{2}\Vert Ay'-b\Vert ^2\).

Consider now an NNLS instance (i.e., \(u(i)=\infty \) for all \(i\in N\)), and let x be a stable point. Recall \(I_1(x)=\emptyset \) in the NNLS setting. Lemma 3.2 allows us to write the projected gradient update in the following simple form that also enables to use optimal line search. Define

$$\begin{aligned} z^x(i):=\max \{-g^x(i),0\} , \end{aligned}$$
(8)

and use \(z=z^x\) when clear from the context. According to Lemma 2.6, x is optimal to P if and only if \(z=\textbf{0}\). We use the optimal line search

$$\begin{aligned} y:=\arg \min _y\left\{ \tfrac{1}{2}\Vert Ay-b\Vert ^2\mid y=x+\lambda z, \lambda \ge 0\right\} \,. \end{aligned}$$

If \(z\ne \textbf{0}\), this can be written explicitly as

$$\begin{aligned} y:=x+\frac{\Vert z\Vert ^2}{\Vert Az\Vert ^2}z\, . \end{aligned}$$
(9)

To verify this formula, we note that \(\Vert z\Vert ^2=-\langle g, z \rangle \), since for every \(i\in N\) either \(z(i)=0\) or \(z(i)=-g(i)\).

Coordinate update Our third update rule is the one used in the Lawson–Hanson algorithm. Given a stable point \(x\in {\textbf{B}}(u)\), we select a coordinate \(j\in N\) where either \(j\in I_0(x)\) and \(g(j)<0\) or \(j\in I_1(x)\) and \(g(j)>0\), and set y such that \(y(i)=x(i)\) if \(i\ne j\), and y(j) is chosen in [0, u(j)] so that \(\tfrac{1}{2}\Vert Ay-b\Vert ^2\) is minimized. As in the Lawson–Hanson algorithm, we can maintain basic solutions throughout.

Lemma 3.3

Assume \(A^J\) is linearly independent for \(J=J(x)\). Then, \(A^{J'}\) is also linearly independent for \(J'=J(y)=J\cup \{j\}\), where \(y=\texttt {Update}(x)\) using a coordinate update.

Proof

For a contradiction, assume \(A^j=A^Jw\) for some \(w\in {\mathbb {R}}^J\). Then,

$$\begin{aligned} g(j)=(A^j)^\top (Ax-b)=w^\top (A^J)^\top (Ax-b)=0\,, \end{aligned}$$

a contradiction. \(\square \)

Let us start with \(x=\textbf{0}\), i.e., \(J(x)=I_1(x)=\emptyset \), \(I_0(x)=N\). Then, \(A^{J(x)}\) remains linearly independent throughout. Hence, every stable solution x is a basic solution to P. Note that whenever \(A^{J(x)}\) is linearly independent, \({\mathcal {C}}(I_0(x),I_1(x))\) contains a single point, hence, \({\Psi }(x)\) is uniquely defined.

For the special case of NNLS, i.e., for the case with no upper bounds, one can obtain simple explicit formulas for the coordinate update y. For z as in (8), let us return \(y=x\) if \(z=\textbf{0}\). Otherwise, let \(j\in \arg \max _k z(k)\); note that \(j\in I_0(x)\). Let

$$\begin{aligned} y(i):={\left\{ \begin{array}{ll} x(i)&{}\hbox {if }i\in N\setminus \{j\}\, ,\\ \frac{z(i)}{\Vert A^i\Vert ^2}&{}\hbox {if }i=j\, . \end{array}\right. } \end{aligned}$$
(10)

The following lemma is immediate. In the NNLS setting, U2 is guaranteed for the updates described above. For the general form with upper bounds, we can post-process as noted above to ensure U2.

Lemma 3.4

The Frank–Wolfe, projected gradient, and coordinate update rules all satisfy U and U2.

Cycle-free update rules

Definition 3.5

We say that \(\texttt {Update}(x)\) is a cycle-free update rule, if for every \(x\in {\textbf{B}}(u)\) and \(y=\texttt {Update}(x)\), \(x-y\) is cycle-free w.r.t. A.

Lemma 3.6

The Frank–Wolfe, projected gradient, and coordinate updates are all cycle-free.

Proof

Each of the three rules satisfies that for any \(x\in {\textbf{B}}(u)\) with gradient g and \(y=\texttt {Update}(x)\), \(y-x\) conforms to \(-g\). We show that this implies the required property.

For a contradiction, assume that a generalized path-cycle decomposition of \(y-x\) contains an inner vector h. Thus, \(h\ne \textbf{0}\), \(Ah=\textbf{0}\), and h conforms to \(-g\). Consequently, \(\left\langle g,h\right\rangle < 0\). Recalling the form of g from (2), we get

$$\begin{aligned} 0>\left\langle g,h\right\rangle =\left\langle A^\top (Ax-b),h\right\rangle =\left\langle Ax-b,Ah\right\rangle =0\,,\ \end{aligned}$$

a contradiction. \(\square \)

4 Analysis

Our main goal is to show the following convergence bound. The proof will be given in Sect. 4.3. Recall that in an NNLS instance, all upper capacities are infinite.

Theorem 4.1

Consider an NNLS instance of P, and assume we use a cycle-free centroid mapping. Algorithm 1 terminates with an optimal solution in \(O(n\cdot m^2\cdot \kappa ^2({\mathcal {X}}_A)\cdot \Vert A\Vert ^2\cdot \log (n+\kappa ({\mathcal {X}}_A)))\) major cycles using projected gradient updates (9), and in \(O(n^{2}m^2\cdot \kappa ^2({\mathcal {X}}_A)\cdot \Vert A\Vert ^2\cdot \log (n+\kappa ({\mathcal {X}}_A)))\) major cycles using coordinate updates (9), when initialized with \(x=\textbf{0}\). In both cases, the total number of minor cycles is \(O(n^{2}m^2\cdot \kappa ^2({\mathcal {X}}_A)\cdot \Vert A\Vert ^2\cdot \log (n+\kappa ({\mathcal {X}}_A)))\).

4.1 Proximity bounds

We show that if using a cycle-free update rule and a cycle-free centroid mapping, the movement of the iterates in Algorithm 1 can be bounded by the change in the objective value. First, a nice property of the centroid set is that the movement of Ax directly relates to the decrease in the objective value. Namely,

Lemma 4.2

For \(x\in {\textbf{B}}(u)\), let \(y\in {\mathcal {C}}(I_0(x),I_1(x))\). Then,

$$\begin{aligned} \Vert Ax-Ay\Vert ^2=\Vert Ax-b\Vert ^2-\Vert Ay-b\Vert ^2\,. \end{aligned}$$

Consequently, if \(\Psi \) is a cycle-free centroid mapping and \(y=\Psi (x)\), then

$$\begin{aligned} \Vert x-y\Vert ^2\le m^2\cdot \kappa ^2({\mathcal {X}}_A)\cdot \left( \Vert Ax-b\Vert ^2-\Vert Ay-b\Vert ^2\right) \,. \end{aligned}$$

Proof

Let \(J:=J(x)\). Since \(Ax-b=(Ax-Ay)+(Ay-b)\), the claim is equivalent to showing that

$$\begin{aligned} \left\langle Ax-Ay,Ay-b\right\rangle =0\,. \end{aligned}$$

Noting that \(Ax-Ay=A^Jx_J-A^J y_J\), we can write

$$\begin{aligned} \left\langle Ax-Ay,Ay-b\right\rangle =(x_J-y_J)^\top (A^J)^\top (Ay-b)=0\,, \end{aligned}$$

where the equality follows since \((A^J)^\top ( Ay-b)=\textbf{0}\) by Proposition 2.10. The second part follows from Lemma 2.3. \(\square \)

Next, let us consider the movement of x during a call to \(\texttt {Update}(x)\).

Lemma 4.3

Let \(x\in {\textbf{B}}(u)\) and \(y=\texttt {Update}(x)\). Then,

$$\begin{aligned} \Vert Ax-Ay\Vert ^2\le \Vert Ax-b\Vert ^2-\Vert Ay-b\Vert ^2\,. \end{aligned}$$

If using a cycle-free update rule, we also have

$$\begin{aligned} \Vert x-y\Vert ^2\le m^2\cdot \kappa ^2({\mathcal {X}}_A)\cdot \left( \Vert Ax-b\Vert ^2-\Vert Ay-b\Vert ^2\right) \,. \end{aligned}$$

Proof

From property U2, it is immediate to see that \(\left\langle Ay-b,Ax-Ay\right\rangle \ge 0\). This implies the first claim. The second claim follows from the definition of a cycle-free update rule and Lemma 2.3. \(\square \)

Lemma 4.4

Let \(x\in {\textbf{B}}(u)\), and let \(x'\) be an iterate obtained by consecutive t major or minor updates of Algorithm 1 using a cycle-free update rule and a cycle-free centroid mapping, starting from x. Then,

$$\begin{aligned} \Vert x-x'\Vert \le m\cdot \kappa ({\mathcal {X}}_A)\cdot \sqrt{t\left( \Vert Ax-b\Vert ^2-\Vert Ax'-b\Vert ^2\right) }\,. \end{aligned}$$

Proof

Let us consider the (major and minor cycle) iterates \(x=x^{(k)},x^{(k+1)},\ldots ,x^{(k+t)}=x'\). From the triangle inequality, and the arithmetic-quadratic means inequality,

$$\begin{aligned} \Vert x-x'\Vert \le \sum _{j=1}^{t} \Vert x^{(k+j)}-x^{(k+j-1)}\Vert \le \\ \sqrt{t\sum _{j=1}^{t} \Vert x^{(k+j)}-x^{(k+j-1)}\Vert ^2} \end{aligned}$$

The statement then follows using the bounds in Lemma 4.2 and Lemma 4.3. \(\square \)

4.2 Geometric convergence of the projected gradient and coordinate updates

We present a simple convergence analysis for the NNLS setting. For the general capacitated setting, similar bounds should follow from [22]. Recall that \(\eta (x)\) denotes the optimality gap at x.

Theorem 4.5

Consider an NNLS instance of P, and let \(x\ge \textbf{0}\) be a stable point. Then for \(y=\texttt {Update}(x)\) using the projected gradient update (9) we have

$$\begin{aligned} \eta (y)\le \left( 1-\frac{1}{2m^2\cdot \kappa ^2({{\mathcal {X}}_A})\cdot \Vert A\Vert ^2}\right) \eta (x)\,. \end{aligned}$$

Using coordinate updates as in (10), we have

$$\begin{aligned} \eta (y)\le \left( 1-\frac{1}{2nm^2\cdot \kappa ^2({{\mathcal {X}}_A})\cdot \Vert A\Vert ^2}\right) \eta (x)\,. \end{aligned}$$

Consequently, either with projected gradient or with coordinate updates, after performing \(O(nm^2\cdot \kappa ^2({{\mathcal {X}}_A})\cdot \Vert A\Vert ^2)\) minor and major cycles from an iterate x, we obtain an iterate \(x'\) with \(\eta (x')\le \eta (x)/2\).

Let us formulate the update progress using optimal line search.

Lemma 4.6

For a stable point \(x\ge \textbf{0}\), the update (9) satisfies

$$\begin{aligned} \Vert Ax-b\Vert ^2-\Vert Ay-b\Vert ^2\ge \frac{\Vert z\Vert ^2}{\Vert A\Vert ^2}\,, \end{aligned}$$

and the update (10) satisfies

$$\begin{aligned} \Vert Ax-b\Vert ^2-\Vert Ay-b\Vert ^2= \frac{z(j)^2}{\Vert A^j\Vert ^2}\,. \end{aligned}$$

Proof

For the update (9) with stepsize \(\lambda =\Vert z\Vert ^2/\Vert Az\Vert ^2\), we have

$$\begin{aligned} \begin{aligned} \Vert Ay-b\Vert ^2&=\Vert Ax-b\Vert ^2+\lambda ^2\Vert Az\Vert ^2+2\lambda \langle Ax-b, Az \rangle \\&=\Vert Ax-b\Vert ^2+\lambda ^2\Vert Az\Vert ^2+2\lambda \langle g, z \rangle \\&=\Vert Ax-b\Vert ^2+\lambda ^2\Vert Az\Vert ^2-2\lambda \Vert z\Vert ^2\\&=\Vert Ax-b\Vert ^2-\frac{\Vert z\Vert ^4}{\Vert Az\Vert ^2}\,, \end{aligned} \end{aligned}$$

where the third equality uses \(\langle g, z \rangle =-\Vert z\Vert ^2\) noted previously. The statement follows by using \(\Vert Az\Vert \le \Vert A\Vert \cdot \Vert z\Vert \).

The proof is similar for the update (10). Here, \(y=x+\lambda e_j\), where \(e_j\) is the jth unit vector, and \(\lambda =z(j)/\Vert A^j\Vert ^2\). The bound follows by noting that \(\langle Ax-b, Ae_j \rangle =\langle g, e_j \rangle =-z(j)\). \(\square \)

We now use Lemma 2.7 to bound \(\Vert z\Vert \).

Lemma 4.7

For a stable point \(x\ge \textbf{0}\) and the update direction \(z=z^x\) as in (8), we have

$$\begin{aligned} \Vert z\Vert \ge \frac{\sqrt{\eta (x)}}{\sqrt{2}m\cdot \kappa ({\mathcal {X}}_A)}\,. \end{aligned}$$

Proof

Let \(x^*\ge \textbf{0}\) be an optimal solution to P as in Lemma 2.7, and \(b^*=Ax^*\). Using convexity of \(f(x):=\tfrac{1}{2}\Vert Ax-b\Vert ^2\),

$$\begin{aligned} p^*=f(x^*)\ge f(x)+\langle g, x^*-x \rangle \ge f(x)-\langle z, x^*-x \rangle \,, \end{aligned}$$

where the second inequality follows by noting that for each \(i\in N\), either \(z(i)=-g(i)\), or \(z(i)=0\) and \(g(i)(x^*(i)-x(i))\ge 0\). From the Cauchy-Schwarz inequality and Lemma 2.7, we get

$$\begin{aligned} p^*\ge f(x)-\Vert z\Vert \cdot \Vert x^*-x\Vert \ge f(x)- m\cdot \kappa ({\mathcal {X}}_A)\cdot \Vert Ax-b^*\Vert \cdot \Vert z\Vert \,, \end{aligned}$$

that is,

$$\begin{aligned} \Vert z\Vert \ge \frac{\eta (x)}{m\cdot \kappa ({\mathcal {X}}_A)\cdot \Vert Ax-b^*\Vert }\,. \end{aligned}$$

The proof is complete by showing

$$\begin{aligned} 2\eta (x)\ge \Vert Ax-b^*\Vert ^2\, . \end{aligned}$$
(11)

Recalling that \(\eta (x)=\tfrac{1}{2}\Vert Ax-b\Vert ^2-\tfrac{1}{2}\Vert Ax^*-b\Vert ^2\) and that \(b^*=Ax^*\), this is equivalent to

$$\begin{aligned} \langle Ax-Ax^*, Ax^*-b \rangle \ge 0\,. \end{aligned}$$

This can be further written as

$$\begin{aligned} \langle x-x^*, g^{x^*} \rangle \ge 0\,, \end{aligned}$$

which is implied by the first order optimality condition at \(x^*\). This proves (11), and hence the lemma follows. \(\square \)

Proof (Proof of Theorem 4.5)

[Proof of Theorem 4.5] The proof for the bound in projected gradient updates is immediate from Lemma 4.6 and Lemma 4.7. For coordinate updates, recall that j is selected as the index of the largest component z(j). Thus, \(z(j)^2\ge \Vert z\Vert ^2/n\), and \(\Vert A^j\Vert \le \Vert A\Vert \).

For the second part, the statement follows for projected gradient updates by the first part and by noting that there are at most n minor cycles in every major cycle. For coordinate updates, every major cycle adds one component to J(x) whereas every minor cycle removes at least one. Hence, the total number of minor cycles is at most m plus the total number of major cycles. \(\square \)

4.3 Overall convergence bounds

In this subsection, we prove Theorem 4.1. Using Lemma 4.4 and Theorem 4.5, we can derive the following stronger proximity bound:

Lemma 4.8

Consider an NNLS instance of P. Let \(x\ge \textbf{0}\) be an iterate of Algorithm 1 using projected gradient or coordinate updates, and let \(x'\ge \textbf{0}\) be any later iterate. Then, for a value

$$\begin{aligned} \Theta :=O(\sqrt{n}m^2\cdot \kappa ^2({\mathcal {X}}_A)\cdot \Vert A\Vert )\,, \end{aligned}$$

we have

$$\begin{aligned} \Vert x-x'\Vert \le \Theta \sqrt{\eta (x)}\,. \end{aligned}$$

Proof

According to Theorem 4.5, after \(T:=O(nm^2\cdot \kappa ^2({{\mathcal {X}}_A})\cdot \Vert A\Vert ^2)\) major and minor cycles, we get to an iterate \(x''\) with \(\eta (x'')\le \eta (x)/4\). Thus, Lemma 4.4 gives

$$\begin{aligned} \Vert x-x''\Vert \le m\cdot \sqrt{2T}\cdot \kappa ({\mathcal {X}}_A)\cdot \sqrt{\eta (x)}\,. \end{aligned}$$

Let us now define \(x^{(k)}\) as the iterate following x after Tk major and minor cycles; we let \(x^{(0)}:=x\). By Theorem 4.5, \(x^{(k)}\le \eta (x)/4^k\), and similarly as above, for each \(k=0,1,2,\ldots \) we get

$$\begin{aligned} \Vert x^{(k)}-x^{(k+1)}\Vert \le m\cdot \sqrt{2T}\cdot \kappa ({\mathcal {X}}_A)\cdot \frac{\sqrt{\eta (x)}}{2^k}\,. \end{aligned}$$

The above bound also holds for any iterate \(x'\) between \(x^{(k)}\) an \(x^{(k+1)}\). Using these bounds and the triangle inequality, for any iterate \(x'\) after x, we obtain

$$\begin{aligned} \Vert x-x'\Vert \le 2m\cdot \sqrt{2T}\cdot \kappa ({\mathcal {X}}_A)\cdot \sqrt{\eta (x)}\,. \end{aligned}$$

This completes the proof. \(\square \)

We need one more auxiliary lemma.

Lemma 4.9

Consider an NNLS instance of P, and let \(x\ge \textbf{0}\) be a stable point. Let \(\hat{x}\ge \textbf{0}\) such that for each \(i\in N\), either \(\hat{x}(i)=x(i)\), or \(\hat{x}(i)=0<x(i)\). Then,

$$\begin{aligned} \Vert A\hat{x}-b\Vert ^2=\Vert Ax-b\Vert ^2+\Vert A\hat{x}-Ax\Vert ^2\,. \end{aligned}$$

Proof

The claim is equivalent to showing

$$\begin{aligned} \langle A\hat{x}-Ax, Ax-b \rangle =0\,. \end{aligned}$$

We can write \(\langle A\hat{x}-Ax, Ax-b \rangle =\langle g^x, \hat{x}-x \rangle \). By assumption, \(\hat{x}(i)-x(i)\ne 0\) only if \(x(i)>0\), but in this case \(g^x(i)=0\) by Lemma 3.2. \(\square \)

For the threshold \(\Theta \) as in Lemma 4.8 and for any \(x\ge \textbf{0}\), let us define

$$\begin{aligned} J^{\star (x)}:=\left\{ i\mid \, x(i)>\Theta \sqrt{\eta (x)}\right\} \,. \end{aligned}$$

The following is immediate from Lemma 4.8.

Lemma 4.10

Consider an NNLS instance of P. Let \(x\ge \textbf{0}\) be an iterate of Algorithm 1 using projected gradient updates, and \(x'\ge \textbf{0}\) be any later iterate. Then,

$$\begin{aligned} J^{\star (x)}\subseteq J(x')\,. \end{aligned}$$

We are ready to prove Theorem 4.1.

Proof (Proof of Theorem 4.1)

[Proof of Theorem 4.1] At any point of the algorithm, let \(J^{\star }\) denote the union of the sets \(J^{\star (x)}\) for all iterations thus far. Consider a stable iterate x at the beginning of any major cycle, and let

$$\begin{aligned} \varepsilon :=\frac{\sqrt{\eta (x)}}{4n\cdot \Theta \cdot \Vert A\Vert }\,. \end{aligned}$$

Theorem 4.5 guarantees that within \(O(nm^{2}\cdot \kappa ^2({\mathcal {X}}_A)\cdot \Vert A\Vert ^2\cdot \log (n+\kappa ({\mathcal {X}}_A)))\) major and minor cycles we arrive at an iterate \(x'\) such that \(\sqrt{\eta (x')}<\varepsilon \). We note that \(\log (n+\kappa ({\mathcal {X}}_A)+\Vert A\Vert )=O(\log (n+\kappa ({\mathcal {X}}_A)))\) according to Remark 2.4. We show that

$$\begin{aligned} J^{\star (x')}\cap I_0(x)\ne \emptyset \, . \end{aligned}$$
(12)

From here, we can conclude that \(J^{\star }\) was extended between iterates x and \(x'\). This may happen at most n times, leading to the claimed bound on the total number of major and minor cycles. Using Theorem 4.5 we also obtain the respective bounds on the number of major cycles for the two different updates.

For a contradiction, assume that (12) does not hold. Thus, for every \(i\in I_0(x)\), we have \(x'(i)\le \Theta \varepsilon \). Let us define \(\hat{x}\in {\mathbb {R}}^N\) as

$$\begin{aligned} \hat{x}(i):={\left\{ \begin{array}{ll} 0&{}\hbox {if }i\in I_0(x)\,,\\ x'(i)&{}\hbox {if }i\in J(x)\,. \end{array}\right. } \end{aligned}$$

By the above assumption, \(\Vert \hat{x}-x'\Vert _\infty \le \Theta \varepsilon \), and therefore \(\Vert A\hat{x}-Ax'\Vert \le \sqrt{n}\Theta \Vert A\Vert \varepsilon \). From Lemma 4.9, we can bound

$$\begin{aligned} \Vert A\hat{x}-b\Vert ^2=\Vert A x'-b\Vert ^2+\Vert A\hat{x}-Ax'\Vert ^2 \le 2p^*+ (n\Theta ^2\Vert A\Vert ^2+2)\varepsilon ^2\, . \end{aligned}$$
(13)

Recall that since x is a stable solution,

$$\begin{aligned} \Vert Ax-b\Vert =\min \left\{ \Vert Ay-b\Vert :\, y\in {\textbf{L}}(I_0(x),\emptyset )\right\} \,. \end{aligned}$$

Since \(\hat{x}\) is a feasible solution to this program, it follows that \(\Vert A\hat{x}-b\Vert ^2\ge \Vert Ax-b\Vert ^2\). We get that

$$\begin{aligned} 2\eta (x)=\Vert Ax-b\Vert ^2-2p^*\le \Vert A\hat{x}-b\Vert ^2-2p^*\le (n\Theta ^2\Vert A\Vert ^2+2)\varepsilon ^2\,, \end{aligned}$$

in contradiction with the choice of \(\varepsilon \). \(\square \)

5 Computational experiments

We give preliminary computational experiments of different versions of our algorithm, and compare them to standard gradient methods and existing NNLS implementations. The experiments were programmed and executed by MATLAB version R2023a on a personal computer having 11th Gen Intel(R) Core(TM) i7-11370 H @ 3.30GHz and 16GB of memory.

We considered two families of randomly generated NNLS instances. In Appendix A, we also present experiments for capacitated instances (finite u(i) values).

We tested each combination of two update methods: Projected Gradient (PG), and coordinate (C); and two centroid mappings, the ‘oblivious’ mapping (5) and the ‘local norm’ mapping (6) with diagonal entries \(1/x(i),~i\in N\). Recall that for coordinate updates and starting from \(x=\textbf{0}\), there is a unique centroid mapping by Lemma 3.3.

Our first benchmarks are the projected gradient (PG) and the projected fast (accelerated) gradient (PFG) methods. In contrast to our algorithms, these do not finitely terminate. We stopped the algorithms once they found a near-optimal solution within a certain accuracy threshold.

Further, we also compare our algorithms against the standard MATLAB implementation of the Lawson–Hanson algorithm called lsqnonneg, and against the implementation TNT-NN from [21]. We note that lsqnonneg and the coordinate update version of our algorithms are essentially the same.

Table 1 Computation time (in sec) for uncapacitated rectangular instances
Table 2 # of major cycles for uncapacitated rectangular instances
Table 3 The total # of minor cycles for uncapacitated rectangular instances
Table 4 Computation time (in sec) for uncapacitated near-square instances
Table 5 # of major cycles for uncapacitated near-square instances
Table 6 Total # of minor cycles for uncapacitated near-square instances

Generating instances We generated two families of experiments. In the rectangular experiments \(n\ge 2m\), and in the near-square experiments \(m\le n\le 1.1m\). In both cases, the entries of the \(m\times n\) matrix A were chosen independently uniformly at random from the interval \([-0.5,0.5]\). In the rectangular experiments, the entries of b were also chosen independently uniformly at random from \([-0.5,0.5]\). Thus, the underlying LP \(Ax=b\), \(x\ge \textbf{0}\) may or may not be feasible.

For the near-square instances, such a random choice of b leads to infeasible instances with high probability. We used this method to generate infeasible instances. We also constructed families where the LP is feasible as follows. For a sparsity parameter \(\chi \in (0,1]\), we sampled a subset \(J\subseteq N\), adding each variable independently with probability \(\chi \), and generated coefficients \(\{z_i: i\in J\}\) independently at random from [0, 1]. We then set \(b=\sum _{j\in J}A^j z_j\).

Computational results We stopped each algorithm when the computation time reached 60 s. For each (mn), we test all the algorithms 5 times and the results shown here are the 5-run averaged figures.

Table 1 shows the overall computational times for rectangular instances; values in brackets show the number of trials whose computation time exceeded 60 s. Tables 2 and 3 show the number of major cycles, and the total number of minor cycles, respectively. Table 4 shows the overall computational times for near-square instances. The status ‘I’ denotes infeasible instances and ‘F’ feasible instances, with the sparsity parameter \(\chi \) in brackets, with values 0.1, 0.5, and 1. Tables 5 and 6 show the number of major cycles, and the total number of minor cycles, respectively, for near-square instances.

Comparison of the results For rectangular instances, the ‘local-norm’ update (6) performs significantly better than the ‘oblivious’ update (5). The ‘oblivious’ updates are also outperformed by the coordinate updates, both in terms of running time as well as in the total number of minor cycles.

As noted above, while our algorithm with coordinate updates and lsqnonneg are basically the same, the running time of the latter algorithm is better by around factor two. This is since lsqnonneg might be using more efficient linear algebra operations, in contrast to our more basic implementation.

The algorithm TNT-NN from [21] is a fast practical algorithm using a number of heuristics, representing the state-of-the-art active set method for NNLS. Notably, our algorithm with ‘local-norm’ updates (6) is almost always within a factor two for rectangular instances, and performs better in some cases. This is despite the fact that we only use a basic implementation without using more efficient linear algebra methods or including further heuristics.

For rectangular instances, TNT-NN and ‘local-norm’ updates also outperform fast projected gradient in most cases.

The picture is more mixed for near-square instances. There is a marked difference between feasible and infeasible instances. The ‘local-norm’ and ‘oblivious’ update rules perform similarly, with a small number of major cycles. The number of minor cycles is much higher for infeasible instances. For infeasible instances, coordinate updates are faster than either variant of the PG update rule, while PG updates are faster for feasible instances.

The algorithm TNT-NN is consistently faster than our algorithm, with better running times for infeasible instances. For projected gradient and projected fast gradient, the running times are similar to TNT-NN except for feasible instances with sparsity parameter \(\chi =1\), where they do not terminate within the 60 s limit in most cases. In contrast, these appear to be the easiest instances to our method with PG updates with the ‘local-norm’ mapping.

6 Concluding remarks

We have proposed a new ‘Update-and-Stabilize’ framework for the minimum-norm-point problem P. Our method combines classical first order methods with ‘stabilizing’ steps using the centroid mapping that amounts to computing a projection to an affine subspace. Our algorithm is always finite, and is strongly polynomial when the associated circuit imbalance measure is constant. In particular, this gives the first such convergence bound for the Lawson–Hanson algorithm.

There is scope for further improvements both in the theoretical analysis and in practical implementations. In this paper, we only analyzed the running time for uncapacitated instances. Combined with existing results from [22], we expect that similar bounds can be shown for capacitated instances. We note that for the analysis, it would suffice to run minor cycles only once in a while, say after every O(n) gradient updates. From a practical perspective however, running minor cycles after every update appears to be highly beneficial in most cases. Rigorous computational experiments, using standard families of LP benchmarks, are left for future work.

Future work should also compare the performance of our algorithms to the gradient projection method [5, 23], using techniques from that method to our algorithm and vice versa. We note that for NNLS instances, starting from a stable point our algorithm already finds the optimal gradient update. However, a similar search as in gradient projection methods may be useful in the capacitated case. In the other direction, we note that the conjugate gradient iterations used in gradient projection do not correspond to an explicit choice of a centroid mapping. A possible enhancement of gradient projection could come from approximating a ‘local-norm’ objective as in (6) in the second stage.

We also point out that the ‘local-norm’ selection rule (6) was inspired by the affine scaling method; the important difference is that our algorithm moves all the way to the boundary, whereas affine scaling stays in the interior throughout.