Abstract
We consider the minimum-norm-point (MNP) problem over polyhedra, a well-studied problem that encompasses linear programming. We present a general algorithmic framework that combines two fundamental approaches for this problem: active set methods and first order methods. Our algorithm performs first order update steps, followed by iterations that aim to ‘stabilize’ the current iterate with additional projections, i.e., find a locally optimal solution whilst keeping the current tight inequalities. Such steps have been previously used in active set methods for the nonnegative least squares (NNLS) problem. We bound on the number of iterations polynomially in the dimension and in the associated circuit imbalance measure. In particular, the algorithm is strongly polynomial for network flow instances. Classical NNLS algorithms such as the Lawson–Hanson algorithm are special instantiations of our framework; as a consequence, we obtain convergence bounds for these algorithms. Our preliminary computational experiments show promising practical performance.
Similar content being viewed by others
1 Introduction
We study the minimum-norm-point (MNP) problem
where m and n are positive integers, \(M=\{1,\cdots ,m\}\) and \(N=\{1,\cdots ,n\}\), \(A\in {\mathbb {R}}^{M\times N}\) is a matrix with rank \({\text {rk}}(A)=m\), \(b\in {\mathbb {R}}^M\), and \(u\in ({\mathbb {R}}\cup \{\infty \})^N\). We will use the notation \({\textbf{B}}(u):=\{x\in {\mathbb {R}}^N\mid {\textbf{0}}\le x\le u\}\) for the feasible set. The problem P generalizes the linear programming (LP) feasibility problem: the optimum value is 0 if and only if \(Ax=b\), \(x\in {\textbf{B}}(u)\) is feasible. The case \(u(i)=\infty \) for all \(i\in N\) is also known as the nonnegative least squares (NNLS) problem, a fundamental problem in numerical analysis.
Two extensively studied approaches for MNP and NNLS are active set methods and first order methods. An influential active set method was proposed by Lawson and Hanson [19, Chapter 23] in 1974. Variants of this algorithm were also proposed by Stoer [28], Björck [2], Wilhelmsen [31], and Leichner, Dantzig, and Davis [20].Footnote 1 Closely related is Wolfe’s classical minimum-norm-point algorithm [32]. These are iterative methods that maintain a set of active variables fixed at the lower or upper bounds, and passive (inactive) variables. In the main update steps, these algorithms fix the active variables at the lower or upper bounds, and perform unconstrained optimization on the passive variables. Such update steps require solving systems of linear equations. In all these methods, the set of columns corresponding to passive variables is linearly independent. The combinatorial nature of these algorithms enables to show termination with an exact optimal solution in a finite number of iterations. However, obtaining subexponential convergence bounds for such active set algorithms has remained elusive; see Sect. 1.1 for more work on NNLS and Wolfe’s algorithm.
In the context of first order methods, the formulation P belongs to a family of problems for which Necoara, Nesterov, and Glineur [22] showed linear convergence bounds. That is, the number of iterations needed to find an \(\varepsilon \)-approximate solution depends linearly on \(\log (1/\varepsilon )\). Such convergence has been known for strongly convex functions, but this property does not hold for P. However, [22] shows that restricted variants of strong convexity also suffice for linear convergence. For problems of the form P, the required property follows using Hoffman-proximity bounds [16]; see [26] and the references therein for recent results on Hoffman-proximity. In contrast to active set methods, first order methods are computationally cheaper as they do not require solving systems of linear equations. On the other hand, they do not find exact solutions.
We propose a new algorithmic framework for the minimum-norm-point problem P that can be seen as a blend of active set and first order methods. Our algorithm performs stabilizing steps between first order updates, and terminates with an exact optimal solution in a finite number of iterations. Moreover, we show poly\((n,\kappa )\) running time bounds for multiple instantiations of the framework, where \(\kappa \) is the circuit imbalance measure associated with the matrix \((A\mid I_M)\) (see Sect. 2.1). This gives strongly polynomial bounds whenever \(\kappa \) is constant; in particular, \(\kappa =1\) for network flow feasibility. We note that if \(A\in \mathbb {Z}^{M\times N}\), then \(\kappa \le \Delta (A)\) for the maximum subdeterminant \(\Delta (A)\). Still, \(\kappa \) can be exponential in the encoding length of the matrix.
The stabilizing step is similar to the one used by Björck [2] who considered the same formulation P. The Lawson–Hanson algorithm for the NNLS problem can be seen as special instantiations of our framework, and we obtain an \(O(n^{2}m^2\cdot \kappa ^2\cdot \Vert A\Vert ^2\cdot \log (n+\kappa ))\) iteration bound. These algorithms only use coordinate updates as first order steps, and maintain linear independence of the columns corresponding to passive variables. Our framework is signficantly more general: we waive the linear independence requirement and allow for arbitarty active and passive sets. This provides much additional flexibility, as our framework can be implemented with a variety of first order methods. This feature also yields a significant advantage in our computational experiments.
Overview of the algorithm A key concept in our algorithm is the centroid mapping, defined as follows. For disjoint subsets \(I_0,I_1\subseteq N\), we let \({\textbf{L}}(I_0,I_1)\) denote the affine subspace of \({\mathbb {R}}^N\) where \(x(i)=0\) for \(i\in I_0\) and \(x(i)=u(i)\) for \(i\in I_1\). For \(x\in {\textbf{B}}(u)\), let \(I_0(x)\) and \(I_1(x)\) denote the subsets of coordinates i with \(x(i)=0\) and \(x(i)=u(i)\), respectively. The centroid mapping \({\Psi }: {\textbf{B}}(u)\rightarrow {\mathbb {R}}^N\) is a mapping with the property that \({\Psi }{(x)}\in \arg \min _y\{\tfrac{1}{2}\Vert Ay-b\Vert ^2\mid y\in {\textbf{L}}(I_0(x),I_1(x))\}\). This mapping may not be unique, since the columns of A corresponding to \(J(x):=\{i\in N\mid 0<x(i)<u(i)\}\) may not be independent: the optimal centroid set is itself an affine subspace. The point \(x\in {\textbf{B}}(u)\) is stable if \({\Psi }(x)=x\). This generalizes an update used by Björck [2]. However, in his setting J(x) is always linearly independent and thus the centroid set is always a single point. Stable sets can also be seen as the analogues of corral solutions in Wolfe’s minimum-norm point algorithm.
Every major cycle starts with an update step and ends with a stable point. The update step could be any first-order step satisfying some natural requirements, such as variants of Frank–Wolfe, projected gradient, or coordinate updates. As long as the current iterate is not optimal, this update strictly improves the objective. Finite convergence follows by the fact that there can be at most \(3^n\) stable points.
After the update step, we start a sequence of minor cycles. From the current iterate \(x\in {\textbf{B}}(u)\), we move to \({\Psi }(x)\) in case \({\Psi }(x)\in {\textbf{B}}(u)\), or to the intersection of the boundary of \({\textbf{B}}(u)\) and the line segment \([x,{\Psi }(x)]\) otherwise. The minor cycles finish once \(x={\Psi }(x)\) is a stable point. The objective \(\tfrac{1}{2}\Vert Ax-b\Vert ^2\) is decreasing in every minor cycle, and at least one new coordinate \(i\in N\) is set to 0 or to u(i). Thus, the number of minor cycles in any major cycle is at most n. One can use various centroid mappings satisfying a mild requirement on \({\Psi }\), described in Sect. 2.3.
We present a poly\((n,\kappa )\) convergence analysis for the NNLS problem with coordinate updates, which corresponds to the Lawson–Hanson algorithm and its variants. We expect that similar arguments extend to the capacitated case. The proof has two key ingredients. First, we show linear convergence of the first-order update steps (Theorem 4.5). Such a bound follows already from [22]; we present a simple self-contained proof exploiting properties of stable points and the uncapacitated setting. The second step of the analysis shows that in every poly\((n,\kappa )\) iterations, we can identify a new variable that will never become zero in subsequent iterations (Theorem 4.1). The proof relies on proximity arguments: we show that for any iterate x and any subsequent iterate \(x'\), the distance \(\Vert x-x'\Vert \) can be upper bounded in terms of n, \(\kappa \), and the optimality gap at x.
In Sect. 5, we present preliminary computational experiments using randomly generated problem instances of various sizes. We compare the performance of different variants of our algorithm to standard gradient methods. For the choice of update steps, projected gradient performs much better than coordinate updates used in the NNLS algorithms. We compare an ‘oblivious’ centroid mapping and one that chooses \({\Psi }(x)\) as the nearest point to x in the centroid set in the ‘local norm’ (see Sect. 2.2). The latter one appears to be significantly better. For choices of parameters \(n\ge 2m\), the running time of our method with projected gradient updates and local norm mapping is typically within a factor two of TNT-NN, the state-of-the-art practical active set heuristic for NNLS [21], despite the fact that we only use simple linear algebra tools and have not made any attempts for practical speed ups. The performance is often better than projected accelerated gradient descent, the best first order approach.
Proximity arguments and strongly polynomial algorithms Arguments that show strongly polynomial convergence by gradually revealing the support of an optimal solution are prevalent in combinatorial optimization. These date back to Tardos’s [29] groundbreaking work giving the first strongly polynomial algorithm for minimum-cost flows. Our proof is closer to the dual ‘abundant arc’ arguments by Fujishige [12] and Orlin [24]. Tardos generalized the above result for general LP’s, giving a running time dependence poly\((n,\log \Delta (A))\), where \(\Delta (A)\) is the largest subdeterminant of the constraint matrix. In particular, her algorithm is strongly polynomial as long as the entries of the matrix are polynomially bounded integers. This framework was recently strengthened in [7] to poly\((n,\log \kappa (A))\) running time for the circuit imbalance measure \(\kappa (A)\). They also highlight the role of Hoffman-proximity and give such a bound in terms of \(\kappa (A)\). We note that the above algorithms—along with many other strongly polynomial algorithms in combinatorial optimization—modify the problem directly once new information is learned about the optimal support. In contrast, our algorithm does not require any such modifications, nor a knowledge or estimate on the condition number \(\kappa \). Arguments about the optimal support only appear in the analysis.
Strongly polynomial algorithms with poly\((n,\log \kappa (A))\) running time bounds can also be obtained using layered least squares interior point methods. This line of work was initiated by Vavasis and Ye [30] using a related condition measure \(\bar{\chi }(A)\). An improved version that also established the relation between \(\bar{\chi }(A)\) and \(\kappa (A)\) was recently given by Dadush et al. [6]. We refer the reader to the survey [9] for properties and further applications of circuit imbalances.
1.1 Further related work
The Lawson–Hanson algorithm remains popular for the NNLS problem, and several variants are known. Bro and De Jong [3], and by Myre et al. [21] proposed empirically faster variants. In particular, [21] allows bigger changes in the active and passive sets, thus waiving the linear independence on passive variables, and reports a significant speedup. However, there are no theoretical results underpinning the performance of these heuristics.
Wolfe’s minimum-norm-point algorithm [32] considers the variant of P where the box constraint \(x\in {\textbf{B}}(u)\) is replaced by \(\sum _{i\in N} x_i=1\), \(x\ge \textbf{0}\). It has been successfully employed as a subroutine in various optimization problems, e.g., submodular function minimization [14], see also [1, 11, 13]. Beyond the trivial \(2^n\) bound, the convergence analysis remained elusive; the first bound with \(1/\varepsilon \)-dependence was given by Chakrabarty et al. [4] in 2014. Lacoste-Julien and Jaggi [17] gave a \(\log (1/\varepsilon )\) bound, parametrized by the pyramidal width of the polyhedron. Recently, De Loera et al. [8] showed an example of exponential time behaviour of Wolfe’s algorithm for the min-norm insertion rule (the analogue of a pivot rule); no exponential example for other insertion rules such as the linopt rule used in the application for submodular minimization.Footnote 2
Our Update-and-Stabilize algorithm is also closely related to the Gradient Projection Method, see [5] and [23, Section 16.7]. This method also maintains a non-independent set of passive variables. For each gradient update, a more careful search is used in the gradient direction, ‘bending’ the movement direction whenever a constraint is hit. The analogues of stabilizer steps are conjugate gradient iterations. Thus, this method avoids the computationally expensive step of exact projections; on the other hand, finite termination is not guaranteed. We further discuss the relationship between the two algorithms in Sect. 6.
There are similarities between our algorithm and the Iteratively Reweighted Least Squares (IRLS) method that has been intensively studied since the 1960s [18, 25]. For some \(p\in [0,\infty ]\), \(A\in {\mathbb {R}}^{M\times N}\) and \(b\in {\mathbb {R}}^M\), the goal is to approximately solve \(\min \{\Vert x\Vert _p\mid \, Ax=b\}\). At each iteration, a weighted minimum-norm point problem \(\min \{\sum _{i=1}^n w^{(t)}_i x^2_i\mid \, Ax=b\}\) is solved, where the weights \(w^{(t)}\) are iteratively updated. The LP-feasibility problem \(Ax=b\), \(\textbf{0}\le x\le \textbf{1}\) for finite upper bounds \(u=\textbf{1}\) can be phrased as an \(\ell _{\infty }\)-minimization problem \(\min \{\Vert x\Vert _\infty \mid \, Ax=b-A\textbf{1}/2\}\). Ene and Vladu [10] gave an efficient variant of IRLS for \(\ell _1\) and \(\ell _\infty \)-minimization; see their paper for further references. Some variants of our algorithm solve a weighted least squares problem with changing weights in the stabilizing steps. There are, however significant differences between IRLS and our method. The underlying optimization problems are different, and IRLS does not find an exact optimal solution in finite time. Applied to LP in the \(\ell _\infty \) formulation, IRLS satisfies \(Ax=b\) throughout while violating the box constraints \(\textbf{0}\le x\le u\). In contrast, iterates of our algorithm violate \(Ax=b\) but maintain \(\textbf{0}\le x\le u\). The role of the least squares subroutines is also rather different in the two settings.
2 Preliminaries
Notation We use \(N\oplus M\) for disjoint union (or direct sum) of the copies of the two sets. For a matrix \(A\in {\mathbb {R}}^{M\times N}\), \(i\in M\) and \(j\in N\), we denote the ith row of A by \(A_i\) and jth column by \(A^j\). Also for any matrix X denote by \(X^\top \) the matrix transpose of X. We let \(\Vert \cdot \Vert _p\) denote the \(\ell _p\) vector norm; we use \(\Vert \cdot \Vert \) to denote the Euclidean norm \(\Vert \cdot \Vert _2\). For a matrix \(A\in {\mathbb {R}}^{M\times N}\), we let \(\Vert A\Vert \) denote the spectral norm, that is, the \(\ell _2\rightarrow \ell _2\) operator norm.
For any \(x, y\in {\mathbb {R}}^M\) we define \(\left\langle x,y\right\rangle =\sum _{i\in M}x(i)y(i)\). We will use this notation also in other dimensions. We let \([x,y]:=\{\lambda x+(1-\lambda )y\mid \lambda \in [0,1]\}\) denote the line segment between the vectors x and y.
2.1 Elementary vectors and circuits
For a linear space \(W\subsetneq {\mathbb {R}}^N\), \(g\in W\) is an elementary vector if g is a support minimal nonzero vector in W, that is, no \(h\in W\setminus \{\textbf{0}\}\) exists such that \({\text {supp}}(h)\subsetneq {\text {supp}}(g)\), where \({\text {supp}}\) denotes the support of a vector. We let \({\mathcal {F}}(W)\subseteq W\) denote the set of elementary vectors. A circuit in W is the support of some elementary vector; these are precisely the circuits in the associated linear matroid \(\mathcal {M}(W)\).
The subspaces \(W=\{\textbf{0}\}\) and \(W={\mathbb {R}}^N\) are called trivial subspaces; all other subspaces are nontrivial. We define the circuit imbalance measure
for nontrivial subspaces and \(\kappa (W)=1\) for trivial subspaces. For a matrix \(A\in {\mathbb {R}}^{M\times N}\), we use the notation \(\kappa (A)\) to denote \(\kappa (\ker (A))\).
The following theorem shows the relation to totally unimodular (TU) matrices. Recall that a matrix is totally unimodular (TU) if the determinant of every square submatrix is 0, \(+1\), or \(-1\).
Theorem 2.1
[Cederbaum, 1957, see c 3,4]ENV22] Let \(W \subset {\mathbb {R}}^N\) be a linear subspace. Then \(\kappa (W) = 1\) if and only if there exists a TU matrix \(A\in {\mathbb {R}}^{M\times N}\) such that \(W = \ker (A)\).
We also note that if \(A\in \mathbb {Z}^{M\times N}\) is an integer matrix, then \(\kappa (A)\le \Delta (A)\) for the maximum subdeterminant \(\Delta (A)\).
Conformal circuit decompositions We say that the vector \(y \in {\mathbb {R}}^N\) conforms to \(x\in {\mathbb {R}}^N\) if \(x(i)y(i) > 0\) whenever \(y(i)\ne 0\). Given a subspace \(W\subseteq {\mathbb {R}}^N\), a conformal circuit decomposition of a vector \(v\in W\) is a decomposition
where \(\ell \le n\) and \(h^1,h^2,\ldots ,h^\ell \in {\mathcal {F}}(W)\) are elementary vectors that conform to v. A fundamental result on elementary vectors asserts the existence of a conformal circuit decomposition; see e.g., [15, 27]. Note that there may be multiple conformal circuit decompositions of a vector.
Lemma 2.2
For every subspace \(W\subseteq {\mathbb {R}}^N\), every \(v\in W\) admits a conformal circuit decomposition.
Given \(A\in {\mathbb {R}}^{M\times N}\), we define the extended subspace \({\mathcal {X}}_A\subset {\mathbb {R}}^{N\oplus M}\) as \({\mathcal {X}}_A:=\ker ([A\mid -I_M])\). Hence, for every \(v\in {\mathbb {R}}^N\), \((v,Av)\in {\mathcal {X}}_A\). For \(v\in {\mathbb {R}}^N\), the generalized path-circuit decomposition of v with respect to A is a decomposition \(v=\sum _{k=1}^\ell h^k\), where \(\ell \le n\), and for each \(1\le k\le \ell \), \((h^k,Ah^k)\in {\mathbb {R}}^{N\oplus M}\) is an elementary vector in \({\mathcal {X}}_A\) that conforms to (v, Av). Moreover, \(h^k\) is an inner vector in the decomposition if \(Ah^k=\textbf{0}\) and an outer vector otherwise.
We say that \(v\in {\mathbb {R}}^N\) is cycle-free with respect to A, if all generalized path-circuit decompositions of v contain outer vectors only. The following lemma will play a key role in analyzing our algorithms.
Lemma 2.3
For any \(A\in {\mathbb {R}}^{M\times N}\), let \(v\in {\mathbb {R}}^N\) be cycle-free with respect to A. Then,
Proof
Consider a generalized path-circuit decomposition \(v=\sum _{k=1}^\ell h^k\). By assumption, \(Ah^k\ne \textbf{0}\) for each k. Thus, for every \(j\in {\text {supp}}(h^k)\) there exists an \(i\in M\), such that \(|h^k(j)|\le \kappa ({{\mathcal {X}}_A}) |A_i h^k|\). For every \(j\in N\), the conformity of the decomposition implies \(|v(j)|=\sum _{k=1}^\ell |h^k(j)|\). Similarly, for every \(i\in M\), \(|A_i v|=\sum _{k=1}^\ell |A_i h^k|\). These imply the inequality \(\Vert v\Vert _\infty \le \kappa ({{\mathcal {X}}_A}) \Vert Av\Vert _1\).
For the second inequality, note that for any outer vector \((h^k,Ah^k)\in {\mathcal {X}}_A\), the columns in \({\text {supp}}(h^k)\) must be linearly independent. Consequently, \(\Vert h^k\Vert _2\le \sqrt{m}\cdot \kappa ({{\mathcal {X}}_A})\cdot |(Ah^k)_i|\) for each k and \(i\in {\text {supp}}(Ah^k)\). This implies
completing the proof. \(\square \)
Remark 2.4
We note that a similar argument shows that \(\Vert A\Vert \le \sqrt{m\tau (A)}\cdot \kappa ({\mathcal {X}}_A)\), where \(\tau (A)\le m\) is the maximum size of \({\text {supp}}(Ah)\) for an elementary vector \((h,Ah)\in {\mathcal {X}}_A\).
Example 2.5
If \(A\in {\mathbb {R}}^{M\times N}\) is the node-arc incidence matrix of a directed graph \(D=(M,N)\). The system \(Ax=b\), \(x\in {\textbf{B}}(u)\) corresponds to a network flow feasibility problem. Here, b(i) is the demand of node \(i\in M\), i.e., the inflow minus the outflow at i is required to be b(i). Recall that A is a TU matrix; consequently, \((A|-I_M)\) is also TU, and \(\kappa ({{\mathcal {X}}_A})=1\). Our algorithm is strongly polynomial in this setting. Note that inner vectors correspond to cycles and outer vectors to paths; this motivates the term ‘generalized path-circuit decomposition.’ We also note \(\tau (A)=2\), and thus \(\Vert A\Vert \le \sqrt{2|M|}\) in this case.
2.2 Optimal solutions and proximity
Let
Thus, Problem P is to find the point in Z(A, u) that is nearest to b with respect to the Euclidean norm. We note that if the upper bounds u are finite, Z(A, u) is called a zonotope.
Throughout, we let \(p^*\) denote the optimum value of P. Note that whereas the optimal solution \(x^*\) may not be unique, the vector \(b^*:=Ax^*\) is unique by strong convexity; we have \(p^*=\tfrac{1}{2}\Vert b-b^*\Vert ^2\). We use
to denote the optimality gap for \(x\in {\textbf{B}}(u)\). The point \(x\in {\textbf{B}}(u)\) is an \(\varepsilon \)-approximate solution if \(\eta (x)\le \varepsilon \).
For a point \(x\in {\textbf{B}}(u)\), let
The gradient of the objective \(\tfrac{1}{2}\Vert Ax-b\Vert ^2\) in P can be written as
We recall the first order optimality conditions.
Lemma 2.6
The point \(x\in {\textbf{B}}(u)\) is an optimal solution to P if and only if \(g^x(i)=0\) for all \(i\in J(x)\), \(g^x(i)\ge 0\) for all \(i\in I_0(x)\), and \(g^x(i)\le 0\) for all \(i\in I_1(x)\).
Using Lemma 2.3, we can bound the distance of any x from the nearest optimal solution.
Lemma 2.7
For any \(x\in {\textbf{B}}(u)\), there exists an optimal solution \(x^*\) to P such that
Proof
Let us select an optimal solution \(x^*\) to P such that \(\Vert x-x^*\Vert _2\) is minimal. We show that \(x-x^*\) is cycle-free w.r.t. A; the statements then follow from Lemma 2.3.
For a contradiction, assume a generalized path-circuit decomposition of \(x-x^*\) contains an inner vector g, i.e., \(Ag=\textbf{0}\). By conformity of the decomposition, for \(\bar{x}=x^*+g\) we have \(\bar{x}\in {\textbf{B}}(u)\) and \(A\bar{x}=A x^*\). Thus, \(\bar{x}\) is another optimal solution, but \(\Vert x-\bar{x}\Vert _2<\Vert x-x^*\Vert _2\), a contradiction. \(\square \)
2.3 The centroid mapping
Let us denote by \(3^N\) the set of all ordered pairs \((I_0,I_1)\) of disjoint subsets \(I_0,I_1\subseteq N\), and let \(I_*:=\{i\in N\mid u(i)<\infty \}\). For any \((I_0,I_1)\in 3^N\) with \(I_1\subseteq I_*\), we let
We call \(\{Ax \mid x\in {\textbf{B}}(u)\cap {\textbf{L}}(I_0,I_1)\} \subseteq Z(A,u)\) a pseudoface of Z(A, u). We note that every face of Z(A, u) is a pseudoface, but there might be pseudofaces that do not correspond to any face.
We define a centroid set for \((I_0,I_1)\) as
Proposition 2.8
For \((I_0,I_1)\in 3^N\) with \(I_1\subseteq I_*\), \({\mathcal {C}}(I_0,I_1)\) is an affine subspace of \({\mathbb {R}}^N\), and for some \(w\in {\mathbb {R}}^M\), it holds that \(Ay=w\) for every \(y\in {\mathcal {C}}(I_0,I_1)\).
The centroid mapping \({\Psi }:\, {\textbf{B}}(u)\rightarrow {\mathbb {R}}^N\) is a mapping that satisfies
We say that \(x\in {\textbf{B}}(u)\) is a stable point if \({\Psi }(x)=x\). A simple, ‘oblivious’ centroid mapping arises by taking a minimum-norm point of the centroid set:
However, this mapping has some undesirable properties. For example, we may have an iterate x that is already in \({\mathcal {C}}(I_0(x),I_1(x))\), but \({\Psi }(x)\ne x\). Instead, we aim for centroid mappings that move the current point ‘as little as possible’. This can be formalized as follows. The centroid mapping \({\Psi }\) is called cycle-free, if the vector \({\Psi }(x)-x\) is cycle-free w.r.t. A for every \(x\in {\textbf{B}}(u)\). The next claim describes a general class of cycle-free centroid mappings.
Lemma 2.9
For every \(x\in {\textbf{B}}(u)\), let \(D(x)\in {\mathbb {R}}^{N\times N}_{>0}\)be a positive diagonal matrix. Then,
defines a cycle-free centroid mapping.
Proof
For a contradiction, assume \(y-x\) is not cycle-free for \(y={\Psi }{(x)}\), that is, a generalized path-circuit decomposition contains an inner vector z. For \(y'=y-z\) we have \(Ay'=Ay\), meaning that \(y'\in {\mathcal {C}}(I_0(x),I_1(x))\). This is a contradiction, since \(\Vert D(x)(y'-x)\Vert <\Vert D(x)(y-x)\Vert \) for any positive diagonal matrix D(x). \(\square \)
We emphasize that D(x) in the above statement is a function of x and can be any positive diagonal matrix. Note also that the diagonal entries for indices in \(I_0(x)\cup I_1(x)\) do not matter. In our experiments, defining D(x) with diagonal entries \(1/x(i)+1/(u(i)-x(i))\) for \(i\in J(x)\) performs particularly well. Intuitively, this choice aims to move less the coordinates close to the boundary.Footnote 3 The next proposition follows from Lagrangian duality, and provides a way to compute \({\Psi }(x)\) as in (6) by solving a system of linear equations.
Proposition 2.10
For a partition \(N=I_0\cup I_1\cup J\), the centroid set can be written as
For \((I_0,I_1,J)=(I_0(x),I_1(x),J(x))\) and \(D=D(x)\), the point \(y={\Psi }(x)\) as in (6) can be obtained as the unique solution to the system of linear equations
3 The update-and-stabilize framework
Now we describe a general algorithmic framework MNPZ(A, b, u) for solving P, shown in Algorithm 1. Similarly to Wolfe’s MNP algorithm, the algorithm comprises major and minor cycles. We maintain a point \(x\in {\textbf{B}}(u)\), and x is stable at the end of every major cycle. Each major cycle starts by calling the subroutine \(\texttt {Update}(x)\); the only general requirement on this subroutine is as follows:
-
(U1)
for \(y=\texttt {Update}(x)\), \(y=x\) if and only if x is optimal to P, and \(\Vert Ay-b\Vert <\Vert Ax-b\Vert \) otherwise, and
-
(U2)
if \(y\ne x\), then for any \(\lambda \in [0,1)\), \(z=\lambda y+(1-\lambda )x\) satisfies \(\Vert Ay-b\Vert <\Vert Az-b\Vert \).
Property U can be obtained from any first order algorithm; we introduce some important examples in Sect. 3.1. Property U2 might be violated if using a fixed step-length, which is a common choice. In order to guarantee U2, we can post-process the first order update: choose y as the optimal point on the line segment \([x,y']\), where \(y'\) is the update found by the fixed-step update.
The algorithm terminates in the first major cycle when \(x=\texttt {Update}(x)\). Within each major cycle, the minor cycles repeatedly use the centroid mapping \(\Psi \). As long as \(w:=\Psi (x)\ne x\), i.e., x is not stable, we set \(x:=w\) if \(w\in {\textbf{B}}(u)\); otherwise, we set the next x as the intersection of the line segment [x, w] and the boundary of \({\textbf{B}}(u)\). The requirement U is already sufficient to show finite termination.
Theorem 3.1
Consider any \(\texttt {Update}(x)\) subroutine that satisfies U and any centroid mapping \(\Psi \). The algorithm MNPZ(A, b, u) finds an optimal solution to P within \(3^n\) major cycles. Every major cycle contains at most n minor cycles.
Proof
Requirement U guarantees that if the algorithm terminates, it returns an optimal solution. We claim that the same sets \((I_0,I_1)\) cannot appear as \((I_0(x),I_1(x))\) at the end of two different major cycles; this implies the bound on the number of major cycles. To see this, we note that for \(x=\Psi (x)\), \(x\in {\mathcal {C}}(I_0(x),I_1(x))={\mathcal {C}}(I_0,I_1)\); thus, \(\Vert Ax-b\Vert =\min \left\{ \Vert Az-b\Vert \mid \ z\in {\textbf{L}}(I_0,I_1))\right\} \). By U, \(\Vert Ay-b\Vert <\Vert Ax-b\Vert \) at the beginning of every major cycle. Moreover, it follows from the definition of the centroid mapping that \(\Vert Ax-b\Vert \) is non-increasing in every minor cycle. To bound the number of minor cycles in a major cycle, note that the set \(I_0(x)\cup I_1(x)\subseteq N\) is extended in every minor cycle. \(\square \)
3.1 The update subroutine
We can implement the \(\texttt {Update}(x)\) subroutine satisfying U and U2 using various first order methods for constrained optimization.
Recall the gradient \(g^x\) from (2); we use \(g=g^x\) when x is clear from the context. The following property of stable points can be compared to the optimality condition in Lemma 2.6.
Lemma 3.2
If \(x(={\Psi }(x))\) is a stable point, then \(g^x(j)=0\) for all \(j\in J(x)\).
Proof
This directly follows from Proposition 2.10 that asserts \((A^{J(x)})^\top (Ax-b)=\textbf{0}\). \(\square \)
We now describe three classical options. We stress that the centroid mapping \({\Psi }\) can be chosen independently from the update step.
The Frank–Wolfe update The Frank–Wolfe or conditional gradient method is applicable only in the case when u(i) is finite for every \(i\in N\). In every update step, we start by computing \(\bar{y}\) as a minimizer of the linear objective \(\left\langle g,y\right\rangle \) over \({\textbf{B}}(u)\), that is,
We set \(\texttt {Update}(x):=x\) if \(\left\langle g,\bar{y}\right\rangle =\left\langle g,x\right\rangle \), or \(y=\texttt {Update}(x)\) is selected so that y minimizes \(\tfrac{1}{2}\Vert Ay-b\Vert ^2\) on the line segment \([x,\bar{y}]\).
Clearly, \(\bar{y}(i)=0\) whenever \(g(i)>0\), and \(\bar{y}(i)=u(i)\) whenever \(g(i)<0\). However, \(\bar{y}(i)\) can be chosen arbitrarily if \(g(i)=0\). In this case, we keep \(\bar{y}(i)=x(i)\); this will be significant to guarantee stability of solutions in the analysis.
The projected gradient update The projected gradient update moves in the opposite gradient direction to \(\bar{y}:=x-\lambda g\) for some step-length \(\lambda >0\), and obtains the output \(y=\texttt {Update}(x)\) as the projection y of \(\bar{y}\) to the box \({\textbf{B}}(u)\). This projection simply changes every negative coordinate to 0 and every \(\bar{y}(i)>u(i)\) to \(y(i)=u(i)\). To ensure U2, we can perform an additional step that replaces y by the point \(y'\in [x,y]\) that minimizes \(\tfrac{1}{2}\Vert Ay'-b\Vert ^2\).
Consider now an NNLS instance (i.e., \(u(i)=\infty \) for all \(i\in N\)), and let x be a stable point. Recall \(I_1(x)=\emptyset \) in the NNLS setting. Lemma 3.2 allows us to write the projected gradient update in the following simple form that also enables to use optimal line search. Define
and use \(z=z^x\) when clear from the context. According to Lemma 2.6, x is optimal to P if and only if \(z=\textbf{0}\). We use the optimal line search
If \(z\ne \textbf{0}\), this can be written explicitly as
To verify this formula, we note that \(\Vert z\Vert ^2=-\langle g, z \rangle \), since for every \(i\in N\) either \(z(i)=0\) or \(z(i)=-g(i)\).
Coordinate update Our third update rule is the one used in the Lawson–Hanson algorithm. Given a stable point \(x\in {\textbf{B}}(u)\), we select a coordinate \(j\in N\) where either \(j\in I_0(x)\) and \(g(j)<0\) or \(j\in I_1(x)\) and \(g(j)>0\), and set y such that \(y(i)=x(i)\) if \(i\ne j\), and y(j) is chosen in [0, u(j)] so that \(\tfrac{1}{2}\Vert Ay-b\Vert ^2\) is minimized. As in the Lawson–Hanson algorithm, we can maintain basic solutions throughout.
Lemma 3.3
Assume \(A^J\) is linearly independent for \(J=J(x)\). Then, \(A^{J'}\) is also linearly independent for \(J'=J(y)=J\cup \{j\}\), where \(y=\texttt {Update}(x)\) using a coordinate update.
Proof
For a contradiction, assume \(A^j=A^Jw\) for some \(w\in {\mathbb {R}}^J\). Then,
a contradiction. \(\square \)
Let us start with \(x=\textbf{0}\), i.e., \(J(x)=I_1(x)=\emptyset \), \(I_0(x)=N\). Then, \(A^{J(x)}\) remains linearly independent throughout. Hence, every stable solution x is a basic solution to P. Note that whenever \(A^{J(x)}\) is linearly independent, \({\mathcal {C}}(I_0(x),I_1(x))\) contains a single point, hence, \({\Psi }(x)\) is uniquely defined.
For the special case of NNLS, i.e., for the case with no upper bounds, one can obtain simple explicit formulas for the coordinate update y. For z as in (8), let us return \(y=x\) if \(z=\textbf{0}\). Otherwise, let \(j\in \arg \max _k z(k)\); note that \(j\in I_0(x)\). Let
The following lemma is immediate. In the NNLS setting, U2 is guaranteed for the updates described above. For the general form with upper bounds, we can post-process as noted above to ensure U2.
Lemma 3.4
The Frank–Wolfe, projected gradient, and coordinate update rules all satisfy U and U2.
Cycle-free update rules
Definition 3.5
We say that \(\texttt {Update}(x)\) is a cycle-free update rule, if for every \(x\in {\textbf{B}}(u)\) and \(y=\texttt {Update}(x)\), \(x-y\) is cycle-free w.r.t. A.
Lemma 3.6
The Frank–Wolfe, projected gradient, and coordinate updates are all cycle-free.
Proof
Each of the three rules satisfies that for any \(x\in {\textbf{B}}(u)\) with gradient g and \(y=\texttt {Update}(x)\), \(y-x\) conforms to \(-g\). We show that this implies the required property.
For a contradiction, assume that a generalized path-cycle decomposition of \(y-x\) contains an inner vector h. Thus, \(h\ne \textbf{0}\), \(Ah=\textbf{0}\), and h conforms to \(-g\). Consequently, \(\left\langle g,h\right\rangle < 0\). Recalling the form of g from (2), we get
a contradiction. \(\square \)
4 Analysis
Our main goal is to show the following convergence bound. The proof will be given in Sect. 4.3. Recall that in an NNLS instance, all upper capacities are infinite.
Theorem 4.1
Consider an NNLS instance of P, and assume we use a cycle-free centroid mapping. Algorithm 1 terminates with an optimal solution in \(O(n\cdot m^2\cdot \kappa ^2({\mathcal {X}}_A)\cdot \Vert A\Vert ^2\cdot \log (n+\kappa ({\mathcal {X}}_A)))\) major cycles using projected gradient updates (9), and in \(O(n^{2}m^2\cdot \kappa ^2({\mathcal {X}}_A)\cdot \Vert A\Vert ^2\cdot \log (n+\kappa ({\mathcal {X}}_A)))\) major cycles using coordinate updates (9), when initialized with \(x=\textbf{0}\). In both cases, the total number of minor cycles is \(O(n^{2}m^2\cdot \kappa ^2({\mathcal {X}}_A)\cdot \Vert A\Vert ^2\cdot \log (n+\kappa ({\mathcal {X}}_A)))\).
4.1 Proximity bounds
We show that if using a cycle-free update rule and a cycle-free centroid mapping, the movement of the iterates in Algorithm 1 can be bounded by the change in the objective value. First, a nice property of the centroid set is that the movement of Ax directly relates to the decrease in the objective value. Namely,
Lemma 4.2
For \(x\in {\textbf{B}}(u)\), let \(y\in {\mathcal {C}}(I_0(x),I_1(x))\). Then,
Consequently, if \(\Psi \) is a cycle-free centroid mapping and \(y=\Psi (x)\), then
Proof
Let \(J:=J(x)\). Since \(Ax-b=(Ax-Ay)+(Ay-b)\), the claim is equivalent to showing that
Noting that \(Ax-Ay=A^Jx_J-A^J y_J\), we can write
where the equality follows since \((A^J)^\top ( Ay-b)=\textbf{0}\) by Proposition 2.10. The second part follows from Lemma 2.3. \(\square \)
Next, let us consider the movement of x during a call to \(\texttt {Update}(x)\).
Lemma 4.3
Let \(x\in {\textbf{B}}(u)\) and \(y=\texttt {Update}(x)\). Then,
If using a cycle-free update rule, we also have
Proof
From property U2, it is immediate to see that \(\left\langle Ay-b,Ax-Ay\right\rangle \ge 0\). This implies the first claim. The second claim follows from the definition of a cycle-free update rule and Lemma 2.3. \(\square \)
Lemma 4.4
Let \(x\in {\textbf{B}}(u)\), and let \(x'\) be an iterate obtained by consecutive t major or minor updates of Algorithm 1 using a cycle-free update rule and a cycle-free centroid mapping, starting from x. Then,
Proof
Let us consider the (major and minor cycle) iterates \(x=x^{(k)},x^{(k+1)},\ldots ,x^{(k+t)}=x'\). From the triangle inequality, and the arithmetic-quadratic means inequality,
The statement then follows using the bounds in Lemma 4.2 and Lemma 4.3. \(\square \)
4.2 Geometric convergence of the projected gradient and coordinate updates
We present a simple convergence analysis for the NNLS setting. For the general capacitated setting, similar bounds should follow from [22]. Recall that \(\eta (x)\) denotes the optimality gap at x.
Theorem 4.5
Consider an NNLS instance of P, and let \(x\ge \textbf{0}\) be a stable point. Then for \(y=\texttt {Update}(x)\) using the projected gradient update (9) we have
Using coordinate updates as in (10), we have
Consequently, either with projected gradient or with coordinate updates, after performing \(O(nm^2\cdot \kappa ^2({{\mathcal {X}}_A})\cdot \Vert A\Vert ^2)\) minor and major cycles from an iterate x, we obtain an iterate \(x'\) with \(\eta (x')\le \eta (x)/2\).
Let us formulate the update progress using optimal line search.
Lemma 4.6
For a stable point \(x\ge \textbf{0}\), the update (9) satisfies
and the update (10) satisfies
Proof
For the update (9) with stepsize \(\lambda =\Vert z\Vert ^2/\Vert Az\Vert ^2\), we have
where the third equality uses \(\langle g, z \rangle =-\Vert z\Vert ^2\) noted previously. The statement follows by using \(\Vert Az\Vert \le \Vert A\Vert \cdot \Vert z\Vert \).
The proof is similar for the update (10). Here, \(y=x+\lambda e_j\), where \(e_j\) is the jth unit vector, and \(\lambda =z(j)/\Vert A^j\Vert ^2\). The bound follows by noting that \(\langle Ax-b, Ae_j \rangle =\langle g, e_j \rangle =-z(j)\). \(\square \)
We now use Lemma 2.7 to bound \(\Vert z\Vert \).
Lemma 4.7
For a stable point \(x\ge \textbf{0}\) and the update direction \(z=z^x\) as in (8), we have
Proof
Let \(x^*\ge \textbf{0}\) be an optimal solution to P as in Lemma 2.7, and \(b^*=Ax^*\). Using convexity of \(f(x):=\tfrac{1}{2}\Vert Ax-b\Vert ^2\),
where the second inequality follows by noting that for each \(i\in N\), either \(z(i)=-g(i)\), or \(z(i)=0\) and \(g(i)(x^*(i)-x(i))\ge 0\). From the Cauchy-Schwarz inequality and Lemma 2.7, we get
that is,
The proof is complete by showing
Recalling that \(\eta (x)=\tfrac{1}{2}\Vert Ax-b\Vert ^2-\tfrac{1}{2}\Vert Ax^*-b\Vert ^2\) and that \(b^*=Ax^*\), this is equivalent to
This can be further written as
which is implied by the first order optimality condition at \(x^*\). This proves (11), and hence the lemma follows. \(\square \)
Proof (Proof of Theorem 4.5)
[Proof of Theorem 4.5] The proof for the bound in projected gradient updates is immediate from Lemma 4.6 and Lemma 4.7. For coordinate updates, recall that j is selected as the index of the largest component z(j). Thus, \(z(j)^2\ge \Vert z\Vert ^2/n\), and \(\Vert A^j\Vert \le \Vert A\Vert \).
For the second part, the statement follows for projected gradient updates by the first part and by noting that there are at most n minor cycles in every major cycle. For coordinate updates, every major cycle adds one component to J(x) whereas every minor cycle removes at least one. Hence, the total number of minor cycles is at most m plus the total number of major cycles. \(\square \)
4.3 Overall convergence bounds
In this subsection, we prove Theorem 4.1. Using Lemma 4.4 and Theorem 4.5, we can derive the following stronger proximity bound:
Lemma 4.8
Consider an NNLS instance of P. Let \(x\ge \textbf{0}\) be an iterate of Algorithm 1 using projected gradient or coordinate updates, and let \(x'\ge \textbf{0}\) be any later iterate. Then, for a value
we have
Proof
According to Theorem 4.5, after \(T:=O(nm^2\cdot \kappa ^2({{\mathcal {X}}_A})\cdot \Vert A\Vert ^2)\) major and minor cycles, we get to an iterate \(x''\) with \(\eta (x'')\le \eta (x)/4\). Thus, Lemma 4.4 gives
Let us now define \(x^{(k)}\) as the iterate following x after Tk major and minor cycles; we let \(x^{(0)}:=x\). By Theorem 4.5, \(x^{(k)}\le \eta (x)/4^k\), and similarly as above, for each \(k=0,1,2,\ldots \) we get
The above bound also holds for any iterate \(x'\) between \(x^{(k)}\) an \(x^{(k+1)}\). Using these bounds and the triangle inequality, for any iterate \(x'\) after x, we obtain
This completes the proof. \(\square \)
We need one more auxiliary lemma.
Lemma 4.9
Consider an NNLS instance of P, and let \(x\ge \textbf{0}\) be a stable point. Let \(\hat{x}\ge \textbf{0}\) such that for each \(i\in N\), either \(\hat{x}(i)=x(i)\), or \(\hat{x}(i)=0<x(i)\). Then,
Proof
The claim is equivalent to showing
We can write \(\langle A\hat{x}-Ax, Ax-b \rangle =\langle g^x, \hat{x}-x \rangle \). By assumption, \(\hat{x}(i)-x(i)\ne 0\) only if \(x(i)>0\), but in this case \(g^x(i)=0\) by Lemma 3.2. \(\square \)
For the threshold \(\Theta \) as in Lemma 4.8 and for any \(x\ge \textbf{0}\), let us define
The following is immediate from Lemma 4.8.
Lemma 4.10
Consider an NNLS instance of P. Let \(x\ge \textbf{0}\) be an iterate of Algorithm 1 using projected gradient updates, and \(x'\ge \textbf{0}\) be any later iterate. Then,
We are ready to prove Theorem 4.1.
Proof (Proof of Theorem 4.1)
[Proof of Theorem 4.1] At any point of the algorithm, let \(J^{\star }\) denote the union of the sets \(J^{\star (x)}\) for all iterations thus far. Consider a stable iterate x at the beginning of any major cycle, and let
Theorem 4.5 guarantees that within \(O(nm^{2}\cdot \kappa ^2({\mathcal {X}}_A)\cdot \Vert A\Vert ^2\cdot \log (n+\kappa ({\mathcal {X}}_A)))\) major and minor cycles we arrive at an iterate \(x'\) such that \(\sqrt{\eta (x')}<\varepsilon \). We note that \(\log (n+\kappa ({\mathcal {X}}_A)+\Vert A\Vert )=O(\log (n+\kappa ({\mathcal {X}}_A)))\) according to Remark 2.4. We show that
From here, we can conclude that \(J^{\star }\) was extended between iterates x and \(x'\). This may happen at most n times, leading to the claimed bound on the total number of major and minor cycles. Using Theorem 4.5 we also obtain the respective bounds on the number of major cycles for the two different updates.
For a contradiction, assume that (12) does not hold. Thus, for every \(i\in I_0(x)\), we have \(x'(i)\le \Theta \varepsilon \). Let us define \(\hat{x}\in {\mathbb {R}}^N\) as
By the above assumption, \(\Vert \hat{x}-x'\Vert _\infty \le \Theta \varepsilon \), and therefore \(\Vert A\hat{x}-Ax'\Vert \le \sqrt{n}\Theta \Vert A\Vert \varepsilon \). From Lemma 4.9, we can bound
Recall that since x is a stable solution,
Since \(\hat{x}\) is a feasible solution to this program, it follows that \(\Vert A\hat{x}-b\Vert ^2\ge \Vert Ax-b\Vert ^2\). We get that
in contradiction with the choice of \(\varepsilon \). \(\square \)
5 Computational experiments
We give preliminary computational experiments of different versions of our algorithm, and compare them to standard gradient methods and existing NNLS implementations. The experiments were programmed and executed by MATLAB version R2023a on a personal computer having 11th Gen Intel(R) Core(TM) i7-11370 H @ 3.30GHz and 16GB of memory.
We considered two families of randomly generated NNLS instances. In Appendix A, we also present experiments for capacitated instances (finite u(i) values).
We tested each combination of two update methods: Projected Gradient (PG), and coordinate (C); and two centroid mappings, the ‘oblivious’ mapping (5) and the ‘local norm’ mapping (6) with diagonal entries \(1/x(i),~i\in N\). Recall that for coordinate updates and starting from \(x=\textbf{0}\), there is a unique centroid mapping by Lemma 3.3.
Our first benchmarks are the projected gradient (PG) and the projected fast (accelerated) gradient (PFG) methods. In contrast to our algorithms, these do not finitely terminate. We stopped the algorithms once they found a near-optimal solution within a certain accuracy threshold.
Further, we also compare our algorithms against the standard MATLAB implementation of the Lawson–Hanson algorithm called lsqnonneg, and against the implementation TNT-NN from [21]. We note that lsqnonneg and the coordinate update version of our algorithms are essentially the same.
Generating instances We generated two families of experiments. In the rectangular experiments \(n\ge 2m\), and in the near-square experiments \(m\le n\le 1.1m\). In both cases, the entries of the \(m\times n\) matrix A were chosen independently uniformly at random from the interval \([-0.5,0.5]\). In the rectangular experiments, the entries of b were also chosen independently uniformly at random from \([-0.5,0.5]\). Thus, the underlying LP \(Ax=b\), \(x\ge \textbf{0}\) may or may not be feasible.
For the near-square instances, such a random choice of b leads to infeasible instances with high probability. We used this method to generate infeasible instances. We also constructed families where the LP is feasible as follows. For a sparsity parameter \(\chi \in (0,1]\), we sampled a subset \(J\subseteq N\), adding each variable independently with probability \(\chi \), and generated coefficients \(\{z_i: i\in J\}\) independently at random from [0, 1]. We then set \(b=\sum _{j\in J}A^j z_j\).
Computational results We stopped each algorithm when the computation time reached 60 s. For each (m, n), we test all the algorithms 5 times and the results shown here are the 5-run averaged figures.
Table 1 shows the overall computational times for rectangular instances; values in brackets show the number of trials whose computation time exceeded 60 s. Tables 2 and 3 show the number of major cycles, and the total number of minor cycles, respectively. Table 4 shows the overall computational times for near-square instances. The status ‘I’ denotes infeasible instances and ‘F’ feasible instances, with the sparsity parameter \(\chi \) in brackets, with values 0.1, 0.5, and 1. Tables 5 and 6 show the number of major cycles, and the total number of minor cycles, respectively, for near-square instances.
Comparison of the results For rectangular instances, the ‘local-norm’ update (6) performs significantly better than the ‘oblivious’ update (5). The ‘oblivious’ updates are also outperformed by the coordinate updates, both in terms of running time as well as in the total number of minor cycles.
As noted above, while our algorithm with coordinate updates and lsqnonneg are basically the same, the running time of the latter algorithm is better by around factor two. This is since lsqnonneg might be using more efficient linear algebra operations, in contrast to our more basic implementation.
The algorithm TNT-NN from [21] is a fast practical algorithm using a number of heuristics, representing the state-of-the-art active set method for NNLS. Notably, our algorithm with ‘local-norm’ updates (6) is almost always within a factor two for rectangular instances, and performs better in some cases. This is despite the fact that we only use a basic implementation without using more efficient linear algebra methods or including further heuristics.
For rectangular instances, TNT-NN and ‘local-norm’ updates also outperform fast projected gradient in most cases.
The picture is more mixed for near-square instances. There is a marked difference between feasible and infeasible instances. The ‘local-norm’ and ‘oblivious’ update rules perform similarly, with a small number of major cycles. The number of minor cycles is much higher for infeasible instances. For infeasible instances, coordinate updates are faster than either variant of the PG update rule, while PG updates are faster for feasible instances.
The algorithm TNT-NN is consistently faster than our algorithm, with better running times for infeasible instances. For projected gradient and projected fast gradient, the running times are similar to TNT-NN except for feasible instances with sparsity parameter \(\chi =1\), where they do not terminate within the 60 s limit in most cases. In contrast, these appear to be the easiest instances to our method with PG updates with the ‘local-norm’ mapping.
6 Concluding remarks
We have proposed a new ‘Update-and-Stabilize’ framework for the minimum-norm-point problem P. Our method combines classical first order methods with ‘stabilizing’ steps using the centroid mapping that amounts to computing a projection to an affine subspace. Our algorithm is always finite, and is strongly polynomial when the associated circuit imbalance measure is constant. In particular, this gives the first such convergence bound for the Lawson–Hanson algorithm.
There is scope for further improvements both in the theoretical analysis and in practical implementations. In this paper, we only analyzed the running time for uncapacitated instances. Combined with existing results from [22], we expect that similar bounds can be shown for capacitated instances. We note that for the analysis, it would suffice to run minor cycles only once in a while, say after every O(n) gradient updates. From a practical perspective however, running minor cycles after every update appears to be highly beneficial in most cases. Rigorous computational experiments, using standard families of LP benchmarks, are left for future work.
Future work should also compare the performance of our algorithms to the gradient projection method [5, 23], using techniques from that method to our algorithm and vice versa. We note that for NNLS instances, starting from a stable point our algorithm already finds the optimal gradient update. However, a similar search as in gradient projection methods may be useful in the capacitated case. In the other direction, we note that the conjugate gradient iterations used in gradient projection do not correspond to an explicit choice of a centroid mapping. A possible enhancement of gradient projection could come from approximating a ‘local-norm’ objective as in (6) in the second stage.
We also point out that the ‘local-norm’ selection rule (6) was inspired by the affine scaling method; the important difference is that our algorithm moves all the way to the boundary, whereas affine scaling stays in the interior throughout.
Notes
While there are minor differences in the details, these are essentially the same algorithm. Henceforth, we refer only to the Lawson–Hanson algorithm for simplicity.
The linopt rule corresponds to the coordinate updates in the terminology of this paper.
Note that the weights for \(i\in I_0(x)\cup I_1(x)\) do not matter, since we force \(y(i)=x(i)\) on these coordinates. The choice \(1/x(i)+1/(u(i)-x(i))\) would set \(\infty \) on these coordinates.
References
Bach, F.: Learning with submodular functions: a convex optimization perspective. Found. Trends Mach. Learn. 6(2–3), 145–373 (2013)
Björck, Å.: A direct method for sparse least squares problems with lower and upper bounds. Numer. Math. 54(1), 19–32 (1988)
Bro, R., De Jong, S.: A fast non-negativity-constrained least squares algorithm. J. Chemomet. J. Chemomet. Soc. 11(5), 393–401 (1997)
Chakrabarty, D., Jain, P., Kothari, P.: Provable submodular minimization using Wolfe’s algorithm. Adv. Neural Inf. Process. Syst. 27 (2014)
Conn, A.R., Gould, N.I., Toint, P.L.: Testing a class of methods for solving minimization problems with simple bounds on the variables. Math. Comput. 50(182), 399–430 (1988)
Dadush, D., Huiberts, S., Natura, B., Végh, L.A.: A scaling-invariant algorithm for linear programming whose running time depends only on the constraint matrix. Math. Program. (2023)
Dadush, D., Natura, B., Végh, L.A.: Revisiting Tardos’s framework for linear programming: Faster exact solutions using approximate solvers. In: Proceedings of the 61st Annual IEEE Symposium on Foundations of Computer Science (FOCS), pp. 931–942 (2020)
De Loera, J.A., Haddock, J., Rademacher, L.: The minimum Euclidean-norm point in a convex polytope: Wolfe’s combinatorial algorithm is exponential. SIAM J. Comput. 49(1), 138–169 (2020)
Ekbatani, F., Natura, B., Végh, A.L.: Circuit imbalance measures and linear programming. In: Surveys in combinatorics 2022, London Mathematical Society Lecture Note Series, pp. 64–114. Cambridge University Press (2022)
Ene, A., Vladu, A.: Improved convergence for \(\ell _1\) and \(\ell _\infty \) regression via iteratively reweighted least squares. In: International Conference on Machine Learning, pp. 1794–1801. PMLR (2019)
Fujishige, S.: Lexicographically optimal base of a polymatroid with respect to a weight vector. Math. Oper. Res. 5(2), 186–196 (1980)
Fujishige, S.: A capacity-rounding algorithm for the minimum-cost circulation problem: a dual framework of the Tardos algorithm. Math. Program. 35(3), 298–308 (1986)
Fujishige, S., Hayashi, T., Yamashita, K., Zimmermann, U.: Zonotopes and the LP-Newton method. Optim. Eng. 10(2), 193–205 (2009)
Fujishige, S., Isotani, S.: A submodular function minimization algorithm based on the minimum-norm base. Pac. J. Optim. 7(1), 3–17 (2011)
Fulkerson, D.: Networks, frames, blocking systems. Math. Decis. Sci. Part I Lect. Appl. Math. 2, 303–334 (1968)
Hoffman, A.J.: On approximate solutions of systems of linear inequalities. J. Res. Natl. Bur. Stand. 49(4), 263–265 (1952)
Lacoste-Julien, S., Jaggi, M.: On the global linear convergence of Frank–Wolfe optimization variants. Adv. Neural Inf. Process. Syst. 28 (2015)
Lawson, C.L.: Contribution to the Theory of Linear Least Maximum Approximation. PhD thesis (1961)
Lawson, C.L., Hanson, R.J.: Solving least squares problems. SIAM (1995)
Leichner, S., Dantzig, G., Davis, J.: A strictly improving linear programming phase I algorithm. Ann. Oper. Res. 46, 409–430 (1993)
Myre, J.M., Frahm, E., Lilja, D.J., Saar, M.O.: TNT-NN: a fast active set method for solving large non-negative least squares problems. Procedia Comput. Sci. 108, 755–764 (2017)
Necoara, I., Nesterov, Y., Glineur, F.: Linear convergence of first order methods for non-strongly convex optimization. Math. Program. 175(1), 69–107 (2019)
Nocedal, J., Wright, S.J.: Numerical Optimization. Springer (1999)
Orlin, J.B.: A faster strongly polynomial minimum cost flow algorithm. Oper. Res. 41(2), 338–350 (1993)
Osborne, M.R.: Finite Algorithms in Optimization and Data Analysis. Wiley (1985)
Peña, J., Vera, J.C., Zuluaga, L.F.: New characterizations of Hoffman constants for systems of linear constraints. Math. Program. 1–31 (2020)
Rockafellar, R.T.: The elementary vectors of a subspace of \(R^N\). In: Combinatorial Mathematics and Its Applications: Proceedings North Carolina Conference, Chapel Hill, 1967, pp. 104–127. The University of North Carolina Press (1969)
Stoer, J.: On the numerical solution of constrained least-squares problems. SIAM J. Numer. Anal. 8(2), 382–411 (1971)
Tardos, É.: A strongly polynomial minimum cost circulation algorithm. Combinatorica 5(3), 247–255 (1985)
Vavasis, S.A., Ye, Y.: A primal-dual interior point method whose running time depends only on the constraint matrix. Math. Program. 74(1), 79–120 (1996)
Wilhelmsen, D.R.: A nearest point algorithm for convex polyhedral cones and applications to positive linear approximation. Math. Comput. 30(133), 48–57 (1976)
Wolfe, P.: Finding the nearest point in a polytope. Math. Program. 11(1), 128–149 (1976)
Acknowledgements
We are grateful to Andreas Wächter for pointing us to the literature on the gradient projection method. The third author would like to thank Richard Cole, Daniel Dadush, Christoph Hertrich, Bento Natura, and Yixin Tao for discussions on first order methods and circuit imbalances.
Funding
SF’s research is supported by JSPS KAKENHI Grant Numbers JP19K11839 and 22K11922 and by the Research Institute for Mathematical Sciences, an International Joint Usage/Research Center located in Kyoto University. TK is supported by JSPS KAKENHI Grant Number JP19K11830. LAV’s research is supported by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement no. 757481–ScaleOpt).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interests
The authors have no conflicts of interest to declare that are relevant to the content of this article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
An extended abstract of this paper has appeared in Proceedings of the 24th Conference on Integer Programming and Combinatorial Optimization, IPCO 2023.
Computational experiments for capacitated instances
Computational experiments for capacitated instances
Tables 7, 8, 9, 10, 11, and 12 show experimental results for capacitated instances. The instances were generated as in the NNLS case, with upper capacities \(u(i)=1\), \(i\in N\). We did not use the benchmarks lsqnonneg and TNT-NN since these are not implemented for the capacitated setting. On the other hand, we also implemented our method with Frank–Wolfe updates, using both ‘local-norm’ and ‘oblvious’ centroid mappings. Among the first order benchmarks, we also included conditional gradient methods: the Frank–Wolfe and away-step Frank Wolfe (AFW) methods.
In our framework, the Frank–Wolfe and projected gradient update rules performed similarly. In contrast, among the benchmark experiments, projected gradient methods consistently outperformed conditional gradient methods: the latter methods did not terminate within the 60 s limit for most cases.
The overall experience is similar for uncapacitated (NNLS) and capacitated instances. Our method does well for rectangular instances, but is generally slower for infeasible near-square instances.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Fujishige, S., Kitahara, T. & Végh, L.A. An update-and-stabilize framework for the minimum-norm-point problem. Math. Program. (2024). https://doi.org/10.1007/s10107-024-02077-0
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10107-024-02077-0