Linearly convergent away-step conditional gradient for non-strongly convex functions

Abstract

We consider the problem of minimizing the sum of a linear function and a composition of a strongly convex function with a linear transformation over a compact polyhedral set. Jaggi and Lacoste-Julien (An affine invariant linear convergence analysis for Frank-Wolfe algorithms. NIPS 2013 Workshop on Greedy Algorithms, Frank-Wolfe and Friends, 2014) show that the conditional gradient method with away steps — employed on the aforementioned problem without the additional linear term — has a linear rate of convergence, depending on the so-called pyramidal width of the feasible set. We revisit this result and provide a variant of the algorithm and an analysis based on simple linear programming duality arguments, as well as corresponding error bounds. This new analysis (a) enables the incorporation of the additional linear term, and (b) depends on a new constant, that is explicitly expressed in terms of the problem’s parameters and the geometry of the feasible set. This constant replaces the pyramidal width, which is difficult to evaluate.

This is a preview of subscription content, log in to check access.

Fig. 1

Notes

  1. 1.

    The paper [12] assumes that the feasible set is a bounded polyhedral, but the proof is actually correct for general compact convex sets.

  2. 2.

    This is how the algorithm is described in [16] although in [17] the authors extend this result to include atom linear oracles, which are oracles whose output is within a predetermined finite set of points, called atoms. This set of atoms includes but is not limited to the set of vertices.

  3. 3.

    This was done as part of the proof of [16, Lemma 6], and does not appear as a separate lemma.

References

  1. 1.

    Beck, A., Teboulle, M.: A conditional gradient method with linear rate of convergence for solving convex linear systems. Math. Methods of Oper. Res. 59(2), 235–247 (2004)

    MathSciNet  Article  MATH  Google Scholar 

  2. 2.

    Beck, A., Teboulle, M.: Gradient-based algorithms with applications to signal recovery problems. In: Palomar, D., Eldar, Y. (eds.) Convex Optimization in Signal Processing and Communications, pp. 139–162. Cambridge University Press, Cambridge (2009)

    Google Scholar 

  3. 3.

    Bertsekas, D.P.: Nonlinear Programming, 2nd edn. Athena Scientific, Belmont, MA, USA (1999)

  4. 4.

    Bertsimas, D., Tsitsiklis, J.N.: Introduction to Linear Optimization, vol. 6. Athena Scientific, Belmont (1997)

    Google Scholar 

  5. 5.

    Canon, M.D., Cullum, C.D.: A tight upper bound on the rate of convergence of Frank-Wolfe algorithm. SIAM J. Control 6(4), 509–516 (1968)

    MathSciNet  Article  MATH  Google Scholar 

  6. 6.

    Dunn, J., Harshbarger, S.: Conditional gradient algorithms with open loop step size rules. J. Math. Anal. Appl. 62(2), 432–444 (1978)

    MathSciNet  Article  MATH  Google Scholar 

  7. 7.

    Epelman, M., Freund, R.M.: Condition number complexity of an elementary algorithm for computing a reliable solution of a conic linear system. Math. Program. 88(3), 451–485 (2000)

    MathSciNet  Article  MATH  Google Scholar 

  8. 8.

    Frank, M., Wolfe, P.: An algorithm for quadratic programming. Nav. Res. Logist. Quart. 3(1–2), 95–110 (1956)

    MathSciNet  Article  Google Scholar 

  9. 9.

    Freund, R.M., Grigas, P., Mazumder, R.: An extended frank-wolfe method with “in-face” directions, and its application to low-rank matrix completion. arXiv preprint; arXiv:1511.02204 (2015)

  10. 10.

    Garber, D., Hazan, E.: A linearly convergent conditional gradient algorithm with applications to online and stochastic optimization. arXiv preprint; arXiv:1301.4666 (2013)

  11. 11.

    Goldfarb, D., Todd, M.J.: Chapter ii: Linear programming. In: Nemhauser, G., Kan, A.R., Todd, M. (eds.) Optimization, volume 1 of Handbooks in Operations Research and Management Science, pp. 73–170. Elsevier, Amsterdam (1989)

    Google Scholar 

  12. 12.

    Guelat, J., Marcotte, P.: Some comments on Wolfe’s away step. Math. Program. 35(1), 110–119 (1986)

    MathSciNet  Article  MATH  Google Scholar 

  13. 13.

    Güler, O.: Foundations of Optimization. Graduate Texts in Mathematics, vol. 258. Springer, New York (2010)

  14. 14.

    Hoffman, A.J.: On approximate solutions of systems of linear inequalities. J. Res. Natl. Bur. Stand. 49(4), 263–265 (1952)

    MathSciNet  Article  Google Scholar 

  15. 15.

    Jaggi, M.: Sparse Convex Optimization Methods for Machine Learning. Ph.D. thesis, ETH Zurich (2011)

  16. 16.

    Lacoste-Julien, S., Jaggi, M.: An affine invariant linear convergence analysis for Frank-Wolfe algorithms. In: NIPS 2013 Workshop on Greedy Algorithms, Frank-Wolfe and Friends (2014)

  17. 17.

    Lacoste-Julien, S., Jaggi, M.: On the global linear convergence of frank-wolfe optimization variants. In: Advances in Neural Information Processing Systems, pp. 496–504 (2015)

  18. 18.

    Levitin, E., Polyak, B.T.: Constrained minimization methods. USSR Comput. Math. Math. Phys. 6(5), 787–823 (1966)

    Article  MATH  Google Scholar 

  19. 19.

    Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course, vol. 87. Springer, Berlin (2004)

    Google Scholar 

  20. 20.

    Pena, J., Rodriguez, D.: Polytope conditioning and linear convergence of the frank-wolfe algorithm. arXiv preprint; arXiv:1512.06142 (2015)

  21. 21.

    Luo, Z.Q., Tseng, P.: Error bounds and convergence analysis of feasible descent methods: a general approach. Ann. Oper. Res. 46–47(1), 157–178 (1993)

    MathSciNet  Article  MATH  Google Scholar 

  22. 22.

    Rockafellar, R.T.: Convex Analysis, 2nd edn. Princeton University Press, Princeton (1970)

    Google Scholar 

  23. 23.

    Wang, P.-W., Lin, C.-J.: Iteration complexity of feasible descent methods for convex optimization. J. Mach. Learn. Res. 15, 1523–1548 (2014)

    MathSciNet  MATH  Google Scholar 

  24. 24.

    Wolfe, P.: Chapter 1:Convergence Theory in Nonlinear Programming. In: Abadie J. (ed.) Integer and nonlinear programming, pp. 1–36. North-Holland Publishing Company, Amsterdam (1970)

Download references

Acknowledgments

The research of Amir Beck was partially supported by the Israel Science Foundation grant 1821/16

Author information

Affiliations

Authors

Corresponding author

Correspondence to Amir Beck.

Appendix: Incremental representation reduction using the Carathéodory theorem

Appendix: Incremental representation reduction using the Carathéodory theorem

In this section we will show a way to efficiently and incrementally implement the constructive proof of Carathéodory theorem, as part of the VRU scheme, at each iteration of the ASCG algorithm. We note that this reduction procedure does not have to be employed, and instead the trivial procedure, which does not change the representation can be used. In that case, the upper bound on the number of extreme points in the representation is just the number of extreme points of the feasible set X.

The implementation described in this section will allow maintaining a vertex representation set \(U^k\), with cardinality of at most \(n+1\), at a computational cost of \(O(n^2)\) operations per iteration. For this purpose, we assume that at the beginning of iteration k, \(\mathbf{x}^{k}\) has a representation with vertex set \(U^{k}=\left\{ {\mathbf{v}^1,\ldots ,\mathbf{v}^{L}}\right\} \subseteq V\), such that the vectors in the set are affinely independent. Moreover, we assume that at the beginning of iteration k, we have at our disposal two matrices \(\mathbf{T}^k\in \mathbb {R}^{n\times n}\) and \({\mathbf{W}}^k\in \mathbb {R}^{n\times (L-1)}\). We define \(\mathbf{V}^k\in \mathbb {R}^{n\times (L-1)}\) to be the matrix whose ith column is the vector \(\mathbf{w}^i=\mathbf{v}^{i+1}-\mathbf{v}^1\) for \(i=1, \ldots , L-1\), where \(\mathbf{v}^1\) is called the reference vertex. The matrix \(\mathbf{T}^k\) is a product of elementary matrices, which ensures that the matrix \({\mathbf{W}}^k=\mathbf{T}^k\mathbf{V}^k\) is in row echelon form. The implementation does not require to save the matrix \(\mathbf{V}^k\), and so at each iteration, only the matrices \(\mathbf{T}^k\) and \(\mathbf{W}^k\) are updated.

Let \(U^{k+1}\) be the vertex set and let \({\varvec{\mu }}^{k+1}\) be the coefficients vector at the end of iteration k, before applying the rank reduction procedure. Updating the matrices \(\mathbf{W}^{k+1}\) and \(\mathbf{T}^{k+1}\), as well as \(U^{k+1}\) and \({\varvec{\mu }}^{k+1}\), is done according to the following Incremental Representation Reduction scheme, which is partially based on the proof of Carathéodory theorem presented in [22, Sect. 17].

figured
figuree

Notice that in order to compute the row rank of the matrix \(\mathbf{W}^{k+1}\) in step 6(d), we may simply convert the matrix to row echelon form, and then count the number of nonzero rows. This is done similarly to step 7, and requires ranking of at most one column. We will need to rerank the matrix in step 7 only if \(L>M\), and subsequently at least one column is removed in step 6(e)vi.

The IRR scheme may reduce the size of the input \(U^{k+1}\) only in the case of a forward step, since otherwise the vertices in \(U^{k+1}\) are all affinely independent. Nonetheless, the IRR scheme must be applied at each iteration in order to maintain the matrices \({\mathbf{W}}^k\) and \(\mathbf{T}^k\).

The efficiency of the scheme relies on the fact that only a small number of vertices are either added to or removed from the representation. The potentially computationally expensive steps are: step 5(b)—replacing the reference vertex, step 6(d)—finding the row rank of \(\mathbf{W}^{k+1}\), step 6(e)i—solving the system of linear equalities, step 6(e)vi—removing columns corresponding with the vertices eliminated from the representation, and step 7—the ranking of the resulting matrix \({\mathbf{W}}^{k+1}\). Step 5(b) can be implemented without explicitly using matrix multiplication and therefore has a computational cost of \(O(n^2)\). Since \({\mathbf{W}}^k\) was in row echelon form, step 6(d) requires a row elimination procedure, similar to step 7, to be conducted only on the last column of \({\mathbf{W}}^{k+1}\), which involves at most O(n) operations and an additional \(O(n^2)\) operation for updating \(\mathbf{T}^{k+1}\). Moreover, since \({\mathbf{W}}^k\) was full column rank, the IRR scheme guarantees that in step 6(e)i the vector \({\varvec{\lambda }}\) has a unique solution, and since \({\mathbf{W}}^{k+1}\) is in row echelon form, it can be found in \(O(n^2)\) operations. Moreover, in step 6(e)i, the specific choice of \(\alpha \) ensures that the reference vertex \(\mathbf{v}^1\) is not eliminated from the representation, and so there is no need to change the reference vertex at this stage. Furthermore, it is reasonable to assume that the set I satisfies \(|I|=O(1)\), since otherwise the vector \(\mathbf{x}^{k+1}\), produced by a forward step, can be represented by significantly less vertices than \(\mathbf{x}^k\), which, although possible, is numerically unlikely. Therefore, assuming that indeed \(|I|=O(1)\), the matrix \(\tilde{\mathbf{T}}\), calculated in step 7, applies a row elimination procedure to at most O(1) rows (one for each column removed from \(\mathbf{W}^{k+1}\)) or one column (if a column was added to \(\mathbf{W}^{k+1}\)). Conducting such an elimination on either row or column takes at most \(O(n^2)\) operations, which may include row switching and at most n row addition and multiplication. Therefore, the total computational cost of the IRR scheme amounts to \(O(n^2)\).

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Beck, A., Shtern, S. Linearly convergent away-step conditional gradient for non-strongly convex functions. Math. Program. 164, 1–27 (2017). https://doi.org/10.1007/s10107-016-1069-4

Download citation

Mathematics Subject Classification

  • 90-08
  • 90C25
  • 65K10