## Abstract

We consider the problem of minimizing the sum of a linear function and a composition of a strongly convex function with a linear transformation over a compact polyhedral set. Jaggi and Lacoste-Julien (An affine invariant linear convergence analysis for Frank-Wolfe algorithms. NIPS 2013 Workshop on Greedy Algorithms, Frank-Wolfe and Friends, 2014) show that the conditional gradient method with away steps — employed on the aforementioned problem without the additional linear term — has a linear rate of convergence, depending on the so-called pyramidal width of the feasible set. We revisit this result and provide a variant of the algorithm and an analysis based on simple linear programming duality arguments, as well as corresponding error bounds. This new analysis (a) enables the incorporation of the additional linear term, and (b) depends on a new constant, that is explicitly expressed in terms of the problem’s parameters and the geometry of the feasible set. This constant replaces the pyramidal width, which is difficult to evaluate.

This is a preview of subscription content, log in to check access.

## Notes

## References

- 1.
Beck, A., Teboulle, M.: A conditional gradient method with linear rate of convergence for solving convex linear systems. Math. Methods of Oper. Res.

**59**(2), 235–247 (2004) - 2.
Beck, A., Teboulle, M.: Gradient-based algorithms with applications to signal recovery problems. In: Palomar, D., Eldar, Y. (eds.) Convex Optimization in Signal Processing and Communications, pp. 139–162. Cambridge University Press, Cambridge (2009)

- 3.
Bertsekas, D.P.: Nonlinear Programming, 2nd edn. Athena Scientific, Belmont, MA, USA (1999)

- 4.
Bertsimas, D., Tsitsiklis, J.N.: Introduction to Linear Optimization, vol. 6. Athena Scientific, Belmont (1997)

- 5.
Canon, M.D., Cullum, C.D.: A tight upper bound on the rate of convergence of Frank-Wolfe algorithm. SIAM J. Control

**6**(4), 509–516 (1968) - 6.
Dunn, J., Harshbarger, S.: Conditional gradient algorithms with open loop step size rules. J. Math. Anal. Appl.

**62**(2), 432–444 (1978) - 7.
Epelman, M., Freund, R.M.: Condition number complexity of an elementary algorithm for computing a reliable solution of a conic linear system. Math. Program.

**88**(3), 451–485 (2000) - 8.
Frank, M., Wolfe, P.: An algorithm for quadratic programming. Nav. Res. Logist. Quart.

**3**(1–2), 95–110 (1956) - 9.
Freund, R.M., Grigas, P., Mazumder, R.: An extended frank-wolfe method with “in-face” directions, and its application to low-rank matrix completion. arXiv preprint; arXiv:1511.02204 (2015)

- 10.
Garber, D., Hazan, E.: A linearly convergent conditional gradient algorithm with applications to online and stochastic optimization. arXiv preprint; arXiv:1301.4666 (2013)

- 11.
Goldfarb, D., Todd, M.J.: Chapter ii: Linear programming. In: Nemhauser, G., Kan, A.R., Todd, M. (eds.) Optimization, volume 1 of Handbooks in Operations Research and Management Science, pp. 73–170. Elsevier, Amsterdam (1989)

- 12.
Guelat, J., Marcotte, P.: Some comments on Wolfe’s away step. Math. Program.

**35**(1), 110–119 (1986) - 13.
Güler, O.: Foundations of Optimization. Graduate Texts in Mathematics, vol. 258. Springer, New York (2010)

- 14.
Hoffman, A.J.: On approximate solutions of systems of linear inequalities. J. Res. Natl. Bur. Stand.

**49**(4), 263–265 (1952) - 15.
Jaggi, M.: Sparse Convex Optimization Methods for Machine Learning. Ph.D. thesis, ETH Zurich (2011)

- 16.
Lacoste-Julien, S., Jaggi, M.: An affine invariant linear convergence analysis for Frank-Wolfe algorithms. In: NIPS 2013 Workshop on Greedy Algorithms, Frank-Wolfe and Friends (2014)

- 17.
Lacoste-Julien, S., Jaggi, M.: On the global linear convergence of frank-wolfe optimization variants. In: Advances in Neural Information Processing Systems, pp. 496–504 (2015)

- 18.
Levitin, E., Polyak, B.T.: Constrained minimization methods. USSR Comput. Math. Math. Phys.

**6**(5), 787–823 (1966) - 19.
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course, vol. 87. Springer, Berlin (2004)

- 20.
Pena, J., Rodriguez, D.: Polytope conditioning and linear convergence of the frank-wolfe algorithm. arXiv preprint; arXiv:1512.06142 (2015)

- 21.
Luo, Z.Q., Tseng, P.: Error bounds and convergence analysis of feasible descent methods: a general approach. Ann. Oper. Res.

**46–47**(1), 157–178 (1993) - 22.
Rockafellar, R.T.: Convex Analysis, 2nd edn. Princeton University Press, Princeton (1970)

- 23.
Wang, P.-W., Lin, C.-J.: Iteration complexity of feasible descent methods for convex optimization. J. Mach. Learn. Res.

**15**, 1523–1548 (2014) - 24.
Wolfe, P.: Chapter 1:Convergence Theory in Nonlinear Programming. In: Abadie J. (ed.) Integer and nonlinear programming, pp. 1–36. North-Holland Publishing Company, Amsterdam (1970)

## Acknowledgments

The research of Amir Beck was partially supported by the Israel Science Foundation grant 1821/16

## Author information

### Affiliations

### Corresponding author

## Appendix: Incremental representation reduction using the Carathéodory theorem

### Appendix: Incremental representation reduction using the Carathéodory theorem

In this section we will show a way to efficiently and incrementally implement the constructive proof of Carathéodory theorem, as part of the VRU scheme, at each iteration of the ASCG algorithm. We note that this reduction procedure does not have to be employed, and instead the trivial procedure, which does not change the representation can be used. In that case, the upper bound on the number of extreme points in the representation is just the number of extreme points of the feasible set *X*.

The implementation described in this section will allow maintaining a vertex representation set \(U^k\), with cardinality of at most \(n+1\), at a computational cost of \(O(n^2)\) operations per iteration. For this purpose, we assume that at the beginning of iteration *k*, \(\mathbf{x}^{k}\) has a representation with vertex set \(U^{k}=\left\{ {\mathbf{v}^1,\ldots ,\mathbf{v}^{L}}\right\} \subseteq V\), such that the vectors in the set are affinely independent. Moreover, we assume that at the beginning of iteration *k*, we have at our disposal two matrices \(\mathbf{T}^k\in \mathbb {R}^{n\times n}\) and \({\mathbf{W}}^k\in \mathbb {R}^{n\times (L-1)}\). We define \(\mathbf{V}^k\in \mathbb {R}^{n\times (L-1)}\) to be the matrix whose *i*th column is the vector \(\mathbf{w}^i=\mathbf{v}^{i+1}-\mathbf{v}^1\) for \(i=1, \ldots , L-1\), where \(\mathbf{v}^1\) is called the reference vertex. The matrix \(\mathbf{T}^k\) is a product of elementary matrices, which ensures that the matrix \({\mathbf{W}}^k=\mathbf{T}^k\mathbf{V}^k\) is in row echelon form. The implementation does not require to save the matrix \(\mathbf{V}^k\), and so at each iteration, only the matrices \(\mathbf{T}^k\) and \(\mathbf{W}^k\) are updated.

Let \(U^{k+1}\) be the vertex set and let \({\varvec{\mu }}^{k+1}\) be the coefficients vector at the end of iteration *k*, before applying the rank reduction procedure. Updating the matrices \(\mathbf{W}^{k+1}\) and \(\mathbf{T}^{k+1}\), as well as \(U^{k+1}\) and \({\varvec{\mu }}^{k+1}\), is done according to the following *Incremental Representation Reduction* scheme, which is partially based on the proof of Carathéodory theorem presented in [22, Sect. 17].

Notice that in order to compute the row rank of the matrix \(\mathbf{W}^{k+1}\) in step 6(d), we may simply convert the matrix to row echelon form, and then count the number of nonzero rows. This is done similarly to step 7, and requires ranking of at most one column. We will need to rerank the matrix in step 7 only if \(L>M\), and subsequently at least one column is removed in step 6(e)vi.

The IRR scheme may reduce the size of the input \(U^{k+1}\) only in the case of a forward step, since otherwise the vertices in \(U^{k+1}\) are all affinely independent. Nonetheless, the IRR scheme *must* be applied at each iteration in order to maintain the matrices \({\mathbf{W}}^k\) and \(\mathbf{T}^k\).

The efficiency of the scheme relies on the fact that only a small number of vertices are either added to or removed from the representation. The potentially computationally expensive steps are: step 5(b)—replacing the reference vertex, step 6(d)—finding the row rank of \(\mathbf{W}^{k+1}\), step 6(e)i—solving the system of linear equalities, step 6(e)vi—removing columns corresponding with the vertices eliminated from the representation, and step 7—the ranking of the resulting matrix \({\mathbf{W}}^{k+1}\). Step 5(b) can be implemented without explicitly using matrix multiplication and therefore has a computational cost of \(O(n^2)\). Since \({\mathbf{W}}^k\) was in row echelon form, step 6(d) requires a row elimination procedure, similar to step 7, to be conducted only on the last column of \({\mathbf{W}}^{k+1}\), which involves at most *O*(*n*) operations and an additional \(O(n^2)\) operation for updating \(\mathbf{T}^{k+1}\). Moreover, since \({\mathbf{W}}^k\) was full column rank, the IRR scheme guarantees that in step 6(e)i the vector \({\varvec{\lambda }}\) has a unique solution, and since \({\mathbf{W}}^{k+1}\) is in row echelon form, it can be found in \(O(n^2)\) operations. Moreover, in step 6(e)i, the specific choice of \(\alpha \) ensures that the reference vertex \(\mathbf{v}^1\) is not eliminated from the representation, and so there is no need to change the reference vertex at this stage. Furthermore, it is reasonable to assume that the set *I* satisfies \(|I|=O(1)\), since otherwise the vector \(\mathbf{x}^{k+1}\), produced by a forward step, can be represented by significantly less vertices than \(\mathbf{x}^k\), which, although possible, is numerically unlikely. Therefore, assuming that indeed \(|I|=O(1)\), the matrix \(\tilde{\mathbf{T}}\), calculated in step 7, applies a row elimination procedure to at most *O*(1) rows (one for each column removed from \(\mathbf{W}^{k+1}\)) or one column (if a column was added to \(\mathbf{W}^{k+1}\)). Conducting such an elimination on either row or column takes at most \(O(n^2)\) operations, which may include row switching and at most *n* row addition and multiplication. Therefore, the total computational cost of the IRR scheme amounts to \(O(n^2)\).

## Rights and permissions

## About this article

### Cite this article

Beck, A., Shtern, S. Linearly convergent away-step conditional gradient for non-strongly convex functions.
*Math. Program.* **164, **1–27 (2017). https://doi.org/10.1007/s10107-016-1069-4

Received:

Accepted:

Published:

Issue Date:

### Mathematics Subject Classification

- 90-08
- 90C25
- 65K10