Abstract
We present new results for the Frank–Wolfe method (also known as the conditional gradient method). We derive computational guarantees for arbitrary step-size sequences, which are then applied to various step-size rules, including simple averaging and constant step-sizes. We also develop step-size rules and computational guarantees that depend naturally on the warm-start quality of the initial (and subsequent) iterates. Our results include computational guarantees for both duality/bound gaps and the so-called FW gaps. Lastly, we present complexity bounds in the presence of approximate computation of gradients and/or linear optimization subproblem solutions.
Similar content being viewed by others
References
Clarkson, K.L.: Coresets, sparse Greedy approximation, and the Frank–Wolfe algorithm. In: 19th ACM-SIAM Symposium on Discrete Algorithms, pp. 922–931 (2008)
d’Aspremont, A.: Smooth optimization with approximate gradient. SIAM J. Optim. 19(3), 1171–1183 (2008)
Demyanov, V., Rubinov, A.: Approximate Methods in Optimization Problems. American Elsevier Publishing, New York (1970)
Devolder, O., Glineur, F., Nesterov, Y.E.: First-order methods of smooth convex optimization with inexact oracle. Technical Report, CORE, Louvain-la-Neuve, Belgium (2013)
Dudík, M., Harchaoui, Z., Malick, J.: Lifted coordinate descent for learning with trace-norm regularization. In: AISTATS (2012)
Dunn, J.: Rates of convergence for conditional gradient algorithms near singular and nonsinglular extremals. SIAM J. Control Optim. 17(2), 187–211 (1979)
Dunn, J.: Convergence rates for conditional gradient sequences generated by implicit step length rules. SIAM J. Control Optim. 18(5), 473–487 (1980)
Dunn, J., Harshbarger, S.: Conditional gradient algorithms with open loop step size rules. J. Math. Anal. Appl. 62, 432–444 (1978)
Frank, M., Wolfe, P.: An algorithm for quadratic programming. Naval Res. Logist. Q. 3, 95–110 (1956)
Freund, R.M., Grigas, P., Mazumder, R.: Boosting methods in regression: computational guarantees and regularization via subgradient optimization. Technical Report, MIT Operations Research Center (2014)
Giesen, J., Jaggi, M., Laue, S.: Optimizing over the growing spectrahedron. In: ESA 2012: 20th Annual European Symposium on Algorithms (2012)
Harchaoui, Z., Juditsky, A., Nemirovski, A.: Conditional gradient algorithms for norm-regularized smooth convex optimization. Math. Program. (2014). doi:10.1007/s10107-014-0778-9
Hazan, E.L.: Sparse approximate solutions to semidefinite programs. In: Proceedings of Theoretical Informatics, 8th Latin American Symposium (LATIN), pp. 306–316 (2008)
Hearn, D.: The gap function of a convex program. Oper. Res. Lett. 1(2), 67–71 (1982)
Jaggi, M.: Revisiting Frank–Wolfe: projection-free sparse convex optimization, In: Proceedings of the 30th International Conference on Machine Learning (ICML-13), pp. 427–435 (2013)
Khachiyan, L.: Rounding of polytopes in the real number model of computation. Math. Oper. Res. 21(2), 307–320 (1996)
Lacoste-Julien, S., Jaggi, M., Schmidt, M., Pletscher, P.: Block-coordinate frank-wolfe optimization for structural svms. In: Proceedings of the 30th International Conference on Machine Learning (ICML-13) (2013)
Lan, G.: The complexity of large-scale convex programming under a linear optimization oracle. Department of Industrial and Systems Engineering, University of Florida, Gainesville, Florida. Technical Report (2013)
Levitin, E., Polyak, B.: Constrained minimization methods. USSR Comput. Math. Math. Phys. 6, 1 (1966)
Nemirovski, A.: Private communication (2007)
Nesterov, Y.E.: Introductory Lectures on Convex Optimization: A Basic Course, Applied Optimization, vol. 87. Kluwer Academic Publishers, Boston (2003)
Nesterov, Y.E.: Smooth minimization of non-smooth functions. Math. Program. 103(1), 127–152 (2005)
Nesterov, Y.E.: Primal-dual subgradient methods for convex problems. Math. Program. 120, 221–259 (2009)
Polyak, B.: Introduction to Optimization. Optimization Software, Inc., New York (1987)
Temlyakov, V.: Greedy approximation in convex optimization. University of South Carolina. Technical report (2012)
Author information
Authors and Affiliations
Corresponding author
Additional information
R. M. Freund: This author’s research is supported by AFOSR Grant No. FA9550-11-1-0141 and the MIT-Chile-Pontificia Universidad Católica de Chile Seed Fund.
P. Grigas: This author’s research has been partially supported through NSF Graduate Research Fellowship No. 1122374 and the MIT-Chile-Pontificia Universidad Católica de Chile Seed Fund.
Appendix
Appendix
Proposition 7.1
Let \(B_k^w\) and \(B_k^m\) be as defined in Sect. 2. Suppose that there exists an open set \(\hat{Q} \subseteq E\) containing \(Q\) such that \(\phi (x,\cdot )\) is differentiable on \(\hat{Q}\) for each fixed \(x \in P\), and that \(h(\cdot )\) has the minmax structure (4) on \(\hat{Q}\) and is differentiable on \(\hat{Q}\). Then it holds that:
Furthermore, it holds that \(B_k^w = B_k^m\) in the case when \(\phi (x,\cdot )\) is linear in the variable \(\lambda \).
Proof
It is simple to show that \(B_k^m \ge h^*\). At the current iterate \(\lambda _k \in Q\), define \(x_k \in \arg \min \limits _{x \in P}\phi (x, \lambda _k)\). Then from the definition of \(h(\lambda )\) and the concavity of \(\phi (x_k, \cdot )\) we have:
whereby \(\nabla _\lambda \phi (x_k, \lambda _k)\) is a subgradient of \(h(\cdot )\) at \(\lambda _k\). It then follows from the differentiability of \(h(\cdot )\) that \(\nabla h(\lambda _k) = \nabla _\lambda \phi (x_k, \lambda _k)\), and this implies from (55) that:
Therefore we have:
If \(\phi (x,\lambda )\) is linear in \(\lambda \), then the second inequality in (55) is an equality, as is (56).
Proposition 7.2
Let \(C_{h, Q}, \mathrm {Diam}_Q\), and \(L_{h,Q}\) be as defined in Sect. 2. Then it holds that \(C_{h, Q} \le L_{h,Q}(\mathrm {Diam}_Q)^2 \).
Proof
Since \(Q\) is convex, we have \(\lambda + \alpha (\tilde{\lambda }- \lambda ) \in Q\) for all \(\lambda , \tilde{\lambda } \in Q\) and for all \(\alpha \in [0,1]\). Since the gradient of \(h(\cdot )\) is Lipschitz, from the fundamental theorem of calculus we have:
whereby it follows that \(C_{h, Q} \le L_{h,Q}(\mathrm {Diam}_Q)^2 \).
Proposition 7.3
For \(k\ge 0\) the following inequality holds:
Proof
The inequality above holds at equality for \(k=0\). By induction, suppose the inequality is true for some given \(k\ge 0\), then
Now notice that
which combined with (57) completes the induction.\(\square \)
Proposition 7.4
For \(k\ge 1\) let \(\bar{\alpha }:= 1-\frac{1}{\root k \of {k+1}}\). Then the following inequalities holds:
-
(i)
\(\displaystyle \frac{\ln (k+1)}{k} \ge \bar{\alpha }\), and
-
(ii)
\((k+1)\bar{\alpha }\ge 1 \).
Proof
To prove (i), define \(f(t):= 1-e^{-t}\), and noting that \(f(\cdot )\) is a concave function, the gradient inequality for \(f(\cdot )\) at \(t=0\) is
Substituting \(t=\frac{\ln (k+1)}{k} \) yields
Note that (ii) holds for \(k = 1\), so assume now that \(k \ge 2\). To prove (ii) for \(k \ge 2\), substitute \(t = -\frac{\ln (k+1)}{k}\) into the gradient inequality above to obtain \(-\frac{\ln (k+1)}{k} \ge 1 - (k+1)^{\frac{1}{k}}\) which can be rearranged to:
Inverting (58) yields:
Finally, rearranging (59) and multiplying by \(k+1\) yields (ii).\(\square \)
Proposition 7.5
For any integers \(\ell , k\) with \(2 \le \ell \le k\), the following inequalities hold:
and
Proof
(60) and (61) are specific instances of the following more general fact: if \(f(\cdot ): [1, \infty ) \rightarrow \mathbb {R}_+\) is a monotonically decreasing continuous function, then
It is easy to verify that the integral expressions in (62) match the bounds in (60) and (61) for the specific choices of \(f(t) = \frac{1}{t}\) and \(f(t) = \frac{1}{t^2}\), respectively.\(\square \)
Rights and permissions
About this article
Cite this article
Freund, R.M., Grigas, P. New analysis and results for the Frank–Wolfe method. Math. Program. 155, 199–230 (2016). https://doi.org/10.1007/s10107-014-0841-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107-014-0841-6