Skip to main content
Log in

New analysis and results for the Frank–Wolfe method

  • Full Length Paper
  • Series A
  • Published:
Mathematical Programming Submit manuscript

Abstract

We present new results for the Frank–Wolfe method (also known as the conditional gradient method). We derive computational guarantees for arbitrary step-size sequences, which are then applied to various step-size rules, including simple averaging and constant step-sizes. We also develop step-size rules and computational guarantees that depend naturally on the warm-start quality of the initial (and subsequent) iterates. Our results include computational guarantees for both duality/bound gaps and the so-called FW gaps. Lastly, we present complexity bounds in the presence of approximate computation of gradients and/or linear optimization subproblem solutions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Clarkson, K.L.: Coresets, sparse Greedy approximation, and the Frank–Wolfe algorithm. In: 19th ACM-SIAM Symposium on Discrete Algorithms, pp. 922–931 (2008)

  2. d’Aspremont, A.: Smooth optimization with approximate gradient. SIAM J. Optim. 19(3), 1171–1183 (2008)

    Article  MATH  MathSciNet  Google Scholar 

  3. Demyanov, V., Rubinov, A.: Approximate Methods in Optimization Problems. American Elsevier Publishing, New York (1970)

    Google Scholar 

  4. Devolder, O., Glineur, F., Nesterov, Y.E.: First-order methods of smooth convex optimization with inexact oracle. Technical Report, CORE, Louvain-la-Neuve, Belgium (2013)

  5. Dudík, M., Harchaoui, Z., Malick, J.: Lifted coordinate descent for learning with trace-norm regularization. In: AISTATS (2012)

  6. Dunn, J.: Rates of convergence for conditional gradient algorithms near singular and nonsinglular extremals. SIAM J. Control Optim. 17(2), 187–211 (1979)

    Article  MATH  MathSciNet  Google Scholar 

  7. Dunn, J.: Convergence rates for conditional gradient sequences generated by implicit step length rules. SIAM J. Control Optim. 18(5), 473–487 (1980)

    Article  MATH  MathSciNet  Google Scholar 

  8. Dunn, J., Harshbarger, S.: Conditional gradient algorithms with open loop step size rules. J. Math. Anal. Appl. 62, 432–444 (1978)

    Article  MATH  MathSciNet  Google Scholar 

  9. Frank, M., Wolfe, P.: An algorithm for quadratic programming. Naval Res. Logist. Q. 3, 95–110 (1956)

    Article  MathSciNet  Google Scholar 

  10. Freund, R.M., Grigas, P., Mazumder, R.: Boosting methods in regression: computational guarantees and regularization via subgradient optimization. Technical Report, MIT Operations Research Center (2014)

  11. Giesen, J., Jaggi, M., Laue, S.: Optimizing over the growing spectrahedron. In: ESA 2012: 20th Annual European Symposium on Algorithms (2012)

  12. Harchaoui, Z., Juditsky, A., Nemirovski, A.: Conditional gradient algorithms for norm-regularized smooth convex optimization. Math. Program. (2014). doi:10.1007/s10107-014-0778-9

    Google Scholar 

  13. Hazan, E.L.: Sparse approximate solutions to semidefinite programs. In: Proceedings of Theoretical Informatics, 8th Latin American Symposium (LATIN), pp. 306–316 (2008)

  14. Hearn, D.: The gap function of a convex program. Oper. Res. Lett. 1(2), 67–71 (1982)

    Article  MATH  MathSciNet  Google Scholar 

  15. Jaggi, M.: Revisiting Frank–Wolfe: projection-free sparse convex optimization, In: Proceedings of the 30th International Conference on Machine Learning (ICML-13), pp. 427–435 (2013)

  16. Khachiyan, L.: Rounding of polytopes in the real number model of computation. Math. Oper. Res. 21(2), 307–320 (1996)

    Article  MATH  MathSciNet  Google Scholar 

  17. Lacoste-Julien, S., Jaggi, M., Schmidt, M., Pletscher, P.: Block-coordinate frank-wolfe optimization for structural svms. In: Proceedings of the 30th International Conference on Machine Learning (ICML-13) (2013)

  18. Lan, G.: The complexity of large-scale convex programming under a linear optimization oracle. Department of Industrial and Systems Engineering, University of Florida, Gainesville, Florida. Technical Report (2013)

  19. Levitin, E., Polyak, B.: Constrained minimization methods. USSR Comput. Math. Math. Phys. 6, 1 (1966)

    Article  Google Scholar 

  20. Nemirovski, A.: Private communication (2007)

  21. Nesterov, Y.E.: Introductory Lectures on Convex Optimization: A Basic Course, Applied Optimization, vol. 87. Kluwer Academic Publishers, Boston (2003)

    Google Scholar 

  22. Nesterov, Y.E.: Smooth minimization of non-smooth functions. Math. Program. 103(1), 127–152 (2005)

    Article  MATH  MathSciNet  Google Scholar 

  23. Nesterov, Y.E.: Primal-dual subgradient methods for convex problems. Math. Program. 120, 221–259 (2009)

    Article  MATH  MathSciNet  Google Scholar 

  24. Polyak, B.: Introduction to Optimization. Optimization Software, Inc., New York (1987)

    Google Scholar 

  25. Temlyakov, V.: Greedy approximation in convex optimization. University of South Carolina. Technical report (2012)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Paul Grigas.

Additional information

R. M. Freund: This author’s research is supported by AFOSR Grant No. FA9550-11-1-0141 and the MIT-Chile-Pontificia Universidad Católica de Chile Seed Fund.

P. Grigas: This author’s research has been partially supported through NSF Graduate Research Fellowship No. 1122374 and the MIT-Chile-Pontificia Universidad Católica de Chile Seed Fund.

Appendix

Appendix

Proposition 7.1

Let \(B_k^w\) and \(B_k^m\) be as defined in Sect. 2. Suppose that there exists an open set \(\hat{Q} \subseteq E\) containing \(Q\) such that \(\phi (x,\cdot )\) is differentiable on \(\hat{Q}\) for each fixed \(x \in P\), and that \(h(\cdot )\) has the minmax structure (4) on \(\hat{Q}\) and is differentiable on \(\hat{Q}\). Then it holds that:

$$\begin{aligned} B_k^w \ge B_k^m \ge h^*. \end{aligned}$$

Furthermore, it holds that \(B_k^w = B_k^m\) in the case when \(\phi (x,\cdot )\) is linear in the variable \(\lambda \).

Proof

It is simple to show that \(B_k^m \ge h^*\). At the current iterate \(\lambda _k \in Q\), define \(x_k \in \arg \min \limits _{x \in P}\phi (x, \lambda _k)\). Then from the definition of \(h(\lambda )\) and the concavity of \(\phi (x_k, \cdot )\) we have:

$$\begin{aligned} h(\lambda )&\le \phi (x_k, \lambda ) \le \phi (x_k, \lambda _k) + \nabla _\lambda \phi (x_k, \lambda _k)^T(\lambda - \lambda _k) \nonumber \\&= h(\lambda _k) + \nabla _\lambda \phi (x_k, \lambda _k)^T(\lambda - \lambda _k), \end{aligned}$$
(55)

whereby \(\nabla _\lambda \phi (x_k, \lambda _k)\) is a subgradient of \(h(\cdot )\) at \(\lambda _k\). It then follows from the differentiability of \(h(\cdot )\) that \(\nabla h(\lambda _k) = \nabla _\lambda \phi (x_k, \lambda _k)\), and this implies from (55) that:

$$\begin{aligned} \phi (x_k, \lambda ) \le h(\lambda _k) + \nabla h(\lambda _k)^T(\lambda - \lambda _k). \end{aligned}$$
(56)

Therefore we have:

$$\begin{aligned} B_k^m = f(x_k) = \max _{\lambda \in Q}\{\phi (x_k, \lambda )\} \le \max _{\lambda \in Q}\{h(\lambda _k) + \nabla h(\lambda _k)^T(\lambda - \lambda _k)\} = B_k^w. \end{aligned}$$

If \(\phi (x,\lambda )\) is linear in \(\lambda \), then the second inequality in (55) is an equality, as is (56).

Proposition 7.2

Let \(C_{h, Q}, \mathrm {Diam}_Q\), and \(L_{h,Q}\) be as defined in Sect. 2. Then it holds that \(C_{h, Q} \le L_{h,Q}(\mathrm {Diam}_Q)^2 \).

Proof

Since \(Q\) is convex, we have \(\lambda + \alpha (\tilde{\lambda }- \lambda ) \in Q\) for all \(\lambda , \tilde{\lambda } \in Q\) and for all \(\alpha \in [0,1]\). Since the gradient of \(h(\cdot )\) is Lipschitz, from the fundamental theorem of calculus we have:

$$\begin{aligned} h(\lambda + \alpha (\tilde{\lambda }- \lambda ))&= h(\lambda ) + \nabla h(\lambda )^T(\alpha (\tilde{\lambda }- \lambda )) \\&+\int \limits _0^1[\nabla h(\lambda + t \alpha (\tilde{\lambda }- \lambda )) - \nabla h (\lambda )]^T(\alpha (\tilde{\lambda }- \lambda )) dt \\&\ge h(\lambda ) + \nabla h(\lambda )^T(\alpha (\tilde{\lambda }- \lambda )) \\&-\int \limits _0^1 \Vert \nabla h(\lambda + t \alpha (\tilde{\lambda }- \lambda )) - \nabla h (\lambda )\Vert _*(\alpha )\Vert \tilde{\lambda }- \lambda \Vert dt \\&\ge h(\lambda ) \!+\! \nabla h(\lambda )^T(\alpha (\tilde{\lambda }- \lambda )) \!-\!\int \limits _0^1 L_{h, Q} \Vert (t \alpha )(\tilde{\lambda }- \lambda )\Vert ( \alpha )\Vert \tilde{\lambda }\!-\! \lambda \Vert dt \\&= h(\lambda ) + \nabla h(\lambda )^T(\alpha (\tilde{\lambda }- \lambda )) - \frac{\alpha ^2}{2}L_{h, Q}\Vert (\tilde{\lambda }- \lambda )\Vert ^2 \\&\ge h(\lambda ) + \nabla h(\lambda )^T(\alpha (\tilde{\lambda }- \lambda )) - \frac{\alpha ^2}{2}L_{h, Q}(\mathrm {Diam}_Q)^2, \end{aligned}$$

whereby it follows that \(C_{h, Q} \le L_{h,Q}(\mathrm {Diam}_Q)^2 \).

Proposition 7.3

For \(k\ge 0\) the following inequality holds:

$$\begin{aligned} \sum _{i=0}^k \frac{i+1}{i+2} \le \frac{(k+1)(k+2)}{k+4}. \end{aligned}$$

Proof

The inequality above holds at equality for \(k=0\). By induction, suppose the inequality is true for some given \(k\ge 0\), then

$$\begin{aligned} \sum _{i=0}^{k+1} \frac{i+1}{i+2}&= \sum _{i=0}^{k} \frac{i+1}{i+2} + \frac{k+2}{k+3} \nonumber \\&\le \frac{(k+1)(k+2)}{k+4} + \frac{k+2}{k+3} \nonumber \\&= (k+2)\left[ \frac{k^2+5k+7}{k^2 + 7k +12}\right] . \end{aligned}$$
(57)

Now notice that

$$\begin{aligned}&(k^2 +5k+7)(k+5) = k^3 + 10k^2 +32k + 35 < k^3 + 10k^2 +\,33k + 36 \\&\quad = (k^2 + 7k +12)(k+3), \end{aligned}$$

which combined with (57) completes the induction.\(\square \)

Proposition 7.4

For \(k\ge 1\) let \(\bar{\alpha }:= 1-\frac{1}{\root k \of {k+1}}\). Then the following inequalities holds:

  1. (i)

    \(\displaystyle \frac{\ln (k+1)}{k} \ge \bar{\alpha }\), and

  2. (ii)

    \((k+1)\bar{\alpha }\ge 1 \).

Proof

To prove (i), define \(f(t):= 1-e^{-t}\), and noting that \(f(\cdot )\) is a concave function, the gradient inequality for \(f(\cdot )\) at \(t=0\) is

$$\begin{aligned} t \ge 1-e^{-t}. \end{aligned}$$

Substituting \(t=\frac{\ln (k+1)}{k} \) yields

$$\begin{aligned} \frac{\ln (k+1)}{k} = t \ge 1-e^{-t} = 1 - e^{-\frac{\ln (k+1)}{k}} = 1- \frac{1}{\root k \of {k+1}} = \bar{\alpha }. \end{aligned}$$

Note that (ii) holds for \(k = 1\), so assume now that \(k \ge 2\). To prove (ii) for \(k \ge 2\), substitute \(t = -\frac{\ln (k+1)}{k}\) into the gradient inequality above to obtain \(-\frac{\ln (k+1)}{k} \ge 1 - (k+1)^{\frac{1}{k}}\) which can be rearranged to:

$$\begin{aligned} (k+1)^{\frac{1}{k}} \ge 1 + \frac{\ln (k+1)}{k} \ge 1 + \frac{\ln (e)}{k} = 1 + \frac{1}{k} = \frac{k+1}{k}. \end{aligned}$$
(58)

Inverting (58) yields:

$$\begin{aligned} (k+1)^{-\frac{1}{k}} \le \frac{k}{k+1} = 1 - \frac{1}{k+1}. \end{aligned}$$
(59)

Finally, rearranging (59) and multiplying by \(k+1\) yields (ii).\(\square \)

Proposition 7.5

For any integers \(\ell , k\) with \(2 \le \ell \le k\), the following inequalities hold:

$$\begin{aligned} \ln \left( \frac{k+1}{\ell }\right) \le \sum _{i = \ell }^k\frac{1}{i} \le \ln \left( \frac{k}{\ell - 1}\right) , \end{aligned}$$
(60)

and

$$\begin{aligned} \frac{k - \ell + 1}{(k+1)\ell } \le \sum _{i = \ell }^k\frac{1}{i^2} \le \frac{k - \ell + 1}{k(\ell -1)}. \end{aligned}$$
(61)

Proof

(60) and (61) are specific instances of the following more general fact: if \(f(\cdot ): [1, \infty ) \rightarrow \mathbb {R}_+\) is a monotonically decreasing continuous function, then

$$\begin{aligned} \int _{\ell }^{k+1}f(t)dt \le \sum _{i = \ell }^kf(i) \le \int _{\ell - 1}^kf(t)dt. \end{aligned}$$
(62)

It is easy to verify that the integral expressions in (62) match the bounds in (60) and (61) for the specific choices of \(f(t) = \frac{1}{t}\) and \(f(t) = \frac{1}{t^2}\), respectively.\(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Freund, R.M., Grigas, P. New analysis and results for the Frank–Wolfe method. Math. Program. 155, 199–230 (2016). https://doi.org/10.1007/s10107-014-0841-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10107-014-0841-6

Mathematics Subject Classification

Navigation