Conditional gradient algorithms for norm-regularized smooth convex optimization

Harchaoui, Zaid; Juditsky, Anatoli; Nemirovski, Arkadi

doi:10.1007/s10107-014-0778-9

Conditional gradient algorithms for norm-regularized smooth convex optimization

Full Length Paper
Series A
Published: 18 April 2014

Volume 152, pages 75–112, (2015)
Cite this article

Mathematical Programming Submit manuscript

Zaid Harchaoui¹,
Anatoli Juditsky² &
Arkadi Nemirovski³

1805 Accesses
74 Citations
1 Altmetric
Explore all metrics

Abstract

Motivated by some applications in signal processing and machine learning, we consider two convex optimization problems where, given a cone $K$, a norm $\Vert \cdot \Vert $ and a smooth convex function $f$, we want either (1) to minimize the norm over the intersection of the cone and a level set of $f$, or (2) to minimize over the cone the sum of $f$ and a multiple of the norm. We focus on the case where (a) the dimension of the problem is too large to allow for interior point algorithms, (b) $\Vert \cdot \Vert $ is “too complicated” to allow for computationally cheap Bregman projections required in the first-order proximal gradient algorithms. On the other hand, we assume that it is relatively easy to minimize linear forms over the intersection of $K$ and the unit $\Vert \cdot \Vert $-ball. Motivating examples are given by the nuclear norm with $K$ being the entire space of matrices, or the positive semidefinite cone in the space of symmetric matrices, and the Total Variation norm on the space of 2D images. We discuss versions of the Conditional Gradient algorithm capable to handle our problems of interest, provide the related theoretical efficiency estimates and outline some applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The Frank-Wolfe Algorithm: A Short Introduction

Article Open access 13 December 2023

$\mathbf{C^{2}}$ -Lusin approximation of strongly convex functions

Article 03 April 2024

A New Insight on Augmented Lagrangian Method with Applications in Machine Learning

Article 13 April 2024

Notes

i.e., the Lipschitz constant of $f$ w.r.t. $\Vert \cdot \Vert $ in the nonsmooth case, or the Lipschitz constant of the gradient mapping $x\mapsto f'(x)$ w.r.t. the norm $\Vert \cdot \Vert $ on the argument and the conjugate of this norm on the image spaces in the smooth case.
The penalized minimization problem, with a similar type of algorithm, was independently of and simultaneously with [12] considered in [38].
Recall that the dual, a.k.a. conjugate, to a norm $\Vert \cdot \Vert $ on a Euclidean space $E$ is the norm on $E$ defined as $\Vert \xi \Vert _*=\max _{x\in E,\Vert x\Vert \le 1}\langle \xi ,x\rangle $, where $\langle \cdot ,\cdot \rangle $ is the inner product on $E$.
Note being a special case of (4), (12) provides a description of regularity properties of $f$ in terms which are “invariant with respect to the geometry of $X$”.
Note that in the context of “classical” Frank–Wolfe algorithm—minimization of a smooth function over a polyhedral set—such modification is referred to as Restricted Simplicial Decomposition [15, 16, 36].
In fact, $x^b_t$ may be substituted for $x^+_t$ in the recurrence (15), and the resulting approximate solutions $x_t$ along with the lower bounds $f_*^t\ge \max _{1\le \tau \le t} f^b_\tau (x^b_\tau )$ will still satisfy the bound (18) of Theorem 1. Indeed, the analysis of the proof of the theorem reveals that a point $\xi \in X$ can be substituted for $x_t^+=x_X[f'(x_t)]$ as soon as the inequality
$$\begin{aligned} \langle f'(x_t),x_t-\xi \rangle \ge f(x_t)-f^t_{*}, \end{aligned}$$
(21)
holds, where $f^t_{*}$ is the best currently available lower bound for the optimal value $f_*$ of (13). In other words, for the result of Theorem 1 to hold, one can substitute $x_t^+=x_X[f'(x_t)]$ with any vector $\xi $ satisfying (21). Now assume that $x_t\in X^b_t$. By convexity of $f^b_t$ (obviously, $f'(x_t)\in \partial f^b_t(x_t)$, where $\partial f^b_t(x)$ is the subdifferential of $f^b_t$ at $x$), we have
$$\begin{aligned} \langle f'(x_t),x_t-x^b_t\rangle \ge f^b_t(x^b_t)-f(x_t)\ge f_*^t-f(x_t), \end{aligned}$$
what is (21).
Assuming possibility to solve (20) exactly, while being idealization, is basically as “tolerable” as the standard in continuous optimization assumption that one can use exact real arithmetic or compute exactly eigenvalues/eigenvectors of symmetric matrices. The outlined “real life” considerations can be replaced with rigorous error analysis which shows that in order to maintain the efficiency estimates from Theorem 1, it suffices to solve $t$-th auxiliary problem within properly selected positive inaccuracy, and this can be achieved in $O(\ln (t))$ computations of $f$ and $f'$.
The iterates $x_t$, same as other indexed by $t$ quantities participating in the description of the algorithm, in fact depend on both $t$ and the stage number $s$. To avoid cumbersome notation when speaking about a particular stage, we suppress $s$ in the notation.
We recover this way the Atom-Descent algorithm of [7, 11].
This property is an immediate corollary of the fact that in the situation in question, by description of the algorithms $x_t$ is a convex combination of $t$ points of the form $x[\cdot ]$.
When $f$ is more complicated, optimal adjustment of the mean $t$ of the image reduces by bisection in $t$ to solving small series of problems of the same structure as (5), (9) where the mean of the image $x$ is fixed and, consequently, the problems reduce to those with $x\in M^n_0$ by shifting $b$.
Which one of these two options takes place depends on the type of the algorithm.
On a closest inspection, “complex geometry” of the ${\hbox {TV}}$-norm stems from the fact that after parameterizing a zero mean image by its discrete gradient field and treating this field $(g=\nabla _i x,h=\nabla _jx)$ as our new design variable, the unit ball of the ${\hbox {TV}}$-norm becomes the intersection of a simple set in the space of pairs $(g,h)\in F={\mathbf {R}}^{(n-1)\times n}\times {\mathbf {R}}^{n\times (n-1)}$ (the $\ell _1$ ball $\Delta $ given by $\Vert g\Vert _1+\Vert h\Vert _1\le 1$) with a linear subspace $P$ of $F$ comprised of potential vector fields $(f,g)$—those which indeed are discrete gradient fields of images. Both dimension and codimension of $P$ are of order of $n^2$, which makes it difficult to minimize over $\Delta \cap P$ nonlinear, even simple, convex functions, which is exactly what is needed in proximal methods.
That is, the rows of $P$ are indexed by the nodes, the columns are indexed by the arcs, and in the column indexed by an arc $\gamma $ there are exactly two nonzero entries: entry 1 in the row indexed by the starting node of $\gamma $, and entry $-1$ in the row indexed by the terminal node of $\gamma $.
From the Sobolev embedding theorem it follows that for a smooth function $f(x,y)$ on the unit square one has $\Vert f\Vert _{L_2}\le O(1)\Vert \nabla f\Vert _1, \Vert \nabla f\Vert _1{:=}\Vert f'_x\Vert _1+\Vert f'_y\Vert _1$, provided that $f$ has zero mean. Denoting by $f^n$ the restriction of the function onto a $n\times n$ regular grid in the square, we conclude that $\Vert f^n\Vert _2/{\hbox {TV}}(f^n)\rightarrow \Vert f\Vert _{L_2}/\Vert \nabla f\Vert _1\le O(1)$ as $n\rightarrow \infty $. Note that the convergence in question takes place only in the 2-dimensional case.
For comparison: solving on the same platform problem (36) corresponding to Experiment A ($256\times 256$ image) by the state-of-the-art commercial interior point solver mosekopt 6.0 took as much as 3,727 sec, and this—for a single value of the penalty (there is no clear way to get from a single run approximate solutions for a set of values of the penalty in this case).

References

Andersen, E.D., Andersen, K.D.: The MOSEK optimization tools manual. http://www.mosek.com/fileadmin/products/6_0/tools/doc/pdf/tools.pdf
Bach, F., Jenatton, R., Mairal, J., Obozinski, G. et al.: Convex optimization with sparsity-inducing norms. In: Sra, S., Nowozin, S., Wright, S. J. (eds). Optimization for Machine Learning, pp. 19–53. MIT Press
Cai, J.-F., Candes, E.J., Shen, Z.: A singular value thresholding algorithm for matrix completion. SIAM J. Optim. 20(4), 1956–1982 (2008)
Article MathSciNet Google Scholar
Candès, E., Recht, B.: Exact matrix completion via convex optimization. Found. Comput. Math. 9(6), 717–772 (2009)
Article MathSciNet MATH Google Scholar
Cox, B., Juditsky, A., Nemirovski, A.: Dual subgradient algorithms for large-scale nonsmooth learning problems. Math. Program. 1–38 (2013). doi:10.1007/s10107-013-0725-1
Demyanov, V., Rubinov, A.: Approximate Methods in Optimization Problems. American Elsevier, Amsterdam (1970)
Google Scholar
Dudik, M., Harchaoui, Z., Malick, J.: Lifted coordinate descent for learning with trace-norm regularization. In: AISTATS (2012)
Dunn, J.C., Harshbarger, S.: Conditional gradient algorithms with open loop step size rules. J. Math. Anal. Appl. 62(2), 432–444 (1978)
Article MathSciNet MATH Google Scholar
Frank, M., Wolfe, P.: An algorithm for quadratic programming. Naval Res. Logist. Q. 3, 95–110 (1956)
Article MathSciNet Google Scholar
Goldfarb, D., Ma, S., Wen, Z.: Solving low-rank matrix completion problems efficiently. In: Proceedings of 47th Annual Allerton Conference on Communication, Control, and Computing (2009)
Harchaoui, Z., Douze, M., Paulin, M., Dudik, M., Malick, J.: Large-scale image classification with trace-norm regularization. In: CVPR (2012)
Harchaoui, Z., Juditsky, A., Nemirovski, A.: Conditional gradient algorithms for machine learning. In: NIPS Workshop on Optimization for Machine Learning. http://opt.kyb.tuebingen.mpg.de/opt12/papers.html (2012)
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Springer Series in Statistics. Springer, Berlin (2008)
Hazan, E.: Sparse approximate solutions to semidefinite programs. In: Proceedings of the 8th Latin American Conference Theoretical Informatics, pp. 306–316 (2008)
Hearn, D., Lawphongpanich, S., Ventura, J.: Restricted simplicial decomposition: computation and extensions. Math. Program. Stud. 31, 99–118 (1987)
Article MathSciNet Google Scholar
Holloway, C.: An extension of the Frank-Wolfe method of feasible directions. Math. Program. 6, 14–27 (1974)
Article MathSciNet MATH Google Scholar
Jaggi, M.: Revisiting Frank-Wolfe: projection-free sparse convex optimization. In: ICML (2013)
Jaggi, M., Sulovsky, M.: A simple algorithm for nuclear norm regularized problems. In: ICML (2010)
Juditsky, A., Karzan, F.K., Nemirovski, A.: Randomized first order algorithms with applications to $\ell _1$-minimization. Math. Program. 142(1–2), 269–310 (2013)
Article MathSciNet MATH Google Scholar
Juditsky, A., Nemirovski, A.: First order methods for nonsmooth large-scale convex minimization, i: general purpose methods; ii: utilizing problem’s structure. In: Sra, S., Nowozin, S., Wright, S. (eds). Optimization for Machine Learning, pp. 121–184. The MIT Press, Cambridge (2012)
Lan, G.: An optimal method for stochastic composite optimization. Math. Program. 133(1–2), 365–397 (2012)
Article MathSciNet MATH Google Scholar
LemarÃchal, C., Nemirovskii, A., Nesterov, Y.: New variants of bundle methods. Math. Program. 69(1–3), 111–147 (1995)
Google Scholar
Ma, S., Goldfarb, D., Chen, L.: Fixed point and bregman iterative methods for matrix rank minimization. Math. Program. 128, 321–353 (2011)
Article MathSciNet Google Scholar
Nemirovski, A., Onn, S., Rothblum, U.G.: Accuracy certificates for computational problems with convex structure. Math. Oper. Res. 35(1), 52–78 (2010)
Article MathSciNet MATH Google Scholar
Nemirovski, A.S., Yudin, D.B.: Problem Complexity and Method Efficiency in Optimization. Wiley-Interscience, New York (1983)
Google Scholar
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Kluwer, Berlin (2003)
Google Scholar
Nesterov, Y.: Gradient methods for minimizing composite functions. Math. Program. 140(1), 125–161 (2013)
Article MathSciNet MATH Google Scholar
Nesterov, Y., Nemirovski, A.: On first-order algorithms for l 1/nuclear norm minimization. Acta Numer. 22, 509–575 (2013)
Article MathSciNet MATH Google Scholar
Pshenichnyj, B., Danilin, Y.: Numerical Methods in Extremal Problems. Mir, Moscow (1978)
Recht, B., Fazel, M., Parrilo, P.: Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Rev. 52(3), 471–501 (2010)
Article MathSciNet MATH Google Scholar
Recht, B., Ré, C.: Parallel stochastic gradient algorithms for large-scale matrix completion. Math. Program. Comput. 5(2), 201–226 (2013)
Article MathSciNet Google Scholar
Rudin, L., Osher, S., Fatemi, E.: Nonlinear total variation based noise removal algorithms. Phys. D 60 (1992)
Shalev-Shwartz, S., Gonen, A., Shamir, O.: Large-scale convex minimization with a low-rank constraint. In: ICML (2011)
Sra, S., Nowozin, S., Wright, S.J.: Optimization for Machine Learning. MIT Press, Cambridge (2010)
Google Scholar
Srebro, N., Shraibman, A.: Rank, trace-norm and max-norm. In: COLT (2005)
Ventura, J.A., Hearn, D.W.: Restricted simplicial decomposition for convex constrained problems. Math. Program. 59, 71–85 (1993)
Article MathSciNet MATH Google Scholar
Yang, J., Yuan, X.: Linearized augmented lagrangian and alternating direction methods for nuclear norm minimization. Math. Comput. 82(281), 301–329 (2013)
Article MathSciNet MATH Google Scholar
Zhang, X., Yu, Y., Schuurmans, D.: Accelerated training for matrix-norm regularization: a boosting approach. In: NIPS, pp. 2915–2923 (2012)
Zibulevski, M., Narkiss, G.: Sequential subspace optimization method for large-scale unconstrained problems. Technical Report CCIT No 559, Faculty of Electrical engineering, Technion (2005)

Download references

Author information

Authors and Affiliations

LJK, Inria, 655 Avenue de l’Europe, Montbonnot, 38334 , Saint-Ismier, France
Zaid Harchaoui
LJK, Université Grenoble Alpes, B.P. 53, 38041 , Grenoble Cedex 9, France
Anatoli Juditsky
Georgia Institute of Technology, Atlanta, GA, 30332, USA
Arkadi Nemirovski

Authors

Zaid Harchaoui
View author publications
You can also search for this author in PubMed Google Scholar
Anatoli Juditsky
View author publications
You can also search for this author in PubMed Google Scholar
Arkadi Nemirovski
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anatoli Juditsky.

Additional information

Research of the first and second authors was supported by the CNRS-Mastodons project GARGANTUA, and the LabEx PERSYVAL-Lab (ANR-11-LABX-0025). Research of the third author was supported by the ONR Grant N000140811104 and NSF Grants DMS 0914785, CMMI 1232623.

Appendix

1.1 Proof of Theorem 1

Define $\epsilon _t=f(x_t)-f^t_*$. When invoking convexity of $f$ and the definition (16), (17) of $f^t_*$, we have

$$\begin{aligned} \langle f'(x_t),x^+_t-x_t\rangle =f_{*,t}-f(x_t)\le f^t_{*}-f(x_t)\;(\le f_*-f(x_t)). \end{aligned}$$

(38)

Observing that for a generic GC algorithm we have (cf. (14), (15)) $f(x_{t+1})\le f(x_t+\gamma _t(x_t^+-x_t)), \gamma _t={2\over t+1}$, and invoking (12), we have

$$\begin{aligned} f(x_{t+1})&\le f(x_t)+\gamma _t\langle f'(x_t),x_t^+-x_t\rangle +{L\over 2} \gamma _t^2\Vert x^+_t-x_t\Vert _X^2 \nonumber \\&\le f(x_t)-\gamma _t(f(x_t)-f^t_*) + {\small \frac{1}{2}}L\gamma _t^2, \end{aligned}$$

(39)

where the concluding $\le $ is due to (38). Since $f_*^{t+1}\ge f_*^{t}$, it follows that

$$\begin{aligned} \epsilon _{t+1}\le f(x_{t+1})-f^{t}_*\le (1-\gamma _t)\epsilon _t+{\small \frac{1}{2}}L\gamma _t^2, \end{aligned}$$

whence

$$\begin{aligned} \epsilon _{t+1}&\le \epsilon _1\prod _{i=1}^t (1-\gamma _i) + {\small \frac{1}{2}}L\sum _{i=1}^t \gamma _i^2\prod _{k=i+1}^t (1-\gamma _k)\nonumber \\&= 2L\sum _{i=1}^t (i+1)^{-2}\prod _{k=i+1}^t (1-{2\over k+1}), \end{aligned}$$

where, by convention, $\prod _{k=t+1}^t=1$. Noting that $ \prod _{k=i+1}^t(1-{2\over k+1})=\prod _{k=i+1}^t {k-1\over k+1}= {i(i+1)\over t(t+1)},\;\;\;i=1,\ldots ,t, $ we get

$$\begin{aligned} \epsilon _{t+1}\le 2L\sum _{i=1}^t {i(i+1)\over (i+1)^{2}t(t+1)}\le {2L t\over (t+1)^2}\le {2L (t+2)^{-1}}, \end{aligned}$$

(40)

what is (18).$\square $

1.2 Proof of Theorem 2

The proof, up to minor modifications, goes back to [22], see also [19, 26]; we provide it here to make the paper self-contained. W.l.o.g. we can assume that we are in the nontrivial case (see description of the algorithm).

$\mathbf{1}^0$.:

As it was explained when describing the method, whenever stage $s$ takes place, we have $0<\rho _1\le \rho _s\le \rho _*$, and $\rho _{s-1}<\rho _s$, provided $s>1$. Therefore by the termination rule, the output $\bar{\rho }, \bar{x}$ of the algorithm, if any, satisfies $\bar{\rho }\le \rho _*, f(\bar{x})\le \epsilon $. Thus, (i) holds true, provided that the algorithm does terminate. Thus, all we need is to verify (ii) and (iii).

$\mathbf{2}^0$.:

Let us prove (ii). Let $s\ge 1$ be such that stage $s$ takes place. Setting $X=K[\rho _s]$, observe that $X-X\subset \{x\in E:\Vert x\Vert \le 2\rho _s\}$, whence $\Vert \cdot \Vert \le 2\rho _s\Vert \cdot \Vert _X$, and therefore the relation (4) implies the validity of (12) with $L=4\rho _s^2L_f$. Now, if stage $s$ does not terminate in course of some number $t$ steps, then, in the notation from the description of the algorithm, $f(\bar{x}_t)>\epsilon $ and $f_*^t<{3\over 4}f(\bar{x}_t)$, whence $f(\bar{x}_t)-f_*^t> \epsilon /4$. By Theorem 1.ii, the latter is possible only when $4.5L/(t-2)>\epsilon /4$. Thus, $t\le \max \left[ 5,2+{72\rho _s^2L_f\over \epsilon }\right] $. Taking into account that $\rho _s\le \rho _*$, (ii) follows.

$\mathbf{3}^0$.:

Let us prove (iii). This statement is trivially true when the number of stages is 1. Assuming that it is not the case, let $S\ge 1$ be such that the stage $S+1$ takes place. For every $s=1,\ldots ,S$, let $t_s$ be the last step of stage $s$, and let $u_s, \ell ^s(\cdot )$ be what in the notation from the description of stage $s$ was denoted $f(\bar{x}_{t_s})$ and $\ell ^{t_s}(\rho )$. Thus, $u_s>\epsilon $ is an upper bound on ${\mathop {\hbox {Opt}}}(\rho _s), \ell _s{:=}\ell ^s(\rho _s)$ is a lower bound on ${\mathop {\hbox {Opt}}}(\rho _s)$ satisfying $\ell _s\ge 3u_s/4$, and $\ell ^s(\cdot )$ is a piecewise linear convex in $\rho $ lower bound on ${\mathop {\hbox {Opt}}}(\rho ), \rho \ge 0$, and $\rho _{s+1}>\rho _s$ is the smallest positive root of $\ell ^s(\cdot )$. Let also $-g_s$ be a subgradient of $\ell ^s(\cdot )$ at $\rho _s$. Note that $g_s>0$ due to $\rho _{s+1}>\rho _s$ combined with $\ell ^s(\rho _s)>0, \ell ^s(\rho _{s+1})=0$, and by the same reasons combined with convexity of $\ell ^s(\cdot )$ we have

$$\begin{aligned} \rho _{s+1}-\rho _s\ge \ell _s/g_s, \end{aligned}$$

(41)

and, as we have seen,

$$\begin{aligned} 1\le s\le S\Rightarrow \left\{ \begin{array}{ll} (a)&{}u_s>\epsilon ,\\ (b)&{}u_s\ge {\mathop {\hbox {Opt}}}(\rho _s)\ge \ell _s\ge {3\over 4}u_s,\\ (c)&{}\ell _s-g_s(\rho -\rho _s) \le {\mathop {\hbox {Opt}}}(\rho ),\,\rho \ge 0.\\ \end{array}\right. \!. \end{aligned}$$

(42)

Assuming $1<s\le S$ and applying (41), we get $\rho _s-\rho _{s-1}\ge {3\over 4}u_{s-1}/g_{s-1}$, whence, invoking (42),

$$\begin{aligned} u_{s-1}\ge {\mathop {\hbox {Opt}}}(\rho _{s-1})\ge \ell _s+g_s[\rho _{s-1}-\rho _s]\ge {3\over 4}u_s+{3\over 4}u_{s-1}{g_s\over g_{s-1}}. \end{aligned}$$

The resulting inequality implies that ${u_s\over u_{s-1}}+{g_s\over g_{s-1}}\le {4\over 3}$, whence ${u_sg_s\over u_{s-1}g_{s-1}}\le (1/4)(4/3)^2=4/9$. It follows that

$$\begin{aligned} \sqrt{u_sg_s}\le (2/3)^{s-1}\sqrt{u_1g_1}, \,\,1\le s\le S. \end{aligned}$$

(43)

Now, since the first iterate of the first stage is $0$, we have $u_1\le f(0)$, while (42) applied with $s=1$ implies that $f(0)={\mathop {\hbox {Opt}}}(0)\ge \ell _1+\rho _1g_1\ge \rho _1g_1$, whence $u_1g_1\le f(0)/\rho _1=d$. Further, by (41) $g_s\ge \ell _s/(\rho _{s+1}-\rho _s)\ge \ell _s/\rho _*\ge {3\over 4} u_s/\rho _*$, where the concluding inequality is given by (42). We see that $u_sg_s\ge {3\over 4}u_s^2/\rho _*\ge {3\over 4}\epsilon ^2/\rho _*$. This lower bound on $u_sg_s$ combines with the bound $u_1g_1\le d$ and with (43) to imply that

$$\begin{aligned} \epsilon \le \sqrt{4/3}(2/3)^{s-1}\sqrt{d\rho _*},\,1\le s\le S. \end{aligned}$$

Finally observe that by the definition of $\rho _*$ and due to the fact that $\Vert x[f'(0)]\Vert =1$ in the nontrivial case, we have

$$\begin{aligned} 0&\le f(\rho _*x[f'(0)])\le f(0)+\rho _*\langle f'(0),x[f'(0)]\rangle +{1\over 2}L_f\rho _*^2\nonumber \\&= f(0)-\rho _*d+{\small \frac{1}{2}}L_f\rho _*^2 \end{aligned}$$

(we have used (4) and the definition of $d$), whence $\rho _*d\le f(0)+{\small \frac{1}{2}}L_f\rho _*^2$ and therefore

$$\begin{aligned} \epsilon \le \sqrt{4/3}(2/3)^{s-1}\sqrt{f(0)+{\small \frac{1}{2}}L_f\rho _*^2},\,1\le s\le S. \end{aligned}$$

Since this relation holds true for every $S\ge 1$ such that the stage $S+1$ takes place, (iii) follows. $\square $

1.3 Proof of Theorem 3

By definition of $z_t$ we have $z_t\in K^+$ for all $t$ and $F(0)=F(z_1)\ge F(z_2)\ge \ldots $, whence $r_t\le D_*$ for all $t$ by Assumption A. Besides this, $r_*\le D_*$ as well. Let now $\epsilon _t=F(z_t)-F_*$, $z_t=[x_t;r_t]$, and let $z^+_t=[x^+_t,r^+_t]$ be a minimizer, as given by Lemma 1, of the linear form $\langle F'(z_t),z\rangle $ of $z\in E^+$ over the set $K^+[r_*]=\{[x;r]: x\in K,\Vert x\Vert \le r\le r_*\}$. Recalling that $F'(z_t)=[f'(x_t);\kappa ]$ and that $r_t\le D_*\le \bar{D}$, Lemma 1 implies that $z^+_t\in \Delta (z_t)$. By definition of $z_t^+$ and convexity of $F$ we have

$$\begin{aligned} \begin{array}{rcl} \langle [f'(x_t);\kappa ],z_t-z_t^+\rangle &{}=&{}\langle f'(x_t),x_t-x_t^+\rangle +\kappa (r_t-r_t^+)\\ &{}\ge &{} \langle f'(x_t),x_t-x_*\rangle +\kappa (r_t-r_*)\\ &{}=&{} \langle F'(z_t),\,z_t-z_*\rangle \ge F(z_t)-F(z_*) =\epsilon _t.\\ \end{array} \end{aligned}$$

Invoking (12), it follows that for $0\le s\le 1$ one has

$$\begin{aligned} F(z_t+s(z_t^+-z_t))&\le F(z_i)+s\langle [f'(x_t);\kappa ],z_t^+-z_t\rangle +{L_f s^2\over 2}\Vert x(z_t^+)-x(z_t)\Vert ^2\\&\le F(z_t)-s\epsilon _t +{\small \frac{1}{2}}L_f s^2(r_t+D_*)^2 \end{aligned}$$

using that $\Vert x(z_t^+)\Vert \le r_t^+$ and $\Vert x(z^t)\Vert \le r_t$ due to $z_t^+,z_t\in K^+$, and that $r_t^+\le r_*\le D_*$. By (24) we have

$$\begin{aligned} F(z_{t+1})\le \min _{0\le s\le 1}F(z_t+s(z_t^+-z_t))\le F(z_t)+\min \limits _{0\le s\le 1} \left\{ -s\epsilon _t +{\small \frac{1}{2}}L_f s^2(r_t+D_*)^2\right\} , \end{aligned}$$

and we arrive at the recurrence

$$\begin{aligned} \epsilon _{t+1}\le \epsilon _t-\left\{ \begin{array}{ll}{\epsilon _t^2\over 2L_f (r_t+D_*)^2},&{}\epsilon _t\le L_f (r_t+D_*)^2\\ \epsilon _t-{\small \frac{1}{2}}L_f (r_t+D_*)^2,&{}\epsilon _t> L_f (r_t+D_*)^2\\ \end{array}\right. , t=1,2,\ldots \end{aligned}$$

(44)

When $t=1$, this recurrence, in view of $z_1=0$, implies that $\epsilon _2\le {\small \frac{1}{2}}L_fD_*^2$. Let us show by induction in $t\ge 2$ that

$$\begin{aligned} \epsilon _{t}\le \bar{\epsilon }_t{:=}{8L_f D_*^2\over t+14},\,t=2,3,\ldots \end{aligned}$$

(45)

thus completing the proof. We have already seen that (45) is valid for $t=2$. Assuming that (45) holds true for $t=k\ge 2$, we have $\epsilon _{k}\le {\small \frac{1}{2}}L_f D_*^2$ and therefore $\epsilon _{k+1}\le \epsilon _{k}-{1\over 8L_f D_*^2}\epsilon _{k}^2$ by (44) combined with $0\le r_k\le D_*$. Now, the function $ s-{1\over 8L_f D_*^2}s^2$ is nondecreasing on the segment $0\le s\le 4L_f D_*^2$ which contains $\bar{\epsilon }_k$ and $\epsilon _k\le \bar{\epsilon }_k$, whence

$$\begin{aligned} \epsilon _{k+1}&\le \epsilon _{k}-{1\over 8L_f D_*^2}\epsilon _{k}^2\le \bar{\epsilon }_{k}-{1\over 8L_f D_*^2}\bar{\epsilon }_{k}^2 = \left[ {8L_f D_*^2\over k+14}\right] -{1\over 8L_f D_*^2}\left[ {8L_f D_*^2\over k+14}\right] ^2\\&= {8L_f D_*^2(k+13)\over (k+14)^2}\le {8L_f D_*^2\over (k+1)+14}, \end{aligned}$$

so that (45) holds true for $t=k+1$. $\square $

1.4 Proofs for Section 6

As we have already explained, (31) is solvable, so that $z$ is well defined. Denoting by $(s_*,r_*)$ an optimal solution to (31) produced, along with $z$, by our solver, note that the characteristic property of $z$ is the relation

$$\begin{aligned} (s_*,r_*)\in \mathop {\mathrm{Argmax}}_{s,r}\{s+\langle z, Pr-s\eta \rangle : 0\le r\le \mathbf {e}\}. \end{aligned}$$

Since the column sums in $P$ are zeros and the sum of entries in $\eta $ is zero, the above characteristic property of $z$ is preserved when passing from $z$ to $\bar{z}$, so that we may assume from the very beginning that $z=\bar{z}$ is a zero mean image. Now, $P=[Q,-Q]$, where $Q$ is the incidence matrix of the network obtained from $G$ by eliminating backward arcs. Representing a flow $r$ as $[r_f;r_b]$, where the blocks are comprised, respectively, of flows in the forward and backward arcs, and passing from $r$ to $\rho =r_f-r_b$, our characteristic property of $z$ clearly implies the relation

$$\begin{aligned} (s_*,\;\rho _*{:=}r_{*\,f}-r_{*\,b})\in \mathop {\mathrm{Argmax}}_{s,\rho }\{\underbrace{s+\langle z,Q\rho -s\eta \rangle }_{\psi (s,\rho )}:\Vert \rho \Vert _\infty \le 1\}. \end{aligned}$$

(46)

By the optimality conditions in linear programming it follows that

$$\begin{aligned} \begin{array}{ll} (a)&{}\langle z,\eta \rangle =1,\\ (b)&{}\Vert \rho _*\Vert _\infty \le 1,\\ (c)&{}(Q^Tz)_\gamma =\left\{ \begin{array}{ll}\le 0, &{}[\rho _*]_\gamma =-1,\\ =0,&{}[\rho _*]_\gamma \in (-1,1),\\ \ge 0,&{}[\rho _*]_\gamma =1,\\ \end{array}\right. \ \hbox {for all forward arcs}\ \gamma ,\\ (d)&{}Q\rho _*=s_*\eta .\\ \end{array} \end{aligned}$$

(47)

Indeed, $(a)$ stems from the fact that $\psi (s,\rho )$, which is affine in $s$, is above bounded, so that the coefficient at $s$ in $\psi $ should be zero; $(b)$ is the constraint in the maximization problem in (46) to which $(s_*,\rho _*)$ is an optimal solution; $(c)$ is the optimality condition for the same problem w.r.t. the $\rho $-variable; and $(d)$ expresses the fact that $(s_*,r_*)$ is feasible for (31). (47.$d$) and (47.$a$) imply that $\langle Q^Tz,\rho _*\rangle =s_*$, while (47.$c$) says that $\langle Q^Tz,\rho _*\rangle = \Vert Q^Tz\Vert _1$, and $s_*=\Vert Q^Tz\Vert _1$. By (47.$a$) $z\ne 0$, and thus $z$ is a nonzero image with zero mean; recalling what $Q$ is, the first $n(n-1)$ entries in $Q^Tz$ form $\nabla _i z$, and the last $n(n-1)$ entries form $\nabla _jz$, so that $\Vert Q^Tz\Vert _1={\hbox {TV}}(z)$. The gradient field of a nonzero image with zero mean cannot be identically zero, whence ${\hbox {TV}}(z)=\Vert Q^Tz\Vert _1=s_*>0$. Thus $x[\eta ]=-z/{\hbox {TV}}(z)=-z/s_*$ is well defined and ${\hbox {TV}}(x[\eta ])=1$, while by (47.$a$) we have $\langle x[\eta ],\eta \rangle =-1/s_*$. Finally, let $x\in {\mathcal{T}\mathcal{V}}$, implying that $Q^Tx$ is the concatenation of $\nabla _ix$ and $\nabla _jx$ and thus $\Vert Q^Tx\Vert _1={\hbox {TV}}(x)\le 1$. Invoking (47.$b,d$), we get $-1\le \langle Q^Tx,\rho ^*\rangle =\langle x, Q\rho _*\rangle =s_*\langle x,\eta \rangle $, whence $\langle x,\eta \rangle \ge -1/s_*=\langle x[\eta ],\eta \rangle $, meaning that $x[\eta ]\in {\mathcal{T}\mathcal{V}}$ is a minimizer of $\langle \eta ,x\rangle $ over $x\in {\mathcal{T}\mathcal{V}}$. $\square $

In the sequel, for a real-valued function $x$ defined on a finite set (e.g., for an image), $\Vert x\Vert _p$ stands for the $L_p$ norm of the function corresponding to the counting measure on the set (the mass of every point from the set is 1). Let us fix $n$ and $x\in M^n_0$ with ${\hbox {TV}}(x)\le 1$; we want to prove that

$$\begin{aligned} \Vert x\Vert _2\le \mathcal{C}\sqrt{\ln (n)} \end{aligned}$$

(48)

with appropriately selected absolute constant $\mathcal{C}$.

$\mathbf{1}^0$. Let $\oplus $ stand for addition, and $\ominus $ for subtraction of integers modulo $n$; $p\oplus q=(p+q) \,\hbox {mod} \,n\in \{0,1,\ldots ,n-1\}$ and similarly for $p\ominus q$. Along with discrete partial derivatives $\nabla _i x, \nabla _jx$, let us define their periodic versions $\widehat{\nabla }_ix, \widehat{\nabla }_jx$:

$$\begin{aligned} \widehat{\nabla }_ix(i,j)&= x(i\oplus 1,j)-x(i,j):\Gamma _{n,n}\rightarrow {\mathbf {R}},\,\, \widehat{\nabla }_jx(i,j)\nonumber \\&= x(i,j\oplus 1)-x(i,j):\Gamma _{n,n}\rightarrow {\mathbf {R}}, \end{aligned}$$

same as periodic Laplacian $\widehat{\Delta } x$:

$$\begin{aligned} \widehat{\Delta } x=x(i,j)-{1\over 4}\left[ x(i\ominus 1,j){+}x(i\oplus 1,j)+x(i,j\ominus 1) +x(i,j\oplus 1)\right] :\Gamma _{n,n}{\rightarrow }{\mathbf {R}}. \end{aligned}$$

For every $j, 0\le j<n$, we have $\sum _{i=0}^{n-1} \widehat{\nabla }_ix(i,j)=0$ and $\nabla _ix(i,j)=\widehat{\nabla }_ix(i,j)$ for $0\le i<n-1$, whence $\sum _{i=0}^{n-1}|\widehat{\nabla }_i(x)|\le 2\sum _{i=0}^{n-1}|\nabla _ix(i,j)|$ for every $j$, and thus $\Vert \widehat{\nabla }_ix\Vert _1\le 2\Vert \nabla _ix\Vert _1$. Similarly, $\Vert \widehat{\nabla }_jx\Vert _1\le 2\Vert \nabla _jx\Vert _1$, and we conclude that

$$\begin{aligned} \Vert \widehat{\nabla }_ix\Vert _1+\Vert \widehat{\nabla }_jx\Vert _1\le 2. \end{aligned}$$

(49)

$\mathbf{2}^0$. Now observe that for $0\le i,j<n$ we have

$$\begin{aligned} \begin{array}{rcl} x(i,j)&{}=&{}x(i\ominus 1,j)+\widehat{\nabla }_ix(i\ominus 1,j)\\ x(i,j)&{}=&{}x(i\oplus 1,j)-\widehat{\nabla }_ix(i,j)\\ x(i,j)&{}=&{}x(i,j\ominus 1)+\widehat{\nabla }_jx(i,j\ominus 1)\\ x(i,j)&{}=&{}x(i,j\oplus 1)-\widehat{\nabla }_jx(i,j)\\ \end{array} \end{aligned}$$

whence

$$\begin{aligned} \widehat{\Delta } x(i,j)={1\over 4}\left[ \widehat{\nabla }_ix(i\ominus 1,j)- \widehat{\nabla }_ix(i,j)+\widehat{\nabla }_jx(i,j\ominus 1)- \widehat{\nabla }_jx(i,j)\right] \end{aligned}$$

(50)

Now consider the following linear mapping from $M^n\times M^n$ into $M^n$:

$$\begin{aligned} B[g,h](i,j)={1\over 4}\left[ g(i\ominus 1,j)-g(i,j)+h(i,j\ominus 1) -h(i,j)\right] ,\,[i;j]\in \Gamma _{n,n}. \end{aligned}$$

(51)

From this definition and (50) it follows that

$$\begin{aligned} \widehat{\Delta }x=B[\widehat{\nabla }_ix,\widehat{\nabla }_jx]. \end{aligned}$$

(52)

Let for $u\in M^n, {\hbox {DFT}}[u]$ stand for the 2D Discrete Fourier Transform of $u$:

$$\begin{aligned} {\hbox {DFT}}[u](p,q)=\sum _{0\le r,s<n} u(r,s)\exp \{-2\pi \imath (pr+qs)/n\},\,[p;q]\in \Gamma _{n,n}. \end{aligned}$$

Note that every image $u$ with zero mean is the periodic Laplacian of another, uniquely, defined, image $X[u]$ with zero mean, with $X[u]$ given by its Fourier transform

$$\begin{aligned} {\hbox {DFT}}[X[u]](p,q)&= Y[u](p,q){:=}\left\{ \begin{array}{ll}0,&{}p=q=0\\ {{\hbox {DFT}}[u](p,q)\over D(p,q)}, &{}0\ne [p;q]\in \Gamma _{n,n}\\ \end{array}\right. \!\!,\,[p;q]\in \Gamma _{n,n},\nonumber \\ D(p,q)&= 1-{1\over 2}[\cos (2\pi p/n)+\cos (2\pi q/n)],\,\,[p;q]\in \Gamma _{n,n}. \end{aligned}$$

(53)

Indeed, representing an $n\times n$ image $x(\mu ,\nu ),$ $0\le \mu ,\nu <n$, as a Fourier sum

$$\begin{aligned} x(\mu ,\nu )=\sum _{0\le p,q<n}c_{p,q}\exp \{2\pi \imath [p\mu /n+q\nu /n]\}, \end{aligned}$$

we get

$$\begin{aligned} \!\!\!\!\!\!\!\!\begin{array}{rcl} \left[ \widehat{\Delta }x\right] (\mu ,\nu )&{}=&{}\sum \limits _{0\le p,q<n}c_{p,q}\bigg [\exp \{2\pi \imath [{p\mu \over n}+{q\nu \over n}]\}\\ &{}&{}-{1\over 4}\exp \{2\pi \imath [{p(\mu \ominus 1)\over n}+{q\nu \over n}]\} - {1\over 4}\exp \{2\pi \imath [{p(\mu \oplus 1)\over n}+{q\nu \over n}]\}\\ &{}&{}-{1\over 4}\exp \{2\pi \imath [{p\mu \over n}+{(q\ominus 1)\nu \over n}]\}-{1\over 4}\exp \{2\pi \imath [{p\mu \over n}+{q(\nu \oplus 1)\over n}]\}\bigg ]\\ &{}=&{} \sum \limits _{0\le p,q<n}c_{p,q}\exp \{2\pi \imath [{p\mu \over n}+{q\nu \over n}]\}\bigg [1-{1\over 4}\exp \{-2\pi \imath {p\over n}\}-{1\over 4}\exp \{2\pi \imath {p\over n}\}\\ &{}&{} -{1\over 4}\exp \{-2\pi \imath {q\over n}\} - {1\over 4}\exp \{2\pi \imath {q\over n}\}\bigg ]\\ &{}=&{}\sum \limits _{0\le p,q<n}\left[ c_{p,q}D(p,q)\right] \exp \{2\pi \imath [{p\mu \over n}+{q\nu \over n}]\},\qquad \end{array} \end{aligned}$$

where

$$\begin{aligned} D(p,q)&= \left[ 1-{1\over 4}\exp \{-2\pi \imath {p\over n}\}-{1\over 4}\exp \{2\pi \imath {p\over n}\}-{1\over 4}\exp \{-2\pi \imath {q\over n}\} - {1\over 4}\exp \{2\pi \imath {q\over n}\}\right] \\&= 1-{1\over 2}\cos (2\pi {p\over n})-{1\over 2}\cos (2\pi {q\over n}),\,\,[p;q]\in \Gamma _{n,n}. \end{aligned}$$

In other words, in the Fourier domain passing from an image to its periodic Laplacian means multiplication of Fourier coefficient $c_{p,q}$ by $D(p,q)$. From the expression for $D(p,q)$ it is immediately seen that $D(p,q)$ is nonzero whenever $(p,q)$ with $0\le p,q<n$ is nonzero. In other words, the periodic Laplacian of a whatever $n\times n$ zero mean image is an image with zero mean (zero Fourier coefficient $(0,0)$), and every image of this type is a periodic Laplacian of another zero mean image described in (53).

In particular, invoking (52), we get

$$\begin{aligned} {\hbox {DFT}}[x]=Y[B[\widehat{\nabla }_ix,\widehat{\nabla }_jx]]. \end{aligned}$$

By Parseval identity, $\Vert {\hbox {DFT}}[x]\Vert _2=n\Vert x\Vert _2$, whence

$$\begin{aligned} \Vert x\Vert _2=n^{-1}\Vert Y[B[\widehat{\nabla }_ix,\widehat{\nabla }_jx]]\Vert _2. \end{aligned}$$

Combining this observation with (49), we see that in order to prove (48), it suffices to check that

(!) Whenever $g,h\in M^n$ are such that

$$\begin{aligned} (g,h)\in G{:=}\{(g,h)\in M^n\times M^n: \Vert g\Vert _1+\Vert h\Vert _1\le 2\}, \end{aligned}$$

we have

$$\begin{aligned} \Vert Y[B[g,h]]\Vert _2\le n\mathcal{C}\sqrt{\ln (n)}. \end{aligned}$$

(54)

$\mathbf{4}^0$. The good news about (!) is that $Y[B[g,h]]$ is linear in $(g,h)$. Therefore, in order to justify (!), it suffices to prove that (54) holds true for the extreme point of $G$, i.e., (a) for pairs where $h\equiv 0$ and $g$ is an image which is equal to 2 at some point of $\Gamma _{n,n}$ and vanishes outside of this point, and (b) for pairs where $g\equiv 0$ and $h$ is an image which is equal to 2 at some point of $\Gamma _{n,n}$ and vanishes outside of this point. Task (b) clearly reduces to task (a) by swapping the coordinates $i,j$ of points from $\Gamma _{n,n}$, so that we may focus solely on task (a). Thus, assume that $g$ is a cyclic shift of the image $2\delta $:

$$\begin{aligned} g(i,j)\equiv 2\delta (i\ominus r,j\ominus s),\,\,\delta (i,j)=\left\{ \begin{array}{ll} 1,&{}[i;j]=[0;0]\\ 0,&{}[i;j]\ne [0;0]\\ \end{array}\right. ,\,[i;j]\in \Gamma _{n,n}. \end{aligned}$$

From (51) it follows that then $B[g,0]$ is a cyclic shift of $B[2\delta ,0]$, whence $|{\hbox {DFT}}[B[g,0]](p,q)|=|{\hbox {DFT}}[B[2\delta ,0]](p,q)|$ for all $[p;q]\in \Gamma _{n,n}$, which, by (53), implies that $|Y[B[g,0]](p,q)|=|Y[B[2\delta ,0]](p,q)|$ for all $[p;q]\in \Gamma _{n,n}$. The bottom line is that all we need is to verify that (54) holds true for $g=2\delta ,h=0$, or, which is the same, that with

$$\begin{aligned} y(p,q)={(1-\exp \{2\pi \imath p/n\})\over 2[1-{1\over 2}[\cos (2\pi p/n)+\cos (2\pi q/n)]]} \end{aligned}$$

(55)

where the right hand side by definition is $0$ at $p=q=0$, it holds

$$\begin{aligned} C_n{:=}\sum _{p,q=0}^{n-1} |y(p,q)|^2 \le n^2\mathcal{C}^2\ln (n). \end{aligned}$$

Now, (55) makes sense for all $[p;q]\in {\mathbf {Z}}^2$ (provided that we define the right hand side as zero at all points of ${\mathbf {Z}}^2$ where the denominator in (55) vanishes, that is, at all point where $p,q$ are integer multiples of $n$) and defines $y$ as a double-periodic, with periods $n$ in $p$ and in $q$, function of $[p;q]$. Therefore, setting $m=\lfloor n/2\rfloor \ge 1$ and $W=\{[p;q]\in {\mathbf {Z}}^2: -m\le p,q<n-m\}$, we have

$$\begin{aligned} C_n=\sum _{0\ne [p;q]\in W} |y(p,q)|^2=\sum _{[p;q]\in W} {|1-\exp \{2\pi \imath p/n\}|^2\over 4|1-{1\over 2}[\cos (2\pi p/n)+\cos (2\pi q/n)]|^2}. \end{aligned}$$

Setting $\rho (p,q)=\sqrt{p^2+q^2}$, observe that when $0\ne [p;q]\in W$, we have $|1-\exp \{2\pi \imath p/n\}|\le C_1n^{-1}\rho (p,q)$ and $2[1-{1\over 2}[\cos (2\pi p/n)+\cos (2\pi q/n)]]\ge C_2n^{-2}\rho ^2(p,q)$ with positive absolute constants $C_1,C_2$, whence

$$\begin{aligned} C_n\le (C_1/C_2)^2\sum _{0\ne [p;q]\in W} n^2\rho ^{-2}(p,q). \end{aligned}$$

With appropriately selected absolute constant $C_3$ we have

$$\begin{aligned} \sum _{0\ne [p;q]\in W} \rho ^{-2}(p,q)\le C_3\int \limits _{1}^n r^{-2}rdr=C_3\ln (n). \end{aligned}$$

Thus, $C_n\le (C_1/C_2)^2C_3n^2\ln (n)$, meaning that (54), and thus (48), holds true with $\mathcal{C}=\sqrt{C_3}C_1/C_2$. $\square $

We are very grateful to Referees for their reading of the manuscript and thoughtful comments.

We have included after Theorem 2 a comment on comparison of the proposed parametric optimization algorithm with that based on bisection. However, we finally decided not to include the numerical comparison. Note that including such experiment would require explaining in some detail what the “bisection” algorithm is and how is it implemented, what would make the manuscript which is already too long with respect to the size requirements of the Mathematical Programming even longer.

We would like to thank Referee 2 for detailed and clarifying explanations of his comments on the initial manuscript. Some comments of Referee 2 concern notation we have not modified in our revision. Since his requests do not seem to be compulsory and the final decision is left to us, we have modified some (e.g., we replaced $D^+$ with $\bar{D}$) but preferred to keep other unchanged.

In what follows we provide our answers to other questions raised in his detailed report.

$6.16^*$ ...I would suggest that the authors simply say, when introducing (12), that is it is implied by (4) but more general, to give some hint about why they are mentioning it

We have added a small comment after display (12) to make clear the choice of the norm $\Vert \cdot \Vert _X$ here
$11.9^*$ ...some readers might be more familiar with a statement like “$\mathcal{A}$ has full column rank,” or “$\mathcal{A}$ has a trivial nullspace,” etc.

We think that for a linear operator (which, in our case, may not be a matrix) a commonly adopted terminology is indeed “$\mathcal{A}$ has a trivial nullspace,” or “$\mathcal{A}$ has a trivial kernel”, what is the exact wording of $\mathrm{Ker}(\mathcal{A})=\{0\}$.
$11.19^*$ ...but I think it would suffice to simply note in the text that the given value $\sum |\lambda _\zeta |\rho $ is an upper bound on $\Vert x\Vert $ for $x$ in the given form.

Thank you, we have added a comment in this sense after display (29).
$30.5^*$ ...I’ll say that the authors are practicing a false economy in not giving a more complete explanation, and that at the least, “we immediately see” should be “one can show”

Thank you for insisting, you are probably right—we have reproduced in the text our answer to the corresponding comment on the previous version of the manuscript.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Harchaoui, Z., Juditsky, A. & Nemirovski, A. Conditional gradient algorithms for norm-regularized smooth convex optimization. Math. Program. 152, 75–112 (2015). https://doi.org/10.1007/s10107-014-0778-9

Download citation

Received: 28 March 2013
Accepted: 31 March 2014
Published: 18 April 2014
Issue Date: August 2015
DOI: https://doi.org/10.1007/s10107-014-0778-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Conditional gradient algorithms for norm-regularized smooth convex optimization

Abstract

Access this article

Similar content being viewed by others

The Frank-Wolfe Algorithm: A Short Introduction

$\mathbf{C^{2}}$ -Lusin approximation of strongly convex functions

A New Insight on Augmented Lagrangian Method with Applications in Machine Learning

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix

1.1 Proof of Theorem 1

1.2 Proof of Theorem 2

1.3 Proof of Theorem 3

1.4 Proofs for Section 6

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Conditional gradient algorithms for norm-regularized smooth convex optimization

Abstract

Access this article

Similar content being viewed by others

The Frank-Wolfe Algorithm: A Short Introduction

$\mathbf{C^{2}}$ -Lusin approximation of strongly convex functions

A New Insight on Augmented Lagrangian Method with Applications in Machine Learning

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix

Appendix

1.1 Proof of Theorem 1

1.2 Proof of Theorem 2

1.3 Proof of Theorem 3

1.4 Proofs for Section 6

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation