Bias of Homotopic Gradient Descent for the Hinge Loss

Molitor, Denali; Needell, Deanna; Ward, Rachel

doi:10.1007/s00245-020-09656-5

Bias of Homotopic Gradient Descent for the Hinge Loss

Published: 23 January 2020

Volume 84, pages 621–647, (2021)
Cite this article

Applied Mathematics & Optimization Aims and scope Submit manuscript

309 Accesses
Explore all metrics

Abstract

Gradient descent is a simple and widely used optimization method for machine learning. For homogeneous linear classifiers applied to separable data, gradient descent has been shown to converge to the maximal-margin (or equivalently, the minimal-norm) solution for various smooth loss functions. The previous theory does not, however, apply to the non-smooth hinge loss which is widely used in practice. Here, we study the convergence of a homotopic variant of gradient descent applied to the hinge loss and provide explicit convergence rates to the maximal-margin solution for linearly separable data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition

Gradient-type penalty method with inertial effects for solving constrained convex optimization problems with smooth data

Article Open access 14 June 2017

On the Gradient Projection Method for Weakly Convex Functions on a Proximally Smooth Set

Article 01 November 2020

References

Bartlett, P., Shawe-Taylor, J.: Advances in Kernel Methods, Chap. Generalization Performance of Support Vector Machines and Other Pattern Classifiers, pp. 43–54. MIT Press, Cambridge (1999)
Google Scholar
Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. SIAM Rev. 60(2), 223–311 (2018)
Article MathSciNet Google Scholar
Brutzkus, A., Globerson, A., Malach, E., Shalev-Shwartz, S.: SGD learns over-parameterized networks that provably generalize on linearly separable data. In: International Conference on Learning Representations, ICLR 2018. Vancouver, BC, Canada, April 30-May 3, 2018, Conference Track Proceedings (2018). https://openreview.net/forum?id=rJ33wwxRb
Bubeck, S.: Convex optimization: algorithms and complexity. arXiv e-prints arXiv:1405.4980 (2014)
Bubeck, S.: Convex optimization: algorithms and complexity. Found. Trends Mach. Learn. 8(3–4), 231–357 (2015). https://doi.org/10.1561/2200000050
Article MATH Google Scholar
Chapelle, O.: Training a support vector machine in the primal. Neural Comput. 19, 1155–1178 (2007)
Article MathSciNet Google Scholar
Chaudhari, P., Choromanska, A., Soatto, S., LeCun, Y., Baldassi, C., Borgs, C., Chayes, J., Sagun, L., Zecchina, R.: Entropy-sgd: biasing gradient descent into wide valleys. In: International Conference on Learning Representations (2017)
Combes, R.T.d., Pezeshki, M., Shabanian, S., Courville, A.C., Bengio, Y.: On the learning dynamics of deep neural networks. CoRR arXiv:1809.06848 (2018)
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)
MATH Google Scholar
Gunasekar, S., Lee, J., Soudry, D., Srebro, N.: Characterizing implicit bias in terms of optimization geometry. In: Dy, J., Krause, A. (eds.) Proceedings of the 35th International Conference on Machine Learning, vol. 80, pp. 1832–1841. PMLR, Stockholmsmässan, Stockholm Sweden (2018). http://proceedings.mlr.press/v80/gunasekar18a.html
Hardt, M., Recht, B., Singer, Y.: Train faster, generalize better: Stability of stochastic gradient descent. In: International Conference on Machine Learning, ICML’16, pp. 1225–1234. JMLR.org (2016). http://dl.acm.org/citation.cfm?id=3045390.3045520
Hastie, T., Rosset, S., Tibshirani, R., Zhu, J.: The entire regularization path for the support vector machine. J. Mach. Learn. Res. 5(Oct), 1391–1415 (2004)
MathSciNet MATH Google Scholar
Hoffer, E., Hubara, I., Soudry, D.: Train longer, generalize better: closing the generalization gap in large batch training of neural networks. In: Advances in Neural Information Processing Systems, pp. 1731–1741 (2017)
Lacoste-Julien, S., Schmidt, M., Bach, F.: A simpler approach to obtaining an o(1/t) convergence rate for the projected stochastic subgradient method. arXiv preprint arXiv:1212.2002 (2012)
Li, Y., Singer, Y.: The well tempered lasso. arXiv preprint arXiv:1806.03190 (2018)
Nacson, M.S., Lee, J., Gunasekar, S., Savarese, P., Srebro, N., Soudry, D.: Convergence of gradient descent on separable data. In: International Conference on Artificial Intelligence and Statistics (2019)
Nacson, M.S., Srebro, N., Soudry, D.: Stochastic gradient descent on separable data: exact convergence with a fixed learning rate. In: Proceedings of Machine Learning Research, vol. 89, pp. 3051–3059. PMLR (2019). http://proceedings.mlr.press/v89/nacson19a.html
Neyshabur, B., Tomioka, R., Srebro, N.: In search of the real inductive bias: on the role of implicit regularization in deep learning. arXiv preprint arXiv:1412.6614 (2014)
Poggio, T., Kawaguchi, K., Liao, Q., Miranda, B., Rosasco, L., Boix, X., Hidary, J., Mhaskar, H.: Theory of deep learning III: explaining the non-overfitting puzzle. arXiv preprint arXiv:1801.00173 (2017)
Poggio, T., Liao, Q., Miranda, B., Banburski, A., Boix, X., Hidary, J.: Theory IIIb: generalization in deep networks. arXiv preprint arXiv:1806.11379 (2018)
Ramdas, A., Peña, J.: Towards a deeper geometric, analytic and algorithmic understanding of margins. Optim. Methods Softw. 31(2), 377–391 (2016)
Article MathSciNet Google Scholar
Rosset, S., Zhu, J., Hastie, T.J.: Margin maximizing loss functions. In: Advances in Neural Information Processing Systems, pp. 1237–1244 (2004)
Soudry, D., Hoffer, E., Nacson, M.S., Gunasekar, S., Srebro, N.: The implicit bias of gradient descent on separable data. J. Mach. Learn. Res. 19(1), 2822–2878 (2018)
MathSciNet MATH Google Scholar
Vapnik, V.: Estimation of Dependences Based on Empirical Data. Springer, Berlin (1982)
MATH Google Scholar
Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York (2013)
MATH Google Scholar
Vapnik, V.N.: An overview of statistical learning theory. IEEE Trans. Neural Netw. 10(5), 988–999 (1999)
Article Google Scholar
Vapnik, V.N., Chervonenkis, A.J.: Theory of Pattern Recognition. Nauka, Moscow (1974)
MATH Google Scholar
Wang, G., Giannakis, G.B., Chen, J.: Learning relu networks on linearly separable data: algorithm, optimality, and generalization. IEEE Trans. Signal Process. 67(9), 2357–2370 (2019). https://doi.org/10.1109/TSP.2019.2904921
Article MathSciNet MATH Google Scholar
Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning requires rethinking generalization. In: International Conference on Machine Learning (2017). arXiv:1611.03530

Download references

Author information

Authors and Affiliations

Department of Mathematics, University of California, Los Angeles, Los Angeles, CA, 90095, USA
Denali Molitor & Deanna Needell
Department of Mathematics, University of Texas at Austin, Austin, TX, 78712, USA
Rachel Ward

Authors

Denali Molitor
View author publications
You can also search for this author in PubMed Google Scholar
Deanna Needell
View author publications
You can also search for this author in PubMed Google Scholar
Rachel Ward
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Denali Molitor.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

D. Molitor and D. Needell are grateful to and were partially supported by NSF CAREER DMS #1348721 and NSF BIGDATA DMS #1740325. R. Ward was supported in part by AFOSR MURI Award N00014-17-S-F006.

Appendix A: Lemma Proofs

We now present proofs for the lemmas of Sects. 2, 3 and 4.

We first prove Lemma 1, which gives a bound on the norm of the iterates produced by the subgradient method applied to Eq. (2).

1.1 Proof of Lemma 1

Proof

Consider the subgradient update for minimizing the function $F_\lambda $ of Eq. (2)

$$\begin{aligned} \varvec{w}^{\prime }{} = (1-\lambda \eta ) \varvec{w}+ \frac{\eta }{n}\sum _{j : y_j\varvec{x}_j^\top \varvec{w}\le 1} y_j\varvec{x}_j \end{aligned}$$

(16)

with $\eta \lambda < 1$. Suppose that the iterate $\varvec{w}$ satisfies $\Vert {\varvec{w}} \Vert \le \frac{1}{\lambda n} \sum _{j=1}^n\Vert {\varvec{x}_j} \Vert $. We aim to show that $\varvec{w}^{\prime }{}$ given by the subgradient update also satisfies $\Vert {\varvec{w}} \Vert \le \frac{1}{\lambda n} \sum _{j=1}^n\Vert {\varvec{x}_j} \Vert .$ Taking the norm on both sides of Eq. (16),

$$\begin{aligned} \Vert {\varvec{w}^{\prime }{}} \Vert&= \bigg \Vert (1-\eta \lambda )\varvec{w}+ \frac{\eta }{n}\sum _{j : y_j\varvec{x}_j^\top \varvec{w}\le 1} y_j\varvec{x}_j \bigg \Vert \\&\le (1-\eta \lambda )\Vert {\varvec{w}} \Vert + \frac{\eta }{n} \bigg \Vert \sum _{j : y_j\varvec{x}_j^\top \varvec{w}\le 1} y_j\varvec{x}_j \bigg \Vert \\&\le (1-\eta \lambda )\frac{1}{\lambda n} \sum _{j=1}^n\Vert {\varvec{x}_j} \Vert + \frac{\eta }{n}\sum _{j=1}^n \Vert {\varvec{x}_j} \Vert \\&= \frac{1}{\lambda n} \sum _{j=1}^n\Vert {\varvec{x}_j} \Vert -\frac{\eta }{ n} \sum _{j=1}^n\Vert {\varvec{x}_j} \Vert + \frac{\eta }{n}\sum _{j=1}^n \Vert {\varvec{x}_j} \Vert \\&= \frac{1}{\lambda n} \sum _{j=1}^n\Vert {\varvec{x}_j} \Vert . \end{aligned}$$

Thus the norms of all iterates of the subgradient method applied to the function $F_\lambda $ remain bounded by $\frac{1}{\lambda n} \sum _{j=1}^n\Vert {\varvec{x}_j} \Vert $ if the initial iterate has norm at most $\frac{1}{\lambda n}\sum _{j=1}^n\Vert {\varvec{x}_j} \Vert $. The norm of the minimizer $\varvec{w}_{\lambda }^*$ of $F_\lambda $ must also satisfy the bound $ \Vert {\varvec{w}_{\lambda }^*} \Vert \le \frac{1}{\lambda n}\sum _{j } \Vert {\varvec{x}_j} \Vert $ as $0\in \partial F_\lambda (\varvec{w}_{\lambda }^*)$ and so

$$\begin{aligned} \lambda \Vert {\varvec{w}_{\lambda }^*} \Vert \le \frac{1}{n}\bigg |\bigg |\sum _{j : y_j\varvec{x}_j^\top \varvec{w}_{\lambda }^*\le 1} y_j\varvec{x}_j\bigg |\bigg |. \end{aligned}$$

$\square $

1.2 Proof of Lemma 2

Lemma 2 uses Theorem 2 to derive bounds for the angle and margin gaps.

Proof

To derive a convergence rate for the angle gap, we use the decomposition

$$\begin{aligned} \Vert \varvec{w}_k - \varvec{w}^*\Vert ^2&= \Vert \varvec{w}_k\Vert ^2 + \Vert \varvec{w}^*\Vert ^2 - 2\varvec{w}_k^\top \varvec{w}^*\\&= \left( \Vert \varvec{w}_k\Vert - \Vert \varvec{w}^*\Vert \right) ^2 +2\Vert \varvec{w}_k\Vert \Vert \varvec{w}^*\Vert - 2\varvec{w}_k^\top \varvec{w}^*. \end{aligned}$$

Dividing by $2\Vert \varvec{w}_k\Vert \Vert \varvec{w}^*\Vert $,

$$\begin{aligned} 1 - \frac{\varvec{w}_k^\top \varvec{w}^*}{\Vert \varvec{w}_k\Vert \Vert \varvec{w}^*\Vert }&= \frac{\Vert \varvec{w}_k - \varvec{w}^*\Vert ^2 - \left( \Vert \varvec{w}_k\Vert - \Vert \varvec{w}^*\Vert \right) ^2 }{2\Vert \varvec{w}_k\Vert \Vert \varvec{w}^*\Vert }\\&\le \frac{\Vert \varvec{w}_k - \varvec{w}^*\Vert ^2 }{2\Vert \varvec{w}_k\Vert \Vert \varvec{w}^*\Vert }. \end{aligned}$$

Since $\Vert \varvec{w}^*\Vert $ is necessarily bounded away from 0 since $y_i\varvec{x}_i^\top \varvec{w}^*\ge 1$ for all i. We can bound $\Vert \varvec{w}_k\Vert $ away from 0 for t large using the convergence of $\varvec{w}_k$ to $\varvec{w}^*$ guaranteed by Theorem 2. Let

$$\begin{aligned} c = \min \left( \frac{(r-2p)(1-\epsilon _0)}{2(r+1)(1+\epsilon _0)}, \frac{(1 - p)(1+\epsilon _0)}{r+1} , \frac{p}{r+1} \right) , \end{aligned}$$

be the exponent in the convergence rate of $\Vert \varvec{w}-\varvec{w}^*\Vert $ and p, r, and $\epsilon _0$ be defined as in Theorem 2. Since

$$\begin{aligned} (\Vert \varvec{w}_k\Vert - \Vert \varvec{w}^*\Vert )^2 \le \Vert \varvec{w}_k - \varvec{w}^*\Vert ^2 \le Ak^{-2c} \end{aligned}$$

for constants $A, c >0$ by Theorem 2, then $\Vert \varvec{w}_k\Vert \ge \Vert \varvec{w}^*\Vert - Ak^{-c}.$ Thus for k sufficiently large, we can bound $\Vert \varvec{w}\Vert $ away from 0 and have

$$\begin{aligned} 1 - \frac{\varvec{w}_k^\top \varvec{w}^*}{\Vert \varvec{w}_k\Vert \Vert \varvec{w}^*\Vert } = O\left( k^{-2c}\right) . \end{aligned}$$

(17)

We now consider the margin bound. Let $j = {{\,\mathrm{arg min}\,}}_{i=1,\ldots n} \frac{y_i \varvec{x}_i^\top \varvec{w}_{k}}{\Vert \varvec{w}_{k}\Vert }$. Since $y_i \varvec{x}_i^\top \varvec{w}^*\ge 1$ for all $i = 1,\ldots , n$, we have that

$$\begin{aligned} 0&\le \frac{1}{\Vert \varvec{w}^*\Vert } - \frac{y_j \varvec{x}_j^\top \varvec{w}_{k}}{\Vert \varvec{w}_{k}\Vert } \le \frac{y_j \varvec{x}_j^\top \varvec{w}^*}{\Vert \varvec{w}^*\Vert } - \frac{y_j \varvec{x}_j^\top \varvec{w}_{k}}{\Vert \varvec{w}_{k}\Vert } \\&=y_j \varvec{x}_j^\top \left( \frac{ \varvec{w}^*}{\Vert \varvec{w}^*\Vert } - \frac{ \varvec{w}_{k}}{\Vert \varvec{w}_{k}\Vert } \right) \le \Vert \varvec{x}_j\Vert \bigg \Vert \frac{ \varvec{w}^*}{\Vert \varvec{w}^*\Vert } - \frac{ \varvec{w}_{k}}{\Vert \varvec{w}_{k}\Vert } \bigg \Vert . \end{aligned}$$

Note that

$$\begin{aligned} \bigg \Vert \frac{ \varvec{w}^*}{\Vert \varvec{w}^*\Vert } - \frac{ \varvec{w}_{k}}{\Vert \varvec{w}_{k}\Vert } \bigg \Vert ^2 = 2 \left( 1 - \frac{\varvec{w}_k^\top \varvec{w}^*}{\Vert \varvec{w}_k\Vert \Vert \varvec{w}^*\Vert }\right) . \end{aligned}$$

Assuming the data is finite and linearly separable, by Eq. (17) we then have

$$\begin{aligned} \frac{1}{\Vert \varvec{w}^*\Vert } - \min _i \frac{y_i \varvec{x}_i^\top \varvec{w}_{k}}{\Vert \varvec{w}_{k}\Vert } = O\left( k^{-c}\right) . \end{aligned}$$

$\square $

1.3 Proof of Lemma 3

Lemma 3 provides a modified convergence guarantee for the averaged subgradient method applied to the functions $F_\lambda $ [4].

Proof

Let $F_\lambda $ be a strongly convex function with strong convexity parameter $\lambda $ and Lipschitz constant L on the bounded domain considered. Let $\varvec{w}_0$ be an initial iterate and $\varvec{w}_{\lambda }^*$ be the minimizer of $F_\lambda $. Suppose $\Vert \varvec{w}_0 - \varvec{w}_{\lambda }^*\Vert \le R$, so that $\varvec{w}_{\lambda }^*$ is contained in a ball of radius R and center $\varvec{w}_0$. Let $\overline{\varvec{w}}= \frac{1}{t} \sum _{i=1}^t \varvec{w}_i$ be the average of t subgradient descent iterates with initial iterate $\varvec{w}_0$ and step size $\eta = \frac{R}{L\sqrt{t}}$. We aim to show that

$$\begin{aligned} 0\le F_{\lambda } (\overline{\varvec{w}}) - F_{\lambda } (\varvec{w}_{\lambda }^*) \le \frac{R L}{\sqrt{t}} - \frac{\lambda }{2}\Vert \overline{\varvec{w}}- \varvec{w}_{\lambda }^*\Vert ^2. \end{aligned}$$

The following proof relies heavily on Theorem 3.2 of [4] (See also [5]).

Since $\varvec{w}_{\lambda }^*$ is the minimizer of $F_\lambda $, the inequality

$$\begin{aligned} F_{\lambda } (\overline{\varvec{w}}) - F_{\lambda } (\varvec{w}_{\lambda }^*)\ge 0 \end{aligned}$$

is immediate. Let $g(\varvec{w}) = F_\lambda (\varvec{w}) - \frac{\lambda }{2}||\varvec{w}||^2$. Since $g(\varvec{w})$ is convex,

$$\begin{aligned} g(\overline{\varvec{w}}) \le \frac{1}{t}\sum _{i=1}^t g(\varvec{w}_{i}) \end{aligned}$$

and thus

$$\begin{aligned} F_\lambda&(\overline{\varvec{w}}) - \frac{\lambda }{2}||\overline{\varvec{w}}||^2 \le \frac{1}{t} \sum _{i=1}^t \left( F_\lambda (\varvec{w}_{i}) -\frac{\lambda }{2}||\varvec{w}_{i}||^2\right) . \end{aligned}$$

Reorganizing and subtracting $F_\lambda (\varvec{w}_{\lambda }^*)$,

$$\begin{aligned} F_\lambda&(\overline{\varvec{w}})-F_\lambda (\varvec{w}_{\lambda }^*) \nonumber \\&\le \frac{1}{t} \sum _{i=1}^t \bigg ( F_\lambda (\varvec{w}_{i}) - F_\lambda (\varvec{w}_{\lambda }^*) -\frac{\lambda }{2}\left( ||\varvec{w}_{i}||^2 -||\overline{\varvec{w}}||^2\right) \bigg ). \end{aligned}$$

(18)

Using the strong convexity of $F_\lambda $ and the proof of Theorem 3.2 of [4],

$$\begin{aligned} F_\lambda&(\varvec{w}_i)-F_\lambda (\varvec{w}_{\lambda }^*) \\&\le \partial F_\lambda (\varvec{w}_{i})^\top (\varvec{w}_{i} - \varvec{w}_{\lambda }^*) - \frac{\lambda }{2}\Vert \varvec{w}_{i} - \varvec{w}_{\lambda }^*\Vert ^2\\&= \frac{1}{2\eta }\left( \Vert {\varvec{w}_i - \varvec{w}^*} \Vert ^2 - \Vert {\varvec{w}_{i+1}-\varvec{w}^*} \Vert ^2\right) + \frac{\eta }{2}\Vert {\partial F_\lambda (\varvec{w}_{i})} \Vert ^2- \frac{\lambda }{2}\Vert \varvec{w}_{i} - \varvec{w}_{\lambda }^*\Vert ^2\\&\le \frac{1}{2\eta }\left( \Vert {\varvec{w}_i - \varvec{w}^*} \Vert ^2 - \Vert {\varvec{w}_{i+1}-\varvec{w}^*} \Vert ^2\right) + \frac{\eta L^2}{2}- \frac{\lambda }{2}\Vert \varvec{w}_{i} - \varvec{w}_{\lambda }^*\Vert ^2. \end{aligned}$$

Making this substitution into Eq. (18),

$$\begin{aligned} F_\lambda&(\overline{\varvec{w}})-F_\lambda (\varvec{w}_{\lambda }^*) \\&\le \frac{1}{2 t\eta }\left( \Vert {\varvec{w}_1 - \varvec{w}^*} \Vert ^2 - \Vert {\varvec{w}_{t+1}-\varvec{w}^*} \Vert ^2\right) + \frac{\eta L^2}{2} \\&\quad - \frac{\lambda }{2t} \sum _{i=1}^t\bigg (\Vert \varvec{w}_{i} - \varvec{w}_{\lambda }^*\Vert ^2 + ||\varvec{w}_{i}||^2-||\overline{\varvec{w}}||^2 \bigg )\\&\le \frac{R^2}{2 t\eta } + \frac{\eta L^2}{2} - \frac{\lambda }{2t} \sum _{i=1}^t\bigg (\Vert \varvec{w}_{i} - \varvec{w}_{\lambda }^*\Vert ^2 + ||\varvec{w}_{i}||^2-||\overline{\varvec{w}}||^2 \bigg )\\&\le \frac{RL}{\sqrt{t}} - \frac{\lambda }{2t} \sum _{i=1}^t\left( ||\varvec{w}_{i}||^2-||\overline{\varvec{w}}||^2 + \Vert \varvec{w}_{i} - \varvec{w}_{\lambda }^*\Vert ^2\right) . \end{aligned}$$

Decomposing the sum,

$$\begin{aligned} \frac{1}{t}\sum _{i=1}^t&\Vert \varvec{w}_{i} - \varvec{w}_{\lambda }^*\Vert ^2 = \frac{1}{t}\sum _{i=1}^t \left( ||\varvec{w}_{i}||^2 - 2 \varvec{w}_{i}^\top \varvec{w}_{\lambda }^*+ ||\varvec{w}_{\lambda }^*||^2\right) \\&= \frac{1}{t}\sum _{i=1}^t \left( ||\varvec{w}_{i}||^2\right) - 2 \overline{\varvec{w}}^\top \varvec{w}_{\lambda }^*+ ||\varvec{w}_{\lambda }^*||^2\\&= \frac{1}{t}\sum _{i=1}^t \left( ||\varvec{w}_{i}||^2\right) -||\overline{\varvec{w}}||^2 +||\overline{\varvec{w}}||^2 - 2 \overline{\varvec{w}}^\top \varvec{w}_{\lambda }^*+ ||\varvec{w}_{\lambda }^*||^2\\&= \frac{1}{t}\sum _{i=1}^t \left( ||\varvec{w}_{i}||^2 -||\overline{\varvec{w}}||^2\right) +||\overline{\varvec{w}}-\varvec{w}_{\lambda }^*||^2. \end{aligned}$$

Making this substitution,

$$\begin{aligned} F_\lambda&(\overline{\varvec{w}})-F_\lambda (\varvec{w}_{\lambda }^*) \\&\le \frac{RL}{\sqrt{t}} - \frac{\lambda }{t} \sum \left( ||\varvec{w}_{i}||^2-||\overline{\varvec{w}}||^2 \right) - \frac{\lambda }{2}||\overline{\varvec{w}}-\varvec{w}_{\lambda }^*||^2. \end{aligned}$$

Since $||\varvec{w}||^2$ is convex, $\frac{\lambda }{t} \sum \left( ||\varvec{w}_{i}||^2-||\overline{\varvec{w}}||^2 \right) \ge 0$ and

$$\begin{aligned}F_\lambda (\overline{\varvec{w}})-F_\lambda (\varvec{w}_{\lambda }^*)&\le \frac{RL}{\sqrt{t}} - \frac{\lambda }{2}||\overline{\varvec{w}}-\varvec{w}_{\lambda }^*||^2 \end{aligned}$$

as desired. $\square $

1.4 Proof of Lemma 4

We now prove Lemma 4, which bounds the distance between minimizers of $F_\lambda $ for different regularization parameters $\lambda $.

Proof

Let $\varvec{w}_{\lambda }^*$ minimize $F_\lambda $ as given in Eq. (2). Let $\lambda ' >0$ be such that $\varvec{w}_{\lambda }^*= \varvec{w}^*$ for all $\lambda \le \lambda '$. For $\lambda ,\widetilde{\lambda }\ge 0$ and data satisfying Assumption 1, we aim to show that

$$\begin{aligned} \Vert \varvec{w}_{\lambda }^*- \varvec{w}_{\widetilde{\lambda }}^*\Vert \le \frac{L}{2} \bigg |\frac{1}{\lambda } - \frac{1}{\tilde{\lambda }}\bigg | \end{aligned}$$

and

$$\begin{aligned} \Vert \varvec{w}_{\lambda }^*- \varvec{w}_{\widetilde{\lambda }}^*\Vert \le \frac{L}{2\left( \lambda '\right) ^2} |\lambda - \widetilde{\lambda }|, \end{aligned}$$

where $L = 2\tfrac{1}{n}\sum _{j=1}^n\Vert \varvec{x}_j\Vert $.

The proof of Lemma 4 makes use of Lemma 8 of [15], which is also stated below.

Lemma 6

(Perturbation of strongly convex functions I [15]). Let $f(\varvec{z})$ be a non-negative, $\alpha ^2$-strongly convex function. Let $g(\varvec{z})$ be a L-Lipschitz non-negative convex function. For any $\beta \ge 0$, let $\varvec{z}[\beta ]$ be the minimizer of $f(\varvec{z}) + \beta g(\varvec{z})$, then we have,

$$\begin{aligned} \bigg \Vert \frac{d\varvec{z}[\beta ]}{d\beta } \bigg \Vert \le \frac{L}{\alpha ^2}. \end{aligned}$$

Let $f(\varvec{w})= \Vert \varvec{w}\Vert ^2$ and $g(\varvec{w})=\frac{1}{n} \sum _{j=1}^n \max \{0, 1 - y_j \varvec{x}_j^\top \varvec{w}\}$. Then f is strongly convex with strong convexity parameter 2 and g is Lipschitz with a Lipschitz constant bounded by $\tfrac{1}{n}\sum _{j=1}^n\Vert \varvec{x}_j\Vert $. Note that

$$\begin{aligned} F_\lambda (\varvec{w})&= \frac{\lambda }{2} f(\varvec{w}) + g(\varvec{w}) = \frac{\lambda }{2} \left[ f(\varvec{w}) + \frac{2}{\lambda }g(\varvec{w}) \right] \\&= \frac{\lambda }{2} \left[ f(\varvec{w}) + \beta (\lambda )g(\varvec{w}) \right] \end{aligned}$$

for $\beta (\lambda ) = \frac{2}{\lambda }$. Applying Lemma 8 of [15],

$$\begin{aligned} \bigg \Vert \frac{d\varvec{w}[\lambda ]}{d\lambda } \bigg \Vert&= \bigg \Vert \frac{d\varvec{w}[\lambda ]}{d\beta (\lambda )} \cdot \frac{d\beta (\lambda )}{d\lambda }\bigg \Vert \\&\le \frac{1}{2n}\sum _{j=1}^n\Vert \varvec{x}_j\Vert \cdot |\beta '(\lambda )| = \frac{\tfrac{1}{n}\sum _{j=1}^n\Vert \varvec{x}_j\Vert }{\lambda ^2} = \frac{L}{2\lambda ^2}. \end{aligned}$$

Integrating, for any $\tilde{\lambda }\ge {{\hat{\lambda }}} >0$, we have

$$\begin{aligned} \Vert {\varvec{w}_{\tilde{\lambda }}^* - \varvec{w}_{{{\hat{\lambda }}}}^*} \Vert&= \bigg \Vert \int _{{\hat{\lambda }}}^{\tilde{\lambda }} \frac{d\varvec{w}[\lambda ]}{d\lambda } d\lambda \bigg \Vert \le \int _{{\hat{\lambda }}}^{\tilde{\lambda }} \bigg \Vert \frac{d\varvec{w}[\lambda ]}{d\lambda } \bigg \Vert d\lambda \le \int _{\hat{\lambda }}^{\tilde{\lambda }} \frac{L}{2\lambda ^2} d\lambda = \frac{L}{2}\left| \frac{1}{\tilde{\lambda }} - \frac{1}{\hat{\lambda }}\right| . \end{aligned}$$

As the regularization parameter $\lambda $ approaches zero, we will use the following bound. Since for all $\lambda < \lambda '$, $\varvec{w}[\lambda ] = \varvec{w}[\lambda '] = \varvec{w}^*$, then for $\lambda < \lambda '$, $\big \Vert \frac{d\varvec{w}[\lambda ]}{d\lambda } \big \Vert =0$. Thus

$$\begin{aligned} \bigg \Vert \frac{d\varvec{w}[\lambda ]}{d\lambda } \bigg \Vert \le \frac{L}{2\left( \lambda '\right) ^2} \quad \forall \; \lambda > 0. \end{aligned}$$

This gives the second bound,

$$\begin{aligned} \Vert \varvec{w}_{\widetilde{\lambda }}^*- \varvec{w}_{{\hat{\lambda }}}^*\Vert \le \int _{ \hat{\lambda }}^{\tilde{\lambda }} \frac{L}{2\lambda ^2} d\lambda \le \int _{ {\hat{\lambda }} }^{\tilde{\lambda }} \frac{L}{2\lambda ^{\prime }{}^2} d\lambda \le \frac{L}{2\left( \lambda '\right) ^2} | \widetilde{\lambda }-\hat{\lambda }|. \end{aligned}$$

$\square $

1.5 Proof of Lemma 5

We finally prove Lemma 5, which makes use of Lemmas 3 and 4 to bound the initial error $\Vert {\overline{\varvec{w}}_s- \varvec{w}_{\lambda }^*} \Vert $ of each regularized subproblem given in Eq. (4).

Proof

We aim to show $\Vert { \overline{\varvec{w}}_s- \varvec{w}_{\lambda _{s}}^*} \Vert \le R_s$ with $R_s$ defined below and proceed by induction. For $s_0 \in {\mathbb {N}}$ with $s_0>1$, $p\in (0,1)$, and $r >2p$, let $\lambda _s = (s_0+s)^{-p}$, $t_s = (s_0+s)^{r}$. Recall that $L = \tfrac{2}{n} \sum _{j=1}^n\Vert {\varvec{x}_j} \Vert $. For some parameter $\alpha > 0$, let

$$\begin{aligned} R_s = CL(s_0+s-1)^{-\alpha }\text { with }C = \max \left\{ 4, \frac{1}{2\lambda _0 }(s_0-1)^{\alpha }\right\} . \end{aligned}$$

By Lemma 1, and since $\overline{\varvec{w}}_0 = \mathbf {0}$, we have $\Vert {\overline{\varvec{w}}_0 - \varvec{w}_{\lambda _0}^*} \Vert \le \frac{L}{2\lambda _0}$. Note that $R_0 \ge \frac{L}{2\lambda _0}$ and thus the base case, $ \Vert {\overline{\varvec{w}}_0 - \varvec{w}_{\lambda _0}^*} \Vert \le R_0, $ is satisfied.

Suppose that $\Vert { \overline{\varvec{w}}_{s-1} - \varvec{w}_{\lambda _{s-1}}} \Vert \le R_{s-1}$. Since the base case of $s=0$ has been satisfied, we now consider $s\ge 1$. By the triangle inequality,

$$\begin{aligned} \Vert { \overline{\varvec{w}}_s- \varvec{w}_{\lambda _{s}}^*} \Vert \le \Vert { \overline{\varvec{w}}_s- \varvec{w}_{\lambda _{s-1}}^* } \Vert +\Vert {\varvec{w}_{\lambda _{s-1}}^*-\varvec{w}_{\lambda _{s}}^*} \Vert . \end{aligned}$$

For $\overline{\varvec{w}}_s$ generated as in Algorithm 1, Lemma 3 along with the inductive assumption gives that

$$\begin{aligned} \Vert {\overline{\varvec{w}}_s- \varvec{w}_{\lambda _{s-1}}^*} \Vert \le \left( \frac{2 R_{s-1} L}{\lambda _{s-1} \sqrt{t_{s-1}}}\right) ^{1/2} =\frac{ \sqrt{2C} L (s_0+s-2)^{-\alpha /2}}{(s_0+s-1)^{r/4-p/2}}. \end{aligned}$$

From Eq. (11) of Lemma 4,

$$\begin{aligned} \Vert \varvec{w}_{\lambda _{s-1}}^*-\varvec{w}_{\lambda _{s}}^*\Vert&\le \tfrac{L}{2}\left( \tfrac{1}{\lambda _{s}}-\tfrac{1}{\lambda _{s-1}}\right) \\&=\tfrac{L}{2}\left( (s_0+s)^p - (s_0+s-1)^p \right) \\&\le \tfrac{Lp}{2}(s_0+s-1)^{p-1}. \end{aligned}$$

Combining these

$$\begin{aligned} \Vert { \overline{\varvec{w}}_s- \varvec{w}_{\lambda _{s}}^*} \Vert&\le \frac{\sqrt{2C} L (s_0+s-2)^{-\alpha /2}}{(s_0+s-1)^{r/4-p/2}} + \frac{L}{2}p(s_0+s-1)^{p-1}. \end{aligned}$$

We apply a change of base to replace $(s_0+s-2)$ with $(s_0+s-1)^{(1-\epsilon )}$. Solving for $\epsilon $, we find $\epsilon \ge \frac{ \log (s_0 + s -1 ) - \log (s_0 + s-2)}{\log (s_0+s - 1)}$. Note that this change of base requires that $s_0 >1 $, since we are considering $s \ge 1$. Applying this change of base,

$$\begin{aligned} \Vert { \overline{\varvec{w}}_s- \varvec{w}_{\lambda _{s}}^*} \Vert&\le \sqrt{2C} L(s_0+s-1)^{p/2-\alpha /2(1-\epsilon )-r/4} + \frac{Lp}{2}(s_0+s-1)^{p-1}. \end{aligned}$$

To simplify the analysis and remove the dependence of $\epsilon $ on the iteration number s, we use $\epsilon _0 = \frac{ \log (s_0 ) - \log (s_0-1)}{\log (s_0)}$. Note that $\epsilon _0 \ge \epsilon $ for $s\ge 1$ and is defined for $s_0 >1$. Now, for

$$\begin{aligned} 0\le \alpha \le \min \left( \frac{r-2p}{2(1+\epsilon _0)}, 1 - p \right) \end{aligned}$$

and $p<1$, we have

$$\begin{aligned} \Vert { \overline{\varvec{w}}_s- \varvec{w}_{\lambda _{s}}^*} \Vert \le L\left( \sqrt{2C} +\tfrac{p}{2} \right) (s_0+s-1)^{-\alpha } \le CL(s_0+s-1)^{-\alpha } = R_{s}. \end{aligned}$$

Note that allowing the first term in the upper bound on $\alpha $ to increase with s leads to smaller bounds $R_s$. This choice, however, complicates the analysis. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Molitor, D., Needell, D. & Ward, R. Bias of Homotopic Gradient Descent for the Hinge Loss. Appl Math Optim 84, 621–647 (2021). https://doi.org/10.1007/s00245-020-09656-5

Download citation

Published: 23 January 2020
Issue Date: August 2021
DOI: https://doi.org/10.1007/s00245-020-09656-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Bias of Homotopic Gradient Descent for the Hinge Loss

Abstract

Access this article

Similar content being viewed by others

Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition

Gradient-type penalty method with inertial effects for solving constrained convex optimization problems with smooth data

On the Gradient Projection Method for Weakly Convex Functions on a Proximally Smooth Set

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix A: Lemma Proofs

1.1 Proof of Lemma 1

Proof

1.2 Proof of Lemma 2

Proof

1.3 Proof of Lemma 3

Proof

1.4 Proof of Lemma 4

Proof

Lemma 6

1.5 Proof of Lemma 5

Proof

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Bias of Homotopic Gradient Descent for the Hinge Loss

Abstract

Access this article

Similar content being viewed by others

Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition

Gradient-type penalty method with inertial effects for solving constrained convex optimization problems with smooth data

On the Gradient Projection Method for Weakly Convex Functions on a Proximally Smooth Set

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix A: Lemma Proofs

Appendix A: Lemma Proofs

1.1 Proof of Lemma 1

Proof

1.2 Proof of Lemma 2

Proof

1.3 Proof of Lemma 3

Proof

1.4 Proof of Lemma 4

Proof

Lemma 6

1.5 Proof of Lemma 5

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation