Skip to main content
Log in

Bias of Homotopic Gradient Descent for the Hinge Loss

  • Published:
Applied Mathematics & Optimization Aims and scope Submit manuscript

Abstract

Gradient descent is a simple and widely used optimization method for machine learning. For homogeneous linear classifiers applied to separable data, gradient descent has been shown to converge to the maximal-margin (or equivalently, the minimal-norm) solution for various smooth loss functions. The previous theory does not, however, apply to the non-smooth hinge loss which is widely used in practice. Here, we study the convergence of a homotopic variant of gradient descent applied to the hinge loss and provide explicit convergence rates to the maximal-margin solution for linearly separable data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Bartlett, P., Shawe-Taylor, J.: Advances in Kernel Methods, Chap. Generalization Performance of Support Vector Machines and Other Pattern Classifiers, pp. 43–54. MIT Press, Cambridge (1999)

    Google Scholar 

  2. Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. SIAM Rev. 60(2), 223–311 (2018)

    Article  MathSciNet  Google Scholar 

  3. Brutzkus, A., Globerson, A., Malach, E., Shalev-Shwartz, S.: SGD learns over-parameterized networks that provably generalize on linearly separable data. In: International Conference on Learning Representations, ICLR 2018. Vancouver, BC, Canada, April 30-May 3, 2018, Conference Track Proceedings (2018). https://openreview.net/forum?id=rJ33wwxRb

  4. Bubeck, S.: Convex optimization: algorithms and complexity. arXiv e-prints arXiv:1405.4980 (2014)

  5. Bubeck, S.: Convex optimization: algorithms and complexity. Found. Trends Mach. Learn. 8(3–4), 231–357 (2015). https://doi.org/10.1561/2200000050

    Article  MATH  Google Scholar 

  6. Chapelle, O.: Training a support vector machine in the primal. Neural Comput. 19, 1155–1178 (2007)

    Article  MathSciNet  Google Scholar 

  7. Chaudhari, P., Choromanska, A., Soatto, S., LeCun, Y., Baldassi, C., Borgs, C., Chayes, J., Sagun, L., Zecchina, R.: Entropy-sgd: biasing gradient descent into wide valleys. In: International Conference on Learning Representations (2017)

  8. Combes, R.T.d., Pezeshki, M., Shabanian, S., Courville, A.C., Bengio, Y.: On the learning dynamics of deep neural networks. CoRR arXiv:1809.06848 (2018)

  9. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)

    MATH  Google Scholar 

  10. Gunasekar, S., Lee, J., Soudry, D., Srebro, N.: Characterizing implicit bias in terms of optimization geometry. In: Dy, J., Krause, A. (eds.) Proceedings of the 35th International Conference on Machine Learning, vol. 80, pp. 1832–1841. PMLR, Stockholmsmässan, Stockholm Sweden (2018). http://proceedings.mlr.press/v80/gunasekar18a.html

  11. Hardt, M., Recht, B., Singer, Y.: Train faster, generalize better: Stability of stochastic gradient descent. In: International Conference on Machine Learning, ICML’16, pp. 1225–1234. JMLR.org (2016). http://dl.acm.org/citation.cfm?id=3045390.3045520

  12. Hastie, T., Rosset, S., Tibshirani, R., Zhu, J.: The entire regularization path for the support vector machine. J. Mach. Learn. Res. 5(Oct), 1391–1415 (2004)

    MathSciNet  MATH  Google Scholar 

  13. Hoffer, E., Hubara, I., Soudry, D.: Train longer, generalize better: closing the generalization gap in large batch training of neural networks. In: Advances in Neural Information Processing Systems, pp. 1731–1741 (2017)

  14. Lacoste-Julien, S., Schmidt, M., Bach, F.: A simpler approach to obtaining an o(1/t) convergence rate for the projected stochastic subgradient method. arXiv preprint arXiv:1212.2002 (2012)

  15. Li, Y., Singer, Y.: The well tempered lasso. arXiv preprint arXiv:1806.03190 (2018)

  16. Nacson, M.S., Lee, J., Gunasekar, S., Savarese, P., Srebro, N., Soudry, D.: Convergence of gradient descent on separable data. In: International Conference on Artificial Intelligence and Statistics (2019)

  17. Nacson, M.S., Srebro, N., Soudry, D.: Stochastic gradient descent on separable data: exact convergence with a fixed learning rate. In: Proceedings of Machine Learning Research, vol. 89, pp. 3051–3059. PMLR (2019). http://proceedings.mlr.press/v89/nacson19a.html

  18. Neyshabur, B., Tomioka, R., Srebro, N.: In search of the real inductive bias: on the role of implicit regularization in deep learning. arXiv preprint arXiv:1412.6614 (2014)

  19. Poggio, T., Kawaguchi, K., Liao, Q., Miranda, B., Rosasco, L., Boix, X., Hidary, J., Mhaskar, H.: Theory of deep learning III: explaining the non-overfitting puzzle. arXiv preprint arXiv:1801.00173 (2017)

  20. Poggio, T., Liao, Q., Miranda, B., Banburski, A., Boix, X., Hidary, J.: Theory IIIb: generalization in deep networks. arXiv preprint arXiv:1806.11379 (2018)

  21. Ramdas, A., Peña, J.: Towards a deeper geometric, analytic and algorithmic understanding of margins. Optim. Methods Softw. 31(2), 377–391 (2016)

    Article  MathSciNet  Google Scholar 

  22. Rosset, S., Zhu, J., Hastie, T.J.: Margin maximizing loss functions. In: Advances in Neural Information Processing Systems, pp. 1237–1244 (2004)

  23. Soudry, D., Hoffer, E., Nacson, M.S., Gunasekar, S., Srebro, N.: The implicit bias of gradient descent on separable data. J. Mach. Learn. Res. 19(1), 2822–2878 (2018)

    MathSciNet  MATH  Google Scholar 

  24. Vapnik, V.: Estimation of Dependences Based on Empirical Data. Springer, Berlin (1982)

    MATH  Google Scholar 

  25. Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York (2013)

    MATH  Google Scholar 

  26. Vapnik, V.N.: An overview of statistical learning theory. IEEE Trans. Neural Netw. 10(5), 988–999 (1999)

    Article  Google Scholar 

  27. Vapnik, V.N., Chervonenkis, A.J.: Theory of Pattern Recognition. Nauka, Moscow (1974)

    MATH  Google Scholar 

  28. Wang, G., Giannakis, G.B., Chen, J.: Learning relu networks on linearly separable data: algorithm, optimality, and generalization. IEEE Trans. Signal Process. 67(9), 2357–2370 (2019). https://doi.org/10.1109/TSP.2019.2904921

    Article  MathSciNet  MATH  Google Scholar 

  29. Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning requires rethinking generalization. In: International Conference on Machine Learning (2017). arXiv:1611.03530

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Denali Molitor.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

D. Molitor and D. Needell are grateful to and were partially supported by NSF CAREER DMS #1348721 and NSF BIGDATA DMS #1740325. R. Ward was supported in part by AFOSR MURI Award N00014-17-S-F006.

Appendix A: Lemma Proofs

Appendix A: Lemma Proofs

We now present proofs for the lemmas of Sects. 2, 3 and 4.

We first prove Lemma 1, which gives a bound on the norm of the iterates produced by the subgradient method applied to Eq. (2).

1.1 Proof of Lemma 1

Proof

Consider the subgradient update for minimizing the function \(F_\lambda \) of Eq. (2)

$$\begin{aligned} \varvec{w}^{\prime }{} = (1-\lambda \eta ) \varvec{w}+ \frac{\eta }{n}\sum _{j : y_j\varvec{x}_j^\top \varvec{w}\le 1} y_j\varvec{x}_j \end{aligned}$$
(16)

with \(\eta \lambda < 1\). Suppose that the iterate \(\varvec{w}\) satisfies \(\Vert {\varvec{w}} \Vert \le \frac{1}{\lambda n} \sum _{j=1}^n\Vert {\varvec{x}_j} \Vert \). We aim to show that \(\varvec{w}^{\prime }{}\) given by the subgradient update also satisfies \(\Vert {\varvec{w}} \Vert \le \frac{1}{\lambda n} \sum _{j=1}^n\Vert {\varvec{x}_j} \Vert .\) Taking the norm on both sides of Eq. (16),

$$\begin{aligned} \Vert {\varvec{w}^{\prime }{}} \Vert&= \bigg \Vert (1-\eta \lambda )\varvec{w}+ \frac{\eta }{n}\sum _{j : y_j\varvec{x}_j^\top \varvec{w}\le 1} y_j\varvec{x}_j \bigg \Vert \\&\le (1-\eta \lambda )\Vert {\varvec{w}} \Vert + \frac{\eta }{n} \bigg \Vert \sum _{j : y_j\varvec{x}_j^\top \varvec{w}\le 1} y_j\varvec{x}_j \bigg \Vert \\&\le (1-\eta \lambda )\frac{1}{\lambda n} \sum _{j=1}^n\Vert {\varvec{x}_j} \Vert + \frac{\eta }{n}\sum _{j=1}^n \Vert {\varvec{x}_j} \Vert \\&= \frac{1}{\lambda n} \sum _{j=1}^n\Vert {\varvec{x}_j} \Vert -\frac{\eta }{ n} \sum _{j=1}^n\Vert {\varvec{x}_j} \Vert + \frac{\eta }{n}\sum _{j=1}^n \Vert {\varvec{x}_j} \Vert \\&= \frac{1}{\lambda n} \sum _{j=1}^n\Vert {\varvec{x}_j} \Vert . \end{aligned}$$

Thus the norms of all iterates of the subgradient method applied to the function \(F_\lambda \) remain bounded by \(\frac{1}{\lambda n} \sum _{j=1}^n\Vert {\varvec{x}_j} \Vert \) if the initial iterate has norm at most \(\frac{1}{\lambda n}\sum _{j=1}^n\Vert {\varvec{x}_j} \Vert \). The norm of the minimizer \(\varvec{w}_{\lambda }^*\) of \(F_\lambda \) must also satisfy the bound \( \Vert {\varvec{w}_{\lambda }^*} \Vert \le \frac{1}{\lambda n}\sum _{j } \Vert {\varvec{x}_j} \Vert \) as \(0\in \partial F_\lambda (\varvec{w}_{\lambda }^*)\) and so

$$\begin{aligned} \lambda \Vert {\varvec{w}_{\lambda }^*} \Vert \le \frac{1}{n}\bigg |\bigg |\sum _{j : y_j\varvec{x}_j^\top \varvec{w}_{\lambda }^*\le 1} y_j\varvec{x}_j\bigg |\bigg |. \end{aligned}$$

\(\square \)

1.2 Proof of Lemma 2

Lemma 2 uses Theorem 2 to derive bounds for the angle and margin gaps.

Proof

To derive a convergence rate for the angle gap, we use the decomposition

$$\begin{aligned} \Vert \varvec{w}_k - \varvec{w}^*\Vert ^2&= \Vert \varvec{w}_k\Vert ^2 + \Vert \varvec{w}^*\Vert ^2 - 2\varvec{w}_k^\top \varvec{w}^*\\&= \left( \Vert \varvec{w}_k\Vert - \Vert \varvec{w}^*\Vert \right) ^2 +2\Vert \varvec{w}_k\Vert \Vert \varvec{w}^*\Vert - 2\varvec{w}_k^\top \varvec{w}^*. \end{aligned}$$

Dividing by \(2\Vert \varvec{w}_k\Vert \Vert \varvec{w}^*\Vert \),

$$\begin{aligned} 1 - \frac{\varvec{w}_k^\top \varvec{w}^*}{\Vert \varvec{w}_k\Vert \Vert \varvec{w}^*\Vert }&= \frac{\Vert \varvec{w}_k - \varvec{w}^*\Vert ^2 - \left( \Vert \varvec{w}_k\Vert - \Vert \varvec{w}^*\Vert \right) ^2 }{2\Vert \varvec{w}_k\Vert \Vert \varvec{w}^*\Vert }\\&\le \frac{\Vert \varvec{w}_k - \varvec{w}^*\Vert ^2 }{2\Vert \varvec{w}_k\Vert \Vert \varvec{w}^*\Vert }. \end{aligned}$$

Since \(\Vert \varvec{w}^*\Vert \) is necessarily bounded away from 0 since \(y_i\varvec{x}_i^\top \varvec{w}^*\ge 1\) for all i. We can bound \(\Vert \varvec{w}_k\Vert \) away from 0 for t large using the convergence of \(\varvec{w}_k\) to \(\varvec{w}^*\) guaranteed by Theorem 2. Let

$$\begin{aligned} c = \min \left( \frac{(r-2p)(1-\epsilon _0)}{2(r+1)(1+\epsilon _0)}, \frac{(1 - p)(1+\epsilon _0)}{r+1} , \frac{p}{r+1} \right) , \end{aligned}$$

be the exponent in the convergence rate of \(\Vert \varvec{w}-\varvec{w}^*\Vert \) and pr,  and \(\epsilon _0\) be defined as in Theorem 2. Since

$$\begin{aligned} (\Vert \varvec{w}_k\Vert - \Vert \varvec{w}^*\Vert )^2 \le \Vert \varvec{w}_k - \varvec{w}^*\Vert ^2 \le Ak^{-2c} \end{aligned}$$

for constants \(A, c >0\) by Theorem 2, then \(\Vert \varvec{w}_k\Vert \ge \Vert \varvec{w}^*\Vert - Ak^{-c}.\) Thus for k sufficiently large, we can bound \(\Vert \varvec{w}\Vert \) away from 0 and have

$$\begin{aligned} 1 - \frac{\varvec{w}_k^\top \varvec{w}^*}{\Vert \varvec{w}_k\Vert \Vert \varvec{w}^*\Vert } = O\left( k^{-2c}\right) . \end{aligned}$$
(17)

We now consider the margin bound. Let \(j = {{\,\mathrm{arg min}\,}}_{i=1,\ldots n} \frac{y_i \varvec{x}_i^\top \varvec{w}_{k}}{\Vert \varvec{w}_{k}\Vert }\). Since \(y_i \varvec{x}_i^\top \varvec{w}^*\ge 1\) for all \(i = 1,\ldots , n\), we have that

$$\begin{aligned} 0&\le \frac{1}{\Vert \varvec{w}^*\Vert } - \frac{y_j \varvec{x}_j^\top \varvec{w}_{k}}{\Vert \varvec{w}_{k}\Vert } \le \frac{y_j \varvec{x}_j^\top \varvec{w}^*}{\Vert \varvec{w}^*\Vert } - \frac{y_j \varvec{x}_j^\top \varvec{w}_{k}}{\Vert \varvec{w}_{k}\Vert } \\&=y_j \varvec{x}_j^\top \left( \frac{ \varvec{w}^*}{\Vert \varvec{w}^*\Vert } - \frac{ \varvec{w}_{k}}{\Vert \varvec{w}_{k}\Vert } \right) \le \Vert \varvec{x}_j\Vert \bigg \Vert \frac{ \varvec{w}^*}{\Vert \varvec{w}^*\Vert } - \frac{ \varvec{w}_{k}}{\Vert \varvec{w}_{k}\Vert } \bigg \Vert . \end{aligned}$$

Note that

$$\begin{aligned} \bigg \Vert \frac{ \varvec{w}^*}{\Vert \varvec{w}^*\Vert } - \frac{ \varvec{w}_{k}}{\Vert \varvec{w}_{k}\Vert } \bigg \Vert ^2 = 2 \left( 1 - \frac{\varvec{w}_k^\top \varvec{w}^*}{\Vert \varvec{w}_k\Vert \Vert \varvec{w}^*\Vert }\right) . \end{aligned}$$

Assuming the data is finite and linearly separable, by Eq. (17) we then have

$$\begin{aligned} \frac{1}{\Vert \varvec{w}^*\Vert } - \min _i \frac{y_i \varvec{x}_i^\top \varvec{w}_{k}}{\Vert \varvec{w}_{k}\Vert } = O\left( k^{-c}\right) . \end{aligned}$$

\(\square \)

1.3 Proof of Lemma 3

Lemma 3 provides a modified convergence guarantee for the averaged subgradient method applied to the functions \(F_\lambda \) [4].

Proof

Let \(F_\lambda \) be a strongly convex function with strong convexity parameter \(\lambda \) and Lipschitz constant L on the bounded domain considered. Let \(\varvec{w}_0\) be an initial iterate and \(\varvec{w}_{\lambda }^*\) be the minimizer of \(F_\lambda \). Suppose \(\Vert \varvec{w}_0 - \varvec{w}_{\lambda }^*\Vert \le R\), so that \(\varvec{w}_{\lambda }^*\) is contained in a ball of radius R and center \(\varvec{w}_0\). Let \(\overline{\varvec{w}}= \frac{1}{t} \sum _{i=1}^t \varvec{w}_i\) be the average of t subgradient descent iterates with initial iterate \(\varvec{w}_0\) and step size \(\eta = \frac{R}{L\sqrt{t}}\). We aim to show that

$$\begin{aligned} 0\le F_{\lambda } (\overline{\varvec{w}}) - F_{\lambda } (\varvec{w}_{\lambda }^*) \le \frac{R L}{\sqrt{t}} - \frac{\lambda }{2}\Vert \overline{\varvec{w}}- \varvec{w}_{\lambda }^*\Vert ^2. \end{aligned}$$

The following proof relies heavily on Theorem 3.2 of [4] (See also [5]).

Since \(\varvec{w}_{\lambda }^*\) is the minimizer of \(F_\lambda \), the inequality

$$\begin{aligned} F_{\lambda } (\overline{\varvec{w}}) - F_{\lambda } (\varvec{w}_{\lambda }^*)\ge 0 \end{aligned}$$

is immediate. Let \(g(\varvec{w}) = F_\lambda (\varvec{w}) - \frac{\lambda }{2}||\varvec{w}||^2\). Since \(g(\varvec{w})\) is convex,

$$\begin{aligned} g(\overline{\varvec{w}}) \le \frac{1}{t}\sum _{i=1}^t g(\varvec{w}_{i}) \end{aligned}$$

and thus

$$\begin{aligned} F_\lambda&(\overline{\varvec{w}}) - \frac{\lambda }{2}||\overline{\varvec{w}}||^2 \le \frac{1}{t} \sum _{i=1}^t \left( F_\lambda (\varvec{w}_{i}) -\frac{\lambda }{2}||\varvec{w}_{i}||^2\right) . \end{aligned}$$

Reorganizing and subtracting \(F_\lambda (\varvec{w}_{\lambda }^*)\),

$$\begin{aligned} F_\lambda&(\overline{\varvec{w}})-F_\lambda (\varvec{w}_{\lambda }^*) \nonumber \\&\le \frac{1}{t} \sum _{i=1}^t \bigg ( F_\lambda (\varvec{w}_{i}) - F_\lambda (\varvec{w}_{\lambda }^*) -\frac{\lambda }{2}\left( ||\varvec{w}_{i}||^2 -||\overline{\varvec{w}}||^2\right) \bigg ). \end{aligned}$$
(18)

Using the strong convexity of \(F_\lambda \) and the proof of Theorem 3.2 of [4],

$$\begin{aligned} F_\lambda&(\varvec{w}_i)-F_\lambda (\varvec{w}_{\lambda }^*) \\&\le \partial F_\lambda (\varvec{w}_{i})^\top (\varvec{w}_{i} - \varvec{w}_{\lambda }^*) - \frac{\lambda }{2}\Vert \varvec{w}_{i} - \varvec{w}_{\lambda }^*\Vert ^2\\&= \frac{1}{2\eta }\left( \Vert {\varvec{w}_i - \varvec{w}^*} \Vert ^2 - \Vert {\varvec{w}_{i+1}-\varvec{w}^*} \Vert ^2\right) + \frac{\eta }{2}\Vert {\partial F_\lambda (\varvec{w}_{i})} \Vert ^2- \frac{\lambda }{2}\Vert \varvec{w}_{i} - \varvec{w}_{\lambda }^*\Vert ^2\\&\le \frac{1}{2\eta }\left( \Vert {\varvec{w}_i - \varvec{w}^*} \Vert ^2 - \Vert {\varvec{w}_{i+1}-\varvec{w}^*} \Vert ^2\right) + \frac{\eta L^2}{2}- \frac{\lambda }{2}\Vert \varvec{w}_{i} - \varvec{w}_{\lambda }^*\Vert ^2. \end{aligned}$$

Making this substitution into Eq. (18),

$$\begin{aligned} F_\lambda&(\overline{\varvec{w}})-F_\lambda (\varvec{w}_{\lambda }^*) \\&\le \frac{1}{2 t\eta }\left( \Vert {\varvec{w}_1 - \varvec{w}^*} \Vert ^2 - \Vert {\varvec{w}_{t+1}-\varvec{w}^*} \Vert ^2\right) + \frac{\eta L^2}{2} \\&\quad - \frac{\lambda }{2t} \sum _{i=1}^t\bigg (\Vert \varvec{w}_{i} - \varvec{w}_{\lambda }^*\Vert ^2 + ||\varvec{w}_{i}||^2-||\overline{\varvec{w}}||^2 \bigg )\\&\le \frac{R^2}{2 t\eta } + \frac{\eta L^2}{2} - \frac{\lambda }{2t} \sum _{i=1}^t\bigg (\Vert \varvec{w}_{i} - \varvec{w}_{\lambda }^*\Vert ^2 + ||\varvec{w}_{i}||^2-||\overline{\varvec{w}}||^2 \bigg )\\&\le \frac{RL}{\sqrt{t}} - \frac{\lambda }{2t} \sum _{i=1}^t\left( ||\varvec{w}_{i}||^2-||\overline{\varvec{w}}||^2 + \Vert \varvec{w}_{i} - \varvec{w}_{\lambda }^*\Vert ^2\right) . \end{aligned}$$

Decomposing the sum,

$$\begin{aligned} \frac{1}{t}\sum _{i=1}^t&\Vert \varvec{w}_{i} - \varvec{w}_{\lambda }^*\Vert ^2 = \frac{1}{t}\sum _{i=1}^t \left( ||\varvec{w}_{i}||^2 - 2 \varvec{w}_{i}^\top \varvec{w}_{\lambda }^*+ ||\varvec{w}_{\lambda }^*||^2\right) \\&= \frac{1}{t}\sum _{i=1}^t \left( ||\varvec{w}_{i}||^2\right) - 2 \overline{\varvec{w}}^\top \varvec{w}_{\lambda }^*+ ||\varvec{w}_{\lambda }^*||^2\\&= \frac{1}{t}\sum _{i=1}^t \left( ||\varvec{w}_{i}||^2\right) -||\overline{\varvec{w}}||^2 +||\overline{\varvec{w}}||^2 - 2 \overline{\varvec{w}}^\top \varvec{w}_{\lambda }^*+ ||\varvec{w}_{\lambda }^*||^2\\&= \frac{1}{t}\sum _{i=1}^t \left( ||\varvec{w}_{i}||^2 -||\overline{\varvec{w}}||^2\right) +||\overline{\varvec{w}}-\varvec{w}_{\lambda }^*||^2. \end{aligned}$$

Making this substitution,

$$\begin{aligned} F_\lambda&(\overline{\varvec{w}})-F_\lambda (\varvec{w}_{\lambda }^*) \\&\le \frac{RL}{\sqrt{t}} - \frac{\lambda }{t} \sum \left( ||\varvec{w}_{i}||^2-||\overline{\varvec{w}}||^2 \right) - \frac{\lambda }{2}||\overline{\varvec{w}}-\varvec{w}_{\lambda }^*||^2. \end{aligned}$$

Since \(||\varvec{w}||^2\) is convex, \(\frac{\lambda }{t} \sum \left( ||\varvec{w}_{i}||^2-||\overline{\varvec{w}}||^2 \right) \ge 0\) and

$$\begin{aligned}F_\lambda (\overline{\varvec{w}})-F_\lambda (\varvec{w}_{\lambda }^*)&\le \frac{RL}{\sqrt{t}} - \frac{\lambda }{2}||\overline{\varvec{w}}-\varvec{w}_{\lambda }^*||^2 \end{aligned}$$

as desired. \(\square \)

1.4 Proof of Lemma 4

We now prove Lemma 4, which bounds the distance between minimizers of \(F_\lambda \) for different regularization parameters \(\lambda \).

Proof

Let \(\varvec{w}_{\lambda }^*\) minimize \(F_\lambda \) as given in Eq. (2). Let \(\lambda ' >0\) be such that \(\varvec{w}_{\lambda }^*= \varvec{w}^*\) for all \(\lambda \le \lambda '\). For \(\lambda ,\widetilde{\lambda }\ge 0\) and data satisfying Assumption 1, we aim to show that

$$\begin{aligned} \Vert \varvec{w}_{\lambda }^*- \varvec{w}_{\widetilde{\lambda }}^*\Vert \le \frac{L}{2} \bigg |\frac{1}{\lambda } - \frac{1}{\tilde{\lambda }}\bigg | \end{aligned}$$

and

$$\begin{aligned} \Vert \varvec{w}_{\lambda }^*- \varvec{w}_{\widetilde{\lambda }}^*\Vert \le \frac{L}{2\left( \lambda '\right) ^2} |\lambda - \widetilde{\lambda }|, \end{aligned}$$

where \(L = 2\tfrac{1}{n}\sum _{j=1}^n\Vert \varvec{x}_j\Vert \).

The proof of Lemma 4 makes use of Lemma 8 of [15], which is also stated below.

Lemma 6

(Perturbation of strongly convex functions I [15]). Let \(f(\varvec{z})\) be a non-negative, \(\alpha ^2\)-strongly convex function. Let \(g(\varvec{z})\) be a L-Lipschitz non-negative convex function. For any \(\beta \ge 0\), let \(\varvec{z}[\beta ]\) be the minimizer of \(f(\varvec{z}) + \beta g(\varvec{z})\), then we have,

$$\begin{aligned} \bigg \Vert \frac{d\varvec{z}[\beta ]}{d\beta } \bigg \Vert \le \frac{L}{\alpha ^2}. \end{aligned}$$

Let \(f(\varvec{w})= \Vert \varvec{w}\Vert ^2\) and \(g(\varvec{w})=\frac{1}{n} \sum _{j=1}^n \max \{0, 1 - y_j \varvec{x}_j^\top \varvec{w}\}\). Then f is strongly convex with strong convexity parameter 2 and g is Lipschitz with a Lipschitz constant bounded by \(\tfrac{1}{n}\sum _{j=1}^n\Vert \varvec{x}_j\Vert \). Note that

$$\begin{aligned} F_\lambda (\varvec{w})&= \frac{\lambda }{2} f(\varvec{w}) + g(\varvec{w}) = \frac{\lambda }{2} \left[ f(\varvec{w}) + \frac{2}{\lambda }g(\varvec{w}) \right] \\&= \frac{\lambda }{2} \left[ f(\varvec{w}) + \beta (\lambda )g(\varvec{w}) \right] \end{aligned}$$

for \(\beta (\lambda ) = \frac{2}{\lambda }\). Applying Lemma 8 of [15],

$$\begin{aligned} \bigg \Vert \frac{d\varvec{w}[\lambda ]}{d\lambda } \bigg \Vert&= \bigg \Vert \frac{d\varvec{w}[\lambda ]}{d\beta (\lambda )} \cdot \frac{d\beta (\lambda )}{d\lambda }\bigg \Vert \\&\le \frac{1}{2n}\sum _{j=1}^n\Vert \varvec{x}_j\Vert \cdot |\beta '(\lambda )| = \frac{\tfrac{1}{n}\sum _{j=1}^n\Vert \varvec{x}_j\Vert }{\lambda ^2} = \frac{L}{2\lambda ^2}. \end{aligned}$$

Integrating, for any \(\tilde{\lambda }\ge {{\hat{\lambda }}} >0\), we have

$$\begin{aligned} \Vert {\varvec{w}_{\tilde{\lambda }}^* - \varvec{w}_{{{\hat{\lambda }}}}^*} \Vert&= \bigg \Vert \int _{{\hat{\lambda }}}^{\tilde{\lambda }} \frac{d\varvec{w}[\lambda ]}{d\lambda } d\lambda \bigg \Vert \le \int _{{\hat{\lambda }}}^{\tilde{\lambda }} \bigg \Vert \frac{d\varvec{w}[\lambda ]}{d\lambda } \bigg \Vert d\lambda \le \int _{\hat{\lambda }}^{\tilde{\lambda }} \frac{L}{2\lambda ^2} d\lambda = \frac{L}{2}\left| \frac{1}{\tilde{\lambda }} - \frac{1}{\hat{\lambda }}\right| . \end{aligned}$$

As the regularization parameter \(\lambda \) approaches zero, we will use the following bound. Since for all \(\lambda < \lambda '\), \(\varvec{w}[\lambda ] = \varvec{w}[\lambda '] = \varvec{w}^*\), then for \(\lambda < \lambda '\), \(\big \Vert \frac{d\varvec{w}[\lambda ]}{d\lambda } \big \Vert =0\). Thus

$$\begin{aligned} \bigg \Vert \frac{d\varvec{w}[\lambda ]}{d\lambda } \bigg \Vert \le \frac{L}{2\left( \lambda '\right) ^2} \quad \forall \; \lambda > 0. \end{aligned}$$

This gives the second bound,

$$\begin{aligned} \Vert \varvec{w}_{\widetilde{\lambda }}^*- \varvec{w}_{{\hat{\lambda }}}^*\Vert \le \int _{ \hat{\lambda }}^{\tilde{\lambda }} \frac{L}{2\lambda ^2} d\lambda \le \int _{ {\hat{\lambda }} }^{\tilde{\lambda }} \frac{L}{2\lambda ^{\prime }{}^2} d\lambda \le \frac{L}{2\left( \lambda '\right) ^2} | \widetilde{\lambda }-\hat{\lambda }|. \end{aligned}$$

\(\square \)

1.5 Proof of Lemma 5

We finally prove Lemma 5, which makes use of Lemmas 3 and 4 to bound the initial error \(\Vert {\overline{\varvec{w}}_s- \varvec{w}_{\lambda }^*} \Vert \) of each regularized subproblem given in Eq. (4).

Proof

We aim to show \(\Vert { \overline{\varvec{w}}_s- \varvec{w}_{\lambda _{s}}^*} \Vert \le R_s\) with \(R_s\) defined below and proceed by induction. For \(s_0 \in {\mathbb {N}}\) with \(s_0>1\), \(p\in (0,1)\), and \(r >2p\), let \(\lambda _s = (s_0+s)^{-p}\), \(t_s = (s_0+s)^{r}\). Recall that \(L = \tfrac{2}{n} \sum _{j=1}^n\Vert {\varvec{x}_j} \Vert \). For some parameter \(\alpha > 0\), let

$$\begin{aligned} R_s = CL(s_0+s-1)^{-\alpha }\text { with }C = \max \left\{ 4, \frac{1}{2\lambda _0 }(s_0-1)^{\alpha }\right\} . \end{aligned}$$

By Lemma 1, and since \(\overline{\varvec{w}}_0 = \mathbf {0}\), we have \(\Vert {\overline{\varvec{w}}_0 - \varvec{w}_{\lambda _0}^*} \Vert \le \frac{L}{2\lambda _0}\). Note that \(R_0 \ge \frac{L}{2\lambda _0}\) and thus the base case, \( \Vert {\overline{\varvec{w}}_0 - \varvec{w}_{\lambda _0}^*} \Vert \le R_0, \) is satisfied.

Suppose that \(\Vert { \overline{\varvec{w}}_{s-1} - \varvec{w}_{\lambda _{s-1}}} \Vert \le R_{s-1}\). Since the base case of \(s=0\) has been satisfied, we now consider \(s\ge 1\). By the triangle inequality,

$$\begin{aligned} \Vert { \overline{\varvec{w}}_s- \varvec{w}_{\lambda _{s}}^*} \Vert \le \Vert { \overline{\varvec{w}}_s- \varvec{w}_{\lambda _{s-1}}^* } \Vert +\Vert {\varvec{w}_{\lambda _{s-1}}^*-\varvec{w}_{\lambda _{s}}^*} \Vert . \end{aligned}$$

For \(\overline{\varvec{w}}_s\) generated as in Algorithm 1, Lemma 3 along with the inductive assumption gives that

$$\begin{aligned} \Vert {\overline{\varvec{w}}_s- \varvec{w}_{\lambda _{s-1}}^*} \Vert \le \left( \frac{2 R_{s-1} L}{\lambda _{s-1} \sqrt{t_{s-1}}}\right) ^{1/2} =\frac{ \sqrt{2C} L (s_0+s-2)^{-\alpha /2}}{(s_0+s-1)^{r/4-p/2}}. \end{aligned}$$

From Eq. (11) of Lemma 4,

$$\begin{aligned} \Vert \varvec{w}_{\lambda _{s-1}}^*-\varvec{w}_{\lambda _{s}}^*\Vert&\le \tfrac{L}{2}\left( \tfrac{1}{\lambda _{s}}-\tfrac{1}{\lambda _{s-1}}\right) \\&=\tfrac{L}{2}\left( (s_0+s)^p - (s_0+s-1)^p \right) \\&\le \tfrac{Lp}{2}(s_0+s-1)^{p-1}. \end{aligned}$$

Combining these

$$\begin{aligned} \Vert { \overline{\varvec{w}}_s- \varvec{w}_{\lambda _{s}}^*} \Vert&\le \frac{\sqrt{2C} L (s_0+s-2)^{-\alpha /2}}{(s_0+s-1)^{r/4-p/2}} + \frac{L}{2}p(s_0+s-1)^{p-1}. \end{aligned}$$

We apply a change of base to replace \((s_0+s-2)\) with \((s_0+s-1)^{(1-\epsilon )}\). Solving for \(\epsilon \), we find \(\epsilon \ge \frac{ \log (s_0 + s -1 ) - \log (s_0 + s-2)}{\log (s_0+s - 1)}\). Note that this change of base requires that \(s_0 >1 \), since we are considering \(s \ge 1\). Applying this change of base,

$$\begin{aligned} \Vert { \overline{\varvec{w}}_s- \varvec{w}_{\lambda _{s}}^*} \Vert&\le \sqrt{2C} L(s_0+s-1)^{p/2-\alpha /2(1-\epsilon )-r/4} + \frac{Lp}{2}(s_0+s-1)^{p-1}. \end{aligned}$$

To simplify the analysis and remove the dependence of \(\epsilon \) on the iteration number s, we use \(\epsilon _0 = \frac{ \log (s_0 ) - \log (s_0-1)}{\log (s_0)}\). Note that \(\epsilon _0 \ge \epsilon \) for \(s\ge 1\) and is defined for \(s_0 >1\). Now, for

$$\begin{aligned} 0\le \alpha \le \min \left( \frac{r-2p}{2(1+\epsilon _0)}, 1 - p \right) \end{aligned}$$

and \(p<1\), we have

$$\begin{aligned} \Vert { \overline{\varvec{w}}_s- \varvec{w}_{\lambda _{s}}^*} \Vert \le L\left( \sqrt{2C} +\tfrac{p}{2} \right) (s_0+s-1)^{-\alpha } \le CL(s_0+s-1)^{-\alpha } = R_{s}. \end{aligned}$$

Note that allowing the first term in the upper bound on \(\alpha \) to increase with s leads to smaller bounds \(R_s\). This choice, however, complicates the analysis. \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Molitor, D., Needell, D. & Ward, R. Bias of Homotopic Gradient Descent for the Hinge Loss. Appl Math Optim 84, 621–647 (2021). https://doi.org/10.1007/s00245-020-09656-5

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00245-020-09656-5

Keywords

Navigation