Advertisement

Fast and strong convergence of online learning algorithms

  • 72 Accesses

Abstract

In this paper, we study the online learning algorithm without explicit regularization terms. This algorithm is essentially a stochastic gradient descent scheme in a reproducing kernel Hilbert space (RKHS). The polynomially decaying step size in each iteration can play a role of regularization to ensure the generalization ability of online learning algorithm. We develop a novel capacity dependent analysis on the performance of the last iterate of online learning algorithm. This answers an open problem in learning theory. The contribution of this paper is twofold. First, our novel capacity dependent analysis can lead to sharp convergence rate in the standard mean square distance which improves the results in the literature. Second, we establish, for the first time, the strong convergence of the last iterate with polynomially decaying step sizes in the RKHS norm. We demonstrate that the theoretical analysis established in this paper fully exploits the fine structure of the underlying RKHS, and thus can lead to sharp error estimates of online learning algorithm.

This is a preview of subscription content, log in to check access.

Access options

Buy single article

Instant unlimited access to the full article PDF.

US$ 39.95

Price includes VAT for USA

Subscribe to journal

Immediate online access to all issues from 2019. Subscription will auto renew annually.

US$ 99

This is the net price. Taxes to be calculated in checkout.

References

  1. 1.

    Aronszajn, N.: Theory of reproducing kernels. Trans. Am. Math. Soc. 68(3), 337–404 (1950)

  2. 2.

    Birman, M.S., Solomjak, M.Z.: Piecewise-polynomial approximations of functions of the classes \({W}_{p}^{\alpha }\). Math. USSR-Sbornik 2(3), 295–317 (1967)

  3. 3.

    Braun, M.L., Buhmann, J.M., Müller, K.-R.: On relevant dimensions in kernel feature spaces. J. Mach. Learn. Res. 9(Aug), 1875–1908 (2008)

  4. 4.

    Caponnetto, A., DeVito, E.: Optimal rates for the regularized least squares algorithm. Found. Comput. Math. 7(3), 331–368 (2007)

  5. 5.

    Cucker, F., Zhou, D.X.: Learning Theory: an Approximation Theory Viewpoint. Cambridge University Press, Cambridge (2007)

  6. 6.

    Didas, S., Setzer, S., Steidl, G.: Combined 2 data and gradient fitting in conjunction with 1 regularization. Adv. Comput. Math. 30(1), 79–99 (2009)

  7. 7.

    Dieuleveut, A., Bach, F.: Nonparametric stochastic approximation with large step-sizes. Ann. Stat. 44(4), 1363–1399 (2016)

  8. 8.

    Gu, C.: Smoothing Spline ANOVA Models. Springer Series in Statistics. Springer, New York (2002)

  9. 9.

    Guo, Z.C., Ying, Y., Zhou, D.X.: Online regularized learning with pairwise loss functions. Adv. Comput. Math. 43(1), 127–150 (2017)

  10. 10.

    Kivinen, J., Smola, A.J., Williamson, R.C.: Online learning with kernels. IEEE Trans. Signal Process. 52(8), 2165–2176 (2004)

  11. 11.

    Lei, Y., Shi, L., Guo, Z.C.: Convergence of unregularized online learning algorithms. J. Mach. Learn. Res. 18(171), 1–33 (2018)

  12. 12.

    Lin, J., Rosasco, L.: Optimal learning for multi-pass stochastic gradient methods. In: Advances in Neural Information Processing Systems, pp 4556–4564 (2016)

  13. 13.

    Mendelson, S., Neeman, J.: Regularization in kernel learning. Ann. Stat. 38(1), 526–565 (2010)

  14. 14.

    Mikusiński, J.: The Bochner Integral. Birkhäuser, Basel (1978)

  15. 15.

    Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19 (4), 1574–1609 (2009)

  16. 16.

    Pillaud-Vivien, L., Alessandro, R., Francis, B.: Statistical optimality of stochastic gradient descent on hard learning problems through multiple passes. In: Advances in Neural Information Processing Systems, pp 8114–8124 (2018)

  17. 17.

    Rakhlin, A., Shamir, O., Sridharan, K.: Making gradient descent optimal for strongly convex stochastic optimization. In: Proceedings of the 29th International Conference on Machine Learning (ICML-12), pp 449–456 (2012)

  18. 18.

    Rosasco, L., Tacchetti, A., Villa, S.: Regularization by early stopping for online learning algorithms. arXiv:1405.0042 (2014)

  19. 19.

    Shalev-Shwartz, S., Singer, Y., Srebro, N., Cotter, A.: Pegasos: primal estimated sub-gradient solver for svm. Math. Program. 127(1), 3–30 (2011)

  20. 20.

    Smale, S., Zhou, D.X.: Estimating the approximation error in learning theory. Anal. Appl. 1(1), 17–41 (2003)

  21. 21.

    Smale, S., Yao, Y.: Online learning algorithms. Found. Comput. Math. 6 (2), 145–170 (2006)

  22. 22.

    Smale, S., Zhou, D.X.: Learning theory estimates via integral of operators and their approximations. Constr. Approx. 26(2), 153C–172 (2007)

  23. 23.

    Steinwart, I., Christmann, A.: Support Vector Machines. Springer, New York (2008)

  24. 24.

    Steinwart, I., Hush, D.R., Scovel, C.: Optimal rates for regularized least squares regression. In: Conference on Learning Theory (2009)

  25. 25.

    Sutskever, I., Martens, J., Dahl, G., Hinton, G.: On the importance of initialization and momentum in deep learning. In: International Conference on Machine Learning, pp 1139–1147 (2013)

  26. 26.

    Tarres, P., Yao, Y.: Online learning as stochastic approximation of regularization paths: optimality and almost-sure convergence. IEEE Trans. Inf. Theory 60(9), 5716–5735 (2014)

  27. 27.

    Wahba, G.: Spline Models for Observational Data. Society for Industrial and Applied Mathematics (SIAM), Philadelphia (1990)

  28. 28.

    Wendland, H.: Scattered Data Approximation. Cambridge University Press, Cambridge (2005)

  29. 29.

    Yao, Y.: On complexity issues of online learning algorithms. IEEE Trans. Inf. Theory 56(12), 6470–6481 (2010)

  30. 30.

    Yao, Y., Rosasco, L., Caponnetto, A.: On early stopping in gradient descent learning. Constr. Approx. 26(2), 289–315 (2007)

  31. 31.

    Ying, Y.: Convergence analysis of online algorithms. Adv. Comput. Math. 27 (3), 273–291 (2007)

  32. 32.

    Ying, Y., Pontil, M.: Online gradient descent learning algorithms. Found. Comput. Math. 8(5), 561–596 (2008)

  33. 33.

    Ying, Y., Zhou, D.X.: Online regularized classification algorithms. IEEE Trans. Inf. Theory 52(11), 4775–4788 (2006)

  34. 34.

    Ying, Y., Zhou, D.X.: Online Online pairwise learning algorithms. Neural Comput. 28(4), 743–777 (2016)

  35. 35.

    Ying, Y., Zhou, D.X.: Unregularized online learning algorithms with general loss functions. Appl. Comput. Harmon. Anal. 2(42), 224–244 (2017)

  36. 36.

    Zhang, T., Yu, B.: Boosting with early stopping: convergence and consistency. Ann. Stat. 33(4), 1538–1579 (2005)

  37. 37.

    Zhou, D.X.: The covering number in learning theory. J. Complex. 18(3), 739–767 (2002)

Download references

Author information

Correspondence to Lei Shi.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Communicated by: Gitta Kutyniok

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Guo, Z., Shi, L. Fast and strong convergence of online learning algorithms. Adv Comput Math 45, 2745–2770 (2019) doi:10.1007/s10444-019-09707-8

Download citation

Keywords

  • Learning theory
  • Online learning
  • Capacity dependent error analysis
  • Strong convergence in an RKHS

Mathematics Subject Classification (2010)

  • 68Q32
  • 68T05
  • 62J02
  • 62L20