Skip to main content
Log in

Abstract

In this paper, we study an online learning algorithm with a robust loss function \(\mathcal {L}_{\sigma }\) for regression over a reproducing kernel Hilbert space (RKHS). The loss function \(\mathcal {L}_{\sigma }\) involving a scaling parameter \(\sigma >0\) can cover a wide range of commonly used robust losses. The proposed algorithm is then a robust alternative for online least squares regression aiming to estimate the conditional mean function. For properly chosen \(\sigma \) and step size, we show that the last iterate of this online algorithm can achieve optimal capacity independent convergence in the mean square distance. Moreover, if additional information on the underlying function space is known, we also establish optimal capacity-dependent rates for strong convergence in RKHS. To the best of our knowledge, both of the two results are new to the existing literature of online learning.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. N. Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society, 68 (1950), 337–404.

    Article  MathSciNet  MATH  Google Scholar 

  2. F. Bauer, S. Pereverzev, and L. Rosasco. On regularization algorithms in learning theory. Journal of complexity, 23 (2007), 52–72.

    Article  MathSciNet  MATH  Google Scholar 

  3. R. Bessa, V. Miranda, and J. Gama. Entropy and correntropy against minimum square error in offline and online three-day ahead wind power forecasting. IEEE Transactions on Power Systems, 24 (2009), 1657–1666.

    Article  Google Scholar 

  4. M. Black and P. Anandan. The robust estimation of multiple motions: Parametric and piecewise-smooth flow fields. Computer vision and image understanding, 63 (1996), 75–104.

    Article  Google Scholar 

  5. G. Blanchard and N. Mücke. Optimal rates for regularization of statistical inverse Learning problems. Foundations of Computational Mathematics, 18 (2018), 971–1013.

    Article  MathSciNet  MATH  Google Scholar 

  6. L. Bottou, F. E Curtis, and J. Nocedal. Optimization methods for large-scale machine learning. SIAM Review, 60 (2018), 223–311.

    Article  MathSciNet  MATH  Google Scholar 

  7. A. Caponnetto and E. De Vito. Optimal rates for the regularized least squares algorithm. Foundations of Computational Mathematics, 7 (2007), 331–368.

    Article  MathSciNet  MATH  Google Scholar 

  8. X. Chen, B. Tang, J. Fan, and X. Guo. Online gradient descent algorithms for functional data learning. Journal of Complexity, page 101635, 2021.

  9. A. Christmann and A. Van Messem, and I. Steinwart. On consistency and robustness properties of support vector machines for heavy-tailed distributions. Statistics and Its Interface, 2 (2009), 331–327.

    Article  MathSciNet  MATH  Google Scholar 

  10. A. Christmann and I. Steinwart. Consistency and robustness of kernel-based regression in convex risk minimization. Bernoulli, 13 (2007), 799–819.

    Article  MathSciNet  MATH  Google Scholar 

  11. F. Cucker and D. X. Zhou. Learning Theory: An Approximation Theory Viewpoint. Cambridge Univesity Press, 2007.

  12. K. De Brabanter, K. Pelckmans, J. De Brabanter, M. Debruyne, J. A. K. Suykens, M. Hubert, and B. De Moor. Robustness of kernel based regression: a comparison of iterative weighting schemes. International Conference on Artificial Neural Networks, (2009), 100–110.

  13. M. Debruyne, A. Christmann, M. Hubert, and J. A. K. Suykens. Robustness of reweighted least squares kernel based regression. Journal of Multivariate Analysis, 101 (2010), 447–463.

    Article  MathSciNet  MATH  Google Scholar 

  14. E. De Vito, S. Pereverzyev, and L. Rosasco. Adaptive kernel methods using the balancing principle. Foundations of Computational Mathematics, 10 (2010), 455–479.

    Article  MathSciNet  MATH  Google Scholar 

  15. A. Dieuleveut and F. Bach. Nonparametric stochastic approximation with large step-sizes. The Annals of Statistics, 44 (2016), 1363–1399.

    Article  MathSciNet  MATH  Google Scholar 

  16. R. Fair. On the robust estimation of econometric models. Annals of Economic and Social Measurement, 3 (1974), 667–677.

    Google Scholar 

  17. H. Feng, S. Hou, L. Wei, and D. X. Zhou. CNN models for readability of Chinese texts. Mathematical Foundations of Computing, 5 (2021), 351–362.

    Article  Google Scholar 

  18. Y. Feng, X. Huang, L. Shi, Y. Yang, and J. A. K. Suykens. Learning with the maximum correntropy criterion induced losses for regression. Journal of Machine Learning Research, 16 (2015), 993–1034.

    MathSciNet  MATH  Google Scholar 

  19. Y. Feng and Q. Wu. A framework of learning through empirical gain maximization. Neural Computation, 33 (2021), 1656–1697.

    Article  MathSciNet  MATH  Google Scholar 

  20. S. Ganan and D. McClure. Bayesian image analysis: An application to single photon emission tomography. Journal of the American Statistical Association, (1985), 12–18.

  21. X. Guo, Z. C. Guo, and L. Shi. Capacity dependent analysis for functional online learning algorithms. Applied and Computational Harmonic Analysis, 67 (2023), 1–30.

  22. Z. C. Guo, T. Hu, and L. Shi. Gradient descent for robust kernel based regression. Inverse Problems, 34 (2018), 065009(29pp).

  23. Z. C. Guo, S. B. Lin, and D. X. Zhou. Learning theory of distribued spectral algorithms. Inverse Problems, 33 (2017), 074009(29pp).

  24. Z. C. Guo and L. Shi. Fast and strong convergence of online learning algorithms. Advances in Computational Mathematics, 26 (2019), 1–26.

    MathSciNet  Google Scholar 

  25. F. R. Hampel, E. M. Ronchetti and P. J. Rousseeuw, and W. A. Stahel. Robust statistics: The Approach Based on Influence Functions. John Wiley & Sons, New York, 1986.

    MATH  Google Scholar 

  26. R. He, W. Zheng, and B. Hu. Maximum correntropy criterion for robust face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33 (2011), 1561–1576.

    Article  Google Scholar 

  27. P. W. Holland and R. E. Welsch. Robust regression using iteratively reweighted leastsquares. Communications in Statistics-Theory and Methods, 6 (1977), 813–827.

    Article  MATH  Google Scholar 

  28. S. Huang, Y. Feng, and Q. Wu, Learning theory of minimum error entropy under weak moment conditions. Analysis and Applications, 20 (2022), 121–139.

    Article  MathSciNet  MATH  Google Scholar 

  29. P. Huber. Robust Statistics. Wiley, New York, 1981.

    Book  MATH  Google Scholar 

  30. J. Lin and L. Rosasco. Optimal learning for multi-pass stochastic gradient methods. In Advances in Neural Information Processing Systems, 4556–4564, 2016.

  31. W. Liu, P. Pokharel, and J. C. Principe. Correntropy: Properties and applications in non-Gaussian signal processing. IEEE Transactions on Signal Processing, 55 (2007), 5286–5298.

    Article  MathSciNet  MATH  Google Scholar 

  32. S. Lu, P. Mathé, and S. V. Pereverzev. Balancing principle in supervised learning for a general regularization scheme. Applied and Computational Harmonic Analysis, 48 (2020), 123–148.

    Article  MathSciNet  MATH  Google Scholar 

  33. F. Lv and J. Fan, Optimal learning with Gaussians and correntropy loss. Analysis and Applications, 19(2021), 107–124.

    Article  MathSciNet  MATH  Google Scholar 

  34. R. Maronna, D. Martin, and V. Yohai. Robust Statistics. John Wiley & Sons, Chichester, 2006.

    Book  MATH  Google Scholar 

  35. R. A. Maronna and R. D. Martin and V. J. Yohai. Robust Statistics: Theory and Methods. John Wiley & Sons, New York, 2006.

    Book  MATH  Google Scholar 

  36. I. Mizera and C. Müller. Breakdown points of Cauchy regression-scale estimators. Statistics & probability letters, 57 (2002), 79–89.

    Article  MathSciNet  MATH  Google Scholar 

  37. L. Pillaud-Vivien, R. Alessandro, and F. Bach. Statistical optimality of stochastic gradient descent on hard learning problems through multiple passes. In Advances in Neural Information Processing Systems, 8114–8124, 2018.

  38. A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19 (2009), 1574–1609.

    Article  MathSciNet  MATH  Google Scholar 

  39. A. Rakhlin, O. Shamir, and K. Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the 29th International Conference on Machine Learning (ICML-12), 449–456, 2012.

  40. G. Raskutti, M. J. Wainwright, and B. Yu. Early stopping and non-parametric regression: an optimal data-dependent stopping rule. Journal of Machine Learning Research, 15 (2014), 335–366.

    MathSciNet  MATH  Google Scholar 

  41. L. Rosasco, A, Tacchetti, and S. Villa. Regularization by early stopping for online learning algorithms. Stat, 1050 (2014), 30 pages.

  42. I. Santamaría, P. Pokharel, and J. C. Principe. Generalized correlation function: definition, properties, and application to blind equalization. IEEE Transactions on Signal Processing, 54 (2006), 2187–2197.

    Article  MATH  Google Scholar 

  43. B. Schölkopf and A. J. Smola. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2018.

  44. S. Smale and D. X. Zhou. Estimating the approximation error in learning theory. Analysis and Applications, 1 (2003), 17–41.

    Article  MathSciNet  MATH  Google Scholar 

  45. S. Smale and D. X. Zhou. Learning theory estimates via integral operators and their approximations. Constructive Approximation, 26 (2007), 153–172.

    Article  MathSciNet  MATH  Google Scholar 

  46. S. Smale and D. X. Zhou. Online learning with Markov sampling. Analysis and Applications, 7 (2009), 87–113.

    Article  MathSciNet  MATH  Google Scholar 

  47. I. Steinwart. How to compare different loss functions and their risks. Constructive Approximation, 26 (2017), 225–287.

    Article  MathSciNet  MATH  Google Scholar 

  48. I. Steinwart and A. Christmann. Support Vector Machines. Springer-Verlag, New York, 2008.

    MATH  Google Scholar 

  49. I. Steinwart, D. R. Hush, and C. Scovel. Optimal rates for regularized least squares regression. In The 22nd Annual Conference on Learning Theory (COLT), 2009.

  50. D. Sun, S. Roth, and M. Black. Secrets of optical flow estimation and their principles. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2432–2439, 2010.

  51. I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. In International Conference on Machine Learning (ICML-13), 1139–1147, 2013.

  52. Y. Yao. On complexity issues of online learning algorithms. IEEE Transactions on Information Theory, 56 (2010), 6470–6481.

    Article  MathSciNet  MATH  Google Scholar 

  53. Y. Ying and M. Pontil. Online gradient descent learning algorithms. Foundations of Computational Mathematics, 8 (2008), 561–596.

    Article  MathSciNet  MATH  Google Scholar 

  54. Y. Ying and D. X. Zhou. Unregularized online learning algorithms with general loss functions. Applied and Computational Harmonic Analysis, 42 (2017), 224–244.

    Article  MathSciNet  MATH  Google Scholar 

  55. T. Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In International Conference on Machine Learning (ICML-04), 919–926, 2004.

  56. X. Zhu, Z. Li, and J. Sun. Expression recognition method combining convolutional features and Transformer. Mathematical Foundations of Computing, 6 (2023), 203–217.

    Article  Google Scholar 

Download references

Acknowledgements

The authors are grateful to the anonymous referees for their careful reading of this paper and suggestions. The work of Zheng-Chu Guo is supported by Zhejiang Provincial Natural Science Foundation of China (Project No. LR20A010001), National Natural Science Foundation of China (Project Nos. U21A20426 and 12271473), and Fundamental Research Funds for the Central Universities (Project No. 2021XZZX001). The work of Andreas Christmann is partially supported by German Science Foundation (DFG) under Grant CH 291/3-1. The work of Lei Shi is supported by the National Natural Science Foundation of China (Project Nos. 12171039 and 12061160462) and Shanghai Science and Technology Program (Project Nos. 21JC1400600 and 20JC1412700).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lei Shi.

Additional information

Communicated by Thomas Strohmer.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Guo, ZC., Christmann, A. & Shi, L. Optimality of Robust Online Learning. Found Comput Math (2023). https://doi.org/10.1007/s10208-023-09616-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10208-023-09616-9

Keywords

Mathematics Subject Classification

Navigation