Online Learning of Linear Classifiers

  • Jyrki Kivinen
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2600)


This paper surveys some basic techniques and recent results related to online learning.Our focus is on linear classification.The most familiar algorithm for this task is the perceptron.We explain the perceptron algorithm and its convergence proof as an instance of a generic method based on Bregman divergences.This leads to a more general algorithm known as the p -norm perceptron.We give the proof for generalizing the perceptron convergence theorem for the p -norm perceptron and the non-separable case.We also show how regularization,again based on Bregman divergences,can make an online algorithm more robust against target movement.


Weight Vector Learning Rate Online Learn Target Movement Online Algorithm 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    P. Auer, N. Cesa-Bianchi and C. Gentile. Adaptive and self-confident on-line learning algorithms. Journal of Computer and System Sciences, 64:48–75, 2002.zbMATHCrossRefMathSciNetGoogle Scholar
  2. 2.
    B.E. Boser, I.M. Guyon and V.N. Vapnik. A training algorithm for optimal margin classifiers. In Proc.5th Annual Workshop on Computational Learning Theory, pages 144–152. ACM Press, New York, NY,1992.Google Scholar
  3. 3.
    L.M. Bregman.The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming.USSR Computational Mathematics and Physics, 7:200–217, 1967.Google Scholar
  4. 4.
    I. Csiszar.Why least squares and maximum entropy?An axiomatic approach for linear inverse problems. The Annals of Statistics, 19:2032–2066, 1991.Google Scholar
  5. 5.
    Y. Freund and R.E. Schapire. Large margin classification using the perceptron algorithm. Machine Learning, 37:277–296,1999.zbMATHCrossRefGoogle Scholar
  6. 6.
    C. Gentile. A new approximate maximal margin classification algorithm. Journal of Machine Learning Research, 2:213–242, 2001.CrossRefMathSciNetGoogle Scholar
  7. 7.
    C. Gentile and N. Littlestone. The robustness of the p-norm algorithms. In Proc. 12th Annual Conference on Computational Learning Theory, pages 1–11. ACM Press, New York, NY, 1999.Google Scholar
  8. 8.
    C. Gentile and M.K. Warmuth. Hinge loss and average margin. In M.S. Kearns, S.A. Solla and D.A. Cohn, editors, Advances in Neural Information Processing Systems 11, pages 225–231. MIT Press, Cambridge, MA, 1998.Google Scholar
  9. 9.
    A.J. Grove, N. Littlestone and D. Schuurmans. General convergence results for linear discriminant updates. Machine Learning, 43:173–210, 2001.zbMATHCrossRefGoogle Scholar
  10. 10.
    D.P. Helmbold, J. Kivinen and M.K. Warmuth. Relative loss bounds for single neurons. IEEE Transactions on Neural Networks, 10:1291–1304, 1999.CrossRefGoogle Scholar
  11. 11.
    R. Herbrich. Learning Kernel Classifiers:Theory and Algorithms. MIT Press, Cambridge, MA, 2002.Google Scholar
  12. 12.
    M. Herbster and M.K. Warmuth. Tracking the best linear predictor. Journal of Machine Learning Research, 1:281–309, 2001.zbMATHCrossRefMathSciNetGoogle Scholar
  13. 13.
    K.-U. Höffgen, H.-U. Simon and K.S. Van Horn. Robust trainability of single neurons. Journal of Computer and System Sciences, 50:114–125, 1995.zbMATHCrossRefMathSciNetGoogle Scholar
  14. 14.
    L. Jones and C. Byrne.General entropy criteria for inverse problems,with applications to data compression,pattern classification and cluster analysis. IEEE Transactions on Information Theory, 36:23–30, 1990.zbMATHCrossRefMathSciNetGoogle Scholar
  15. 15.
    J. Kivinen, A.J. Smola and R.C. Williamson. Online learning with kernels. In T.G. Dietterich, S. Becker and Z. Ghahramani, editors, Advances in Neural Information Processing Systems14, pages 785–792. MIT Press, Cambridge,MA, 2002.Google Scholar
  16. 16.
    J. Kivinen and M.K. Warmuth. Additive versus exponentiated gradient updates for linear prediction. Information and Computation, 132:1–64, 19zbMATHCrossRefMathSciNetGoogle Scholar
  17. 17.
    J. Kivinen and M.K. Warmuth. Relative loss bounds for multidimensional regression problems. Machine Learning, 45:301–329, 2001.zbMATHCrossRefGoogle Scholar
  18. 18.
    J. Kivinen, M.K. Warmuth and P. Auer. The Perceptron algorithm vs.Winnow:linear vs. logarithmic mistake bounds when few input variables are relevant. Artificial Intelligence, 97:325–343, 1997.zbMATHCrossRefMathSciNetGoogle Scholar
  19. 19.
    J. Kivinen, A.J. Smola and R.C. Williamson. Large margin classification for moving targets. In N. Cesa-Bianchi, M. Numao and R. Reischuk, editors, Proc. 13th International Conference on Algorithmic Learning Theory. Springer, Berlin, November 2002.Google Scholar
  20. 20.
    N. Littlestone. Learning quickly when irrelevant attributes abound:A new linear threshold algorithm. Machine Learning, 2:285–318, 1988.Google Scholar
  21. 21.
    N. Littlestone.Mistake Bounds and Logarithmic Linear-threshold Learning Algorithms. PhD thesis, Technical Report UCSC-CRL-89-11, University of CaliforniaSanta Cruz, 1989.Google Scholar
  22. 22.
    C. Mesterharm. Tracking linear-threshold concepts with Winnow. In J. Kivinen and B. Sloan, editors, Proc.15th Annual Conference on Computational Learning Theory, pages 138–152. Springer, Berlin, 2002.Google Scholar
  23. 23.
    A.B.J. Novikoff. On convergence proofs on perceptrons. In Proceedings of the Symposium on the Mathematical Theory of Automata, volume 12, pages 615–622. Polytechnic Institute of Brooklyn, 1962.Google Scholar
  24. 24.
    G. Rätsch and M.K. Warmuth. Maximizing the margin with boosting. In J. Kivinen and B. Sloan, editors, Proc.15th Annual Conference on Computational Learning Theory, pages 334–350. Springer, Berlin, 2002.Google Scholar
  25. 25.
    R. Rockafellar. Convex Analysis. Princeton University Press, 1970.Google Scholar
  26. 26.
    V.N. Vapnik. Estimation of Dependences Based on Empirical Data. Springer, New York, NY, 1982.zbMATHGoogle Scholar
  27. 27.
    M.K. Warmuth and A. Jagota. Continuous and discrete time nonlinear gradient descent:relative loss bounds and convergence. In R. Greiner and E. Boros, editors, Electronic Proceedings of Fifth International Symposium on Artificial Intelligence and Mathematics,, 1998.

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Jyrki Kivinen
    • 1
  1. 1.Research School of Information Sciences and EngineeringAustralian National UniversityCanberraAustralia

Personalised recommendations