Advanced Lectures on Machine Learning pp 235-257
Online Learning of Linear Classifiers
This paper surveys some basic techniques and recent results related to online learning.Our focus is on linear classification.The most familiar algorithm for this task is the perceptron.We explain the perceptron algorithm and its convergence proof as an instance of a generic method based on Bregman divergences.This leads to a more general algorithm known as the p -norm perceptron.We give the proof for generalizing the perceptron convergence theorem for the p -norm perceptron and the non-separable case.We also show how regularization,again based on Bregman divergences,can make an online algorithm more robust against target movement.
Unable to display preview. Download preview PDF.
- 2.B.E. Boser, I.M. Guyon and V.N. Vapnik. A training algorithm for optimal margin classifiers. In Proc.5th Annual Workshop on Computational Learning Theory, pages 144–152. ACM Press, New York, NY,1992.Google Scholar
- 3.L.M. Bregman.The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming.USSR Computational Mathematics and Physics, 7:200–217, 1967.Google Scholar
- 4.I. Csiszar.Why least squares and maximum entropy?An axiomatic approach for linear inverse problems. The Annals of Statistics, 19:2032–2066, 1991.Google Scholar
- 7.C. Gentile and N. Littlestone. The robustness of the p-norm algorithms. In Proc. 12th Annual Conference on Computational Learning Theory, pages 1–11. ACM Press, New York, NY, 1999.Google Scholar
- 8.C. Gentile and M.K. Warmuth. Hinge loss and average margin. In M.S. Kearns, S.A. Solla and D.A. Cohn, editors, Advances in Neural Information Processing Systems 11, pages 225–231. MIT Press, Cambridge, MA, 1998.Google Scholar
- 11.R. Herbrich. Learning Kernel Classifiers:Theory and Algorithms. MIT Press, Cambridge, MA, 2002.Google Scholar
- 15.J. Kivinen, A.J. Smola and R.C. Williamson. Online learning with kernels. In T.G. Dietterich, S. Becker and Z. Ghahramani, editors, Advances in Neural Information Processing Systems14, pages 785–792. MIT Press, Cambridge,MA, 2002.Google Scholar
- 19.J. Kivinen, A.J. Smola and R.C. Williamson. Large margin classification for moving targets. In N. Cesa-Bianchi, M. Numao and R. Reischuk, editors, Proc. 13th International Conference on Algorithmic Learning Theory. Springer, Berlin, November 2002.Google Scholar
- 20.N. Littlestone. Learning quickly when irrelevant attributes abound:A new linear threshold algorithm. Machine Learning, 2:285–318, 1988.Google Scholar
- 21.N. Littlestone.Mistake Bounds and Logarithmic Linear-threshold Learning Algorithms. PhD thesis, Technical Report UCSC-CRL-89-11, University of CaliforniaSanta Cruz, 1989.Google Scholar
- 22.C. Mesterharm. Tracking linear-threshold concepts with Winnow. In J. Kivinen and B. Sloan, editors, Proc.15th Annual Conference on Computational Learning Theory, pages 138–152. Springer, Berlin, 2002.Google Scholar
- 23.A.B.J. Novikoff. On convergence proofs on perceptrons. In Proceedings of the Symposium on the Mathematical Theory of Automata, volume 12, pages 615–622. Polytechnic Institute of Brooklyn, 1962.Google Scholar
- 24.G. Rätsch and M.K. Warmuth. Maximizing the margin with boosting. In J. Kivinen and B. Sloan, editors, Proc.15th Annual Conference on Computational Learning Theory, pages 334–350. Springer, Berlin, 2002.Google Scholar
- 25.R. Rockafellar. Convex Analysis. Princeton University Press, 1970.Google Scholar
- 27.M.K. Warmuth and A. Jagota. Continuous and discrete time nonlinear gradient descent:relative loss bounds and convergence. In R. Greiner and E. Boros, editors, Electronic Proceedings of Fifth International Symposium on Artificial Intelligence and Mathematics, http://www.rutcor.rutgers.edu/~amai, 1998.