Advertisement

Machine Learning

, Volume 46, Issue 1–3, pp 361–387 | Cite as

The Relaxed Online Maximum Margin Algorithm

  • Yi Li
  • Philip M. Long
Article

Abstract

We describe a new incremental algorithm for training linear threshold functions: the Relaxed Online Maximum Margin Algorithm, or ROMMA. ROMMA can be viewed as an approximation to the algorithm that repeatedly chooses the hyperplane that classifies previously seen examples correctly with the maximum margin. It is known that such a maximum-margin hypothesis can be computed by minimizing the length of the weight vector subject to a number of linear constraints. ROMMA works by maintaining a relatively simple relaxation of these constraints that can be efficiently updated. We prove a mistake bound for ROMMA that is the same as that proved for the perceptron algorithm. Our analysis implies that the maximum-margin algorithm also satisfies this mistake bound; this is the first worst-case performance guarantee for this algorithm. We describe some experiments using ROMMA and a variant that updates its hypothesis more aggressively as batch algorithms to recognize handwritten digits. The computational complexity and simplicity of these algorithms is similar to that of perceptron algorithm, but their generalization is much better. We show that a batch algorithm based on aggressive ROMMA converges to the fixed threshold SVM hypothesis.

online learning large margin classifiers perceptrons support vector machines 

References

  1. Aizerman, M. A., Braverman, E. M., & Rozonoer, L. I. (1964). Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control, 25, 821-837.Google Scholar
  2. Anthony, M. & Bartlett, P. L. (1999). Neural network learning: Theoretical foundations. Cambridge, UK: Cambridge University Press.Google Scholar
  3. Block, H. D. (1962). The perceptron: A model for brain functioning. Reviews of Modern Physics, 34, 123-135.Google Scholar
  4. Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Workshop on Computational Learning Theory (pp. 144-152).Google Scholar
  5. Burges, C. & Crisp, D. J. (1999). Uniqueness of the SVM solution. In Advances in neural information processing systems, 12.Google Scholar
  6. Campbell, C. & Cristianini, N. (1998). Simple learning algorithms for training support vector machines. Technical report, University of Bristol.Google Scholar
  7. Chapelle, O. & Vapnik,V. (1999). Model selection for support vector machines. In Advances in Neural Information Processing Systems.Google Scholar
  8. Cortes, C. & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20:3, 273-297.Google Scholar
  9. Cristianini, N. & Shawe-Taylor, J. (2000). An introduction to support vector machines and other kernel-based learning methods. Cambridge, UK: Cambridge University Press.Google Scholar
  10. Fletcher, R. (1987). Practical methods of optimization. (2nd edn.). New York: John Wiley and Sons.Google Scholar
  11. Freund, Y. & Schapire, R. E. (1998). Large margin classification using the perceptron algorithm. In Proceedings of the Eleventh Conference on Computational Learning Theory (pp. 209-217).Google Scholar
  12. Friedman, J. H. (1996). Another approach to polychotomous classification. Technical report, Department of Statistics, Stanford, CA: Stanford University.Google Scholar
  13. Friess, T. T., Cristianini, N., & Campbell, C. (1998). The kernel adatron algorithm: A fast and simple learning procedure for support vector machines. In Proceedings of the Fifteenth International Conference on Machine Learning.Google Scholar
  14. Gallant, S. I. (1986). Optimal linear discriminants. In Proceedings of the Eighth International Conference on Pattern Recognition. Paris, France (pp. 849-852).Google Scholar
  15. Gilbert, E. G. (1996). Minimizing the quadratic form on a convex set. SIAM J. Control, 4, 61-79.Google Scholar
  16. Guo, Y., Bartlett, P. L., Shawe-Taylor, J., & Williamson, R. (1999). Covering numbers for support vector machines. In Proceedings of the 1999 Conference on Computational Learning Theory (pp. 267-277.)Google Scholar
  17. Helmbold, D. & Warmuth, M. K. (1995). On weak learning. Journal of Computer and System Sciences, 50, 551-573.Google Scholar
  18. Hertz, J. A., Krogh, A., & Palmer, R. (1991). Introduction to the theory of neural computation. Redwood, CA: Addison-Wesley.Google Scholar
  19. Joachims, T. (1998). Making large-scale support vector machines learning practical. In B. Schölkopf, C. Burges, & A. Smola (Eds.). Advances in kernel methods: Support vector machines (pp. 169-184).Google Scholar
  20. Kaufman, L. (1998). Solving the quardratic programming problem arising in support vector classification. In B. Sch¨olkopf, C. Burges, & A. Smola (Eds.). Advances in kernel methods: Support vector machines.Google Scholar
  21. Kearns, M., Li, M., Pitt, L., & Valiant, L. G. (1987). On the learnability of Boolean formulae. In Proceedings of the 19th Annual Symposium on the Theory of Computation (pp. 285-295).Google Scholar
  22. Keerthi, S. S., Shevade, S. K., Bhattacharyya, C., & Murthy, K. R. K. (1999).Afast iterative nearest point algorithm for support vector machine classifier design. Technical report, Indian Institute of Science. TR-ISL-99-03.Google Scholar
  23. Klasner, N. & Simon, H. U. (1995). From noise-free to noise-tolerant and from on-line to batch learning. In Proceedings of the 1995 Conference on Computational Learning Theory (pp. 250-257).Google Scholar
  24. Knerr, S., Personnaz, L., & Dreyfus, G. (1990). Single-layer learning revisited: A stepwise procedure for building and training a neural network. In Fogelman-Soulie & Herault (Eds.). Neurocomputing: Algorithms, architectures and applications. NATO ASI: Springer.Google Scholar
  25. Kowalczyk, A. (1999). Maximal margin perceptron. In A. Smola, P. Bartlett, B. Schölkopf, & O. Schuurmans (Eds.). Advances in large margin classifiers. Cambridge, MA: MIT Press.Google Scholar
  26. LeCun, Y., Jackel, L., Bottou, L., Brunot, A., Cortes, C., Denker, J., Drucker, H., Guyon, I., Muller, U., Sackinger, E., Simard, P., & Vapnik, V. (1995). Comparison of learning algorithms for handwritten digit recognition. In Proceedings of the Fifth International Conference on Artificial Neural Networks (pp. 53-60).Google Scholar
  27. Li, Y. (2000). Selective voting for perceptron-like online learning. In Proceedings of the 17th International Conference on Machine Learning (pp. 559-566).Google Scholar
  28. Littlestone, N. (1998). Learning quickly when irrelevant attributes abound: A new lenear-threshold algorithm. Machine Learning, 2, 285-318.Google Scholar
  29. Littlestone, N. (1989a). From on-line to batch learning. In Proceedings of the SecondWorkshop on Computational Learning Theory (pp. 269-284).Google Scholar
  30. Littlestone, N. (1989b). Mistake-bounds and logarithmic linear-threshold learning algorithms. Ph.D. thesis, UC Santa Cruz.Google Scholar
  31. Minsky, M. & Papert, S. (1969). expanded edition 1988, Perceptrons. Cambridge, MA: MIT Press.Google Scholar
  32. Mitchell, B. F., Dem'yanov, V. F., & Malozemov, V. N. (1974). Finding the point of a polyhedron closet to the origin. SIAM J. Control, 12, 19-26.Google Scholar
  33. Novikoff, A. B. J. (1962). On convergence proofs on perceptrons. In Proceedings of the Symposium on the Mathematical Theory of Automata (pp. 615-622).Google Scholar
  34. Opper, M. & Winther, O. (1999). Gaussian processes and SVM: Mean field results and leave-one-out. In Smola, Bartlett, Schölkopf, & Schuurmans (Eds.). Advances in large margin Classifiers. Cambridge, MA: MIT PressGoogle Scholar
  35. Osuna, E., Freund R., & Girosi, F. (1997). An improved training algorithm for support vector machines. In J. Principle, L. Gile, N. Margan, & E. Wilson (Eds.). Neural networks for signal processing VII-Proceedings of the 1997 IEEE workshop (pp. 276-285).Google Scholar
  36. Platt, J. C. (1998). Fast training of support vector machines using sequential minimal optimization. In B. Schölkopf, C. Burges, & A. Smola (Eds.). Advances in kernel methods: Support vector machines. Cambridge, MA: MIT Press.Google Scholar
  37. Platt, J., Cristianini, N., & Shawe-Taylor, J. (1999). Large margin DAGs for multiclass classification. In Advances in Neural Information Processing Systems, 12.Google Scholar
  38. Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65, 386-407.Google Scholar
  39. Rosenblatt, F. (1962). Principles of neurodynamics: Perceptrons and the theory of brain mechanisms.Washington, D. C.: Spartan Books.Google Scholar
  40. Schapire, R. E., Freund, Y., Bartlett, P., & Lee, W. S. (1998). Boosting the Margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics, 26:5, 1651-1686.Google Scholar
  41. Shawe-Taylor, J., Bartlett, P., Williamson, R., & Ony, M. A. (1998). Structural risk minimization over datadependent hierarchies. IEEE Transactions on Information Theory, 44:5, 1926-1940.Google Scholar
  42. Smola, A., Óvári, Z., & Williamson, R. (2000). Regularization with dot-product kernels. submitted to NIPS00.Google Scholar
  43. Vapnik, V. (1998). Statistical learning theory. New York: Wiley.Google Scholar
  44. Vapnik, V. N. (1995). The nature of statistical learning theory. New York: Springer.Google Scholar
  45. Wahba, G. (1999). Support vector machines, reproducing kernel hilbert spaces and the randomized GACV. In B. Schölkopf, C. J. C. Burges, & A. J. Smola (Eds.). Advances in kernel methods-Support vector learning (pp.69-88). Cambridge, MA: MIT Press.Google Scholar
  46. Williams, C. K. I (1998). Prediction with Gaussian processes: From linear regression to linear prediction and beyond. In M. I. Jordan (Ed.). Learning and inference in graphical models. Dordrecht: Kluwer.Google Scholar
  47. Williamson, R. C., Smola, A., & Scholkpof, B. (1998). Generalization bounds for regularization networks and support vector machines via entropy numbers of compact operators. IEEE Transactions on Information Theory.Google Scholar

Copyright information

© Kluwer Academic Publishers 2002

Authors and Affiliations

  • Yi Li
    • 1
  • Philip M. Long
    • 2
  1. 1.Department of Engineering MathematicsUniversity of BristolBristolUK
  2. 2.Department of Computer ScienceNational University of SingaporeSingaporeRepublic of Singapore

Personalised recommendations