Advertisement

Machine Learning

, Volume 53, Issue 3, pp 265–299 | Cite as

The Robustness of the p-Norm Algorithms

  • Claudio Gentile
Article

Abstract

We consider two on-line learning frameworks: binary classification through linear threshold functions and linear regression. We study a family of on-line algorithms, called p-norm algorithms, introduced by Grove, Littlestone and Schuurmans in the context of deterministic binary classification. We show how to adapt these algorithms for use in the regression setting, and prove worst-case bounds on the square loss, using a technique from Kivinen and Warmuth. As pointed out by Grove, et al., these algorithms can be made to approach a version of the classification algorithm Winnow as p goes to infinity; similarly they can be made to approach the corresponding regression algorithm EG in the limit. Winnow and EG are notable for having loss bounds that grow only logarithmically in the dimension of the instance space. Here we describe another way to use the p-norm algorithms to achieve this logarithmic behavior. With the way to use them that we propose, it is less critical than with Winnow and EG to retune the parameters of the algorithm as the learning task changes. Since the correct setting of the parameters depends on characteristics of the learning task that are not typically known a priori by the learner, this gives the p-norm algorithms a desireable robustness. Our elaborations yield various new loss bounds in these on-line settings. Some of these bounds improve or generalize known results. Others are incomparable.

on-line learning loss bounds learning rate dual norms on-line model selection 

References

  1. Angluin, D. (1988). Queries and concept learning. Machine Learning, 2:4, 319–342.Google Scholar
  2. Auer, P., &; Warmuth, M. K. (1998). Tracking the best disjunction. Machine Learning, 32:2, 127–150.Google Scholar
  3. Auer, P., &; Gentile, C. (2000). Adaptive and self-confident on-line learning algorithms. In Proc. 13th Annu. Conf. on Comput. Learning Theory (pp. 107–117). San Mateo, CA: Morgan Kaufmann.Google Scholar
  4. Azoury, K., &; Warmuth, M. K. (2001). Relative loss bounds for on-line density estimation with the exponential family of distributions. Machine Learning, 43, 211–246.Google Scholar
  5. Barzdin, J. M., &; Frievald, R. V. (1972). On the prediction of general recursive functions. Soviet Math. Doklady, 13, 1224–1228.Google Scholar
  6. Block, H. D. (1962). The perceptron: A model for brain functioning. Reviews of Modern Physics, 34, 123–135. Reprinted in Neurocomputing by Anderson and Rosenfeld.Google Scholar
  7. Bregman, L. M. (1967). The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Physics, 7, 200- 217.Google Scholar
  8. Bylander, T. (1997). The binary exponentiated gradient algorithm for learning linear functions. In Proc. 8th Annu. Conf. on Comput. Learning Theory (pp. 184–192). ACM.Google Scholar
  9. Cesa-Bianchi, N., Freund, Y., Helmbold, D. P., &; Warmuth, M. K. (1996). On-line prediction and conversion strategies. Machine Learning, 25, 71–110.Google Scholar
  10. Cesa-Bianchi, N., Freund, Y., Haussler, D., Helmbold, D. P., Schapire, R. E., &; Warmuth, M. K. (1997). How to use expert advice. Journal of the ACM, 44:3, 427–485.Google Scholar
  11. Cesa-Bianchi, N., Helmbold, P. D., &; Panizza, S. (1998). On bayes methods for on-line boolean prediction. Algorithmica, 22:1, 112–137.Google Scholar
  12. Cesa-Bianchi, N., Long, P., &; Warmuth, M. K. (1996). Worst-case quadratic loss bounds for on-line prediction of linear functions by gradient descent. IEEE Transactions on Neural Networks, 7, 604–619.Google Scholar
  13. Censor, Y., &; Lent, A. (1981). An iterative row-action method for interval convex programming. Journal of Optimization Theory and Applications, 34:3, 321–353.Google Scholar
  14. Forster, J., &; Warmuth, M. K. (2000). Relative expected instantaneous loss bounds. In Proc. 13th Annu. Conf. on Comput. Learning Theory (pp. 90–99). San Mateo, CA: Morgan Kaufmann.Google Scholar
  15. Freund, Y., &; Schapire, R. E. (1999). Large margin classification using the perceptron algorithm. Machine Learning, 37:3, 277–296.Google Scholar
  16. Gentile, C., &; Littlestone, N. (1999). The robustness of the p-norm algorithms. In Proc. 12th Annu. Conf. on Comput. Learning Theory (pp. 1–11). ACM.Google Scholar
  17. Gentile, C., &; Warmuth, M. K. (1999). Linear Hinge Loss and Average margin. In Proc. Advances in Neural Information Processing Systems 11 (pp. 225-231). Cambridge, MA: MIT Press.Google Scholar
  18. Gordon, G. J. (1999). Regret bounds for prediction problems. In Proc. 12th Annu. Conf. on Comput. Learning Theory (pp. 29–40). ACM.Google Scholar
  19. Grove, A. J., Littlestone, N., &; Schuurmans, D. (2001). General convergence results for linear discriminant updates. Machine Learning, 43:3, 173–210.Google Scholar
  20. Helmbold, D. P., Kivinen, J., &; Warmuth, M. K. (1999). Worst-case loss bounds for sigmoided linear neurons. IEEE Transactions on Neural Networks, 10:6, 1291–1304.Google Scholar
  21. Helmbold, D. P., &; Schapire, R. E. (1997). Predicting nearly as well as the best pruning of a decision tree. Machine Learning, 27, 51–68.Google Scholar
  22. Helmbold, D. P., &; Warmuth, M. K. (1995). On weak learning. Journal of Computer and System Sciences, 50:3, 551–573.Google Scholar
  23. Herbster, M., &; Warmuth, M. K. (1998a). Tracking the best expert. Machine Learning, 32:2, 151–178.Google Scholar
  24. Herbster, M., &; Warmuth, M. K. (1998b). Tracking the best regressor. In Proc. 11th Annu. Conf. on Comput. Learning Theory (pp. 24–31). ACM.Google Scholar
  25. Jagota, A., &; Warmuth, M. K. (1998). Continuous and discrete time nonlinear gradient descent: Relative loss bounds and convergence. In Electronic Proceedings of Fifth International Symposium on Artificial Intelligence and Mathematics. Electronic, http://rutcor.rutgers.edu/~amai.Google Scholar
  26. Kivinen, J., &; Warmuth, M. K. (1997). Additive versus exponentiated gradient updates for linear prediction. Information and Computation, 132:1, 1–64.Google Scholar
  27. Kivinen, J., &; Warmuth, M. K (2001). Relative loss bounds for multidimensional regression problems. Machine Learning, 45:3, 301–329.Google Scholar
  28. Kivinen, J., &; Warmuth, M. K. (1999). Averaging expert predictions. In Proc. 4th European Conference on Comput. learning Theory (pp. 153–167). Lecture Notes in Computer Science, Vol. 1572. Springer.Google Scholar
  29. Littlestone, N. (1988). Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 2, 285–318.Google Scholar
  30. Littlestone, N. (1989a). From on-line to batch learning. In Proc. 2nd Annu.Workshop on Comput. Learning Theory (pp. 269–284). San Mateo, CA: Morgan Kaufmann.Google Scholar
  31. Littlestone, N. (1989b). Mistake Bounds and Logarithmic Linear-Threshold Learning Algorithms. PhD thesis, Technical Report UCSC-CRL-89-11, University of California Santa Cruz.Google Scholar
  32. Littlestone, N. (1991). Redundant noisy attributes, attribute errors, and linear threshold learning using Winnow. In Proc. 4th Annu. Workshop on Comput. Learning Theory (pp. 147–156). San Mateo, CA: Morgan Kaufmann.Google Scholar
  33. Littlestone, N., &; Mesterharm, C. (1997). An apobayesian relative of Winnow. In Proc. Advances in Neural Information Processing Systems 9. Cambridge, MA: MIT Press.Google Scholar
  34. Littlestone, N., &; Mesterharm, C. (1999). A simulation study ofWinnow and related learning algorithms. Unpublished manuscript.Google Scholar
  35. Littlestone, N., &; Warmuth, M. K. (1994). The weighted majority algorithm. Information and Computation, 108:2, 212–261.Google Scholar
  36. Maruoka, A., Takimoto, E., &; Vovk, V. (1999). Predicting nearly as well as the best pruning of a decision tree through dynamic programming scheme. Theoretical Computer Science, to appear.Google Scholar
  37. Novikov, A. B. J. (1962). On convergence proofs on perceptrons. In Proc. of the Symposium on the Mathematical Theory of Automata, vol. XII (pp. 615–622).Google Scholar
  38. Rockafellar, R. Convex Analysis, Princeton University press, 1970.Google Scholar
  39. Rosenblatt, F. Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms. Spartan Books, Washington, D.C., 1962.Google Scholar
  40. Vovk, V. (1997). Competitive on-line linear regression. Technical Report CSD-TR-97-13, Department of Computer Science, Royal Holloway, University of London. Preliminary version in Proc. Advances in Neural Information Processing Systems 10 (pp. 364–370). Cambridge, MA: MIT Press.Google Scholar
  41. Vovk, V. (1990). Aggregating strategies. In Proc. 3rd Annu.Workshop on Comput. Learning Theory (pp. 371–383). San Mateo, CA: Morgan Kaufmann.Google Scholar
  42. Widrow, B., Hoff, M. E. (1960). Adaptive switching circuits. 1960 IRE WESCON Conv. Record, Part 4 (pp. 96–104).Google Scholar
  43. Yamanishi, K. (1998). A decision-theoretic extension of stochastic complexity and its applications to learning. IEEE Information Theory, 44:4, 1424–1439.Google Scholar

Copyright information

© Kluwer Academic Publishers 2003

Authors and Affiliations

  • Claudio Gentile
    • 1
  1. 1.CRII, Università dell'InsubriaVareseItaly

Personalised recommendations