We consider two on-line learning frameworks: binary classification through linear threshold functions and linear regression. We study a family of on-line algorithms, called p-norm algorithms, introduced by Grove, Littlestone and Schuurmans in the context of deterministic binary classification. We show how to adapt these algorithms for use in the regression setting, and prove worst-case bounds on the square loss, using a technique from Kivinen and Warmuth. As pointed out by Grove, et al., these algorithms can be made to approach a version of the classification algorithm Winnow as p goes to infinity; similarly they can be made to approach the corresponding regression algorithm EG in the limit. Winnow and EG are notable for having loss bounds that grow only logarithmically in the dimension of the instance space. Here we describe another way to use the p-norm algorithms to achieve this logarithmic behavior. With the way to use them that we propose, it is less critical than with Winnow and EG to retune the parameters of the algorithm as the learning task changes. Since the correct setting of the parameters depends on characteristics of the learning task that are not typically known a priori by the learner, this gives the p-norm algorithms a desireable robustness. Our elaborations yield various new loss bounds in these on-line settings. Some of these bounds improve or generalize known results. Others are incomparable.
Angluin, D. (1988). Queries and concept learning. Machine Learning, 2:4, 319–342.
Auer, P., &; Warmuth, M. K. (1998). Tracking the best disjunction. Machine Learning, 32:2, 127–150.
Auer, P., &; Gentile, C. (2000). Adaptive and self-confident on-line learning algorithms. In Proc. 13th Annu. Conf. on Comput. Learning Theory (pp. 107–117). San Mateo, CA: Morgan Kaufmann.
Azoury, K., &; Warmuth, M. K. (2001). Relative loss bounds for on-line density estimation with the exponential family of distributions. Machine Learning, 43, 211–246.
Barzdin, J. M., &; Frievald, R. V. (1972). On the prediction of general recursive functions. Soviet Math. Doklady, 13, 1224–1228.
Block, H. D. (1962). The perceptron: A model for brain functioning. Reviews of Modern Physics, 34, 123–135. Reprinted in Neurocomputing by Anderson and Rosenfeld.
Bregman, L. M. (1967). The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Physics, 7, 200- 217.
Bylander, T. (1997). The binary exponentiated gradient algorithm for learning linear functions. In Proc. 8th Annu. Conf. on Comput. Learning Theory (pp. 184–192). ACM.
Cesa-Bianchi, N., Freund, Y., Helmbold, D. P., &; Warmuth, M. K. (1996). On-line prediction and conversion strategies. Machine Learning, 25, 71–110.
Cesa-Bianchi, N., Freund, Y., Haussler, D., Helmbold, D. P., Schapire, R. E., &; Warmuth, M. K. (1997). How to use expert advice. Journal of the ACM, 44:3, 427–485.
Cesa-Bianchi, N., Helmbold, P. D., &; Panizza, S. (1998). On bayes methods for on-line boolean prediction. Algorithmica, 22:1, 112–137.
Cesa-Bianchi, N., Long, P., &; Warmuth, M. K. (1996). Worst-case quadratic loss bounds for on-line prediction of linear functions by gradient descent. IEEE Transactions on Neural Networks, 7, 604–619.
Censor, Y., &; Lent, A. (1981). An iterative row-action method for interval convex programming. Journal of Optimization Theory and Applications, 34:3, 321–353.
Forster, J., &; Warmuth, M. K. (2000). Relative expected instantaneous loss bounds. In Proc. 13th Annu. Conf. on Comput. Learning Theory (pp. 90–99). San Mateo, CA: Morgan Kaufmann.
Freund, Y., &; Schapire, R. E. (1999). Large margin classification using the perceptron algorithm. Machine Learning, 37:3, 277–296.
Gentile, C., &; Littlestone, N. (1999). The robustness of the p-norm algorithms. In Proc. 12th Annu. Conf. on Comput. Learning Theory (pp. 1–11). ACM.
Gentile, C., &; Warmuth, M. K. (1999). Linear Hinge Loss and Average margin. In Proc. Advances in Neural Information Processing Systems 11 (pp. 225-231). Cambridge, MA: MIT Press.
Gordon, G. J. (1999). Regret bounds for prediction problems. In Proc. 12th Annu. Conf. on Comput. Learning Theory (pp. 29–40). ACM.
Grove, A. J., Littlestone, N., &; Schuurmans, D. (2001). General convergence results for linear discriminant updates. Machine Learning, 43:3, 173–210.
Helmbold, D. P., Kivinen, J., &; Warmuth, M. K. (1999). Worst-case loss bounds for sigmoided linear neurons. IEEE Transactions on Neural Networks, 10:6, 1291–1304.
Helmbold, D. P., &; Schapire, R. E. (1997). Predicting nearly as well as the best pruning of a decision tree. Machine Learning, 27, 51–68.
Helmbold, D. P., &; Warmuth, M. K. (1995). On weak learning. Journal of Computer and System Sciences, 50:3, 551–573.
Herbster, M., &; Warmuth, M. K. (1998a). Tracking the best expert. Machine Learning, 32:2, 151–178.
Herbster, M., &; Warmuth, M. K. (1998b). Tracking the best regressor. In Proc. 11th Annu. Conf. on Comput. Learning Theory (pp. 24–31). ACM.
Jagota, A., &; Warmuth, M. K. (1998). Continuous and discrete time nonlinear gradient descent: Relative loss bounds and convergence. In Electronic Proceedings of Fifth International Symposium on Artificial Intelligence and Mathematics. Electronic, http://rutcor.rutgers.edu/~amai.
Kivinen, J., &; Warmuth, M. K. (1997). Additive versus exponentiated gradient updates for linear prediction. Information and Computation, 132:1, 1–64.
Kivinen, J., &; Warmuth, M. K (2001). Relative loss bounds for multidimensional regression problems. Machine Learning, 45:3, 301–329.
Kivinen, J., &; Warmuth, M. K. (1999). Averaging expert predictions. In Proc. 4th European Conference on Comput. learning Theory (pp. 153–167). Lecture Notes in Computer Science, Vol. 1572. Springer.
Littlestone, N. (1988). Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 2, 285–318.
Littlestone, N. (1989a). From on-line to batch learning. In Proc. 2nd Annu.Workshop on Comput. Learning Theory (pp. 269–284). San Mateo, CA: Morgan Kaufmann.
Littlestone, N. (1989b). Mistake Bounds and Logarithmic Linear-Threshold Learning Algorithms. PhD thesis, Technical Report UCSC-CRL-89-11, University of California Santa Cruz.
Littlestone, N. (1991). Redundant noisy attributes, attribute errors, and linear threshold learning using Winnow. In Proc. 4th Annu. Workshop on Comput. Learning Theory (pp. 147–156). San Mateo, CA: Morgan Kaufmann.
Littlestone, N., &; Mesterharm, C. (1997). An apobayesian relative of Winnow. In Proc. Advances in Neural Information Processing Systems 9. Cambridge, MA: MIT Press.
Littlestone, N., &; Mesterharm, C. (1999). A simulation study ofWinnow and related learning algorithms. Unpublished manuscript.
Littlestone, N., &; Warmuth, M. K. (1994). The weighted majority algorithm. Information and Computation, 108:2, 212–261.
Maruoka, A., Takimoto, E., &; Vovk, V. (1999). Predicting nearly as well as the best pruning of a decision tree through dynamic programming scheme. Theoretical Computer Science, to appear.
Novikov, A. B. J. (1962). On convergence proofs on perceptrons. In Proc. of the Symposium on the Mathematical Theory of Automata, vol. XII (pp. 615–622).
Rockafellar, R. Convex Analysis, Princeton University press, 1970.
Rosenblatt, F. Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms. Spartan Books, Washington, D.C., 1962.
Vovk, V. (1997). Competitive on-line linear regression. Technical Report CSD-TR-97-13, Department of Computer Science, Royal Holloway, University of London. Preliminary version in Proc. Advances in Neural Information Processing Systems 10 (pp. 364–370). Cambridge, MA: MIT Press.
Vovk, V. (1990). Aggregating strategies. In Proc. 3rd Annu.Workshop on Comput. Learning Theory (pp. 371–383). San Mateo, CA: Morgan Kaufmann.
Widrow, B., Hoff, M. E. (1960). Adaptive switching circuits. 1960 IRE WESCON Conv. Record, Part 4 (pp. 96–104).
Yamanishi, K. (1998). A decision-theoretic extension of stochastic complexity and its applications to learning. IEEE Information Theory, 44:4, 1424–1439.
About this article
Cite this article
Gentile, C. The Robustness of the p-Norm Algorithms. Machine Learning 53, 265–299 (2003). https://doi.org/10.1023/A:1026319107706
- on-line learning
- loss bounds
- learning rate
- dual norms
- on-line model selection