## Abstract

We consider two on-line learning frameworks: binary classification through linear threshold functions and linear regression. We study a family of on-line algorithms, called *p*-norm algorithms, introduced by Grove, Littlestone and Schuurmans in the context of deterministic binary classification. We show how to adapt these algorithms for use in the regression setting, and prove worst-case bounds on the square loss, using a technique from Kivinen and Warmuth. As pointed out by Grove, et al., these algorithms can be made to approach a version of the classification algorithm Winnow as *p* goes to infinity; similarly they can be made to approach the corresponding regression algorithm EG in the limit. Winnow and EG are notable for having loss bounds that grow only logarithmically in the dimension of the instance space. Here we describe another way to use the *p*-norm algorithms to achieve this logarithmic behavior. With the way to use them that we propose, it is less critical than with Winnow and EG to retune the parameters of the algorithm as the learning task changes. Since the correct setting of the parameters depends on characteristics of the learning task that are not typically known *a priori* by the learner, this gives the *p*-norm algorithms a desireable robustness. Our elaborations yield various new loss bounds in these on-line settings. Some of these bounds improve or generalize known results. Others are incomparable.

## References

- Angluin, D. (1988). Queries and concept learning.
*Machine Learning, 2:4*, 319–342.Google Scholar - Auer, P., &; Warmuth, M. K. (1998). Tracking the best disjunction.
*Machine Learning, 32:2*, 127–150.Google Scholar - Auer, P., &; Gentile, C. (2000). Adaptive and self-confident on-line learning algorithms. In
*Proc. 13th Annu. Conf. on Comput. Learning Theory*(pp. 107–117). San Mateo, CA: Morgan Kaufmann.Google Scholar - Azoury, K., &; Warmuth, M. K. (2001). Relative loss bounds for on-line density estimation with the exponential family of distributions.
*Machine Learning,*43, 211–246.Google Scholar - Barzdin, J. M., &; Frievald, R. V. (1972). On the prediction of general recursive functions.
*Soviet Math. Doklady, 13*, 1224–1228.Google Scholar - Block, H. D. (1962). The perceptron: A model for brain functioning.
*Reviews of Modern Physics, 34*, 123–135. Reprinted in Neurocomputing by Anderson and Rosenfeld.Google Scholar - Bregman, L. M. (1967). The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming.
*USSR Computational Mathematics and Physics, 7*, 200- 217.Google Scholar - Bylander, T. (1997). The binary exponentiated gradient algorithm for learning linear functions. In
*Proc. 8th Annu. Conf. on Comput. Learning Theory*(pp. 184–192). ACM.Google Scholar - Cesa-Bianchi, N., Freund, Y., Helmbold, D. P., &; Warmuth, M. K. (1996). On-line prediction and conversion strategies.
*Machine Learning*,*25*, 71–110.Google Scholar - Cesa-Bianchi, N., Freund, Y., Haussler, D., Helmbold, D. P., Schapire, R. E., &; Warmuth, M. K. (1997). How to use expert advice.
*Journal of the ACM, 44:3*, 427–485.Google Scholar - Cesa-Bianchi, N., Helmbold, P. D., &; Panizza, S. (1998). On bayes methods for on-line boolean prediction.
*Algorithmica, 22:1*, 112–137.Google Scholar - Cesa-Bianchi, N., Long, P., &; Warmuth, M. K. (1996). Worst-case quadratic loss bounds for on-line prediction of linear functions by gradient descent.
*IEEE Transactions on Neural Networks, 7*, 604–619.Google Scholar - Censor, Y., &; Lent, A. (1981). An iterative row-action method for interval convex programming.
*Journal of Optimization Theory and Applications, 34:3*, 321–353.Google Scholar - Forster, J., &; Warmuth, M. K. (2000). Relative expected instantaneous loss bounds. In
*Proc. 13th Annu. Conf. on Comput. Learning Theory*(pp. 90–99). San Mateo, CA: Morgan Kaufmann.Google Scholar - Freund, Y., &; Schapire, R. E. (1999). Large margin classification using the perceptron algorithm.
*Machine Learning, 37:3*, 277–296.Google Scholar - Gentile, C., &; Littlestone, N. (1999). The robustness of the
*p*-norm algorithms. In*Proc. 12th Annu. Conf. on Comput. Learning Theory*(pp. 1–11). ACM.Google Scholar - Gentile, C., &; Warmuth, M. K. (1999). Linear Hinge Loss and Average margin. In
*Proc. Advances in Neural Information Processing Systems 11*(pp. 225-231). Cambridge, MA: MIT Press.Google Scholar - Gordon, G. J. (1999). Regret bounds for prediction problems. In
*Proc. 12th Annu. Conf. on Comput. Learning Theory*(pp. 29–40). ACM.Google Scholar - Grove, A. J., Littlestone, N., &; Schuurmans, D. (2001). General convergence results for linear discriminant updates.
*Machine Learning, 43:3*, 173–210.Google Scholar - Helmbold, D. P., Kivinen, J., &; Warmuth, M. K. (1999). Worst-case loss bounds for sigmoided linear neurons.
*IEEE Transactions on Neural Networks, 10:6*, 1291–1304.Google Scholar - Helmbold, D. P., &; Schapire, R. E. (1997). Predicting nearly as well as the best pruning of a decision tree.
*Machine Learning, 27*, 51–68.Google Scholar - Helmbold, D. P., &; Warmuth, M. K. (1995). On weak learning.
*Journal of Computer and System Sciences, 50:3*, 551–573.Google Scholar - Herbster, M., &; Warmuth, M. K. (1998a). Tracking the best expert.
*Machine Learning, 32:2*, 151–178.Google Scholar - Herbster, M., &; Warmuth, M. K. (1998b). Tracking the best regressor. In
*Proc. 11th Annu. Conf. on Comput. Learning Theory*(pp. 24–31). ACM.Google Scholar - Jagota, A., &; Warmuth, M. K. (1998). Continuous and discrete time nonlinear gradient descent: Relative loss bounds and convergence. In
*Electronic Proceedings of Fifth International Symposium on Artificial Intelligence and Mathematics*. Electronic, http://rutcor.rutgers.edu/~amai.Google Scholar - Kivinen, J., &; Warmuth, M. K. (1997). Additive versus exponentiated gradient updates for linear prediction.
*Information and Computation, 132:1*, 1–64.Google Scholar - Kivinen, J., &; Warmuth, M. K (2001). Relative loss bounds for multidimensional regression problems.
*Machine Learning, 45:3*, 301–329.Google Scholar - Kivinen, J., &; Warmuth, M. K. (1999). Averaging expert predictions. In
*Proc. 4th European Conference on Comput. learning Theory*(pp. 153–167). Lecture Notes in Computer Science, Vol. 1572. Springer.Google Scholar - Littlestone, N. (1988). Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm.
*Machine Learning, 2*, 285–318.Google Scholar - Littlestone, N. (1989a). From on-line to batch learning. In
*Proc. 2nd Annu.Workshop on Comput. Learning Theory*(pp. 269–284). San Mateo, CA: Morgan Kaufmann.Google Scholar - Littlestone, N. (1989b). Mistake Bounds and Logarithmic Linear-Threshold Learning Algorithms. PhD thesis, Technical Report UCSC-CRL-89-11, University of California Santa Cruz.Google Scholar
- Littlestone, N. (1991). Redundant noisy attributes, attribute errors, and linear threshold learning using Winnow. In
*Proc. 4th Annu. Workshop on Comput. Learning Theory*(pp. 147–156). San Mateo, CA: Morgan Kaufmann.Google Scholar - Littlestone, N., &; Mesterharm, C. (1997). An apobayesian relative of Winnow. In
*Proc. Advances in Neural Information Processing Systems 9*. Cambridge, MA: MIT Press.Google Scholar - Littlestone, N., &; Mesterharm, C. (1999). A simulation study ofWinnow and related learning algorithms. Unpublished manuscript.Google Scholar
- Littlestone, N., &; Warmuth, M. K. (1994). The weighted majority algorithm.
*Information and Computation, 108:2*, 212–261.Google Scholar - Maruoka, A., Takimoto, E., &; Vovk, V. (1999). Predicting nearly as well as the best pruning of a decision tree through dynamic programming scheme.
*Theoretical Computer Science,*to appear.Google Scholar - Novikov, A. B. J. (1962). On convergence proofs on perceptrons. In
*Proc. of the Symposium on the Mathematical Theory of Automata, vol. XII*(pp. 615–622).Google Scholar - Rockafellar, R.
*Convex Analysis*, Princeton University press, 1970.Google Scholar - Rosenblatt, F.
*Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms.*Spartan Books, Washington, D.C., 1962.Google Scholar - Vovk, V. (1997). Competitive on-line linear regression. Technical Report CSD-TR-97-13, Department of Computer Science, Royal Holloway, University of London. Preliminary version in
*Proc. Advances in Neural Information Processing Systems 10*(pp. 364–370). Cambridge, MA: MIT Press.Google Scholar - Vovk, V. (1990). Aggregating strategies. In
*Proc. 3rd Annu.Workshop on Comput. Learning Theory*(pp. 371–383). San Mateo, CA: Morgan Kaufmann.Google Scholar - Widrow, B., Hoff, M. E. (1960). Adaptive switching circuits.
*1960 IRE WESCON Conv. Record*, Part*4*(pp. 96–104).Google Scholar - Yamanishi, K. (1998). A decision-theoretic extension of stochastic complexity and its applications to learning.
*IEEE Information Theory*,*44:4*, 1424–1439.Google Scholar