Abstract
We investigate on-line prediction of individual sequences. Given a class of predictors, the goal is to predict as well as the best predictor in the class, where the loss is measured by the self information (logarithmic) loss function. The excess loss (regret) is closely related to the redundancy of the associated lossless universal code. Using Shtarkov's theorem and tools from empirical process theory, we prove a general upper bound on the best possible (minimax) regret. The bound depends on certain metric properties of the class of predictors. We apply the bound to both parametric and nonparametric classes of predictors. Finally, we point out a suboptimal behavior of the popular Bayesian weighted average algorithm.
Article PDF
Similar content being viewed by others
References
Azuma, K. (1967). Weighted sums of certain dependent random variables. Tohoku Mathematical Journal, 68, 357–367.
Barron, A. & Xie, Q. (1996). Asymptotic minimax regret for data compression, gambling, and prediction. Unpublished manuscript presented at an informal meeting on prediction of individual sequences held at the University of California (Santa Cruz).
Cover, T. (1991). Universal portfolios. Mathematical Finance, 1, 1–29.
Cover, T. & Ordentlich, E. (1996). Universal portfolios with side information. IEEE Transactions on Information Theory, 42:2, 348–363.
Cover, T. & Thomas, J. (1991). Elements of Information Theory. New York: John Wiley and Sons.
Feder, M. (1991). Gambling using a finite state machine. IEEE Transactions on Information Theory, 37, 1459–1465.
Freund, Y. (1996). Predicting a binary sequence almost as well as the optimal biased coin. In Proceedings of the 9th Annual Conference on Computational Learning Theory (pp. 89–98).
Haussler, D. & Barron, A. (1993). How well does the Bayes method work in on-line predictions of {+1,-1} Values? In Proceedings of 3rd NEC Symposium (pp. 74–100).
Haussler, D., Kivinen, J., & Warmuth, M. (1998). Sequential prediction of individual sequences under general loss functions. IEEE Transactions on Information Theory, 44, 1906–1925.
Merhav, N. & Feder,M. (1998). Universal prediction. IEEE Transactions on Information Theory, 44:6, 2124–2147.
Opper, M. & Haussler, D. (1997). Worst case prediction over sequences under log loss. The Mathematics of Information Coding, Extraction, and Distribution. Springer Verlag.
Rissanen, J. (1976). Generalized Kraft's inequality and arithmetic coding. IBM Journal of Research and Development, 20, 198–203.
Rissanen, J. (1996). Fischer information and stochastic complexity. IEEE Transactions on Information Theory, 42, 40–47.
Santis, A. D., Markowski, G., & Wegman, M. (1988). Learning probabilistic prediction functions. In Proceedings of the 1st Annual Workshop on Computational Learning Theory (pp. 312–328).
Shtarkov, Y. (1987). Universal sequential coding of single messages. Translated from: Problems in Information Transmission, 23:3, 3–17.
Talagrand, M. (1996). Majorizing measures: The generic chaining. Annals of Probability, 24, 1049–1103. (Special Invited Paper).
Vovk, V. (1990). Aggregating strategies. In Proceedings of the 3rd Annual Workshop on Computational Learning Theory, 372–383.
Vovk, V. (1998). A game of prediction with expert advice. Journal of Computer and System Sciences, 56:2, 153–173.
Weinberger, M., Merhav, N., & Feder, M. (1994). Optimal sequential probability assignment for individual sequences. IEEE Transactions on Information Theory, 40, 384–396.
Yamanishi, K. (1995).Aloss bound model for on-line stochastic algorithms. Information and Computation, 119:1, 39–54.
Yamanishi, K. (1998). A decision-theoretic extension of stochastic complexity and its application to learning. IEEE Transactions on Information Theory, 44, 1424–1440.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Cesa-Bianchi, N., Lugosi, G. Worst-Case Bounds for the Logarithmic Loss of Predictors. Machine Learning 43, 247–264 (2001). https://doi.org/10.1023/A:1010848128995
Issue Date:
DOI: https://doi.org/10.1023/A:1010848128995