Machine Learning

, Volume 43, Issue 3, pp 211–246 | Cite as

Relative Loss Bounds for On-Line Density Estimation with the Exponential Family of Distributions

  • Katy S. Azoury
  • M. K. Warmuth


We consider on-line density estimation with a parameterized density from the exponential family. The on-line algorithm receives one example at a time and maintains a parameter that is essentially an average of the past examples. After receiving an example the algorithm incurs a loss, which is the negative log-likelihood of the example with respect to the current parameter of the algorithm. An off-line algorithm can choose the best parameter based on all the examples. We prove bounds on the additional total loss of the on-line algorithm over the total loss of the best off-line parameter. These relative loss bounds hold for an arbitrary sequence of examples. The goal is to design algorithms with the best possible relative loss bounds. We use a Bregman divergence to derive and analyze each algorithm. These divergences are relative entropies between two exponential distributions. We also use our methods to prove relative loss bounds for linear regression.

relative loss bounds worst-case loss bounds on-line algorithms exponential family of distributions linear regression Bregman divergences 


  1. Auer, P. Herbster, M. & Warmuth, M. K. (1995). Exponentially many local minima for single neurons. In Proc. 1995 Neural Information Processing Conference (pp. 316–317). Cambridge, MA: MIT Press.Google Scholar
  2. Amari, S. (1985). Differential Geometrical Methods in Statistics. Berlin: Springer Verlag.Google Scholar
  3. Azoury, K. & Warmuth, M. K. (1999). Relative loss bounds for on-line density estimation with the exponential family of distributions. In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence (pp. 31–40) Morgan Kaufmann, San Francisco CA.Google Scholar
  4. Beckenbach, E. F. & Bellman, R. (1965). Inequalities. Berlin: Springer. Second revised printing.Google Scholar
  5. Blackwell, D. (1956). An analog of the minimax theorem for vector payoffs. Pacific Journal of Mathematics, 6:1, 1–8.Google Scholar
  6. Barndorff-Nielsen, O. (1978). Information and Exponential Families in Statistical Theory. Chichester: Wiley.Google Scholar
  7. Bregman, L. M. (1967). The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Physics, 7, 200–217.Google Scholar
  8. Cesa-Bianchi, N., Freund, Y., Haussler, D., Helmbold, D. P., Schapire, R. E., & Warmuth, M. K. (1997). How to use expert advice. Journal of the ACM, 44:3, 427–485.Google Scholar
  9. Cesa-Bianchi, N., Long, P., & Warmuth, M. K. (1996).Worst-case quadratic loss bounds for on-line prediction of linear functions by gradient descent. IEEE Transactions on Neural Networks, 7:2, 604–619.Google Scholar
  10. Censor, Y. & Lent, A. (1981). An iterative row-action method for interval convex programming. Journal of Optimization Theory and Applications, 34:3, 321–353.Google Scholar
  11. Cover, T. M. & Ordentlich, E. (1996). Universal portfolios with side information. IEEE Transactions on Information Theory, 42:2, 348–363.Google Scholar
  12. Cover, T. M. (1991). Universal portfolios. Mathematical Finance, 1:1, 1–29.Google Scholar
  13. Csiszar, I. (1991). Why least squares and maximum entropy? An axiomatic approach for linear inverse problems. The Annals of Statistics, 19:4, 2032–2066.Google Scholar
  14. DeSantis, A., Markowsky, G., & Wegman, M. N. (1988). Learning probabilistic prediction functions. In Proc. 29th Annu. IEEE Sympos. Found. Comput. Sci. (pp. 110–119). Los Alamitos, CA: IEEE Computer Society Press.Google Scholar
  15. Forster, J. (1999). On relative loss bounds in generalized linear regression. In 12th International Symposium on Fundamentals of Computation Theory (FCT' 99) (pp. 269–280). Springer.Google Scholar
  16. Foster, D. P. (1991). Prediction in the worst case. The Annals of Statistics, 19:2, 1084–1090.Google Scholar
  17. Freund, Y. (1996). Predicting a binary sequence almost as well as the optimal biased coin. In Proc. 9th Annu. Conf. on Comput. Learning Theory (pp. 89–98). New York, NYACM Press.Google Scholar
  18. Grove, A. J., Littlestone, N., & Schuurmans, D. (1997). General convergence results for linear discriminant updates. In Proc. 10th Annu. Conf. on Comput. Learning Theory (pp. 171–183). ACM.Google Scholar
  19. Gordon, G. J. (1999). Approximate solutions to Markov decision processes. Ph.D. Thesis, Department of Computer Science, Carnegie Mellon University, Pittsburgh. Technical report CMU-CS-99-143.Google Scholar
  20. Gutiérrez-Peña & Smith, A. F. M. (1995). Conjugate parameterizations of natural exponential families. Journal of the American Statistical Association, 90:432, 1347–1356.Google Scholar
  21. Gruenwald, P. D. (1998). The Minimum Discription Length Principle and reasoning under uncertainty. Ph.D. thesis, Institute for Logic, Language and Computation, Universiteit van Amsterdam, October 1998.Google Scholar
  22. Hannan, J. (1957). M. Dresher, A. W. Tucker & P. Wolfe (Eds.), Approximation to Bayes risk in repeated play, Vol. III. Princeton University Press.Google Scholar
  23. Herbster, M. (1998). Learning algorithms for tracking changing concepts and an investigation into the error surfaces of single artificial neurons. Ph.D. Thesis, Department of Computer Science, University of California at Santa Cruz.Google Scholar
  24. Helmbold, D. P., Kivinen, J., & Warmuth, M. K. (1999). Relative loss bounds for single neurons. IEEE Transactions on Neural Networks, 10:6, 1291–1304.Google Scholar
  25. Haussler, D., Kivinen, J., & Warmuth, M. K. (1998). Sequential prediction of individual sequences under general loss functions. IEEE Transactions on Information Theory, 44:2, 1906–1925.Google Scholar
  26. Herbster, M. & Manfred, W. K. (1998). Tracking the best regressor. In Proc. 11th Annu. Conf. on Comput. Learning Theory (pp. 24–31). New York, NY: ACM Press.Google Scholar
  27. Jones, L. & Byrne, C. (1990). General entropy criteria for inverse problems, with applications to data compression, pattern classification and cluster analysis. IEEE Transactions on Information Theory, 36:1, 23–30.Google Scholar
  28. Kivinen, J. & Warmuth, M. K. (1997). Additive versus exponentiated gradient updates for linear prediction. Information and Computation, 132:1. 1–64.Google Scholar
  29. Kivinen, J. & Warmuth, M. K. (2000). Relative loss bounds for multidimensional regression problems, to appear in Journal of Machine Learning.Google Scholar
  30. Littlestone, N. (1988). Learning when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 2, 285–318.Google Scholar
  31. Littlestone, N., Long, P. M., & Warmuth, M. K. (1995). On-line learning of linear functions. Journal of Computational Complexity, 5, 1–23.Google Scholar
  32. Littlestone N. & Warmuth, M. K. (1994). The weighted majority algorithm. Information and Computation, 108:2, 212–261.Google Scholar
  33. McCullagh, P. & Nelder, J. A. (1989). Generalized Linear Models. Chapman & Hall.Google Scholar
  34. Morris, C. N. (1982). Natural exponential families with quadratic variance functions. The Annals of Statistics, 10:1, 65–80.Google Scholar
  35. Rissanen, J. (1989). Stochastic Complexity in Statistical Inquiry, volume 15 of Series in Computer Science. World Scientific.Google Scholar
  36. Rissanen, J. (1996). Fisher information and stochastic complexity. IEEE Transactions on Information Theory, 42:1, 40–47.Google Scholar
  37. Rockafellar, R. (1997). Convex Analysis. Princeton University Press.Google Scholar
  38. Stein, C. (1956). Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. In Proc. of the Third Berkeley Symposium on Mathematical Statistics and Probability, (Vol. 1) (pp 197–205). Berkeley: University of California Press.Google Scholar
  39. Takeuchi, J. & Barron, A. (1997). Asymptotically minimax regret for exponential families. In SITA' 97 (pp. 665–668).Google Scholar
  40. Takeuchi, J. & Barron, A. (1998). Asymptotically minimax regret by Bayes mixtures. In IEEE ISIT' 98.Google Scholar
  41. Vovk, V. (1990). Aggregating strategies. In Proc. 3rd Annu.Workshop on Comput. Learning Theory (pp. 371–383). Morgan Kaufmann.Google Scholar
  42. Vovk, V. (1997). Competitive on-line linear regression. Technical Report CSD-TR-97-13, Department of Computer Science, Royal Holloway, University of London.Google Scholar
  43. Warmuth, M. K. & Jagota, A. (1998). Continuous and discrete time nonlinear gradient descent: relative loss bounds and convergence. In R. Greiner & E. Boros (Eds.), Electronic Proceedings of Fifth International Symposium on Artificial Intelligence and Mathematics. Electronic, Scholar
  44. Xie, Q. & Barron, A. R. (1997). Minimax redundancy for the class of memoryless sources. IEEE Transactions on Information Theory, 43:2, 646–57.Google Scholar
  45. Yamanishi, K. (1998). A decision-theoretic extension of stochastic complexity and its applications to learning. IEEE Transaction on Information Theory, 44:4, 1424–1439.Google Scholar

Copyright information

© Kluwer Academic Publishers 2001

Authors and Affiliations

  • Katy S. Azoury
    • 1
  • M. K. Warmuth
    • 2
  1. 1.College of BusinessSan Francisco State UniversitySan FranciscoUSA
  2. 2.Computer Science DepartmentUniversity of CaliforniaSanta Cruz, Santa CruzUSA

Personalised recommendations