Unifying Divergence Minimization and Statistical Inference Via Convex Duality

  • Yasemin Altun
  • Alex Smola
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4005)


In this paper we unify divergence minimization and statistical inference by means of convex duality. In the process of doing so, we prove that the dual of approximate maximum entropy estimation is maximum a posteriori estimation as a special case. Moreover, our treatment leads to stability and convergence bounds for many statistical learning problems. Finally, we show how an algorithm by Zhang can be used to solve this class of optimization problems efficiently.


Banach Space Reproduce Kernel Hilbert Space Bregman Divergence Convex Duality Maximum Entropy Distribution 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Altun, Y., Hofmann, T., Smola, A.J.: Exponential families for conditional random fields. In: Uncertainty in Artificial Intelligence UAI, pp. 2–9 (2004)Google Scholar
  2. 2.
    Borgwardt, K., Gretton, A., Smola, A.J.: Kernel discrepancy estimation. Technical report, NICTA, Canberra (2006)Google Scholar
  3. 3.
    Borwein, J., Zhu, Q.J.: Techniques of Variational Analysis. Springer, Heidelberg (2005)zbMATHGoogle Scholar
  4. 4.
    Borwein, J.M.: Semi-infinite programming: How special is it? In: Fiacco, A.V., Kortanek, K.O. (eds.) Semi-Infinite Programming and Applications. Springer, Heidelberg (1983)Google Scholar
  5. 5.
    Bousquet, O., Boucheron, S., Lugosi, G.: Theory of classification: a survey of recent advances. In: ESAIM: Probability and Statistics (submitted, 2004)Google Scholar
  6. 6.
    Bousquet, O., Elisseeff, A.: Stability and generalization. JMLR 2, 499–526 (2002)CrossRefMathSciNetzbMATHGoogle Scholar
  7. 7.
    Chen, S., Rosenfeld, R.: A Gaussian prior for smoothing maximum entropy models. Technical Report CMUCS-99-108, Carnegie Mellon University (1999)Google Scholar
  8. 8.
    Collins, M., Schapire, R.E., Singer, Y.: Logistic regression, adaboost and bregman distances. In: COLT 2000, pp. 158–169 (2000)Google Scholar
  9. 9.
    Cortes, C., Vapnik, V.: Support vector networks. Machine Learning 20, 273–297 (1995)zbMATHGoogle Scholar
  10. 10.
    Dudík, M., Phillips, S.J., Schapire, R.E.: Performance guarantees for regularized maximum entropy density estimation. In: Shawe-Taylor, J., Singer, Y. (eds.) COLT 2004. LNCS, vol. 3120, pp. 472–486. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  11. 11.
    Dudík, M., Schapire, R.E.: Maximum entropy distribution estimation with generalized regularization. In: Lugosi, G., Simon, H.U. (eds.) COLT 2006. LNCS, vol. 4005, pp. 123–138. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  12. 12.
    Friedlander, M.P., Gupta, M.R.: On minimizing distortion and relative entropy. IEEE Transactions on Information Theory 52(1) (2006)Google Scholar
  13. 13.
    Kivinen, J., Warmuth, M.: Boosting as entropy projection. In: COLT 1999 (1999)Google Scholar
  14. 14.
    Lafferty, J.: Additive models, boosting, and inference for generalized divergences. In: COLT 1999, pp. 125–133. ACM Press, New York (1999)CrossRefGoogle Scholar
  15. 15.
    Le, Q.V., Smola, A.J., Canu, S.: Heteroscedastic gaussian process regression. In: International Conference on Machine Learning ICML 2005 (2005)Google Scholar
  16. 16.
    Morozov, V.A.: Methods for solving incorrectly posed problems. Springer, Heidelberg (1984)Google Scholar
  17. 17.
    Neal, R.: Priors for infinite networks. Technical report, U. Toronto (1994)Google Scholar
  18. 18.
    Nemenman, I., Bialek, W.: Occam factors and model independent bayesian learning of continuous distributions. Physical Review E 65(2), 6137 (2002)CrossRefGoogle Scholar
  19. 19.
    Rätsch, G., Mika, S., Warmuth, M.K.: On the convergence of leveraging. In: Advances in Neural Information Processing Systems (NIPS) (2002)Google Scholar
  20. 20.
    Rockafellar, R.T.: Convex Analysis. Princeton University Press, Princeton (1970)zbMATHGoogle Scholar
  21. 21.
    Ruderman, D.L., Bialek, W.: Statistics of natural images: Scaling in the woods. Phys. Rev. Letters (1994)Google Scholar
  22. 22.
    Schölkopf, B., Smola, A.: Learning with Kernels. MIT Press, Cambridge (2002)Google Scholar
  23. 23.
    Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge (2004)Google Scholar
  24. 24.
    Thikhonov, A.N., Arsenin, V.Y.: Solutions of Ill-Posed Problems. Wiley, Chichester (1977)Google Scholar
  25. 25.
    Wainwright, M.J., Jordan, M.I.: Graphical models, exponential families, and variational inference. Technical Report 649, UC Berkeley (September 2003)Google Scholar
  26. 26.
    Zhang, T.: Sequential greedy approximation for certain convex optimization problems. IEEE Transactions on Information Theory 49(3), 682–691 (2003)CrossRefzbMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Yasemin Altun
    • 1
  • Alex Smola
    • 2
  1. 1.Toyota Technological Institute at ChicagoChicagoUSA
  2. 2.National ICT AustraliaCanberraAustralia

Personalised recommendations