Maximum Entropy Distribution Estimation with Generalized Regularization

  • Miroslav Dudík
  • Robert E. Schapire
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4005)


We present a unified and complete account of maximum entropy distribution estimation subject to constraints represented by convex potential functions or, alternatively, by convex regularization. We provide fully general performance guarantees and an algorithm with a complete convergence proof. As special cases, we can easily derive performance guarantees for many known regularization types, including ℓ1, ℓ2, \(\ell_{\rm 2}^{\rm 2}\) and ℓ1 + \(\ell_{\rm 2}^{\rm 2}\) style regularization. Furthermore, our general approach enables us to use information about the structure of the feature space or about sample selection bias to derive entirely new regularization functions with superior guarantees. We propose an algorithm solving a large and general subclass of generalized maxent problems, including all discussed in the paper, and prove its convergence. Our approach generalizes techniques based on information geometry and Bregman divergences as well as those based more directly on compactness.


Relative Entropy Performance Guarantee Generalize Regularization Gibbs Distribution Dual Objective 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Jaynes, E.T.: Information theory and statistical mechanics. Phys. Rev. 106, 620–630 (1957)CrossRefMathSciNetGoogle Scholar
  2. 2.
    Berger, A.L., Della Pietra, S.A., Della Pietra, V.J.: A maximum entropy approach to natural language processing. Computational Linguistics 22(1), 39–71 (1996)Google Scholar
  3. 3.
    Della Pietra, S., Della Pietra, V., Lafferty, J.: Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(4), 1–13 (1997)CrossRefGoogle Scholar
  4. 4.
    Phillips, S.J., Dudík, M., Schapire, R.E.: A ME approach to species distribution modeling. In: Proceedings of the Twenty-First International Conference on Machine Learning (2004)Google Scholar
  5. 5.
    Lau, R.: Adaptive statistical language modeling. Master’s thesis, MIT Department of Electrical Engineering and Computer Science (1994)Google Scholar
  6. 6.
    Chen, S.F., Rosenfeld, R.: A survey of smoothing techniques for ME models. IEEE Transactions on Speech and Audio Processing 8(1), 37–50 (2000)CrossRefGoogle Scholar
  7. 7.
    Lebanon, G., Lafferty, J.: Boosting and maximum likelihood for exponential models. Technical Report CMU-CS-01-144, CMU School of Computer Science (2001)Google Scholar
  8. 8.
    Zhang, T.: Class-size independent generalization analysis of some discriminative multi-category classification. Advances in Neural Information Processing Systems 17 (2005)Google Scholar
  9. 9.
    Goodman, J.: Exponential priors for maximum entropy models. In: Conference of the North American Chapter of the Association for Computational Linguistics (2004)Google Scholar
  10. 10.
    Kazama, J., Tsujii, J.: Evaluation and extension of ME models with inequality constraints. In: Conference on Empirical Methods in Natural Language Processing, pp. 137–144 (2003)Google Scholar
  11. 11.
    Dudík, M., Phillips, S.J., Schapire, R.E.: Performance guarantees for regularized maximum entropy density estimation. In: Shawe-Taylor, J., Singer, Y. (eds.) COLT 2004. LNCS, vol. 3120, pp. 472–486. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  12. 12.
    Altun, Y., Smola, A.J.: Unifying divergence minimization and statistical inference via convex duality. In: Lugosi, G., Simon, H.U. (eds.) COLT 2006. LNCS, vol. 4005, pp. 139–153. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  13. 13.
    Dudík, M., Schapire, R.E., Phillips, S.J.: Correcting sample selection bias in ME density estimation. Advances in Neural Information Processing Systems 18 (2006)Google Scholar
  14. 14.
    Collins, M., Schapire, R.E., Singer, Y.: Logistic regression, AdaBoost and Bregman distances. Machine Learning 48(1), 253–285 (2002)CrossRefMATHGoogle Scholar
  15. 15.
    Darroch, J.N., Ratcliff, D.: Generalized iterative scaling for log-linear models. The Annals of Mathematical Statistics 43(5), 1470–1480 (1972)CrossRefMathSciNetMATHGoogle Scholar
  16. 16.
    Malouf, R.: A comparison of algorithms for maximum entropy parameter estimation. In: Proceedings of the Sixth Conference on Natural Language Learning, pp. 49–55 (2002)Google Scholar
  17. 17.
    Krishnapuram, B., Carin, L., Figueiredo, M.A.T., Hartemink, A.J.: Sparse multinomial logistic regression: Fast algorithms and generalization bounds. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(6), 957–968 (2005)CrossRefGoogle Scholar
  18. 18.
    Ng, A.Y.: Feature selection, L 1 vs. L 2 regularization, and rotational invariance. In: Proceedings of the Twenty-First International Conference on Machine Learning (2004)Google Scholar
  19. 19.
    Newman, W.: Extension to the ME method. IEEE Trans. on Inf. Th. IT-23(1), 89–93 (1977)CrossRefGoogle Scholar
  20. 20.
    Rockafellar, R.T.: Convex Analysis. Princeton University Press, Princeton (1970)MATHGoogle Scholar
  21. 21.
    Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004)MATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Miroslav Dudík
    • 1
  • Robert E. Schapire
    • 1
  1. 1.Department of Computer SciencePrinceton UniversityPrinceton

Personalised recommendations