The Generalized Cross Entropy Method, with Applications to Probability Density Estimation

  • Zdravko I. Botev
  • Dirk P. KroeseEmail author


Nonparametric density estimation aims to determine the sparsest model that explains a given set of empirical data and which uses as few assumptions as possible. Many of the currently existing methods do not provide a sparse solution to the problem and rely on asymptotic approximations. In this paper we describe a framework for density estimation which uses information-theoretic measures of model complexity with the aim of constructing a sparse density estimator that does not rely on large sample approximations. The effectiveness of the approach is demonstrated through an application to some well-known density estimation test cases.


Cross entropy Information theory Monte Carlo simulation Statistical modeling Kernel smoothing Functional optimization Bandwidth selection Calculus of variations 

AMS 2000 Subject Classifications

Primary 94A17 60K35; Secondary 68Q32 93E14 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Abramson IS (1982) On bandwidth variation in kernel estimates—a square root law. Ann Stat 10:1217–1223zbMATHCrossRefMathSciNetGoogle Scholar
  2. Basford KE, McLachlan GJ, York MG (1997) Modelling the distribution of stamp paper thickness via finite normal mixtures: the 1872 stamp issue of Mexico revisited. J Appl Stat 24:169–179CrossRefGoogle Scholar
  3. Ben-Tal A, Teboulle M (1987) Penalty functions and duality in stochastic programming via ϕ divergence functionals. Math Oper Res 12:224–240zbMATHCrossRefMathSciNetGoogle Scholar
  4. Biernacki C, Celeux C, Govaert G (1998) Assessing a mixture model for clustering with the integrated classification likelihood. Technical report no. 3521. Rhône-Alpes, INRIAGoogle Scholar
  5. Borwein JM, Lewis AS (1991) Duality relationships for entropy-like minimization problems. SIAM J Control Optim 29:325–338zbMATHCrossRefMathSciNetGoogle Scholar
  6. Borwein JM, Lewis AS (2000) Convex analysis and nonlinear optimization: theory and examples. Springer, Berlin Heidelberg New YorkzbMATHGoogle Scholar
  7. Botev ZI (2005) Stochastic methods for optimization and machine learning. ePrintsUQ, BSc (Hons) thesis, Department of Mathematics, School of Physical Sciences, The University of Queensland.
  8. Botev ZI, Kroese DP (2008) Non-asymptotic bandwidth selection for density estimation of discrete data. Methodol Comput Appl Probab 10:435–451zbMATHCrossRefMathSciNetGoogle Scholar
  9. Bowman AW (1985) A comparative study of some kernel-based nonparametric density estimators. J Stat Comput Simul 21:313–327zbMATHCrossRefMathSciNetGoogle Scholar
  10. Bowman AW, Hall P, Titterington DM (1984) Cross-validation in nonparametric estimation of probabilities and probability densities. Biometrika 71:341–351zbMATHCrossRefMathSciNetGoogle Scholar
  11. Boyd SP (2004) Convex optimization. Cambridge, New YorkzbMATHGoogle Scholar
  12. Celeux G, Soromenho G (1996) An entropy criterion for assessing the number of clusters in a mixture model. J Classif 13:195–212zbMATHCrossRefMathSciNetGoogle Scholar
  13. Chib S (1982) Marginal likelihood from the gibbs output. J Am Stat Assoc 90:1313–1321CrossRefMathSciNetGoogle Scholar
  14. Chiu ST (1991) Bandwidth selection for kernel density estimation. Ann Stat 19:1883–1905zbMATHCrossRefMathSciNetGoogle Scholar
  15. Csiszár I (1972) A class of measures of informativity of observation channels. Period Math Hung 2:191–213zbMATHCrossRefGoogle Scholar
  16. Decarreau A, Hilhorst D, Lemarechal C, Navaza J (1992) Dual methods in entropy maximization. Applications to some problems in crystalography. SIAM J Optim 2:173–197zbMATHCrossRefMathSciNetGoogle Scholar
  17. Devroye L, Gyofri L (1985) Nonparametric density estimation: the L 1 view. Wiley series in probability and mathematical statistics. Wiley, New YorkzbMATHGoogle Scholar
  18. Doucet A, de Freitas N, Gordon N (2001) Sequential Monte Carlo methods in practice. Springer, New YorkzbMATHGoogle Scholar
  19. Girolami M, He C (2003) Probability density estimation from optimally condensed data samples. IEEE Trans Pattern Anal Mach Intell 25(10):1253–1264CrossRefGoogle Scholar
  20. Girolami M, He C (2004) Novelty detection employing an l 2 optimal non-parametric density estimator. Pattern Recogn Lett 25:1389–1397CrossRefGoogle Scholar
  21. Hall P (1987) On Kullback–Leibler loss and density estimation. Ann Stat 15:1491–1519zbMATHCrossRefGoogle Scholar
  22. Hall P, Turlach BA (1999) Reducing bias in curve estimation by use of weights. Comput Stat Data Anal 30:67–86zbMATHCrossRefMathSciNetGoogle Scholar
  23. Havrda JH, Charvat F (1967) Quantification methods of classification processes: concepts of structural α entropy. Kybernatica 3:30–35zbMATHMathSciNetGoogle Scholar
  24. Izenman AJ, Sommer CJ (1988) Philatelic mixtures and multimodal densities. J Am Stat Assoc 83:941–953CrossRefGoogle Scholar
  25. Jaynes ET (1957) Information theory and statistical mechanics. Phys Rev 106:621–630CrossRefMathSciNetGoogle Scholar
  26. Jones MC, Marron JS, Sheather SJ (1996) Progress in data-based bandwidth selection for kernel density estimation. Comput Stat 11:337–381zbMATHMathSciNetGoogle Scholar
  27. Kapur JN (1989) Maximum entropy models in science and engineering. Wiley Eastern, New DelhizbMATHGoogle Scholar
  28. Kapur JN (1994) Measures of information and their applications. Wiley, New DelhizbMATHGoogle Scholar
  29. Kapur JN, Kesavan HK (1987) Generalized maximum entropy principle (with applications). Standford Educational Press, University of Waterloo, WaterloozbMATHGoogle Scholar
  30. Kapur JN, Kesavan HK (1989) The generalized maximum entropy principle. IEEE Trans Syst Man Cybern 19:1042–1052CrossRefMathSciNetGoogle Scholar
  31. Kapur JN, Kesavan HK (1992) Entropy optimization principles with applications. Academic, New YorkGoogle Scholar
  32. Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22:79–86zbMATHCrossRefMathSciNetGoogle Scholar
  33. Kesavan HK, Srikanth M, Roe PH (2000) Probability density function estimation using the minmax measure. IEEE Trans Syst Man Cybern Part C Appl Rev 30(1):77–83CrossRefGoogle Scholar
  34. Lehmann EL (1990) Model specification: the views of fisher and neyman, and later developments. Stat Sci 5:160–168zbMATHCrossRefMathSciNetGoogle Scholar
  35. Loader CR (1999a) Bandwidth selection: classical or plug-in. Ann Stat 27:415–438zbMATHCrossRefMathSciNetGoogle Scholar
  36. Loader CR (1999b) Local regression and likelihood. Springer, Berlin Heidelberg New YorkzbMATHGoogle Scholar
  37. Marron JS (1985) An asymptotically efficient solution to the bandwidth problem of kernel density estimation. Ann Stat 13:1011–1023zbMATHCrossRefMathSciNetGoogle Scholar
  38. Marron JS, Wand MP (1992) Exact mean integrated squared error. Ann Stat 20:712–736zbMATHCrossRefMathSciNetGoogle Scholar
  39. Marron JS, Jones MC, Park BU (1991) A simple root n bandwidth selector. Ann Stat 19(4):1919–1932zbMATHCrossRefMathSciNetGoogle Scholar
  40. McLachlan GJ, Krishnan T (1997) The EM algorithm and extensions. Wiley, New YorkzbMATHGoogle Scholar
  41. Mclachlan GJ, Peel D (1997) Contribution to the discussion of paper by S. Richardson and P. J. Green. J R Stat Soc Ser B Stat Methodol 59:779–780Google Scholar
  42. McLachlan GJ, Peel D (2000) Finite mixture models. Wiley, New YorkzbMATHCrossRefGoogle Scholar
  43. Morejon RA, Principe JC (2004) Advanced search algorithms for information-theoretic learning with kernel-based estimators. IEEE Trans Neural Netw 15:874–884CrossRefGoogle Scholar
  44. Mukherjee S, Vapnik V (1999) Multivariate density estimation: a support vector machine approach. Massachusetts Institute of Technology.
  45. Pawitan Y (2001) In all likelihood: statistical modeling and inference using likelihood. Carendon, OxfordGoogle Scholar
  46. Principe JC, Erdogmus D (2002) An error-entropy minimization algorithm for supervised training of nonlinear adaptive systems. IEEE Trans Signal Process 50:1780–1786CrossRefGoogle Scholar
  47. Richardson S, Green PJ (1997) On bayesian analysis of mixtures with an unknown number of components (with discussion). J R Stat Soc Ser B Stat Methodol 59:731–792zbMATHCrossRefMathSciNetGoogle Scholar
  48. Roeder K (1990) Density estimation with confidence sets exemplified by super-clusters and voids in the galaxies. J Am Stat Assoc 85:617–624zbMATHCrossRefGoogle Scholar
  49. Rubinstein RY (2005) The stochastic minimum cross-entropy method for combinatorial optimization and rare-event estimation. Methodol Comput Appl Probab 7:5–50zbMATHCrossRefMathSciNetGoogle Scholar
  50. Rubinstein RY, Kroese DP (2004) The cross-entropy method. Springer, Berlin Heidelberg New YorkzbMATHGoogle Scholar
  51. Rubinstein RY, Kroese DP (2007) Simulation and the Monte Carlo method, 2nd edn. Wiley, New YorkCrossRefGoogle Scholar
  52. Ruppert D, Cline DBH (1994) Bias reduction in kernel density estimation by smoothed empirical transformations. Ann Stat 22:185–210zbMATHCrossRefMathSciNetGoogle Scholar
  53. Scott DW (1992) Multivariate density estimation. Theory, practice and visualization. Wiley, New YorkzbMATHCrossRefGoogle Scholar
  54. Scott DW (2001) Parametric statistical modeling by minimum integrated square error. Technimetrics 43:274–285CrossRefGoogle Scholar
  55. Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27:379–423, 623–659zbMATHMathSciNetGoogle Scholar
  56. Silverman BW (1986) Density estimation for statistics and data analysis. Chapman and Hall, LondonzbMATHGoogle Scholar
  57. Simonoff JS (1996) Smoothing methods in statistics. Springer, Berlin Heidelberg New YorkzbMATHGoogle Scholar
  58. Stone CJ (1984) An asymptotically optimal window selection rule for kernel density estimates. Ann Stat 12:1285–1297zbMATHCrossRefGoogle Scholar
  59. Terrell GR, Scott DW (1992) Variable kernel density estimation. Ann Stat 20:1236–1265zbMATHCrossRefMathSciNetGoogle Scholar
  60. Titterington DM, Smith AFM, Makov UE (1985) Statistical analysis of finite mixture distributions. Wiley, New YorkzbMATHGoogle Scholar
  61. Tsallis C (1988) Possible generalization of Boltzmann-Gibbs statistics. J Stat Phys 52:479zbMATHCrossRefMathSciNetGoogle Scholar
  62. Vapnik V (1998) Statistical learning theory. Wiley, New YorkzbMATHGoogle Scholar
  63. Wan FYM (1995) Introduction to the calculus of variations and its applications. Chapman and Hall, LondonzbMATHGoogle Scholar
  64. Wand MP, Jones MC (1995) Kernel smoothing. Chapman and Hall, LondonzbMATHGoogle Scholar
  65. Zhang P (1996) Nonparametric importance sampling. J Am Stat Assoc 91(435):1245–1253zbMATHCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  1. 1.Department of MathematicsThe University of QueenslandBrisbaneAustralia

Personalised recommendations