Skip to main content

The Generalized Cross Entropy Method, with Applications to Probability Density Estimation

Abstract

Nonparametric density estimation aims to determine the sparsest model that explains a given set of empirical data and which uses as few assumptions as possible. Many of the currently existing methods do not provide a sparse solution to the problem and rely on asymptotic approximations. In this paper we describe a framework for density estimation which uses information-theoretic measures of model complexity with the aim of constructing a sparse density estimator that does not rely on large sample approximations. The effectiveness of the approach is demonstrated through an application to some well-known density estimation test cases.

This is a preview of subscription content, access via your institution.

References

  • Abramson IS (1982) On bandwidth variation in kernel estimates—a square root law. Ann Stat 10:1217–1223

    MATH  Article  MathSciNet  Google Scholar 

  • Basford KE, McLachlan GJ, York MG (1997) Modelling the distribution of stamp paper thickness via finite normal mixtures: the 1872 stamp issue of Mexico revisited. J Appl Stat 24:169–179

    Article  Google Scholar 

  • Ben-Tal A, Teboulle M (1987) Penalty functions and duality in stochastic programming via ϕ divergence functionals. Math Oper Res 12:224–240

    MATH  Article  MathSciNet  Google Scholar 

  • Biernacki C, Celeux C, Govaert G (1998) Assessing a mixture model for clustering with the integrated classification likelihood. Technical report no. 3521. Rhône-Alpes, INRIA

  • Borwein JM, Lewis AS (1991) Duality relationships for entropy-like minimization problems. SIAM J Control Optim 29:325–338

    MATH  Article  MathSciNet  Google Scholar 

  • Borwein JM, Lewis AS (2000) Convex analysis and nonlinear optimization: theory and examples. Springer, Berlin Heidelberg New York

    MATH  Google Scholar 

  • Botev ZI (2005) Stochastic methods for optimization and machine learning. ePrintsUQ, BSc (Hons) thesis, Department of Mathematics, School of Physical Sciences, The University of Queensland. http://eprint.uq.edu.au/archive/00003377/

  • Botev ZI, Kroese DP (2008) Non-asymptotic bandwidth selection for density estimation of discrete data. Methodol Comput Appl Probab 10:435–451

    MATH  Article  MathSciNet  Google Scholar 

  • Bowman AW (1985) A comparative study of some kernel-based nonparametric density estimators. J Stat Comput Simul 21:313–327

    MATH  Article  MathSciNet  Google Scholar 

  • Bowman AW, Hall P, Titterington DM (1984) Cross-validation in nonparametric estimation of probabilities and probability densities. Biometrika 71:341–351

    MATH  Article  MathSciNet  Google Scholar 

  • Boyd SP (2004) Convex optimization. Cambridge, New York

    MATH  Google Scholar 

  • Celeux G, Soromenho G (1996) An entropy criterion for assessing the number of clusters in a mixture model. J Classif 13:195–212

    MATH  Article  MathSciNet  Google Scholar 

  • Chib S (1982) Marginal likelihood from the gibbs output. J Am Stat Assoc 90:1313–1321

    Article  MathSciNet  Google Scholar 

  • Chiu ST (1991) Bandwidth selection for kernel density estimation. Ann Stat 19:1883–1905

    MATH  Article  MathSciNet  Google Scholar 

  • Csiszár I (1972) A class of measures of informativity of observation channels. Period Math Hung 2:191–213

    MATH  Article  Google Scholar 

  • Decarreau A, Hilhorst D, Lemarechal C, Navaza J (1992) Dual methods in entropy maximization. Applications to some problems in crystalography. SIAM J Optim 2:173–197

    MATH  Article  MathSciNet  Google Scholar 

  • Devroye L, Gyofri L (1985) Nonparametric density estimation: the L 1 view. Wiley series in probability and mathematical statistics. Wiley, New York

    MATH  Google Scholar 

  • Doucet A, de Freitas N, Gordon N (2001) Sequential Monte Carlo methods in practice. Springer, New York

    MATH  Google Scholar 

  • Girolami M, He C (2003) Probability density estimation from optimally condensed data samples. IEEE Trans Pattern Anal Mach Intell 25(10):1253–1264

    Article  Google Scholar 

  • Girolami M, He C (2004) Novelty detection employing an l 2 optimal non-parametric density estimator. Pattern Recogn Lett 25:1389–1397

    Article  Google Scholar 

  • Hall P (1987) On Kullback–Leibler loss and density estimation. Ann Stat 15:1491–1519

    MATH  Article  Google Scholar 

  • Hall P, Turlach BA (1999) Reducing bias in curve estimation by use of weights. Comput Stat Data Anal 30:67–86

    MATH  Article  MathSciNet  Google Scholar 

  • Havrda JH, Charvat F (1967) Quantification methods of classification processes: concepts of structural α entropy. Kybernatica 3:30–35

    MATH  MathSciNet  Google Scholar 

  • Izenman AJ, Sommer CJ (1988) Philatelic mixtures and multimodal densities. J Am Stat Assoc 83:941–953

    Article  Google Scholar 

  • Jaynes ET (1957) Information theory and statistical mechanics. Phys Rev 106:621–630

    Article  MathSciNet  Google Scholar 

  • Jones MC, Marron JS, Sheather SJ (1996) Progress in data-based bandwidth selection for kernel density estimation. Comput Stat 11:337–381

    MATH  MathSciNet  Google Scholar 

  • Kapur JN (1989) Maximum entropy models in science and engineering. Wiley Eastern, New Delhi

    MATH  Google Scholar 

  • Kapur JN (1994) Measures of information and their applications. Wiley, New Delhi

    MATH  Google Scholar 

  • Kapur JN, Kesavan HK (1987) Generalized maximum entropy principle (with applications). Standford Educational Press, University of Waterloo, Waterloo

    MATH  Google Scholar 

  • Kapur JN, Kesavan HK (1989) The generalized maximum entropy principle. IEEE Trans Syst Man Cybern 19:1042–1052

    Article  MathSciNet  Google Scholar 

  • Kapur JN, Kesavan HK (1992) Entropy optimization principles with applications. Academic, New York

    Google Scholar 

  • Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22:79–86

    MATH  Article  MathSciNet  Google Scholar 

  • Kesavan HK, Srikanth M, Roe PH (2000) Probability density function estimation using the minmax measure. IEEE Trans Syst Man Cybern Part C Appl Rev 30(1):77–83

    Article  Google Scholar 

  • Lehmann EL (1990) Model specification: the views of fisher and neyman, and later developments. Stat Sci 5:160–168

    MATH  Article  MathSciNet  Google Scholar 

  • Loader CR (1999a) Bandwidth selection: classical or plug-in. Ann Stat 27:415–438

    MATH  Article  MathSciNet  Google Scholar 

  • Loader CR (1999b) Local regression and likelihood. Springer, Berlin Heidelberg New York

    MATH  Google Scholar 

  • Marron JS (1985) An asymptotically efficient solution to the bandwidth problem of kernel density estimation. Ann Stat 13:1011–1023

    MATH  Article  MathSciNet  Google Scholar 

  • Marron JS, Wand MP (1992) Exact mean integrated squared error. Ann Stat 20:712–736

    MATH  Article  MathSciNet  Google Scholar 

  • Marron JS, Jones MC, Park BU (1991) A simple root n bandwidth selector. Ann Stat 19(4):1919–1932

    MATH  Article  MathSciNet  Google Scholar 

  • McLachlan GJ, Krishnan T (1997) The EM algorithm and extensions. Wiley, New York

    MATH  Google Scholar 

  • Mclachlan GJ, Peel D (1997) Contribution to the discussion of paper by S. Richardson and P. J. Green. J R Stat Soc Ser B Stat Methodol 59:779–780

    Google Scholar 

  • McLachlan GJ, Peel D (2000) Finite mixture models. Wiley, New York

    MATH  Book  Google Scholar 

  • Morejon RA, Principe JC (2004) Advanced search algorithms for information-theoretic learning with kernel-based estimators. IEEE Trans Neural Netw 15:874–884

    Article  Google Scholar 

  • Mukherjee S, Vapnik V (1999) Multivariate density estimation: a support vector machine approach. Massachusetts Institute of Technology. ftp://publications.ai.mit.edu/ai-publications/1500-1999/AIM-1653.ps

  • Pawitan Y (2001) In all likelihood: statistical modeling and inference using likelihood. Carendon, Oxford

    Google Scholar 

  • Principe JC, Erdogmus D (2002) An error-entropy minimization algorithm for supervised training of nonlinear adaptive systems. IEEE Trans Signal Process 50:1780–1786

    Article  Google Scholar 

  • Richardson S, Green PJ (1997) On bayesian analysis of mixtures with an unknown number of components (with discussion). J R Stat Soc Ser B Stat Methodol 59:731–792

    MATH  Article  MathSciNet  Google Scholar 

  • Roeder K (1990) Density estimation with confidence sets exemplified by super-clusters and voids in the galaxies. J Am Stat Assoc 85:617–624

    MATH  Article  Google Scholar 

  • Rubinstein RY (2005) The stochastic minimum cross-entropy method for combinatorial optimization and rare-event estimation. Methodol Comput Appl Probab 7:5–50

    MATH  Article  MathSciNet  Google Scholar 

  • Rubinstein RY, Kroese DP (2004) The cross-entropy method. Springer, Berlin Heidelberg New York

    MATH  Google Scholar 

  • Rubinstein RY, Kroese DP (2007) Simulation and the Monte Carlo method, 2nd edn. Wiley, New York

    Book  Google Scholar 

  • Ruppert D, Cline DBH (1994) Bias reduction in kernel density estimation by smoothed empirical transformations. Ann Stat 22:185–210

    MATH  Article  MathSciNet  Google Scholar 

  • Scott DW (1992) Multivariate density estimation. Theory, practice and visualization. Wiley, New York

    MATH  Book  Google Scholar 

  • Scott DW (2001) Parametric statistical modeling by minimum integrated square error. Technimetrics 43:274–285

    Article  Google Scholar 

  • Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27:379–423, 623–659

    MATH  MathSciNet  Google Scholar 

  • Silverman BW (1986) Density estimation for statistics and data analysis. Chapman and Hall, London

    MATH  Google Scholar 

  • Simonoff JS (1996) Smoothing methods in statistics. Springer, Berlin Heidelberg New York

    MATH  Google Scholar 

  • Stone CJ (1984) An asymptotically optimal window selection rule for kernel density estimates. Ann Stat 12:1285–1297

    MATH  Article  Google Scholar 

  • Terrell GR, Scott DW (1992) Variable kernel density estimation. Ann Stat 20:1236–1265

    MATH  Article  MathSciNet  Google Scholar 

  • Titterington DM, Smith AFM, Makov UE (1985) Statistical analysis of finite mixture distributions. Wiley, New York

    MATH  Google Scholar 

  • Tsallis C (1988) Possible generalization of Boltzmann-Gibbs statistics. J Stat Phys 52:479

    MATH  Article  MathSciNet  Google Scholar 

  • Vapnik V (1998) Statistical learning theory. Wiley, New York

    MATH  Google Scholar 

  • Wan FYM (1995) Introduction to the calculus of variations and its applications. Chapman and Hall, London

    MATH  Google Scholar 

  • Wand MP, Jones MC (1995) Kernel smoothing. Chapman and Hall, London

    MATH  Google Scholar 

  • Zhang P (1996) Nonparametric importance sampling. J Am Stat Assoc 91(435):1245–1253

    MATH  Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dirk P. Kroese.

Additional information

Supported by the Australian Research Council, under grant number DP0985177.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Botev, Z.I., Kroese, D.P. The Generalized Cross Entropy Method, with Applications to Probability Density Estimation. Methodol Comput Appl Probab 13, 1–27 (2011). https://doi.org/10.1007/s11009-009-9133-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11009-009-9133-7

Keywords

  • Cross entropy
  • Information theory
  • Monte Carlo simulation
  • Statistical modeling
  • Kernel smoothing
  • Functional optimization
  • Bandwidth selection
  • Calculus of variations

AMS 2000 Subject Classifications

  • Primary 94A17
  • 60K35; Secondary 68Q32
  • 93E14