Advertisement

The Noise Component in Model-based Cluster Analysis

  • Christian Hennig
  • Pietro Coretto
Part of the Studies in Classification, Data Analysis, and Knowledge Organization book series (STUDIES CLASS)

Abstract

The so-called noise-component has been introduced by Banfield and Raftery (1993) to improve the robustness of cluster analysis based on the normal mixture model. The idea is to add a uniform distribution over the convex hull of the data as an additional mixture component. While this yields good results in many practical applications, there are some problems with the original proposal: 1) As shown by Hennig (2004), the method is not breakdown-robust. 2) The original approach doesn’t define a proper ML estimator, and doesn’t have satisfactory asymptotic properties.

We discuss two alternatives. The first one consists of replacing the uniform distribution by a fixed constant, modelling an improper uniform distribution that doesn’t depend on the data. This can be proven to be more robust, though the choice of the involved tuning constant is tricky. The second alternative is to approximate the ML-estimator of a mixture of normals with a uniform distribution more precisely than it is done by the “convex hull” approach. The approaches are compared by simulations and for a real data example.

Keywords

Mixture Model Mixture Component Noise Component Breakdown Point Extreme Outlier 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. BANFIELD, J. D. and RAFTERY, A. E. (1993): Model-Based Gaussian and Non-Gaussian Clustering. Biometrics, 49, 803-821.zbMATHCrossRefMathSciNetGoogle Scholar
  2. CAMPBELL, N. A. (1984): Mixture models and atypical values. Mathematical Geology, 16, 465-477.CrossRefGoogle Scholar
  3. CORETTO P. and HENNIG C. (2006): Identifiability for mixtures of distributions from a location-scale family with uniforms. DISES Working Papers No. 3.186, University of Salerno.Google Scholar
  4. CORETTO P. and HENNIG C. (2007): Choice of the improper density in robust improper ML for finite normal mixtures. Submitted.Google Scholar
  5. CUESTA-ALBERTOS, J. A., GORDALIZA, A. and MATRAN, C. (1997): Trimmed k-means: An Attempt to Robustify Quantizers. Annals of Statistics, 25, 553-576.zbMATHCrossRefMathSciNetGoogle Scholar
  6. DONOHO, D. L. and HUBER, P. J. (1983): The notion of breakdown point. In P. J. Bickel, K. Doksum, and J. L. Hodges jr. (Eds.): A Festschrift for Erich L. Lehmann, Wadsworth, Belmont, CA, 157-184.Google Scholar
  7. FRALEY, C. and RAFTERY, A. E. (1998): How Many Clusters? Which Clustering Method? Answers Via Model Based Cluster Analysis. Computer Journal, 41, 578-588.zbMATHCrossRefGoogle Scholar
  8. HATHAWAY, R. J. (1985): A constrained formulation of maximum-likelihood estimates for normal mixture distributions. Annals of Statistics, 13, 795-800.zbMATHCrossRefMathSciNetGoogle Scholar
  9. HENNIG, C. (2004): Breakdown points for maximum likelihood-estimators of location-scale mixtures. Annals of Statistics, 32, 1313-1340.zbMATHCrossRefMathSciNetGoogle Scholar
  10. MCLACHLAN, G. J. and PEEL, D. (2000): Finite Mixture Models, Wiley, New York.zbMATHCrossRefGoogle Scholar
  11. REDNER, R. A. and WALKER, H. F. (1984): Mixture densities, maximum likelihood and the EM algorithm, SIAM Review, 26, 195-239.zbMATHCrossRefMathSciNetGoogle Scholar
  12. SCHWARZ, G. (1978): Estimating the dimension of a model, Annals of Statistics, 6, 461-464.zbMATHCrossRefMathSciNetGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Christian Hennig
    • 1
  • Pietro Coretto
    • 2
  1. 1.Department of Statistical ScienceUniversity College LondonLondonUK
  2. 2.Dipartimento di Scienze Economiche e Statistiche Universita degli Studi di SalernoFiscianoItaly

Personalised recommendations