Skip to main content

The Noise Component in Model-based Cluster Analysis

  • Conference paper
Data Analysis, Machine Learning and Applications

Abstract

The so-called noise-component has been introduced by Banfield and Raftery (1993) to improve the robustness of cluster analysis based on the normal mixture model. The idea is to add a uniform distribution over the convex hull of the data as an additional mixture component. While this yields good results in many practical applications, there are some problems with the original proposal: 1) As shown by Hennig (2004), the method is not breakdown-robust. 2) The original approach doesn’t define a proper ML estimator, and doesn’t have satisfactory asymptotic properties.

We discuss two alternatives. The first one consists of replacing the uniform distribution by a fixed constant, modelling an improper uniform distribution that doesn’t depend on the data. This can be proven to be more robust, though the choice of the involved tuning constant is tricky. The second alternative is to approximate the ML-estimator of a mixture of normals with a uniform distribution more precisely than it is done by the “convex hull” approach. The approaches are compared by simulations and for a real data example.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • BANFIELD, J. D. and RAFTERY, A. E. (1993): Model-Based Gaussian and Non-Gaussian Clustering. Biometrics, 49, 803-821.

    Article  MATH  MathSciNet  Google Scholar 

  • CAMPBELL, N. A. (1984): Mixture models and atypical values. Mathematical Geology, 16, 465-477.

    Article  Google Scholar 

  • CORETTO P. and HENNIG C. (2006): Identifiability for mixtures of distributions from a location-scale family with uniforms. DISES Working Papers No. 3.186, University of Salerno.

    Google Scholar 

  • CORETTO P. and HENNIG C. (2007): Choice of the improper density in robust improper ML for finite normal mixtures. Submitted.

    Google Scholar 

  • CUESTA-ALBERTOS, J. A., GORDALIZA, A. and MATRAN, C. (1997): Trimmed k-means: An Attempt to Robustify Quantizers. Annals of Statistics, 25, 553-576.

    Article  MATH  MathSciNet  Google Scholar 

  • DONOHO, D. L. and HUBER, P. J. (1983): The notion of breakdown point. In P. J. Bickel, K. Doksum, and J. L. Hodges jr. (Eds.): A Festschrift for Erich L. Lehmann, Wadsworth, Belmont, CA, 157-184.

    Google Scholar 

  • FRALEY, C. and RAFTERY, A. E. (1998): How Many Clusters? Which Clustering Method? Answers Via Model Based Cluster Analysis. Computer Journal, 41, 578-588.

    Article  MATH  Google Scholar 

  • HATHAWAY, R. J. (1985): A constrained formulation of maximum-likelihood estimates for normal mixture distributions. Annals of Statistics, 13, 795-800.

    Article  MATH  MathSciNet  Google Scholar 

  • HENNIG, C. (2004): Breakdown points for maximum likelihood-estimators of location-scale mixtures. Annals of Statistics, 32, 1313-1340.

    Article  MATH  MathSciNet  Google Scholar 

  • MCLACHLAN, G. J. and PEEL, D. (2000): Finite Mixture Models, Wiley, New York.

    Book  MATH  Google Scholar 

  • REDNER, R. A. and WALKER, H. F. (1984): Mixture densities, maximum likelihood and the EM algorithm, SIAM Review, 26, 195-239.

    Article  MATH  MathSciNet  Google Scholar 

  • SCHWARZ, G. (1978): Estimating the dimension of a model, Annals of Statistics, 6, 461-464.

    Article  MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Hennig, C., Coretto, P. (2008). The Noise Component in Model-based Cluster Analysis. In: Preisach, C., Burkhardt, H., Schmidt-Thieme, L., Decker, R. (eds) Data Analysis, Machine Learning and Applications. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78246-9_16

Download citation

Publish with us

Policies and ethics