The Noise Component in Model-based Cluster Analysis

Hennig, Christian; Coretto, Pietro

doi:10.1007/978-3-540-78246-9_16

Christian Hennig⁵ &
Pietro Coretto⁶

Part of the book series: Studies in Classification, Data Analysis, and Knowledge Organization ((STUDIES CLASS))

6053 Accesses
14 Citations

Abstract

The so-called noise-component has been introduced by Banfield and Raftery (1993) to improve the robustness of cluster analysis based on the normal mixture model. The idea is to add a uniform distribution over the convex hull of the data as an additional mixture component. While this yields good results in many practical applications, there are some problems with the original proposal: 1) As shown by Hennig (2004), the method is not breakdown-robust. 2) The original approach doesn’t define a proper ML estimator, and doesn’t have satisfactory asymptotic properties.

We discuss two alternatives. The first one consists of replacing the uniform distribution by a fixed constant, modelling an improper uniform distribution that doesn’t depend on the data. This can be proven to be more robust, though the choice of the involved tuning constant is tricky. The second alternative is to approximate the ML-estimator of a mixture of normals with a uniform distribution more precisely than it is done by the “convex hull” approach. The approaches are compared by simulations and for a real data example.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

BANFIELD, J. D. and RAFTERY, A. E. (1993): Model-Based Gaussian and Non-Gaussian Clustering. Biometrics, 49, 803-821.
Article MATH MathSciNet Google Scholar
CAMPBELL, N. A. (1984): Mixture models and atypical values. Mathematical Geology, 16, 465-477.
Article Google Scholar
CORETTO P. and HENNIG C. (2006): Identifiability for mixtures of distributions from a location-scale family with uniforms. DISES Working Papers No. 3.186, University of Salerno.
Google Scholar
CORETTO P. and HENNIG C. (2007): Choice of the improper density in robust improper ML for finite normal mixtures. Submitted.
Google Scholar
CUESTA-ALBERTOS, J. A., GORDALIZA, A. and MATRAN, C. (1997): Trimmed k-means: An Attempt to Robustify Quantizers. Annals of Statistics, 25, 553-576.
Article MATH MathSciNet Google Scholar
DONOHO, D. L. and HUBER, P. J. (1983): The notion of breakdown point. In P. J. Bickel, K. Doksum, and J. L. Hodges jr. (Eds.): A Festschrift for Erich L. Lehmann, Wadsworth, Belmont, CA, 157-184.
Google Scholar
FRALEY, C. and RAFTERY, A. E. (1998): How Many Clusters? Which Clustering Method? Answers Via Model Based Cluster Analysis. Computer Journal, 41, 578-588.
Article MATH Google Scholar
HATHAWAY, R. J. (1985): A constrained formulation of maximum-likelihood estimates for normal mixture distributions. Annals of Statistics, 13, 795-800.
Article MATH MathSciNet Google Scholar
HENNIG, C. (2004): Breakdown points for maximum likelihood-estimators of location-scale mixtures. Annals of Statistics, 32, 1313-1340.
Article MATH MathSciNet Google Scholar
MCLACHLAN, G. J. and PEEL, D. (2000): Finite Mixture Models, Wiley, New York.
Book MATH Google Scholar
REDNER, R. A. and WALKER, H. F. (1984): Mixture densities, maximum likelihood and the EM algorithm, SIAM Review, 26, 195-239.
Article MATH MathSciNet Google Scholar
SCHWARZ, G. (1978): Estimating the dimension of a model, Annals of Statistics, 6, 461-464.
Article MATH MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Department of Statistical Science, University College London, Gower St, London, WC1E 6BT, UK
Christian Hennig
Dipartimento di Scienze Economiche e Statistiche Universita degli Studi di Salerno, 84084, Fisciano, SA, Italy
Pietro Coretto

Authors

Christian Hennig
View author publications
You can also search for this author in PubMed Google Scholar
Pietro Coretto
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Computer Science and Institute of Business Economics and Information Systems, University of Hildesheim, Marienburgerplatz 22, 31141, Hildesheim, Germany
Christine Preisach
Lehrstuhl für Mustererkennung und Bildverarbeitung, Universität Freiburg, Gebäude 052, 79110, Freiburg i. Br, Germany
Hans Burkhardt
Institute of Computer Science and Institute of Business Economics and Information Systems, Marienburgerplatz 22, 31141, Hildesheim, Germany
Lars Schmidt-Thieme
Fakultät für Wirtschaftswissenschaften, Lehrstuhl für Betriebswirtschaftslehre, insbes. Marketing, Universitätsstraße 25, 33615, Bielefeld, Germany
Reinhold Decker

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hennig, C., Coretto, P. (2008). The Noise Component in Model-based Cluster Analysis. In: Preisach, C., Burkhardt, H., Schmidt-Thieme, L., Decker, R. (eds) Data Analysis, Machine Learning and Applications. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78246-9_16

Download citation

DOI: https://doi.org/10.1007/978-3-540-78246-9_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-78239-1
Online ISBN: 978-3-540-78246-9
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics