Advertisement

Finding the Number of Disparate Clusters with Background Contamination

  • Anthony C. Atkinson
  • Andrea CerioliEmail author
  • Gianluca Morelli
  • Marco Riani
Part of the Studies in Classification, Data Analysis, and Knowledge Organization book series (STUDIES CLASS)

Abstract

The Forward Search is used in an exploratory manner, with many random starts, to indicate the number of clusters and their membership in continuous data. The prospective clusters can readily be distinguished from background noise and from other forms of outliers. A confirmatory Forward Search, involving control on the sizes of statistical tests, establishes precise cluster membership. The method performs as well as robust methods such as TCLUST. However, it does not require prior specification of the number of clusters, nor of the level of trimming of outliers. In this way it is “user friendly”.

Keywords

Mahalanobis Distance Outlier Detection Random Start Forward Search Background Contamination 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Notes

Acknowledgements

We are very grateful to Berthold Lausen and Matthias Bömher for their scientific and organizational support during the European Conference on Data Analysis 2013. We also thank an anonymous reviewer for careful reading of an earlier draft, and for pointing out the reference to Hennig and Christlieb (2002). Our work on this paper was partly supported by the project MIUR PRIN “MISURA—Multivariate Models for Risk Assessment”.

References

  1. Atkinson, A. C., & Riani, M. (2007). Exploratory tools for clustering multivariate data. Computational Statistics and Data Analysis, 52, 272–285.CrossRefzbMATHMathSciNetGoogle Scholar
  2. Atkinson, A. C., Riani, M., & Cerioli, A. (2004). Exploring multivariate data with the forward search. New York: Springer.CrossRefzbMATHGoogle Scholar
  3. Atkinson, A. C., Riani, M., & Cerioli, A. (2006). Random start forward searches with envelopes for detecting clusters in multivariate data. In S. Zani, A. Cerioli, M. Riani, & M. Vichi (Eds.), Data Analysis, Classification and the Forward Search (pp. 163–171). Berlin: Springer.CrossRefGoogle Scholar
  4. Cerioli, A., & Perrotta, D. (2014). Robust clustering around regression lines with high density regions. Advances in Data Analysis and Classification, 8, 5–26.CrossRefMathSciNetGoogle Scholar
  5. Coretto, P., & Hennig, C. (2010). A simulation study to compare robust clustering methods based on mixtures. Advances in Data Analysis and Classification, 4, 111–135.CrossRefzbMATHMathSciNetGoogle Scholar
  6. Fowlkes, E. B., Gnanadesikan, R., & Kettenring, J. R. (1988). Variable selection in clustering. Journal of Classification, 5, 205–228.CrossRefMathSciNetGoogle Scholar
  7. Fraley, C., & Raftery, A. E. (2002). Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association, 97, 611–631.CrossRefzbMATHMathSciNetGoogle Scholar
  8. Fritz, H., García-Escudero, L. A., & Mayo-Iscar, A. (2012). TCLUST: An R package for a trimming approach to cluster analysis. Journal of Statistical Software, 47, 1–26.Google Scholar
  9. Gallegos, M. T., & Ritter, G. (2009). Trimming algorithms for clustering contaminated grouped data and their robustness. Advances in Data Analysis and Classification, 3, 135–167.CrossRefzbMATHMathSciNetGoogle Scholar
  10. García-Escudero, L. A., Gordaliza, A., Matrán, C., & Mayo-Iscar, A. (2008). A general trimming approach to robust cluster analysis. Annals of Statistics, 36, 1324–1345.CrossRefzbMATHMathSciNetGoogle Scholar
  11. García-Escudero, L. A., Gordaliza, A., Matrán, C., & Mayo-Iscar, A. (2010). A review of robust clustering methods. Advances in Data Analysis and Classification, 4, 89–109.CrossRefzbMATHGoogle Scholar
  12. García-Escudero, L. A., Gordaliza, A., Matrán, C., & Mayo-Iscar, A. (2011). Exploring the number of groups in model-based clustering. Statistics and Computing, 21, 585–599.CrossRefzbMATHMathSciNetGoogle Scholar
  13. Hennig, C., & Christlieb, N. (2002). Validating visual clusters in large datasets: Fixed point clusters of spectral features. Computational Statistics and Data Analysis, 40, 723–739.CrossRefzbMATHMathSciNetGoogle Scholar
  14. Kaufman, L., & Rousseeuw, P. J. (1990). Finding groups in data. An introduction to cluster analysis. New York: Wiley.Google Scholar
  15. Lee, S. X., & Mclachlan, G. J. (2013). Model-based clustering and classification with non-normal mixture distributions. Statistical Methods and Applications, 22, 427–454.CrossRefMathSciNetGoogle Scholar
  16. Milligan, G. W. (1980). An examination of the effect of six types of error perturbation on fifteen clustering algorithms. Psychometrika, 45, 325–342.CrossRefGoogle Scholar
  17. Morelli, G. (2013). A comparison of different classification methods. Ph.D. dissertation, Università di Parma.Google Scholar
  18. Riani, M., Atkinson, A. C., & Cerioli, A. (2009). Finding an unknown number of multivariate outliers. Journal of the Royal Statistical Society, Series B, 71, 447–466.CrossRefzbMATHMathSciNetGoogle Scholar
  19. Riani, M., Perrotta, D., & Torti, F. (2012). FSDA: A MATLAB toolbox for robust analysis and interactive data exploration. Chemometrics and Intelligent Laboratory Systems, 116, 17–32.CrossRefGoogle Scholar
  20. Riani, M., Atkinson, A. C., & Perrotta, D. (2014). A parametric framework for the comparison of methods of very robust regression. Statistical Science, 29, 128–143.CrossRefMathSciNetGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2015

Authors and Affiliations

  • Anthony C. Atkinson
    • 1
  • Andrea Cerioli
    • 2
    Email author
  • Gianluca Morelli
    • 2
  • Marco Riani
    • 2
  1. 1.Department of StatisticsLondon School of EconomicsLondonUK
  2. 2.Dipartimento di EconomiaUniversità di ParmaParmaItaly

Personalised recommendations