Finding the Number of Disparate Clusters with Background Contamination
The Forward Search is used in an exploratory manner, with many random starts, to indicate the number of clusters and their membership in continuous data. The prospective clusters can readily be distinguished from background noise and from other forms of outliers. A confirmatory Forward Search, involving control on the sizes of statistical tests, establishes precise cluster membership. The method performs as well as robust methods such as TCLUST. However, it does not require prior specification of the number of clusters, nor of the level of trimming of outliers. In this way it is “user friendly”.
KeywordsMahalanobis Distance Outlier Detection Random Start Forward Search Background Contamination
We are very grateful to Berthold Lausen and Matthias Bömher for their scientific and organizational support during the European Conference on Data Analysis 2013. We also thank an anonymous reviewer for careful reading of an earlier draft, and for pointing out the reference to Hennig and Christlieb (2002). Our work on this paper was partly supported by the project MIUR PRIN “MISURA—Multivariate Models for Risk Assessment”.
- Atkinson, A. C., Riani, M., & Cerioli, A. (2006). Random start forward searches with envelopes for detecting clusters in multivariate data. In S. Zani, A. Cerioli, M. Riani, & M. Vichi (Eds.), Data Analysis, Classification and the Forward Search (pp. 163–171). Berlin: Springer.CrossRefGoogle Scholar
- Fritz, H., García-Escudero, L. A., & Mayo-Iscar, A. (2012). TCLUST: An R package for a trimming approach to cluster analysis. Journal of Statistical Software, 47, 1–26.Google Scholar
- Kaufman, L., & Rousseeuw, P. J. (1990). Finding groups in data. An introduction to cluster analysis. New York: Wiley.Google Scholar
- Morelli, G. (2013). A comparison of different classification methods. Ph.D. dissertation, Università di Parma.Google Scholar