Exploratory Data Analysis in Empirical Research pp 216-226 | Cite as
k-Means Clustering with Outlier Detection, Mixed Variables and Missing Values
Conference paper
- 13 Citations
- 879 Downloads
Abstract
This paper addresses practical issues in k-means cluster analysis or segmentation with mixed types of variables and missing values. A more general k-means clustering procedure is developed that is suitable for use with very large datasets, such as arise in data mining and survey analysis. An exact assignment test guarantees that the algorithm will converge, and the detection of outliers allows the densest regions of the sample space to be mapped by tessellations of tightly-specified spherical clusters. A summary tree is obtained for the resulting k-cluster partition.
Keywords
Outlier Detection Summary Tree Mixed Data Type General Similarity Coefficient Outlier Deletion
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
Preview
Unable to display preview. Download preview PDF.
References
- BALL, G. H. (1965): Data analysis in the social sciences: What about the details? Proc. Fall Joint Computer Conf., Spartan Books, Washington D.C., Vol. 27 (1), 533–539.Google Scholar
- BALL, G. H. and HALL, D. J. (1967): A clustering technique for summarizing multivariate data. Behavioral Science, Vol. 12, 153–155.CrossRefGoogle Scholar
- BEALE, E. M. L. (1969): Euclidean cluster analysis. Bull. I. S. I., Vol. 43 (2), 92–94.Google Scholar
- DIDAY, E., and SIMON, J. C. (1976): Cluster analysis, in Fu, K. S. (Ed): Digital pattern recognition. Springer, Berlin, 47–94.CrossRefGoogle Scholar
- FORGEY, E. W. (1965): Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics, Vol. 21, 768–769.Google Scholar
- GOWER, J. C. (1971): A general coefficient of similarity and some of its properties. Biometrics, Vol. 27, 857–874.CrossRefGoogle Scholar
- JANCEY, R. C. (1966): Multidimensional group analysis. Austral. J. Botany, Vol. 14 (1), 127–130.CrossRefGoogle Scholar
- KASS, G. V. (1980): An exploratory technique for investigating large quantities of categorical data. Applied Statistics, Vol. 29, 119–127.CrossRefGoogle Scholar
- KAUFMAN, L. and ROUSSEEUW, P. J. (1960): Finding groups in data. Wiley, New York.Google Scholar
- MacQUEEN, J. (1967): Some methods for classification and analysis of multivariate observations. Proc. 5th Berkeley Symp., Vol. I, 281–297.MathSciNetGoogle Scholar
- THORNDIKE, R. L. (1953): Who belongs in the family. Psychometrika, Vol. 18, 267–276.CrossRefGoogle Scholar
- WISHART, D. (1970): Some problems in the theory and application of the methods of numerical taxonomy. Ph.D. dissertation, University of St. Andrews.Google Scholar
- WISHART, D. (1978): Treatment of missing values in cluster analysis. Proc. Compstat 1978, Physica-Verlag, Wien, 281–287.Google Scholar
- WISHART, D. (1984): Clustan Benutzerhandbuch. Gustav Fischer Verlag, Stuttgart, 46–54.zbMATHGoogle Scholar
- WISHART, D. (1986): Hierarchical cluster analysis with messy data, in: Gaul, Schader, (Eds.): Classification as a Tool of Research. North-Holland, Amsterdam, 453–460.Google Scholar
- WISHART, D. (1999): ClustanGraphics Primer. Clustan, Edinburgh, 37–38.Google Scholar
- WISHART, D. (2002): Clustan Professional User Guide. Clustan, Edinburgh (in preparation).Google Scholar
Copyright information
© Springer-Verlag Berlin Heidelberg 2003