Advertisement

k-Means Clustering with Outlier Detection, Mixed Variables and Missing Values

  • D. Wishart
Conference paper
Part of the Studies in Classification, Data Analysis, and Knowledge Organization book series (STUDIES CLASS)

Abstract

This paper addresses practical issues in k-means cluster analysis or segmentation with mixed types of variables and missing values. A more general k-means clustering procedure is developed that is suitable for use with very large datasets, such as arise in data mining and survey analysis. An exact assignment test guarantees that the algorithm will converge, and the detection of outliers allows the densest regions of the sample space to be mapped by tessellations of tightly-specified spherical clusters. A summary tree is obtained for the resulting k-cluster partition.

Keywords

Outlier Detection Summary Tree Mixed Data Type General Similarity Coefficient Outlier Deletion 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. BALL, G. H. (1965): Data analysis in the social sciences: What about the details? Proc. Fall Joint Computer Conf., Spartan Books, Washington D.C., Vol. 27 (1), 533–539.Google Scholar
  2. BALL, G. H. and HALL, D. J. (1967): A clustering technique for summarizing multivariate data. Behavioral Science, Vol. 12, 153–155.CrossRefGoogle Scholar
  3. BEALE, E. M. L. (1969): Euclidean cluster analysis. Bull. I. S. I., Vol. 43 (2), 92–94.Google Scholar
  4. DIDAY, E., and SIMON, J. C. (1976): Cluster analysis, in Fu, K. S. (Ed): Digital pattern recognition. Springer, Berlin, 47–94.CrossRefGoogle Scholar
  5. FORGEY, E. W. (1965): Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics, Vol. 21, 768–769.Google Scholar
  6. GOWER, J. C. (1971): A general coefficient of similarity and some of its properties. Biometrics, Vol. 27, 857–874.CrossRefGoogle Scholar
  7. JANCEY, R. C. (1966): Multidimensional group analysis. Austral. J. Botany, Vol. 14 (1), 127–130.CrossRefGoogle Scholar
  8. KASS, G. V. (1980): An exploratory technique for investigating large quantities of categorical data. Applied Statistics, Vol. 29, 119–127.CrossRefGoogle Scholar
  9. KAUFMAN, L. and ROUSSEEUW, P. J. (1960): Finding groups in data. Wiley, New York.Google Scholar
  10. MacQUEEN, J. (1967): Some methods for classification and analysis of multivariate observations. Proc. 5th Berkeley Symp., Vol. I, 281–297.MathSciNetGoogle Scholar
  11. THORNDIKE, R. L. (1953): Who belongs in the family. Psychometrika, Vol. 18, 267–276.CrossRefGoogle Scholar
  12. WISHART, D. (1970): Some problems in the theory and application of the methods of numerical taxonomy. Ph.D. dissertation, University of St. Andrews.Google Scholar
  13. WISHART, D. (1978): Treatment of missing values in cluster analysis. Proc. Compstat 1978, Physica-Verlag, Wien, 281–287.Google Scholar
  14. WISHART, D. (1984): Clustan Benutzerhandbuch. Gustav Fischer Verlag, Stuttgart, 46–54.zbMATHGoogle Scholar
  15. WISHART, D. (1986): Hierarchical cluster analysis with messy data, in: Gaul, Schader, (Eds.): Classification as a Tool of Research. North-Holland, Amsterdam, 453–460.Google Scholar
  16. WISHART, D. (1999): ClustanGraphics Primer. Clustan, Edinburgh, 37–38.Google Scholar
  17. WISHART, D. (2002): Clustan Professional User Guide. Clustan, Edinburgh (in preparation).Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • D. Wishart
    • 1
  1. 1.Department of ManagementUniversity of St. AndrewsSt. Andrews, FifeScotland

Personalised recommendations