Advertisement

Journal of Classification

, Volume 2, Issue 1, pp 63–76 | Cite as

Statistical theory in clustering

  • J. A. Hartigan
Authors Of Articles

Abstract

A number of statistical models for forming and evaluating clusters are reviewed. Hierarchical algorithms are evaluated by their ability to discover high density regions in a population, and complete linkage hopelessly fails; the others don't do too well either. Single linkage is at least of mathematical interest because it is related to the minimum spanning tree and percolation. Mixture methods are examined, related to k-means, and the failure of likelihood tests for the number of components is noted. The DIP test for estimating the number of modes in a univariate population measures the distance between the empirical distribution function and the closest unimodal distribution function (or k-modal distribution function when testing for k modes). Its properties are examined and multivariate extensions are proposed. Ultrametric and evolutionary distances on trees are considered briefly.

Keywords

Theory of clustering High density clusters Tests of unimodality 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. BAKER, F.B. (1974), “Stability of Two Hierarchical Grouping Techniques, Case I: Sensitivity to Data Errors,”Journal of the American Statistical Association, 69, 440–445.Google Scholar
  2. BINDER, D.A. (1978), Comment on ’Estimating Mixtures of Normal Distributions and Switching Regressions’,Journal of the American Statistical Association, 73, 746–747.Google Scholar
  3. BROADBENT, S.R., and HAMMERSLEY, J.M. (1957), “Percolation Processes, I: Crystals and Mazes,”Proceedings of the Cambridge Philosophical Society, 53, 629–641.Google Scholar
  4. DAY, N.E. (1969), “Estimating the Components of a Mixture of Normal Distributions,”Biometrika, 56, 463–474.Google Scholar
  5. DICK, N.P., and BOWDEN, D.C. (1973), “Maximum Likelihood Estimation for Mixture of Two Normal Distributions,”Biometrics, 29, 781–790.Google Scholar
  6. EVERITT, B.S., and HAND, D.J. (1981),Finite Mixture Distributions, London: Chapman and Hall.Google Scholar
  7. FITCH, W.M., and MARGOLIASH, E. (1967), “Construction of Phylogenetic Trees,”Science N.Y., 155, 279–284.Google Scholar
  8. GOWER, J.C., and ROSS, G.J.S. (1969), “Minimum Spanning Trees and Single Linkage Cluster Analysis,”Applied Statistics, 18, 54–65.Google Scholar
  9. HARTIGAN, J.A. (1967), “Representation of Similarity Matrices by Trees,”Journal of the American Statistical Association, 62, 1140–1158.Google Scholar
  10. HARTIGAN, J.A. (1975),Clustering Algorithms, New York: John Wiley.Google Scholar
  11. HARTIGAN, J.A. (1977), “Distribution Problems in Clustering,” inClassification and Clustering, ed. J. V. Ryzin, New York: Academic Press.Google Scholar
  12. HARTIGAN, J.A. (1978), “Asymptotic Distributions for Clustering Criteria,”The Annals of Statistics, 6, 117–131.Google Scholar
  13. HARTIGAN, J.A. (1981), “Consistency of Single Linkage for High Density Clusters,”Journal of the American Statistical Association, 76, 388–394.Google Scholar
  14. HARTIGAN, J.A., and HARTIGAN, P.M. (1984), “The Dip Test of Multimodality,”The Annals of Statistics, submitted.Google Scholar
  15. HOSMER, D.W. (1973), “A Comparison of Iterative Maximum Likelihood Estimates of the Parameters of a Mixture of Two Normal Distributions under Three Different Types of Sample,”Biometrics, 29, 761–770.Google Scholar
  16. JARDINE, C.J., JARDINE, N., and SIBSON, R. (1967), “The Structure and Construction of Taxonomic Hierarchies,”Math. Biosciences, 1, 173–179.Google Scholar
  17. JOHNSON, S.C. (1967), “Hierarchical Clustering Schemes,”Psychometrika, 32, 241–254.PubMedGoogle Scholar
  18. LING, R.F. (1973), “A Probability Theory of Cluster Analysis,”Journal of the American Statistical Association, 68, 159–169.Google Scholar
  19. MAC QUEEN, J. (1967), “Some Methods for Classification and Analysis of Multivariate Observations,”Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1, 281–297.Google Scholar
  20. POLLARD, D. (1982), “A Central Limit Theorem for k-means Clustering,”Annals of Probability, 10, 919–926.Google Scholar
  21. RAO, C.R. (1948), “The Utilization of Multiple Measurements in Problems of Biological Classification,”Journal of the Royal Statistical Society, Series B, 10, 159–203.Google Scholar
  22. SMYTHE, R.T., and WIERMAN, J.C. (1978), “First Passage Percolation on the Square Lattice,”Leture Notes in Mathematics, 671, Berlin: Springer-Verlag.Google Scholar
  23. WISHART, D. (1969), “Mode Analysis: A Generalization of Nearest Neighbor Which Reduces Chaining Effects,” inNumerical Taxonomy, ed. A. J. Cole, London: Academic Press.Google Scholar
  24. WOLFE, J.H. (1970), “Pattern Clustering by Multivariate Analysis,”Multivariate Behavioral Research, 5, 329–350.Google Scholar
  25. WOLFE, J.H. (1971), “A Monte-Carlo Study of the Sampling Distribution of the Likelihood Ratio fro Mixtures of Multinormal Distributions,”Research Memorandum, 72–2, Naval Personnel and Research Training Laboratory, San Diego.Google Scholar
  26. WONG, M.A. (1982), “A Hybrid Clustering Algorithm for Identifying High Density Clusters,”Journal of the American Statistical Association, 77, 841–847.Google Scholar
  27. WONG, M.A., and LANE, T. (1983), “A kth Nearest Neighbor Clustering Procedure,”Journal of the Royal Statistical Society, SeriesB, 45, 362–368.Google Scholar

Copyright information

© Springer-Verlag New York Inc 1985

Authors and Affiliations

  • J. A. Hartigan
    • 1
  1. 1.Department of StatisticsYale UniversityNew HavenUSA

Personalised recommendations