Abstract
Cluster analysis provides methods and algorithms for partitioning a set of objects O = 1,…, n (or data vectors x1,…, xn ∈ R p ) into a suitable number of classes C1,…,Cm ⊆ O such that these classes are homogeneous and each of them comprizes only objects which are’similar’ in some sense. The historical evolution shows a surprising trend from an algorithmic, heuristic and applications oriented point of view (Sokal/Sneath 1963) to a more basic, theory oriented investigation of the structural, mathematical and statistical properties of clustering methods. Nowadays, the questions to be answered are of the type’How many clusters are there ?’,’Is there a classification structure ?’,’Is the calculated classification adequate ?’,’Which are the strongest clusters ?’ etc.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Aitkin, M., Anderson, D., Hinde, J. Statistical modelling of data on teaching style (with discussion). J. Roy. Statist. Soc. A 144 (1981) 419–461.
Ambrosi, K. Klassifikation mit Dichteschätzungen. Oper. Res. Verf. 22 (1976) 1–11.
Anderson, J. J. Normal mixtures and the number of clusters problems. Computational Statistics Quarterly 2 (1985) 3–14.
Arnold, S. J. A test for clusters. J. Marketing Research 16 (1979) 545–551.
Aubuchon, J. C., Hettmansperger, T. P. A note on the estimation of the integral of f 2 (x) . J. Statist. Planing and Inference 9 (1984) 321–332.
Baubkus, W. Minimizing the variance criterion in cluster analysis: Optimal configurations in the multidimensional normal case. Diplomarbeit, Institute of Statistics, Technical University Aachen, 1984.
Beale, E. M. L. Euclidean cluster analysis. Bull. Intern. Statist. Inst. 43 (1969), Vol. 2, 82–94.
Bezdek, J. C. Pattern recognition with fuzzy objective function algorithms. Plenum Press, New York, 1981.
Bickel, P. J., Breiman, L. Sums of functions of nearest neighbor distances, moment bounds, limit theorems and a goodness of fit test. Ann. Probab. 11 (1983) 185–214.
Binder, D. A. Bayesian cluster analysis. Biometrika 65 (1978) 31–38.
Binder, D. A. Approximations to Bayesian clustering rules. Biometrika 68 (1981) 275–286.
Bock, H. H. Statistische Modelle für die einfache und doppelte Klassifikation von normalverteilten Beobachtungen. Dissertation, Univ. Freiburg i. Brsg., 1968.
Bock, H. H. Statistische Modelle und Bayes’sche Verfahren zur Bestimmung einer unbekannten Klassifikation normalverteilter zufälliger Vektoren. Metrika 18 (1972) 120–132.
Bock, H.H. Automatische Klassifikation. Theoretische und praktische Methoden zur Gruppierung und Strukturierung von Daten (Clusteranalyse). Vandenhoeck & Ruprecht, Göttingen, 1974.
Bock, H. H. On tests concerning the existence of a classification. In: IRIA, vol. 2, 1977, 449–464.
Bock, H. H. Clusteranalyse mit unscharfen Partitionen. In: Bock, H. H. (ed.): Klassifikation und Erkenntnis III: Numerische Klassifikation. Studien zur Klassifikation SK-6. Indeks-Verlag, Frankfurt, 1979a, 137–163.
Bock, H. H. (1979b) Clustering by density estimation. In: Tomassone, R. (ed.), 1979, 173–186.
Bock, H. H. (1979c) Fuzzy clustering procedures. In: Tomassone, R. (ed.), 1979, 205–218.
Bock, H, H. Dichteschätzung und Clusteranalyse (abstract). Koll.’Dichteschätzung und verwandte Themen’, Univ.-GHS Siegen, FB Math., 20./21. Nov. 1980.
Bock, H. H. Statistical testing and evaluation methods in cluster analysis. (1981) In: Ghosh, J. K., Roy, J. (eds.), 1984, 116–146.
Bock, H. H. Statistische Testverfahren im Rahmen der Clusteranalyse. In: Dahlberg, L, Schader, M. (Hrsg.): Studien zur Klassifikation SK-13. Indeks- Verlag, Frankfurt, 1983, 161–176.
Bock, H. H. On some significance tests in cluster analysis. J. of Classification 2 (1985) 77–108.
Bock, H. H. Loglinear models and entropy clustering methods for qualitative data. (1986a) In: Gaul, W., Schader, M. (eds.), 1986, 19–26.
Bock, H. H. Multidimensional scaling in the framework of cluster analysis. In: Degens, P. O., Hermes H.-J., Opitz, O. (Hrsg.): Classification and its environment. Indeks-Verlag, Frankfurt, 1986b, 247–258.
Bock, H. H. On the interface between cluster analysis, principal component analysis, and multidimensional scaling. In: Bozdogan, H., Gupta, A. K. (eds.): Multivariate statistical modeling and data analysis. Reidel, Dordrecht, 1987, 17–34.
Bock, H. H. (ed.) Classification and related methods of data analysis. Proc. First Conference of the International Federation of Classification Societies (IFCS-87), June 29 — July 1, 1987, Aa-chen/FRG. North Holland, Amsterdam, 1988.
Boswell, St. B. Nonparametric estimation of the modes of high-dimensional densities. In: Billard, L. (ed.), Computer Science and Statistics. Proc. 16th Symposium on the Interface. North Holland, Amsterdam, 1985, 217–226.
Bryant, P.G., Williamson, J. A. Asymptotic behaviour of classification maximum likelihood estimates. Biometrika 65 (1978) 273–281.
Bryant, P. G., Williamson, J. A. Maximum likelihood and classification. A comparison of three approaches. In: Gaul, W. et al., 1986, 35–46.
Butler, R. W. Optimal stratification and clustering on the line using the Li-norm. J. Multiv. Analysis 18 (1986) 142–155.
Butler, R. W. Optimal clustering in the real line. J. Multiv. Analysis 24 (1988) 88–108.
Céleux, G., Diebolt, J. The SEM algorithm: A probabilistic teacher algorithm derived from the EM algorithm for the mixture problem. Computational Statistics Quarterly 2 (1985) 73–82.
Céleux, G., Diebolt, J. L’algorithme SEM: un algorithme d’apprentissage probabiliste pour la reconnaissance de mélange de densités. Revue Statist. Appliquée 34 (1986).
Cressie, N. An optimal statistic based on higher order gaps. Biometrika 66 (1979) 619–628.
Davies, P. L. Consistent estimates for finite mixtures of well separated elliptical distributions. In: Bock, H. H. (ed.): IFCS-87, 1988, 195–202.
Degens, P.O. Clusteranalyse auf topologisch-maßtheoretischer Grundlage. Dissertation, Universität München, Fachbereich Mathematik, 1978.
Deheuvels, P., Einmahl, J. H. J., Mason, D. M., Ruymgaart, F. H. The almost sure behavior of maximal and minimal multivariate k n -spacings. J. Multiv. Analysis 24 (1988) 155–176.
Dette, H., Henze, N. The limit distribution of the largest nearest neighbour link in the unit d-cube. J. Appl. Probab. 26 (1988).
Diday, E. et al. (eds.) Optimisation en classification automatique. Vol. 1, 2. Institut National de Recherche en Informatique et en Automatique (INRIA), Le Chesnay, 1979.
Dubes, R. Jain, A.K. Validity studies in clustering methodologies. Pattern Recognition 11 (1979) 235–254.
Duda, R. O., Hart, P. E. Pattern classification and scene analysis. Wiley, New York, 1973.
Engelman, L., Hartigan, J. A. Percentage points of a test for clusters. J. Amer. Statist. Assoc. 64 (1969) 1647–1649.
Everitt, B. S. A Monte Carlo investigation of the likelihood ratio test for the number of components in a mixture of normal distributions. Multivariate Behavioural Research 16 (1981a) 171–180.
Everitt, B. S. Contribution to the discussion of the paper by M. Aitkin, D. Anderson and J. Hinde. J. Roy. Statist. Soc. A 144 (1981b) 457–458.
Everitt, B. S., Hand, D. Finite mixture distributions. Chapman and Hall, London, 1981.
Felsenstein, J. Numerical taxonomy. Springer-Verlag, Heidelberg, New York, 1983.
Fukunaga, K., Hostetler, L. D. The estimation of the gradient of a density function with applications in pattern recognition. IEEE Trans. Information Theory IT-21 (1975) 32–40.
Fukunaga, K., Koontz, W. G. A nonparametric valley-seeking technique for cluster analysis. IEEE Trans. Computers C-21 (1972) 171–178.
Gänßler, P. On a modification of the k-means clustering procedure. Preprint No. 39, Math. Inst., Univ. München, 1986.
Gates, D. J., Westcott, M. On the distribution of scan statistics. J. Amer. Statist. Assoc. 79 (1984) 423–429.
Gaul, W, Schader, M. (eds.) Classification as a tool of research. Proc. 9th Annual Meeting of the Gesellschaft für Klassifikation. Karlsruhe, 26 – 28 June 1986. North Holland, Amsterdam, 1987.
Geisser, S., Cornfield, J. Porterior distributions for multivariate normal parameters. J. Roy. Statist. Soc. B 25 (1963) 368–376.
Ghosh, J. K., Roy, J. (eds.) Golden Jubilee Conference in Statistics: Applications and new directions. Calcutta, December 1981. Indian Statistical Institute, Calcutta, 1984.
Ghosh, J. K., Sen, P. K. On the asymptotic performance of the log likelihood ratio statistic for the mixture model and related results. In: LeCam, L. M., Ohlsen, R. A. (eds.), 1985, 789–806.
Giacomelli, F. et. al. (eds.) Subpopulations of blood lymphocytes demonstrated by quantitative chemistry. J. Histochemistry and Cytochemistry 19 (1971) 426–433.
Glaz, J., Naus, J. Multiple clusters on the line. Comm. Statist., Theory and Methods 12 (1983) 1961–1986.
Godehardt, E. Graphs as structural models. The application of graphs and multigraphs in cluster analysis. Vieweg, Braunschweig, 1988.
Gray, R.M., Kamin, E.D. Multiple local optima in vector quantizers. IEEE Trans. Information Theory IT-28 (1982) 256–261.
Hall, P. On powerful distributional tests based on sample spacings. J. Multiv. Analysis 19 (1986) 201–224.
Hartigan, J. A. Clustering algorithms. Wiley, New York, 1975.
Hartigan, J. A. Clusters as modes. (1977a) In: IRIA, vol. II, 1977, 433–448.
Hartigan, J. A. Distribution problems in clustering. (1977b) In: Ryzin, J. van (ed.), 1977, 45–72
Hartigan, J. A. Asymptotic distributions for clustering criteria. Ann. Statist. 6 (1978) 117–131.
Hartigan, J. A. Consistency of single linkage for high-density clusters. J. Amer. Statist. Assoc. 76 (1981) 388–394.
Hartigan, J. A. Statistical theory in clustering. J. of Classification 2 (1985a) 63–76.
Hartigan, J. A. A failure of likelihood asymptotics for normal mixtures. (1985b) In: LeCam, L. M., Ohlsen, R. A. (eds.), 1985, 807–810.
Hartigan, J. A. The span test for unimodality. In: Bock, H.H. (ed.), IFCS-87, 1988, 229–236.
Hartigan, J. A., Hartigan, P. M. The dip test of unimodality. Ann. Statist. 13 (1985) 70–84.
Henze, N. The limit distribution for maxima of”weighted” rth-nearest neighbour distances. J. Appl. Probab. 19 (1982) 344–354.
Henze, N. Ein asymptotischer Satz über den maximalen Minimalabstand von unabhängigen Zufallsvektoren mit Anwendung auf einen Anpassungstest im R p und auf der Kugel. Metrika 30 (1983) 245–260.
Hill. L. R., Silvestri, L. G., Ihm, P., Farchi, G., Lanciani, P. Automatic classification of staphylococci by principal component analysis and a gradient method. J. Bacteriology 89 (1965) 1393–1401.
Hüsler, J. Minimal spacings of non-uniform densities. Stoch. Proc. Appl. 25 (1987) 73–82.
IRIA Proc. 1st Symp. Data Analysis and Informatics. Versailles, 1977. Institut de Recherche, d’Informatique et d’Automatique (IRIA), Le Chesnay, 1977.
Jahnke, H., Clusteranalyseverfahren als Verfahren der schließenden Statistik. Vandenhoeck & Ruprecht, Göttingen, 1988.
Jain A.K., Dubes, R.C. Algorithms for clustering data. Prentice Hall, Englewood Cliffs, 1988.
Kopp, B. Über ein Verfahren zur Gruppenbildung durch Dichtefunktionen. Biometrische Zeitschrift 18 (1976) 291–296.
Krauth, J. An improved upper bound for the tail probabilities of the scan statistic for testing non-random clustering. In: H.H. Bock, (ed.): IFCS-87, 1988, 237–244.
Kuo, M., Rao, J. S. Limit theory and efficiences for tests based on higher order spacings. (1981) In: J.K. Gosh, J. Roy (eds.), 1984, 333–352.
LeCam, L. M., Ohlsen, R. A. (eds.) Proc. Berkely Conference in honor of Jerzy Neyman and Jack Kiefer, Vol.11, Wadsworth, Mon-tery, 1985.
Lee, K. L. Multivariate tests for clusters. J. Amer. Statist. Assoc. 74 (1979) 708–714.
Lewis, T., Thompson, J. W. Dispersive distributions, and the connection between dispersivity and strong unimodality. J. Appl. Prob. 18 (1981) 76–90.
Li, L. A., Sedransk, N. Mixtures of distributions: A topological approach. Ann. Statist. 16 (1988) 1623–1634.
Lynch, J. Mixtures, generalized convexity and balayages. Scand. J. Statist. 15 (1988) 203–210.
Marriott, F. H. C. Separating mixtures of normal distributions. Biometrics 31 (1975) 767–769.
McLachlan, G. J. The classification and mixture maximum likelihood approaches to cluster analysis. In: Krishnaiah, P.R., Kanal, L.N. (eds.): Handbook of Statistics, vol. 2. North Holland, Amsterdam, 1982, 199–208.
McLachlan, G. J. On bootstrapping the likelihood ratio test statistic for the number of components in a normal mixture. Appl. Statist. 36 (1987) 318–324.
McLachlan, G. J., Basford, K. E. Mixture models. Inference and applications to clustering. Dekker, New York, 1988.
Menzefricke, U. Bayesian clustering of data sets. Comm. Statist., Theory and Methods A 10 (1981) 65–77.
Milligan, G. W. A Monte Carlo study of thirty internal criterion measures for cluster analysis. Psychometrika 46 (1981a) 187–199.
Milligan, G. W. A review of Monte Carlo tests of cluster analysis. Multivariate Behavioral Research 16 (1981b) 379–401.
Milligan, G. W., Cooper, M. C. An examination of procedures for determining the number of clusters in a data set. Psychometrika 50 (1985) 159–179.
Milligan, G. W., Soon, S. C., Sokal, L. M. The effect of cluster size, dimensionality, and the number of clusters on recovery of true cluster structure. IEEE Trans. PAMI-5 (1983) 40–47.
Molenaar, W., Zwet, W. R. van On mixtures of distributions. Ann. Math. Statist. 37 (1966) 281–283.
Müller, D.W., Sawitzki, G. Using excess mass estimates to investigate the modality of a distribution. Preprint Nr. 398, Univ. Heidelberg, Sonderforschungsbereich 123, Heidelberg, 1987.
Narendra, P. M., Goldberg, M. A non-parametric clustering schema for Landsat. Pattern Recognition 9 (1977) 207–215.
Naus, J. I. The distribution of the size of the maximum cluster of points on a line. J. Amer. Statist. Assoc. 60 (1965a) 532–538.
Naus, J. I. Clustering of random points in two dimensions. Biometrika 52 (1965b) 263–267.
Naus, J. I. An indexed bibliography of clusters, clumps and coincidences. Intern. Statist. Review 47 (1979) 47–78.
Naus, J. I. Approximations for distributions of scan statistics. J. Amer. Statist. Assoc. 7 (1982) 177–183.
Newell, G. F. Distribution for the smallest distance between any pair of kth nearest neighbor random points on a line. In: Rosenblatt, M. (ed.): Proc. Symp. Time Series Analysis. Wiley, New York, 1963, 89–103.
Pärna, K. Strong consistency of k-means clustering criterion in separable metric spaces. Tartu Riikliku Ülikooli, TOIMEISED 733 (1986) 86–96.
Perruchet, Ch. Significance tests for clusters: Overview and comments. In: J. Felsenstein (ed.), Berlin, 1983, 199–208.
Pino, G. E. del On the asymptotic distribution of k-spacings with applications to goodness-of-fit tests. Ann. Statist. 7 (1979) 1058–1065.
Pollard, D. Strong consistency of k-means clustering. Ann. Statist. 9 (1981) 135–140.
Pollard, D. A central limit theorem for k-means clustering. Ann. Pro-bab. 10 (1982a) 919–926.
Pollard, D. Quantization and the method of k-means. IEEE Trans. Information Theory T-28 (1982b) 199–205.
Rao, J.S., Sethuraman, J. Pitman efficiencies of tests based on spacings. In: M.L. Puri (ed.): Nonparametric techniques in statistical inference. Cambridge Univ. Press, Cambridge/Mass., 1970, 267–273.
Rasson, J. P., Hardy, A., Weverbergh, D. Point process, classification and data analysis. In: Bock, H. H. (ed.): IFCS-87, 1988, 245–256.
Ripley, B. D. Spatial Statistics. Wiley, New York, 1981.
Rohlf, F.J. Generalization of the gap test for multivariate outliers. Biometrics 31 (1975) 93–101.
Ryzin, J. van (ed.) Classification and clustering. Academic Press, New York, 1977.
Sarle, W. S. Cubic clustering criterion. SAS Technical Report A-108. SAS Institute Inc., Cary, NC, 15 November, 1983.
Saunders, R., Funk, G. M. Poisson limits for a clustering model of Strauss. J. Appl. Prob. 14 (1977) 776–784.
Schilling, M. F. Goodness of fit testing in Rm based on the weighted empirical distribution of certain nearest neighbor statistics. Ann. Statist. 11 (1983) 1–12.
Schroeder, A. Analyse d’un mélange de distributions de probabilité de même type. Revue de Statistique Appliquée 24 (1976), no.1, 39–62.
Schweder, T. On the dispersion of mixtures. Scand. J. Statist., Theory and Applications 9 (1982) 165–170.
Sclove, S. L. Population mixture models and clustering algorithms. Communications in Statistics, Theory and Methods A 6 (1977) 417–434.
Scott, A. J., Knott, M. A cluster analysis method for grouping means in the analysis of variance. Biometrics 30 (1974) 507–512.
Scott, A. J., Symon, M. J. Clustering methods based on likelihood ratio criteria. Biometrics 27 (1971) 387–397.
Shaked, M. On mixtures from exponential families. J. Roy. Statist. Soc. B 42 (1980) 192–198.
Silverman, B.W. Limit theorems for dissociated random variables. Adv. Appl. Prob. 8 (1976) 806–819.
Silverman, B. W. Using kernel density estimates to investigate multimodality. J. Roy. Statist. Soc. B 43 (1981) 97–99.
Silverman, B., Brown, T.Short distances, flat triangles and Poisson limits. J. Appl. Prob. 15 (1978) 816–826.
Sokal, R. R., Sneath, P. H. A. Principles of numerical taxonomy. Freeman, San Francisco-London, 1963.
Späth, H. Cluster dissection and analysis. Wiley, Chichester, 1985.
Symons, M. J. Clustering criteria and multivariate normal mixtures. Biometrics 37 (1981) 35–43.
Titterington, D. M. Contribution to the discussion of the paper by M. Aitkin, D. Anderson and J. Hinde. J. Roy. Statist. Soc. A 144 (1981) 459.
Titterington, D. M., Smith, A. F. M., Makov, U. E. Statistical analysis of finite mixture distributions. Wiley, Chichester, 1985.
Tomassone, R. (ed.) Analyse de données et informatique. Institut National de Recherche en Informatique et en Automatique (IN-RIA), Le Chesnay, 1979.
Trémolières, R. The percolation method for an efficient grouping of data. Pattern Recognition 11 (1979) 255–262.
Windham, M. P. Parameter modification for clustering criteria. J. of Classification 4 (1987) 191–214.
Wolfe, J. H. A Monte-Carlo study of the sampling distribution of the likelihood ratio for mixtures of multinormal distributions. Tech. Bull. STB 72–2. Naval Personnel and Training Research Laboratory, San Diego, 1971.
Wong, M. A. A hybrid clustering method for identifying high-desity clusters. J. Amer. Statist. Assoc. 77 (1982) 841–847.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1989 Springer-Verlag Berlin · Heidelberg
About this paper
Cite this paper
Bock, H.H. (1989). Probabilistic Aspects in Cluster Analysis. In: Optiz, O. (eds) Conceptual and Numerical Analysis of Data. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-75040-3_2
Download citation
DOI: https://doi.org/10.1007/978-3-642-75040-3_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-51641-5
Online ISBN: 978-3-642-75040-3
eBook Packages: Springer Book Archive