Skip to main content
Log in

An examination of indexes for determining the number of clusters in binary data sets

  • Articles
  • Published:
Psychometrika Aims and scope Submit manuscript

Abstract

The problem of choosing the correct number of clusters is as old as cluster analysis itself. A number of authors have suggested various indexes to facilitate this crucial decision. One of the most extensive comparative studies of indexes was conducted by Milligan and Cooper (1985). The present piece of work pursues the same goal under different conditions. In contrast to Milligan and Cooper's work, the emphasis here is on high-dimensional empirical binary data. Binary artificial data sets are constructed to reflect features typically encountered in real-world data situations in the field of marketing research. The simulation includes 162 binary data sets that are clustered by two different algorithms and lead to recommendations on the number of clusters for each index under consideration. Index results are evaluated and their performance is compared and analyzed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Aldenderfer, M.S., & Blashfield, R.K. (1996).Cluster analysis. London, U.K.: Sage Publications.

    Google Scholar 

  • Andrews, D.F. (1972). Plots of high-dimensional data.Biometrics, 28, 125–136.

    Google Scholar 

  • Arabie, P., & Hubert, L.J. (1996).Clustering and classification (pp. 5–63). River Edge, NJ: World Scientific.

    Google Scholar 

  • Arratia, R., & Lander, E.S. (1990). The distribution of clusters in random graphs.Advances in Applied Mathematics, 11, 36–48.

    Google Scholar 

  • Baker, F.B., & Hubert, L.J. (1975). Measuring the power of hierarchical cluster analysis.Journal of the American Statistical Association, 70, 31–38.

    Google Scholar 

  • Ball, G.H., & Hall, D.J. (1965).ISODATA, A novel method of data analysis and pattern classification (Tech. Rep. NTIS No. AD 699616). Menlo Park, CA: Stanford Research Institute.

    Google Scholar 

  • Baroni-Urbani, C., & Buser, M.W. (1976). Similarity of binary data.Systematic Zoology, 25, 251–259.

    Google Scholar 

  • Baulieu, F. (1989). A classification of presence/absence based dissimilarity coefficients.Journal of Classification, 6, 233–246.

    Google Scholar 

  • Calinski, R.B., & Harabasz, J. (1974). A dendrite method for cluster analysis.Communications in Statistics, 3, 1–27.

    Google Scholar 

  • Cheetham, H., & Hazel, J. (1969). Binary (presence-absence) similarity coefficients.Journal of Paleontology, 43, 1130–1136.

    Google Scholar 

  • Cox, D. (1970).The analysis of binary data. London, U.K.: Chapman and Hall.

    Google Scholar 

  • Davies, D.L., & Bouldin, D.W. (1979). A cluster separation measure.IEEE Transactions on Pattern Analysis and Machine Intelligence, 1, 224–227.

    Google Scholar 

  • Dolnicar, S., Grabler, K., & Mazanec, J. (2000). A tale of three cities: Perceptual charting for analysing destination images. In A. Woodside (Ed.),Consumer psychology of tourism, hospitality and leisure (pp. 39–62). London, U.K.: CAB International.

    Google Scholar 

  • Dolnicar, S., Leisch, F., Weingessel, A., Buchta, C., & Dimitriadou, E. (1998).A comparison of several cluster algorithms on artificial binary data scenarios from tourism marketing (Working Paper 7, SFB). Wien, Austria: Adaptive Information Systems. (http://www.wu-wien.ac.at/am)

    Google Scholar 

  • Edwards, A.W.F., & Cavalli-Sforza, L. (1965). A method for cluster analysis.Biometrics, 21, 362–375.

    Google Scholar 

  • Formann, A.K. (1984). Die Latent-Class-Analyse: Einführung in die Theorie und Anwendung [Latent class analysis: Introduction into theory and application]. Weinheim, Germany: Beltz.

    Google Scholar 

  • Friedman, H.P., & Rubin, J. (1967). On some invariant criteria for grouping data.Journal of the American Statistical Association, 62, 1159–1178.

    Google Scholar 

  • Fritzke, B. (1997).Some competitive learning methods. Unpublished manuscript [On-line draft document available at http://www.ki.inf.tu-dresden.de/ fritzke/JavaPaper/t.html or http://www.neuroinformatik.ruhr-unibochum.de/ini/VDM/research/gsn/].

  • Fukunaga, K., & Koontz, W.L.G. (1970). A criterion and an algorithm for grouping data.IEEE Transactions on Computers, C-19, 917–923.

    Google Scholar 

  • Gower, J.C. (1985). Measures of similarity, dissimilarity, and distance. In S. Kotz & N.L. Johnson (Eds.),Encyclopedia of Statistical Sciences, Vol. 5 (pp. 397–405). New York, NY: Wiley.

    Google Scholar 

  • Green, P.E., Tull, D.S., & Albaum, G. (1988).Research for Marketing Decisions (5th ed., The Prentice Hall Series in Marketing). Englewood Cliffs, NJ: Prentice-Hall.

    Google Scholar 

  • Hall, D.J., Duda, R.O., Huffman, D.A., & Wolf, E.E. (1973). Development of new pattern recognotion methods (Tech. Rep. NTIS No. AD 7726141). Los Angeles, CA: Aerospace Research Laboratories.

    Google Scholar 

  • Hartigan, J.A. (1975).Clustering algorithms. New York, NY: Wiley.

    Google Scholar 

  • Hubalek, L. (1982). Coefficients of association and similarity, based on binary (presence-absence) data: An evaluation.Biological Review, 57, 669–689.

    Google Scholar 

  • Hubert, L.J., & Levin, J.R. (1976). A general statistical framework for assessing categorical clustering in free recall.Phycological Bulletin, 83, 1072–1080.

    Google Scholar 

  • Kaufmann, H., & Pape, H. (1996).Multivariate statistische Verfahren (2nd ed.) [Multivariate statistical methods]. Berlin: Walter de Gruyter.

    Google Scholar 

  • Li, X. & Dubes, R.C. (1989). A probabilistic measure of similarity for binary data in pattern recognition.Pattern Recognition, 22(4), 397–409.

    Google Scholar 

  • Linde, Y., Buzo, A., & Gray, R.M. (1980). An algorithm for vector quantizer design.IEEE Transactions on Communications, COM-28(1), 84–95.

    Google Scholar 

  • Marriot, F.H.C. (1971). Practical problems in a method of cluster analysis.Biometrics, 27, 501–514.

    Google Scholar 

  • McCutcheon, A.L. (1987).Latent class analysis (Sage University Paper series on Quantitative Applications in the Social Sciences, Series No. 07-064). Beverly Hills, CA: Sage Publications.

    Google Scholar 

  • Milligan, G.W. (1980). An examination of the effect of six types of error perturbation on fifteen clutering algorithms.Psychometrika, 45, 325–342.

    Google Scholar 

  • Milligan, G.W. (1981). A Monte Carlo study of thirty internal criterion measures for cluster analysis.Psychometrika, 46, 187–199.

    Google Scholar 

  • Milligan, G.W., & Cooper, M.C. (1985). An examination of procedures for determining the number of clusters in a data set.Psychometrika, 50, 159–179.

    Google Scholar 

  • Orloci, L. (1967). An agglomerative method of classification of plant communities.Journal of Ecology, 55, 193–206.

    Google Scholar 

  • Ramaswamy, W., Chatterjee, R., & Cohen, S.H. (1996). Joint segmentation on distinct interdependent bases with categorical data.Journal of Marketing Research, 33, 337–350.

    Google Scholar 

  • Ratkowsky, D.A., & Lance, G.N. (1978). A criterion for determining the number of groups in a classification.Australian Computer Journal, 10, 115–117.

    Google Scholar 

  • Rost, J. (1996).Testtheorie, Testkonstruktion [Theory and construction of tests]. Bern: Verlag Hans Huber.

    Google Scholar 

  • Sarle, W.S. (1983).Cubic clustering criterion (Tech. Rep. A-108). Research Triangle Park, NC: SAS Institute.

    Google Scholar 

  • Schwarz, G. (1978). Estimating the dimension of a model.Annuals of Statistics, 6, 461–464.

    Google Scholar 

  • Scott, A.J. & Symons, M.J. (1971). Clustering methods based on likelihood ratio criteria.Biometrics, 27, 387–397.

    Google Scholar 

  • Thorndike, R.L. (1953). Who belongs in the familiy?Psychometrika, 18, 267–276.

    Google Scholar 

  • Wedel, M., & Kamakura, W.A. (1998).Marketing segmentation. Conceptual and methodological foundations (pp. 89–92). Boston/Dordrecht/London: Kluwer Academic.

    Google Scholar 

  • Wolfe, J.H. (1970). Pattern clustering by multivariate mixture analysis.Multivariate Behavioral Research, 5, 329–350.

    Google Scholar 

  • Xu, L. (1997). Bayesian Ying-Yang machine, clustering and number of clusters.Pattern Recognition Letters, 18, 1167–1178.

    Google Scholar 

  • Yang, M.-S. & Yu, K.F. (1990). On stochastic convergence theorems for the fuzzy c-means clustering procedure.International Journal of General Systems, 16, 397–411.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Additional information

Author names are listed in alphabetical order.

This piece of research was supported by the Austrian Science Foundation (FWF) under grant SFB#010 (“Adaptive Information Systems and Modeling in Economics and Management Science”).

The authors would like to thank the anonymous reviewers and especially the associate editor for their helpful comments and suggestions.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dimitriadou, E., Dolničar, S. & Weingessel, A. An examination of indexes for determining the number of clusters in binary data sets. Psychometrika 67, 137–159 (2002). https://doi.org/10.1007/BF02294713

Download citation

  • Received:

  • Revised:

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF02294713

Key words

Navigation