Skip to main content

Cluster Analysis

  • Chapter
  • First Online:
Multivariate Analysis

Abstract

Cluster analysis is a procedure for grouping cases (objects of investigation) in a data set. For this purpose, the first step is to determine the similarity or dissimilarity (distance) between the cases by a suitable measure. The second step searches for the fusion algorithm which combines the individual cases successively into groups (clusters). The goal is to combine such cases into groups which are similar with respect to the considered segmentation variables (homogenous groups). At the same time, the groups should be as dissimilar as possible. The procedures of cluster analysis can handle variables with metric, non-metric as well as mixed scales. The focus of the chapter is on hierarchical agglomerative clustering methods, with the single-linkage method and Ward’s method presented in detail. Finally, k-means clustering and two-step cluster analysis, two partitioning cluster methods, are also explained. These methods offer particular advantages when working with large amounts of data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    In diagram A the two characteristics “income” and “age” are not independent. This means that the two-cluster solution could have been achieved on the basis of only one of the two characteristics. On the independence of cluster variables, see Sect. 8.2.1.

  2. 2.

    The selection of the proximity dimensions shown in Table 8.4 is based on the proximity measurements provided in the SPSS procedure “Hierarchical Cluster Analysis”.

  3. 3.

    On the website www.multivariate-methods.info, we provide supplementary material (e.g., Excel files) to deepen the reader’s understanding of the methodology.

  4. 4.

    To simplify the following calculations, only integer values were included in the initial data matrix.

  5. 5.

    On the standardization of variables, see the comments on statistical basics in Sect. 1.2.1.

  6. 6.

    A detailed description of the calculation of the correlation coefficient may be found in Sect. 1.2.2.

  7. 7.

    Due to their rather minor practical importance, divisive cluster procedures will not be discussed here. If you consider applying a divisive clustering algorithm, you can do this in SPSS by clicking on ‘Analyze/Classify/Tree.

  8. 8.

    The course of a fusion process is usually illustrated by a table (so-called agglomeration schedule) and by a dendrogram or icicle diagrams. Both options are explained in detail for the single-linkage method in Sect. 8.2.3.2.1.

  9. 9.

    For the extended example, the dendrograms were created using the procedure CLUSTER in SPSS (see Sect. 8.3.2).

  10. 10.

    The agglomeration schedule was also created using the procedure CLUSTER in SPSS.

  11. 11.

    Since there are no criteria available in SPSS for determining the optimal number of clusters, it is recommended to use alternative programs such as S-Plus, R or SAS and the cubic clustering criterion (CCC) if available.

  12. 12.

    For a brief summary of the basics of statistical testing, see Sect. 1.3.

  13. 13.

    In addition to KM-CA, two-step cluster analysis may also be used to optimize a clustering solution found by another procedure. Both methods belong to the partitioning clustering methods described in detail in Sect. 8.4.2.

  14. 14.

    On the problem of outliers, see also the comments in Sect. 1.5.1.

  15. 15.

    For more detailed considerations on the robustness of cluster analyses, see García-Escudero et al. (2010, p. 89).

  16. 16.

    For a brief summary of the basics of statistical testing, see Sect. 1.3.

  17. 17.

    Missing values are a frequent and unfortunately unavoidable problem when conducting surveys (e.g. because people cannot or do not want to answer the question, or as a result of mistakes by the interviewer). The handling of missing values in empirical studies is discussed in Sect. 1.5.2.

  18. 18.

    The mean values were calculated on the basis of the data set that was also used in the case study of discriminant analysis (Chap. 4), logistic regression (Chap. 5) and factor analysis (Chap. 7). Using the same case study allows us to illustrate the similarities and differences between the methods.

  19. 19.

    Multinomial logistic regression requires at least three groups. In case of a two-cluster solution a binary logistic regression would have to be performed.

References

  • Calinski, T., & Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in statistics—Theory and methods, 3(1), 1–27.

    Google Scholar 

  • García-Escudero, L., Gordaliza, A., Matrán, C., & Mayo-Iscar, A. (2010). A review of robust clustering methods. Advances in Data Analysis and Classification, 4, 89–109.

    Article  Google Scholar 

  • Kline, R. (2011). Principles and practice of structural equation modeling (3rd ed.). New York: Guilford Press.

    Google Scholar 

  • Lance, G. H., & Williams, W.T. (1966). A general theory of classification sorting strategies I. Hierarchical systems. Computer Journal, 9, 373–380.

    Google Scholar 

  • Milligan, G. W. (1980). An Examination of the effect of six types of error pertubation on fifteen clustering algorithms. Psychometrika, 45(3), 325–342.

    Article  Google Scholar 

  • Milligan, G. W., & Cooper, M. (1985). An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50(2), 159–179.

    Article  Google Scholar 

  • Mojena, R. (1977). Hierarchical clustering methods and stopping rules: A evaluation. The Computer Journal, 20(4), 359–363.

    Article  Google Scholar 

  • Punj, G., & Stewart, D. (1983). Cluster analysis in marketing research: Review and suggestions for application. Journal of Marketing Research, 20(2), 134–148.

    Article  Google Scholar 

  • Wedel, M., & Wagner, A. (2000). Market segmentation: Conceptual and methodological foundations (2nd ed.). Boston: Springer.

    Book  Google Scholar 

  • Wind, Y. (1978). Issues and Advances in segmentation research. Journal of Marketing Research, 15(3), 317–337.

    Article  Google Scholar 

Further reading

  • Anderberg, M. R. (2014). Cluster analysis for applications: Probability and mathematical statistics: A series of monographs and textbooks (Vol. 19). New York: Academic press.

    Google Scholar 

  • Eisen, M. B., Spellman, P. T., Brown, P. O., & Botstein, D. (1998). Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences, 95(25), 14863–14868.

    Article  Google Scholar 

  • Everitt, B., Landau, S., Leese, M., & Stahl, D. (2011). Cluster analysis (5th ed.). New York: Wiley.

    Book  Google Scholar 

  • Hennig, C., Meila, M., Murtagh, F., & Rocci, R. (Eds.). (2015). Handbook of cluster analysis. London, New York: Chapman & Hall/CRC.

    Google Scholar 

  • Kaufman, L., & Rousseeuw, P. (2005). Finding groups in data: An introduction to cluster analysis (2nd ed.). New Jersey: John Wiley & Sons.

    Google Scholar 

  • Romesberg, C. (2004). Cluster analysis for researchers. Raleigh: Lulu.com.

    Google Scholar 

  • Wierzchoń, S., & Kłopotek, M. (2018). Modern Algorithms of Cluster Analysis. Berlin: Springer Nature.

    Book  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Klaus Backhaus .

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Der/die Herausgeber bzw. der/die Autor(en), exklusiv lizenziert durch Springer Fachmedien Wiesbaden GmbH, ein Teil von Springer Nature

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Backhaus, K., Erichson, B., Gensler, S., Weiber, R., Weiber, T. (2021). Cluster Analysis. In: Multivariate Analysis. Springer Gabler, Wiesbaden. https://doi.org/10.1007/978-3-658-32589-3_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-658-32589-3_8

  • Published:

  • Publisher Name: Springer Gabler, Wiesbaden

  • Print ISBN: 978-3-658-32588-6

  • Online ISBN: 978-3-658-32589-3

  • eBook Packages: Business and Economics (German Language)

Publish with us

Policies and ethics