Abstract
Cluster analysis is a procedure for grouping cases (objects of investigation) in a data set. For this purpose, the first step is to determine the similarity or dissimilarity (distance) between the cases by a suitable measure. The second step searches for the fusion algorithm which combines the individual cases successively into groups (clusters). The goal is to combine such cases into groups which are similar with respect to the considered segmentation variables (homogenous groups). At the same time, the groups should be as dissimilar as possible. The procedures of cluster analysis can handle variables with metric, non-metric as well as mixed scales. The focus of the chapter is on hierarchical agglomerative clustering methods, with the single-linkage method and Ward’s method presented in detail. Finally, k-means clustering and two-step cluster analysis, two partitioning cluster methods, are also explained. These methods offer particular advantages when working with large amounts of data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
In diagram A the two characteristics “income” and “age” are not independent. This means that the two-cluster solution could have been achieved on the basis of only one of the two characteristics. On the independence of cluster variables, see Sect. 8.2.1.
- 2.
The selection of the proximity dimensions shown in Table 8.4 is based on the proximity measurements provided in the SPSS procedure “Hierarchical Cluster Analysis”.
- 3.
On the website www.multivariate-methods.info, we provide supplementary material (e.g., Excel files) to deepen the reader’s understanding of the methodology.
- 4.
To simplify the following calculations, only integer values were included in the initial data matrix.
- 5.
On the standardization of variables, see the comments on statistical basics in Sect. 1.2.1.
- 6.
A detailed description of the calculation of the correlation coefficient may be found in Sect. 1.2.2.
- 7.
Due to their rather minor practical importance, divisive cluster procedures will not be discussed here. If you consider applying a divisive clustering algorithm, you can do this in SPSS by clicking on ‘Analyze/Classify/Tree‘.
- 8.
The course of a fusion process is usually illustrated by a table (so-called agglomeration schedule) and by a dendrogram or icicle diagrams. Both options are explained in detail for the single-linkage method in Sect. 8.2.3.2.1.
- 9.
For the extended example, the dendrograms were created using the procedure CLUSTER in SPSS (see Sect. 8.3.2).
- 10.
The agglomeration schedule was also created using the procedure CLUSTER in SPSS.
- 11.
Since there are no criteria available in SPSS for determining the optimal number of clusters, it is recommended to use alternative programs such as S-Plus, R or SAS and the cubic clustering criterion (CCC) if available.
- 12.
For a brief summary of the basics of statistical testing, see Sect. 1.3.
- 13.
In addition to KM-CA, two-step cluster analysis may also be used to optimize a clustering solution found by another procedure. Both methods belong to the partitioning clustering methods described in detail in Sect. 8.4.2.
- 14.
On the problem of outliers, see also the comments in Sect. 1.5.1.
- 15.
For more detailed considerations on the robustness of cluster analyses, see García-Escudero et al. (2010, p. 89).
- 16.
For a brief summary of the basics of statistical testing, see Sect. 1.3.
- 17.
Missing values are a frequent and unfortunately unavoidable problem when conducting surveys (e.g. because people cannot or do not want to answer the question, or as a result of mistakes by the interviewer). The handling of missing values in empirical studies is discussed in Sect. 1.5.2.
- 18.
- 19.
Multinomial logistic regression requires at least three groups. In case of a two-cluster solution a binary logistic regression would have to be performed.
References
Calinski, T., & Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in statistics—Theory and methods, 3(1), 1–27.
García-Escudero, L., Gordaliza, A., Matrán, C., & Mayo-Iscar, A. (2010). A review of robust clustering methods. Advances in Data Analysis and Classification, 4, 89–109.
Kline, R. (2011). Principles and practice of structural equation modeling (3rd ed.). New York: Guilford Press.
Lance, G. H., & Williams, W.T. (1966). A general theory of classification sorting strategies I. Hierarchical systems. Computer Journal, 9, 373–380.
Milligan, G. W. (1980). An Examination of the effect of six types of error pertubation on fifteen clustering algorithms. Psychometrika, 45(3), 325–342.
Milligan, G. W., & Cooper, M. (1985). An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50(2), 159–179.
Mojena, R. (1977). Hierarchical clustering methods and stopping rules: A evaluation. The Computer Journal, 20(4), 359–363.
Punj, G., & Stewart, D. (1983). Cluster analysis in marketing research: Review and suggestions for application. Journal of Marketing Research, 20(2), 134–148.
Wedel, M., & Wagner, A. (2000). Market segmentation: Conceptual and methodological foundations (2nd ed.). Boston: Springer.
Wind, Y. (1978). Issues and Advances in segmentation research. Journal of Marketing Research, 15(3), 317–337.
Further reading
Anderberg, M. R. (2014). Cluster analysis for applications: Probability and mathematical statistics: A series of monographs and textbooks (Vol. 19). New York: Academic press.
Eisen, M. B., Spellman, P. T., Brown, P. O., & Botstein, D. (1998). Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences, 95(25), 14863–14868.
Everitt, B., Landau, S., Leese, M., & Stahl, D. (2011). Cluster analysis (5th ed.). New York: Wiley.
Hennig, C., Meila, M., Murtagh, F., & Rocci, R. (Eds.). (2015). Handbook of cluster analysis. London, New York: Chapman & Hall/CRC.
Kaufman, L., & Rousseeuw, P. (2005). Finding groups in data: An introduction to cluster analysis (2nd ed.). New Jersey: John Wiley & Sons.
Romesberg, C. (2004). Cluster analysis for researchers. Raleigh: Lulu.com.
Wierzchoń, S., & Kłopotek, M. (2018). Modern Algorithms of Cluster Analysis. Berlin: Springer Nature.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2021 Der/die Herausgeber bzw. der/die Autor(en), exklusiv lizenziert durch Springer Fachmedien Wiesbaden GmbH, ein Teil von Springer Nature
About this chapter
Cite this chapter
Backhaus, K., Erichson, B., Gensler, S., Weiber, R., Weiber, T. (2021). Cluster Analysis. In: Multivariate Analysis. Springer Gabler, Wiesbaden. https://doi.org/10.1007/978-3-658-32589-3_8
Download citation
DOI: https://doi.org/10.1007/978-3-658-32589-3_8
Published:
Publisher Name: Springer Gabler, Wiesbaden
Print ISBN: 978-3-658-32588-6
Online ISBN: 978-3-658-32589-3
eBook Packages: Business and Economics (German Language)