Abstract
We perform a benchmarking study to identify the advantages and the drawbacks of Spectral Clustering and Density-Based Spatial Clustering of Applications with Noise (DBSCAN). We compare the two methods with the classic K-means clustering. The methods are performed on five simulated and three real data sets. The obtained clustering results are compared using external and internal indices, as well as run times. Although there is not one method that performs best on all types of data sets, we find that DBSCAN should generally be reserved for non-convex data with well-separated clusters or for data with many outliers. Spectral Clustering has better overall performance but with higher instability of the results compared to K-means, and longer run time.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Performed on a MacBook Pro 2017 - Processor: 2.3 GHz Intel Core i5; Memory: 8GB.
- 2.
References
Barndorff-Nielsen, O.E.: Exponentially decreasing distributions for the logarithm of particle size. Proc. R. Soc. Lond. A. 353(1674), 401–419 (1977)
Becker, R.A., Wilks, A.R.: (Orig. S code), R. Brownrigg (R Version), T.P. Minka and A. Deckmyn (Enhancements), maps: Draw Geographical Maps. R package v. 3.3.0 (2018)
Becker, R.A., Wilks A.R.: (Original S code), R. Brownrigg (R Version), mapdata: Extra Map Databases. R package version 2.3.0. (2018)
Berhane, F.: Data distributions where Kmeans clustering fails: Can DBSCAN be a solution? https://datascience-enthusiast.com/Python/DBSCAN_Kmeans.html (2020)
Desgraupes, B.: clusterCrit: Clustering Indices. R package version 1.2.8. (2018)
Emadi, H.S., Mazinani, S.M.: A novel anomaly detection algorithm using DBSCAN and SVM in wireless sensor networks. Wirel. Pers Comm. 98, 2025–2035 (2018)
Ester, M., Kriegel, H., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD-96 Proceedings, pp. 226-231 (1996)
Fitch, J.P., Khan, N.A.M., Tortora, C.: Back pain: a spectral clustering approach. Arch. Data Sci. Ser. B 1(1) (online first) (2019)
Francis, Z., Villagrasa, C., Clairand, I.: Simulation of DNA damage clustering after proton irradiation using an adapted DBSCAN algorithm. Comput. Methods Programs Biomed. 101(3), 265–270 (2011)
Genz, A., Bretz, F., Miwa, T., Mi, X., Leisch, F., Scheipl, F., Hothorn, T.: mvtnorm: Multivariate Normal and t Distributions, R package version 1.0-11 (2019)
Hahsler, M., Piekenbrock, M.: dbscan: Density Based Clustering of Applications with Noise (DBSCAN) and Related Algorithms, R package version 1.1-1 (2017)
Hornik, K., Grün, B.: movMF: An R package for fitting mixtures of von mises-fisher distributions. J. Stat. Soft. 58(10), 1–31 (2014)
Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985)
Kahle, D., Wickham, H.: ggmap: spatial visualization with ggplot2. R J. 5(1), 144–161 (2013)
Karatzoglou, A., Smola, A., Hornik, K., Zeileis, A.: kernlab—An S4 Package for kernel methods in R. J. Stat. Soft. 11(9), 1–20 (2004)
Kassambara, A., Mundt, F.: factoextra: Extract and Visualize the Results of Multivariate Data Analyses, R package version 1.0.5 (2017)
Leisch F., Dimitriadou, E.: mlbench: Machine Learning Benchmark Problems, R package version 2.1-1 (2010)
Liu, Y., Li, Z., Xiong, H., Gao, X. Wu, J.: Understanding of internal clustering validation measures. In: Proceedings of the 2010 IEEE International Conference on Data Mining, pp. 911-916 (2010)
MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1: Statistics, pp. 281-297. University of California Press, Berkeley, California (1967)
Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., Hornik, K.: cluster: Cluster Analysis Basics and Extensions, R package version 2.1.0 (2019)
McInnes, L., Healy, J., Astels, S.: Comparing python clustering algorithms. https://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html
McLachlan, G.J., Basford, K.E.: Mixture Models: Inference and Applications to Clustering, Statistics, Textbooks and Monographs, vol. 84 (1988)
Mechelen, I.V., Boulesteix, A., Dangl, R., Dean, N., Guyon, I., Hennig, C., Leisch, F., Steinley, D.: Benchmarking in cluster analysis: A white paper (2018). arXiv:1809.10496v2
Murugesan, N., Cho, I., Tortora, C.: Benchmarking in cluster analysis: a study on spectral clustering, DBSCAN and K-means (github repository) https://irene1014.github.io/benchmarking-study-cluster-analysis (2020)
Paccanaro, A., Casbon, J.A., Saqi, M.A.S.: Spectral clustering of protein sequences. Nucleic. Acid Res. 34(5), 1571–1580 (2006)
Peng, H., Pavlidis, N., Eckley, I., Tsalamanis, I.: Subspace clustering of very sparse high-dimensional data. In: 2018 IEEE International Conference on Big Data, pp. 3780–3783 (2018)
Punzo, A., Mazza, A., McNicholas, P.D.: ContaminatedMixt: An R package for fitting parsimonious mixtures of multivariate contaminated normal distributions. J. Stat. Softw. 85(10), 1–25 (2018)
R Core Team: R: A language and environment for statistical computing, R Foundation for Statistical Computing, Vienna, Austria (2017)
Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)
Sander, J., Ester, M., Kriegel, H., Xu, X.: Density-based clustering in spatial databases: The algorithm GDBSCAN and its applications. Data Min. Knowl. Disc. 2, 169–194 (1998)
Schubert, E., Sander, J., Ester, M., Kriegel, H., Xu, X.: DBSCAN revisited, revisited: why and how you should (Still) use DBSCAN. ACM T Database Syst. 42(3), 19 (2017)
Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 228, 888–905 (2000)
Tortora, C., ElSherbiny, A., Browne, R.P., Franczak B.C., McNicholas, P.D.: MixGHD: Model Based Clustering, Classification and Discriminant Analysis Using the Mixture of Generalized Hyperbolic Distributions, R package version 2.1 (2017)
Tortora, C., Franczak, B.C., Browne, R.P., McNicholas, P.D.: A mixture of coalesced generalized hyperbolic distributions. J. Classif. 36, 26–57 (2019)
Tukey, J.W.: A survey of sampling from contaminated distributions. In: Olkin, I., Ghurye, S.G., Hoeffding, W., Madow, W.G., Mann, H.B. (eds.) Contributions to Probability and Statistics: Essays in Honor of Harold Hotelling, pp 448–495 (1960)
Wang, S., Chen, F., Feng, J.: Spectral clustering of high-dimensional data via nonnegative matrix factorization. In: 2015 International Joint Conference on Neural Networks (IJCNN), pp 1–8 (2015)
Wickham, H.: ggplot2: Elegant Graphics for Data Analysis. Springer, NY (2016)
Wu, S., Feng, X., Zhou, W.: Spectral clustering of high-dimensional data exploiting sparse representation vectors. Neurocomputing 135, 229–239 (2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Murugesan, N., Cho, I., Tortora, C. (2021). Benchmarking in Cluster Analysis: A Study on Spectral Clustering, DBSCAN, and K-Means. In: Chadjipadelis, T., Lausen, B., Markos, A., Lee, T.R., Montanari, A., Nugent, R. (eds) Data Analysis and Rationality in a Complex World. IFCS 2019. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Cham. https://doi.org/10.1007/978-3-030-60104-1_20
Download citation
DOI: https://doi.org/10.1007/978-3-030-60104-1_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60103-4
Online ISBN: 978-3-030-60104-1
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)