Abstract
Clustering algorithms are a powerful machine learning tool when working with large datasets, as they allow data to be grouped according to certain characteristics without the need to manually label the data. These algorithms generally request the number of clusters to be formed (k) as a parameter of the model and, while in some instances it is possible to indicate this number manually, most situations require this estimation to be an unsupervised task. The most widespread techniques offer acceptable results, but there is still much room for improvement. This study highlights their main shortcomings and reviews some of the advances in the estimation of this parameter presented in recent years, exploring their advantages and limitations.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Abdalameer, A., Alswaitti, M., Alsudani, A., Isa, N.: A new validity clustering index-based on finding new centroid positions using the mean of clustered data to determine the optimum number of clusters. Expert Syst. Appl. 116329, 191 (2022). https://doi.org/10.1016/J.ESWA.2021.116329
Alibuhtto, M., Mahat, N.: Distance based k-means clustering algorithm for determining number of clusters for high dimensional data. Decis. Sci. Lett. 9, 51–58 (2020). https://doi.org/10.5267/J.DSL.2019.8.002
Calinski, T., Harabasz, J.: A dendrite method for cluster analysis. Commun. Stat. 3(1), 1–27 (1974). https://doi.org/10.1080/03610927408827101
Krzanowski, W.J., Lai, T.: A criterion for determining the number of groups in a data set using sum-of-squares clustering. Biometrics 44(1), 23–34 (1988)
Patil, C., Baidari, I.: Estimating the optimal number of clusters k in a dataset using data depth. Data Sci. Eng. 4, 132–140 (2019). https://doi.org/10.1007/s41019-019-0091-y
Ri, Y., Kang, C., Kim, K., Choe, Y., Han, U.: A new method to determine cluster number without clustering for every k based on ratio of variance to range in k-means. Math. Probl. Eng. (2022). https://doi.org/10.1155/2022/6866747
Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987). https://doi.org/10.1016/0377-0427(87)90125-7
Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a data set via the gap statistic. J. R. Stat. Soc. Ser. B 63(2), 411–423 (2001). https://doi.org/10.1111/1467-9868.00293
Wang, X., Xu, Y.: An improved index for clustering validation based on Silhouette index and Calinski-Harabasz index. IOP Conf. Ser. Mater. Sci. Eng. 569(5), 052024 (2019). https://doi.org/10.1088/1757-899X/569/5/052024
Xie, S., Lawniczak, A., Gan, C.: Optimal number of clusters in explainable data analysis of agent-based simulation experiments. J. Comput. Sci. 101685, 62 (2022). https://doi.org/10.1016/J.JOCS.2022.101685
Yang, J., Lee, J.Y., Choi, M., Joo, Y.: A new approach to determine the optimal number of clusters based on the gap statistic. In: Boumerdassi, S., Renault, É., Mühlethaler, P. (eds.) Machine Learning for Networking: Second IFIP TC 6 International Conference, MLN 2019, Paris, France, December 3–5, 2019, Revised Selected Papers, pp. 227–239. Springer International Publishing, Cham (2020). https://doi.org/10.1007/978-3-030-45778-5_15
Yuan, C., Yang, H.: Research on K-value selection method of k-means clustering algorithm. J 2(2), 226–235 (2019). https://doi.org/10.3390/j2020016
Acknowledgment
This research has been funded through project CAROLUM (PID2021-125125OB-I00) by the Spanish State Research Agency and the European Regional Development Fund. The research was also supported by the Ministry of Universities of Spain through a grant for the Training of University Researchers (Ayuda para la Formación del Profesorado Universitario, reference FPU20/05584).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Pegado-Bardayo, A., Muñuzuri, J., Escudero-Santana, A., Lorenzo-Espejo, A. (2024). Trends in Unsupervised Methodologies for Optimal K-Value Selection in Clustering Algorithms. In: Bautista-Valhondo, J., Mateo-Doll, M., Lusa, A., Pastor-Moreno, R. (eds) Proceedings of the 17th International Conference on Industrial Engineering and Industrial Management (ICIEIM) – XXVII Congreso de Ingeniería de Organización (CIO2023). CIO 2023. Lecture Notes on Data Engineering and Communications Technologies, vol 206. Springer, Cham. https://doi.org/10.1007/978-3-031-57996-7_49
Download citation
DOI: https://doi.org/10.1007/978-3-031-57996-7_49
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-57995-0
Online ISBN: 978-3-031-57996-7
eBook Packages: EngineeringEngineering (R0)