Trends in Unsupervised Methodologies for Optimal K-Value Selection in Clustering Algorithms

Pegado-Bardayo, Ana; Muñuzuri, Jesús; Escudero-Santana, Alejandro; Lorenzo-Espejo, Antonio

doi:10.1007/978-3-031-57996-7_49

Ana Pegado-Bardayo⁶,
Jesús Muñuzuri⁶,
Alejandro Escudero-Santana⁶ &
…
Antonio Lorenzo-Espejo⁶

Part of the book series: Lecture Notes on Data Engineering and Communications Technologies ((LNDECT,volume 206))

Included in the following conference series:

International Conference on Industrial Engineering and Industrial Management (ICIEIM) – Congreso de Ingeniería de Organización

74 Accesses

Abstract

Clustering algorithms are a powerful machine learning tool when working with large datasets, as they allow data to be grouped according to certain characteristics without the need to manually label the data. These algorithms generally request the number of clusters to be formed (k) as a parameter of the model and, while in some instances it is possible to indicate this number manually, most situations require this estimation to be an unsupervised task. The most widespread techniques offer acceptable results, but there is still much room for improvement. This study highlights their main shortcomings and reviews some of the advances in the estimation of this parameter presented in recent years, exploring their advantages and limitations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 219.00; Price excludes VAT (USA)

Softcover Book: USD 279.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Abdalameer, A., Alswaitti, M., Alsudani, A., Isa, N.: A new validity clustering index-based on finding new centroid positions using the mean of clustered data to determine the optimum number of clusters. Expert Syst. Appl. 116329, 191 (2022). https://doi.org/10.1016/J.ESWA.2021.116329
Article Google Scholar
Alibuhtto, M., Mahat, N.: Distance based k-means clustering algorithm for determining number of clusters for high dimensional data. Decis. Sci. Lett. 9, 51–58 (2020). https://doi.org/10.5267/J.DSL.2019.8.002
Article Google Scholar
Calinski, T., Harabasz, J.: A dendrite method for cluster analysis. Commun. Stat. 3(1), 1–27 (1974). https://doi.org/10.1080/03610927408827101
Article MathSciNet Google Scholar
Krzanowski, W.J., Lai, T.: A criterion for determining the number of groups in a data set using sum-of-squares clustering. Biometrics 44(1), 23–34 (1988)
Article MathSciNet Google Scholar
Patil, C., Baidari, I.: Estimating the optimal number of clusters k in a dataset using data depth. Data Sci. Eng. 4, 132–140 (2019). https://doi.org/10.1007/s41019-019-0091-y
Article Google Scholar
Ri, Y., Kang, C., Kim, K., Choe, Y., Han, U.: A new method to determine cluster number without clustering for every k based on ratio of variance to range in k-means. Math. Probl. Eng. (2022). https://doi.org/10.1155/2022/6866747
Article Google Scholar
Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987). https://doi.org/10.1016/0377-0427(87)90125-7
Article Google Scholar
Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a data set via the gap statistic. J. R. Stat. Soc. Ser. B 63(2), 411–423 (2001). https://doi.org/10.1111/1467-9868.00293
Article MathSciNet Google Scholar
Wang, X., Xu, Y.: An improved index for clustering validation based on Silhouette index and Calinski-Harabasz index. IOP Conf. Ser. Mater. Sci. Eng. 569(5), 052024 (2019). https://doi.org/10.1088/1757-899X/569/5/052024
Article Google Scholar
Xie, S., Lawniczak, A., Gan, C.: Optimal number of clusters in explainable data analysis of agent-based simulation experiments. J. Comput. Sci. 101685, 62 (2022). https://doi.org/10.1016/J.JOCS.2022.101685
Article Google Scholar
Yang, J., Lee, J.Y., Choi, M., Joo, Y.: A new approach to determine the optimal number of clusters based on the gap statistic. In: Boumerdassi, S., Renault, É., Mühlethaler, P. (eds.) Machine Learning for Networking: Second IFIP TC 6 International Conference, MLN 2019, Paris, France, December 3–5, 2019, Revised Selected Papers, pp. 227–239. Springer International Publishing, Cham (2020). https://doi.org/10.1007/978-3-030-45778-5_15
Chapter Google Scholar
Yuan, C., Yang, H.: Research on K-value selection method of k-means clustering algorithm. J 2(2), 226–235 (2019). https://doi.org/10.3390/j2020016
Article Google Scholar

Download references

Acknowledgment

This research has been funded through project CAROLUM (PID2021-125125OB-I00) by the Spanish State Research Agency and the European Regional Development Fund. The research was also supported by the Ministry of Universities of Spain through a grant for the Training of University Researchers (Ayuda para la Formación del Profesorado Universitario, reference FPU20/05584).

Author information

Authors and Affiliations

Dpto. De Organización Industrial y Gestión de Empresas II, Escuela Técnica Superior de Ingeniería, Universidad de Sevilla, Camino de los Descubrimientos S/N, 41092, Sevilla, Spain
Ana Pegado-Bardayo, Jesús Muñuzuri, Alejandro Escudero-Santana & Antonio Lorenzo-Espejo

Authors

Ana Pegado-Bardayo
View author publications
You can also search for this author in PubMed Google Scholar
Jesús Muñuzuri
View author publications
You can also search for this author in PubMed Google Scholar
Alejandro Escudero-Santana
View author publications
You can also search for this author in PubMed Google Scholar
Antonio Lorenzo-Espejo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ana Pegado-Bardayo .

Editor information

Editors and Affiliations

ETSEIB, Universitat Politècnica de Catalunya, Barcelona, Spain
Joaquín Bautista-Valhondo
ETSEIB, Universitat Politècnica de Catalunya, Barcelona, Spain
Manuel Mateo-Doll
ETSEIB, Universitat Politècnica de Catalunya, Barcelona, Spain
Amaia Lusa
ETSEIB, Universitat Politècnica de Catalunya, Barcelona, Spain
Rafael Pastor-Moreno

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pegado-Bardayo, A., Muñuzuri, J., Escudero-Santana, A., Lorenzo-Espejo, A. (2024). Trends in Unsupervised Methodologies for Optimal K-Value Selection in Clustering Algorithms. In: Bautista-Valhondo, J., Mateo-Doll, M., Lusa, A., Pastor-Moreno, R. (eds) Proceedings of the 17th International Conference on Industrial Engineering and Industrial Management (ICIEIM) – XXVII Congreso de Ingeniería de Organización (CIO2023). CIO 2023. Lecture Notes on Data Engineering and Communications Technologies, vol 206. Springer, Cham. https://doi.org/10.1007/978-3-031-57996-7_49

Download citation

DOI: https://doi.org/10.1007/978-3-031-57996-7_49
Published: 26 April 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-57995-0
Online ISBN: 978-3-031-57996-7
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics