Abstract
Hierarchical clustering is one of the most suitable tools to discover the underlying true structure of a dataset in the case of unsupervised learning where the ground truth is unknown and classical machine learning classifiers are not suitable. In many real applications, it provides a perspective on inner data structure and is preferred to partitional methods. However, determining the resulting number of clusters in hierarchical clustering requires human expertise to deduce this from the dendrogram and this represents a major challenge in making a fully automatic system such as the ones required for decision support in Industry 4.0. This research proposes a general criterion to perform the cut of a dendrogram automatically, by comparing six original criteria based on the Calinski-Harabasz index. The performance of each criterion on 95 real-life dendrograms of different topologies is evaluated against the number of classes proposed by the experts and a winner criterion is determined. This research is framed in a bigger project to build an Intelligent Decision Support system to assess the performance of 3D printers based on sensor data in real-time, although the proposed criteria can be used in other real applications of hierarchical clustering.The methodology is applied to a real-life dataset from the 3D printers and the huge reduction in CPU time is also shown by comparing the CPU time before and after this modification of the entire clustering method. It also reduces the dependability on human-expert to provide the number of clusters by inspecting the dendrogram. Further, such a process allows applying hierarchical clustering in an automatic mode in real-life industrial applications and allows the continuous monitoring of real 3D printers in production, and helps in building an Intelligent Decision Support System to detect operational modes, anomalies, and other behavioral patterns.
Similar content being viewed by others
References
Bruzzese D, Vistocco D (2010) Cutting the dendrogram through permutation tests. In: Proceedings of COMPSTAT’2010, pp 847–854
Bruzzese D, Vistocco D (2015) Despota: dendrogram slicing through a pemutation test approach. J Classif 32(2):285–304
Caliński T, Harabasz J (1974) A dendrite method for cluster analysis. Commun Stat-Theory Methods 3(1):1–27
Cowgill MC, Harvey RJ, Watson LT (1999) A genetic algorithm approach to cluster analysis. Comput Math Appl 37(7):99–108
Davies DL, Bouldin DW (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell PAMI-1 (2):224–227
Dunn J (1974) A graph theoretic analysis of pattern classification via tamura’s fuzzy relation. IEEE Trans Syst Man Cybern 3:310–313
Dunn JC (1973) A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. J Cybern 3(3):32–57. https://doi.org/10.1080/01969727308546046
Ferraretti D, Gamberoni G, Lamma E (2009) Automatic cluster selection using index driven search strategy. In: Congress of the Italian Association for artificial intelligence, Springer, pp 172–181
Gibert K, Marti-Puig P, Cusidó J, Solé-Casals J et al (2018) Identifying health status of wind turbines by using self organizing maps and interpretation-oriented post-processing tools. Energies 11(4):723
Gibert K, Nonell R, Velarde J, Colillas M (2005) Knowledge discovery with clustering: impact of metrics and reporting phase by using klass. Neural Netw World 15(4):319
Gibert K, Sànchez-Marrè M, Codina V (2010) Choosing the right data mining technique: classification of methods and intelligent recommendation. Fifth international Congress on environmental modelling and software
Gibert K, Sànchez-Marrè M, Izquierdo J (2016) A survey on pre-processing techniques: relevant issues in the context of environmental data mining. AI Commun 29(6):627–663
Gibert K, Sevilla-Villanueva B, Sànchez-Marrè M (2016) The role of significance tests in consistent interpretation of nested partitions. J Comput Appl Math 292:623–633. https://doi.org/10.1016/j.cam.2015.01.031
Guha S, Rastogi R, Shim K (1998) Cure: an efficient clustering algorithm for large databases. ACM Sigmod Record 27(2):73–84
Hermann M, Pentek T, Otto B (2016) Design principles for industrie 4.0 scenarios. In: 2016 49th Hawaii international conference on system sciences (HICSS), pp 3928–3937. IEEE
HP: Hp multi jet fusion technology (2020) Technical whitepaper. https://www8.hp.com/us/en/printers/3d-printers/products/multi-jet-technology.html. Accessed 30 May 2020
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv (CSUR) 31(3):264–323
Johnson SC (1967) Hierarchical clustering schemes. Psychometrika 32(3):241–254
Jung Y, Park H, Du DZ, Drake BL (2003) A decision criterion for the optimal number of clusters in hierarchical clustering. J Glob Optim 25(1):91–111
Karna A, Gibert K (2019) Using hierarchical clustering to understand behavior of 3d printer sensors. In: International workshop on self-organizing maps, Springer, pp 150–159
Karna A, Gibert K. Bootstrap cure: a novel clustering approach forsensor data. State of the art on sensor data scienceand application to 3d printing industry. Computers in Industry (Submitted)
Liu Y, Wu X, Shen Y (2011) Automatic clustering using genetic algorithms. Appl Math comput 218(4):1267–1279
Miller GA (1956) The magical number seven, plus or minus two: some limits on our capacity for processing information. Psychol Rev 63(2):81
Milligan GW (1981) A monte carlo study of thirty internal criterion measures for cluster analysis. Psychometrika 46(2):187–199
Milligan GW, Cooper MC (1985) An examination of procedures for determining the number of clusters in a data set. Psychometrika 50(2):159–179
Nale SB, Kalbande AG (2015) A review on 3d printing technology. Int J Innov Emerg Res Eng 2(9):2394–5494
Rodas J, Gibert K, Rojo JE (2001) Electroshock effects identification using classification based on rules. In: International symposium on medical data analysis, Springer, pp 238–244
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65
Rüßmann M, Lorenz M, Gerbert P, Waldner M, Justus J, Engel P, Harnisch M (2015) Industry 4.0: the future of productivity and growth in manufacturing industries. Boston Consulting Group 9(1):54–89
Saaty TL, Ozdemir MS (2003) Why the magic number seven plus or minus two. Math Comput Model 38(3–4):233–244
Salvador S, Chan P (2004) Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms. In: 16th IEEE international conference on tools with artificial intelligence, pp 576–584. IEEE
Sevilla-Villanueva B, Gibert K, Sànchez-Marrè M (2016) Using cvi for understanding class topology in unsupervised scenarios. In: Conference of the Spanish association for artificial intelligence, Springer, pp 135–149
Sugar CA, James GM (2003) Finding the number of clusters in a dataset: an information-theoretic approach. J Am Stat Assoc 98(463):750–763
Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a data set via the gap statistic. J Royal Stat Soc Ser B (Stat Methodol) 63(2):411–423
Ward JH Jr (1963) Hierarchical grouping to optimize an objective function. J Am Stat Assoc 58(301):236–244
Wong VK, Hernandez A (2012) A review of additive manufacturing. ISRN Mech Eng 2012:1–10. https://doi.org/10.5402/2012/208760
Yang Y, Chen K (2010) Temporal data clustering via weighted clustering ensemble with different representations. IEEE Trans Knowl Data Eng 23(2):307–320
Yang Y, Jiang J (2018) Adaptive bi-weighting toward automatic initialization and model selection for hmm-based hybrid meta-clustering ensembles. IEEE Trans Cybern 49(5):1657–1668
Zhou S, Xu Z, Liu F (2016) Method for determining the optimal number of clusters based on agglomerative hierarchical clustering. IEEE Trans Neural Netw Learn Syst 28(12):3007–3017
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Karna, A., Gibert, K. Automatic identification of the number of clusters in hierarchical clustering. Neural Comput & Applic 34, 119–134 (2022). https://doi.org/10.1007/s00521-021-05873-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-021-05873-3