Skip to main content
Log in

Automatic identification of the number of clusters in hierarchical clustering

  • S.I. : WSOM 2019
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Hierarchical clustering is one of the most suitable tools to discover the underlying true structure of a dataset in the case of unsupervised learning where the ground truth is unknown and classical machine learning classifiers are not suitable. In many real applications, it provides a perspective on inner data structure and is preferred to partitional methods. However, determining the resulting number of clusters in hierarchical clustering requires human expertise to deduce this from the dendrogram and this represents a major challenge in making a fully automatic system such as the ones required for decision support in Industry 4.0. This research proposes a general criterion to perform the cut of a dendrogram automatically, by comparing six original criteria based on the Calinski-Harabasz index. The performance of each criterion on 95 real-life dendrograms of different topologies is evaluated against the number of classes proposed by the experts and a winner criterion is determined. This research is framed in a bigger project to build an Intelligent Decision Support system to assess the performance of 3D printers based on sensor data in real-time, although the proposed criteria can be used in other real applications of hierarchical clustering.The methodology is applied to a real-life dataset from the 3D printers and the huge reduction in CPU time is also shown by comparing the CPU time before and after this modification of the entire clustering method. It also reduces the dependability on human-expert to provide the number of clusters by inspecting the dendrogram. Further, such a process allows applying hierarchical clustering in an automatic mode in real-life industrial applications and allows the continuous monitoring of real 3D printers in production, and helps in building an Intelligent Decision Support System to detect operational modes, anomalies, and other behavioral patterns.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Bruzzese D, Vistocco D (2010) Cutting the dendrogram through permutation tests. In: Proceedings of COMPSTAT’2010, pp 847–854

  2. Bruzzese D, Vistocco D (2015) Despota: dendrogram slicing through a pemutation test approach. J Classif 32(2):285–304

    Article  MathSciNet  Google Scholar 

  3. Caliński T, Harabasz J (1974) A dendrite method for cluster analysis. Commun Stat-Theory Methods 3(1):1–27

    Article  MathSciNet  Google Scholar 

  4. Cowgill MC, Harvey RJ, Watson LT (1999) A genetic algorithm approach to cluster analysis. Comput Math Appl 37(7):99–108

    Article  MathSciNet  Google Scholar 

  5. Davies DL, Bouldin DW (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell PAMI-1 (2):224–227

  6. Dunn J (1974) A graph theoretic analysis of pattern classification via tamura’s fuzzy relation. IEEE Trans Syst Man Cybern 3:310–313

    Article  Google Scholar 

  7. Dunn JC (1973) A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. J Cybern 3(3):32–57. https://doi.org/10.1080/01969727308546046

    Article  MathSciNet  MATH  Google Scholar 

  8. Ferraretti D, Gamberoni G, Lamma E (2009) Automatic cluster selection using index driven search strategy. In: Congress of the Italian Association for artificial intelligence, Springer, pp 172–181

  9. Gibert K, Marti-Puig P, Cusidó J, Solé-Casals J et al (2018) Identifying health status of wind turbines by using self organizing maps and interpretation-oriented post-processing tools. Energies 11(4):723

    Article  Google Scholar 

  10. Gibert K, Nonell R, Velarde J, Colillas M (2005) Knowledge discovery with clustering: impact of metrics and reporting phase by using klass. Neural Netw World 15(4):319

    Google Scholar 

  11. Gibert K, Sànchez-Marrè M, Codina V (2010) Choosing the right data mining technique: classification of methods and intelligent recommendation. Fifth international Congress on environmental modelling and software

  12. Gibert K, Sànchez-Marrè M, Izquierdo J (2016) A survey on pre-processing techniques: relevant issues in the context of environmental data mining. AI Commun 29(6):627–663

    Article  MathSciNet  Google Scholar 

  13. Gibert K, Sevilla-Villanueva B, Sànchez-Marrè M (2016) The role of significance tests in consistent interpretation of nested partitions. J Comput Appl Math 292:623–633. https://doi.org/10.1016/j.cam.2015.01.031

    Article  MathSciNet  MATH  Google Scholar 

  14. Guha S, Rastogi R, Shim K (1998) Cure: an efficient clustering algorithm for large databases. ACM Sigmod Record 27(2):73–84

    Article  Google Scholar 

  15. Hermann M, Pentek T, Otto B (2016) Design principles for industrie 4.0 scenarios. In: 2016 49th Hawaii international conference on system sciences (HICSS), pp 3928–3937. IEEE

  16. HP: Hp multi jet fusion technology (2020) Technical whitepaper. https://www8.hp.com/us/en/printers/3d-printers/products/multi-jet-technology.html. Accessed 30 May 2020

  17. Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv (CSUR) 31(3):264–323

    Article  Google Scholar 

  18. Johnson SC (1967) Hierarchical clustering schemes. Psychometrika 32(3):241–254

    Article  Google Scholar 

  19. Jung Y, Park H, Du DZ, Drake BL (2003) A decision criterion for the optimal number of clusters in hierarchical clustering. J Glob Optim 25(1):91–111

    Article  MathSciNet  Google Scholar 

  20. Karna A, Gibert K (2019) Using hierarchical clustering to understand behavior of 3d printer sensors. In: International workshop on self-organizing maps, Springer, pp 150–159

  21. Karna A, Gibert K. Bootstrap cure: a novel clustering approach forsensor data. State of the art on sensor data scienceand application to 3d printing industry. Computers in Industry (Submitted)

  22. Liu Y, Wu X, Shen Y (2011) Automatic clustering using genetic algorithms. Appl Math comput 218(4):1267–1279

    MathSciNet  MATH  Google Scholar 

  23. Miller GA (1956) The magical number seven, plus or minus two: some limits on our capacity for processing information. Psychol Rev 63(2):81

    Article  Google Scholar 

  24. Milligan GW (1981) A monte carlo study of thirty internal criterion measures for cluster analysis. Psychometrika 46(2):187–199

    Article  Google Scholar 

  25. Milligan GW, Cooper MC (1985) An examination of procedures for determining the number of clusters in a data set. Psychometrika 50(2):159–179

    Article  Google Scholar 

  26. Nale SB, Kalbande AG (2015) A review on 3d printing technology. Int J Innov Emerg Res Eng 2(9):2394–5494

    Google Scholar 

  27. Rodas J, Gibert K, Rojo JE (2001) Electroshock effects identification using classification based on rules. In: International symposium on medical data analysis, Springer, pp 238–244

  28. Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65

    Article  Google Scholar 

  29. Rüßmann M, Lorenz M, Gerbert P, Waldner M, Justus J, Engel P, Harnisch M (2015) Industry 4.0: the future of productivity and growth in manufacturing industries. Boston Consulting Group 9(1):54–89

  30. Saaty TL, Ozdemir MS (2003) Why the magic number seven plus or minus two. Math Comput Model 38(3–4):233–244

    Article  MathSciNet  Google Scholar 

  31. Salvador S, Chan P (2004) Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms. In: 16th IEEE international conference on tools with artificial intelligence, pp 576–584. IEEE

  32. Sevilla-Villanueva B, Gibert K, Sànchez-Marrè M (2016) Using cvi for understanding class topology in unsupervised scenarios. In: Conference of the Spanish association for artificial intelligence, Springer, pp 135–149

  33. Sugar CA, James GM (2003) Finding the number of clusters in a dataset: an information-theoretic approach. J Am Stat Assoc 98(463):750–763

    Article  MathSciNet  Google Scholar 

  34. Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a data set via the gap statistic. J Royal Stat Soc Ser B (Stat Methodol) 63(2):411–423

    Article  MathSciNet  Google Scholar 

  35. Ward JH Jr (1963) Hierarchical grouping to optimize an objective function. J Am Stat Assoc 58(301):236–244

    Article  MathSciNet  Google Scholar 

  36. Wong VK, Hernandez A (2012) A review of additive manufacturing. ISRN Mech Eng 2012:1–10. https://doi.org/10.5402/2012/208760

    Article  Google Scholar 

  37. Yang Y, Chen K (2010) Temporal data clustering via weighted clustering ensemble with different representations. IEEE Trans Knowl Data Eng 23(2):307–320

    Article  Google Scholar 

  38. Yang Y, Jiang J (2018) Adaptive bi-weighting toward automatic initialization and model selection for hmm-based hybrid meta-clustering ensembles. IEEE Trans Cybern 49(5):1657–1668

    Article  MathSciNet  Google Scholar 

  39. Zhou S, Xu Z, Liu F (2016) Method for determining the optimal number of clusters based on agglomerative hierarchical clustering. IEEE Trans Neural Netw Learn Syst 28(12):3007–3017

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ashutosh Karna.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Karna, A., Gibert, K. Automatic identification of the number of clusters in hierarchical clustering. Neural Comput & Applic 34, 119–134 (2022). https://doi.org/10.1007/s00521-021-05873-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-021-05873-3

Keywords

Navigation