Advertisement

Indexes to Find the Optimal Number of Clusters in a Hierarchical Clustering

  • José David Martín-FernándezEmail author
  • José María Luna-Romera
  • Beatriz Pontes
  • José C. Riquelme-Santos
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 950)

Abstract

Clustering analysis is one of the most commonly used techniques for uncovering patterns in data mining. Most clustering methods require establishing the number of clusters beforehand. However, due to the size of the data currently used, predicting that value is at a high computational cost task in most cases. In this article, we present a clustering technique that avoids this requirement, using hierarchical clustering. There are many examples of this procedure in the literature, most of them focusing on the dissociative or descending subtype, while in this article we cover the agglomerative or ascending subtype. Being more expensive in computational and temporal cost, it nevertheless allows us to obtain very valuable information, regarding elements membership to clusters and their groupings, that is to say, their dendrogram. Finally, several sets of data have been used, varying their dimensionality. For each of them, we provide the calculations of internal validation indexes to test the algorithm developed, studying which of them provides better results to obtain the best possible clustering.

Keywords

Machine Learning Hierarchical clustering Internal validation indexes 

References

  1. 1.
    Krumholz, H.M.: Big data and new knowledge in medicine: the thinking, training, and tools needed for a learning health system. Health Aff. 33(7), 1163–1170 (2014)CrossRefGoogle Scholar
  2. 2.
    Su, Z., Xu, Q., et al.: Big data in mobile social networks: a QoE-oriented framework. IEEE Netw. 30(1), 52–57 (2016)MathSciNetCrossRefGoogle Scholar
  3. 3.
    Pérez-Chacón, R., Luna-Romera, J.M., et al.: Big data analytics for discovering electricity consumption patterns in smart cities. Energies 11(3), 683 (2018)CrossRefGoogle Scholar
  4. 4.
    Guo, H., Liu, Z., et al.: Big earth data: a new challenge and opportunity for digital earth’s development. Int. J. Digit. Earth 10(1), 1–12 (2017)CrossRefGoogle Scholar
  5. 5.
    Dean, J., Ghemawat, S.: MapReduce. Commun. ACM 51(1), 107 (2008)CrossRefGoogle Scholar
  6. 6.
    Ghemawat, S., Gobioff, H., et al.: The Google file system. In: Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles - SOSP 2003, vol. 37, no. 5, p. 29. ACM Press, New York (2003)Google Scholar
  7. 7.
    Apache Spark: Lightning-Fast C. C. https://spark.apache.org/
  8. 8.
    Apache Spark: Clustering Documentation (2019). https://spark.apache.org/docs/2.2.0/ml-clustering.html
  9. 9.
    Luna-Romera, J.M., García-Gutiérrez, J., et al.: An approach to validity indices for clustering techniques in Big Data. Progr. Artif. Intell. 7, 1–14 (2017)Google Scholar
  10. 10.
  11. 11.
    Fahad, A., Alshatri, N., et al.: A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans. Emerg. Top. Comput. 2(3), 267–279 (2014)CrossRefGoogle Scholar
  12. 12.
    Sharma, A., López, Y., et al.: Divisive hierarchical maximum likelihood clustering. BMC Bioinform. 18(S16), 546 (2017)CrossRefGoogle Scholar
  13. 13.
    Kim, E., Oh, W., et al.: Divisive hierarchical clustering towards identifying clinically significant pre-diabetes subpopulations. In: AMIA – Annual Symposium Proceedings. AMIA Symposium, vol. 2014, pp. 1815–1824 (2014)Google Scholar
  14. 14.
    Patnaik, A.K., Bhuyan, P.K., et al.: Divisive Analysis (DIANA) of hierarchical clustering and GPS data for level of service criteria of urban streets. Alexandria Eng. J. 55(1), 407–418 (2016)CrossRefGoogle Scholar
  15. 15.
    Loewenstein, Y., Portugaly, E.: Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space. Bioinformatics 24(13), i41–i49 (2008)CrossRefGoogle Scholar
  16. 16.
    Uchiyama, I.: Hierarchical clustering algorithm for comprehensive orthologous-domain classification in multiple genomes. Nucleic Acids Res. 34(2), 647–658 (2006)CrossRefGoogle Scholar
  17. 17.
    Dunn, J.C.: Well-separated clusters and optimal fuzzy partitions. J. Cybern. 4(1), 95–104 (1974)MathSciNetCrossRefGoogle Scholar
  18. 18.
    Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)CrossRefGoogle Scholar
  19. 19.
    Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. PAMI-1(2), 224–227 (1979)CrossRefGoogle Scholar
  20. 20.
    Martín-Fernández, J.D., Luna-Romera, J.M.: Distance class (2018). https://github.com/Joseda13/linkage/blob/master/src/main/scala/es/us/linkage/Distance.scala
  21. 21.
    Hastie, T., Tibshirani, R., et al.: The Elements of Statistical Learning. Springer Series in Statistics. Springer, New York (2009)CrossRefGoogle Scholar
  22. 22.

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  • José David Martín-Fernández
    • 1
    Email author
  • José María Luna-Romera
    • 1
  • Beatriz Pontes
    • 1
  • José C. Riquelme-Santos
    • 1
  1. 1.University of SevilleSevilleSpain

Personalised recommendations