Skip to main content

Hierarchical conceptual clustering based on quantile method for identifying microscopic details in distributional data

Abstract

Symbolic data is aggregated from bigger traditional datasets in order to hide entry specific details and to enable analysing large amounts of data, like big data, which would otherwise not be possible. Symbolic data may appear in many different but complex forms like intervals and histograms. Identifying patterns and finding similarities between objects is one of the most fundamental tasks of data mining. In order to accurately cluster these sophisticated data types, usual methods are not enough. Throughout the years different approaches have been proposed but they mainly concentrate on the “macroscopic” similarities between objects. Distributional data, for example symbolic data, has been aggregated from sets of large data and thus even the smallest microscopic differences and similarities become extremely important. In this paper a method is proposed for clustering distributional data based on these microscopic similarities by using quantile values. Having multiple points for comparison enables to identify similarities in small sections of distribution while producing more adequate hierarchical concepts. Proposed algorithm, called microscopic hierarchical conceptual clustering, has a monotone property and has been found to produce more adequate conceptual clusters during experimentation. Furthermore, thanks to the usage of quantiles, this algorithm allows us to compare different types of symbolic data easily without any additional complexity.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

References

  • Bertrand P, Mufti GB (2008) Stability measures for assessing a partition and its clusters: application to symbolic data sets. In: Symbolic data analysis and the SODAS software, pp 263–278

  • Billard L, Diday E (2006) Symbolic data analysis: conceptual statistics and data mining. Wiley, Hoboken

    Book  Google Scholar 

  • Brito P, De Carvalho FdA (2008) Hierarchical and pyramidal clustering. In: Symbolic data analysis and the sodas software, pp 157–180

  • Brito P, Ichino M (2010) Symbolic clustering based on quantile representation. In: Proceedings of COMPSTAT2010, pp 22–27

  • Brito P, Ichino M (2011) Conceptual clustering of symbolic data using a quantile representation: discrete and continuous approaches. In: Proceeding of theory and application of high-dimensional complex and symbolic data analysis in economics and management science, pp 22–27

  • de Carvalho FdA, de Souza RM (2010) Unsupervised pattern recognition models for mixed feature-type symbolic data. Pattern Recogn Lett 31(5):430–443

    Article  Google Scholar 

  • De Carvalho FDA, Lechevallier Y, Verde R (2008) Clustering methods in symbolic data analysis. In: Symbolic data analysis and the sodas software, pp 181–204

  • Diday E, Esposito F (2003) An introduction to symbolic data analysis and the sodas software. Intell Data Anal 7(6):583–601

    Article  Google Scholar 

  • El-Sonbaty Y, Ismail MA (1998) On-line hierarchical clustering. Pattern Recogn Lett 19(14):1285–1291

    Article  Google Scholar 

  • Fisher DH (1987) Knowledge acquisition via incremental conceptual clustering. Mach Learn 2(2):139–172

    Google Scholar 

  • Goswami S, Chakrabarti A (2012) Quartile clustering: a quartile based technique for generating meaningful clusters. J Comput 4(2):48–55

    Google Scholar 

  • Guru D, Nagendraswamy H (2006) Clustering of interval-valued symbolic patterns based on mutual similarity value and the concept of k-mutual nearest neighborhood. In: Asian conference on computer vision, Springer, Berlin, pp 234–243

  • Hardy A, Lallemand P (2002) Determination of the number of clusters for symbolic objects described by interval variables. In: Classification, clustering, and data analysis, Springer, Berlin, pp 311–318

  • Hu X (1992) Conceptual clustering and concept hierarchies in knowledge discovery. Ph.D. thesis, theses (School of Computing Science)/Simon Fraser University

  • Hubert L (1972) Some extensions of Johnson’s hierarchical clustering algorithms. Psychometrika 37(3):261–274

    MathSciNet  Article  Google Scholar 

  • Ichino M (2008) Symbolic PCA for histogram-valued data. In: Proceedings IASC, pp 5–8

  • Ichino M (2011) The quantile method for symbolic principal component analysis. Stat Anal Data Min: ASA Data Sci J 4(2):184–198

    MathSciNet  Article  Google Scholar 

  • Ichino M, Britto P (2014) The data accumulation graph (DAQ) to visualize multi- dimensional symbolic data. In: Workshop in symbolic data analysis, Taipei, Taiwan

  • Ichino M, Brito P (2013) A hierarchical conceptual clustering based on the quantile method for mixed feature-type data. In: Proceedings of world statistics congress of the international statistical institute

  • Ichino M, Umbleja K (2018) Similarity and dissimilarity measures for mixed feature-type symbolic data. In: Studies in theoretical and applied statistics, Springer, Berlin, pp 131–144

  • Ichino M, Yaguchi H (1994) Generalized minkowski metrics for mixed feature-type data analysis. IEEE Trans Syst Man Cybern 24(4):698–708

    MathSciNet  Article  Google Scholar 

  • Irpino A, Verde R (2006) A new wasserstein based distance for the hierarchical clustering of histogram symbolic data. In: Data science and classification, Springer, Berlin, pp 185–192

  • Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv (CSUR) 31(3):264–323

    Article  Google Scholar 

  • Johnson SC (1967) Hierarchical clustering schemes. Psychometrika 32(3):241–254

    Article  Google Scholar 

  • Jonyer I, Cook DJ, Holder LB (2001) Graph-based hierarchical conceptual clustering. J Mach Learn Res 2:19–43

    MATH  Google Scholar 

  • Liu Y, Li Z, Xiong H, Gao X, Wu J (2010) Understanding of internal clustering validation measures. In: 2010 IEEE international conference on data mining, IEEE, pp 911–916

  • Michalski RS, Stepp RE (1983) Learning from observation: conceptual clustering. In: Machine learning, Springer, Berlin, pp 331–363

  • National Climatic Data Center (2014) Tables of histogram data. Climate-vegetation atlas of North America. http://www1.ncdc.noaa.gov/pub/data/cirs/drd/drd964x.tmpst.txt. Accessed 10 Aug 2015

  • Umbleja K (2017) Competence based learning—framework, implementation, analysis and management of learning process. Ph.D. thesis, Theses (School of Information Technologies)/Tallinn University of Technology, https://digi.lib.ttu.ee/i/?7573. Accessed 4 Oct 2018

  • US Geological Survey (2013) Tables of histogram data. Climate-vegetation atlas of North America. http://pubs.usgs.gov/pp/p1650-b/datatables/hgtable.xls. Accessed 24 Aug 2015

  • Vendramin L, Campello RJ, Hruschka ER (2010) Relative clustering validity criteria: a comparative overview. Stat Anal Data Min: ASA Data Sci J 3(4):209–235

    MathSciNet  Article  Google Scholar 

Download references

Acknowledgements

The authors want to thank reviewers for their helpful comments. Kadri Umbleja’s work has been supported by Japan Society for the Promotion of Science’s International Research Fellow program.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kadri Umbleja.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

Implementation of algorithm in Python can be found at: https://github.com/iardacil/MHCC

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Umbleja, K., Ichino, M. & Yaguchi, H. Hierarchical conceptual clustering based on quantile method for identifying microscopic details in distributional data. Adv Data Anal Classif 15, 407–436 (2021). https://doi.org/10.1007/s11634-020-00411-w

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11634-020-00411-w

Keywords

  • Conceptual clustering
  • Quantile method
  • Symbolic data

Mathematics Subject Classification

  • 68T09