Skip to main content

A Deeper Analysis of the Hierarchical Clustering and Set Unionability-Based Data Union Method

Abstract

Data nowadays are an extremely valuable resource. Data can come from different sources, and it can originate from the government of a country, an organization, a company, or just a normal person. Furthermore, the content of data is varied: the data could be about primary education in the U.K, it could be about medical care in the U.S., or it could be about agriculture in Vietnam, etc. It is reasonable to assume that among those datasets, some datasets would be about the same topic. Moreover, those datasets could have the same structures, or at least, similar structures. It is beneficial that we can union those datasets into more meaningful datasets: The unionized datasets would contain the collective information of the datasets, and the users and scientists do not have to spend a lot of time searching and combining the datasets themselves, etc. In this paper, we proposed a data union method based on hierarchical clustering and Set Unionability for JSON-format data. Besides, we also performed some experiments to evaluate our method and prove its feasibility.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Notes

  1. http://github.com/ligthsworn/table_union_benchmark.

References

  1. Aditya B. Distributed clustering via LSH based data partitioning. ICML. 2018;2018:569–78.

    Google Scholar 

  2. Broder AZ. On the resemblance and containment of documents. Sequences. 1997;1997:21–9.

    Google Scholar 

  3. Apache Foundation. Apache Spark, an open-source unified analytics engine for large-scale data processing. https://spark.apache.org/ 2022.

  4. Bachem O, Lucic M, Krause A. Practical coreset constructions for machine learning. arXiv preprint, 2017.

  5. Chun SL, Youwei J, Zhekang D, Dongxiao W, Yingshan T, Qi HL, Richard TKW, Ahmed FZ, Ruiheng W, Loi LL (2020) A review of technical standards for smart cities. Clean Technol

  6. Craig AK, Pedro AS. Exploiting semantics for big data integration. AI Magn. 2015;36(1): 25–38.

  7. Defays D. An efficient algorithm for a complete link method. Comput J. 1977;20(4):364–6.

    MathSciNet  Article  Google Scholar 

  8. Dong XL, Srivastava D. Big data integration. Morgan & Claypool Publishers, 2015;p. 198.

  9. McLaren D, Agyeman J. Sharing cities: a case for truly smart and sustainable cities. London: MIT Press; 2015.

    Google Scholar 

  10. Erkang Z, Fatemeh N, Ken QP, Renée JM. LSH ensemble: internet scale domain search. arXiv:1603.07410, 2016.

  11. Erkang Z, Fatemeh N, Ken QP, Renee JM. LSH ensemble: internet-scale domain search. Proc. VLDB Endow. 2016;9(12):1185–1196.

  12. Zhu E, Deng D, Nargesian F, Miller RJ. JOSIE: overlap set similarity search for finding joinable tables in data lakes. SIGMOD Conf. 2019;2019:847–64.

    Google Scholar 

  13. Fabian MS, Gjergji K, Gerhard W. Yago: a core of semantic knowledge. In WWW, pages 697–706, 2007.

  14. Fatemeh N, Erkang Z, Ken QP, Renee JM. Table union search on open data. Proc. VLDB Endow. 2018;11(7):813–825.

  15. Fatemeh N, Erkang Z, Renee JM, Ken QP, Patricia CA. Data lake management: challenges and opportunities. Proc VLDB Endow. 2019;12(12):1986–9.

    Article  Google Scholar 

  16. Fatemeh N, Erkang Z, Ken QP, Renée JM. Benchmarch for evaluating table union search algorithms. https://github.com/RJMillerLab/table-union-search-benchmark, 2022.

  17. Har-Peled S. Geometric approximation algorithms, vol. 173. Washington: American mathematical society Providence; 2011.

    MATH  Google Scholar 

  18. Har-Peled S, Kushal A. Smaller coresets for k-median and kmeans clustering. In: Symposium on computational geometry (SoCG), ACM, pp. 126-134, 2005.

  19. Koga H, Ishibashi T, Watanabe T. Fast hierarchical clustering algorithm using locality-sensitive hashing. Discov Sci. 2004;2004:114–28.

    Article  Google Scholar 

  20. Hisashi K, Tetsuo I, Toshinori W. Fast agglomerative hierarchical clustering algorithm using Locality-Sensitive Hashing. Knowl. Inf. Syst. 2007;12(1):25–53.

  21. Hyeonjeong L, Hoseok J, Miyoung S, Ohseok K. Developing a semi-automatic data conversion tool for Korean ecological data standardization. In Journal of Ecology and Environment, 2017;41(11).

  22. Ivan Ermilov, Claus Stadler, Michael Martin, Soeren Auer (2013). CSV2RDF: User-Driven CSV to RDF Mass Conversion Framework. In Proceedings of the 9th International Conference on Semantic Systems.

  23. Joelson Antônio dos Santos, Syed Talat Iqbal, Murilo Coelho Naldi, Ricardo J. G. B. Campello, Joerg Sander (2021). Hierarchical Density-Based Clustering Using MapReduce. IEEE Trans. Big Data 7(1): 102-114 (2021)

  24. Rice JA. Mathematical Statistics and Data Analysis. Duxbury Press; 2006.

    Google Scholar 

  25. Rocha L, Vale F, Cirilo E, Barbosa D, Mourao F. A Framework for Migrating Relational Datasets to NoSQL. ICCS. 2015;2015:2593–602.

    Google Scholar 

  26. Le Hong Trang, Nguyen Le Hoang, Tran Khanh Dang (2020). A Farthest First Traversal based Sampling Algorithm for k-clustering. IMCOM 2020: 1-6 (2020).

  27. Michael J. Cafarella, Alon Y. Halevy, Nodira Khoussainova (2009). Data Integration for the Relational Web. Proc. VLDB Endow. 2(1): 1090-1101 (2009).

  28. Mior MJ, Salem K. Renormalization of NoSQL Database Schemas ER. 2018;2018:479–87.

    Google Scholar 

  29. Nguyen Duy Khang Truong, Tran Khanh Dang, Cong An Nguyen (2021). On Using Cryptographic Technologies in Privacy Protection of Online Conferencing Systems. FDSE (CCIS Volume) 2021: 123-138 (2021).

  30. Nguyen Le Hoang, Tran Khanh Dang (2022). Alpha Lightweight Coreset for k-Means Clustering. IMCOM 2022: 1-8 (2022).

  31. Oliver Lehmberg, Christian Bizer (2017). Stitching Web Tables for Improving Matching Quality. Proc. VLDB Endow. 10(11): 1502-1513 (2017).

  32. Robin Sibson (1973). SLINK: An Optimally Efficient Algorithm for the Single-Link Cluster Method. Comput. J. 16(1): 30-34 (1973).

  33. Ros F, Guillaume S. ProTraS: a probabilistic traversing sampling algorithm. Expert Syst Appl. 2018;105:65–76.

    Article  Google Scholar 

  34. Ryan A. Rossi, Nesreen K. Ahmed, Eunyee Koh, Sungchul Kim (2020). Fast Hierarchical Graph Clustering in Linear-Time. WWW (Companion Volume) 2020: 10-12 (2020).

  35. Subbulakshmi Pasupathi, Vimal Shanmuganathan, Madasamy Kaliappan, Yesudhas Harold Robinson, Mucheol Kim (2021).Trend analysis using agglomerative hierarchical clustering approach for time series big data. J. Supercomput. 77(7): 6505-6524 (2021).

  36. Credit Fraud Detection. Thanh Cong Tran, Tran Khanh Dang (2021). Machine Learning for Prediction of Imbalanced Data. IMCOM. 2021;2021:1–7.

    Google Scholar 

  37. Tran Khanh Dang, Xuan Tinh Chu, The Huy Tran (2021). Privacy-Preserving Attribute-Based Access Control in Education Information Systems. FDSE (CCIS Volume) 2021: 327-345 (2021).

  38. Dang TK, Anh TD. An Effective and Elastic Blockchain-based Provenance Preserving Solution for the Open Data. Int J Web Inf Syst. 2021;17(5):480–515.

    Article  Google Scholar 

  39. Tran Khanh Dang, Manh Huy Ta, Ly Hoang Dang, Nguyen Le Hoang (2021). An Elastic Data Conversion Framework: A Case Study for MySQL and MongoDB. SN Comput. Sci. 2(4): 325 (2021).

  40. Dang TK, Ta MH, Dang LH, Le Hoang N. An Elastic Data Conversion Framework for Data Integration System. FDSE (CCIS Volume). 2021;2020:35–50.

    Google Scholar 

  41. Dang TK, Ta MH, Le Hoang N. Intermediate Data Format for the Elastic Data Conversion Framework. IMCOM. 2021;2021:1–5.

    Google Scholar 

  42. Ha T, Dang TK. Investigating Local Differential Privacy and Generative Adversarial Network in Collecting Data. ACOMP. 2020;2020:140–5.

    Google Scholar 

  43. Vladimir Estivill-Castro (2002). Why so many clustering algorithms: a position paper. SIGKDD Explor. 4(1): 65-75 (2002).

  44. Ling X, Halevy AY, Fei W, Cong Yu. Synthesizing Union Tables from the Web. IJCAI. 2013;2013:2677–83.

    Google Scholar 

  45. Wang Y, Shangdi Yu, Yan G, Shun J. Fast Parallel Algorithms for Euclidean Minimum Spanning Tree and Hierarchical Spatial Clustering. SIGMOD Conference. 2021;2021:1982–95.

    Google Scholar 

  46. Yue Wang, Vivek R. Narasayya, Yeye He, Surajit Chaudhuri (2022). PACk: An Efficient Partition-based Distributed Agglomerative Hierarchical Clustering Algorithm for Deduplication. Proc. VLDB Endow. 15(6): 1132-1145 (2022).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tran Khanh Dang.

Ethics declarations

Conflict of Interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “Future Data and Security Engineering 2021” guest edited by Tran Khanh Dang.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Dang, T.K., Ta, M.H. A Deeper Analysis of the Hierarchical Clustering and Set Unionability-Based Data Union Method. SN COMPUT. SCI. 3, 486 (2022). https://doi.org/10.1007/s42979-022-01384-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s42979-022-01384-7

Keywords

  • Data integration system
  • Data union
  • Hierarchical clustering
  • Open data