Enhancing Multi-Attribute Similarity Join using Reduced and Adaptive Index Trees

Silva, Vítor Bezerra; Nascimento, Dimas Cassimiro

doi:10.1007/s10115-024-02089-4

Enhancing Multi-Attribute Similarity Join using Reduced and Adaptive Index Trees

Regular Paper
Published: 09 April 2024

(2024)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

33 Accesses
Explore all metrics

Abstract

Multi-Attribute Similarity Join represents an important task for a variety of applications. Due to a large amount of data, several techniques and approaches were proposed to avoid superfluous comparisons between entities. One of these techniques is denominated Index Tree. In this work, we proposed an adaptive version (Adaptive Index Tree) of the state-of-the-art Index Tree for multi-attribute data. Our method selects the best filter configuration to construct the Adaptive Index Tree. We also proposed a reduced version of the Index Trees, aiming to improve the trade-off between efficacy and efficiency for the Similarity Join task. Finally, we proposed Filter and Feature selectors designed for the Similarity Join task. To evaluate the impact of the proposed approaches, we employed five real-world datasets to perform the experimental analysis. Based on the experiments, we conclude that our reduced approaches have produced superior results when compared to the state-of-the-art approach, specially when dealing with datasets that present a significant number of attributes and/or and expressive attribute sizes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Group Decision Making Based on Generalized Intuitionistic Fuzzy Yager Weighted Heronian Mean Aggregation Operator

Article 23 April 2024

Dataset search: a survey

Article Open access 24 August 2019

Data dependencies for query optimization: a survey

Article Open access 14 June 2021

Data availability

The datasets employed in our experiments are all available in public repositories: Google Playstore: https://www.kaggle.com/datasets/gauthamp10/google-playstore-apps; Music Brainz 20 M and North Carolina Voter 10 M: https://dbs.uni-leipzig.de/research/projects/object_matching/benchmark_datasets_for_entity_resolution, with reference to the paper Evaluation of entity resolution approaches on real-world match problems, of Köpckee et al.; Spotify Charts: https://www.kaggle.com/datasets/dhruvildave/spotify-charts; Steam App Data: https://www.kaggle.com/vicentearce/steam-and-steam-spy-raw-datasets.

Notes

References

Almeida J, da Torres RS, Leite NJ (2010) Bp-tree: an efficient index for similarity search in high-dimensional metric spaces. In: Proceedings of the 19th ACM international conference on information and knowledge management, pp 1365–1368
Arasu A, Ganti V, Kaushik R (2006) Efficient exact set-similarity joins. In Proceedings of the 32nd international conference on very large data bases, pp 918–929
Aronovich L, Spiegler I (2007) Cm-tree: a dynamic clustered index for similarity search in metric databases. Data Knowl Eng 63(3):919–946
Article Google Scholar
Bahri A, Zouaki H, Thami ROH (2016) Blbtree: an efficient index structure for fast search. Int Rev Comput Softw (IRECOS), 11(10)
Broder AZ, Glassman SC, Manasse MS, Zweig G (1997) Syntactic clustering of the web. Comput Netw ISDN Syst 29(8–13):1157–1166
Article Google Scholar
Christiani T, Pagh R (2017) Set similarity search beyond minhash. In: Proceedings of the 49th annual ACM SIGACT symposium on theory of computing, pp 1094–1107
Christiani T, Pagh R, Sivertsen J (2018) Scalable and robust set similarity join. In: 2018 IEEE 34th international conference on data engineering (ICDE), pp 1240–1243. IEEE
Christophides V, Efthymiou V, Palpanas T, Papadakis G, Stefanidis K (2020) An overview of end-to-end entity resolution for big data. ACM Comput Surv (CSUR) 53(6):1–42
Article Google Scholar
Ciaccia P, Patella M, Zezula P (1997) M-tree: an efficient access method for similarity search in metric spaces. Vldb 97:426–435
Google Scholar
Ferchichi A, Gouider MS (2014) Bstree: an incremental indexing structure for similarity search and real time monitoring of data streams. In: Future Information Technology, pp 185–190. Springer
Jia L, Zhang L, Guoxian Yu, You J, Ding J, Li M (2018) A survey on set similarity search and join. Int J Perform Eng 14(2):245
Google Scholar
Köpcke H, Thor A, Rahm E (2010) Evaluation of entity resolution approaches on real-world match problems. Proc VLDB Endow 3(1–2):484–493
Article Google Scholar
Kuo FY, Sloan IH (2005) Lifting the curse of dimensionality. Not AMS 52(11):1320–1328
MathSciNet Google Scholar
Kurita T (2019) Principal component analysis (PCA). Computer vision: a reference guide, pp 1–4
Li G, He J, Deng D, Li J (2015) Efficient similarity join and search on multi-attribute data. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data, pp 1137–1151
Mann W, Augsten N, Bouros P (2016) An empirical evaluation of set similarity join techniques. Proc VLDB Endow 9(9):636–647
Article Google Scholar
Ortona S, Orsi G, Buoncristiano M, Furche T (2015) Wadar: joint wrapper and data repair. Proc VLDB Endow 8(12):1996–1999
Article Google Scholar
Ribeiro LA, Borges FF, do Carmo ODJ (2020) A framework for set similarity join on multi-attribute data. In: SBBD, pp 61–72
Skopal T, Lokoč J (2008) Nm-tree: flexible approximate similarity search in metric and non-metric spaces. In: International conference on database and expert systems applications, pp 312–325. Springer
Solorio-Fernández S, Carrasco-Ochoa JA, Martínez-Trinidad JF (2018) Ranking based unsupervised feature selection methods: an empirical comparative study in high dimensional datasets. In: Mexican international conference on artificial intelligence, p 205–218. Springer
Sebastian VS, José L, Alberto C (2016) BoD-books on demand, big data on real-world applications
Wang Y, Qin J, Wang W (2017) Efficient approximate entity matching using Jaro–Winkler distance. In: International conference on web information systems engineering, pp 231–239. Springer
Minghe Y, Li G, Deng D, Feng J (2016) String similarity search and join: a survey. Front Comp Sci 10(3):399–417
Zhang Z, Hadjieleftheriou M, Ooi BC, Srivastava D (2010) Bed-tree: an all-purpose index structure for string similarity search based on edit distance. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data, pp 915–926

Download references

Acknowledgements

This paper has been supported by the following Brazilian research agency: CNPq (Conselho Nacional de Desenvolvimento Científico e Tecnológico).

Funding

Any funding was involved in the production of this paper.

Author information

Authors and Affiliations

Universidade Federal do Agreste de Pernambuco, Avenida Bom Pastor, Garanhuns, Pernambuco, 55292-270, Brazil
Vítor Bezerra Silva & Dimas Cassimiro Nascimento

Authors

Vítor Bezerra Silva
View author publications
You can also search for this author in PubMed Google Scholar
Dimas Cassimiro Nascimento
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Vítor Bezerra Silva wrote the paper, reviewed the paper, performed the experimental analysis, collected the data, conceived, and designed the analysis.

Dimas Cassimiro Nascimento wrote the paper, reviewed the paper, conceived, and designed the analysis.

Corresponding author

Correspondence to Vítor Bezerra Silva.

Ethics declarations

Conflict of interest

All authors confirmed that there are no conflicts of interest.

Ethical approval and consent to participate

This work did not require any ethical approval. All datasets employed in this work are publicly available.

Consent for publication

All authors had explicitly agreed to submit and publish this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Silva, V.B., Nascimento, D.C. Enhancing Multi-Attribute Similarity Join using Reduced and Adaptive Index Trees. Knowl Inf Syst (2024). https://doi.org/10.1007/s10115-024-02089-4

Download citation

Received: 28 August 2023
Revised: 11 February 2024
Accepted: 28 February 2024
Published: 09 April 2024
DOI: https://doi.org/10.1007/s10115-024-02089-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Enhancing Multi-Attribute Similarity Join using Reduced and Adaptive Index Trees

Abstract

Access this article

Similar content being viewed by others

Group Decision Making Based on Generalized Intuitionistic Fuzzy Yager Weighted Heronian Mean Aggregation Operator

Dataset search: a survey

Data dependencies for query optimization: a survey

Data availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval and consent to participate

Consent for publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Enhancing Multi-Attribute Similarity Join using Reduced and Adaptive Index Trees

Abstract

Access this article

Similar content being viewed by others

Group Decision Making Based on Generalized Intuitionistic Fuzzy Yager Weighted Heronian Mean Aggregation Operator

Dataset search: a survey

Data dependencies for query optimization: a survey

Data availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval and consent to participate

Consent for publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation