Abstract
Multi-Attribute Similarity Join represents an important task for a variety of applications. Due to a large amount of data, several techniques and approaches were proposed to avoid superfluous comparisons between entities. One of these techniques is denominated Index Tree. In this work, we proposed an adaptive version (Adaptive Index Tree) of the state-of-the-art Index Tree for multi-attribute data. Our method selects the best filter configuration to construct the Adaptive Index Tree. We also proposed a reduced version of the Index Trees, aiming to improve the trade-off between efficacy and efficiency for the Similarity Join task. Finally, we proposed Filter and Feature selectors designed for the Similarity Join task. To evaluate the impact of the proposed approaches, we employed five real-world datasets to perform the experimental analysis. Based on the experiments, we conclude that our reduced approaches have produced superior results when compared to the state-of-the-art approach, specially when dealing with datasets that present a significant number of attributes and/or and expressive attribute sizes.
Similar content being viewed by others
Data availability
The datasets employed in our experiments are all available in public repositories: Google Playstore: https://www.kaggle.com/datasets/gauthamp10/google-playstore-apps; Music Brainz 20 M and North Carolina Voter 10 M: https://dbs.uni-leipzig.de/research/projects/object_matching/benchmark_datasets_for_entity_resolution, with reference to the paper Evaluation of entity resolution approaches on real-world match problems, of Köpckee et al.; Spotify Charts: https://www.kaggle.com/datasets/dhruvildave/spotify-charts; Steam App Data: https://www.kaggle.com/vicentearce/steam-and-steam-spy-raw-datasets.
Notes
References
Almeida J, da Torres RS, Leite NJ (2010) Bp-tree: an efficient index for similarity search in high-dimensional metric spaces. In: Proceedings of the 19th ACM international conference on information and knowledge management, pp 1365–1368
Arasu A, Ganti V, Kaushik R (2006) Efficient exact set-similarity joins. In Proceedings of the 32nd international conference on very large data bases, pp 918–929
Aronovich L, Spiegler I (2007) Cm-tree: a dynamic clustered index for similarity search in metric databases. Data Knowl Eng 63(3):919–946
Bahri A, Zouaki H, Thami ROH (2016) Blbtree: an efficient index structure for fast search. Int Rev Comput Softw (IRECOS), 11(10)
Broder AZ, Glassman SC, Manasse MS, Zweig G (1997) Syntactic clustering of the web. Comput Netw ISDN Syst 29(8–13):1157–1166
Christiani T, Pagh R (2017) Set similarity search beyond minhash. In: Proceedings of the 49th annual ACM SIGACT symposium on theory of computing, pp 1094–1107
Christiani T, Pagh R, Sivertsen J (2018) Scalable and robust set similarity join. In: 2018 IEEE 34th international conference on data engineering (ICDE), pp 1240–1243. IEEE
Christophides V, Efthymiou V, Palpanas T, Papadakis G, Stefanidis K (2020) An overview of end-to-end entity resolution for big data. ACM Comput Surv (CSUR) 53(6):1–42
Ciaccia P, Patella M, Zezula P (1997) M-tree: an efficient access method for similarity search in metric spaces. Vldb 97:426–435
Ferchichi A, Gouider MS (2014) Bstree: an incremental indexing structure for similarity search and real time monitoring of data streams. In: Future Information Technology, pp 185–190. Springer
Jia L, Zhang L, Guoxian Yu, You J, Ding J, Li M (2018) A survey on set similarity search and join. Int J Perform Eng 14(2):245
Köpcke H, Thor A, Rahm E (2010) Evaluation of entity resolution approaches on real-world match problems. Proc VLDB Endow 3(1–2):484–493
Kuo FY, Sloan IH (2005) Lifting the curse of dimensionality. Not AMS 52(11):1320–1328
Kurita T (2019) Principal component analysis (PCA). Computer vision: a reference guide, pp 1–4
Li G, He J, Deng D, Li J (2015) Efficient similarity join and search on multi-attribute data. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data, pp 1137–1151
Mann W, Augsten N, Bouros P (2016) An empirical evaluation of set similarity join techniques. Proc VLDB Endow 9(9):636–647
Ortona S, Orsi G, Buoncristiano M, Furche T (2015) Wadar: joint wrapper and data repair. Proc VLDB Endow 8(12):1996–1999
Ribeiro LA, Borges FF, do Carmo ODJ (2020) A framework for set similarity join on multi-attribute data. In: SBBD, pp 61–72
Skopal T, Lokoč J (2008) Nm-tree: flexible approximate similarity search in metric and non-metric spaces. In: International conference on database and expert systems applications, pp 312–325. Springer
Solorio-Fernández S, Carrasco-Ochoa JA, Martínez-Trinidad JF (2018) Ranking based unsupervised feature selection methods: an empirical comparative study in high dimensional datasets. In: Mexican international conference on artificial intelligence, p 205–218. Springer
Sebastian VS, José L, Alberto C (2016) BoD-books on demand, big data on real-world applications
Wang Y, Qin J, Wang W (2017) Efficient approximate entity matching using Jaro–Winkler distance. In: International conference on web information systems engineering, pp 231–239. Springer
Minghe Y, Li G, Deng D, Feng J (2016) String similarity search and join: a survey. Front Comp Sci 10(3):399–417
Zhang Z, Hadjieleftheriou M, Ooi BC, Srivastava D (2010) Bed-tree: an all-purpose index structure for string similarity search based on edit distance. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data, pp 915–926
Acknowledgements
This paper has been supported by the following Brazilian research agency: CNPq (Conselho Nacional de Desenvolvimento Científico e Tecnológico).
Funding
Any funding was involved in the production of this paper.
Author information
Authors and Affiliations
Contributions
Vítor Bezerra Silva wrote the paper, reviewed the paper, performed the experimental analysis, collected the data, conceived, and designed the analysis.
Dimas Cassimiro Nascimento wrote the paper, reviewed the paper, conceived, and designed the analysis.
Corresponding author
Ethics declarations
Conflict of interest
All authors confirmed that there are no conflicts of interest.
Ethical approval and consent to participate
This work did not require any ethical approval. All datasets employed in this work are publicly available.
Consent for publication
All authors had explicitly agreed to submit and publish this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Silva, V.B., Nascimento, D.C. Enhancing Multi-Attribute Similarity Join using Reduced and Adaptive Index Trees. Knowl Inf Syst (2024). https://doi.org/10.1007/s10115-024-02089-4
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10115-024-02089-4