Skip to main content
Log in

Enhancing Multi-Attribute Similarity Join using Reduced and Adaptive Index Trees

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Multi-Attribute Similarity Join represents an important task for a variety of applications. Due to a large amount of data, several techniques and approaches were proposed to avoid superfluous comparisons between entities. One of these techniques is denominated Index Tree. In this work, we proposed an adaptive version (Adaptive Index Tree) of the state-of-the-art Index Tree for multi-attribute data. Our method selects the best filter configuration to construct the Adaptive Index Tree. We also proposed a reduced version of the Index Trees, aiming to improve the trade-off between efficacy and efficiency for the Similarity Join task. Finally, we proposed Filter and Feature selectors designed for the Similarity Join task. To evaluate the impact of the proposed approaches, we employed five real-world datasets to perform the experimental analysis. Based on the experiments, we conclude that our reduced approaches have produced superior results when compared to the state-of-the-art approach, specially when dealing with datasets that present a significant number of attributes and/or and expressive attribute sizes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Algorithm 1
Algorithm 2
Fig. 6
Fig. 7
Fig. 8
Algorithm 3
Algorithm 4
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Data availability

The datasets employed in our experiments are all available in public repositories: Google Playstore: https://www.kaggle.com/datasets/gauthamp10/google-playstore-apps; Music Brainz 20 M and North Carolina Voter 10 M: https://dbs.uni-leipzig.de/research/projects/object_matching/benchmark_datasets_for_entity_resolution, with reference to the paper Evaluation of entity resolution approaches on real-world match problems, of Köpckee et al.; Spotify Charts: https://www.kaggle.com/datasets/dhruvildave/spotify-charts; Steam App Data: https://www.kaggle.com/vicentearce/steam-and-steam-spy-raw-datasets.

Notes

  1. https://www.kaggle.com/datasets/gauthamp10/google-playstore-apps.

  2. https://dbs.uni-leipzig.de/research/projects/object_matching/benchmark_datasets_for_entity_resolution.

  3. https://www.kaggle.com/datasets/dhruvildave/spotify-charts.

  4. https://www.kaggle.com/vicentearce/steam-and-steam-spy-raw-datasets.

  5. https://github.com/VitorAlan/Index-Tree.

References

  1. Almeida J, da Torres RS, Leite NJ (2010) Bp-tree: an efficient index for similarity search in high-dimensional metric spaces. In: Proceedings of the 19th ACM international conference on information and knowledge management, pp 1365–1368

  2. Arasu A, Ganti V, Kaushik R (2006) Efficient exact set-similarity joins. In Proceedings of the 32nd international conference on very large data bases, pp 918–929

  3. Aronovich L, Spiegler I (2007) Cm-tree: a dynamic clustered index for similarity search in metric databases. Data Knowl Eng 63(3):919–946

    Article  Google Scholar 

  4. Bahri A, Zouaki H, Thami ROH (2016) Blbtree: an efficient index structure for fast search. Int Rev Comput Softw (IRECOS), 11(10)

  5. Broder AZ, Glassman SC, Manasse MS, Zweig G (1997) Syntactic clustering of the web. Comput Netw ISDN Syst 29(8–13):1157–1166

    Article  Google Scholar 

  6. Christiani T, Pagh R (2017) Set similarity search beyond minhash. In: Proceedings of the 49th annual ACM SIGACT symposium on theory of computing, pp 1094–1107

  7. Christiani T, Pagh R, Sivertsen J (2018) Scalable and robust set similarity join. In: 2018 IEEE 34th international conference on data engineering (ICDE), pp 1240–1243. IEEE

  8. Christophides V, Efthymiou V, Palpanas T, Papadakis G, Stefanidis K (2020) An overview of end-to-end entity resolution for big data. ACM Comput Surv (CSUR) 53(6):1–42

    Article  Google Scholar 

  9. Ciaccia P, Patella M, Zezula P (1997) M-tree: an efficient access method for similarity search in metric spaces. Vldb 97:426–435

    Google Scholar 

  10. Ferchichi A, Gouider MS (2014) Bstree: an incremental indexing structure for similarity search and real time monitoring of data streams. In: Future Information Technology, pp 185–190. Springer

  11. Jia L, Zhang L, Guoxian Yu, You J, Ding J, Li M (2018) A survey on set similarity search and join. Int J Perform Eng 14(2):245

    Google Scholar 

  12. Köpcke H, Thor A, Rahm E (2010) Evaluation of entity resolution approaches on real-world match problems. Proc VLDB Endow 3(1–2):484–493

    Article  Google Scholar 

  13. Kuo FY, Sloan IH (2005) Lifting the curse of dimensionality. Not AMS 52(11):1320–1328

    MathSciNet  Google Scholar 

  14. Kurita T (2019) Principal component analysis (PCA). Computer vision: a reference guide, pp 1–4

  15. Li G, He J, Deng D, Li J (2015) Efficient similarity join and search on multi-attribute data. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data, pp 1137–1151

  16. Mann W, Augsten N, Bouros P (2016) An empirical evaluation of set similarity join techniques. Proc VLDB Endow 9(9):636–647

    Article  Google Scholar 

  17. Ortona S, Orsi G, Buoncristiano M, Furche T (2015) Wadar: joint wrapper and data repair. Proc VLDB Endow 8(12):1996–1999

    Article  Google Scholar 

  18. Ribeiro LA, Borges FF, do Carmo ODJ (2020) A framework for set similarity join on multi-attribute data. In: SBBD, pp 61–72

  19. Skopal T, Lokoč J (2008) Nm-tree: flexible approximate similarity search in metric and non-metric spaces. In: International conference on database and expert systems applications, pp 312–325. Springer

  20. Solorio-Fernández S, Carrasco-Ochoa JA, Martínez-Trinidad JF (2018) Ranking based unsupervised feature selection methods: an empirical comparative study in high dimensional datasets. In: Mexican international conference on artificial intelligence, p 205–218. Springer

  21. Sebastian VS, José L, Alberto C (2016) BoD-books on demand, big data on real-world applications

  22. Wang Y, Qin J, Wang W (2017) Efficient approximate entity matching using Jaro–Winkler distance. In: International conference on web information systems engineering, pp 231–239. Springer

  23. Minghe Y, Li G, Deng D, Feng J (2016) String similarity search and join: a survey. Front Comp Sci 10(3):399–417

  24. Zhang Z, Hadjieleftheriou M, Ooi BC, Srivastava D (2010) Bed-tree: an all-purpose index structure for string similarity search based on edit distance. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data, pp 915–926

Download references

Acknowledgements

This paper has been supported by the following Brazilian research agency: CNPq (Conselho Nacional de Desenvolvimento Científico e Tecnológico).

Funding

Any funding was involved in the production of this paper.

Author information

Authors and Affiliations

Authors

Contributions

Vítor Bezerra Silva wrote the paper, reviewed the paper, performed the experimental analysis, collected the data, conceived, and designed the analysis.

Dimas Cassimiro Nascimento wrote the paper, reviewed the paper, conceived, and designed the analysis.

Corresponding author

Correspondence to Vítor Bezerra Silva.

Ethics declarations

Conflict of interest

All authors confirmed that there are no conflicts of interest.

Ethical approval and consent to participate

This work did not require any ethical approval. All datasets employed in this work are publicly available.

Consent for publication

All authors had explicitly agreed to submit and publish this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Silva, V.B., Nascimento, D.C. Enhancing Multi-Attribute Similarity Join using Reduced and Adaptive Index Trees. Knowl Inf Syst (2024). https://doi.org/10.1007/s10115-024-02089-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10115-024-02089-4

Keywords

Navigation