Scalable Similarity Joins for Fast and Accurate Record Deduplication in Big Data

Rozinek, Ondrej; Borkovcova, Monika; Mares, Jan

doi:10.1007/978-3-031-60328-0_18

Ondrej Rozinek¹⁴,
Monika Borkovcova¹⁵ &
Jan Mares^14,16

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 990))

Included in the following conference series:

World Conference on Information Systems and Technologies

11 Accesses

Abstract

Record linkage is the process of matching records from multiple data sources that refer to the same entities. When applied to a single data source, this process is known as deduplication. With the increasing size of data source, recently referred to as big data, the complexity of the matching process becomes one of the major challenges for record linkage and deduplication. In recent decades, several blocking, indexing and filtering techniques have been developed. Their purpose is to reduce the number of record pairs to be compared by removing obvious non-matching pairs in the deduplication process, while maintaining high quality of matching. Currently developed algorithms and traditional techniques are not efficient, using methods that still lose significant proportion of true matches when removing comparison pairs. This paper proposes more efficient algorithms for removing non-matching pairs, with an explicitly proven mathematical lower bound on recently used state-of-the-art approximate string matching method - Fuzzy Jaccard Similarity. The algorithm is also much more efficient in classification using Density-based spatial clustering of applications with noise (DBSCAN) in log-linear time complexity \(\mathcal {O}(|\mathcal {E}|\log (|\mathcal {E}|))\).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 249.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng. 24(9), 1537–1555 (2012)
Article Google Scholar
Cohen, W.W., Ravikumar, P., Fienberg, S.E., et al.: A comparison of string distance metrics for name-matching tasks. In: IIWeb, vol. 3, pp. 73–78 (2003)
Google Scholar
Dafir, Z., Lamari, Y., Slaoui, S.C.: A survey on parallel clustering algorithms for big data. Artif. Intell. Rev. 54, 2411–2443 (2021)
Article Google Scholar
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)
Article Google Scholar
Ester, M., Kriegel, H.P., Sander, J., Xu, X., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD, vol. 96, pp. 226–231 (1996)
Google Scholar
Jiang, Y., Li, G., Feng, J., Li, W.S.: String similarity joins: an experimental evaluation. Proc. VLDB Endow. 7(8), 625–636 (2014)
Article Google Scholar
Jokinen, P., Ukkonen, E.: Two algorithms for approxmate string matching in static texts. In: Tarlecki, A. (ed.) MFCS 1991. LNCS, vol. 520, pp. 240–248. Springer, Heidelberg (1991). https://doi.org/10.1007/3-540-54345-7_67
Chapter Google Scholar
Papadakis, G., Skoutas, D., Thanos, E., Palpanas, T.: Blocking and filtering techniques for entity resolution: a survey. ACM Comput. Surv. (CSUR) 53(2), 1–42 (2020)
Article Google Scholar
Papadakis, G., Svirsky, J., Gal, A., Palpanas, T.: Comparative analysis of approximate blocking techniques for entity resolution. Proc. VLDB Endow. 9(9), 684–695 (2016)
Article Google Scholar
Rozinek, O., Borkovcova, M.: Theorems for boyd-wong contraction mappings on similarity spaces. Mathematics 11(20), 4359 (2023)
Article Google Scholar
Rozinek, O., Mareš, J.: The duality of similarity and metric spaces. Appl. Sci. 11(4) (2021). https://www.mdpi.com/2076-3417/11/4/1910
Ukkonen, E.: Approximate string-matching with q-grams and maximal matches. Theoret. Comput. Sci. 92(1), 191–211 (1992)
Article MathSciNet Google Scholar
Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using mapreduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, pp. 495–506 (2010)
Google Scholar
Wang, J., Li, G., Fe, J.: Fast-join: an efficient method for fuzzy token matching based string similarity join. In: 2011 IEEE 27th International Conference on Data Engineering, pp. 458–469. IEEE (2011)
Google Scholar
Wang, J., Li, G., Feng, J.: Extending string similarity join to tolerant fuzzy token matching. ACM Trans. Database Syst. (TODS) 39(1), 1–45 (2014)
Article MathSciNet Google Scholar
Xiao, C., Wang, W., Lin, X.: Ed-join: an efficient algorithm for similarity joins with edit distance constraints. Proc. VLDB Endow. 1(1), 933–944 (2008)
Article MathSciNet Google Scholar
Yang, Z., Yu, J., Kitsuregawa, M.: Fast algorithms for top-k approximate string matching. In: Twenty-Fourth AAAI Conference on Artificial Intelligence (2010)
Google Scholar
Yu, M., Li, G., Deng, D., Feng, J.: String similarity search and join: a survey. Front. Comput. Sci. 10(3), 399–417 (2016)
Article Google Scholar

Download references

Acknowledgment

It was supported by SGS FEI UPCE 2024 and the Erasmus+ project: Project number: 2022-1-SK01-KA220-HED-000089149, Project title: Including EVERyone in GREEN Data Analysis (EVERGREEN) funded by the European Union. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the Slovak Academic Association for International Cooperation (SAAIC). Neither the European Union nor SAAIC can be held responsible for them.

Author information

Authors and Affiliations

Department of Process Control, University of Pardubice, Studentska 95, 532 10, Pardubice, Czech Republic
Ondrej Rozinek & Jan Mares
Department of Information Technology, University of Pardubice, Studentska 95, 532 10, Pardubice, Czech Republic
Monika Borkovcova
Department of Mathematics, Informatics and Cybernetics, University of Chemistry and Technology Prague, Technicka 5, 166 28, Prague, Czech Republic
Jan Mares

Authors

Ondrej Rozinek
View author publications
You can also search for this author in PubMed Google Scholar
Monika Borkovcova
View author publications
You can also search for this author in PubMed Google Scholar
Jan Mares
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ondrej Rozinek .

Editor information

Editors and Affiliations

ISEG, Universidade de Lisboa, Lisbon, Portugal
Álvaro Rocha
College of Engineering, The Ohio State University, Columbus, OH, USA
Hojjat Adeli
Institute of Data Science and Digital Technologies, Vilnius University, Vilnius, Lithuania
Gintautas Dzemyda
DCT, Universidade Portucalense, Porto, Portugal
Fernando Moreira
Institute of Information Technology, Lodz University of Technology, Łódz, Poland
Aneta Poniszewska-Marańda

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rozinek, O., Borkovcova, M., Mares, J. (2024). Scalable Similarity Joins for Fast and Accurate Record Deduplication in Big Data. In: Rocha, Á., Adeli, H., Dzemyda, G., Moreira, F., Poniszewska-Marańda, A. (eds) Good Practices and New Perspectives in Information Systems and Technologies. WorldCIST 2024. Lecture Notes in Networks and Systems, vol 990. Springer, Cham. https://doi.org/10.1007/978-3-031-60328-0_18

Download citation

DOI: https://doi.org/10.1007/978-3-031-60328-0_18
Published: 16 May 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-60327-3
Online ISBN: 978-3-031-60328-0
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

Scalable Similarity Joins for Fast and Accurate Record Deduplication in Big Data