Scalable Blocking for Very Large Databases

Borthwick, Andrew; Ash, Stephen; Pang, Bin; Qureshi, Shehzad; Jones, Timothy

doi:10.1007/978-3-030-65965-3_20

Andrew Borthwick³⁵,
Stephen Ash³⁵,
Bin Pang³⁵,
Shehzad Qureshi³⁵ &
…
Timothy Jones³⁵

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1323))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

2297 Accesses
1 Citations

Abstract

In the field of database deduplication, the goal is to find approximately matching records within a database. Blocking is a typical stage in this process that involves cheaply finding candidate pairs of records that are potential matches for further processing. We present here Hashed Dynamic Blocking, a new approach to blocking designed to address datasets larger than those studied in most prior work. Hashed Dynamic Blocking (HDB) extends Dynamic Blocking, which leverages the insight that rare matching values and rare intersections of values are predictive of a matching relationship. We also present a novel use of Locality Sensitive Hashing (LSH) to build blocking key values for huge databases with a convenient configuration to control the trade-off between precision and recall. HDB achieves massive scale by minimizing data movement, using compact block representation, and greedily pruning ineffective candidate blocks using a Count-min Sketch approximate counting data structure. We benchmark the algorithm by focusing on real-world datasets in excess of one million rows, demonstrating that the algorithm displays linear time complexity scaling in this range. Furthermore, we execute HDB on a 530 million row industrial dataset, detecting 68 billion candidate pairs in less than three hours at a cost of $307 on a major cloud service.

A. Borthwick and S. Ash—Equal Contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

$$\partial u\partial u$$ Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data

Similarity Grouping in Big Data Systems

A Comparison of Blocking Methods for Record Linkage

Notes

References

Ohio voter registration and election history statewide data. https://www6.ohiosos.gov/ords/f?p=VOTERFTP:STWD:::#stwdVtrFiles. Accessed 21 Dec 2019 and 08 Feb 2020. These two snapshots are available for research purposes from the authors
Ash, S.M., Ip-Lin, K.: Embracing the sparse, noisy, and interrelated aspects of patient demographics for use in clinical medical record linkage. In: AMIA Summits on Translational Science Proceedings, p. 425. AMIA (2015)
Google Scholar
Bilenko, M., Kamath, B., Mooney, R.J.: Adaptive blocking: learning to scale up record linkage. In: IEEE International Conference on Data Mining, ICDM, pp. 87–96 (2006)
Google Scholar
Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)
Article Google Scholar
Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M.: Min-wise independent permutations. J. Comput. Syst. Sci. 60(3), 630–659 (2000)
Article MathSciNet Google Scholar
Chen, S., Borthwick, A., Carvalho, V.R.: The case for cost-sensitive and easy-to-interpret models in industrial record linkage. In: 9th International Workshop on Quality in Databases (2011)
Google Scholar
Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng. 24(9), 1–20 (2011)
Google Scholar
Christen, P.: Preparation of a real temporal voter data set for record linkage and duplicate detection research (2013). http://cs.anu.edu.au/~./Peter.Christen/publications/ncvoter-report-29june2014.pdf
Chu, X., Ilyas, I.F., Koutris, P.: Distributed data deduplication. Proc. VLDB Endow. 9(11), 864–875 (2016)
Article Google Scholar
Cormode, G., Muthukrishnan, S.: An improved data stream summary: the count-min sketch and its applications. J. Algorithms 55(1), 58–75 (2005)
Article MathSciNet Google Scholar
Efthymiou, V., Papadakis, G., Papastefanatos, G., Stefanidis, K., Palpanas, T.: Parallel meta-blocking for scaling entity resolution over big heterogeneous data. Inf. Syst. 65, 137–157 (2017)
Article Google Scholar
Elmagarmid, A., Ipeirotis, P., Verykios, V.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)
Article Google Scholar
Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: Proceedings of the 25th International Conference on Very Large Data Bases, pp. 518–529 (1999)
Google Scholar
Hassanzadeh, O., Chiang, F., Lee, H.C., Miller, R.J.: Framework for evaluating clustering algorithms in duplicate detection. Proc. VLDB Endow. 2, 1282–1293 (2009)
Article Google Scholar
Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: 30th Annual ACM Symposium on Theory of Computing, pp. 604–613. ACM (1998)
Google Scholar
Köpcke, H., Thor, A., Rahm, E.: Evaluation of entity resolution approaches on real-world match problems. Proc. VLDB Endow. 3(1–2), 484–493 (2010)
Article Google Scholar
Koudas, N., Sarawagi, S., Srivastava, D.: Record linkage: similarity measures and algorithms. In: ACM SIGMOD International Conference on Management of Data, pp. 802–803. ACM (2006)
Google Scholar
Leskovec, J., Rajaraman, A., Ullman, J.D.: Finding Similar Items. In: Mining of Masive Datasets, 2nd edn., pp. 72–130 (2014)
Google Scholar
McNeill, W.P., Kardes, H., Borthwick, A.: Dynamic record blocking: efficient linking of massive databases in mapreduce. In: Quality in Databases (2012)
Google Scholar
Mudgal, S., et al.: Deep learning for entity matching: a design space exploration. In: 2018 International Conference on Management of Data, pp. 19–34 (2018)
Google Scholar
Papadakis, G., Koutrika, G., Palpanas, T., Nejdl, W.: Meta-blocking: taking entity resolution to the next level. IEEE Trans. Knowl. Data Eng. 26(8), 1946–1960 (2014)
Article Google Scholar
Papadakis, G., Papastefanatos, G., Palpanas, T., Koubarakis, M.: Scaling Entity Resolution to Large. Heterogeneous Data with Enhanced Meta-blocking, EBDT (February) (2016)
Google Scholar
Papadakis, G., Skoutas, D., Thanos, E., Palpanas, T.: A Survey of Blocking and Filtering Techniques for Entity Resolution. arXiv e-prints arXiv:1905.06167, May 2019
Papadakis, G., Svirsky, J., Gal, A., Palpanas, T.: Comparative analysis of approximate blocking techniques for entity resolution. Proc. VLDB Endow. 9(9), 684–695 (2016)
Article Google Scholar
Reas, R., Ash, S., Barton, R., Borthwick, A.: SuperPart: supervised graph partitioning for record linkage. In: IEEE International Conference on Data Mining (2018)
Google Scholar
Simonini, G., Gagliardelli, L., Bergamaschi, S., Jagadish, H.: Scaling entity resolution: a loosely schema-aware approach. Inf. Syst. 83, 145–165 (2019)
Article Google Scholar
Van Dam, I., van Ginkel, G., Kuipers, W., Nijenhuis, N., Vandic, D., Frasincar, F.: Duplicate detection in web shops using LSH to reduce the number of computations. In: 31st Annual ACM Symposium on Applied Computing, pp. 772–779 (2016)
Google Scholar
Wang, X., Sun, A., Kardes, H., Agrawal, S., Chen, L., Borthwick, A.: Probabilistic estimates of attribute statistics and match likelihood for people entity resolution. In: 2014 IEEE International Conference on Big Data, pp. 92–99. IEEE (2014)
Google Scholar

Download references

Author information

Authors and Affiliations

AWS AI Labs, Seattle, WA, USA
Andrew Borthwick, Stephen Ash, Bin Pang, Shehzad Qureshi & Timothy Jones

Authors

Andrew Borthwick
View author publications
You can also search for this author in PubMed Google Scholar
Stephen Ash
View author publications
You can also search for this author in PubMed Google Scholar
Bin Pang
View author publications
You can also search for this author in PubMed Google Scholar
Shehzad Qureshi
View author publications
You can also search for this author in PubMed Google Scholar
Timothy Jones
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Stephen Ash .

Editor information

Editors and Affiliations

University of Sydney, Sydney, NSW, Australia
Irena Koprinska
Monash University, Clayton, VIC, Australia
Michael Kamp
University of Bari Aldo Moro, Bari, Italy
Annalisa Appice
University of Bari Aldo Moro, Bari, Italy
Corrado Loglisci
University of Guelph, Guelph, ON, Canada
Luiza Antonie
University of Caen Normandy, Caen, France
Albrecht Zimmermann
University of Pisa, Pisa, Italy
Riccardo Guidotti
Norwegian University of Science and Technology, Trondheim, Norway
Özlem Özgöbek
University of Porto, Porto, Portugal
Rita P. Ribeiro
UPC BarcelonaTech, Barcelona, Spain
Ricard Gavaldà
University of Porto, Porto, Portugal
João Gama
Fraunhofer IAIS, St. Augustin, Germany
Linara Adilova
Royal Holloway University of London, Egham, UK
Yamuna Krishnamurthy
University of Lisbon, Lisbon, Portugal
Pedro M. Ferreira
University of Bari Aldo Moro, Bari, Italy
Donato Malerba
University of Lisbon, Lisbon, Portugal
Ibéria Medeiros
University of Bari Aldo Moro, Bari, Italy
Michelangelo Ceci
ICAR-CNR, Rende, Italy
Giuseppe Manco
University of Naples Federico II, Naples, Italy
Elio Masciari
University of North Carolina, Charlotte, NC, USA
Zbigniew W. Ras
Australian National University, Canberra, ACT, Australia
Peter Christen
Leibniz University Hannover, Hannover, Germany
Eirini Ntoutsi
Technical University of Dortmund, Dortmund, Germany
Erich Schubert
University of Southern Denmark, Odense, Denmark
Arthur Zimek
University of Pisa, Pisa, Italy
Anna Monreale
Warsaw University of Technology, Warsaw, Poland
Przemyslaw Biecek
ISTI-CNR, PISA, Italy
Salvatore Rinzivillo
Berlin Institute of Technology, Berlin, Germany
Benjamin Kille
Berlin Institute of Technology, Berlin, Germany
Andreas Lommatzsch
Norwegian University of Science and Technology, Trondheim, Norway
Jon Atle Gulla

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Borthwick, A., Ash, S., Pang, B., Qureshi, S., Jones, T. (2020). Scalable Blocking for Very Large Databases. In: Koprinska, I., et al. ECML PKDD 2020 Workshops. ECML PKDD 2020. Communications in Computer and Information Science, vol 1323. Springer, Cham. https://doi.org/10.1007/978-3-030-65965-3_20

Download citation

DOI: https://doi.org/10.1007/978-3-030-65965-3_20
Published: 02 February 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-65964-6
Online ISBN: 978-3-030-65965-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the ECML PKDD community (opens in a new tab)

Scalable Blocking for Very Large Databases

Abstract

Access this chapter

Similar content being viewed by others

$$\partial u\partial u$$ Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data

Similarity Grouping in Big Data Systems

A Comparison of Blocking Methods for Record Linkage

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Navigation

Scalable Blocking for Very Large Databases

Abstract

Access this chapter

Similar content being viewed by others

$$\partial u\partial u$$ Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data

Similarity Grouping in Big Data Systems

A Comparison of Blocking Methods for Record Linkage

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation