Skip to main content

Scalable Blocking for Very Large Databases

  • Conference paper
  • First Online:
ECML PKDD 2020 Workshops (ECML PKDD 2020)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1323))

Abstract

In the field of database deduplication, the goal is to find approximately matching records within a database. Blocking is a typical stage in this process that involves cheaply finding candidate pairs of records that are potential matches for further processing. We present here Hashed Dynamic Blocking, a new approach to blocking designed to address datasets larger than those studied in most prior work. Hashed Dynamic Blocking (HDB) extends Dynamic Blocking, which leverages the insight that rare matching values and rare intersections of values are predictive of a matching relationship. We also present a novel use of Locality Sensitive Hashing (LSH) to build blocking key values for huge databases with a convenient configuration to control the trade-off between precision and recall. HDB achieves massive scale by minimizing data movement, using compact block representation, and greedily pruning ineffective candidate blocks using a Count-min Sketch approximate counting data structure. We benchmark the algorithm by focusing on real-world datasets in excess of one million rows, demonstrating that the algorithm displays linear time complexity scaling in this range. Furthermore, we execute HDB on a 530 million row industrial dataset, detecting 68 billion candidate pairs in less than three hours at a cost of $307 on a major cloud service.

A. Borthwick and S. Ash—Equal Contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://spark.apache.org/.

  2. 2.

    https://github.com/vefthym/ParallelMetablocking.

References

  1. Ohio voter registration and election history statewide data. https://www6.ohiosos.gov/ords/f?p=VOTERFTP:STWD:::#stwdVtrFiles. Accessed 21 Dec 2019 and 08 Feb 2020. These two snapshots are available for research purposes from the authors

  2. Ash, S.M., Ip-Lin, K.: Embracing the sparse, noisy, and interrelated aspects of patient demographics for use in clinical medical record linkage. In: AMIA Summits on Translational Science Proceedings, p. 425. AMIA (2015)

    Google Scholar 

  3. Bilenko, M., Kamath, B., Mooney, R.J.: Adaptive blocking: learning to scale up record linkage. In: IEEE International Conference on Data Mining, ICDM, pp. 87–96 (2006)

    Google Scholar 

  4. Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)

    Article  Google Scholar 

  5. Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M.: Min-wise independent permutations. J. Comput. Syst. Sci. 60(3), 630–659 (2000)

    Article  MathSciNet  Google Scholar 

  6. Chen, S., Borthwick, A., Carvalho, V.R.: The case for cost-sensitive and easy-to-interpret models in industrial record linkage. In: 9th International Workshop on Quality in Databases (2011)

    Google Scholar 

  7. Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng. 24(9), 1–20 (2011)

    Google Scholar 

  8. Christen, P.: Preparation of a real temporal voter data set for record linkage and duplicate detection research (2013). http://cs.anu.edu.au/~./Peter.Christen/publications/ncvoter-report-29june2014.pdf

  9. Chu, X., Ilyas, I.F., Koutris, P.: Distributed data deduplication. Proc. VLDB Endow. 9(11), 864–875 (2016)

    Article  Google Scholar 

  10. Cormode, G., Muthukrishnan, S.: An improved data stream summary: the count-min sketch and its applications. J. Algorithms 55(1), 58–75 (2005)

    Article  MathSciNet  Google Scholar 

  11. Efthymiou, V., Papadakis, G., Papastefanatos, G., Stefanidis, K., Palpanas, T.: Parallel meta-blocking for scaling entity resolution over big heterogeneous data. Inf. Syst. 65, 137–157 (2017)

    Article  Google Scholar 

  12. Elmagarmid, A., Ipeirotis, P., Verykios, V.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)

    Article  Google Scholar 

  13. Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: Proceedings of the 25th International Conference on Very Large Data Bases, pp. 518–529 (1999)

    Google Scholar 

  14. Hassanzadeh, O., Chiang, F., Lee, H.C., Miller, R.J.: Framework for evaluating clustering algorithms in duplicate detection. Proc. VLDB Endow. 2, 1282–1293 (2009)

    Article  Google Scholar 

  15. Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: 30th Annual ACM Symposium on Theory of Computing, pp. 604–613. ACM (1998)

    Google Scholar 

  16. Köpcke, H., Thor, A., Rahm, E.: Evaluation of entity resolution approaches on real-world match problems. Proc. VLDB Endow. 3(1–2), 484–493 (2010)

    Article  Google Scholar 

  17. Koudas, N., Sarawagi, S., Srivastava, D.: Record linkage: similarity measures and algorithms. In: ACM SIGMOD International Conference on Management of Data, pp. 802–803. ACM (2006)

    Google Scholar 

  18. Leskovec, J., Rajaraman, A., Ullman, J.D.: Finding Similar Items. In: Mining of Masive Datasets, 2nd edn., pp. 72–130 (2014)

    Google Scholar 

  19. McNeill, W.P., Kardes, H., Borthwick, A.: Dynamic record blocking: efficient linking of massive databases in mapreduce. In: Quality in Databases (2012)

    Google Scholar 

  20. Mudgal, S., et al.: Deep learning for entity matching: a design space exploration. In: 2018 International Conference on Management of Data, pp. 19–34 (2018)

    Google Scholar 

  21. Papadakis, G., Koutrika, G., Palpanas, T., Nejdl, W.: Meta-blocking: taking entity resolution to the next level. IEEE Trans. Knowl. Data Eng. 26(8), 1946–1960 (2014)

    Article  Google Scholar 

  22. Papadakis, G., Papastefanatos, G., Palpanas, T., Koubarakis, M.: Scaling Entity Resolution to Large. Heterogeneous Data with Enhanced Meta-blocking, EBDT (February) (2016)

    Google Scholar 

  23. Papadakis, G., Skoutas, D., Thanos, E., Palpanas, T.: A Survey of Blocking and Filtering Techniques for Entity Resolution. arXiv e-prints arXiv:1905.06167, May 2019

  24. Papadakis, G., Svirsky, J., Gal, A., Palpanas, T.: Comparative analysis of approximate blocking techniques for entity resolution. Proc. VLDB Endow. 9(9), 684–695 (2016)

    Article  Google Scholar 

  25. Reas, R., Ash, S., Barton, R., Borthwick, A.: SuperPart: supervised graph partitioning for record linkage. In: IEEE International Conference on Data Mining (2018)

    Google Scholar 

  26. Simonini, G., Gagliardelli, L., Bergamaschi, S., Jagadish, H.: Scaling entity resolution: a loosely schema-aware approach. Inf. Syst. 83, 145–165 (2019)

    Article  Google Scholar 

  27. Van Dam, I., van Ginkel, G., Kuipers, W., Nijenhuis, N., Vandic, D., Frasincar, F.: Duplicate detection in web shops using LSH to reduce the number of computations. In: 31st Annual ACM Symposium on Applied Computing, pp. 772–779 (2016)

    Google Scholar 

  28. Wang, X., Sun, A., Kardes, H., Agrawal, S., Chen, L., Borthwick, A.: Probabilistic estimates of attribute statistics and match likelihood for people entity resolution. In: 2014 IEEE International Conference on Big Data, pp. 92–99. IEEE (2014)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Stephen Ash .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Borthwick, A., Ash, S., Pang, B., Qureshi, S., Jones, T. (2020). Scalable Blocking for Very Large Databases. In: Koprinska, I., et al. ECML PKDD 2020 Workshops. ECML PKDD 2020. Communications in Computer and Information Science, vol 1323. Springer, Cham. https://doi.org/10.1007/978-3-030-65965-3_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-65965-3_20

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-65964-6

  • Online ISBN: 978-3-030-65965-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics