Abstract
In the field of database deduplication, the goal is to find approximately matching records within a database. Blocking is a typical stage in this process that involves cheaply finding candidate pairs of records that are potential matches for further processing. We present here Hashed Dynamic Blocking, a new approach to blocking designed to address datasets larger than those studied in most prior work. Hashed Dynamic Blocking (HDB) extends Dynamic Blocking, which leverages the insight that rare matching values and rare intersections of values are predictive of a matching relationship. We also present a novel use of Locality Sensitive Hashing (LSH) to build blocking key values for huge databases with a convenient configuration to control the trade-off between precision and recall. HDB achieves massive scale by minimizing data movement, using compact block representation, and greedily pruning ineffective candidate blocks using a Count-min Sketch approximate counting data structure. We benchmark the algorithm by focusing on real-world datasets in excess of one million rows, demonstrating that the algorithm displays linear time complexity scaling in this range. Furthermore, we execute HDB on a 530 million row industrial dataset, detecting 68 billion candidate pairs in less than three hours at a cost of $307 on a major cloud service.
A. Borthwick and S. Ash—Equal Contribution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Ohio voter registration and election history statewide data. https://www6.ohiosos.gov/ords/f?p=VOTERFTP:STWD:::#stwdVtrFiles. Accessed 21 Dec 2019 and 08 Feb 2020. These two snapshots are available for research purposes from the authors
Ash, S.M., Ip-Lin, K.: Embracing the sparse, noisy, and interrelated aspects of patient demographics for use in clinical medical record linkage. In: AMIA Summits on Translational Science Proceedings, p. 425. AMIA (2015)
Bilenko, M., Kamath, B., Mooney, R.J.: Adaptive blocking: learning to scale up record linkage. In: IEEE International Conference on Data Mining, ICDM, pp. 87–96 (2006)
Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)
Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M.: Min-wise independent permutations. J. Comput. Syst. Sci. 60(3), 630–659 (2000)
Chen, S., Borthwick, A., Carvalho, V.R.: The case for cost-sensitive and easy-to-interpret models in industrial record linkage. In: 9th International Workshop on Quality in Databases (2011)
Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng. 24(9), 1–20 (2011)
Christen, P.: Preparation of a real temporal voter data set for record linkage and duplicate detection research (2013). http://cs.anu.edu.au/~./Peter.Christen/publications/ncvoter-report-29june2014.pdf
Chu, X., Ilyas, I.F., Koutris, P.: Distributed data deduplication. Proc. VLDB Endow. 9(11), 864–875 (2016)
Cormode, G., Muthukrishnan, S.: An improved data stream summary: the count-min sketch and its applications. J. Algorithms 55(1), 58–75 (2005)
Efthymiou, V., Papadakis, G., Papastefanatos, G., Stefanidis, K., Palpanas, T.: Parallel meta-blocking for scaling entity resolution over big heterogeneous data. Inf. Syst. 65, 137–157 (2017)
Elmagarmid, A., Ipeirotis, P., Verykios, V.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)
Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: Proceedings of the 25th International Conference on Very Large Data Bases, pp. 518–529 (1999)
Hassanzadeh, O., Chiang, F., Lee, H.C., Miller, R.J.: Framework for evaluating clustering algorithms in duplicate detection. Proc. VLDB Endow. 2, 1282–1293 (2009)
Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: 30th Annual ACM Symposium on Theory of Computing, pp. 604–613. ACM (1998)
Köpcke, H., Thor, A., Rahm, E.: Evaluation of entity resolution approaches on real-world match problems. Proc. VLDB Endow. 3(1–2), 484–493 (2010)
Koudas, N., Sarawagi, S., Srivastava, D.: Record linkage: similarity measures and algorithms. In: ACM SIGMOD International Conference on Management of Data, pp. 802–803. ACM (2006)
Leskovec, J., Rajaraman, A., Ullman, J.D.: Finding Similar Items. In: Mining of Masive Datasets, 2nd edn., pp. 72–130 (2014)
McNeill, W.P., Kardes, H., Borthwick, A.: Dynamic record blocking: efficient linking of massive databases in mapreduce. In: Quality in Databases (2012)
Mudgal, S., et al.: Deep learning for entity matching: a design space exploration. In: 2018 International Conference on Management of Data, pp. 19–34 (2018)
Papadakis, G., Koutrika, G., Palpanas, T., Nejdl, W.: Meta-blocking: taking entity resolution to the next level. IEEE Trans. Knowl. Data Eng. 26(8), 1946–1960 (2014)
Papadakis, G., Papastefanatos, G., Palpanas, T., Koubarakis, M.: Scaling Entity Resolution to Large. Heterogeneous Data with Enhanced Meta-blocking, EBDT (February) (2016)
Papadakis, G., Skoutas, D., Thanos, E., Palpanas, T.: A Survey of Blocking and Filtering Techniques for Entity Resolution. arXiv e-prints arXiv:1905.06167, May 2019
Papadakis, G., Svirsky, J., Gal, A., Palpanas, T.: Comparative analysis of approximate blocking techniques for entity resolution. Proc. VLDB Endow. 9(9), 684–695 (2016)
Reas, R., Ash, S., Barton, R., Borthwick, A.: SuperPart: supervised graph partitioning for record linkage. In: IEEE International Conference on Data Mining (2018)
Simonini, G., Gagliardelli, L., Bergamaschi, S., Jagadish, H.: Scaling entity resolution: a loosely schema-aware approach. Inf. Syst. 83, 145–165 (2019)
Van Dam, I., van Ginkel, G., Kuipers, W., Nijenhuis, N., Vandic, D., Frasincar, F.: Duplicate detection in web shops using LSH to reduce the number of computations. In: 31st Annual ACM Symposium on Applied Computing, pp. 772–779 (2016)
Wang, X., Sun, A., Kardes, H., Agrawal, S., Chen, L., Borthwick, A.: Probabilistic estimates of attribute statistics and match likelihood for people entity resolution. In: 2014 IEEE International Conference on Big Data, pp. 92–99. IEEE (2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Borthwick, A., Ash, S., Pang, B., Qureshi, S., Jones, T. (2020). Scalable Blocking for Very Large Databases. In: Koprinska, I., et al. ECML PKDD 2020 Workshops. ECML PKDD 2020. Communications in Computer and Information Science, vol 1323. Springer, Cham. https://doi.org/10.1007/978-3-030-65965-3_20
Download citation
DOI: https://doi.org/10.1007/978-3-030-65965-3_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-65964-6
Online ISBN: 978-3-030-65965-3
eBook Packages: Computer ScienceComputer Science (R0)