Fast schemes for online record linkage

Abstract

The process of integrating large volumes of data coming from disparate data sources, in order to detect records that refer to the same entities, has always been an important problem in both academia and industry. This problem becomes significantly more challenging when the integration involves a huge amount of records and needs to be conducted in a real-time fashion to address the requirements of critical applications. In this paper, we propose two novel schemes for online record linkage, which achieve very fast response times and high levels of recall and precision. Our proposed schemes embed the records into a Bloom filter space and employ the Hamming Locality-Sensitive Hashing technique for blocking. Each Bloom filter is hashed to a number of hash tables in order to amplify the probability of formulating similar Bloom filter pairs. The main theoretical premise behind our first scheme relies on the number of times a Bloom filter pair is formulated in the hash tables of the blocking mechanism. We prove that this number strongly depends on the distance of that Bloom filter pair. This correlation allows us to estimate in real-time the Hamming distances of Bloom filter pairs without performing the comparisons. The second scheme is progressive and achieves high recall, upfront during the linkage process, by continuously adjusting the sequence in which the hash tables are scanned, and also guarantees, with high probability, the identification of each similar Bloom filter pair. Our experimental evaluation, using four real-world data sets, shows that the proposed schemes outperform four state-of-the-art methods by achieving higher recall and precision, while being very efficient.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Notes

  1. 1.

    LSH-based collision counting has also been studied in Karapiperis and Verykios (2016) under a different focus and theoretical development. Specifically, the method therein provides guarantees for a matching pair in achieving the required number of collisions.

  2. 2.

    A \(\lambda \)-gram is a substring genarated by sliding a window of length \(\lambda \) over the characters of a string value.

  3. 3.

    The Hamming distance between two Bloom filters is equal to the number of components in which these Bloom filters differ.

  4. 4.

    A z-score is the number of standard deviations that an element lies from the mean value.

  5. 5.

    The value of L for P-RDS is a function of k and \(\vartheta \) as discussed in Sect. 3.2

  6. 6.

    The resolution of a bucket involves performing the distance computations of the pairs stored therein, and then classifying those pairs as matching or non-matching.

  7. 7.

    The value of k should be sufficiently large because otherwise a small number of buckets is generated in each \(T_{l}\), which are overpopulated by Bloom filters resulting in the formulation of mostly dissimilar pairs.

  8. 8.

    http://secondstring.sourceforge.net/.

  9. 9.

    http://hpi.de/naumann/projects/repeatability/datasets/cd-datasets.html.

  10. 10.

    http://dl.ncsbe.gov/index.html?prefix=data/.

  11. 11.

    http://dblp.uni-trier.de/xml.

  12. 12.

    The Jaro-Winkler similarity result between ‘TAMPA’ and ‘TEMPA’ is 0.88, while between ‘LOS ANGELES’ and ‘LOS ANGALES’ is 0.98.

  13. 13.

    LSHDB can be found at https://github.com/dimkar121/LSHDB. Test data sets have been uploaded at https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/JKBULA.

  14. 14.

    Using Double Metaphone encoding, ‘SMITH’ and ‘SMYTH’ are encoded both as ‘SM0’.

  15. 15.

    We exclude redundant distance computations by using a Bloom filter, which implements a very fast bounded-memory buffer.

  16. 16.

    We perform logical XOR operations between the Bloom filters.

References

  1. Altwaijry H, Kalashnikov D, Mehrotra S (2013) Query-driven approach to entity resolution. Int Conf Very Large Data Bases (PVLDB) 6:1846–1857

    Google Scholar 

  2. Andoni A, Indyk P (2008) Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun ACM (CACM) 51(1):117–122

    Article  Google Scholar 

  3. Bhattacharya I, Getoor L Licamele L (2006) Query-time entity resolution. In: International conference on knowledge discovery and data mining (KDD), pp 529–534

  4. Bilenko M, Kamath B, Mooney RJ (2006) Adaptive blocking: learning to scale up record linkage. In: International conference on data mining (ICDM), pp 87–96

  5. Christen P (2012) A survey of indexing techniques for scalable record linkage and deduplication. Trans Knowl Data Eng (TKDE) 12(9):1537–1555

    Article  Google Scholar 

  6. Christen P, Gayler R, Hawking D (2009) Similarity—aware indexing for real-time entity resolution. In: International conference on information and knowledge management (CIKM), pp 1565–1568

  7. Cohen WW, Richman J (2002) Learning to match and cluster large high-dimensional data sets for data integration. In: International conference on knowledge discovery and data mining (SIGKDD), pp 475–480

  8. Dey D, Mookerjee V, Liu D (2011) Efficient techniques for online record linkage. Trans Knowl Data Eng (TKDE) 23(3):373–387

    Article  Google Scholar 

  9. Elmagarmid A, Ipeirotis P, Verykios V (2007) Duplicate record detection: a survey. Trans Knowl Data Eng (TKDE) 19(1):1–16

    Article  Google Scholar 

  10. Firmani D, Saha B, Srivastava D (2016) Online entity resolution using an oracle. Int Conf Very Large Data Bases (PVLDB) 9:384–395

    Google Scholar 

  11. Hernandez MA, Stolfo SJ (1995) The merge/purge problem for large databases. In: International conference on management of data (SIGMOD), pp 127–138

  12. Ioannou E, Nejdl W, Niederee C, Velegrakis Y (2010) On-the-fly entity-aware query processing in the presence of linkage. Int Conf Very Large Data Bases (PVLDB) 3(1):429–438

    Google Scholar 

  13. Karapiperis D, Verykios VS (2015) An LSH-based blocking approach with a homomorphic matching technique for privacy-preserving record linkage. Trans Knowl Data Eng (TKDE) 27(4):909–921

    Article  Google Scholar 

  14. Karapiperis D, Verykios VS (2016) A fast and efficient Hamming LSH-based scheme for accurate linkage. Knowl Inf Syst (KAIS) 49(3):861–884

    Article  Google Scholar 

  15. Karapiperis D, Gkoulalas-Divanis A, Verykios VS (2016a) LSHDB: a parallel and distributed engine for record linkage and similarity search. In: International conference on data mining (ICDM) demos, pp 1–4

  16. Karapiperis D, Vatsalan D, Verykios VS, Christen P (2016b) Efficient record linakge using a compact Hamming space. In: International conference on extending database technology (EDBT), pp 209–220

  17. Kim H, Lee D (2010) Fast iterative hashed record linkage for large-scale data collections. In: International conference on extending database technology (EDBT), pp 525 – 536

  18. Papenbrock T, Heise A, Naumann F (2015) Progressive duplicate detection. Trans Knowl Data Eng (TKDE) 27(5):1316–1329

    Article  Google Scholar 

  19. Schnell R, Bachteler T, Reiher J (2009) Privacy-preserving record linkage using Bloom filters. Med Inform Decis Mak (BMC) 9:41

    Article  Google Scholar 

  20. Shrivastava A, Li P (2014) Improved densification of one permutation hashing. In: International conference on uncertainty in artificial intelligence (UAI), pp 732–741

  21. Steorts R, Ventura S, Sadinle M, Fienberg S (2014) A comparison of blocking methods for record linkage. In: Privacy in statistical databases (PSD), pp 253–268

  22. Whang SE, Menestrina D, Koutrika G, Theobald M, Garcia-Molina H (2009) Entity resolution with iterative blocking. In: SIGMOD, pp 219–232

  23. Whang SE, Marmaros D, Garcia-Molina H (2013) Pay-as-you-go entity resolution. Trans Knowl Data Eng (TKDE) 25(5):1111–1124

    Article  Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Dimitrios Karapiperis.

Additional information

Responsible editor: Kurt Driessens, Dragi Kocev, Marko Robnik-Šikonja, Myra Spiliopoulou.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Karapiperis, D., Gkoulalas-Divanis, A. & Verykios, V.S. Fast schemes for online record linkage. Data Min Knowl Disc 32, 1229–1250 (2018). https://doi.org/10.1007/s10618-018-0563-0

Download citation

Keywords

  • Record linkage
  • Efficiency
  • Locality-sensitive hashing