Skip to main content

Efficient Sequential and Parallel Algorithms for Incremental Record Linkage

  • Conference paper
  • First Online:
Computational Advances in Bio and Medical Sciences (ICCABS 2019)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 12029))

  • 2373 Accesses

Abstract

Given a collection of records, the problem of record linkage is to cluster them such that each cluster contains all the records of one and only one individual. Existing algorithms for this important problem have large run times especially when the number of records is large. Often, a small number of new records have to be linked with a large number of existing records. Linking the old and new records together might call for large run times. We refer to any algorithm that efficiently links the new records with the existing ones as incremental record linkage (IRL) algorithms and in this paper, we offer novel IRL algorithms. Clustering is the basic approach we employ. Our algorithms use a novel random sampling technique to compute the distance between a new record and any cluster and associate the new record with the cluster with which it has the least distance. The idea is to compute the distance between the new record and only a random subset of the cluster records. We can use a sampling lemma to show that this computation is very accurate. We have developed both sequential and parallel implementations of our algorithms. They outperform the best-known prior algorithm (called RLA). For example, one of our algorithms takes 71.22 s to link 100,000 records with a database of 1,000,000 records. In comparison, the current best algorithm takes 140.91 s to link 1,100,000 records. We achieve a very nearly linear speedup in parallel. E.g., we obtain a speedup of 28.28 with 32 cores. To the best of our knowledge, we are the first to propose parallel IRL algorithms. Our algorithms offer state-of-the-art solutions to the IRL problem.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Koudas, N., Sarawagi, S., Srivastava, D.: Record linkage: similarity measures and algorithms. In: Proceedings of the 2006 ACM SIGMOD International Conference on Management of data, pp. 802–803. ACM (2006)

    Google Scholar 

  2. Gomatam, S., Carter, R., Ariet, M., et al.: An empirical comparison of record linkage procedures. Stat. Med. 21(10), 1485–1496 (2002). https://doi.org/10.1002/sim.1147. PMID: 12185898

    Article  Google Scholar 

  3. Winkler, W.E.: Overview of record linkage and current research directions. In: Bureau of the Census. Citeseer (2006)

    Google Scholar 

  4. Christen, P., Churches, T., Hegland, M.: Febrl – a parallel open source data linkage system. In: Dai, H., Srikant, R., Zhang, C. (eds.) PAKDD 2004. LNCS (LNAI), vol. 3056, pp. 638–647. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-24775-3_75

    Chapter  Google Scholar 

  5. Christen, P.: Febrl—a freely available record linkage system with a graphical user interface. In: Second Australasian Workshop on Health Data and Knowledge Management, vol. 80, pp. 17–25 (2008)

    Google Scholar 

  6. Jurczyk, P., Lu, J.J., Xiong, L., et al.: FRIL: a tool for comparative record linkage. In: AMIA Annual Symposium Proceedings, vol. 2008. American Medical Informatics Association, p. 440 (2008)

    Google Scholar 

  7. Jurczyk, P., Lu, J.J., Xiong, L., et al.: Fine-grained record integration and linkage tool. Birth Defects Res. Part A: Clin. Mol. Teratol. 82(11), 822–829 (2008). https://doi.org/10.1002/bdra.2052

    Article  Google Scholar 

  8. Lee, M.L., Ling, T.W., Low, W.L.: IntelliClean: a knowledge-based intelligent data cleaner. In: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 290–294. ACM (2000)

    Google Scholar 

  9. Mi, T., Rajasekaran, S., Aseltine, R.: Efficient algorithms for fast integration on large data sets from multiple sources. BMC Med. Inform. Decis. Making 12(1), 59 (2012). https://doi.org/10.1186/1472-6947-12-59. PMID: 22741525

    Article  Google Scholar 

  10. Mi, T., Aseltine, R., Rajasekaran, S.: Data integration on multiple data sets. In: 2008 IEEE International Conference on Bioinformatics and Biomedicine. BIBM 2008, pp. 443–446. IEEE (2008)

    Google Scholar 

  11. Li, X., Shen, C.: Linkage of patient records from disparate sources. Stat. Methods Med. Res. 22(1), 31–38 (2013). https://doi.org/10.1177/0962280211403600. PMID: 21665896

    Article  MathSciNet  Google Scholar 

  12. Mamun, A.A., Mi, T., Aseltine, R., Rajasekaran, S.: Efficient sequential and parallel algorithms for record linkage. J. Am. Med. Inform. Assoc. 21(2), 252–262 (2014). https://doi.org/10.1136/amiajnl-2013-002034. PMID: 24154837

    Article  Google Scholar 

  13. Gruenheid, A., Dong, X.L., Srivastava, D.: Incremental record linkage. Proc. VLDB Endow. 7(9), 697–708 (2014)

    Article  Google Scholar 

  14. Whang, S.E., Garcia-Molina, H.: Entity resolution with evolving rules. Proc. VLDB 3(1), 1326–1337 (2010)

    Article  Google Scholar 

  15. Whang, S.E., Garcia-Molina, H.: Incremental entity resolution on rules and data. VLDB J. 23(1), 77–102 (2013). https://doi.org/10.1007/s00778-013-0315-0

    Article  Google Scholar 

  16. Christen, P.: Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31164-2

    Book  Google Scholar 

  17. Gu, L., Baxter, R., Vickers, D., et al.: Record linkage: current practice and future directions. CSIRO Mathematical and Information Sciences Technical Report 3/83 (2003)

    Google Scholar 

  18. Brizan, D.G., Tansel, A.U.A.: Survey of entity resolution and record linkage methodologies. Commun. IIMA 6(3), 5 (2015)

    Google Scholar 

  19. Rajasekaran, S.: Efficient parallel hierarchical clustering algorithms. IEEE Trans. Parallel Distrib. Syst. (6), 497–502 (2005). https://doi.org/10.1109/tpds.2005.72

  20. Li, X.: Parallel algorithms for hierarchical clustering and cluster validity. IEEE Trans. Pattern Anal. Mach. Intell. 12(11), 1088–1092 (1990). https://doi.org/10.1109/34.61708

    Article  Google Scholar 

  21. Olson, C.F.: Parallel algorithms for hierarchical clustering. Parallel Comput. 21(8), 1313–1325 (1995). https://doi.org/10.1016/0167-8191(95)00017-i

  22. Wu, C.H., Horng, S.J., Tsai, H.R.: Efficient parallel algorithms for hierarchical clustering on arrays with reconfigurable optical buses. J. Parallel Distrib. Comput. 60(9), 1137–1153 (2000). https://doi.org/10.1006/jpdc.2000.1644

    Article  MATH  Google Scholar 

  23. Kawai, H., Garcia-Molina, H., Benjelloun, O., et al.: P-swoosh: parallel algorithm for generic entity resolution (2006)

    Google Scholar 

  24. Kim, H.S., Lee, D.: Parallel linkage. In: Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management, pp. 283–292. ACM (2007)

    Google Scholar 

  25. Kirsten, T., Kolb, L., Hartung, M., et al.: Data partitioning for parallel entity matching. arXiv preprint arXiv:10065309 (2010)

  26. Bianco, G.D., Galante, R., Heuser, C.A.: A fast approach for parallel deduplication on multicore processors. In: Proceedings of the 2011 ACM Symposium on Applied Computing, pp. 1027–1032. ACM (2011)

    Google Scholar 

  27. Steorts, R.C., Ventura, S.L., Sadinle, M., Fienberg, S.E.: A comparison of blocking methods for record linkage. In: Domingo-Ferrer, J. (ed.) PSD 2014. LNCS, vol. 8744, pp. 253–268. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11257-2_20

    Chapter  Google Scholar 

  28. Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng. 24(9), 1537–1555 (2012). https://doi.org/10.1109/TKDE.2011.127

    Article  Google Scholar 

  29. Bachteler, T., Reiher, J., Schnell, R.: Similarity filtering with multibit trees for record linkage. German Record Linkage Center, Nuremberg, Working Paper WP-GRLC-2013-02 (2013)

    Google Scholar 

  30. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. In: Soviet Physics Doklady, vol. 10, pp. 707–710 (1966)

    Google Scholar 

  31. Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Comput. Surv. (CSUR) 31(3), 264–323 (1999). https://doi.org/10.1145/331499.331504

    Article  Google Scholar 

  32. McCallum, A., Nigam, K., Ungar, L.H.: Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 169–178. ACM (2000)

    Google Scholar 

  33. Rokach, L., Maimon, O.: Clustering methods. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, pp. 321–352. Springer, Boston (2005). https://doi.org/10.1007/0-387-25465-X_15

    Chapter  MATH  Google Scholar 

  34. Connecticut Hospital Association: https://cthosp.org/member-services/chimedata/chimedata-overview/

  35. Rajasekaran, S., Reif, J.H.: Derivation of randomized sorting and selection algorithms. In: Paige, R., Reif, J., Watcher, R. (eds.) Parallel Algorithm Derivation and Program Transformation. The Springer International Series in Engineering and Computer Science, vol. 231, pp. 187–205. Springer, Boston (1993). https://doi.org/10.1007/978-0-585-27330-3_6

    Chapter  Google Scholar 

Download references

Acknowledgment

This work has been supported in part by the following NSF grants: 1447711, 1514357, 1743418, and 1843025. Also, this project was supported by King Saud University, Deanship of Scientific Research, Community College Research Unit. Data for this study were obtained from the Connecticut Department of Public Health. The Connecticut Department of Public Health does not endorse or assume any responsibility for any analyses, interpretations or conclusions based on the data. ​ The Human Investigations Committee of the Department of Public Health approved this study.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sanguthevar Rajasekaran .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Baihan, A., Ammar, R., Aseltine, R., Baihan, M., Rajasekaran, S. (2020). Efficient Sequential and Parallel Algorithms for Incremental Record Linkage. In: Măndoiu, I., Murali, T., Narasimhan, G., Rajasekaran, S., Skums, P., Zelikovsky, A. (eds) Computational Advances in Bio and Medical Sciences. ICCABS 2019. Lecture Notes in Computer Science(), vol 12029. Springer, Cham. https://doi.org/10.1007/978-3-030-46165-2_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-46165-2_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-46164-5

  • Online ISBN: 978-3-030-46165-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics