Efficient Sequential and Parallel Algorithms for Incremental Record Linkage

Baihan, Abdullah; Ammar, Reda; Aseltine, Robert; Baihan, Mohammed; Rajasekaran, Sanguthevar

doi:10.1007/978-3-030-46165-2_3

Abdullah Baihan^14,16,
Reda Ammar¹⁴,
Robert Aseltine¹⁵,
Mohammed Baihan¹⁶ &
…
Sanguthevar Rajasekaran¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 12029))

Included in the following conference series:

International Conference on Computational Advances in Bio and Medical Sciences

2373 Accesses

Abstract

Given a collection of records, the problem of record linkage is to cluster them such that each cluster contains all the records of one and only one individual. Existing algorithms for this important problem have large run times especially when the number of records is large. Often, a small number of new records have to be linked with a large number of existing records. Linking the old and new records together might call for large run times. We refer to any algorithm that efficiently links the new records with the existing ones as incremental record linkage (IRL) algorithms and in this paper, we offer novel IRL algorithms. Clustering is the basic approach we employ. Our algorithms use a novel random sampling technique to compute the distance between a new record and any cluster and associate the new record with the cluster with which it has the least distance. The idea is to compute the distance between the new record and only a random subset of the cluster records. We can use a sampling lemma to show that this computation is very accurate. We have developed both sequential and parallel implementations of our algorithms. They outperform the best-known prior algorithm (called RLA). For example, one of our algorithms takes 71.22 s to link 100,000 records with a database of 1,000,000 records. In comparison, the current best algorithm takes 140.91 s to link 1,100,000 records. We achieve a very nearly linear speedup in parallel. E.g., we obtain a speedup of 28.28 with 32 cores. To the best of our knowledge, we are the first to propose parallel IRL algorithms. Our algorithms offer state-of-the-art solutions to the IRL problem.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Koudas, N., Sarawagi, S., Srivastava, D.: Record linkage: similarity measures and algorithms. In: Proceedings of the 2006 ACM SIGMOD International Conference on Management of data, pp. 802–803. ACM (2006)
Google Scholar
Gomatam, S., Carter, R., Ariet, M., et al.: An empirical comparison of record linkage procedures. Stat. Med. 21(10), 1485–1496 (2002). https://doi.org/10.1002/sim.1147. PMID: 12185898
Article Google Scholar
Winkler, W.E.: Overview of record linkage and current research directions. In: Bureau of the Census. Citeseer (2006)
Google Scholar
Christen, P., Churches, T., Hegland, M.: Febrl – a parallel open source data linkage system. In: Dai, H., Srikant, R., Zhang, C. (eds.) PAKDD 2004. LNCS (LNAI), vol. 3056, pp. 638–647. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-24775-3_75
Chapter Google Scholar
Christen, P.: Febrl—a freely available record linkage system with a graphical user interface. In: Second Australasian Workshop on Health Data and Knowledge Management, vol. 80, pp. 17–25 (2008)
Google Scholar
Jurczyk, P., Lu, J.J., Xiong, L., et al.: FRIL: a tool for comparative record linkage. In: AMIA Annual Symposium Proceedings, vol. 2008. American Medical Informatics Association, p. 440 (2008)
Google Scholar
Jurczyk, P., Lu, J.J., Xiong, L., et al.: Fine-grained record integration and linkage tool. Birth Defects Res. Part A: Clin. Mol. Teratol. 82(11), 822–829 (2008). https://doi.org/10.1002/bdra.2052
Article Google Scholar
Lee, M.L., Ling, T.W., Low, W.L.: IntelliClean: a knowledge-based intelligent data cleaner. In: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 290–294. ACM (2000)
Google Scholar
Mi, T., Rajasekaran, S., Aseltine, R.: Efficient algorithms for fast integration on large data sets from multiple sources. BMC Med. Inform. Decis. Making 12(1), 59 (2012). https://doi.org/10.1186/1472-6947-12-59. PMID: 22741525
Article Google Scholar
Mi, T., Aseltine, R., Rajasekaran, S.: Data integration on multiple data sets. In: 2008 IEEE International Conference on Bioinformatics and Biomedicine. BIBM 2008, pp. 443–446. IEEE (2008)
Google Scholar
Li, X., Shen, C.: Linkage of patient records from disparate sources. Stat. Methods Med. Res. 22(1), 31–38 (2013). https://doi.org/10.1177/0962280211403600. PMID: 21665896
Article MathSciNet Google Scholar
Mamun, A.A., Mi, T., Aseltine, R., Rajasekaran, S.: Efficient sequential and parallel algorithms for record linkage. J. Am. Med. Inform. Assoc. 21(2), 252–262 (2014). https://doi.org/10.1136/amiajnl-2013-002034. PMID: 24154837
Article Google Scholar
Gruenheid, A., Dong, X.L., Srivastava, D.: Incremental record linkage. Proc. VLDB Endow. 7(9), 697–708 (2014)
Article Google Scholar
Whang, S.E., Garcia-Molina, H.: Entity resolution with evolving rules. Proc. VLDB 3(1), 1326–1337 (2010)
Article Google Scholar
Whang, S.E., Garcia-Molina, H.: Incremental entity resolution on rules and data. VLDB J. 23(1), 77–102 (2013). https://doi.org/10.1007/s00778-013-0315-0
Article Google Scholar
Christen, P.: Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31164-2
Book Google Scholar
Gu, L., Baxter, R., Vickers, D., et al.: Record linkage: current practice and future directions. CSIRO Mathematical and Information Sciences Technical Report 3/83 (2003)
Google Scholar
Brizan, D.G., Tansel, A.U.A.: Survey of entity resolution and record linkage methodologies. Commun. IIMA 6(3), 5 (2015)
Google Scholar
Rajasekaran, S.: Efficient parallel hierarchical clustering algorithms. IEEE Trans. Parallel Distrib. Syst. (6), 497–502 (2005). https://doi.org/10.1109/tpds.2005.72
Li, X.: Parallel algorithms for hierarchical clustering and cluster validity. IEEE Trans. Pattern Anal. Mach. Intell. 12(11), 1088–1092 (1990). https://doi.org/10.1109/34.61708
Article Google Scholar
Olson, C.F.: Parallel algorithms for hierarchical clustering. Parallel Comput. 21(8), 1313–1325 (1995). https://doi.org/10.1016/0167-8191(95)00017-i
Wu, C.H., Horng, S.J., Tsai, H.R.: Efficient parallel algorithms for hierarchical clustering on arrays with reconfigurable optical buses. J. Parallel Distrib. Comput. 60(9), 1137–1153 (2000). https://doi.org/10.1006/jpdc.2000.1644
Article MATH Google Scholar
Kawai, H., Garcia-Molina, H., Benjelloun, O., et al.: P-swoosh: parallel algorithm for generic entity resolution (2006)
Google Scholar
Kim, H.S., Lee, D.: Parallel linkage. In: Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management, pp. 283–292. ACM (2007)
Google Scholar
Kirsten, T., Kolb, L., Hartung, M., et al.: Data partitioning for parallel entity matching. arXiv preprint arXiv:10065309 (2010)
Bianco, G.D., Galante, R., Heuser, C.A.: A fast approach for parallel deduplication on multicore processors. In: Proceedings of the 2011 ACM Symposium on Applied Computing, pp. 1027–1032. ACM (2011)
Google Scholar
Steorts, R.C., Ventura, S.L., Sadinle, M., Fienberg, S.E.: A comparison of blocking methods for record linkage. In: Domingo-Ferrer, J. (ed.) PSD 2014. LNCS, vol. 8744, pp. 253–268. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11257-2_20
Chapter Google Scholar
Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng. 24(9), 1537–1555 (2012). https://doi.org/10.1109/TKDE.2011.127
Article Google Scholar
Bachteler, T., Reiher, J., Schnell, R.: Similarity filtering with multibit trees for record linkage. German Record Linkage Center, Nuremberg, Working Paper WP-GRLC-2013-02 (2013)
Google Scholar
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. In: Soviet Physics Doklady, vol. 10, pp. 707–710 (1966)
Google Scholar
Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Comput. Surv. (CSUR) 31(3), 264–323 (1999). https://doi.org/10.1145/331499.331504
Article Google Scholar
McCallum, A., Nigam, K., Ungar, L.H.: Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 169–178. ACM (2000)
Google Scholar
Rokach, L., Maimon, O.: Clustering methods. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, pp. 321–352. Springer, Boston (2005). https://doi.org/10.1007/0-387-25465-X_15
Chapter MATH Google Scholar
Connecticut Hospital Association: https://cthosp.org/member-services/chimedata/chimedata-overview/
Rajasekaran, S., Reif, J.H.: Derivation of randomized sorting and selection algorithms. In: Paige, R., Reif, J., Watcher, R. (eds.) Parallel Algorithm Derivation and Program Transformation. The Springer International Series in Engineering and Computer Science, vol. 231, pp. 187–205. Springer, Boston (1993). https://doi.org/10.1007/978-0-585-27330-3_6
Chapter Google Scholar

Download references

Acknowledgment

This work has been supported in part by the following NSF grants: 1447711, 1514357, 1743418, and 1843025. Also, this project was supported by King Saud University, Deanship of Scientific Research, Community College Research Unit. Data for this study were obtained from the Connecticut Department of Public Health. The Connecticut Department of Public Health does not endorse or assume any responsibility for any analyses, interpretations or conclusions based on the data. The Human Investigations Committee of the Department of Public Health approved this study.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, University of Connecticut, 252 ITEB, 371 Fairfield Way, UConn, Storrs, CT, 06269-4155, USA
Abdullah Baihan, Reda Ammar & Sanguthevar Rajasekaran
Division of Behavioral Sciences and Community Health and Center for Population Health, UConn Health, 263 Farmington Ave, MC 3910, Farmington, CT, 06030-3910, USA
Robert Aseltine
Department of Computer Science, Community College, King Saud University, Riyadh, Saudi Arabia
Abdullah Baihan & Mohammed Baihan

Authors

Abdullah Baihan
View author publications
You can also search for this author in PubMed Google Scholar
Reda Ammar
View author publications
You can also search for this author in PubMed Google Scholar
Robert Aseltine
View author publications
You can also search for this author in PubMed Google Scholar
Mohammed Baihan
View author publications
You can also search for this author in PubMed Google Scholar
Sanguthevar Rajasekaran
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sanguthevar Rajasekaran .

Editor information

Editors and Affiliations

University of Connecticut, Storrs, CT, USA
Ion Măndoiu
Virginia Tech, Blacksburg, VA, USA
T. M. Murali
Florida International University, Miami, FL, USA
Giri Narasimhan
University of Connecticut, Storrs, CT, USA
Sanguthevar Rajasekaran
Georgia State University, Atlanta, GA, USA
Pavel Skums
Georgia State University, Atlanta, GA, USA
Alexander Zelikovsky

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Baihan, A., Ammar, R., Aseltine, R., Baihan, M., Rajasekaran, S. (2020). Efficient Sequential and Parallel Algorithms for Incremental Record Linkage. In: Măndoiu, I., Murali, T., Narasimhan, G., Rajasekaran, S., Skums, P., Zelikovsky, A. (eds) Computational Advances in Bio and Medical Sciences. ICCABS 2019. Lecture Notes in Computer Science(), vol 12029. Springer, Cham. https://doi.org/10.1007/978-3-030-46165-2_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-46165-2_3
Published: 29 April 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-46164-5
Online ISBN: 978-3-030-46165-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics