Abstract
Given a collection of records, the problem of record linkage is to cluster them such that each cluster contains all the records of one and only one individual. Existing algorithms for this important problem have large run times especially when the number of records is large. Often, a small number of new records have to be linked with a large number of existing records. Linking the old and new records together might call for large run times. We refer to any algorithm that efficiently links the new records with the existing ones as incremental record linkage (IRL) algorithms and in this paper, we offer novel IRL algorithms. Clustering is the basic approach we employ. Our algorithms use a novel random sampling technique to compute the distance between a new record and any cluster and associate the new record with the cluster with which it has the least distance. The idea is to compute the distance between the new record and only a random subset of the cluster records. We can use a sampling lemma to show that this computation is very accurate. We have developed both sequential and parallel implementations of our algorithms. They outperform the best-known prior algorithm (called RLA). For example, one of our algorithms takes 71.22 s to link 100,000 records with a database of 1,000,000 records. In comparison, the current best algorithm takes 140.91 s to link 1,100,000 records. We achieve a very nearly linear speedup in parallel. E.g., we obtain a speedup of 28.28 with 32 cores. To the best of our knowledge, we are the first to propose parallel IRL algorithms. Our algorithms offer state-of-the-art solutions to the IRL problem.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Koudas, N., Sarawagi, S., Srivastava, D.: Record linkage: similarity measures and algorithms. In: Proceedings of the 2006 ACM SIGMOD International Conference on Management of data, pp. 802–803. ACM (2006)
Gomatam, S., Carter, R., Ariet, M., et al.: An empirical comparison of record linkage procedures. Stat. Med. 21(10), 1485–1496 (2002). https://doi.org/10.1002/sim.1147. PMID: 12185898
Winkler, W.E.: Overview of record linkage and current research directions. In: Bureau of the Census. Citeseer (2006)
Christen, P., Churches, T., Hegland, M.: Febrl – a parallel open source data linkage system. In: Dai, H., Srikant, R., Zhang, C. (eds.) PAKDD 2004. LNCS (LNAI), vol. 3056, pp. 638–647. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-24775-3_75
Christen, P.: Febrl—a freely available record linkage system with a graphical user interface. In: Second Australasian Workshop on Health Data and Knowledge Management, vol. 80, pp. 17–25 (2008)
Jurczyk, P., Lu, J.J., Xiong, L., et al.: FRIL: a tool for comparative record linkage. In: AMIA Annual Symposium Proceedings, vol. 2008. American Medical Informatics Association, p. 440 (2008)
Jurczyk, P., Lu, J.J., Xiong, L., et al.: Fine-grained record integration and linkage tool. Birth Defects Res. Part A: Clin. Mol. Teratol. 82(11), 822–829 (2008). https://doi.org/10.1002/bdra.2052
Lee, M.L., Ling, T.W., Low, W.L.: IntelliClean: a knowledge-based intelligent data cleaner. In: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 290–294. ACM (2000)
Mi, T., Rajasekaran, S., Aseltine, R.: Efficient algorithms for fast integration on large data sets from multiple sources. BMC Med. Inform. Decis. Making 12(1), 59 (2012). https://doi.org/10.1186/1472-6947-12-59. PMID: 22741525
Mi, T., Aseltine, R., Rajasekaran, S.: Data integration on multiple data sets. In: 2008 IEEE International Conference on Bioinformatics and Biomedicine. BIBM 2008, pp. 443–446. IEEE (2008)
Li, X., Shen, C.: Linkage of patient records from disparate sources. Stat. Methods Med. Res. 22(1), 31–38 (2013). https://doi.org/10.1177/0962280211403600. PMID: 21665896
Mamun, A.A., Mi, T., Aseltine, R., Rajasekaran, S.: Efficient sequential and parallel algorithms for record linkage. J. Am. Med. Inform. Assoc. 21(2), 252–262 (2014). https://doi.org/10.1136/amiajnl-2013-002034. PMID: 24154837
Gruenheid, A., Dong, X.L., Srivastava, D.: Incremental record linkage. Proc. VLDB Endow. 7(9), 697–708 (2014)
Whang, S.E., Garcia-Molina, H.: Entity resolution with evolving rules. Proc. VLDB 3(1), 1326–1337 (2010)
Whang, S.E., Garcia-Molina, H.: Incremental entity resolution on rules and data. VLDB J. 23(1), 77–102 (2013). https://doi.org/10.1007/s00778-013-0315-0
Christen, P.: Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31164-2
Gu, L., Baxter, R., Vickers, D., et al.: Record linkage: current practice and future directions. CSIRO Mathematical and Information Sciences Technical Report 3/83 (2003)
Brizan, D.G., Tansel, A.U.A.: Survey of entity resolution and record linkage methodologies. Commun. IIMA 6(3), 5 (2015)
Rajasekaran, S.: Efficient parallel hierarchical clustering algorithms. IEEE Trans. Parallel Distrib. Syst. (6), 497–502 (2005). https://doi.org/10.1109/tpds.2005.72
Li, X.: Parallel algorithms for hierarchical clustering and cluster validity. IEEE Trans. Pattern Anal. Mach. Intell. 12(11), 1088–1092 (1990). https://doi.org/10.1109/34.61708
Olson, C.F.: Parallel algorithms for hierarchical clustering. Parallel Comput. 21(8), 1313–1325 (1995). https://doi.org/10.1016/0167-8191(95)00017-i
Wu, C.H., Horng, S.J., Tsai, H.R.: Efficient parallel algorithms for hierarchical clustering on arrays with reconfigurable optical buses. J. Parallel Distrib. Comput. 60(9), 1137–1153 (2000). https://doi.org/10.1006/jpdc.2000.1644
Kawai, H., Garcia-Molina, H., Benjelloun, O., et al.: P-swoosh: parallel algorithm for generic entity resolution (2006)
Kim, H.S., Lee, D.: Parallel linkage. In: Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management, pp. 283–292. ACM (2007)
Kirsten, T., Kolb, L., Hartung, M., et al.: Data partitioning for parallel entity matching. arXiv preprint arXiv:10065309 (2010)
Bianco, G.D., Galante, R., Heuser, C.A.: A fast approach for parallel deduplication on multicore processors. In: Proceedings of the 2011 ACM Symposium on Applied Computing, pp. 1027–1032. ACM (2011)
Steorts, R.C., Ventura, S.L., Sadinle, M., Fienberg, S.E.: A comparison of blocking methods for record linkage. In: Domingo-Ferrer, J. (ed.) PSD 2014. LNCS, vol. 8744, pp. 253–268. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11257-2_20
Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng. 24(9), 1537–1555 (2012). https://doi.org/10.1109/TKDE.2011.127
Bachteler, T., Reiher, J., Schnell, R.: Similarity filtering with multibit trees for record linkage. German Record Linkage Center, Nuremberg, Working Paper WP-GRLC-2013-02 (2013)
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. In: Soviet Physics Doklady, vol. 10, pp. 707–710 (1966)
Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Comput. Surv. (CSUR) 31(3), 264–323 (1999). https://doi.org/10.1145/331499.331504
McCallum, A., Nigam, K., Ungar, L.H.: Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 169–178. ACM (2000)
Rokach, L., Maimon, O.: Clustering methods. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, pp. 321–352. Springer, Boston (2005). https://doi.org/10.1007/0-387-25465-X_15
Connecticut Hospital Association: https://cthosp.org/member-services/chimedata/chimedata-overview/
Rajasekaran, S., Reif, J.H.: Derivation of randomized sorting and selection algorithms. In: Paige, R., Reif, J., Watcher, R. (eds.) Parallel Algorithm Derivation and Program Transformation. The Springer International Series in Engineering and Computer Science, vol. 231, pp. 187–205. Springer, Boston (1993). https://doi.org/10.1007/978-0-585-27330-3_6
Acknowledgment
This work has been supported in part by the following NSF grants: 1447711, 1514357, 1743418, and 1843025. Also, this project was supported by King Saud University, Deanship of Scientific Research, Community College Research Unit. Data for this study were obtained from the Connecticut Department of Public Health. The Connecticut Department of Public Health does not endorse or assume any responsibility for any analyses, interpretations or conclusions based on the data. The Human Investigations Committee of the Department of Public Health approved this study.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Baihan, A., Ammar, R., Aseltine, R., Baihan, M., Rajasekaran, S. (2020). Efficient Sequential and Parallel Algorithms for Incremental Record Linkage. In: Măndoiu, I., Murali, T., Narasimhan, G., Rajasekaran, S., Skums, P., Zelikovsky, A. (eds) Computational Advances in Bio and Medical Sciences. ICCABS 2019. Lecture Notes in Computer Science(), vol 12029. Springer, Cham. https://doi.org/10.1007/978-3-030-46165-2_3
Download citation
DOI: https://doi.org/10.1007/978-3-030-46165-2_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-46164-5
Online ISBN: 978-3-030-46165-2
eBook Packages: Computer ScienceComputer Science (R0)