Determining the Impact of Missing Values on Blocking in Record Linkage

Anindya, Imrul Chowdhury; Kantarcioglu, Murat; Malin, Bradley

doi:10.1007/978-3-030-16142-2_21

Imrul Chowdhury Anindya¹⁹,
Murat Kantarcioglu¹⁹ &
Bradley Malin²⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11441))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

2017 Accesses
2 Citations

Abstract

Record linkage is the process of integrating information from the same underlying entity across disparate data sets. This process, which is increasingly utilized to build accurate representations of individuals and organizations for a variety of applications, ranging from credit worthiness assessments to continuity of medical care, can be computationally intensive because it requires comparing large quantities of records over a range of attributes. To reduce the amount of computation in record linkage in big data settings, blocking methods, which are designed to limit the number of record pair comparisons that needs to be performed, are critical for scaling up the record linkage process. These methods group together potential matches into blocks, often using a subset of attributes before a final comparator function predicts which record pairs within the blocks correspond to matches. Yet data corruption and missing values adversely influence the performance of blocking methods (e.g., it may cause some matching records not to be placed in the same block). While there has been some investigation into the impact of missing values on general record linkage techniques (e.g., the comparator function), no study has addressed the impact of the missing values on blocking methods. To address this issue, in this work, we systematically perform a detailed empirical analysis of the individual and joint impact of missing values and data corruption on different blocking methods using realistic data sets. Our results show that blocking approaches that do not depend on one type of blocking attributes are more robust against missing values. In addition, our results indicate that blocking parameters must be chosen carefully for different blocking techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Florida Voter Registration Records. http://flvoters.com/downloads.html. Accessed 10 July 2018
North Carolina Voter Registration Records. https://dl.ncsbe.gov/index.html?prefix=data/Snapshots. Accessed 10 July 2018
Aizawa, A.N., Oyama, K.: A fast linkage detection scheme for multi-source information integration. In: Proceedings of International Workshop on Challenges in Web Information Retrieval and Integration, pp. 30–39 (2005)
Google Scholar
Baxter, R., Christen, P., Churches, T.: A comparison of fast blocking methods for record linkage. In: Proceedings of ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 25–27 (2003)
Google Scholar
Christen, P.: Febrl-a open source data cleaning, deduplication and record linkage system with a graphical user interface. In: Proceedings of ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 1065–1068 (2008)
Google Scholar
Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng. 24(9), 1537–1555 (2012)
Article Google Scholar
Dusetzina, S.B., Tyree, S., Meyer, A.M., Meyer, A., Green, L., Carpenter, W.R.: Linking data for health services research: a framework and instructional guide. Agency for Healthcare Research and Quality (US), Rockville (MD) (2014)
Google Scholar
Hernández, M.A., Stolfo, S.J.: Real-world data is dirty: data cleansing and the merge/purge problem. Data Mini. Knowl. Discov. 2(1), 9–37 (1998)
Article Google Scholar
Jin, L., Li, C., Mehrotra, S.: Efficient record linkage in large data sets. In: Proceedings of International Conference on Database Systems for Advanced Applications, pp. 137–146. IEEE (2003)
Google Scholar
McCallum, A., Nigam, K., Ungar, L.H.: Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 169–178 (2000)
Google Scholar
Ong, T.C., Mannino, M.V., Schilling, L.M., Kahn, M.G.: Improving record linkage performance in the presence of missing linkage data. J. Biomed. Inf. 52, 43–54 (2014)
Article Google Scholar
Prasad, K.H., Chaturvedi, S., Faruquie, T.A., Subramaniam, L.V., Mohania, M.K.: Automated selection of blocking columns for record linkage. In: Proceedings of International Conference on Service Operations and Logistics, and Informatics, pp. 78–83. IEEE (2012)
Google Scholar
Tran, K.N., Vatsalan, D., Christen, P.: GeCo: an online personal data generator and corruptor. In: Proceedings of ACM International Conference on Information and Knowledge Management, pp. 2473–2476 (2013)
Google Scholar
Winkler, W.E.: Using the EM algorithm for weight computation in the Fellegi-Sunter model of record linkage. In: Proceedings of the Section on Survey Research Methods, American Statistical Association, vol. 667, p. 671 (1988)
Google Scholar

Download references

Acknowledgements

The research reported herein was supported in part by NIH awards 1R01HG006844, RM1HG009034, NSF awards CICI- 1547324, IIS-1633331, CNS-1837627, OAC-1828467 and ARO award W911NF-17-1-0356.

Author information

Authors and Affiliations

The University of Texas at Dallas, Richardson, TX, 75080, USA
Imrul Chowdhury Anindya & Murat Kantarcioglu
Vanderbilt University, Nashville, TN, 37235, USA
Bradley Malin

Authors

Imrul Chowdhury Anindya
View author publications
You can also search for this author in PubMed Google Scholar
Murat Kantarcioglu
View author publications
You can also search for this author in PubMed Google Scholar
Bradley Malin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Imrul Chowdhury Anindya .

Editor information

Editors and Affiliations

Hong Kong University of Science and Technology, Hong Kong, China
Qiang Yang
Nanjing University, Nanjing, China
Zhi-Hua Zhou
University of Macau, Taipa, Macau, China
Zhiguo Gong
Southeast University, Nanjing, China
Min-Ling Zhang
Nanjing University of Aeronautics and Astronautics, Nanjing, China
Sheng-Jun Huang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Anindya, I.C., Kantarcioglu, M., Malin, B. (2019). Determining the Impact of Missing Values on Blocking in Record Linkage. In: Yang, Q., Zhou, ZH., Gong, Z., Zhang, ML., Huang, SJ. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2019. Lecture Notes in Computer Science(), vol 11441. Springer, Cham. https://doi.org/10.1007/978-3-030-16142-2_21

Download citation

DOI: https://doi.org/10.1007/978-3-030-16142-2_21
Published: 20 March 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-16141-5
Online ISBN: 978-3-030-16142-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics