Advertisement

Determining the Impact of Missing Values on Blocking in Record Linkage

  • Imrul Chowdhury AnindyaEmail author
  • Murat Kantarcioglu
  • Bradley Malin
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11441)

Abstract

Record linkage is the process of integrating information from the same underlying entity across disparate data sets. This process, which is increasingly utilized to build accurate representations of individuals and organizations for a variety of applications, ranging from credit worthiness assessments to continuity of medical care, can be computationally intensive because it requires comparing large quantities of records over a range of attributes. To reduce the amount of computation in record linkage in big data settings, blocking methods, which are designed to limit the number of record pair comparisons that needs to be performed, are critical for scaling up the record linkage process. These methods group together potential matches into blocks, often using a subset of attributes before a final comparator function predicts which record pairs within the blocks correspond to matches. Yet data corruption and missing values adversely influence the performance of blocking methods (e.g., it may cause some matching records not to be placed in the same block). While there has been some investigation into the impact of missing values on general record linkage techniques (e.g., the comparator function), no study has addressed the impact of the missing values on blocking methods. To address this issue, in this work, we systematically perform a detailed empirical analysis of the individual and joint impact of missing values and data corruption on different blocking methods using realistic data sets. Our results show that blocking approaches that do not depend on one type of blocking attributes are more robust against missing values. In addition, our results indicate that blocking parameters must be chosen carefully for different blocking techniques.

Keywords

Record linkage Deduplication Missing values Blocking methods Data corruption 

Notes

Acknowledgements

The research reported herein was supported in part by NIH awards 1R01HG006844, RM1HG009034, NSF awards CICI- 1547324, IIS-1633331, CNS-1837627, OAC-1828467 and ARO award W911NF-17-1-0356.

References

  1. 1.
    Florida Voter Registration Records. http://flvoters.com/downloads.html. Accessed 10 July 2018
  2. 2.
    North Carolina Voter Registration Records. https://dl.ncsbe.gov/index.html?prefix=data/Snapshots. Accessed 10 July 2018
  3. 3.
    Aizawa, A.N., Oyama, K.: A fast linkage detection scheme for multi-source information integration. In: Proceedings of International Workshop on Challenges in Web Information Retrieval and Integration, pp. 30–39 (2005)Google Scholar
  4. 4.
    Baxter, R., Christen, P., Churches, T.: A comparison of fast blocking methods for record linkage. In: Proceedings of ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 25–27 (2003)Google Scholar
  5. 5.
    Christen, P.: Febrl-a open source data cleaning, deduplication and record linkage system with a graphical user interface. In: Proceedings of ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 1065–1068 (2008)Google Scholar
  6. 6.
    Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng. 24(9), 1537–1555 (2012)CrossRefGoogle Scholar
  7. 7.
    Dusetzina, S.B., Tyree, S., Meyer, A.M., Meyer, A., Green, L., Carpenter, W.R.: Linking data for health services research: a framework and instructional guide. Agency for Healthcare Research and Quality (US), Rockville (MD) (2014)Google Scholar
  8. 8.
    Hernández, M.A., Stolfo, S.J.: Real-world data is dirty: data cleansing and the merge/purge problem. Data Mini. Knowl. Discov. 2(1), 9–37 (1998)CrossRefGoogle Scholar
  9. 9.
    Jin, L., Li, C., Mehrotra, S.: Efficient record linkage in large data sets. In: Proceedings of International Conference on Database Systems for Advanced Applications, pp. 137–146. IEEE (2003)Google Scholar
  10. 10.
    McCallum, A., Nigam, K., Ungar, L.H.: Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 169–178 (2000)Google Scholar
  11. 11.
    Ong, T.C., Mannino, M.V., Schilling, L.M., Kahn, M.G.: Improving record linkage performance in the presence of missing linkage data. J. Biomed. Inf. 52, 43–54 (2014)CrossRefGoogle Scholar
  12. 12.
    Prasad, K.H., Chaturvedi, S., Faruquie, T.A., Subramaniam, L.V., Mohania, M.K.: Automated selection of blocking columns for record linkage. In: Proceedings of International Conference on Service Operations and Logistics, and Informatics, pp. 78–83. IEEE (2012)Google Scholar
  13. 13.
    Tran, K.N., Vatsalan, D., Christen, P.: GeCo: an online personal data generator and corruptor. In: Proceedings of ACM International Conference on Information and Knowledge Management, pp. 2473–2476 (2013)Google Scholar
  14. 14.
    Winkler, W.E.: Using the EM algorithm for weight computation in the Fellegi-Sunter model of record linkage. In: Proceedings of the Section on Survey Research Methods, American Statistical Association, vol. 667, p. 671 (1988)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Imrul Chowdhury Anindya
    • 1
    Email author
  • Murat Kantarcioglu
    • 1
  • Bradley Malin
    • 2
  1. 1.The University of Texas at DallasRichardsonUSA
  2. 2.Vanderbilt UniversityNashvilleUSA

Personalised recommendations