Abstract
Often data analysts use probabilistic record linkage techniques to match records across two data sets. Such matching can be the primary goal, or it can be a necessary step to analyze relationships among the variables in the data sets. We propose a Bayesian hierarchical model that allows data analysts to perform simultaneous linear regression and probabilistic record linkage. This allows analysts to leverage relationships among the variables to improve linkage quality. Further, it enables analysts to propagate uncertainty in a principled way, while also potentially offering more accurate estimates of regression parameters compared to approaches that use a two-step process, i.e., link the records first, then estimate the linear regression on the linked data. We propose and evaluate three Markov chain Monte Carlo algorithms for implementing the Bayesian model, which we compare against a two-step process.
R. C. Steorts—This research was partially supported by the National Science Foundation through grants SES1131897, SES1733835, SES1652431 and SES1534412.
This is a preview of subscription content, access via your institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng. 24(9), 1537–1555 (2012)
Christen, P.: Data linkage: the big picture. Harv. Data Sci. Rev. 1(2) (2019)
Dalzell, N.M., Reiter, J.P.: Regression modeling and file matching using possibly erroneous matching variables. J. Comput. Graph. Stat. 27, 728–738 (2018)
Fellegi, I.P., Sunter, A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64(328), 1183–1210 (1969)
Fortini, M., Liseo, B., Nuccitelli, A., Scanu, M.: On Bayesian record linkage. Res. Off. Stat. 4, 185–198 (2001)
Gutman, R., Afendulis, C.C., Zaslavsky, A.M.: A Bayesian procedure for file linking to analyze end-of-life medical costs. J. Am. Stat. Assoc. 108(501), 34–47 (2013)
Hof, M.H., Ravelli, A.C., To, A.H.Z.: A probabilistic record linkage model for survival data. J. Am. Stat. Assoc. 112(520), 1504–1515 (2017)
Jaro, M.A.: Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. J. Am. Stat. Assoc. 84, 414–420 (1989)
Larsen, M.D.: Iterative automated record linkage using mixture models. J. Am. Stat. Assoc. 96, 32–41 (2001)
Larsen, M.D.: Comments on hierarchical Bayesian record linkage. In: Proceedings of the Section on Survey Research Methods. ASA, Alexandria, VA, 1995–2000 (2002)
Larsen, M.D.: Advances in record linkage theory: hierarchical Bayesian record linkage theory. In: Proceedings of the Section on Survey Research Methods. ASA, Alexandria, VA, pp. 3277–3284 (2005)
Larsen, M.D., Rubin, D.B.: Iterative automated record linkage using mixture models. J. Am. Stat. Assoc. 96(453), 32–41 (2001)
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Phys. Doklady 10, 707–710 (1965)
Marchant, N.G., Steorts, R.C., Kaplan, A., Rubinstein, B.I., Elazar, D.N.: d-blink: distributed end-to-end Bayesian entity resolution (2019). arXiv preprint arXiv:1909.06039
Newcombe, H.B., Kennedy, J.M., Axford, S.J., James, A.P.: Automatic linkage of vital records: computers can be used to extract “follow-up” statistics of families from files of routine records. Science 130(3381), 954–959 (1959)
Sadinle, M.: Detecting duplicates in a homicide registry using a Bayesian partitioning approach. Ann. Appl. Stat. 8(4), 2404–2434 (2014). MR3292503
Sadinle, M.: Bayesian estimation of bipartite matchings for record linkage. J. Am. Stat. Assoc. 112, 600–612 (2017)
Sadinle, M., Fienberg, S.E.: A generalized Fellegi-Sunter framework for multiple record linkage with application to homicide record systems. J. Am. Stat. Assoc. 108(502), 385–397 (2013)
Steorts, R.C.: Entity resolution with empirically motivated priors. Bayesian Anal. 10(4), 849–875 (2015). MR3432242
Steorts, R.C., Hall, R., Fienberg, S.E.: A Bayesian approach to graphical record linkage and deduplication. J. Am. Stat. Assoc. 111(516), 1660–1672 (2016)
Tancredi, A., Liseo, B.: A hierarchical Bayesian approach to record linkage and population size problems. Ann. Appl. Stat. 5(2B), 1553–1585 (2011). MR2849786
Winkler, W.E.: String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage. In: Proceedings of the Section on Survey Research Methods. ASA, Alexandria, VA, pp. 354–359 (1990)
Winkler, W.E.: Overview of record linkage and current research directions. Technical report. Statistics #2006-2, U.S. Bureau of the Census (2006)
Winkler, W.E.: Matching and record linkage. Wiley Interdisc. Rev.: Comput. Stat. 6(5), 313–325 (2014)
Zanella, G., Betancourt, B., Wallach, H., Miller, J., Zaidi, A., Steorts, R.C.: Flexible models for microclustering with application to entity resolution. In: Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS 2016, NY, USA. Curran Associates Inc., pp. 1425–1433 (2016)
Zhao, B., Rubinstein, B.I.P., Gemmell, J., Han, J.: A Bayesian approach to discovering truth from conflicting sources for data integration. Proc. VLDB Endow. 5(6), 550–561 (2012)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
A Record Linkage Evaluation Metrics
Here, we review the definitions of the average numbers of correct links (CL), correct non-links (CNL), false negatives (FN), and false positives (FP). These allow one to calculate the false negative rate (FNR) and false discovery rate (FDR) [19]. For any MCMC iteration t, we define CL\(^{[t]}\) as the number of record pairs with \(Z_j \le n_1\) and that are true links. We define CNL\(^{[t]}\) as the number of record pairs with \(Z_j > n_1\) that also are not true links. We define FN\(^{[t]}\) as the number of record pairs that are true links but have \(Z_j > n_1\). We define FP\(^{[t]}\) as the number of record pairs that are not true links but have \(Z_j \le n_1\). In the simulations, the true number of true links is CL\(^{[t]}\)+FN\(^{[t]}=750\), and the estimated number of links is CL\(^{[t]}\)+FP\(^{[t]}\). Thus, FNR\(^{[t]} = \) is FN\(^{[t]}\)/(CL\(^{[t]}\)+FN\(^{[t]}\)). The FDR\(^{[t]} = \) FP\(^{[t]}\)/(CL\(^{[t]}\)+FP\(^{[t]}\)), where by convention we take FDR\(^{[t]} = 0\) when both the numerator and denominator are 0. We report the FDR instead of the FPR, as an algorithm that does not link any records has a small FPR, but this does not mean that it is a good algorithm. Finally, for each metric, we compute the posterior means across all MCMC iterations, which we average across all simulations.
B Additional Simulations with a Mis-specified Regression
As an additional simulation, we examine the performance of the hierarchical model in terms of linkage quality when we use a mis-specified regression. The true data generating model is \(\log (\mathbf {Y})|\mathbf {X},\mathbf {V},\mathbf {Z} \sim N(\mathbf {X\beta }, \sigma ^2 \mathbf {I})\), but we incorrectly assume \(\mathbf {Y}|\mathbf {X},\mathbf {V},\mathbf {Z} \sim N(\mathbf {X\beta }, \sigma ^2 \mathbf {I})\) in the hierarchical model. Table 3 summarizes the measures of linkage quality when the linkage variables have weak information. Even though the regression component of the hierarchical model is mis-specified, the hierarchical model still identifies more correct non-matches than the two-step approach identifies, although the difference is less obvious than when we use the correctly specified regression. We see a similar trend when the information in the linking variables is strong, albeit with smaller differences between the two-step approach and the hierarchical model.
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Tang, J., Reiter, J.P., Steorts, R.C. (2020). Bayesian Modeling for Simultaneous Regression and Record Linkage. In: Domingo-Ferrer, J., Muralidhar, K. (eds) Privacy in Statistical Databases. PSD 2020. Lecture Notes in Computer Science(), vol 12276. Springer, Cham. https://doi.org/10.1007/978-3-030-57521-2_15
Download citation
DOI: https://doi.org/10.1007/978-3-030-57521-2_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-57520-5
Online ISBN: 978-3-030-57521-2
eBook Packages: Computer ScienceComputer Science (R0)