Skip to main content

Record Linkage in Statistical Sampling: Past, Present, and Future

  • Chapter
  • First Online:
Recent Advances on Sampling Methods and Educational Statistics

Part of the book series: Emerging Topics in Statistics and Biostatistics ((ETSB))

  • 254 Accesses

Abstract

Record linkage is a useful tool to match records across datasets when the datasets lack a unique identifier. In this chapter, we examine the past, current, and present uses of probabilistic record linkage with a specific interest in its use in statistical sampling. For example, given the rise in interest and use of non-probability data within sampling, many researchers seek to augment a non-probability sample with a probability sample. Record linkage is a useful method for doing such combining. This chapter will examine the ways record linkage has been used and is currently being researched and implemented, with an emphasis on its current and future use for statistical sampling. The chapter concludes with open research questions for record linkage in the context of sampling, where the questions center around the idea of creating a total error framework for linked data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Abowd, J. M., Abramowitz, J., Levenstein, M. C., Mccue, K., Patki, D., Raghunathan, T., Rodgers, A. M., Shapiro, M. D., & Wasi, N. (2019). Optimal probabilistic record linkage: Best practice for linking employers in survey and administrative data. Center for Economic Studies Working Paper Series Working Paper Number CES-19-08.

    Google Scholar 

  • Amaya, A., Biemer, P. P., & Kinyon, D. (2020). Total error in a big data world: Adapting the TSE framework to big data. Journal of Survey Statistics and Methodology, 8(1), 89–119. https://doi.org/10.1093/jssam/smz056

    Article  Google Scholar 

  • Baker, R., J. M. Brick, Bates, N. A., Battaglia, M., Couper, M. P., Dever, J. A., Gile, K. J., & Tourangeau, R. (2013). Report of the AAPOR task force on non-probability sampling. American Association for Public Opinion Research. www.aapor.org/AAPOR_Main/media/MainSiteFiles/NPS_TF_Report_Final_7_revised_FNL_6_22_13.pdf

    Book  Google Scholar 

  • Bell, R. M. (2017). Diverse applications of probabilistic record linkage: Schucany lecture series. Southern Methodist University.

    Google Scholar 

  • Bell, R. M., Keesey, J., & Richards, T. (1994). The urge to merge: Linking vital statistics records and Medicaid claims. In Medical care (pp. 1004–1018).

    Google Scholar 

  • Boudreaux, M. H., Call, K. T., Turner, J., Fried, B., & O’Hara, B. (2015). Measurement error in public health insurance reporting in the American community survey: Evidence from record linkage. Health Services Research, 50, 1972–1995. https://doi.org/10.1111/1475-6773.12308

    Article  Google Scholar 

  • Breidt, F. J., Opsomer, J. D., & Huang, C.-M. (2017). Model-assisted survey estimation with imperfectly matched auxiliary data. In: TES 2018: Predictive econometrics and big data, studies in computational intelligence.

    Google Scholar 

  • Briscolini, D., Di Consiglio, L., Liseo, B., Tancredi, A., & Tuoto, T. (2018). New methods for small area estimation with linkage uncertainty. International Journal of Approximate Reasoning, 94, 30–42. https://doi.org/10.1016/j.ijar.2017.12.005

    Article  MathSciNet  MATH  Google Scholar 

  • Brus, D., & Gruijter, J. D. (2003). A method to combine non-probability sample data with probability sample data in estimating spatial means of environmental variables. Environmental Monitoring and Assessment, 83(3), 303–317. https://doi.org/10.1023/A:1022618406507

    Article  Google Scholar 

  • Chambers, R. (2009). Regression analysis of probability-linked data. Official statistics research series (Vol. 4). Statistics New Zealand. oCLC: 908449516.

    Google Scholar 

  • Chambers, R., & Diniz da Silva, A. (2020). Improved secondary analysis of linked data: A framework and an illustration. Journal of the Royal Statistical Society: Series A (Statistics in Society), 183(1), 37–59. https://doi.org/10.1111/rssa.12477

    Article  MathSciNet  Google Scholar 

  • Chipperfield, J. (2020). Bootstrap inference using estimating equations and data that are linked with complex probabilistic algorithms. Statistica Neerlandica, 74(2), 96–111. https://doi.org/10.1111/stan.12189

    Article  MathSciNet  Google Scholar 

  • Chipperfield, J. O., & Chambers, R. L. (2015). Using the bootstrap to account for linkage errors when analysing probabilistically linked categorical data. Journal of Official Statistics, 31(3), 397–414. https://doi.org/10.1515/jos-2015-0024

    Article  Google Scholar 

  • Christen, P. (2008). Automatic training example selection for scalable unsupervised record linkage. In Advances in knowledge discovery and data mining, 12th Pacific-Asia conference PAKDD (pp. 511–518).

    Google Scholar 

  • Christen, P. (2019). Data linkage: The big picture. Harvard Data Science Review https://doi.org/10.1162/99608f92.84deb5c4

  • Cohen, W. W., Ravikumar, P., & Fienberg, S. E. (2003). A comparison of string distance metrics for name-matching tasks. In Proceedings of the 2003 International Conference on Information Integration on the Web (pp. 73–78).

    Google Scholar 

  • Copas, J. B., & Hilton, F. J. (1990). Record linkage: Statistical models for matching computer records. Journal of the Royal Statistical Society Series A (Statistics in Society), 153(3), 287. https://doi.org/10.2307/2982975

    Article  Google Scholar 

  • Dalzell, N. M., & Reiter, J. P. (2016). Regression modeling and file matching using possibly erroneous matching variables. arXiv preprint arXiv:160806309.

    Google Scholar 

  • Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1), 1–22. https://doi.org/10.1111/j.2517-6161.1977.tb01600.x

    MathSciNet  MATH  Google Scholar 

  • Dong, X. L., & Srivastava, D. (2015). Synthesis lectures on data management:Big data integration. Morgan and Claypool. https://doi.org/10.2200/S00578ED1V01Y201404DTM040

  • Dunn, H. L. (1946). Record linkage. American Journal of Public Health and the Nation’s Health, 36(12), 1412–1416.

    Article  Google Scholar 

  • Elliott, M. N., & Haviland, A. (2007). Use of a web-based convenience sample to supplement a probability sample. Survey methodology, 33(2), 211–215. http://www.thewitnessbox.com/10498-en.pdf

    Google Scholar 

  • Elliott, M. R. (2009). Combining data from probability and non-probability samples using pseudo-weights. Survey Practice, 2(6), 1–7. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.981.4054&rep=rep1&type=pdf

    Article  Google Scholar 

  • Fellegi, I. P. (1999) Record linkage and public policy—a dynamic evolution. In: Record Linkage Techniques—1997 Proceedings of an International Workshop and Exposition. National Academies Press, (pp. 1–12).

    Google Scholar 

  • Fellegi, I. P., & Sunter, A. B. (1969). A theory for record linkage. Journal of the American Statistical Association, 64(328), 1183–1210. https://doi.org/10.2307/2286061

    Article  MATH  Google Scholar 

  • Groves, R. M., & Lyberg, L. (2010). Total survey error: past, present, and future. Public Opinion Quarterly, 74(5), 849–879. https://doi.org/10.1093/poq/nfq065

    Article  Google Scholar 

  • Hallifax, R., Goldacre, R., Landray, M. J., Rahman, N. M., & Goldacre, M. J. (2018). Trends in the incidence and recurrence of inpatient-treated spontaneous pneumothorax. JAMA, 320. https://doi.org/10.1001/jama.2018.14299

  • Harron, K., Goldstein, H., & Dibben, C. (Eds.). (2016). Methodological developments in data linkage. Wiley.

    Google Scholar 

  • Herzog, T. N., Scheuren, F., & Winkler, W. E. (2007). Data quality and record linkage techniques. Springer. oCLC: ocn137313060.

    Google Scholar 

  • Jaro, M. A. (1989). Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. Journal of the American Statistical Association, 84, 414–420.

    Article  Google Scholar 

  • Jaro, M. A. (1995). Probabilistic linkage of large public health data files. Statistics in Medicine, 14, 491–498.

    Article  Google Scholar 

  • Jurek, A., Hong, J., Chi, Y., & Liu, W. (2017). A novel ensemble learning approach to unsupervised record linkage. Information Systems, 71, 40–54. https://doi.org/10.1016/j.is.2017.06.006

    Article  Google Scholar 

  • Kim, G., & Chambers, R. (2012). Regression analysis under incomplete linkage. Computational Statistics & Data Analysis, 56(9), 2756–2770. https://doi.org/10.1016/j.csda.2012.02.026

    Article  MathSciNet  MATH  Google Scholar 

  • Kim, G., & Chambers, R. (2015). Unbiased regression estimation under correlated linkage errors: Correlated linkage errors. Stat, 4(1), 32–45 https://doi.org/10.1002/sta4.76

    Article  MathSciNet  Google Scholar 

  • Kim, J., & Tam, S. (2021). Data integration by combining big data and survey sample data for finite population inference. International Statistical Review, 89(2), 382–401. https://doi.org/10.1111/insr.12434

    Article  MathSciNet  Google Scholar 

  • Lahiri, P., & Larsen, M. D. (2005). Regression analysis with linked data. Journal of the American Statistical Association, 100(469), 222–230. https://doi.org/10.1198/016214504000001277

    Article  MathSciNet  MATH  Google Scholar 

  • Liu, B., Stokes, L., Topping, T., & Stunz, G. (2017). Estimation of a total from a population of unknown size and application to estimating recreational red snapper catch in Texas. Journal of Survey Statistics and Methodology, 5(3), 350–371. https://doi.org/10.1093/jssam/smx006

    Article  Google Scholar 

  • Lohr, S. L. (2010). Sampling: Design and analysis 2nd ed.. Brooks/Cole.

    Google Scholar 

  • Meng, X.-L. (2018). Statistical paradises and paradoxes in big data (I): Law of large populations, big data paradox, and the 2016 US presidential election. The Annals of Applied Statistics, 12(2). https://doi.org/10.1214/18-AOAS1161SF

  • Mulry, M. H., Bean, S. L., Bauder, D. M., Wagner, D., Mule, T., & Petroni, R. J. (2006). Evaluation of estimates of census duplication using administrative records information. Journal of Official Statistics, 22(4), 655–679.

    Google Scholar 

  • Neter, J., Maynes, E. S., & Ramanathan, R. (1965). The effect of mismatching on the measurement of response error. Journal of the American Statistical Association, 60(312). https://doi.org/10.2307/2283401

  • Newcombe, H. B., Kennedy, J. M., Axford, S. J., & James, A. P. (1959). Automatic linkage of vital records. Science, 130(3381), 954–959.

    Article  Google Scholar 

  • Sakshaug, J. W., WiÅ›niowski, A., Ruiz, D. A. P., & Blom, A. G. (2019). Supplementing small probability samples with nonprobability samples: A Bayesian approach. Journal of Official Statistics, 35(3), 653–681. https://doi.org/10.2478/jos-2019-0027

    Article  Google Scholar 

  • Salvati, N., Fabrizi, E., Ranalli, M. G., & Chambers, R. L. (2021). Small area estimation with linked data. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 83(1), 78–107. https://doi.org/10.1111/rssb.12401

    Article  MathSciNet  MATH  Google Scholar 

  • Stokes, S. L., Williams, B. M., McShane, R. P. A., & Zalsha, S. (2021). The impact of nonsampling errors on estimators of catch from electronic reporting systems. Journal of Survey Statistics and Methodology, 9(1), 159–184. https://doi.org/10.1093/jssam/smz042

    Article  Google Scholar 

  • Särndal, C.-E., Swensson, B., & Wretman, J. (1992). Model assisted survey sampling. Springer.

    Book  MATH  Google Scholar 

  • Valliant, R., Dever, J. A. (2011). Estimating propensity adjustments for volunteer web surveys. Sociological Methods & Research, 40(1), 105–137. https://doi.org/10.1177/0049124110392533

    Article  MathSciNet  Google Scholar 

  • Vatsalan, D., Sehili, Z., Christen, P., & Rahm, E. (2017) Privacy-preserving record linkage for big data: Current approaches and research challenges. Springer. https://doi.org/10.1007/978-3-319-49340-4_25

  • Winkler, W. E. (1990). String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage. In Proceedings of the Section on Survey Research Methods American Statistical Association (pp. 354–359).

    Google Scholar 

  • WiÅ›niowski, A., Sakshaug, J. W., Perez Ruiz, D. A., & Blom, A. G. (2020). Integrating probability and nonprobability samples for survey inference. Journal of Survey Statistics and Methodology, 8(1), 120–147. https://doi.org/10.1093/jssam/smz051

    Article  Google Scholar 

  • Zhang, L., & Tuoto, T. (2021). Linkage-data linear regression. Journal of the Royal Statistical Society: Series A (Statistics in Society), 184(2), 522–547. https://doi.org/10.1111/rssa.12630

    Article  MathSciNet  Google Scholar 

  • Zhang, L.-C. (2021). Generalised regression estimation given imperfectly matched auxiliary data. Journal of Official Statistics, 37(1), 239–255. https://doi.org/10.2478/jos-2021-0010

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Benjamin Williams .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Williams, B. (2022). Record Linkage in Statistical Sampling: Past, Present, and Future. In: Ng, H.K.T., Heitjan, D.F. (eds) Recent Advances on Sampling Methods and Educational Statistics. Emerging Topics in Statistics and Biostatistics . Springer, Cham. https://doi.org/10.1007/978-3-031-14525-4_9

Download citation

Publish with us

Policies and ethics