Probabilistic master lists: integration of patient records from different databases when unique patient identifier is missing

Alemi, Farrokh; Loaiza, Francisco; Vang, Jee

doi:10.1007/s10729-006-9002-7

Probabilistic master lists: integration of patient records from different databases when unique patient identifier is missing

Published: 28 November 2006

Volume 10, pages 95–104, (2007)
Cite this article

Health Care Management Science Aims and scope Submit manuscript

Farrokh Alemi¹,
Francisco Loaiza¹ &
Jee Vang¹

120 Accesses
1 Citation
Explore all metrics

Abstract

We show how Bayesian probability models can be used to integrate two databases, one of which does not have a key for uniquely identifying clients (e.g., social security number or medical record number). The analyst selects a set of imperfect identifiers (last visit diagnosis, first name, etc.). The algorithm assesses the likelihood ratio associated with the identifier from the database of known cases. It estimates the probability that two records belong to the same client from the likelihood ratios. As it proceeds in examining various identifiers, it accounts for inter-dependencies among them by allowing overlapping and redundant identifiers to be used. We test that the procedure is effective by examining data from the Medical Expenditure Panel Survey (MEPS) Population Characteristics data set, a publicly available data set. We randomly selected 1,000 cases for training data set—these constituted the known cases. The algorithm was used to identify if 100 cases not in the training data set would be misclassified in terms of being a case in the training set or a new case. With 12 fields as identifiers, all 100 cases were correctly classified as new cases. We also selected 100 known cases from the training set and asked the algorithm to classify these cases. Again, all 100 cases were correctly classified. Less accurate results were obtained when the training data set was too small (e.g., less than 100 records) or the number of fields used as identifiers was too small (e.g., less than seven fields). In a test of performance of the algorithm, when the ratio of testing to training data set exceeds 4 to 1, the accuracy of the algorithm exceeded 90% of cases. As the ratio increases, the accuracy of algorithm improves further. These data suggest the accuracy of our automated and mathematical procedure to merge data from two different data sets without the presence of a unique identifier. The algorithm uses imperfect and overlapping clues to re-identify cases from information not typically considered to be a patient identifier.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Probabilistic record linkage of de-identified research datasets with discrepancies using diagnosis codes

Article Open access 08 January 2019

Overview of Data Linkage Methods for Integrating Separate Health Data Sources

De-identified Bayesian personal identity matching for privacy-preserving record linkage despite errors: development and validation

Article Open access 05 May 2023

References

Achard F, Vaysseix G, Barillot E (2001) XML, bioinformatics and data integration. Bioinformatics 17(2):115–125, (Feb)
Article Google Scholar
Dolin RH, Alschuler L, Beebe C, et al (2001) The HL7 clinical document architecture. J Am Med Inform Assoc 8:552–569
Google Scholar
Schadow G, Russler DC, Mead CN, McDonald CJ (2000) Integrating medical information and knowledge in the HL7 RIM. Proceedings of American Medical Information Association Symposium, pp 764–768
The Electronic Health Record (EHR) System Functional Model. Health level seven. http://www.hl7.org/ehr/index.asp. Cited 1 Oct 2005
JAMIA Board of Directors (1994) Standards for medical identifiers, codes and messages needed to create an efficient computer-stored medical record. J Am Med Inform Assoc 1:1–7
Google Scholar
Barthell EN, Coonan K, Finnell J, Pollock D, Cochrane D (2004) Disparate systems, disparate data: integration, interfaces, and standards in emergency medicine information technology. Acad. Emerg Med 11(11):1142–1148, (November 1)
Article Google Scholar
Therrell BL Jr (2003) Data integration and warehousing: coordination between newborn screening and related public health programs. Southeast Asian J Trop Med Public Health 34(Suppl 3):63–68
Google Scholar
Arellano MG, Weber GI (1998) Issues in identification and linkage of patient records across an integrated delivery system. J Healthc Inf Manag 12(3):43–52, Fall
Google Scholar
Quantin C, Binquet C, Bourquard K, Pattisina R, Gouyon-Cornet B, Ferdynus C, Gouyon JB, Allaert FA (2004) A peculiar aspect of patients’ safety: the discriminating power of identifiers for record linkage. Stud Health Technol Inform 103:400–406
Google Scholar
Quantin C, Binquet C, Bourquard K, Pattisina R, Gouyon-Cornet B, Ferdynus C, Gouyon JB, Francois-Andre A (2004) Which are the best identifiers for record linkage? Med Inform Internet Med 29(3–4):221–227, (Sep–Dec)
Article Google Scholar
Fenna D (1984) Phonetic reduction of names. Comput Programs Biomed 19(1):31–36
Article Google Scholar
Mortimer JY, Salathiel JA (1995) ‘Soundex’ codes of surnames provide confidentiality and accuracy in a national HIV database. CDR Rev 5(12):R183–R186, Nov 10
Google Scholar
Searls DB (2003) Data integration—connecting the dots. Nat Biotechnol 21(8):844–845, Aug
Article Google Scholar
Quantin C, Bouzelat H, Dusserre L (1997) A computerized record hash coding and linkage procedure to warrant epidemiological follow-up data security. Stud Health Technol Inform 43(Pt A):339–342
Google Scholar
Roos LL, Wajda A (1991) Record linkage strategies. Part I: estimating information and evaluating approaches. Methods Inf Med 30(2):117–123, (Apr)
Google Scholar
Gomatam S, Carter R, Ariet M, Mitchell G (2002) An empirical comparison of record linkage procedures. Stat Med 21(10):1485–1496, (May 30)
Article Google Scholar
Quantin C, Binquet C, Allaert FA, Cornet B, Pattisina R, Leteuff G, Ferdynus C, Gouyon JB (2005) Decision analysis for the assessment of a record linkage procedure: application to a perinatal network. Methods Inf Med 44(1):72–79
Google Scholar
Dean JM, Vernon DD, Cook L, Nechodom P, Reading J, Suruda A (2001) Probabilistic linkage of computerized ambulance and inpatient hospital discharge records: a potential tool for evaluation of emergency medical services. Ann Emerg Med 37(6):616–626, (Jun)
Article Google Scholar
Newcombe HB, Kennedy YJM, Axford SJ, James AP. Automatic linkage of vital records. Science 150 (1959):954–959
Google Scholar
Blakely T, Salmond C (2002) Probabilistic record linkage and a method to calculate the positive predictive value. Int J Epidemiol 31(6):1246–1252, (PMID: 12540730, Dec)
Article Google Scholar
Fellegi IF, Sunter AB (1969) A theory for record linkage. J Am Stat Assoc 64:1183–1210
Article Google Scholar
Jaro MA (1989) Advances in record linkage methodology as applied to matching the 1985 census of Tampa Florida. J Am Stat Assoc 84:414–420
Article Google Scholar
Edwards AWF (1974) The history of likelihood. Int Stat Rev 42:9–15
Article Google Scholar
Fisher RA (1956) Statistical methods and scientific inference. (2nd edn. rev. 1959), Hafner, New York
Google Scholar
Bayes T (1763) An essay toward solving a problem in the doctrine of chances. Philos Trans of R Soc 3:370–418
Article Google Scholar
Norusis MJ, Jacquez JA (1975) Diagnosis. I. Symptom non-independence in mathematical models for diagnosis. Comput Biomed Res 8:156–172
Article Google Scholar
Gammerman A, Thatcher AR (1991) Bayesian diagnostic probabilities without assuming independence of symptoms. Methods Inf Med 30:15–22
Google Scholar
Ohmann C, Yang Q, Kunneke M, Stoltzing H, Thon K, Lorenz W (1988) Bayes theorem and conditional dependence of symptoms: different models applied to data of upper gastrointestinal bleeding. Methods Inf Med 27:73–83
Google Scholar
Eisenstein EL, Alemi F (1994) An evaluation of factors influencing Bayesian learning systems. J Am Med Inform Assoc 1(3):272–284
Google Scholar
Cohen SB (2003) Design strategies and innovations in the medical expenditure panel survey. Med Care 41(7 Suppl):III5–III12, (Jul)
Google Scholar
Salvador M, Vang J, Castro M, Diab T (2002) Correlating a medical records databases with Bayesian classification. Prepared for scientific databases (CSI 710) Class Fall

Download references

Author information

Authors and Affiliations

College of Nursing and Health Sciences, George Mason University, 4400 University Drive, Fairfax, VA, 22030, USA
Farrokh Alemi, Francisco Loaiza & Jee Vang

Authors

Farrokh Alemi
View author publications
You can also search for this author in PubMed Google Scholar
Francisco Loaiza
View author publications
You can also search for this author in PubMed Google Scholar
Jee Vang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Farrokh Alemi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Alemi, F., Loaiza, F. & Vang, J. Probabilistic master lists: integration of patient records from different databases when unique patient identifier is missing. Health Care Manage Sci 10, 95–104 (2007). https://doi.org/10.1007/s10729-006-9002-7

Download citation

Published: 28 November 2006
Issue Date: February 2007
DOI: https://doi.org/10.1007/s10729-006-9002-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Probabilistic master lists: integration of patient records from different databases when unique patient identifier is missing

Abstract

Access this article

Similar content being viewed by others

Probabilistic record linkage of de-identified research datasets with discrepancies using diagnosis codes

Overview of Data Linkage Methods for Integrating Separate Health Data Sources

De-identified Bayesian personal identity matching for privacy-preserving record linkage despite errors: development and validation

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Probabilistic master lists: integration of patient records from different databases when unique patient identifier is missing

Abstract

Access this article

Similar content being viewed by others

Probabilistic record linkage of de-identified research datasets with discrepancies using diagnosis codes

Overview of Data Linkage Methods for Integrating Separate Health Data Sources

De-identified Bayesian personal identity matching for privacy-preserving record linkage despite errors: development and validation

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation