Privacy-Preserving Record Linkage to Identify Fragmented Electronic Medical Records in the All of Us Research Program

Kho, Abel N.; Yu, Jingzhi; Bryan, Molly Scannell; Gladfelter, Charon; Gordon, Howard S.; Grannis, Shaun; Madden, Margaret; Mendonca, Eneida; Mitrovic, Vesna; Shah, Raj; Tachinardi, Umberto; Taylor, Bradley

doi:10.1007/978-3-030-43887-6_7

Abel N. Kho⁸,
Jingzhi Yu⁸,
Molly Scannell Bryan^9,10,
Charon Gladfelter⁸,
Howard S. Gordon^9,10,
Shaun Grannis¹¹,
Margaret Madden⁸,
Eneida Mendonca¹¹,
Vesna Mitrovic⁸,
Raj Shah¹²,
Umberto Tachinardi¹¹ &
…
Bradley Taylor¹³

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1168))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

1669 Accesses
3 Citations
7 Altmetric

Abstract

As part of a national study in the United States to recruit one million Americans (All of Us Research Program) and their Electronic Health Record data, we set out to determine the degree to which care is fragmented across a sample of participating health provider organizations (HPOs). We distributed a previously validated Privacy-Preserving Record Linkage (PPRL) tool to participating sites to generate a unique set of keyed encrypted hashes for seven participating institutions across three States in the Upper Midwest of the U.S. An honest broker received the resulting encrypted hashes to identify patients with the same encrypted hashes shared across any combination of more than one institution as a proxy for patients receiving care across institutions. Out of 5,831,238 individuals, we identified 458,680 patients with data at more than one institution. Care fragmentation varied significantly by State and by Institution ranging from 6.1% up to 32.7%. Patients with fragmented care were more likely to be black (11.8% vs 10.8%), and slightly older (Median birth year 1968 vs 1969) compared with patients receiving care at only one participating institution. In contrast, patients who maintained an address in a warmer state (“snowbirds”) were the least likely to be black (7.5%) of all study groups. We identified conflicting or inconsistent demographic information in 49.1% of patients with care fragmentation compared with 5.6% of patients without care fragmentation. Privacy-preserving record linkage can be an effective means to identify populations with care fragmentation and poor data quality for focused clinical and data improvement efforts.

You have full access to this open access chapter, Download conference paper PDF

Overview of Data Linkage Methods for Integrating Separate Health Data Sources

Accuracy and completeness of patient pathways – the benefits of national data linkage in Australia

Article Open access 08 August 2015

A blinded evaluation of privacy preserving record linkage with Bloom filters

Article Open access 16 January 2022

Keywords

1 Introduction

1.1 The All of Us Research Program

In 2016, the United States Congress launched the Precision Medicine Initiative (PMI) with $200M in funding in order to advance the development and application of individualized care based on a person’s unique lifestyle, environment, and biology. A core foundation of the Precision Medicine Initiative, the All of Us Research Program (AoURP) was initially allocated $130M to create a national cohort of over one million Americans broadly representing the rich diversity of the U.S. population. Widespread adoption of Electronic Health Records (EHRs) across the U.S. was identified early in the design of the AoURP as a potentially rich source of data on patient health conditions and treatments.

The AoURP designated and funded over 40 Health Care Provider Organizations (HPOs) nationally to serve as recruitment centers. As part of the enrollment process, HPOs are required to send EHR data for consented participants to the AoURP Data and Research Center after verifying the identity of the participant and standardizing the EHR data into the Observational Medical Outcomes Partnership (OMOP) data model [1].

1.2 Data Fragmentation Across Institutions

However, healthcare in the United States is delivered across a wide variety of care settings and lacks the availability of a universal patient identifier. As a result, patient records may be fragmented across each location where a patient receives care, and unavailable both for patient care, but also for aggregation for research purposes such as those envisioned by the AoURP. Health Information Exchanges (HIEs) emerged as a means to address data and care fragmentation, and use a master patient index to consistently track the same patient across different care settings but are not available in many regions in the United States, or have struggled to remain financially viable [2]. Some EHR systems can link health records across institutions which use the same EHR system for routine clinical care, but do not currently integrate these data together for research purposes [3]. Because the AoURP aims to aggregate as much information about a participant as possible, investigators at participating HPOs questioned how often participants might receive care at a different care site than the HPO at which they might be enrolled. But without cross-institutional data sharing agreements in place to allow for patient identifiers to be shared across sites, and with many HPOs not part of HIEs, an alternate mechanism to link the same patient record across sites was needed.

1.3 Prior Use of Privacy-Preserving Record Linkage

We previously developed software to generate keyed hashes of patient identifiers that is fully compliant with HIPAA de-identification methods and could enable privacy preserving record linkage across AoURP HPOs [4]. A key finding of the initial linkage across seven healthcare institutions was the significant degree of data fragmentation across care sites ranging from 11 to 28% over a several year span. We subsequently demonstrated similar care fragmentation for specific populations including patients with diabetic ketoacidosis [5] and systemic lupus erythematosus [6]. Notably, we identified worse clinical outcomes for patients with fragmented care vs those without care fragmentation, a finding consistent across each condition we studied. Relevant to a cohort study such as the AoURP, we linked individual data between a longitudinal cohort study (the Multi-Ethnic Study of Atherosclerosis or MESA) and EHR data in our region, and identified gaps in data coverage in both sources of data even for conditions as seemingly obvious as a myocardial infarction [7]. The combination of both multi-institutional EHR data and prospectively collected data for a cohort study created a more complete set of data for a given research study participant than any one source alone.

With this background and with the endorsement of the AoURP Steering Committee, we set out to use our previously validated privacy preserving record linkage method to determine how often patients receive care across participating AoURP institutions within a geographically proximate region of three adjoining States in the Upper Midwest of the United States. Our goal was to identify the degree of data fragmentation across AoURP sites in order to determine whether to pursue additional data sources to fully characterize research cohort participants.

2 Methods

We submitted and received approval for this study of de-identified patient level data from the Northwestern Institutional Review Board. We defined the study population as patients seen at participating institutions from January 1, 2011 through May 1, 2018. We excluded patients aged 90 or over as of April 30, 2018 to comply with HIPAA Safe Harbor restrictions on age. Seven institutions participated in the study, three based in the State of Wisconsin, three in Illinois, and one in Indiana which had access to data from the statewide Health Information Exchange.

At a kickoff meeting hosted in Wisconsin and through subsequent discussion, all participating institutions agreed upon a common data dictionary to define key demographic and clinical fields to extract along with keyed hashes to uniquely identify a patient (Table 1).

Table 1. Key data fields extracted by institutions to characterize the demographics and diagnoses of the study population.

Full size table

We distributed an executable software program with known matching performance characteristics as described in our prior publication. Participating institutions installed the software locally, and collectively identified a key to be used to hash the patient identifiers that was kept separate from the group aggregating the data on behalf of the study. Using a combination of last name, first name, date of birth, and social security number (where available), sites encrypted multiple concatenated combinations of these features in order to generate up to 17 secret key encrypted hashes. The central site (Northwestern University) team, acting as an honest broker, received the keyed hashes, along with attached demographic and clinical data as defined by the study data dictionary.

We matched the data across the participating institutions to evaluate the degree of care fragmentation within each State, across States, and across all institutions. Because we included three digit ZIP codes in our data set (which is a broad enough level of geography to still be considered de-identified by HIPAA), we could identify the sub-population of patients who also have a home address in a considerably warmer region of the United States (the States of Alabama, Arizona, Arkansas, California, Florida, Georgia, Louisiana, Mississippi, New Mexico, and Texas) during the winter months (colloquially referred to as “snowbirds”). We analyzed the differences in demographics between those patients who have fragmented and non-fragmented care, as well as between “snowbirds” and those less capable of escaping the cold winter weather in the Upper Midwest.

Several data fields required additional translation between data terminologies in order to be consistent for further analyses. Diagnoses in EHRs arrived as ICD9, ICD10, and SNOMED codes and required significant re-mapping to a consistent and common terminology, in this case MS-DRG-CM. We identified data quality issues including missing data and data which conflicted across sites.

Due to of the large size of the total number of records, we conducted analyses using Python 3.7 with pandas and numpy packages.

3 Results

In total, we received records on 5,831,238 individuals across the three states. We identified 458,680 patients with data at more than one institution. Table 2 describes the demographics for our total study population, and the populations of patients with non-fragmented care, fragmented care, and “snowbirds”. Demographics information that was declined or missing at the point of recording, as well as patients that had conflicting demographics information from multiple patient records were given the same category. Considerable patient race information were found to be conflicted or missing, and as high as 44.8% in fragmented patients.

Table 2. Demographics of the total study population, patients with non-fragmented care, fragmented care, and “snowbirds”.

Full size table

3.1 Patient with Care Fragmentation

The distribution of patients with care fragmentation was unevenly distributed by State and Institutions. The percent of patients with care fragmentation differed by state ranging from 4.9% to 11.7% (Table 3).

Table 3. Care fragmentation by State.

Full size table

The percent of patients with care fragmentation varied by site ranging from 6.1% to 32.7% (Table 4).

Table 4. Fragmentation by care site.

Full size table

3.2 Data Quality Issues

We identified a significant percentage of records with conflicting demographic information, with the majority of discrepancies for race (Table 5 and Fig. 1).

Table 5. Number of records with conflicting demographic information by feature.

Full size table

Patients with care fragmentation had conflicting information at a much higher rate than those without care fragmentation (49.1% vs 5.6%, Table 6)

Table 6. Counts and percentage of patients with conflicting information by fragmentation status.

Full size table

3.3 Geographic Analysis to Characterize “Snowbirds”

Patients with home addresses (by 3 digit ZIP code) varied by State (Table 7) and by Institution (Table 8).

Table 7. Snowbirds by State

Full size table

Table 8. Snowbirds by Institution

Full size table

4 Discussion

We used a previously validated privacy preserving record linkage method based on generating keyed hashes of patient identifiers to identify the degree of data fragmentation across a sample of HPOs within the AoURP. Data fragmentation varied from 3.6% to 32.7% with the greatest percentage at sites within IL and the more population-dense Chicago-based institutions. Consistent with prior studies, patients with care fragmentation were more likely to be black and younger. In contrast, patients with the ability to “snowbird” to warmer climes were least likely to be black.

A common problem with linking data across sites is the issue of conflicting data, e.g. one site lists race as “Caucasian” and another site may list race as “unknown”. We identified conflicting demographic information for 49.1% of those patients receiving care at more than one institution. Even in patients who receive care at the same institution, demographic information captured over time had conflicting information 5.6% of the time. Race was the most common demographic feature with conflicting information.

There are several limitations to our study. Our study only included a small number of institutions within each State (those that participate in the AoURP), e.g. in the Chicagoland area alone there are over 40 distinct healthcare institutions. Thus our estimates of data fragmentation are likely significant underestimates. Because we focused on sharing only demographic features compliant with HIPAA de-identification criteria, we could not evaluate more specific geographic features beyond 3 digit ZIP code. Geographic features such as home address are likely to change over time for patients as they move, or to be collected in non-standardized fashions, and could be a common feature at risk of conflicting across care sites. We defined “snowbirds” as having a listed address in the EMR from one of several warm winter month states. However, many “snowbirds” may only list their local address so our estimates likely significantly underestimate the population size.

Our study demonstrated the utility of a privacy-preserving record linkage tool to characterize care fragmentation across institutions spanning three contiguous States. Our findings are consistent with prior findings that care fragmentation is associated with at-risk populations but also demonstrates a novel association with significantly higher proportion of conflicting data. We have ongoing work to analyze the differences in insurance status and diagnoses across the study population and to use study results to guide strategies to capture more comprehensive clinical data for patients enrolled in the All of Us Research Program.

References

The OMOP data model. https://www.ohdsi.org/data-standardization/the-common-data-model/. Accessed 14 June 2019
Holmgren, A.J., Adler-Milstein, J.: Health information exchange in US hospitals: the current landscape and a path to improved information sharing. J. Hosp. Med. 12(3), 193–198 (2017)
Article Google Scholar
Epic Care Everywhere. https://www.epic.com/careeverywhere/. Accessed 14 June 2019
Kho, A.N., Cashy, J.P., Jackson, K.L., et al.: Design and implementation of a privacy preserving electronic health record linkage tool in Chicago. J. Am. Med. Inform. Assoc. 22(5), 1072–1080 (2015)
Article Google Scholar
Mays, J.A., Jackson, K.L., Derby, T.A., et al.: An evaluation of recurrent diabetic ketoacidosis, fragmentation of care, and mortality across Chicago, Illinois. Diabetes Care 39(10), 1671–1676 (2016)
Article Google Scholar
Walunas, T.L., Jackson, K.L., Chung, A.H., et al.: Disease outcomes and care fragmentation among patients with systemic lupus erythematosus. Arthritis Care Res (Hoboken) 69(9), 1369–1376 (2017)
Article Google Scholar
Ahmad, F.S., Chan, C., Rosenman, M.B., et al.: Validity of cardiovascular data from electronic sources: The Multi-Ethnic Study of Atherosclerosis and HealthLNK. Circulation 136(13), 1207–1216 (2017)
Article Google Scholar

Download references

Acknowledgements

This study was funded under a supplement to NIH award 1 U2C OD023196-01 (All of Us Research Program Data and Research Center). During the study period, authors EM and UT were faculty at the University of Wisconsin, Madison.

Author information

Authors and Affiliations

Northwestern University, Evanston, IL, 60611, USA
Abel N. Kho, Jingzhi Yu, Charon Gladfelter, Margaret Madden & Vesna Mitrovic
University of Illinois at Chicago, Chicago, IL, 60612, USA
Molly Scannell Bryan & Howard S. Gordon
Veterans Affairs Medical Center, Chicago, IL, 60612, USA
Molly Scannell Bryan & Howard S. Gordon
Regenstrief Institute, Indianapolis, IN, 46202, USA
Shaun Grannis, Eneida Mendonca & Umberto Tachinardi
Rush University, Chicago, IL, 60612, USA
Raj Shah
Medical College of Wisconsin, Milwaukee, WI, 53226, USA
Bradley Taylor

Authors

Abel N. Kho
View author publications
You can also search for this author in PubMed Google Scholar
Jingzhi Yu
View author publications
You can also search for this author in PubMed Google Scholar
Molly Scannell Bryan
View author publications
You can also search for this author in PubMed Google Scholar
Charon Gladfelter
View author publications
You can also search for this author in PubMed Google Scholar
Howard S. Gordon
View author publications
You can also search for this author in PubMed Google Scholar
Shaun Grannis
View author publications
You can also search for this author in PubMed Google Scholar
Margaret Madden
View author publications
You can also search for this author in PubMed Google Scholar
Eneida Mendonca
View author publications
You can also search for this author in PubMed Google Scholar
Vesna Mitrovic
View author publications
You can also search for this author in PubMed Google Scholar
Raj Shah
View author publications
You can also search for this author in PubMed Google Scholar
Umberto Tachinardi
View author publications
You can also search for this author in PubMed Google Scholar
Bradley Taylor
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Abel N. Kho .

Editor information

Editors and Affiliations

Institut National des Sciences Appliquées, Rennes, France
Peggy Cellier
Maastricht University, Maastricht, The Netherlands
Kurt Driessens

Ethics declarations

ANK is an advisor to Datavant, Inc., which supports Privacy-Preserving Record Linkage software. Datavant acquired Health Data Link, Inc. which ANK co-founded based on this earlier version of the software.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kho, A.N. et al. (2020). Privacy-Preserving Record Linkage to Identify Fragmented Electronic Medical Records in the All of Us Research Program. In: Cellier, P., Driessens, K. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2019. Communications in Computer and Information Science, vol 1168. Springer, Cham. https://doi.org/10.1007/978-3-030-43887-6_7

Download citation

DOI: https://doi.org/10.1007/978-3-030-43887-6_7
Published: 28 March 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-43886-9
Online ISBN: 978-3-030-43887-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the ECML PKDD community (opens in a new tab)

Privacy-Preserving Record Linkage to Identify Fragmented Electronic Medical Records in the All of Us Research Program

Abstract

Similar content being viewed by others

Overview of Data Linkage Methods for Integrating Separate Health Data Sources

Accuracy and completeness of patient pathways – the benefits of national data linkage in Australia

A blinded evaluation of privacy preserving record linkage with Bloom filters

Keywords