Abstract
While record linkage can expand analyses performable from survey microdata, it also incurs greater risk of privacy-encroaching disclosure. One way to mitigate this risk is to replace some of the information added through linkage with synthetic data elements. This paper describes a case study using the National Hospital Care Survey (NHCS), which collects patient records under a pledge of protecting patient privacy from a sample of U.S. hospitals for statistical analysis purposes. The NHCS data were linked to the National Death Index (NDI) to enhance the survey with mortality information. The added information from NDI linkage enables survival analyses related to hospitalization, but as the death information includes dates of death and detailed causes of death, having it joined with the patient records increases the risk of patient re-identification (albeit only for deceased persons). For this reason, an approach was tested to develop synthetic data that uses models from survival analysis to replace vital status and actual dates-of-death with synthetic values and uses classification tree analysis to replace actual causes of death with synthesized causes of death. The degree to which analyses performed on the synthetic data replicate results from analysis on the actual data is measured by comparing survival analysis parameter estimates from both data files. Because synthetic data only have value to the degree that they can be used to produce statistical estimates that are like those based on the actual data, this evaluation is an essential first step in assessing the potential utility of synthetic mortality data.
Similar content being viewed by others
Availability of data and materials
Researchers who wish to obtain access to the linked 2016 NHCS to 2016/2017 NDI file must submit and have an approved research proposal to the NCHS Research Data Center (RDC): https://www.cdc.gov/rdc/index.htm.
Code availability
Code for this analysis is available upon request.
Notes
Responding hospitals were those providing records for at least 50 encounters covering at least 6 months of the year.
Sufficient PII is defined as having two of the following three items: valid date of birth (month, day, and year), name (first, middle, and last), and/or a valid format 9-digit SSN. See (NCHS 2019, p. 5).
Division is coded based on patient home state based on U.S. Census Bureau Schema. See https://www2.census.gov/geo/pdfs/maps-data/maps/reference/us_regdiv.pdf.
References
Breiman, L., et al.: Classification and Regression Trees. CRC Press, Boca Raton (1984)
Breslow, N.E.: Discussion of the paper by D. R. Cox. J. R. Stat. Soc. B 34, 216–217 (1972)
Charlson, M., et al.: A new method of classifying prognostic comorbidity in longitudinal studies: development and validation. J. Chronic Dis. 40(5), 373–383 (1987)
David, C.R.: Regression models and life tables (with discussion). J. R. Stat. Soc. 34(2), 187–220 (1972)
Fiscella, K., Fremont, A.M.: Use of geocoding and surname analysis to estimate race and ethnicity. Health Serv. Res. 41(4), 1482–1500 (2006)
Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biometrics 33, 159–174 (1977)
Lipkovich, I., Ratitch, B., O’Kelly, M.: Sensitivity to censored-at-random assumption in the analysis of time-to-event endpoints. Pharm. Stat. 15(3), 216–229 (2016)
National Center for Health Statistics. National Death Index (NDI). http://www.cdc.gov/nchs/ndi/index.htm
National Center for Health Statistics. Division of Analysis and Epidemiology. The Linkage of the 2016 National Hospital Care Survey to the 2016/2017 National Death Index: Methodology Overview and Analytic Considerations, August 2019. Hyattsville, Maryland. https://www.cdc.gov/nchs/data/datalinkage/NHCS16_NDI16_17_Methodology_Analytic_Consider.pdf
National Center for Health Statistics. National Hospital Care Survey (NHCS). http://www.cdc.gov/nchs/dhcs/index.htm
SAS HPSPLIT documentation. https://documentation.sas.com/?docsetId=stathpug&docsetTarget=stathpug_hpsplit_syntax01.htm&docsetVersion=15.1&locale=en
SAS PHREG documentation. https://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#phreg_toc.htm
Sundararajan, V., et al.: New ICD-10 version of the Charlson comorbidity index predicted in-hospital mortality. J. Clin. Epidemiol. 57(12), 1288–1294 (2004)
Funding
This work was supported in part with funding from the Department of Health and Human Services’ Office of the Secretary Patient Centered Outcomes Research Trust Fund (OS-PCORTF).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have no conflicts of interest to report.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
NHCS 2016 Survival Estimates
Comparison of Actual to Synthetic Data
Comparison for All Deaths
# | Variable | Estimate | Estimate (Synth.) | StdErr | StdErr (Synth.) | ProbChiSq | ProbchiSq (Synth.) |
---|---|---|---|---|---|---|---|
Age group (rounded to nearest 5-year, reference category: 70 years old) | |||||||
1 | 0 | − 3.37 | − 3.34 | 0.036 | 0.035 | < 0.0001 | < 0.0001 |
2 | 5 | − 4.12 | − 4.07 | 0.059 | 0.057 | < 0.0001 | < 0.0001 |
3 | 10 | − 3.97 | − 3.95 | 0.063 | 0.062 | < 0.0001 | < 0.0001 |
4 | 15 | − 3.47 | − 3.41 | 0.048 | 0.047 | < 0.0001 | < 0.0001 |
5 | 20 | − 2.75 | − 2.78 | 0.032 | 0.032 | < 0.0001 | < 0.0001 |
6 | 25 | − 2.34 | − 2.33 | 0.025 | 0.025 | < 0.0001 | < 0.0001 |
7 | 30 | − 2.13 | − 2.15 | 0.023 | 0.023 | < 0.0001 | < 0.0001 |
8 | 35 | − 1.79 | − 1.80 | 0.021 | 0.021 | < 0.0001 | < 0.0001 |
9 | 40 | − 1.48 | − 1.50 | 0.019 | 0.019 | < 0.0001 | < 0.0001 |
10 | 45 | − 1.19 | − 1.16 | 0.017 | 0.016 | < 0.0001 | < 0.0001 |
11 | 50 | − 0.90 | − 0.87 | 0.014 | 0.014 | < 0.0001 | < 0.0001 |
12 | 55 | − 0.62 | − 0.62 | 0.012 | 0.012 | < 0.0001 | < 0.0001 |
13 | 60 | − 0.38 | − 0.39 | 0.011 | 0.011 | < 0.0001 | < 0.0001 |
14 | 65 | − 0.21 | − 0.22 | 0.011 | 0.011 | < 0.0001 | < 0.0001 |
15 | 75 | 0.21 | 0.21 | 0.010 | 0.010 | < 0.0001 | < 0.0001 |
16 | 80 | 0.48 | 0.47 | 0.010 | 0.010 | < 0.0001 | < 0.0001 |
17 | 85 | 0.79 | 0.78 | 0.010 | 0.010 | < 0.0001 | < 0.0001 |
18 | 90 | 1.12 | 1.11 | 0.011 | 0.011 | < 0.0001 | < 0.0001 |
19 | 95 | 1.52 | 1.50 | 0.012 | 0.013 | < 0.0001 | < 0.0001 |
20 | Age missing | 0.26 | 0.15 | 0.041 | 0.043 | < 0.0001 | 0.0004 |
Conditions not present (reference category: condition not present) | |||||||
21 | Myocardial Infarction | 0.03 | 0.03 | 0.016 | 0.016 | 0.0804 | 0.0309 |
22 | Diabetes without complications | − 0.31 | − 0.33 | 0.023 | 0.023 | < 0.0001 | < 0.0001 |
23 | Diabetes with complications | − 0.03 | − 0.06 | 0.036 | 0.037 | 0.3462 | 0.1008 |
24 | Paraplegia and Hemiplegia | 0.33 | 0.31 | 0.021 | 0.022 | < 0.0001 | < 0.0001 |
25 | Renal Disease | 0.06 | 0.06 | 0.014 | 0.015 | < 0.0001 | < 0.0001 |
26 | Cancer | 0.56 | 0.54 | 0.015 | 0.016 | < 0.0001 | < 0.0001 |
27 | Moderate or Severe Liver Disease | 0.59 | 0.58 | 0.076 | 0.076 | < 0.0001 | < 0.0001 |
28 | Metastatic Carcinoma | 1.30 | 1.30 | 0.029 | 0.031 | < 0.0001 | < 0.0001 |
29 | AIDS/HIV | − 0.08 | − 0.12 | 0.045 | 0.047 | 0.0598 | 0.0081 |
30 | Congestive Heart Failure | 0.39 | 0.38 | 0.009 | 0.010 | < 0.0001 | < 0.0001 |
31 | Peripheral Vascular Disease | − 0.15 | − 0.15 | 0.016 | 0.016 | < 0.0001 | < 0.0001 |
32 | Cerebrovascular Disease | − 0.05 | − 0.06 | 0.012 | 0.012 | 0.0001 | < 0.0001 |
33 | Dementia | 0.50 | 0.48 | 0.010 | 0.011 | < 0.0001 | < 0.0001 |
34 | Chronic Pulmonary Disease | 0.03 | 0.03 | 0.009 | 0.009 | 0.0002 | 0.0013 |
35 | Connective Tissue Disease-Rheumatic Disease | − 0.05 | − 0.03 | 0.020 | 0.020 | 0.0203 | 0.1004 |
36 | Peptic Ulcer Disease | − 0.06 | − 0.03 | 0.025 | 0.025 | 0.0189 | 0.2764 |
37 | Mild Liver Disease | 0.82 | 0.82 | 0.016 | 0.017 | < 0.0001 | < 0.0001 |
Charlson Index Summary (reference category: 1) | |||||||
38 | 0 | − 0.64 | − 0.65 | 0.010 | 0.010 | < 0.0001 | < 0.0001 |
39 | 2 | 0.38 | 0.37 | 0.011 | 0.011 | < 0.0001 | < 0.0001 |
40 | 3 | 0.59 | 0.58 | 0.016 | 0.017 | < 0.0001 | < 0.0001 |
41 | 4 | 0.78 | 0.77 | 0.022 | 0.024 | < 0.0001 | < 0.0001 |
42 | 5 | 0.85 | 0.84 | 0.030 | 0.032 | < 0.0001 | < 0.0001 |
43 | 6 | 0.62 | 0.59 | 0.040 | 0.043 | < 0.0001 | < 0.0001 |
Imputed Race/Ethnicity (reference category: White) | |||||||
44 | Asian | − 0.21 | − 0.20 | 0.017 | 0.017 | < 0.0001 | < 0.0001 |
45 | Black | − 0.04 | − 0.04 | 0.007 | 0.007 | < 0.0001 | < 0.0001 |
46 | Hispanic | − 0.24 | − 0.23 | 0.009 | 0.009 | < 0.0001 | < 0.0001 |
47 | Amer. Ind | 0.06 | 0.14 | 0.070 | 0.070 | 0.3613 | 0.0505 |
48 | Sex: Female (reference category: Male) | − 0.27 | − 0.26 | 0.005 | 0.005 | < 0.0001 | < 0.0001 |
Division (reference category: Mid-Atlantic) | |||||||
49 | Northeast | − 0.05 | − 0.03 | 0.018 | 0.018 | 0.0128 | 0.1139 |
50 | East North Central | 0.12 | 0.12 | 0.009 | 0.009 | < 0.0001 | < 0.0001 |
51 | West North Central | 0.10 | 0.10 | 0.013 | 0.013 | < 0.0001 | < 0.0001 |
52 | South Atlantic | 0.21 | 0.22 | 0.009 | 0.009 | < 0.0001 | < 0.0001 |
53 | East South Central | 0.17 | 0.19 | 0.011 | 0.011 | < 0.0001 | < 0.0001 |
54 | West South Central | 0.89 | 0.90 | 0.012 | 0.012 | < 0.0001 | < 0.0001 |
55 | Mountain | 0.18 | 0.19 | 0.013 | 0.013 | < 0.0001 | < 0.0001 |
56 | Pacific | 0.40 | 0.41 | 0.010 | 0.010 | < 0.0001 | < 0.0001 |
Hospital bed size (reference category: > 500) | |||||||
57 | 0–25 | − 0.37 | − 0.38 | 0.038 | 0.038 | < 0.0001 | < 0.0001 |
58 | 26–100 | − 0.19 | − 0.19 | 0.017 | 0.017 | < 0.0001 | < 0.0001 |
59 | 101–500 | − 0.08 | − 0.09 | 0.006 | 0.006 | < 0.0001 | < 0.0001 |
Hospital ownership (reference category: non-profit) | |||||||
60 | For profit | − 0.55 | − 0.54 | 0.019 | 0.019 | < 0.0001 | < 0.0001 |
61 | Government | − 0.03 | − 0.03 | 0.009 | 0.009 | 0.0005 | 0.0043 |
Hospital type (reference category: general acute) | |||||||
62 | Children’s | − 0.14 | − 0.16 | 0.046 | 0.045 | 0.0027 | 0.0004 |
63 | Psychiatric | 0.37 | 0.35 | 0.024 | 0.024 | < 0.0001 | < 0.0001 |
64 | Long-term acute care, rehab., etc | 0.17 | 0.20 | 0.026 | 0.027 | < 0.0001 | < 0.0001 |
Hospital urban rural classification (reference category: large central metropolitan) | |||||||
65 | Large Fringe Metropolitan | − 0.10 | − 0.09 | 0.008 | 0.008 | < 0.0001 | < 0.0001 |
66 | Medium Metropolitan | − 0.05 | − 0.05 | 0.007 | 0.007 | < 0.0001 | < 0.0001 |
67 | Small Metropolitan | 0.25 | 0.22 | 0.011 | 0.011 | < 0.0001 | < 0.0001 |
68 | Micropolitan | 0.19 | 0.20 | 0.013 | 0.013 | < 0.0001 | < 0.0001 |
69 | Non-core | 0.10 | 0.12 | 0.045 | 0.045 | 0.0218 | 0.0078 |
Rights and permissions
About this article
Cite this article
Resnick, D.M., Cox, C.S. & Mirel, L.B. Using synthetic data to replace linkage derived elements: a case study. Health Serv Outcomes Res Method 21, 389–406 (2021). https://doi.org/10.1007/s10742-021-00241-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10742-021-00241-z