Skip to main content

Advertisement

Log in

Using synthetic data to replace linkage derived elements: a case study

  • Published:
Health Services and Outcomes Research Methodology Aims and scope Submit manuscript

Abstract

While record linkage can expand analyses performable from survey microdata, it also incurs greater risk of privacy-encroaching disclosure. One way to mitigate this risk is to replace some of the information added through linkage with synthetic data elements. This paper describes a case study using the National Hospital Care Survey (NHCS), which collects patient records under a pledge of protecting patient privacy from a sample of U.S. hospitals for statistical analysis purposes. The NHCS data were linked to the National Death Index (NDI) to enhance the survey with mortality information. The added information from NDI linkage enables survival analyses related to hospitalization, but as the death information includes dates of death and detailed causes of death, having it joined with the patient records increases the risk of patient re-identification (albeit only for deceased persons). For this reason, an approach was tested to develop synthetic data that uses models from survival analysis to replace vital status and actual dates-of-death with synthetic values and uses classification tree analysis to replace actual causes of death with synthesized causes of death. The degree to which analyses performed on the synthetic data replicate results from analysis on the actual data is measured by comparing survival analysis parameter estimates from both data files. Because synthetic data only have value to the degree that they can be used to produce statistical estimates that are like those based on the actual data, this evaluation is an essential first step in assessing the potential utility of synthetic mortality data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Availability of data and materials

Researchers who wish to obtain access to the linked 2016 NHCS to 2016/2017 NDI file must submit and have an approved research proposal to the NCHS Research Data Center (RDC): https://www.cdc.gov/rdc/index.htm.

Code availability

Code for this analysis is available upon request.

Notes

  1. Responding hospitals were those providing records for at least 50 encounters covering at least 6 months of the year.

  2. Sufficient PII is defined as having two of the following three items: valid date of birth (month, day, and year), name (first, middle, and last), and/or a valid format 9-digit SSN. See (NCHS 2019, p. 5).

  3. Division is coded based on patient home state based on U.S. Census Bureau Schema. See https://www2.census.gov/geo/pdfs/maps-data/maps/reference/us_regdiv.pdf.

References

Download references

Funding

This work was supported in part with funding from the Department of Health and Human Services’ Office of the Secretary Patient Centered Outcomes Research Trust Fund (OS-PCORTF).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dean M. Resnick.

Ethics declarations

Conflict of interest

The authors have no conflicts of interest to report.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

NHCS 2016 Survival Estimates

Comparison of Actual to Synthetic Data

Comparison for All Deaths

#

Variable

Estimate

Estimate (Synth.)

StdErr

StdErr (Synth.)

ProbChiSq

ProbchiSq (Synth.)

Age group (rounded to nearest 5-year, reference category: 70 years old)

1

0

− 3.37

− 3.34

0.036

0.035

< 0.0001

< 0.0001

2

5

− 4.12

− 4.07

0.059

0.057

< 0.0001

< 0.0001

3

10

− 3.97

− 3.95

0.063

0.062

< 0.0001

< 0.0001

4

15

− 3.47

− 3.41

0.048

0.047

< 0.0001

< 0.0001

5

20

− 2.75

− 2.78

0.032

0.032

< 0.0001

< 0.0001

6

25

− 2.34

− 2.33

0.025

0.025

< 0.0001

< 0.0001

7

30

− 2.13

− 2.15

0.023

0.023

< 0.0001

< 0.0001

8

35

− 1.79

− 1.80

0.021

0.021

< 0.0001

< 0.0001

9

40

− 1.48

− 1.50

0.019

0.019

< 0.0001

< 0.0001

10

45

− 1.19

− 1.16

0.017

0.016

< 0.0001

< 0.0001

11

50

− 0.90

− 0.87

0.014

0.014

< 0.0001

< 0.0001

12

55

− 0.62

− 0.62

0.012

0.012

< 0.0001

< 0.0001

13

60

− 0.38

− 0.39

0.011

0.011

< 0.0001

< 0.0001

14

65

− 0.21

− 0.22

0.011

0.011

< 0.0001

< 0.0001

15

75

0.21

0.21

0.010

0.010

< 0.0001

< 0.0001

16

80

0.48

0.47

0.010

0.010

< 0.0001

< 0.0001

17

85

0.79

0.78

0.010

0.010

< 0.0001

< 0.0001

18

90

1.12

1.11

0.011

0.011

< 0.0001

< 0.0001

19

95

1.52

1.50

0.012

0.013

< 0.0001

< 0.0001

20

Age missing

0.26

0.15

0.041

0.043

< 0.0001

0.0004

Conditions not present (reference category: condition not present)

21

Myocardial Infarction

0.03

0.03

0.016

0.016

0.0804

0.0309

22

Diabetes without complications

− 0.31

− 0.33

0.023

0.023

< 0.0001

< 0.0001

23

Diabetes with complications

− 0.03

− 0.06

0.036

0.037

0.3462

0.1008

24

Paraplegia and Hemiplegia

0.33

0.31

0.021

0.022

< 0.0001

< 0.0001

25

Renal Disease

0.06

0.06

0.014

0.015

< 0.0001

< 0.0001

26

Cancer

0.56

0.54

0.015

0.016

< 0.0001

< 0.0001

27

Moderate or Severe Liver Disease

0.59

0.58

0.076

0.076

< 0.0001

< 0.0001

28

Metastatic Carcinoma

1.30

1.30

0.029

0.031

< 0.0001

< 0.0001

29

AIDS/HIV

− 0.08

− 0.12

0.045

0.047

0.0598

0.0081

30

Congestive Heart Failure

0.39

0.38

0.009

0.010

< 0.0001

< 0.0001

31

Peripheral Vascular Disease

− 0.15

− 0.15

0.016

0.016

< 0.0001

< 0.0001

32

Cerebrovascular Disease

− 0.05

− 0.06

0.012

0.012

0.0001

< 0.0001

33

Dementia

0.50

0.48

0.010

0.011

< 0.0001

< 0.0001

34

Chronic Pulmonary Disease

0.03

0.03

0.009

0.009

0.0002

0.0013

35

Connective Tissue Disease-Rheumatic Disease

− 0.05

− 0.03

0.020

0.020

0.0203

0.1004

36

Peptic Ulcer Disease

− 0.06

− 0.03

0.025

0.025

0.0189

0.2764

37

Mild Liver Disease

0.82

0.82

0.016

0.017

< 0.0001

< 0.0001

Charlson Index Summary (reference category: 1)

38

0

− 0.64

− 0.65

0.010

0.010

< 0.0001

< 0.0001

39

2

0.38

0.37

0.011

0.011

< 0.0001

< 0.0001

40

3

0.59

0.58

0.016

0.017

< 0.0001

< 0.0001

41

4

0.78

0.77

0.022

0.024

< 0.0001

< 0.0001

42

5

0.85

0.84

0.030

0.032

< 0.0001

< 0.0001

43

6

0.62

0.59

0.040

0.043

< 0.0001

< 0.0001

Imputed Race/Ethnicity (reference category: White)

44

Asian

− 0.21

− 0.20

0.017

0.017

< 0.0001

< 0.0001

45

Black

− 0.04

− 0.04

0.007

0.007

< 0.0001

< 0.0001

46

Hispanic

− 0.24

− 0.23

0.009

0.009

< 0.0001

< 0.0001

47

Amer. Ind

0.06

0.14

0.070

0.070

0.3613

0.0505

48

Sex: Female (reference category: Male)

− 0.27

− 0.26

0.005

0.005

< 0.0001

< 0.0001

Division (reference category: Mid-Atlantic)

49

Northeast

− 0.05

− 0.03

0.018

0.018

0.0128

0.1139

50

East North Central

0.12

0.12

0.009

0.009

< 0.0001

< 0.0001

51

West North Central

0.10

0.10

0.013

0.013

< 0.0001

< 0.0001

52

South Atlantic

0.21

0.22

0.009

0.009

< 0.0001

< 0.0001

53

East South Central

0.17

0.19

0.011

0.011

< 0.0001

< 0.0001

54

West South Central

0.89

0.90

0.012

0.012

< 0.0001

< 0.0001

55

Mountain

0.18

0.19

0.013

0.013

< 0.0001

< 0.0001

56

Pacific

0.40

0.41

0.010

0.010

< 0.0001

< 0.0001

Hospital bed size (reference category: > 500)

57

0–25

− 0.37

− 0.38

0.038

0.038

< 0.0001

< 0.0001

58

26–100

− 0.19

− 0.19

0.017

0.017

< 0.0001

< 0.0001

59

101–500

− 0.08

− 0.09

0.006

0.006

< 0.0001

< 0.0001

Hospital ownership (reference category: non-profit)

60

For profit

− 0.55

− 0.54

0.019

0.019

< 0.0001

< 0.0001

61

Government

− 0.03

− 0.03

0.009

0.009

0.0005

0.0043

Hospital type (reference category: general acute)

62

Children’s

− 0.14

− 0.16

0.046

0.045

0.0027

0.0004

63

Psychiatric

0.37

0.35

0.024

0.024

< 0.0001

< 0.0001

64

Long-term acute care, rehab., etc

0.17

0.20

0.026

0.027

< 0.0001

< 0.0001

Hospital urban rural classification (reference category: large central metropolitan)

65

Large Fringe Metropolitan

− 0.10

− 0.09

0.008

0.008

< 0.0001

< 0.0001

66

Medium Metropolitan

− 0.05

− 0.05

0.007

0.007

< 0.0001

< 0.0001

67

Small Metropolitan

0.25

0.22

0.011

0.011

< 0.0001

< 0.0001

68

Micropolitan

0.19

0.20

0.013

0.013

< 0.0001

< 0.0001

69

Non-core

0.10

0.12

0.045

0.045

0.0218

0.0078

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Resnick, D.M., Cox, C.S. & Mirel, L.B. Using synthetic data to replace linkage derived elements: a case study. Health Serv Outcomes Res Method 21, 389–406 (2021). https://doi.org/10.1007/s10742-021-00241-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10742-021-00241-z

Keywords

Navigation