A novel method for causal structure discovery from EHR data and its application to type-2 diabetes mellitus

Shen, Xinpeng; Ma, Sisi; Vemuri, Prashanthi; Castro, M. Regina; Caraballo, Pedro J.; Simon, Gyorgy J.

doi:10.1038/s41598-021-99990-7

A novel method for causal structure discovery from EHR data and its application to type-2 diabetes mellitus

Article
Open access
Published: 25 October 2021

Volume 11, article number 21025, (2021)
Cite this article

Download PDF

You have full access to this open access article

Scientific Reports

A novel method for causal structure discovery from EHR data and its application to type-2 diabetes mellitus

Download PDF

Xinpeng Shen¹,
Sisi Ma^1,2,
Prashanthi Vemuri³,
M. Regina Castro⁴,
Pedro J. Caraballo⁵ &
…
Gyorgy J. Simon^1,2

2712 Accesses
7 Citations
4 Altmetric
Explore all metrics

Abstract

Modern AI-based clinical decision support models owe their success in part to the very large number of predictors they use. Safe and robust decision support, especially for intervention planning, requires causal, not associative, relationships. Traditional methods of causal discovery, clinical trials and extracting biochemical pathways, are resource intensive and may not scale up to the number and complexity of relationships sufficient for precision treatment planning. Computational causal structure discovery (CSD) from electronic health records (EHR) data can represent a solution, however, current CSD methods fall short on EHR data. This paper presents a CSD method tailored to the EHR data. The application of the proposed methodology was demonstrated on type-2 diabetes mellitus. A large EHR dataset from Mayo Clinic was used as development cohort, and another large dataset from an independent health system, M Health Fairview, as external validation cohort. The proposed method achieved very high recall (.95) and substantially higher precision than the general-purpose methods (.84 versus .29, and .55). The causal relationships extracted from the development and external validation cohorts had a high (81%) overlap. Due to the adaptations to EHR data, the proposed method is more suitable for use in clinical decision support than the general-purpose methods.

Data-driven identification of heart failure disease states and progression pathways using electronic health records

Article Open access 25 October 2022

Foundations of Causal ML

Causal inference in the medical domain: a survey

Article 01 March 2024

Introduction

Diagnostic tools based on artificial intelligence (AI) have recently demonstrated human-like performance^1,2,3,4, owing their high performance to their ability to synthesize information from many features. Consistent with this observation, national initiatives such as the Precision Medicine Initiative⁵ and the Learning Health Systems⁶ encourage the inclusion of a wide-range of information about the patient into the decision making process. Increasingly, clinical decision support systems start to include treatment planning and selection tools⁷. Such tools require causal knowledge, not merely the associations (correlations). Intervening on correlates rather than causal factors of the disease leads to lack of efficacy, under- or overtreatment, and in worst case, to iatrogenic harm⁸.

The gold standard for discovering causal relationships is conducting a randomized clinical trial or elucidating the underlying biochemical pathways. In many cases, clinical trials are impractical, unethical, if not outright impossible. Computational causal structure discovery (CSD) methods to discover causal relationships have demonstrated great success in many domains^9,10,11 and their application to EHR data could offer a solution for causal discovery from observational real world medical data. However, to unlock their full potential, these general-purpose algorithms need to be adapted to address study design and data quality challenges specific to the EHR data.

We propose an algorithm with three adaptations. First, we incorporate study design considerations. EHR data as it exists in the system does not follow any study design. Billing codes in particular are recorded for reimbursement purposes and do not distinguish between new incidences and pre-existing conditions. Understanding this difference is critical for study design. Second, time stamps can be unreliable. The time stamp of a diagnosis often does not coincide with the onset time of the disease, but rather reflects the documentation time. In some cases, the temporal ordering of diseases may be reversed. Partly for this reason, general purpose CSD algorithms applied to the EHR data occasionally report “causal” relationships that are in the opposite direction of the natural disease progression. Third, general-purpose CSD methods sometimes fail to orient edges. Even when a clear causal direction exists and is not masked by data artifacts, CSD algorithms can have difficulty distinguishing the cause from the effect due to statistical equivalence¹². Leveraging the longitudinal nature of EHR data and incorporating time information as part of the causal discovery process can enhance the identification of edge orientation.

In this paper, (1) we propose a data transformation procedure that distinguishes new incidences from pre-existing conditions, which allows us to determine the temporal order of the disease-related events despite the inaccurate (or rather noisy) timestamps in the EHR data. (2) We then present modifications to an existing CSD method, (Fast) Greedy Equivalence Search (GES)^13,14, to infer the direction of causal relationships more robustly using longitudinal information and takes the above study design considerations into account.

We demonstrate this methodology through the clinical example of type-2 diabetes mellitus (T2D), its risk factors and complications. T2D is an exceptionally well-studied disease with numerous clinical trials having produced a vast knowledge base, making the evaluation of the methodology possible. The goal of this work is not to uncover new causal relationships in diabetes, but to present a novel methodology for discovering causal relationships from EHR data that are sufficiently robust to support model development for clinical decision support tools. While we use T2D as our use case, we expect our methods to generalize to other diseases, typically chronic diseases, that exhibit similar characteristics and suffer from the same EHR shortcomings.

Methods

Study source and population

This retrospective cohort study utilized EHR data sets from two independent health systems, Mayo Clinic (MC) in Rochester, Minnesota and M Health Fairview (FV) in Minneapolis, Minnesota. Two 2-year time windows 2003–2004 and 2006–2007 for MC; and 2008–2009, and 2011–2012 for FV were defined. Dates for the time windows differed between MC and FV due to data availability. We extracted diagnoses, prescriptions, laboratory results, and vital signs from the two EHR data sets with the same inclusion and exclusion criteria: patients must have at least two blood pressure measurements, one before the first time window and one after the second time window; aged 18 + at the end of the first time window; and sex and age must be known. Figure 1A shows an overview of the study design of MC EHR (the study design for FV is similar). We used the MC EHR as the development cohort.

Variables

Diagnosis codes are aggregated into the disease categories of obesity, hyperlipidemia, pre-diabetes, type 2 diabetes mellitus, coronary artery disease, myocardial infarction, heart failure, chronic renal failure, cerebrovascular disease, and stroke based on ICD-9 and codes following our previous work¹⁵. Medications indicated for the above conditions were rolled up into NDF-RT therapeutic subclasses. Relevant laboratory results and vital signs were categorized based on cutoffs from the American Diabetes Association guidelines¹⁶.

Causal structure discovery

A relationship between two events is causal if manipulating the earlier event causes the other (later) event to change. For example, prescribing a medication reduces the probability of downstream events (complications). Causation differs from association. For example, blood sugar is associated with risk of stroke: diabetic patients with higher blood sugar have a higher risk of stroke; however, this relationship is likely not causal in diabetic patients since attempts to reduce the risk of stroke by reducing blood sugar consistently failed in clinical trials^17,18. If two events share a common cause (a confounder) and are not otherwise causally related, then manipulating one event will not affect the other variable as long as the common cause remains unchanged. The confounder can be observed or latent. The term causal structure refers to the set of all existing causal relationships among all events and can be visualized as a graph. The causal graph consists of nodes, which corresponds to events, and the nodes are connected by edges that denote causal relationships. General-purpose CSD methods are designed to work with observational data to derive a causal structure that are consistent with the joint probability of the data.

Several general-purpose CSD algorithms have been proposed and the interested reader is referred to the Supplements II where we present an overview of the major methods. In this work, we focus on (Fast) Greedy Equivalence Search (FGES) as the comparison method, because we previously found it to outperform other CSD methods¹⁹. Briefly, FGES finds the optimal causal graph by a greedy search guided by a goodness-of-fit score (e.g. BIC or BDeu) over all possible graphs. Particularly, it starts with an empty graph, and iteratively adds individual edges that maximize the score given the current graph, until adding edges no longer improves the score. Then, FGES iteratively removes individual edges that maximizes the score, until edge removal ceases to improve the score. The output of FGES is a pattern, which can contain undirected edges, where the causal effect direction could not be determined due to statistical equivalency. FGES has good mathematical properties and been shown to be consistent under a set of assumptions^14,20.

Proposed methods

The workflow of the proposed methods is described in Fig. 1B, method 3 (colored in orange). We propose two methods, a data transformation and a causal search method. The former method transforms the longitudinal EHR data into disease-related events, so that we can determine the temporal ordering of events (diseases) despite inaccuracies in the EHR data and extracts all pairs of diseases where a clear precedence ordering exists. The search method constructs the causal graph using the transformed data and the set of precedence pairs.

Data transformation method

A disease-related event is defined as a diagnosis, a prescription, an abnormal lab result, or abnormal vital sign. An event is incident if it occurs in the second time window but is not present in the first time window although the patient is observed in the first time window. Conversely, a disease event is pre-existing if the patient presented with it in or before the first time window. An event A precedes another event B if among patients who have both A and B in the second time window, B is significantly more likely to be incident than A. Note that precedence implies neither causation nor association; however, if a causal effect exists, it must follow the precedence direction. Formal mathematical definitions of these concepts can be found in the Supplement I. The output from this step is (i) an event-based data set consisting of the incident and pre-existing conditions for each patient in each of the two time windows, (ii) a set ${\mathcal{C}}$ of precedence relationships of all pairs $\left( {v_{i} ,{ }v_{j} } \right)$ of events for which event $v_{i}$ clearly precedes $v_{j}$.

The proposed CSD search Algorithm

Given ${\mathcal{C}}$, we construct the causal graph ${\mathcal{G}}$ by iteratively adding edge $\left( {v_{i} ,{ }v_{j} } \right){ }$ from ${\mathcal{C}}$ that maximizes the goodness of fit of ${\mathcal{G}}$. The orientation of this edge must be consistent with the precedence relationship, namely from $v_{i}$ to $v_{j}$. The goodness of fit is defined by the BIC criteria. Let $X^{\left( 1 \right)} ,{ }X^{\left( 2 \right)}$ denote the data sets collected in the two distinct time windows, where $X^{\left( 2 \right)}$ follows $X^{\left( 1 \right)}$. The likelihood of the ${\mathcal{G}}$ is

$$\begin{aligned} {\mathcal{L}}\left( {{\mathcal{G}}|X^{{\left( 1 \right)}} ,~X^{{\left( 2 \right)}} } \right) & = \text{P}\left( {~X^{{\left( 2 \right)}} ,~X^{{\left( 1 \right)}} |{\mathcal{G}}} \right) = \text{P}\left( {~X^{{\left( 2 \right)}} ~|~X^{{\left( 1 \right)}} ,{\mathcal{G}}} \right)~\text{P}\left( {~X^{{\left( 1 \right)}} |{\mathcal{G}}} \right) \\ & = ~\mathop \prod \limits_{s} \mathop \prod \limits_{{v~ \in {\mathcal{V}}}} \text{P}\left( {v_{s}^{{\left( 2 \right)}} |x_{s}^{{\left( 1 \right)}} ,~{\mathcal{G}}} \right)\text{P}\left( {~x_{s}^{{\left( 1 \right)}} |{\mathcal{G}}} \right)\begin{array}{*{20}c} { = ~\mathop \prod \limits_{s} \mathop \prod \limits_{{v~ \in {\mathcal{V}}}} \text{P}\left( {v_{s}^{{\left( 2 \right)}} |~pa\left( {v,~{\mathcal{G}}} \right)_{s}^{{\left( 1 \right)}} } \right)\text{P}\left( {~x_{s}^{{\left( 1 \right)}} |{\mathcal{G}}} \right),} \\ \end{array} \\ \end{aligned}$$

(1)

where $x_{s}^{\left( t \right)}$ is the observation vector for subject s at the cross-section t; $v_{s}^{\left( t \right)}$ is the observation of variable (event) v for subject s at the cross-section t; and $pa\left( {v, {\mathcal{G}}} \right)_{s}^{\left( 1 \right)}$ is the observation vector for the parents of v in the causal structure ${\mathcal{G}}$, at cross Sect. 1 for subject s.

The algorithm estimates $P\left( {v_{s}^{\left( 2 \right)} |pa\left( {v, {\mathcal{G}}} \right)_{s}^{\left( 1 \right)} } \right)$ using logistic regression on the subjects that do not have v at the first cross section and are under observations for both cross sections. For subjects who have v at the first cross section, the probability of having v at the second cross section is 1. Since G represents the transition graph, the term $P\left( { x_{s}^{\left( 1 \right)} |{\mathcal{G}}} \right)$ is a constant.

Finally, the BIC score is

$$\begin{array}{*{20}c} {BIC\left( {\mathcal{G}} \right) = - 2lnL\left( {{\mathcal{G}}|X^{\left( 1 \right)} ,X^{\left( 2 \right)} } \right) + ln\left( {\text{n}} \right)\left| {\mathcal{G}} \right|,} \\ \end{array}$$

(2)

where n is the number of observations that are common in the two cross sections, and $\left| {\mathcal{G}} \right|$ is the number of edges in the causal structure ${\mathcal{G}}$.

Algorithm 1 describes the proposed algorithm for constructing the causal graph ${\mathcal{G}}$. ${\mathcal{G}}$ is a directed acyclic graph (DAG), with nodes representing variables and edges representing causal effects between a pre-existing and an incident variable.

Statement of human rights and informed consent

The study was approved by both Mayo Clinic and University of Minnesota Institutional Review Board (IRB). Informed consent was obtained from all patients. All relevant guidelines and regulations were followed.

Evaluation

Clinical evidence

The standard way to evaluate CSD methods is to compare the resulting graph to a gold standard graph. However, such a gold standard graph does not exist and possibly many relationships are unknown. However, there exists (1) Associative Evidence: a large body of observational studies documenting risk factors and outcomes for diabetes. Results from these studies have already been distilled into summaries²¹. (2) Clinical trials can support both the existence (positive) and also the lack (negative) of hypothesized causal relationships. We compiled a list of causal relationships from clinical trials considering 175 clinical trials with a primary endpoint of any of the conditions we studied, including composite end points. We excluded trials with inclusion criteria that are too strict (trial results would not generalize to our population) and the interventions that are out of the scope of our study. 14 trials remained yielding 19 positive and 18 negative causal relationships. These trials and the evidence they produced are listed in Supplement III, Table S1. These relationships are used as causal evidence to compute recall.

Internal evaluation

We evaluated the method and the resulting graphs from the following four perspectives.

Stability

We run 1000 bootstrap replicas on the development cohort. An edge has ambiguous orientation if it is present in at least half of the 1000 graphs (edge is not noise) and both orientations appear in at least 30% of the graphs that contain this edge (it does not have a dominant direction). We report the percentage of ambiguous edges.

Precision

Based on the causal graph derived from the training cohort, an edge is incorrect if there is no associative evidence of a relationship between the two events; or if causal evidence specifically indicates the lack of a causal relationship. We define precision as one minus the proportion of incorrect edges among the discovered edges.

Causal recall

Causal recall is computed on a single graph discovered from the training cohort, quantifying the percentage of the known causal relationships discovered. A known causal relationship from A to B is discovered if there is a node in the graph that maps to A, another node that maps to B and (a) a direct causal relationship A → B in the graph exists or (b) a causal path A → X → B exists and no causal evidence states that in patients with X, A does not cause B. For example, if the evidence states that blood pressure (without specifying whether it is systolic or diastolic) increases the risk of stroke, then the path sbp → cevd → stroke would satisfy this relationship.

Associative recall

Associative recall is also computed on a single graph discovered from the training cohort and it quantifies the percentage of known associative relationships that can be explained by the discovered causal graph. An associative relationship between A and B is explained by the graph if there is a node in the graph that maps to A, another node that maps to B, and a path between A and B exists in the graph.

External validation

We performed 1000 bootstrap replications on both data sets independently using the proposed method. On each data set, all edges from the 1000 graphs were pooled, resulting in two sets of pooled edges. We compared these two sets and pointed out the edges that were discordant between the MC and FV data, as shown in Fig. 1C.

Method comparison

Figure 1B depicts an overview of the method comparison. Three methods are compared, (1) FGES + raw FGES is applied directly to the raw data; (2) FGES + transf data is transformed using the proposed transformation method and FGES is applied to the transformed data; and (3) Proposed the proposed search algorithm is applied to the transformed data. Comparing FGES + raw and FGES + transf isolates the effect of the proposed transformation method, and comparing FGES + transf and Proposed highlights the effect of the proposed search algorithm.

Results

Baseline characteristics

Table 1 presents descriptive statistics for the MC and FV data sets at the end of the first time window and incidence rates for the diseases in the second window. Differences between datasets are tested through the t-test (for age) and the chi-square test (all other variables).

Table 1 Characteristics of the MC and FV data sets.

Full size table

Directional stability

The proposed data transformation reduced the percentage of ambiguously oriented edges from 45 to 24%, and finally, the proposed search method eliminated ambiguously oriented edges (Table 2).

Table 2 Directional stability.

Full size table

Correctness and completeness

Table 3 shows the precision, associative recall and causal recall of the graphs discovered by the three methods. All three methods achieved almost perfect recall; FGES + raw achieved the lowest precision of 0.294: less than third of the events reported as causally related are even associated. By using the proposed transformation, the precision increased to 0.55, but almost half of the reported causal relationships are still incorrect. Finally, the proposed method achieved a precision of 0.838. We present the causal graph discovered by the proposed methods in the Fig. 2. Incorrect edges are colored in red.

Table 3 Metrics from clinical evidence.

Full size table

External validation

We compared the graphs discovered from the MC and FV data sets. There are 74 distinct edges that were discovered from at least one of the data sets. Sixty (81%) edges coincided across the two datasets, while 14 (19%) differed. Table 4 shows the discordant edges, the percentage of bootstrap iterations in which the edge was present and the main reason for the discordance.

Table 4 External validation.

Full size table

There are three broad reasons for differences in edges. The main reason, affecting half of the edges was that of policy differences. These include preferred lab results (A1C vs FPG) and decisions regarding therapeutic interventions. The second reason, affecting four edges, is a lack of clear precedence in the relationships among the events. For example, the abnormal Trigl → HL treatment edge was not discovered at FV because the first abnormal Trigl precedes or follows the HL treatment in statistically equal proportions. The final reason, affecting the remaining three edges, is differential degree of confounding between the two sites. For example, SBP is a confounder of CHF and MI. When the algorithm fails to detect the SBP → MI edge, the effect of SBP on MI was shown through CHF (which depends on SBP more than MI). For the HL diagnosis → Trigl edge, the common cause is BMI, and for the HL treatment → CAD edge, it is LDL. The reason for differential confounding was likely a combination of population and institutional differences as well as data artifacts.

Discussion

We proposed a new data transformation method and a new search algorithm specifically designed for EHR data. We showed how the resulting graph achieved close to 90% precision (90% of the edges were correct), almost 100% recall (the graph could explain all known associations and almost all known causal relationships), and the graph was remarkably stable in face of data perturbation (no edge disappeared or changed direction). Due to its built-in facility, our method outperformed general purpose methods by a large margin.

While the two graphs from the two independent health systems are reassuringly similar, small differences exist. None of these differences implies an incorrect physiological or pathophysiological effect. Among the 14 edges that differed, seven captured differences between the population and the institutions, such as institution-specific triggers for prescriptions and the use of different laboratory tests for the same purpose (fasting plasma glucose versus A1c). Depending on the goal of the modeling, it may be desirable to include such differences. We believe that the discovered causal graphs offer adequate information about causal (including confounding) factors to support the development of clinical decision support models and can also support clinical research efforts.

The proposed algorithm achieved such high performance because it could compensate for errors in the EHR data and it incorporated study design considerations. Problems caused by incorrect time stamps and diseases appearing in the reverse order are alleviated by reducing the overall reliance on time stamps. The study design with its two-year windows allows for (even large) errors in the time stamp and once a disease is recognized as pre-existing by the data transformation method, its subsequent time stamps are irrelevant. Time stamps that appear in the reverse order tend to have a small gap (time to schedule and complete a diagnostic procedure), so they likely fall into the same two-year window. Study design considerations, namely that billing codes do not distinguish between incident and pre-existing conditions as well as whether a patient is under observation or not, are addressed through the data transformation method. The ability of the search algorithm to produce a DAG is achieved through using precedence relationships to orient edges that have equal probability in both orientations. Precedence relationships in turn rely on the pre-existing/incident status of the disease as determination by the data transformation method.

Generalizability beyond diabetes

The proposed method was demonstrated on type 2 diabetes, but it can generalize to other applications as long as the target application benefits from some of the improvements: reducing the impact of inaccuracies in the EHR data, accounting for the temporal ordering of events and distinguishing pre-existing and incident conditions. The method assumes that pre-existing diseases persist during the second time window.

Future work

The algorithm requires longitudinal data with at least two time windows. Different diseases and their symptoms might manifest at different rates, incorporating this knowledge into the discovery process may enhance the performance of the algorithms. Secondly, the proposed methods may be able to capture the effect of medication changes when a study design of multiple (more than two) time windows is applied. The current implementation assumes a single incidence of a disease, or that the diseases persists during the study period. Another possible extension could relax this assumption, allowing for transient conditions that can have multiple incidences in the study period. Thirdly, variable semantics (such as SBP and DBP being measures related to hypertension) is an essential component of the proposed algorithm, but it is not always available in a computable form. Further, both datasets in this study are from the Midwest with a predominantly white patient population. The generalizability of the discovered causal relations can be further tested by examining a broader patient population.

Conclusions

We have demonstrated that the graph produced by the proposed transformation and search algorithm is more stable across bootstrap iterations and as complete as other methods yet it contained substantially fewer errors (had higher precision) than graphs produced by general-purpose methods. The resulting graph was successfully validated using longitudinal EHR data from an independent health system. We conclude that the proposed method is more suitable for use in clinical studies using EHR data.

Data availability

The data that support the findings of this study are not publicly available since they contain patient health information. Authorization to access patient data can be requested from the Mayo Clinic and University of Minnesota Institutional Review Board.

References

Loh, E. Medicine and the rise of the robots: A qualitative review of recent advances of artificial intelligence in health. BMJ Leader https://doi.org/10.1136/leader-2018-000071 (2018).
Article Google Scholar
Semigran, H. L., Levine, D. M., Nundy, S. & Mehrotra, A. Comparison of physician and computer diagnostic accuracy. JAMA Intern. Med. 176(12), 1860–1861. https://doi.org/10.1001/jamainternmed.2016.6001 (2016).
Article PubMed Google Scholar
Beam, A. L. & Kohane, I. S. Big data and machine learning in health CareBig data and machine learning in health CareBig data and machine learning in health care. JAMA 319(13), 1317–1318. https://doi.org/10.1001/jama.2017.18391 (2018).
Article PubMed Google Scholar
Trister, A. D., Buist, D. S. M. & Lee, C. I. Will machine learning tip the balance in breast cancer screening?Will machine learning tip the balance in breast cancer screening? Will machine learning tip the balance in breast cancer screening?. JAMA Oncol. 3(11), 1463–1464. https://doi.org/10.1001/jamaoncol.2017.0473 (2017).
Article PubMed Google Scholar
Ashley, E. A. The precision medicine initiative: A new national effort. JAMA 313(21), 2119–2120. https://doi.org/10.1001/jama.2015.3595 (2015).
Article CAS PubMed Google Scholar
Friedman, C. P., Wong, A. K. & Blumenthal, D. Achieving a nationwide learning health system. Sci. Transl. Med. 2(57), 57cm29. https://doi.org/10.1126/scitranslmed.3001456 (2010).
Article PubMed Google Scholar
Mukherjee, S. A.I. versus M.D. What happens when diagnosis is automated? The New Yorker, 2017.
Segura-Egea, J. J., Cabanillas-Balsera, D., Jiménez-Sánchez, M. C. & Martín-González, J. Endodontics and diabetes: Association versus causation. Int. Endod. J. 52(6), 790–802. https://doi.org/10.1111/iej.13079 (2019).
Article CAS PubMed Google Scholar
Li, Y., Torralba, A., Anandkumar, A., Fox, D., & Garg, A. Causal discovery in physical systems from videos. In: NeurIPS (2020).
Anker, J. J., Kummerfeld, E., Rix, A., Burwell, S. J. & Kushner, M. G. Causal network modeling of the determinants of drinking behavior in comorbid alcohol use and anxiety disorder. Alcohol.: Clin. Exp. Res. 43(1), 91–97. https://doi.org/10.1111/acer.13914 (2019).
Article Google Scholar
Ebert-Uphoff, I. & Deng, Y. Causal discovery for climate research using graphical models. J. Clim. 25(17), 5648–5665. https://doi.org/10.1175/jcli-d-11-00387.1 (2012).
Article ADS Google Scholar
Pearl, J. Causality (Cambridge University Press, 2009).
Book Google Scholar
Meek, C. Causal inference and causal explanation with background knowledge. In: Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence. pp. 403–410 (1995).
Ramsey. J. D. Scaling up greedy equivalence search for continuous variables. CoRR arXiv:1507.07749 (2015).
Ngufor, C. et al. Development and validation of a risk stratification model using disease severity hierarchy for mortality or major cardiovascular event. JAMA Netw. Open 3(7), e208270–e208370. https://doi.org/10.1001/jamanetworkopen.2020.8270 (2020).
Article PubMed PubMed Central Google Scholar
Standards of Medical Care in Diabetes—2019 Abridged for Primary Care Providers. Clin. Diabetes 37(1), 11. https://doi.org/10.2337/cd18-0105 (2019)
The Action to Control Cardiovascular Risk in Diabetes Study Group. Effects of intensive glucose lowering in type 2 diabetes. New Engl. J. Med. 358(24), 2545–2559. https://doi.org/10.1056/NEJMoa0802743 (2008).
Article PubMed Google Scholar
UK Prospective Diabetes Study (UKPDS) Group. Intensive blood-glucose control with sulphonylureas or insulin compared with conventional treatment and risk of complications in patients with type 2 diabetes (UKPDS 33). The Lancet 352(9131), 837–853. https://doi.org/10.1016/S0140-6736(98)07019-6 (1998).
Shen, X. et al. Challenges and opportunities with causal discovery algorithms: Application to Alzheimer’s pathophysiology. Sci. Rep. 10(1), 2975. https://doi.org/10.1038/s41598-020-59669-x (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Chickering, D. M. Optimal structure identification with greedy search. J. Mach. Learn. Res. 3, 507–554 (2002).
MathSciNet MATH Google Scholar
Mayo Clinic Patient Care and Health Information. https://www.mayoclinic.org/patient-care-and-health-information. Accessed 21 Nov 2020.

Download references

Acknowledgements

This work is supported by the National Institutes of Health (NIH) Grants AG056366, TR002494, and LM011972. Contents of this document are the sole responsibility of the authors and do not necessarily represent official views of the NIH.

Author information

Authors and Affiliations

Institute for Health Informatics, University of Minnesota, Minneapolis, MN, USA
Xinpeng Shen, Sisi Ma & Gyorgy J. Simon
Department of Medicine, University of Minnesota, Minneapolis, MN, USA
Sisi Ma & Gyorgy J. Simon
Department of Radiology, Mayo Clinic, Rochester, MN, USA
Prashanthi Vemuri
Division of Endocrinology, Mayo Clinic, Rochester, MN, USA
M. Regina Castro
Department of Internal Medicine, and Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
Pedro J. Caraballo

Authors

Xinpeng Shen
View author publications
You can also search for this author in PubMed Google Scholar
Sisi Ma
View author publications
You can also search for this author in PubMed Google Scholar
Prashanthi Vemuri
View author publications
You can also search for this author in PubMed Google Scholar
M. Regina Castro
View author publications
You can also search for this author in PubMed Google Scholar
Pedro J. Caraballo
View author publications
You can also search for this author in PubMed Google Scholar
Gyorgy J. Simon
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

G.S. and S.M. designed the study. X.S. conducted the analyses and drafted the paper. G.S. and S.M. designed and oversaw the statistical analysis. P.V., R.C., and P.C. supervised and contributed to the data interpretation. All authors reviewed the manuscript and provided suggestions.

Corresponding author

Correspondence to Gyorgy J. Simon.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Shen, X., Ma, S., Vemuri, P. et al. A novel method for causal structure discovery from EHR data and its application to type-2 diabetes mellitus. Sci Rep 11, 21025 (2021). https://doi.org/10.1038/s41598-021-99990-7

Download citation

Received: 01 May 2021
Accepted: 13 September 2021
Published: 25 October 2021
DOI: https://doi.org/10.1038/s41598-021-99990-7
Springer Nature Limited

A novel method for causal structure discovery from EHR data and its application to type-2 diabetes mellitus

Abstract

Similar content being viewed by others

Data-driven identification of heart failure disease states and progression pathways using electronic health records

Foundations of Causal ML

Causal inference in the medical domain: a survey

Introduction

Methods

Study source and population

Variables

Causal structure discovery

Proposed methods

Data transformation method

The proposed CSD search Algorithm

Statement of human rights and informed consent

Evaluation

Clinical evidence

Internal evaluation

Stability

Precision

Causal recall

Associative recall

External validation

Method comparison

Results

Baseline characteristics

Directional stability

Correctness and completeness

External validation

Discussion

Generalizability beyond diabetes

Future work

Conclusions

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's note

Supplementary Information

Supplementary Information.

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation