1 Introduction

Type 2 diabetes mellitus (T2DM) is a chronic metabolic disorder characterized by high levels of blood sugar (glucose) resulting from the body’s resistance to insulin or its reduced secretion. The number of adults suffering from T2DM in Europe varies between countries but it is expected to increase overall from 52.8 in 2011 to 69 million by 2045 (, accessed June 2023). About 30–40% of affected individuals develop diabetic kidney disease (DKD), a devastating complication that reduces quality as well as duration of life and imposes an enormous burden on health care budget. In developed countries DKD is the leading cause of end stage renal disease [1].

For many years kidney disease in type 2 diabetes was considered to mimic kidney disease in type 1 diabetes, a somewhat “homogenous” disorder primarily driven (at least in early stage) by genetic predisposition and quality of metabolic control. However recently it became evident that it is much more complex and multifactorial due to different comorbidities more prevalent in this elderly population (like hypertension) and deregulations in a large number of different biological pathways including metabolic, hemodynamic, and inflammatory processes have been described [2]. A consequence of this complexity is massive inter-individual and longitudinal intra-individual heterogeneity of pathophysiology on the molecular level the phenotype (i.e. clinical presentation) and the response to specific therapy is observed. Understanding these mechanisms and their interactions cross sectionally and over time is crucial for improving clinical care and developing targeted therapies and interventions to prevent or delay the onset and slow the progression of DKD. With better profiling of patients there is an increasing need of a new understanding on the framework of relationships involving some of the variables and their interactions used to judge the state of a patient with DKD and support selection of appropriate therapy.

In this work we propose a probabilistic graphical model, namely the Bayesian network, to identify the network of relationships among the selected variables of the disease pathophysiology of DKD. Ideally the results should give a consensus to the theoretical path of pathophysiology, and when combined with expert knowledge or per se, should improve the information on the actual relationships among the different considered factors. Specifically, by estimating a Bayesian network model we can contribute in

  • evaluating the strength of the well-known relationships on DKD;

  • proving new insights on new relationships emerged from the data on patients;

  • identifying differences that could be imputed to the specific therapy.

The paper is structured as follow: in Sect. 2 we introduce the study conducted to derive the data used in the analyses and the statistical approach developed to address the proposed objectives, in particular how to include prior knowledge available from the literature and experts to produce more informative models; then in Sect. 3 we present the main results achieved in the content of DKD. Finally, in Sect. 4 we propose some concluding remarks about issues requiring further researches.

2 Materials and Methods

2.1 The PROVALID Study

The data used in this work were provided by the PROVALID study (“PROspective cohort study in patients with type 2 diabetes mellitus for VALIDation of biomarkers”), a prospective observational study that recruited over 4.000 patients with T2DM in five European countries with normal, mild or moderately reduced kidney function. Patients were followed for at last 4 years and variables holding information on clinical data, laboratory values and medication were collected on an annual basis. For a more complete description of the study and the available data we refer to [3, 4]. The disease trajectories (as assessed by changes in eGFR, a measure of renal excretory capacity) were highly variable in the PROVALID participants even under stable therapy [5]. Next to drug adherence and environmental factors, heterogeneity in pathophysiology is a very likely explanation for this finding. In order to systematically approach this problem we defined two populations of patients:

  • RASi only, population 1: a population of patients that was continuously treated with agents that block the renin angiotensin system, the current standard of care for at least 4 years.

  • Drop-in, population 2: a selection of patients to whom other agents were added on top of RASi therapy by their clinicians in order to improve metabolic control and/or DKD (sodium glucose transporter 2 inhibitors, i.e. SGLT2is, glucagon like peptide 1 receptor agonists, i.e. GLP1as, or the mineralocorticoid receptor antagonists, i.e. MCRAs.

The definition of these two different populations can help in addressing the aim of identifying differences that could be attributed to the specific therapy.

Among the over one hundred variables collected within the PROVALID data, thirteen available from routine clinical care visits and considered important by physicians were selected. After a preprocessing of the data to remove incomplete cases and to adjust skewed distribution by means of log transformation if appropriated, the selected variables in the two populations are described in Table 1. We point out that the data which we analyzed are datapoints, i.e. we did not consider the longitudinal component of the data. From the Table we can highlight that differences on the mean value of some variables emerge when comparing the two populations, meaning that therapy seems to have an effect to those variables.

Table 1. Comparison of main selected variables between the two populations. Mean ± standard deviation are reported. p-value refers to a t-student test to evaluate the statistical difference between the two populations.

After the selection of the relevant variables, clinical expertise was used to construct an interaction network based on pathophysiology understanding. This network, considered a theoretical framework is presented in Fig. 1. Then, the interaction network and the suspected strength of the interactions between variables was considered as a benchmark, and compared with a purely data driven approach to determine if the latter could improve our understanding of the DKD complex interactions. However confounding of this network by changing in treatment that affects target variables with or without altering disease pathology is an obvious weakness.

Fig. 1.
figure 1

Theoretical framework. Red arcs reflect quite known relationships between the associated variables whereas blue arcs reflect relationships which are less clear from clinical point of view. Signs indicate the direction of the associations.

2.2 The Bayesian Networks

To derive the network of relationships among the selected variables of the disease pathophysiology of DKD, we propose to build Bayesian networks (BNs) [6, 7]. Bayesian networks provide a method for the representation and reasoning of uncertainty and have been widely used in the medical field [8,9,10]. Specifically, a BN for a set of random variables \(\textbf{X}=\{X_1, \ldots , X_p\}\) (in this case \(p=13\)) is identified by

  • a network structure G, a directed acyclic graph (DAG) where nodes represent the variables \(\textbf{X}\) of the system and the directed arcs between nodes represent the probability dependences between them,

  • a set of parameters, representing conditional probability distributions \(P(X_i|Pa(X_i))\) associated to each variable \(X_i\), \(i=1,\ldots ,p\), where \(Pa(X_i)\) are the variables that correspond to the parents of \(X_i\) in the DAG (i.e. the nodes with an arc pointing towards \(X_i\)).

The global distribution of the variables \(\textbf{X}\) is decomposed into the local distributions of the individual variables \(X_i\) as

$$\begin{aligned} P(\textbf{X})=\prod _{n=1}^{p} P(X_{i}|Pa(X_{i})) \end{aligned}$$

The process of estimating a BN is called learning and typically involves two main steps: (1) the structure learning to identify the topological structure, i.e. which arcs are present in the graph and therefore which probabilistic relationships are supported by the data, and (2) the parameter learning to learn the conditional probability distributions that regulate the strength of the relationships.

There are many approaches in literature to estimate BNs from the data [11]: in this work we will focus on a Search & Score strategy which uses a score function in order to compare the structures of the network and then selects the structure which better fits the data. Specifically, we develop structure learning by means of hill-climbing search procedure and a BDe score [6]. Furthermore, to reduce the impact of the noise present in the data, model averaging learning techniques can be used to improve the reliability of structure learning [12]. The process consists in:

  • perform bootstrap resampling, i.e. re-sample the data k times using bootstrap and perform structure learning separately on each of the resulting samples, thus collecting k DAGs;

  • calculate arc strength, i.e. compute the frequency with which each arc appears in those k graphs deriving an “average” consensus DAG by selecting those arcs that have a frequency above a certain threshold t.

In this work we fix the number of bootstrap replications to \(k = 200\) and threshold to \(t = 0.5\) (selection of only arcs with strength \(> 0.5\)). The average BN model built within this process should be less sensitive to noisy data and typically should produce more accurate predictions for new observations [8].

One more characteristic on structure learning is that BN can include prior knowledge available from the literature and the practice of the discipline to produce more informative models and to overcome the inherent noisiness and variability of data. This is possible by means of whitelisted arcs: they represent well-known dependencies which should be forced to be present in the graph. In this work we estimate several BNs by including and excluding prior knowledge representing the theoretical framework of interconnections among the selected variables in DKD. The prior information was delivered by study physicians in the form of 32 prior relationships (whitelisted arcs) derived from the pathophysiology theoretical framework in Fig. 1.

Last, BNs are derived both considering the whole dataset (Overall population) to improve the experts understanding of the pathophysiology complex interactions, and the therapy-specific populations (Rasi and Drop-in populations) to identify if any difference can be imputed to added agents.

3 Results

To evaluate the strength of the well-known relationships on DKD and how data can provide insights on new relationships in patients on therapy, we introduce some measure of graphical differences. In Table 2 we provide the number of arcs (Num. arcs), the average Markov Blanket size (Av. MB size), the average neighborhood size (Av. neighb. size), the number of missing priors (FN), the number of confirmed priors (TP) and the number of new arcs emerging from data (TN) with respect to the “Expert” network built with only the 32 whitelisted arcs suggested by expert clinicians. Last, a BIC measure was provided for each BN in order to compare the fit to the data. BNs in Table 2 are learned using data referred to the whole dataset (Overall population).

Table 2. Measures of graphical differences. Expert refers to BN where the structure represents the expert knowledge, Data only refers to BN learned using data only, Data + Prior refers to BN learned by using both data and expert knowledge.

From the results we can see that the “Data only” BN have a less number of arcs, meaning that data provide relationships that should be considered as robust. By comparing them with the expert prior whitelisted arcs, we highlight that the 7 TP arcs detected by a purely data driven approach have a strength ranging from 1 to 0.910 meaning that the associated prior relationships are highly confirmed also from an empirical point of view (some examples are: SBP \(\rightarrow \) DBP, DBP \(\rightarrow \) HB and BG \(\rightarrow \) HBA1C, all with associated strength equal to 1). Furthermore, 21 new emerging arcs are achieved: some of them describe prior relationships but with a reversed directions (for example, HDLCHOL \(\rightarrow \) BMI with strength equals to 1 or HBA1C \(\rightarrow \) BMI with strength equals to 0.975), but many others can provide new insights on the DKD pathophysiology network as, for example SALB \(\rightarrow \) HB (strength = 1), SALB \(\rightarrow \) UACR (strength = 1) or CPR \(\rightarrow \) BMI (strength = 1).

When looking at the results of the BN learned by using prior expert information, we see that the number of emerged new relationships is 17 and most of them are the same as in the network built using only data.

To understand if therapies affect the results, the same procedure was separately developed in the Rasi only and Drop-in populations. Results are presented in Table 3. The BNs built without prior information within the Drop-in population seems to present less arcs with respect to Rasi only population. Only 3 prior relationships are confirmed in both populations (SBP \(\rightarrow \) DBP, DBP \(\rightarrow \) HB and BG \(\rightarrow \) HBA1C, all with strength equals to 1) but what emerges is that the new relationships found in Rasi only population are mainly different compared to Drop-in population. In Fig. 2 the arcs which can be attributed to therapy are shown. Specifically, black solid lines represent relationships which are present in both Rasi only and Drop-in populations, blue dashed lines represent relationships which are present in Rasi only population but not in Drop-in population and red solid lines represent relationships which are present in Drop-in population but not in Rasi only population. When introducing prior information, despite the high number of common whitelisted arcs which can also put constraints in the search approach, there are again differences that can be attributed to the therapies as shown in Fig. 3. Most of them confirm the results obtained by a purely data-driven approach, but some new relationships also emerge. This suggest that expert prior information can guide and contribute to a better understand on the interconnection network among the variables involved in the disease.

To evaluate how expert knowledge merged with information directly extracted from the data is able to better identify the pattern of pathophysiology, we calculate the predictive accuracy of the BNs estimated from data with and without prior information in the different populations in terms of correlation between the observed and the predicted value for all the variables. This predictive accuracy is achieved by using 10-fold cross-validation [13]. 10-fold cross-validation is a model validation technique that assesses how well a statistical model accurately predict the behavior of new observations; for each variable we compute the correlation between the observed and predicted pairs and this quantity is called predictive correlation. The predictive correlations for all the variables are reported in Table 4. Both Data and Data + Prior BNs predictions for all the considered variables outperform the predictive correlations in the Expert network for all the populations, meaning that data can provide a very valuable source of additional information to better understand unknown mechanisms in the DKD. Furthermore, in differentiating by therapies we can also achieved specific directions of intervention: for example, the value of the predictive correlation of CRP is about 0.2 for Rasi only population and about 0.4 for Drop-in population meaning that the interconnections found in this last BN are able to better describe what influences the value of CRP.

Table 3. Measures of graphical differences among populations. Expert refers to BN where the structure represents the expert knowledge, Data only refers to BN learned using data only, Data + Prior refers to BN learned by using both data and expert knowledge.
Fig. 2.
figure 2

Structural differences imputed to therapy - Data only

Fig. 3.
figure 3

Structural differences imputed to therapy - Data + Prior

Table 4. Measures of prediction performance. Data only refers to BNs learned using data only, Data + Prior refers to BNs learned by using both data and expert knowledge, and Expert refers to BNs where the structure represents the expert knowledge and only the parameters of the BN are estimated from data.

4 Concluding Remarks

In this work we provide evidence on how BNs are effective and efficient models for the identification and the quantification of complex structures in medical practice and research. Specifically, by using average Bayesian network models for therapy-specific data we can provide an intuitive qualitative and quantitative description (in the form of a DAG) of the relationships that link the variables of the theoretical framework. Furthermore, this methodological strategy has the advantage of allowing the integration of prior expert knowledge into model estimation, which is quite common in clinical settings. From the results of the analysis, we can highlight how the data can provide a source of information able to increase the knowledge of experts in finding complex relationships in the path of pathophysiology for the disease. In this sense, data and experts are both complementary and collaborative: experts can corroborate what emerges from data and data can help experts find new insights. Moreover, by digging inside the estimated structure in the two populations we should be able to identify differences that could be imputed to the specific therapy in order to support the selection of appropriate interventions for patients treated with that therapy. Further researches can be developed to improve the efficiency of the estimated models by adding new set of variables (not strictly related to the pathophysiology perspective such as the set of risk factor medications, the clinical readout features, family history information, etc.) or move to a BN classifier (or a BN-based predictive model) with the main emergent relationships to derive a personalized probabilistic outcome.