Abstract
Diabetic kidney disease is a serious complication of diabetes and one of the leading causes of chronic and endstage kidney disease worldwide. The clinical course and response to therapy is complex and heterogeneous both between and over time within individuals. Therefore it is extremely important to derive even more indepth information on what characterizes its pathophysiology and pattern of disease progression. Statistical models can help in this task by understanding the interconnections among variables clinically considered to characterize the disease. In this work we propose to use Bayesian networks, a class of probabilistic graphical models, able to identify robust relationships among a set of variables. Furthermore, Bayesian networks are able to include expert knowledge in the modeling phase to reduce the uncertainty on the phenomenon under study. We provide some evidence that the synergy between data and expert prior information is a great source of valuable help in gaining new knowledge about Diabetic Kidney Disease.
You have full access to this open access chapter, Download conference paper PDF
Keywords
1 Introduction
Type 2 diabetes mellitus (T2DM) is a chronic metabolic disorder characterized by high levels of blood sugar (glucose) resulting from the body’s resistance to insulin or its reduced secretion. The number of adults suffering from T2DM in Europe varies between countries but it is expected to increase overall from 52.8 in 2011 to 69 million by 2045 (www.heartstats.org, accessed June 2023). About 30–40% of affected individuals develop diabetic kidney disease (DKD), a devastating complication that reduces quality as well as duration of life and imposes an enormous burden on health care budget. In developed countries DKD is the leading cause of end stage renal disease [1].
For many years kidney disease in type 2 diabetes was considered to mimic kidney disease in type 1 diabetes, a somewhat “homogenous” disorder primarily driven (at least in early stage) by genetic predisposition and quality of metabolic control. However recently it became evident that it is much more complex and multifactorial due to different comorbidities more prevalent in this elderly population (like hypertension) and deregulations in a large number of different biological pathways including metabolic, hemodynamic, and inflammatory processes have been described [2]. A consequence of this complexity is massive interindividual and longitudinal intraindividual heterogeneity of pathophysiology on the molecular level the phenotype (i.e. clinical presentation) and the response to specific therapy is observed. Understanding these mechanisms and their interactions cross sectionally and over time is crucial for improving clinical care and developing targeted therapies and interventions to prevent or delay the onset and slow the progression of DKD. With better profiling of patients there is an increasing need of a new understanding on the framework of relationships involving some of the variables and their interactions used to judge the state of a patient with DKD and support selection of appropriate therapy.
In this work we propose a probabilistic graphical model, namely the Bayesian network, to identify the network of relationships among the selected variables of the disease pathophysiology of DKD. Ideally the results should give a consensus to the theoretical path of pathophysiology, and when combined with expert knowledge or per se, should improve the information on the actual relationships among the different considered factors. Specifically, by estimating a Bayesian network model we can contribute in

evaluating the strength of the wellknown relationships on DKD;

proving new insights on new relationships emerged from the data on patients;

identifying differences that could be imputed to the specific therapy.
The paper is structured as follow: in Sect. 2 we introduce the study conducted to derive the data used in the analyses and the statistical approach developed to address the proposed objectives, in particular how to include prior knowledge available from the literature and experts to produce more informative models; then in Sect. 3 we present the main results achieved in the content of DKD. Finally, in Sect. 4 we propose some concluding remarks about issues requiring further researches.
2 Materials and Methods
2.1 The PROVALID Study
The data used in this work were provided by the PROVALID study (“PROspective cohort study in patients with type 2 diabetes mellitus for VALIDation of biomarkers”), a prospective observational study that recruited over 4.000 patients with T2DM in five European countries with normal, mild or moderately reduced kidney function. Patients were followed for at last 4 years and variables holding information on clinical data, laboratory values and medication were collected on an annual basis. For a more complete description of the study and the available data we refer to [3, 4]. The disease trajectories (as assessed by changes in eGFR, a measure of renal excretory capacity) were highly variable in the PROVALID participants even under stable therapy [5]. Next to drug adherence and environmental factors, heterogeneity in pathophysiology is a very likely explanation for this finding. In order to systematically approach this problem we defined two populations of patients:

RASi only, population 1: a population of patients that was continuously treated with agents that block the renin angiotensin system, the current standard of care for at least 4 years.

Dropin, population 2: a selection of patients to whom other agents were added on top of RASi therapy by their clinicians in order to improve metabolic control and/or DKD (sodium glucose transporter 2 inhibitors, i.e. SGLT2is, glucagon like peptide 1 receptor agonists, i.e. GLP1as, or the mineralocorticoid receptor antagonists, i.e. MCRAs.
The definition of these two different populations can help in addressing the aim of identifying differences that could be attributed to the specific therapy.
Among the over one hundred variables collected within the PROVALID data, thirteen available from routine clinical care visits and considered important by physicians were selected. After a preprocessing of the data to remove incomplete cases and to adjust skewed distribution by means of log transformation if appropriated, the selected variables in the two populations are described in Table 1. We point out that the data which we analyzed are datapoints, i.e. we did not consider the longitudinal component of the data. From the Table we can highlight that differences on the mean value of some variables emerge when comparing the two populations, meaning that therapy seems to have an effect to those variables.
After the selection of the relevant variables, clinical expertise was used to construct an interaction network based on pathophysiology understanding. This network, considered a theoretical framework is presented in Fig. 1. Then, the interaction network and the suspected strength of the interactions between variables was considered as a benchmark, and compared with a purely data driven approach to determine if the latter could improve our understanding of the DKD complex interactions. However confounding of this network by changing in treatment that affects target variables with or without altering disease pathology is an obvious weakness.
2.2 The Bayesian Networks
To derive the network of relationships among the selected variables of the disease pathophysiology of DKD, we propose to build Bayesian networks (BNs) [6, 7]. Bayesian networks provide a method for the representation and reasoning of uncertainty and have been widely used in the medical field [8,9,10]. Specifically, a BN for a set of random variables \(\textbf{X}=\{X_1, \ldots , X_p\}\) (in this case \(p=13\)) is identified by

a network structure G, a directed acyclic graph (DAG) where nodes represent the variables \(\textbf{X}\) of the system and the directed arcs between nodes represent the probability dependences between them,

a set of parameters, representing conditional probability distributions \(P(X_iPa(X_i))\) associated to each variable \(X_i\), \(i=1,\ldots ,p\), where \(Pa(X_i)\) are the variables that correspond to the parents of \(X_i\) in the DAG (i.e. the nodes with an arc pointing towards \(X_i\)).
The global distribution of the variables \(\textbf{X}\) is decomposed into the local distributions of the individual variables \(X_i\) as
The process of estimating a BN is called learning and typically involves two main steps: (1) the structure learning to identify the topological structure, i.e. which arcs are present in the graph and therefore which probabilistic relationships are supported by the data, and (2) the parameter learning to learn the conditional probability distributions that regulate the strength of the relationships.
There are many approaches in literature to estimate BNs from the data [11]: in this work we will focus on a Search & Score strategy which uses a score function in order to compare the structures of the network and then selects the structure which better fits the data. Specifically, we develop structure learning by means of hillclimbing search procedure and a BDe score [6]. Furthermore, to reduce the impact of the noise present in the data, model averaging learning techniques can be used to improve the reliability of structure learning [12]. The process consists in:

perform bootstrap resampling, i.e. resample the data k times using bootstrap and perform structure learning separately on each of the resulting samples, thus collecting k DAGs;

calculate arc strength, i.e. compute the frequency with which each arc appears in those k graphs deriving an “average” consensus DAG by selecting those arcs that have a frequency above a certain threshold t.
In this work we fix the number of bootstrap replications to \(k = 200\) and threshold to \(t = 0.5\) (selection of only arcs with strength \(> 0.5\)). The average BN model built within this process should be less sensitive to noisy data and typically should produce more accurate predictions for new observations [8].
One more characteristic on structure learning is that BN can include prior knowledge available from the literature and the practice of the discipline to produce more informative models and to overcome the inherent noisiness and variability of data. This is possible by means of whitelisted arcs: they represent wellknown dependencies which should be forced to be present in the graph. In this work we estimate several BNs by including and excluding prior knowledge representing the theoretical framework of interconnections among the selected variables in DKD. The prior information was delivered by study physicians in the form of 32 prior relationships (whitelisted arcs) derived from the pathophysiology theoretical framework in Fig. 1.
Last, BNs are derived both considering the whole dataset (Overall population) to improve the experts understanding of the pathophysiology complex interactions, and the therapyspecific populations (Rasi and Dropin populations) to identify if any difference can be imputed to added agents.
3 Results
To evaluate the strength of the wellknown relationships on DKD and how data can provide insights on new relationships in patients on therapy, we introduce some measure of graphical differences. In Table 2 we provide the number of arcs (Num. arcs), the average Markov Blanket size (Av. MB size), the average neighborhood size (Av. neighb. size), the number of missing priors (FN), the number of confirmed priors (TP) and the number of new arcs emerging from data (TN) with respect to the “Expert” network built with only the 32 whitelisted arcs suggested by expert clinicians. Last, a BIC measure was provided for each BN in order to compare the fit to the data. BNs in Table 2 are learned using data referred to the whole dataset (Overall population).
From the results we can see that the “Data only” BN have a less number of arcs, meaning that data provide relationships that should be considered as robust. By comparing them with the expert prior whitelisted arcs, we highlight that the 7 TP arcs detected by a purely data driven approach have a strength ranging from 1 to 0.910 meaning that the associated prior relationships are highly confirmed also from an empirical point of view (some examples are: SBP \(\rightarrow \) DBP, DBP \(\rightarrow \) HB and BG \(\rightarrow \) HBA1C, all with associated strength equal to 1). Furthermore, 21 new emerging arcs are achieved: some of them describe prior relationships but with a reversed directions (for example, HDLCHOL \(\rightarrow \) BMI with strength equals to 1 or HBA1C \(\rightarrow \) BMI with strength equals to 0.975), but many others can provide new insights on the DKD pathophysiology network as, for example SALB \(\rightarrow \) HB (strength = 1), SALB \(\rightarrow \) UACR (strength = 1) or CPR \(\rightarrow \) BMI (strength = 1).
When looking at the results of the BN learned by using prior expert information, we see that the number of emerged new relationships is 17 and most of them are the same as in the network built using only data.
To understand if therapies affect the results, the same procedure was separately developed in the Rasi only and Dropin populations. Results are presented in Table 3. The BNs built without prior information within the Dropin population seems to present less arcs with respect to Rasi only population. Only 3 prior relationships are confirmed in both populations (SBP \(\rightarrow \) DBP, DBP \(\rightarrow \) HB and BG \(\rightarrow \) HBA1C, all with strength equals to 1) but what emerges is that the new relationships found in Rasi only population are mainly different compared to Dropin population. In Fig. 2 the arcs which can be attributed to therapy are shown. Specifically, black solid lines represent relationships which are present in both Rasi only and Dropin populations, blue dashed lines represent relationships which are present in Rasi only population but not in Dropin population and red solid lines represent relationships which are present in Dropin population but not in Rasi only population. When introducing prior information, despite the high number of common whitelisted arcs which can also put constraints in the search approach, there are again differences that can be attributed to the therapies as shown in Fig. 3. Most of them confirm the results obtained by a purely datadriven approach, but some new relationships also emerge. This suggest that expert prior information can guide and contribute to a better understand on the interconnection network among the variables involved in the disease.
To evaluate how expert knowledge merged with information directly extracted from the data is able to better identify the pattern of pathophysiology, we calculate the predictive accuracy of the BNs estimated from data with and without prior information in the different populations in terms of correlation between the observed and the predicted value for all the variables. This predictive accuracy is achieved by using 10fold crossvalidation [13]. 10fold crossvalidation is a model validation technique that assesses how well a statistical model accurately predict the behavior of new observations; for each variable we compute the correlation between the observed and predicted pairs and this quantity is called predictive correlation. The predictive correlations for all the variables are reported in Table 4. Both Data and Data + Prior BNs predictions for all the considered variables outperform the predictive correlations in the Expert network for all the populations, meaning that data can provide a very valuable source of additional information to better understand unknown mechanisms in the DKD. Furthermore, in differentiating by therapies we can also achieved specific directions of intervention: for example, the value of the predictive correlation of CRP is about 0.2 for Rasi only population and about 0.4 for Dropin population meaning that the interconnections found in this last BN are able to better describe what influences the value of CRP.
4 Concluding Remarks
In this work we provide evidence on how BNs are effective and efficient models for the identification and the quantification of complex structures in medical practice and research. Specifically, by using average Bayesian network models for therapyspecific data we can provide an intuitive qualitative and quantitative description (in the form of a DAG) of the relationships that link the variables of the theoretical framework. Furthermore, this methodological strategy has the advantage of allowing the integration of prior expert knowledge into model estimation, which is quite common in clinical settings. From the results of the analysis, we can highlight how the data can provide a source of information able to increase the knowledge of experts in finding complex relationships in the path of pathophysiology for the disease. In this sense, data and experts are both complementary and collaborative: experts can corroborate what emerges from data and data can help experts find new insights. Moreover, by digging inside the estimated structure in the two populations we should be able to identify differences that could be imputed to the specific therapy in order to support the selection of appropriate interventions for patients treated with that therapy. Further researches can be developed to improve the efficiency of the estimated models by adding new set of variables (not strictly related to the pathophysiology perspective such as the set of risk factor medications, the clinical readout features, family history information, etc.) or move to a BN classifier (or a BNbased predictive model) with the main emergent relationships to derive a personalized probabilistic outcome.
References
Perco, P., Pena, M., Heerspink, H.J.L., Mayer, G.: Multimarker panels in diabetic kidney disease: the way to improved clinical trial design and clinical practice? Kidney Int. Rep. 4(2), 212–221 (2019)
GaliciaGarcia, U., et al.: Pathophysiology of type 2 diabetes mellitus. Int. J. Mol. Sci. 21(17), 6275 (2020)
Eder, S., et al.: A prospective cohort study in patients with type 2 diabetes mellitus for validation of biomarkers (PROVALID)  study design and baseline characteristics. Kidney Blood Press. Res. 43(1), 181–190 (2018)
Eder, S., et al.: Guidelines and clinical practice at the primary level of healthcare in patients with type 2 diabetes mellitus with and without kidney disease in five European countries. Diab. Vasc. Dis. Res. 16(1), 47–56 (2019)
Kerschbaum, J., et al.: Intraindividual variability of eGFR trajectories in early diabetic kidney disease and lack of performance of prognostic biomarkers. Sci. Rep. 10, 19743 (2020)
Koller, D., Friedman, N.: Probabilistic Graphical Models: Principles and Techniques. MIT Press, Cambridge (2009)
Scutari, M., Denis, J.B.: Bayesian Networks with Examples in R. Chapman & Hall, London (2014)
Scutari, M., Auconi, P., Caldarelli, G., Franchi, L.: Bayesian networks analysis of malocclusion data. Sci. Rep. 7(1), 15236 (2017)
Arora, P., Boyne, D., Slater, J.J., Gupta, A., Brenner, D.R., Druzdzel, M.J.: Bayesian networks for risk prediction using realworld data: a tool for precision medicine. Value Health 22(4), 439–445 (2019)
Shen, J., Liu, F., Xu, M., Fu, L., Dong, Z., Wu, J.: Decision support analysis for risk identification and control of patients affected by COVID19 based on Bayesian networks. Expert Syst. Appl. 196, 116547 (2022)
Kitson, N.K., Constantinou, A.C., Guo, Z., Liu, Y., Chobtham, K.: A survey of Bayesian network structure learning. Artif. Intell. Rev. 56, 8721–8814 (2023). In press
Scutari, M., Nagarajan, R.: On identifying significant edges in graphical models of molecular networks. Artif. Intell. Med. 57, 207–217 (2013)
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York (2009). https://doi.org/10.1007/9780387216065
Acknowledgments
The authors would like to acknowledge all the members of the DCren consortium and the ECLT for fruitful conversations and suggestions.
Funding
Funded by the European Union’s Horizon 2020 research and innovation programme under grant agreement No 848011 (“DCren”). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or of the granting authority. Neither the European Union nor the granting authority can be held responsible for them.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2024 The Author(s)
About this paper
Cite this paper
Slanzi, D., Silvestri, C., Poli, I., Mayer, G. (2024). Exploiting the Potential of Bayesian Networks in Deriving New Insight into Diabetic Kidney Disease (DKD). In: Villani, M., Cagnoni, S., Serra, R. (eds) Artificial Life and Evolutionary Computation. WIVACE 2023. Communications in Computer and Information Science, vol 1977. Springer, Cham. https://doi.org/10.1007/9783031574306_23
Download citation
DOI: https://doi.org/10.1007/9783031574306_23
Published:
Publisher Name: Springer, Cham
Print ISBN: 9783031574290
Online ISBN: 9783031574306
eBook Packages: Computer ScienceComputer Science (R0)