Bayesian network structure learning with a new ensemble weights and edge constraints setting mechanism

Liu, Kaiyue; Zhou, Yun; Huang, Hongbin

doi:10.1007/s40747-024-01485-1

Bayesian network structure learning with a new ensemble weights and edge constraints setting mechanism

Original Article
Open access
Published: 04 June 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

Complex & Intelligent Systems Aims and scope Submit manuscript

Bayesian network structure learning with a new ensemble weights and edge constraints setting mechanism

Download PDF

121 Accesses
Explore all metrics

Abstract

Bayesian networks (BNs) are highly effective in handling uncertain problems, which can assist in decision-making by reasoning with limited and incomplete information. Learning a faithful directed acyclic graph (DAG) from a large number of complex samples of a joint distribution is currently a challenging combinatorial problem. Due to the growing volume and complexity of data, some Bayesian structure learning algorithms are ineffective and lack the necessary precision to meet the required needs. In this paper, we propose a new PCCL-CC algorithm. To ensure the accuracy of the network structure, we introduce the new ensemble weights and edge constraints setting mechanism. In this mechanism, we employ a method that estimates the interaction between network nodes from multiple perspectives and divides the learning process into multiple stages. We utilize an asymmetric weighted ensemble method and adaptively adjust the network structure. Additionally, we propose a causal discovery method that effectively utilizes the causal relationships among data samples to correct the network structure and mitigate the influence of Markov equivalence classes (MEC). Experimental results on real datasets demonstrate that our approach outperforms state-of-the-art methods.

Adaptive Bayesian Network Structure Learning from Big Datasets

Exploring complex multivariate probability distributions with simple and robust bayesian network topology for classification

Article 03 November 2023

Stochastic optimization for bayesian network classifiers

Article 16 March 2022

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

BNs [1], consisting of Directed Acyclic Graphs (DAGs) and Conditional Probability Tables (CPTs), are a subset of probabilistic graphical models. This graphical modeling approach has found widespread applications in diverse domains, including social network analysis [2], biology [2, 3], and logistics planning. Representing a fusion of graph theory and probability theory, BNs utilize graph models to provide intuitive and structural representations of complex issues. In the real world, there are numerous uncertain problems, and the sources of these uncertainties encompass various aspects, such as incomplete information, measurement errors, randomness, variability, complexity, external environmental changes, and human subjective factors. Uncertainty problems can usually be solved through methods such as probability modeling and inference, robust design, fault detection and tolerance, optimization and decision-making.

Compared to other methods, BNs have a unified modeling framework, powerful inference capabilities, flexible updating and adaptability, interpretability and visualization, and the advantage of integrating domain knowledge when dealing with uncertainty problems. This makes BNs an important tool and method for handling uncertainty problems. For example, in control systems, various factors such as parameters and modeling uncertainties [4], external disturbances and noise [5], nonlinearity and complexity, as well as time delays and communication delays, introduce numerous uncertainties. BNs can provide a probabilistic modeling and inference tool to help deal with uncertainties in the system. They can more accurately represent and infer the state of the system and its uncertainties. They can also design control methods to address these uncertainties, such as adaptive control [6, 7], model predictive control [8, 9], nonlinear control [10], filtering and estimation [11, 12], and robust control [13, 14], among others. With the increasing demand for uncertainty modeling and probabilistic inference, BNs have been widely used in various fields, including risk assessment [15], fault diagnosis [16], decision systems [17], gene sequence analysis [18], biomedical image processing [19], and other areas. BN learning consists of parameter learning and structure learning. Before utilizing BN to tackle real-world problems, it is essential to construct the structure of the BN. The accuracy of this structure directly impacts the accuracy of parameter learning and inference results, making BN structure learning (BNSL) the foundation of parameter learning and a core problem in the learning process.

In the early stages, BN was constructed manually by experts, a process that was not only time-consuming and labor-intensive but also influenced by the subjective judgment of the experts, resulting in strong subjectivity and limitations. With the advancement in data volume and computational power, researchers have focused on methods for automatic learning from data. However, finding the optimal BN structure has been proven to be NP-hard [20]. Additionally, the presence of noise and missing data exacerbates the uncertainty and unreliability in the process of learning the structure. Consequently, efficiently and accurately constructing BN structures from data has become one of the most challenging tasks.

Existing BNSL methods can be categorized into three types: constraint-based (CB), scoreand-search (SS), and hybrid approaches.

CB methods represent the structure learning problem as a constraint satisfaction problem. These algorithms examine the relationships of conditional independence (CI) between different variables to construct the network structure. Some notable algorithms in this category are the PC algorithm [21], grow-shrink (GS) algorithm [22], and IAMB algorithm [23]. The PC algorithm laid the foundation for the development of the CB methods. However, the output of the PC algorithm is dependent on the order and becomes more noticeable in high-dimensional environments, leading to highly variable results. Therefore, many researchers have studied and improved the PC algorithm. Li et al. [24] mitigated the issue of high-order CI tests by introducing the FEPC algorithm. Additionally, some scholars have addressed the instability caused by the PC algorithm’s dependency on the order of nodes and proposed the PC-stable algorithm [25], PC-parallel [26], the PC-MI algorithm [27], and other algorithms [28]. CB methods can handle a wider range of data types and distributions, and they are computationally efficient, making them highly interpretable. However, the accuracy of the learning process depends on the number of CI tests performed and the size of the constraint set. They are sensitive to CI tests and data noise, and high-order dependencies are unreliable for large networks and complex data.

SS methods design scoring functions to assign scores to all possible structures and utilize search algorithms to find the optimal structure. SS methods transform the problem of structure learning into a combinatorial optimization problem. Representative scoring functions include the BDeu [29], AIC [30], BIC [31], and others [32]. Most BNSL methods search within the space of DAGs. However, the search space for DAGs exhibits exponential growth as the number of nodes increases. Some greedy search algorithms [33, 34] have low efficiency and slow convergence speed. Therefore, researchers have explored various heuristic algorithms to improve search speed, such as genetic algorithms [35], particle swarm algorithms [36], and bee algorithms.The SS method may still be subject to search boundaries and space limitations. Additionally, the scoring function has score equivalence, which means that the resulting network structure may still be influenced by the issue of MEC and display a significant number of reverse edges.

Hybrid approaches combining the above two approaches have gradually become the mainstream of research in BNSL. Hybrid approaches leverage the advantages of both approaches. They first reduce the search space by conducting tests on conditional independence and then utilize SS method for learning. Some notable algorithms in this category are the the classic max-min hill-climbing (MMHC) algorithm [37] and SaiyanH [38]. Hybrid approaches often have a high computational complexity and can be sensitive to the initial network structure. The efficiency of hybrid algorithms depends on the search strategy employed, and in certain instances, it may not be feasible to find the precise global optimum.

In our work, our main focus is on PC algorithm because it has broader applicability and high scalability. Literature has shown that CB algorithms can efficiently learn sparse graphs with hundreds or thousands of variables [39]. We aim to address two challenges. Firstly, the quality of the learned directed graph through constraint-based learning methods is highly influenced by the order of variable pairing and the order of selecting the condition sets used to test conditional independence. Secondly, data noise can strongly interfere with algorithms. Data-driven algorithms often perform poorly when they encounter randomness caused by small sample sizes or a large number of negative samples.

Curriculum learning (CL), as advocated by Bengio [40], suggests that models should initially learn from easier samples and gradually transition to more complex ones. From a data perspective, the CL strategy can adaptively adjust the weights of different samples, effectively reducing the negative impact on the learning process caused by challenging samples. In subsequent years, many researchers have developed CL strategies for specific applications, such as weakly supervised object localization [41], object detection, and neural machine translation [42]. These studies have shown compelling benefits of CL in small-batch sampling. Existing research has also started to investigate the integration of BNSL and CL [43], but the problem of reducing the adverse effects of samples in the learning process has not been thoroughly addressed. Additionally, in the process of CL, the issue of uneven sample distribution and the propagation of learning errors throughout different stages of the curriculum have not been adequately addressed.

To address these challenges, we propose a more robust mechanism. We first introduce a multi-perspective influence estimation method to assess the degree of interaction among network nodes. We then divide the learning process into different curriculum stages based on the outcome. We employ a progressive weighting and staged learning approach to acquire the network structure. This involves assigning different curriculum weights based on the learning outcomes at each stage and dynamically adjusting the network structure to reduce the impact of negative samples and noisy data on the algorithm. Additionally, we propose effectively utilizing causal relations among data samples to correct the network structure, thereby mitigating the influence of MEC and improving the robustness of the learned network structure.

Contributions. Our work proposes a new method for progressively learning BN structures The specific contributions of this article are emphasized as follows:

1.
To mitigate noise interference between samples and account for the stability of individual nodes, as well as the strength of mutual influences between nodes, we propose a multi-perspective influence assessment method. This method aims to evaluate the difficulty level of node learning.
2.
Based on this assessment, we divide the learning process into different curriculum stages. By considering multiple perspectives and taking into account both individual node stability and the strength of mutual influences, our method provides a comprehensive evaluation of the learning difficulty of each node, allowing us to design a more effective curriculum for structure learning.
3.
In order to prevent error propagation across different stages of the curriculum, we integrate the learning outcomes from each stage and adaptively adjust the final network structure. By combining the learned structures from different curriculum stages, we can effectively utilize the strengths and knowledge obtained at each stage to improve the overall learning performance and reduce the impact of potential errors.
4.
In Sect. Causal correction mechanism, we investigate the use of hidden compact representation (HCR) [44] for BNs and propose a causal correction mechanism to capture potential causal relationships between variables and accordingly refine the network structure. This mechanism utilizes causal learning to address the influence of MEC and improve the accuracy and interpretability of the learned BNs.
5.
Finally, in Sect. Experiment, we provided under different standard data set of a large number of experiments, compared with a PC, HC, MMHC, NOTEARS [45] algorithm, we show that PCCL-CC is capable of obtaining DAG with better accuracy, namely the lower Structural Hamming Distance (SHD).

Preliminaries

Definition 1

Bayesian network BN is represented by a pair $(G,\Theta )$, where G and $\Theta $ indicate the structure and the conditional probability distributions, respectively. Let $G = (V,E)$ be a DAG structure, where $ V = \{X_1, \ldots , X_n \}$ is the set of n nodes, where each node $X_i$ corresponds to a random variable of BN,and $E = \{(X_1,X_2),\ldots ,(X_i,X_j)\}$ represents the set of directed edges that describes the conditional dependencies over V.

BNs can represent the joint probability distribution of a large set of variables and analyze the relationship between them. By assuming data independence, when the parent node is known, $X_i$ is independent of its non-child nodes. The joint probability distribution of a node can be factorized as the product of each individual node’s conditional probability given its parents, following the chain rule of probability:

$$\begin{aligned} p(x)=\prod _{i=1}^n p(x_i\Vert pa_i) \end{aligned}$$

(1)

where $ pa_i $ represents the parent node of mode $x_i$.

Definition 2

Mutual Information Entropy (MI) MI is the amount of information contained in one random variable about another random variable, or the reduced uncertainty of one random variable owing to another random variable being known.

Let the joint distribution of two random variables (X, Y) be p(x, y), and the marginal distributions are p(x) , p(y) . The MI I(X; Y) is the relative entropy of the joint distribution p(x, y) and the marginal distribution p(x) , p(y) , viz.

$$\begin{aligned} I(X;Y) = \sum _{x\in X}^{} \sum _{y\in Y}^{}p(x,y)\log \dfrac{p(x,y)}{p(x)p(y)} \end{aligned}$$

(2)

Definition 3

Information Entropy Information entropy [46] represents the average amount of information in the information flow. For the information composed of several discrete sources, the information entropy can be represented by the mean of the negative logarithm of the source probability as

$$\begin{aligned} H(X_i) = E[-\log {p_i}]= -\sum _{i=1}^{n}\log \frac{p_i}{n}. \end{aligned}$$

(3)

where H is the information entropy, and $ p_i $ represents the probability of each source. For the information of a continuous source [47], p(x) is the distribution of x, the information entropy can be expressed as

$$\begin{aligned} H(X)&= E_{x\sim X}[-\log p(x)] = -\sum _{x}^{}\log \frac{1}{P(x)}\nonumber \\&= -\sum _{x}^{}P(x)\log P(x) \end{aligned}$$

(4)

Definition 4

Conditional independence Two random variables X and Y are conditionally independent given S, denoted by Ind(X; Y|S), if $p(x, y|s)= p(x|s)p(y|s)$, for all values $x\in X$, $y\in Y$,and $s\in S$ such that $p(s)>0$, where X, Y,and S are the domains of x, y,and s, respectively.

Definition 5

d-separation Given a causal graph $G=(V,E)$, an undirected path $\rho $ between two distinct vertices $X \in V$ and $Y \in V$ given a conditioning set $S \subseteq V \{X, Y\}$ is open, if (i) every collider of $\rho $ is in S or has a descendant in S, and (ii) no other nodes of $\rho $ are in S. If a path is not open, then it is blocked. Two variables X and Y are d-separated given a conditioning set S, denoted by $ X \perp Y|S$.

Definition 6

Markov Property Given a DAG G and the joint probability distribution P of all nodes, if $X_i$ and $X_j$ are d-separated by S $\Rightarrow X_i \perp X_j|S$ then this distribution P is said to satisfy the global Markov property with respect to G.

The Markov property is a commonly used assumption in the construction of graphical models. When a distribution over a graph is “Markovian”, it indicates that the graph can model certain specific independencies in the distribution. These independencies can be utilized for efficient computation or data storage.

The property that is opposite to it is Faithfulness.

Definition 7

Faithfulness Consider a distribution P and a DAG G. If $X_i \perp X_j|S$ $\rightarrow X_i$ and $X_j$ are d-separated by S, then P is faithful about G.

Consider the following example. Suppose the probability distribution P over a DAG G is Markov and faithful, then determine the Structure G based on the following conditions.

1.
$X \perp Z$ (Variables include X, Y, Z)
2.
$X \perp Y|Z$ (Variables include X, Y, Z)

The express under two conditions are shown in Fig. 1. The first condition is the unique collision combination, which is the V structure. According to previous knowledge, the collision combination makes X and Z independent, but when conditioned on Y, X and Z become dependent. There is more than one combination that satisfies the second condition. These three structures are called MEC.

Definition 8

Markov Equivalence class(MEC) If DAG G and H have the same d-separation properties, then G and H are Markov equivalent and belong to the same MEC. If G and H are Markov equivalent, then they have the same skeleton and V-structures (also known as collider structures), and vice versa.

Definition 9

Bayesian Information Criterion (BIC) BIC [31] uses the log-likelihood to measure the degree of fit between the structure and the data on the premise that the samples satisfy the assumption of independent and identical distribution. The BIC scoring function is as follows:

$$\begin{aligned} BIC(S|D){} & {} = \sum _{i=1}^{n}\sum _{j=1}^{q_i}\sum _{k=1}^{r_i} m_{ijk}\log \theta _{ijk}\nonumber \\{} & {} \quad -\frac{1}{2}\sum _{i=1}^{n} q_i(r_i-1)\log m \end{aligned}$$

(5)

where, S represents the BN structure composed of variables $\{X_1,\ldots ,X_n\}$, $q_i$ indicates the number of possible values for the parent node of the variable $x_i$, $r_i$ is the number of values of the variable $X_i$, $m_{ijk}$ is the number of samples when the parent node of $X_i$ takes the value of j, and $X_i$ takes the value of k. $\theta _{ijk}=\frac{m_{ijk}}{m_{ij}}$ is the likelihood conditional probability, $0\le \theta _{ijk}\le 1,\sum _{k}\theta _{ijk} = 1$.

The first component of BIC corresponds to the logarithm of the optimal likelihood, denoted as the likelihood function value, for a given model S. This term assesses the compatibility of the model structure S, with the observed data D. The subsequent term serves as a penalty, mitigating the impact of model complexity to prevent overfitting.

BIC serves as a pivotal criterion for model selection, offering an evaluation of a model’s quality within the context of a specific dataset. It integrates maximum likelihood estimation with a penalty factor addressing the complexity inherent in the model. This integration seeks an equilibrium to circumvent the potential pitfall of overfitting. Consequently, BIC facilitates the identification of the most suitable model for a given dataset, considering both the adequacy of fit and the intricacies of the model.

In the realm of BN structures, a reduced BIC value signifies an improved model. The minimal BIC value signifies an optimal equilibrium between the fitness of the model and its complexity. BIC excels by incorporating considerations of model complexity into the model selection process, thereby acting as a safeguard against overfitting.

BN structure learning based on PCCL-CC algorithm

PCCL-CC is based on the concept of CL and causal correction. We propose a framework for asymmetric weighted integration. By measuring the difficulty of node learning between data samples using a multi-perspective difficulty estimation method, the curriculum stages can be divided reasonably. And we assign varying weights to them. By continuously learning, the algorithm can effectively reduce the interference of sample noise. Additionally, asymmetric weight distribution and integration can be utilized to adaptively adjust the learning structure using constraints. This approach helps to avoid the impact of sampling and individual curriculum stage errors on the overall learning effect. Finally, Traditional BNSL algorithms cannot identify MEC. The PCCL-CC algorithm further utilizes causal correction mechanisms to investigate potential causal relationships in the data, thereby mitigating the influence of MEC on the algorithm from the perspective of causality. We obtain the complete BN structure, as depicted in Fig. 2.

The algorithm is roughly divided into three stages:

Stage 1.
Division Curriculum Stage: The initial node for this process is selected by calculating the entropy of the dataset. The node with the smallest entropy $H(X_i)$ is chosen as the initial node for the curriculum stage. The next curriculum node is selected using $MI(X_i, X_j)$, which identifies the node with the strongest correlation among the candidate nodes and each node in the curriculum.
Stage 2.
Weight Allocation and Edge Constraints: Different weights are allocated to various stages of the learning process, integrating the structures learned at each stage of the curriculum and adaptively removing network structures with low reliability.
Stage 3.
Causal Correction: Using the HCR model to discover potential causal relationships between data samples and modify the network structure.

Division curriculum stage

A curriculum is a sequence of training criteria $C=\{Q_1,..., Q_t,..., Q_T\}$ on T training steps. Each criterion $Q_t$ is a reweighting of the target training distribution P(z):

$$\begin{aligned} Q_t(z) \propto W_t(z)P(z) \quad \forall \,example\,z \in training\,set\,D \end{aligned}$$

(6)

such that the following three conditions are satisfied:

(1)
The entropy of distributions gradually increases
$$\begin{aligned} H(Q_t) < H(Q_{t+1}). \end{aligned}$$
(7)
(2)
The weight $W_t(z)$ is monotonically in t, i.e.,
$$\begin{aligned} W_t(z) < W_{t+1}(z) \quad \forall z\,\in \,D. \end{aligned}$$
(8)
(3)
$Q_T(z) = P(z)$.

Taking inspiration from the structural characteristics of BN, we integrate CL into BNSL problems. In accordance with the BN structural properties, we categorize the nodes representing samples in the BN into distinct curriculum stages. This approach aligns with the inherent features of BN and facilitates a more systematic and structured learning process. Figure 3 displays the curriculum-matching mechanism constructed by the BN.

In accordance with the condition 7, the complexity of acquiring the structure between sample nodes is gauged through the assessment of sample entropy within the dataset, leading to the segmentation of curriculum stages.

we introduces a novel approach that amalgamates information entropy H and mutual information entropy MI to assess the interaction among network nodes from diverse vantage points.

In the context of entropy, it is established that the lower the overall probability of an event, the greater the quantity of information it encapsulates. Put simply, information entropy serves as a metric for the uncertainty associated with variables. When applied to nodes within the network, lower information entropy signifies reduced uncertainty pertaining to the nodes, which facilitates easier learning. Motivated by the concept of moving from shallow to deep, the initial node $X_i$ must satisfy the following condition:

$$\begin{aligned} X_{i}^{*} = \arg \min _{X_i \in X} H(X_i) \end{aligned}$$

(9)

The nodes in BN have interdependent relationships, and MI can measure the uncertainty of another node’s value based on its value, which meets the condition 7. Further from the perspective of BNSL, if there is a directed edge between two nodes, knowing the state of one node will provide lots of information about the other one, which eliminates more uncertainty and means more correlation. In other words, a higher MI between two variables means there is more likely to have an edge connecting them. However, the existence of the edge still cannot be determined with confidence. Thus, Li et al. [48] both introduce the constraint given by (10) to make the relationship between a pair of variables more explicit.

$$\begin{aligned} I(X;Y) \geqq \alpha _{MI} \cdot \min (MMI(X),MMI(Y)) \end{aligned}$$

(10)

where $\alpha _{MI}$ is a binding parameter with the value in the range of (0,1); The maximum mutual information (MMI) between node X and all other nodes in the structure is denoted by MMI(X).

If the value of MI between two nodes satisfies (10), the nodes are believed to be strongly correlated, and an edge connecting them should be considered. In previous works, the value of $\alpha _{MI}$ was predefined or changed dynamically [49]. For a fixed parameter, a larger value would lead to an insufficient restriction on the search space, and a smaller value would exclude the right edge from the candidate structures, resulting in low accuracy. In our work, we set the constraint value of MI, denoted as $\alpha _{MI}$, to 0.5 in order to ensure the stability of learning effectiveness.

Therefore, the selection of subsequent curriculum nodes is based on the sequentially obtained mutual influence. By calculating the MI between curriculum nodes and candidate nodes, we can estimate the strength of their correlation. The node with the highest MI is selected as the next node in the curriculum. The curriculum nodes to be learned in the next stage should be met:

$$\begin{aligned} X^* \leftarrow \arg \max _{X_i \in C_i,X_j \in H_i} MI(X_i,X_j) \end{aligned}$$

(11)

where $C_i$ is the curriculum sets and the $H_i$ is the candidate set. We define the CL step as t, the number of nodes requiring new learning in the next stage. The t nodes that meet the condition, and the curriculum set are updated as follows:

$$\begin{aligned} C_i = C_i \cup \{X_1^*,X_2^*,\ldots ,X_t^*\} \end{aligned}$$

(12)

Weight allocation and edge constraints

Conditions 8 emphasizes the weight allocation between different curriculum stages, assigning higher weights to initial simple curriculum. From the perspective of data classification, CL can effectively reduce the impact of noisy data in samples. Gong et al. [50] hypothesize that training data with noise/incorrect annotations leads to a deviation between the training distribution and the test distribution, demonstrating the denoising mechanism of CL strategies in real-world datasets. To explain the rationality of the model, we will combine the characteristics of BNSL to further provide theoretical explanations for the CL working mechanism proposed by Gong [50]. In this work, we mainly study the establishment of BN structure. We assume that the data distribution of the BN is $P_{target}$, the data distribution we train is $P_{train}$, and we can obtain a simulated curriculum formula. Firstly, we will represent $P_{target}$ the weighted expression of $P_{train}$:

$$\begin{aligned} P_{target} = \frac{1}{\alpha ^*}W_{\lambda ^*}(x)P_{train} \end{aligned}$$

(13)

where $0\le W_{\lambda ^*} \le 1,\alpha ^*=\int W_{\lambda ^*}(x)P_{train}dx$ represents the normalization factor. Under weight $W_{\lambda ^*}$, $P_{target}$ actually corresponds to the defined curriculum. In the low confidence region of $P_{target}$ where complex samples are located, smaller weights should be given, and in the high confidence region where easy samples are located, larger values (close to 1) should be given. Therefore, we can express the above equation as:

$$\begin{aligned} P_{train}(x) = \alpha ^*P_{target}+(1-\alpha ^*)E(x) \end{aligned}$$

(14)

where E(x) denotes the distribution represented by $P_{train}(x)$ under weights $(1-W_{\lambda ^*}(x))$, while E(x) signifies the deviation between $P_{target}(x)$ and $P_{train}(x)$. Within high-confidence regions, E(x) aligns with E(x) under weights approaching 0, leading to a diminished deviation. Conversely, within low-confidence regions, E(x) exerts a more substantial influence on $P_{train}(x)$. The precision of edge judgment between nodes directly influences the magnitude of the deviation between the actual structure and the constructed structure, with more accurate judgments resulting in smaller deviations.

Therefore, the following curriculum sequence is constructed through theory:

$$\begin{aligned} Q_{\lambda }(x) = \alpha _{\lambda }P_{target}(x)+(1-\alpha _{\lambda })E(x) \end{aligned}$$

(15)

where $\alpha _{\lambda }$ increases with the number of courses from 1 to $\alpha ^*$. Correspondingly, curriculum $Q_{\lambda }$ a simulates the process of change from $P_{target}$ to $P_{train}$, where $Q_{\lambda }$ is represented as

$$\begin{aligned} Q_{\lambda }(x)\propto W_{\lambda }(x)P_{train} \end{aligned}$$

(16)

and,

$$\begin{aligned} W_{\lambda }(x)\propto \frac{\alpha _\lambda P_{target(x)}+(1-\alpha _\lambda )E(x)}{\alpha ^* P_{target(x)}+(1-\alpha ^*)E(x)} \end{aligned}$$

(17)

where $0\le \,W_{\lambda }(x)\le 1$,normalize its maximum value to 1. The initial stage setting of this CL process $W_{\lambda }(x)\propto \frac{P_{target(x)}}{P_{train(x)}}$, It has a higher weight in high confidence regions, while it is much smaller in low confidence regions due to heavy tailed issues. As the number of courses increases $\lambda $ With the increase of, the weight of high confidence areas decreases, while the weight of low confidence areas increases, resulting in a more uniform distribution of weights and smaller changes. After normalizing $W_{\lambda }(x)$ to the interval [0,1], its value tends to be expressed as follows: $\lambda $ The unit continues to increase. Therefore, this meets the weight increase criteria defined for the curriculum in condition 8.

For BNSL problems, we assign different weights to the edges learned in different curriculum stages. To further reduce uncertainty during each sampling process, we introduced the concept of integrated optimization. It samples the original data and constructs an ensemble learning model every 15 iterations. By assigning a weight matrix set W to each edge in the computational network, we can determine the maximum weight $W_{e_{ij}}$ for each edge. The final network structure is adaptively adjusted based on the weights to enhance the reliability of the learning outcomes.

Adaptively adjust the weight of edges $W_{e_{ij}}$ based on the learning situation at different integration stages, the weight distribution function of each edge $W_{e_{ij}}$ is defined as follows:

$$\begin{aligned} W_{e_{ij}} = \sum ^{n}_{k=1}\sum ^{m}_{s=1}N_{ks}(e_{ij}) \end{aligned}$$

(18)

where n denotes the number of optimizations. Further, s denotes the number of curriculum, and $N_{ks}(e_{ij})$ denotes the number of occurrences of $X_i$ and $X_j$ in the side $e_{ij}$ in the curriculum after k iterations.

To ensure the accuracy of the learned DAG and reduce the search space, we impose constraints on the learning edges. We compute the average weight ${avg}_{w}$ and set the threshold $\alpha $. If $w_{e_{ij}}<\alpha \cdot {avg}_{w}$, we set $ e_{ij}=0 $. Otherwise, we set $e_{ij}=1$. The initial DAG is then traversed, and any edge with $e_{ij}=0 $ is deleted. The specific constraints are given as follows.

$$\begin{aligned} e_{ij} = {\left\{ \begin{array}{ll} 1,&{}W_{e_{ij}}\geqslant \alpha \cdot {avg}_{W}\\ 0,&{}W_{e_{ij}}<\alpha \cdot {avg}_{W} \end{array}\right. } \end{aligned}$$

(19)

The PCCL-CC algorithm pseudo-code is shown as Algorithm 1.

Causal correction mechanism

BN model the relationship between causal variables as a probabilistic relationship, constructing the causal network structure by considering the independence and conditional dependence between variables. However, this class of methods suffers from the MEC problem.

To distinguish MEC, existing methods introduce a causal function model. Common causal function models are easily perturbed by discrete categorical data and can be challenged by MEC. Presently, most discrete causal discovery models can only be applied to ordered discrete structures, and their assumptions are easily violated for disordered causal relationships.

we proposes applying the HCR model to BNs and correcting the causal directions of the learned directed edges in the model. The main principle is to identify the correct causal direction within the compressed space of causal directions. The details are shown in Fig. 4. By modeling these two stages, we can derive the likelihood of the model and its maximum likelihood estimation method.

Proposition 1

Given joint distribution $P(X) = P(X_1,X_2,$ $\dots ,X_p)$ which is generated from a multivariate HCR model with graph G. $E_{ij}: X_i \rightarrow Z_{ij} \rightarrow X_j$ is an edge in G. If $|Z_{ij}|= 1$, then $X_i \perp Xj$.

Proof

If $|Z_{ij}|= 1$, any value $x_j$ in $X_j$ and $x_i$ in $X_i$ satisfying $P(X_j = x_j|X_i = x_i)$ $= P(X_j = x_j|Z_{ij}) = P(X_j = x_j)$ which means that $X_i \perp X_j$.

Proposition 1 shows an alternative way for detecting the independence in the causal graph which could provide a much effective and more robustness score-based independence detecting method. This method will be used in our structure learning algorithm as a correction phase.

Causal relationship correction is performed on the edge (x, y) in the learned BN structure. The compression mapping mechanism and probability mapping mechanism divide $X \rightarrow Y$ into two stages. The conditional independence $X \bot Y|Y^{'}$ is satisfied. Given observation data for the edge (x, y), the likelihood of $X \rightarrow Y^{'} \rightarrow Y$ is obtained as:

$$\begin{aligned} L(M;D) =&log \prod ^{m}_{i}\sum _{y^{'}_{i}}P(X=x_{i},Y^{'} = y^{'}_{i},Y = y_{i}) \nonumber \\ =&log \prod ^{m}_{i}\sum _{y^{'}_{i}}P(X=x_{i})P(Y^{'} = y^{'}_{i}|X=x_{i})\nonumber \\&\quad P(Y = y_{i}|Y^{'} = y^{'}_{i}) \nonumber \\ =&log \prod ^{m}_{i}P(X=x_{i})P(Y = y_{i}|Y^{'} = f(x_{i})) . \end{aligned}$$

(20)

where $P(Y = y_{i}|Y^{'} = f(x_{i})$ takes the value of 1 when $y'=f(x_i)$ is satisfied; otherwise, it is 0.

The parameters in M include f and $\theta $, where $\theta $ includes the parameters of P(X) and $P(Y|Y')$. By introducing a penalty term for the number of parameters, the complexity of the model is ensured not to be too high, while avoiding overfitting [44]. The optimal score corresponding to edge (x, y) is given by:

$$\begin{aligned} Score_{(x,y)}&= \sup _f \max _{\theta }L(\theta ;D) - \frac{n_p}{2}\log m \nonumber \\&=\sup _f \max _{\theta }log \prod ^{m}_{i}P(X=x_{i})P(Y = y_{i}|Y^{'} \nonumber \\&= f(x_{i})) - \frac{n_p}{2}\log m \nonumber \\&=\sup _f \max _{\theta }log\prod _{x}{\hat{a}}_{x}^{n_{x}}\prod _{y'}\prod _{y}{\hat{b}}_{y',y}^{n_{y',y}} - \frac{n_p}{2}\log m. \end{aligned}$$

(21)

where $n_p=({|X |}-1)+|Y^{\prime } |(|Y|-1)$ measures the effective number of parameters in the model, $n_{x}=\sum _{i=1}^{m}I(x_{i}=x)$ and $n_{y',y}=\sum _{i=1}^{m}I(y'_{i}=y',y_{i}=y)$ indicates the frequency in the sample respectively. the ${\hat{a}}_{x}={\hat{p}}(X=x)=\frac{n_{x}}{\sum _{x}n_{x}}$, ${\hat{b}}_{y',y}={\hat{p}}(Y=y|Y'=y')=\frac{n_{y',y}}{\sum _{y}n_{y',y}}$ denotes the maximum likelihood solution of $P_{X}$,$ P_{Y|Y'}$.

By comparing the scores of the edges (X, Y) and reverse edges (Y, X) of the network structure, the correct direction of the edge can be determined.

$$\begin{aligned} {\left\{ \begin{array}{ll} X \rightarrow Y ,&{} Score_{(X,Y)}>Score_{(Y,X)} \\ X \leftarrow Y ,&{} other \end{array}\right. } \end{aligned}$$

Assumption 1

Consider the multivariate HCR model causal relation $X_{P_i} \rightarrow X_i$ where $X_{P_i} $ is a set of parents of $X_i$. For each $X_k \in X_{P_i}$, the conditional distribution $P(X_i|X_{P_i})$ is random in the sense that there does not exist values $x_i \ne x_i'$ such that $\forall \, x_k \in X_k$, $P(X_i=x_i'|X_k=x_k) = C$. $P(X_i=x_i'|X_k=x_k)$ holds with some constant C.

Theorem 1

The directions of edges in BN are distinguishable, and the conditional distribution P(X|Y) correctly directed edges have the following property under the certain conditions.

There does not exist values $y_1\ne y_2$ such that $P(Y=y_1|X)$ equals $P(Y=y_2|X) \cdot c$ for all possible X values. where the c is the constant. So in the reverse direction there not exist $X^{'} = \hat{f(Y)} $ with $|X{'}|<|Y|$ for all possible X and Y, make the $P(X|Y) = P(X|X')$.

Proof

For the correct edges, we have $P(X,Y)=P(X)$ P(Y|X). Assum that there exist such a $X'={\hat{f}}(Y)$ to satisfy $P(X|Y) = P(X|X{'})$. Then we have the $P(X,Y) = P(Y)P(X|X').$ So we have

$$\begin{aligned} P(X|X{'}) = \frac{P(X)P(Y|X)}{P(Y)}. \end{aligned}$$

(22)

However, $|X'|<|Y|$, there must exist two values $y_1 \ne y_2$ such that $\hat{f(y_2)} = \hat{f(y_1)}$, which implies $P(X|\hat{f(y_2)}) =P(X|\hat{f(y_1)})$. So,we have

$$\begin{aligned} \frac{P(X)P(Y=y_1|X)}{P(Y=y_1)}=\frac{P(X)P(Y=y_2|X)}{P(Y=y_2)}. \end{aligned}$$

(23)

which is contradicts.Therefore, any causal pair in the reverse direction does not admit a low-cardinality hidden representation.

Theorem 2

Assume that in the causal direction there exists the transformation $Y' = f(X)$ such that $P(Y|X) = P(Y|Y')$, where $|Y'| < |X|$. Then to produce the same distribution P(X, Y), the reverse direction must involve more effective number of parameters in the model than the right direction.

Essentially, Theorem 2 shows that if each pair has the lowcardinality property as well as Assumption 1 holds, then we can identify the causal relationship even in the multivariate case. Based on Theorem 2, we can further assume the faithfulness and conclude that the causal structure is also identifiable due to the identifiability of HCR model and the faithfulness assumption.

The pseudo-code for causal correction is shown as Algorithm 2.

Time and space complexity analysis

In this section, we analyze the time and space complexities of the PCCL-CC algorithm which is composed of several parts. Let n be the number of network nodes and m be the sample size. During the curriculum stage division, we need to compute the entropy value H(X) of each node to select the initial node. Since we need to calculate H(X) for all the n nodes, the time complexity is O(n). In the subsequent curriculum stage division, we need to calculate MI between each curriculum node and the candidate node which has at most $O(n^2)$ time complexity. The time complexity of calculating the MI value between nodes is polynomial which is expressed as a mic(m, r), where r is the maximum possible value of any variable. Thus, the time complexity of calculating the MI value between all nodes is $O(n^2\cdot mic(m,r))$. After sorting all the MI values using quick sort, the time complexity becomes $O(\log (n))$. While constructing the initial skeleton according to the curriculum stage, we compute the weight on each side of the weight matrix set, and add constraints, resulting in a time complexity of $O(n^2)$. Hence, the time complexity of constructing the initial DAG is $O(n^2)$. Assuming that the maximum number of iterations is $Max_{iter}$, the causality correction is performed on the learned BN structure, resulting in a time complexity of $O(n^2)$. Therefore, the time complexity of the algorithm is $O(mic(m,r)\cdot n^2))$. The space complexity of the algorithm remains constant at $O(n^2)$, since the curriculum division and weight distribution are the mountain climbing method, belonging to the local search algorithm, which can find reasonable understanding in a certain space with a constant space complexity.

Experiment

All the experiments were performed on a computer equipped with an Intel(R) Core(TM) i7-1165G7 CPU at 2.80GHz, 16GB memory and Compiler Environment is Pycharm 2021.1.2.

Data sets Four standard networks of different sizes have been selected for comparative experiments, and samples of different data sizes have been formed by sampling. The details are shown in Table 1.

Comparison Methods We compared our method against HC, PC [21], MMHC [37], PC-stable [51],PC-paraller [26] and NOTEARS algorithm [45].

Table 1 The classical BN data structures

Full size table

Evaluation index We evaluate the proposed PCCL-CC approach in the quality of learned network structures. we use the following measurements:

(1)
Structural Hamming Distance(SHD): SHD is a standard distance used to compare graphs by using their adjacency matrices. It involves calculating the differences between two (binary) adjacency matrices: each missing or non-existent edge in the target graph is considered as an error. The smaller the SHD, the fewer incorrect edges are learned, and the better the learning effect.
(2)
F1-score: Characterization of learning in the precision rate and recall rate of performance measurement. In BNs, the F1-score can help evaluate the performance of classification tasks and provide a comprehensive assessment of the predictive ability of positive and negative classes. The higher the F1-score, the more accurate the algorithm learns the network structure.
(3)
False discovery rate (FDR): The ratio of all discoveries that are incorrect or reversed in direction.FDR can help determine potential false discoveries. The smaller the FDR, the lower the error rate of the learned network structure, and the better the algorithm’s performance.

Using the SHD, F1-score, and FDR to measure the benefits of BNs provides a comprehensive evaluation, assessing the structure of the network, classification accuracy, and performance of feature selection from different perspectives.

Performance comparison between PCCL-CC algorithm and other algorithms under different datasets

In the experiment, We also compared our algorithm with other structural learning algorithms, such as HC, PC, MMHC, and NOTEARS. For each group, we used ten sets of data and took the average of the ten results. To verify the effectiveness of the algorithm, we used F1-score and SHD to evaluate the generated network.

Figure 5 displays the changes in SHD and F1-score for the five algorithms on the four standard datasets. As shown in the figure, our proposed PCCL-CC algorithm outperforms the other structural learning algorithms on smaller networks, such as Asia, with sample sizes less than 3k. When the sample size is 3k, HC and PC algorithms also achieve good learning effects. For the Sachs network, the PCCL-CC algorithm demonstrates better results compared to other algorithms. For the larger and more complex network structures of Child and Alarm, the PCCL-CC algorithm has a significant advantage in learning effects, as evidenced by higher indicators compared to other algorithms.

We have discovered that our proposed PCCL-CC algorithm is effective in performing BN structure learning on networks of different sizes. For smaller networks with fewer nodes, the PCCL-CC algorithm can provide a more comprehensive description of the network and accurately identify the relationships between nodes with a smaller sample size. For larger and more complex networks with multiple nodes, the PCCL-CC algorithm outperforms other algorithms. As the number of nodes in the data set increases, traditional BNSL algorithms rely more on relationships between nodes, making them more susceptible to noise data and resulting in more learning errors. In contrast, incremental learning through stages enables the distribution of nodes to provide more accurate judgments of higher-weight dependencies and gradually learn more complex dependencies, thereby reducing the occurrence of incorrect results in network learning.

Impact of curriculum size on networks due to the different sizes of the network

The stage division in the algorithm is affected by the different network scales, which affects the step length of the learning node and determines the size of the learning network. Therefore, we conducted experiments to test the effect of curriculum size ($N_c$) on the algorithm’s performance. For the four standard networks, we set up a contrast experiment from $N_c$ = 1 to $N_c$ = 20 to distinguish the influence of class size for large networks. For the Alarm network, we set up a contrast experiment from $N_c$ = 1 to $N_c$ = 30. The evaluation index used for comparison is SHD. We compared the experimental results for each group by taking the average of ten sets of data, as shown in Fig. 6.

Table 2 Influence of different $\theta $ values on each index

Full size table

Figure 6 shows the changes in SHD for the PCCL-CC algorithm on the four standard networks with different $N_c$. Based on the figure, it is obvious that for the small network models, such as Asia and Sachs, as $N_c$ increases, the network nodes are divided into different stages, which decreases the SHD. However, excessive segmentation of network nodes leads to an increase in SHD after reaching a certain degree. The excessive segmentation results in each stage of CL to only one or two nodes, preventing the BNSL algorithm from effectively identifying dependencies between nodes, resulting in an increase in SHD. When the $N_c$ increases to a certain degree, the subsequent stages only focus on the original curriculum, causing the SHD to flatten out. For the medium-sized Child network, the network node is gradually subdivided as the $N_c$ increases, and the SHD decreases accordingly. This reflects how the phased learning network structure can effectively reduce the emergence of errors and improve the reliability of the algorithm. However, when the $N_c$ increases to a certain degree, excessive segmentation leads to an increase in the SHD. For the large-scale Alarm network, we set up a contrast experiment from 1 to 30 and found from the figure that as the $N_c$ increases, the SHD initially decreases before showing a trend of increase. When the $N_c$ reaches 12, the SHD is the minimum, reflecting how the division of network scale affects BNSL algorithm. For the four types of standard network structures, the performance is best when $N_c$ approaches the degree of the network.

The curriculum stage should not be excessively segmented, as this can lead to a large number of key missing nodes and prevent the identification of nodes with complex dependencies, ultimately resulting in a large number of errors. Therefore, the setting of curriculum stage division should be combined with the degree of the network for effective performance.

The influence degree of threshold setting in edge constraint

Experiments have been performed on the Asia datasets for the sample size 1000 for determining the effect of the threshold $\theta $ on the experimental results, and the results are shown in Table 2.

Table 2 reveals that the choice of $\theta $ has a less significant impact on the final result, as the corresponding value can be selected within a certain range. When the $\theta $ is set to negligible, more redundant edges are introduced to the network. Conversely, when the $\theta $ is set to a large value, fewer edges are selected and added to the network. This exclusion of potentially correct edges can result in a small number of correct edges and severe missing edges, which in turn leads to a relatively high SHD. Therefore, a comprehensive analysis and reasonable selection of $\theta $ are necessary to accurately learn the network structure while ensuring that more correct edges are present, which can make the multilateral and missing edges situation lighter.

Ablation study

Tailoring our approach to the learning dynamics across distinct integration stages, we dynamically adjusted edge weights to systematically assess the efficacy of both the curriculum learning strategy and the causal correction mechanism in enhancing BNSL algorithms. Ablation experiments were conducted, wherein we juxtaposed the SHD of the classic PC algorithm against variations incorporating curriculum learning (PCCL), causal correction (PCCC), and a combined approach integrating both curriculum learning and further causal correction (PCCL-CC). This comparative evaluation was executed across four benchmark datasets.

Table 3 Ablation experiments on four standard datasets

Full size table

Table 3 reveals varying degrees of enhancement in the PC algorithm upon integration with both the CL strategy and the causal correction mechanism. Notably, the performance improvement attributed to the CL strategy surpasses that of the causal correction mechanism. This discrepancy arises from the fact that the CL strategy predominantly mitigates the impact of noise during the learning process, whereas the causal correction mechanism primarily addresses the MEC problem. The synergistic application of these two strategies yields a substantial improvement of approximately 50% in the PC algorithm’s performance when learning the network structure.

Curriculum learning thought and causal correction mechanism effect on the learning algorithm of BN structure

To verify the effectiveness of the CL concept and causal correction mechanism on the structure of the BN learning algorithm, we compared the PC algorithm, the PCCL algorithm (incorporating CL), the PCCL-CC algorithm (incorporating both CL and causal correction mechanism) and other PC-like algorithmson the four standard datasets. For each group, we used ten sets of data and took the average of the ten results.

Figure 7 compares the performance of PCCL-CC algorithm, PCCL (with the addition of CL framework), PC, and other PC-like algorithms (including PC-stable, PC-parallel) on SHD and FDR metrics. From the graph, it is evident that for the Asia network, the PCCL-CC and PCCL exhibit significant advantages when the sample size is less than 3k, indicating that the CL framework can effectively reduce the negative sample influence in small samples and improve the algorithm’s performance. When the sample size is greater than 3k, the algorithm did not exhibit a significant advantage, as a sufficiently large sample size can counteract the noise interference. For the Sachs network, the PCCL-CC and PCCL algorithms perform significantly better in terms of SHD, with the PCCL-CC algorithm demonstrating a more pronounced advantage over the PCCL algorithm, suggesting that the causal learning algorithm can improve the learned network structure to a certain extent. For networks with a large number of nodes and more complex network structures such as the Child and Alarm networks, the advantages of the PCCL-CC algorithm and PCCL algorithm are more pronounced compared to smaller networks. They exhibit a much smaller SHD than other algorithms, although the relatively lower FDR performance on the Alarm network is due to the threshold setting, which caused the algorithm to delete some correct edges.

By analyzing the results, it can be concluded that traditional BNSL algorithms are sensitive to sample noise and quantity, especially for networks with more complex node relationships. However, the CL framework effectively mitigates the impact of negative samples in samples by weighting and adapting them, improving the algorithm’s robustness. Through the performance of the PCCL-CC and PCCL algorithms, it was shown that the causal correction mechanism can improve the learning effect of the algorithm to a certain extent. However, due to the limitations of the original sample types of BNs, no significant advantages were observed, which is also a direction for future improvements.

Conclusion

In this study, we propose the PCCL-CC algorithm, which combines the concepts of CL and causal discovery. It introduces a method for measuring the strength of connections between nodes from multiple perspectives. The method divides the curriculum stages and utilizes an asymmetric weighting mechanism to minimize the impact of sample noise on the algorithm. Based on this, causal learning methods are used to discover potential causal relationships and minimize the impact of MEC. The experiments show that the PCCL-CC algorithm is less affected by sample noise and can obtain more accurate network structures compared to other algorithms.

However, our algorithm still has certain limitations. For example, it has higher complexity when dealing with large datasets compared to traditional CB method, requiring more computational resources. It is also sensitive to the initial stage of the curriculum and lacks a feedback updating mechanism during the curriculum stages. One of the future works is to utilize the learning effects between curriculum to guide the algorithm in its own learning process.

Data Availability

The data that support the findings of this study are publicity available in Bayesian Network Repository at https://www.bnlearn.com/.

References

Neufeld E, Pearl J (1993) probabilistic reasoning in intelligent systems: networks of plausible inference. Series in representation and reasoning. Morgan kaufmann, san mateo1988, xix 552 pp. J Symb Logic 58(2): 721-721. https://doi.org/10.2307/2275238
Monnier V, Vidal P, Rodriguez V, Zitoun R (2023) From graph theory and geometric probabilities to a representative width for three-dimensional detonation cells. Combus Flame 256:112996. https://doi.org/10.1016/j.combustflame.2023.112996
Article Google Scholar
Tutsoy O (2023) Graph theory based large-scale machine learning with multi-dimensional constrained optimization approaches for exact epidemiological modeling of pandemic diseases. IEEE Trans Pattern Anal Mach Intell 45(8):9836–9845. https://doi.org/10.1109/TPAMI.2023.3256421
Article Google Scholar
Gao L, Li F, Fu J (2020) Output-based event-triggered resilient control of uncertain ncss under dos attacks and quantisation. Int J Syst Sci 51(14):2582–2596. https://doi.org/10.1080/00207721.2020.1797923
Article MathSciNet Google Scholar
Dong S, Chen G, Liu M, Wu Z-G (2022) Robust adaptive h$\infty $ control for networked uncertain semi-markov jump nonlinear systems with input quantization. Sci China Inform Sci 65:285–286
Article MathSciNet Google Scholar
Wan H, Luan X, Stojanovic V, Liu F (2023) Self-triggered finite-time control for discrete-time markov jump systems. Inform Sci 634:101–121. https://doi.org/10.1016/j.ins.2023.03.070
Article Google Scholar
Ren Y, Zhao Z, Ahn CK, Li H-X (2022) Adaptive fuzzy control for an uncertain axially moving slung-load cable system of a hovering helicopter with actuator fault. IEEE Trans Fuzzy Syst 30(11):4915–4925. https://doi.org/10.1109/TFUZZ.2022.3164512
Article Google Scholar
Cheng P, Wang H, Stojanovic V, Liu F, He S, Shi K (2022) Dissipativity-based finite-time asynchronous output feedback control for wind turbine system via a hidden markov model. Int J Syst Sci 53(15):3177–3189. https://doi.org/10.1080/00207721.2022.2076171
Article MathSciNet Google Scholar
Ding Y-D, Wang Y-Y, Jiang S-R, Chen B (2021) Active fault-tolerant control scheme of aerial manipulators with actuator faults. J Central South Univ 28(3):771–783. https://doi.org/10.1007/s11771-021-4644-7
Article Google Scholar
Yin Y, Shi P, Liu F, Teo KL, Lim C-C (2014) Robust filtering for nonlinear nonhomogeneous markov jump systems by fuzzy approximation approach. IEEE Trans Cybern 45(9):1706–1716. https://doi.org/10.1109/TCYB.2014.2358680
Article Google Scholar
Wan H, Karimi HR, Luan X, Liu F (2021) Self-triggered finite-time h$\infty $ control for markov jump systems with multiple frequency ranges performance. Inform Sci 581:694–710. https://doi.org/10.1016/j.ins.2021.10.002
Article Google Scholar
Peng Z, Song X, Song S, Stojanovic V (2023) Hysteresis quantified control for switched reaction-diffusion systems and its application. Complex Intell Syst 9(6):7451–7460. https://doi.org/10.1007/s40747-023-01135-y
Article Google Scholar
Dong X, He S, Stojanovic V (2020) Robust fault detection filter design for a class of discrete-time conic-type non-linear markov jump systems with jump fault signals. IET Control Theory Appl 14(14):1912–1919. https://doi.org/10.1049/iet-cta.2019.1316
Article MathSciNet Google Scholar
Wang Z-P, Zhang X, Wu H-N, Huang T (2021) Fuzzy boundary control for nonlinear delayed dpss under boundary measurements. IEEE Trans Cybern 53(3). https://doi.org/10.1109/TCYB.2021.3105249
Lee D, Pan R (2018) A nonparametric bayesian network approach to assessing system reliability at early design stages. Reliab Eng Syst Saf 171:57–66. https://doi.org/10.1016/j.ress.2017.11.009
Article Google Scholar
Jiayan XJQJH (2023) Improved bayesian network-based for fault diagnosis of air conditioner system. Int J Metrol Qual Eng 14:10. https://doi.org/10.1051/ijmqe/202309
Article Google Scholar
Kim J, Zhao X, Shah AUA, Kang HG (2021) System risk quantification and decision making support using functional modeling and dynamic bayesian network. Reliab Eng Syst Saf 215:107880. https://doi.org/10.1016/j.ress.2021.107880
Article Google Scholar
Ainsworth RI, Rizi A, Bo D, Nan L, Kai Z, Wei W (2018) Bayesian networks predict neuronal transdifferentiation. G3: Genes Genom Genet 8(7): 2501 – 2511. https://doi.org/10.1534/g3.118.200401
Obayya M, Haj Hassine SB, Alazwari S, Nour MK, Mohamed A, Motwakel A, Yaseen I, Sarwar Zamani A, Abdelmageed AA, Mohammed GP (2022) Aquila optimizer with bayesian neural network for breast cancer detection on ultrasound images. Appl. Sci. 12(17). https://doi.org/10.3390/app12178679
Chickering DM (1996) Learning bayesian networks is np-complete. Learn Data Artif Intell Stat V 112:121–130. https://doi.org/10.1007/978-1-4612-2404-4_12
Article MathSciNet Google Scholar
Burr T (2003) Causation, prediction, and search. Technometrics 45(3):272–273. https://doi.org/10.1198/tech.2003.s776
Article Google Scholar
Thrun DM (1999) Bayesian network induction via local neighborhoods. In: NIPS’99: Proceedings of the 12th International Conference on Neural Information Processing Systems, pp. 505–511
Tsamardinos l, Aliferis CF, Statnikov A (2003) Time and sample efficient discovery of markov blankets and direct causal relations. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining, pp. 673–678
Li Y, Yang Y, Zhu X, Yang W (2015) Towards fast and efficient algorithm for learning bayesian network. Wuhan Univ J Natl Sci 20(3):214–220
Article MathSciNet Google Scholar
Colombo D, Maathuis MH (2014) Order-independent constraint-based causal structure learning. J Mach Learn Res 15:3741–3782
MathSciNet Google Scholar
Le TD, Hoang T, Li J, Liu L, Liu H, Hu S (2016) A fast pc algorithm for high dimensional causal discovery with multi-core pcs. IEEE/ACM Trans Comput Biol Bioinform 16(5):1483–1495. https://doi.org/10.1109/TCBB.2016.2591526
Article Google Scholar
Qi X, Fan X, Gao Y, Liu Y (2019) Learning bayesian network structures using weakest mutual-information-first strategy. Int J Approx Reason 114:84–98. https://doi.org/10.1016/j.ijar.2019.08.004
Article MathSciNet Google Scholar
Qi X, Fan X, Wang H, Lin L, Gao Y (2021) Mutual-information-inspired heuristics for constraint-based causal structure learning. Inform Sci 560:152–167. https://doi.org/10.1016/j.ins.2020.12.009
Article MathSciNet Google Scholar
Heckerman D, Geiger D, Chickering DM (1995) Learning bayesian networks: the combination of knowledge and statistical data. Mach Learn 20(3):197–243. https://doi.org/10.1023/A:1022623210503
Article Google Scholar
Hiramatsu K, Matsumiya Y, Kitada S (1994) Introduction of suitable stock-recruitment relationship by a comparison of statistical models. Fish Sci 60(4):411–414. https://doi.org/10.2331/fishsci.60.411
Article Google Scholar
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464. https://doi.org/10.1214/aos/1176344136
Article MathSciNet Google Scholar
Bouckaert RR (1993) Probabilistic network construction using the minimum description length principle. In: Clarke M, Kruse R, Moral S (eds) Symbolic and quantitative approaches to reasoning and uncertainty. Springer, Berlin Heidelberg, pp 41–48
Chapter Google Scholar
Adhitama RP, Saputro DR (2022) Hill climbing algorithm for bayesian network structure. AIP Conf Proc 2479(1):1–7. https://doi.org/10.1063/5.0099793
Article Google Scholar
Lee S, Kim SB (2020) Parallel simulated annealing with a greedy algorithm for bayesian network structure learning. IEEE Trans Knowl Data Eng 32(6):1157–1166. https://doi.org/10.1109/TKDE.2019.2899096
Article Google Scholar
Zhang W, Fang W, Sun J, Chen Q (2020) Learning bayesian networks structures with an effective knowledge-driven ga. In: 2020 IEEE Congress on Evolutionary Computation (CEC). IEEE 2020:1–8. https://doi.org/10.1109/CEC48606.2020.9185884
Wang J, Liu S (2019) A novel discrete particle swarm optimization algorithm for solving bayesian network structures learning problem. Int J Comput Math 96(12):2423–2440. https://doi.org/10.1080/00207160.2019.1566535
Article Google Scholar
Tsamardinos I, Brown LE, Aliferis CF (2006) The max-min hill-climbing bayesian network structure learning algorithm. Mach Learn 65(1):31–78. https://doi.org/10.1007/s10994-006-6889-7
Article Google Scholar
Constantinou AC (2020) Learning bayesian networks that enable full propagation of evidence. IEEE Access 8:124845–124856. https://doi.org/10.1109/ACCESS.2020.3006472
Article Google Scholar
Stekhoven DJ, Moraes I, Sveinbjörnsson G, Hennig L, Maathuis MH, Bühlmann P (2012) Causal stability ranking. Bioinformatics 28(21):2819–2823. https://doi.org/10.1093/bioinformatics/bts523
Article Google Scholar
Bengio Y, Louradour J, Collobert R, Weston J (2009) Curriculum learning. In: Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, Association for Computing Machinery, New York, NY, USA, 2009, p. 41-48. https://doi.org/10.1145/1553374.1553380
Tudor Ionescu R, Alexe B, Leordeanu M, Popescu M, Papadopoulos DP, Ferrari V (2016) How hard can it be? estimating the difficulty of visual search in an image, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2157–2166. https://doi.org/10.1109/CVPR.2016.237
Platanios EA, Stretcu O, Neubig G, Póczos B, Mitchell TM (2019) Competence-based curriculum learning for neural machine translation, CoRR abs/1903.09848. arXiv:1903.09848
Zhao Y, Chen Y, Tu K, Tian J (2017) Learning bayesian network structures under incremental construction curricula. Neurocomputing 258:30–40. https://doi.org/10.1016/j.neucom.2017.01.092
Article Google Scholar
Cai R, Qiao J, Zhang K, Zhang Z, Hao Z (2018) Causal discovery from discrete data using hidden compact representation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, Curran Associates Inc., Red Hook, NY, USA, p. 2671-2679
Zheng X, Aragam B, Ravikumar P, Xing EP (2018) Dags with no tears: continuous optimization for structure learning. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, Curran Associates Inc., Red Hook, NY, USA, p. 9492-9503
Kraskov A, Stögbauer H, Grassberger P (2004) Estimating mutual information. Phys Rev E 69 (6). https://doi.org/10.1103/physreve.69.066138
Ross BC (2014) Mutual information between discrete and continuous data sets. PLOS One 9(2):1–5. https://doi.org/10.1371/journal.pone.0087357
Article Google Scholar
Li BH, Liu SY, Li ZG (2012) Improved algorithm based on mutual information for learning bayesian network structures in the space of equivalence classes. Multimed Tools Appl 60:129–137. https://doi.org/10.1007/s11042-011-0801-6
Article Google Scholar
Scutari M, Graafland CE, Gutiérrez JM (2019) Who learns better bayesian network structures: accuracy and speed of structure learning algorithms. Int J Approx Reason 115:235–253. https://doi.org/10.1016/j.ijar.2019.10.003
Gong T, Zhao Q, Meng D, Xu Z (2016) Why curriculum learning and self-paced learning work in big/noisy data: a theoretical perspective. Big Data Inform Anal 1(1):111–127. https://doi.org/10.3934/bdia.2016.1.111
Colombo D, Maathuis MH (2014) Order-independent constraint-based causal structure learning. J Mach Learn Res 15:3741–3782
MathSciNet Google Scholar

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grant 62276262; Science and Technology Innovation Program of Hunan Province under Grant 2021RC3076; Training Program for Excellent Young Innovators of Changsha under Grant KQ2009009.

Author information

Authors and Affiliations

National Key Laboratory of Information Systems Engineering, National University of Defense Technology, No. 109 Deya Road, Kaifu District, Changsha City, Hunan Province, China
Kaiyue Liu & Yun Zhou
Laboratory for Big Data and Decision, National University of Defense Technology, No. 109 Deya Road, Kaifu District, Changsha City, Hunan Province, China
Hongbin Huang

Authors

Kaiyue Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yun Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Hongbin Huang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yun Zhou.

Ethics declarations

Conflict of interest

The work described has not been submitted else where for publication, in whole or in part, and all the authors listed have approved the manuscript that is enclosed. Moreover, the authors declare that they have no Conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Liu, K., Zhou, Y. & Huang, H. Bayesian network structure learning with a new ensemble weights and edge constraints setting mechanism. Complex Intell. Syst. (2024). https://doi.org/10.1007/s40747-024-01485-1

Download citation

Received: 25 June 2023
Accepted: 21 January 2024
Published: 04 June 2024
DOI: https://doi.org/10.1007/s40747-024-01485-1

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Bayesian network structure learning with a new ensemble weights and edge constraints setting mechanism

Abstract

Similar content being viewed by others

Adaptive Bayesian Network Structure Learning from Big Datasets

Exploring complex multivariate probability distributions with simple and robust bayesian network topology for classification

Stochastic optimization for bayesian network classifiers

Introduction

Preliminaries

Definition 1

Definition 2

Definition 3

Definition 4

Definition 5

Definition 6

Definition 7

Definition 8

Definition 9

BN structure learning based on PCCL-CC algorithm

Division curriculum stage

Weight allocation and edge constraints

Causal correction mechanism

Proposition 1

Proof

Assumption 1

Theorem 1

Proof

Theorem 2

Time and space complexity analysis

Experiment

Performance comparison between PCCL-CC algorithm and other algorithms under different datasets

Impact of curriculum size on networks due to the different sizes of the network

The influence degree of threshold setting in edge constraint

Ablation study

Curriculum learning thought and causal correction mechanism effect on the learning algorithm of BN structure

Conclusion

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation