Semi-supervised learning for k-dependence Bayesian classifiers

Wang, LiMin; Zhang, XinHao; Li, Kuo; Zhang, Shuai

doi:10.1007/s10489-021-02531-y

Semi-supervised learning for k-dependence Bayesian classifiers

Published: 08 July 2021

Volume 52, pages 3604–3622, (2022)
Cite this article

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Applied Intelligence Aims and scope Submit manuscript

Semi-supervised learning for k-dependence Bayesian classifiers

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

LiMin Wang¹,
XinHao Zhang ORCID: orcid.org/0000-0003-0722-7681²,
Kuo Li³ &
…
Shuai Zhang⁴

408 Accesses
5 Citations
1 Altmetric
Explore all metrics

Abstract

Bayesian network classifiers (BNCs) are powerful tools for graphically encoding the dependency relationships among variables in a directed acyclic graph and reasoning under conditions of uncertainty. Ever increasing data quantity makes ever more urgent the need of BNCs that are highly scalable and can perform significantly better in terms of classification. Numerous approaches have been proposed to mine conditional dependencies among attributes implicated in labeled training data under the framework of supervised learning, whereas the specific characteristics of unlabeled testing instances receive less attention. That may lead to overfitting and degradation in classification performance. In this paper, we argue that the knowledge learned from labeled training dataset and that from unlabeled testing instance are complementary in nature. The testing instance is pre-assigned with any possible label to make it complete, then log-likelihood function is introduced and redefined to measure the extents to which the learned BNC fits training or testing data. Heuristic search strategy is applied to learn two kinds of arbitrary k-dependence BNCs (general BNC for modeling training dataset and local BNC for modeling testing instance), which will work as an ensemble to make the final prediction under the framework of semi-supervised learning. The experimental evaluation on 40 publicly available datasets from the UCI machine learning repository reveals that the proposed algorithm achieves competitive classification performance compared with state-of-the-art BNCs and their variants, such as CFWNB, WATAN, FKDB, SKDB and IWAODE.

A survey on semi-supervised learning

Article Open access 15 November 2019

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A Review on Random Forest: An Ensemble Classifier

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

During the past decades, Supervised Learning (SL), which deeply exploits labeled dataset to achieve promising learning performance, is the main stream for classification in the machine learning field. Various methods and theoretical models have been presented, such as k-nearest neighbors [1], support vector machine [2], logistic regression [3] and decision tree [4], among them Bayesian network classifiers (BNCs) are powerful tools to graphically describe the statistical knowledge and infer under conditions of uncertainty. The dependency relationships among predictive attributes $\mathcal {X}=\{X_{1}, \cdots , X_{n}\}$ and class variable Y are graphically described in a directed acyclic graph (DAG). Ideally the joint probability distribution corresponding to the topology of DAG should fit the data best, and it can be factorized into a set of conditional probabilities. However, existing BNCs under SL can not always obtain desired performance in all cases. The learned topology of BNCs may overfit the training dataset whereas underfit the testing instances, that results in high variance and performance degradation.

Semi-supervised learning (SSL) is an effective and efficient way to incorporate the information learned from testing instance into that learned from training dataset, and the final model will be refined as more instances are introduced for learning. However, some previous research works [5,6,7] have stated that risky unlabeled instances when incorporated into non-robust models will result in “noise propagation”, and the negative effect may lead to the biased decision boundaries. SSL may perform even worse than SL and the application scope of SSL will be limited to some extent. To address this issue, one feasible approach is to independently consider the impact of each unlabeled testing instance [8]. Thus the impact, whether positive or negative, will not accumulate or be transmitted to the next unlabeled instance.

Ever increasing data quantity makes ever more urgent the need of BNC learners that are highly scalable and can perform significantly better in terms of classification [9]. Labeled training data may account for only a small portion of the massive scientific data. Correspondingly the network topology ${\mathscr{B}}$ learned from training data can represent only a limited number of significant conditional dependencies [9]. The dependency relationships implicated in unlabeled testing instance may differ greatly from that in ${\mathscr{B}}$, thus they are more suitable to be used to rebuild the network topology rather than refine ${\mathscr{B}}$.

Log-likelihood function $LL({\mathscr{B}}|\mathcal {D}) $ measures the number of bits needed to describe $\mathcal {D}$ based on the the learned BNC ${\mathscr{B}}$ [10]. The log likelihood has a statistical interpretation: the higher the log likelihood is, the closer ${\mathscr{B}}$ is to model the probability distribution in the data $\mathcal {D}$. Thus $LL({\mathscr{B}}|\mathcal {D}) $ measures the extent to which the learned BNC ${\mathscr{B}}$ fit training data $\mathcal {D}$, and conditional log-likelihood function LL(X_i|π_i,Y ) can be used to identify directed conditional dependencies between attribute X_i and its parents π_i. In this paper we argue that the knowledge learned from labeled training dataset $\mathcal {D}$ and that from unlabeled testing instance x are complementary in nature. A variant of the conditional log-likelihood function is introduced to identify the conditional dependencies between attribute values in one single instance. On the basis of this, by pre-assigning any possible class label to the testing instance to make it complete, we apply heuristic search strategy to build a restricted class of subclassifiers for modeling the testing instance. The final BNC, which is called semi-supervised k-dependence Bayesian classifier (SSKDB), is the ensemble of subclassifiers, one general BNC$_{{\mathscr{B}}}$ learned from training dataset and a set of local BNC_ℓs learned from testing instance. These subclassifiers will be built independently but work as an ensemble to make the final prediction under the framework of semi-supervised learning.

The contributions of this paper are as follows:

The log-likelihood functions are proved to be theoretically effective for measuring the extents to which the learned topologies fit data. We then propose to use conditional log likelihood to identify directed conditional dependence, and apply heuristic search strategy to learn BNCs from training data $\mathcal {D}$ and labeled testing instance d, respectively. Prediction is determined by applying uniform averaging of the joint probability estimates.
We compare the performance of our algorithm, i.e., SSKDB, with other state-of-the-art classifiers on 40 datasets, ranging in size from 57 to 164 thousand instances and 2 to 22 class labels. We show that SSKDB achieves comparable or lower error than a range of state-of-the-art BNCs (e.g., CFWNB, WATAN, FKDB, SKDB and IWAODE).

The following section reviews the background knowledge (i.e., learning BNC in the framework of supervised learning or semi-supervised learning). Section 3 introduces the details of our algorithm. Section 4 gives the comparisons with 6 state-of-the-art or recently proposed algorithms to verify the effectiveness of our algorithm on 40 UCI (University of California at Irvine) benchmark datasets. The final section draws conclusions and outlines some directions for further research.

2 Background theory and related research work

2.1 Supervised learning for Bayesian network classifiers

Under the supervised learning framework, the trained supervised classification model will use the statistical knowledge learned to assign appropriate class label to each testing instance. Researchers have proposed numerous methods to train a supervised model. The k-nearest neighbors classifies an instance by a plurality vote of its neighbors, with the instance being assigned to the class most common among its k nearest neighbors [11]. Support vector machine assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier [2]. Logistic regression transforms its output using the logistic sigmoid function to return a probability value [3]. Decision tree builds a flowchart-like structure in which each internal node represents a specific attribute and the path from the root node to the leaf node represents the chain of reasoning [4]. BNCs use DAG to graphically represent the probabilistic dependencies, from which the joint probability distribution can be recovered.

For restricted BNC ${\mathscr{B}}$, class variable Y is assumed to be the root node and the common parent of predictive attributes. BNC models the joint distribution P(x,y) and factorizes it according to the topology $\mathcal {G}$ in the form of product of a set of conditional probabilities as follows:

$$ P(\boldsymbol{x},y)=P(y)P(\boldsymbol{x}|y)=P(y)\prod\limits_{i=1}^{n} {P(x_{i}|{{\varPi}}_{i},y)}, $$

(1)

where π_i denotes the parents of X_i. Each factor in P(x,y), i.e., P(x_i|π_i,y), is a categorical distribution. To achieve the trade-off between classification performance and structure complexity, BNCs need to identify limited number of parents for each attribute. Among all the BNCs, NB has the simplest topology and it takes class variable as the only parent of the rest of variables (explanatory variables) in the DAG [12]. This model will perform the best when the dataset satisfies the assumption of conditional independence among predictors, given the class.

Although NB is surprisingly effective, the implicit conditional dependencies between predictive attributes become to be significant as the amount of data increases and cannot be neglected, that makes the classification performance of NB degrade dramatically [13]. To retain the simplicity and direct theoretical foundation of NB, Hidden naive Bayes (HNB) creates a hidden parent for each attribute, which combines the influences from all other attributes [14]. Thus HNB avoids the computational complexity for learning an optimal BNC. Correlation-based Feature Weighting Naive Bayes (CFWNB) applies correlation-based feature weighting (CFW) filter for NB, that assigns greater weights to highly predictive features that are highly correlated with the class (maximum mutual relevance), yet uncorrelated with other features (minimum mutual redundancy) [15].

An efficient and effective approach for refining NB is to add augmented edges that impose a restriction on the topology of NB by permitting more than one parents for each attribute in addition to the class variable [16]. Tree Augmented Naive Bayes (TAN) [17] applies conditional mutual information and minimal description length as the scoring function, and allows a restricted amount of edges between the attributes. The weighted averaged TAN (WATAN) [18] constructs a set of maximum spanning trees and takes each attribute X_i in turn as the root. The mutual information I(X_i;Y ) is applied as the aggregation weight to the committee trees. The k-dependence Bayesian classifier (KDB) [19] uses a parameter k to control the number of interdependencies modeled and achieves the trade-off between bias and variance for different data quantities. Selective KDB (SKDB) [20] selects attribute subsets and value of k in a single additional pass through the training data, and discriminatively selects a submodel of a full KDB classifier. Flexible KDB(FKDB) [10] uses entropy function $H_{B}(\mathcal {D})$ to roughly measure the robustness of the topology of BNC, and heuristic search strategy is introduced to explore the optimal topology.

Complex network topology for one single model may result in high variance and overfitting, in contrast ensemble model or multiple submodels with relatively simple structures can fully describe the conditional dependencies and achieve the trade-off between bias and variance. Superparent one-dependence estimator (SPODE) utilizes a weaker attribute independence assumption than NB. The averaged one-dependence estimators (AODE) [21] averages the predictions of all qualified SPODEs, that is, SPODEs are treated equally and assigned with the same weight. However, different SPODEs may not always play the same role while dealing with different classification problems. Some of them are more important than others. Weighted average of one-dependence estimators (WAODE) [22] extend AODE by assigning different weights to SPODEs of AODE. In WAODE, four different weighting criteria are applied and that results in four different versions, i.e., WAODE-MI, WAODE-ACC, WAODE-CLL and WAODE-AUC.

2.2 Semi-supervised learning for Bayesian network classifiers

Larger data quantities call for scaling-up of existing learning algorithms, whereas supervised learning requires more effort, expertise and time consumption for the labeled data during the training phase [23]. To address the issue of the lack of labeled data, semi-supervised learning (SSL) proposes to mine information from unlabeled data to refine the model learned in the framework of supervised learning. Semi-supervised classification methods are especially effective while tackling datasets with large amounts of unlabeled data and a small quantity of labeled data, and they can increase the classification accuracy compared with the default procedure of supervised methods.

Numerous approaches have been proposed and studied in the framework of SSL. Graph-based semi-supervised learning methods [24] aim to learn a graph from all the instances, use non-negative weights on the edge of any two instances to measure the similarity between them, and then propagate labels from labeled instances to unlabeled ones based on the pairwise similarity. Blum and Chawla [25] regard semi-supervised learning as a graph min-cut problem and used pairwise relationships among the instances to learn from both labeled and unlabeled data. Zhou et al. [26] propose a general regularization framework on directed graphs, in which the directionality and global relationship are considered. Generative models [24] assume that the data follows a determined parametric model and one key issue is how to learn the corresponding joint probability. Jiang and Chen et al. propose signed probabilistic mixture models to detect overlapping communities from signed networks [27, 28]. Yang et al. [29] theoretically derives a variational Bayes EM to estimate the parameter and variation-based approximate evidence to select appropriate model. Semi-supervised support vector machine (S3VMs) [30, 31] assumes that it is more possible for data points in the highly dense regions to share the same label. Unlabeled data is applied to adjust the decision boundary to lie in lower dense regions.

For BNC learning, Zheng et al. [32] present Subsumption Resolution (SR) to identify the generalization-specialization relationships between the parent and children attribute values, such that the redundant attribute values when instantiated will be removed and that helps improve the estimate of conditional probability distribution for AODE. Zaidi et al. [33] demonstrate that SR and MI-weighting are complementary, and the resulting combined technique delivers computationally efficient low-bias learning well suited to learning from big data. Jiang et al. [34] argue that the impact of root attribute value on the class variable can be applied as a more fine-grained weighting metric. The Kullback-Leibler (KL) measure and the information gain (IG) measure are introduced to compute the weights for different SPODEs in AODE. The resulting weighted AODE achieves the tradeoff between full representation of ground-truth dependencies and high-confidence estimate of conditional probabilities. Duan et al. [9] propose instance-based weighting AODE (IWAODE) [9] by applying instance-based weighting filter to flexibly assign discriminative weights to each single SPODE for different test instances.

3 Semi-supervised learning for k-dependence Bayesian classifiers

3.1 Model selection

Given the topology of Bayesian network, the conditional log-likelihood LL(X_i|π_i,Y ) measures the number of bits to describe attribute X_i given its parents π_i and is defined as follows,

$$ \begin{array}{ll} LL(X_{i}|{{\varPi}}_{i},Y)&= \underset{X_{i},{{\varPi}}_{i},Y}{\sum}P(x_{i},{{\varPi}}_{i},y)\log P(x_{i}|{{\varPi}}_{i},y). \end{array} $$

(2)

Given attributes {X₁,X₂,⋯ ,X_n} and corresponding factorization of the joint probability distribution P(x,y) (see (1)), the log-likelihood function $LL({\mathscr{B}}|D)$, which is defined as follows, measures the number of bits encoded in the network topology ${\mathscr{B}}$ learned from training data $\mathcal {D}$ [10].

$$ \begin{array}{ll} LL(\mathcal{B}|D)&= \underset{\mathbf{X},Y}{\sum}P(\mathbf{x},y)\log P_{\mathcal{B}}(\mathbf{x},y). \end{array} $$

(3)

Learning the topology ${\mathscr{B}}$ based on log-likelihood function $LL({\mathscr{B}}|D)$ has obvious advantages in knowledge expression and data fitting. Ideally $LL({\mathscr{B}}|D)$ should represent the most significant dependencies between X_i and its parents π_i, and its corresponding joint probability distribution can fit the data to the maximum.

The SSKDB$_{{\mathscr{B}}}$ procedure is shown in Algorithm 1

Theorem 1

Let $\mathcal {D}$ be a collection of N instances of {Y,X₁,...,X_n}. The learning procedure shown in Algorithm 1 builds a network topology ${\mathscr{B}}$ that maximizes $LL({\mathscr{B}}|D)$ and has time complexity $\mathcal {O}(tn^{k}+m(nv)^{k}+n^{2}logn)$.

Proof

We start with a reformulation of the log-likelihood by applying the chain law for joint probability:

$$ \begin{array}{@{}rcl@{}} LL(\mathcal{B}|D)\!&=&{\sum}_{\mathbf{X},Y}P(\mathbf{x},y)\log \left\{P(y)\prod\limits_{i=1}^{n}P(x_{i}|{{\varPi}}_{i},y)\right\}\\ &=&{\sum}_{Y}P(y)\log P(y) + \sum\limits_{i=1}^{n}\underset{X_{i},{{\varPi}}_{i},Y}{\sum}P(x_{i},{{\varPi}}_{i},y)\log P(x_{i}|{{\varPi}}_{i},y)\\ &=&LL(Y)+\sum\limits_{i=1}^{n} LL(X_{i}|{{\varPi}}_{i},Y),\\ &=&constant\ term+\sum\limits_{i=1}^{n} LL(X_{i}|{{\varPi}}_{i},Y), \end{array} $$

(4)

□

Thus, maximizing the log-likelihood function $LL({\mathscr{B}}|D)$ is equivalent to maximizing the term ${\sum }^{n}_{i=1} LL(X_{i}|{{\varPi }}_{i},Y)$.

The attributes are divided into two groups, the attributes sorted in attribute order ${\mathscr{L}}$ and those to be sorted. The attributes will be added into ${\mathscr{L}}$ in turn, and each one can select candidate parents from ${\mathscr{L}}$ only. For restricted BNC, class variable is the common parent of all the attributes. Suppose that $X_{1}=\arg \max \limits LL(X_{i}|Y)(1\leq i\leq n)$ holds, then X₁ is the root node and will be added to the attribute order ${\mathscr{L}}$. The initial network topology with directed edge $Y\rightarrow X_{1}$ only is guaranteed to describe this term. Then LL(X_i|π_i,Y )(i≠ 1) will be computed and compared to select the second node X₂, where $X_{2}=\arg \max \limits LL(X_{i}|{{\varPi }}_{i},Y)(X_{i}\notin {\mathscr{L}},{{\varPi }}_{i}\subseteq {\mathscr{L}})$. Since only attribute X₁ exists in ${\mathscr{L}}$, the directed edge $\{X_{1},Y\}\rightarrow X_{2}$ will be added to the network topology, and X₂ will be regarded as the second attribute and added into the order ${\mathscr{L}}$. By applying heuristic search strategy to maximize each factor LL(X_i|π_i,Y ) (1 ≤ i ≤ n) in (4) in turn, Algorithm 1 determines the attribute order and the parent-children relationships, and thus maximizes the log-likelihood function $LL({\mathscr{B}}|D)$.

SSKDB$_{{\mathscr{B}}}$ uses the topology of ${\mathscr{B}}$ to represent arbitrary k-dependence conditional dependencies implicated in training data in the form of high-order maximum weighted spanning tree. At training time, SSKDB$_{{\mathscr{B}}}$ needs to generate a (k+ 1)-dimensional table of co-occurrence counts for every k attribute values and each class label. The resulting time complexity of generating a (k+ 1)-dimensional table is $\mathcal {O}(tn^{k})$, where n is the number of attributes, t is the number of data instances. To calculate the log-likelihood function $LL({\mathscr{B}}|D)$, SSKDB$_{{\mathscr{B}}}$ considers every combination of attribute values x_i in conjunction with parent π_i and class label y. Thus the time complexity is $\mathcal {O}(m(nv)^{k})$, where m is the number of classes, and v is the maximum number of discrete values that any attribute may take. The time complexity of parent assignment is $\mathcal {O}(n^{2}logn)$. Thus the time complexity for building SSKDB$_{{\mathscr{B}}}$ is $\mathcal {O}(tn^{k}+m(nv)^{k}+n^{2}logn)$.

However, the conditional dependencies learned based on Algorithm 1 may fit different instances to different extents. Take dataset Localization for example, the attribute pair that maximizes the log-likelihood function is {X₁,X₃}, which respectively denotes “Sequence Name” and “X coordinate”. The estimates of conditional probability distribution P(x₁,x₃|y) when Y takes different values are shown in Fig. 1. The significant difference in probability distributions demonstrates significant variation in conditional dependencies between attribute values. In this paper, we propose to further explore the possible conditional dependencies between attribute values. One more BNC is built independently to fully represent the dependency characteristics implicated in specific instance x, and the network topology can adaptively change to fit it.

Similar to the definition of $LL({\mathscr{B}}|D)$, the point-wise log-likelihood (PLL) function for instance d = {x,y}, which measures the number of bits encoded, is defined as

$$ \begin{array}{@{}rcl@{}} LL(d)&=&\log P(\mathbf{x},y) =\log P(y)+\sum\limits_{i=1}^{n}\log P(x_{i}|{{\varPi}}_{i},y)\\ &=&LL(y)+\sum\limits_{i=1}^{n} LL(x_{i}|{{\varPi}}_{i},y), \end{array} $$

(5)

Learning the topology ℓ based on PLL function LL(d) has obvious advantages in knowledge expression and data fitting. Ideally, LL(d) should represent the most significant dependencies between x_j and its parents π_j, and its corresponding joint probability distribution can fit d to the maximum. Similar to the learning procedure of Algorithm 1, Algorithm 2 applies heuristic search strategy to maximize each factor LL(x_i|π_i,y) (1 ≤ i ≤ n) in (5) in turn, then the attribute order and the parent-children relationships will be determined, and the PLL function LL(d) will be maximized correspondingly.

The learning process of SSKDB_ℓ for instance d_i is shown as follows.

As described in Algorithm 2, SSKDB_ℓ learns a restricted class of subclassifiers from the pseudo complete instance d = (x,y) for all the possible class labels y, and represents arbitrary k-dependence conditional dependencies implicated. That is, each subclassifier can fully describe the conditional dependencies between attribute values in x given specific class label.

Theorem 2

Let d be a complete instance, i.e., d = {x₁,...,x_n,y}. The learning procedure of Algorithm 2 builds a network topology ℓ that maximizes LL(d) and has time complexity $\mathcal {O}(mn^{k}+n^{2}logn)$.

Proof

Suppose that ℓ is the subclassifier learned from pseudo instance d = (x,y), then the estimate of the joint probability distribution P(x,y) for ℓ is

$$ P(\mathbf{x},y|\ell) = P(y)\prod\limits_{i=1}^{n}P(x_{i}|{{\varPi}}_{i},y,\ell). $$

(6)

Similar to the proof of Theorem 1, the PLL function LL(d) can be factorized as follows,

$$ \begin{array}{@{}rcl@{}} LL(d)&=&\log P(y)+\sum\limits_{j=1}^{n}\log{P(x_{j}|{{\varPi}}_{j},y)}\\ &=&LL(y)+\sum\limits_{j=1}^{n}LL(x_{j}|{{\varPi}}_{j},y), \end{array} $$

(7)

Thus, the learned network topology should encode the maximum number of bits for describing d by maximizing the PLL function LL(d).

To compute the actual network structure of SSKDB$_{{\mathscr{B}}}$, we need to consider all attribute values and all possible class labels, thus it requires $\mathcal {O}(tn^{k}+m(nv)^{k}+n^{2}logn)$ time. In contrast, to compute the network structure of SSKDB_ℓ we only need to consider the attribute values in x and all possible class labels, thus it requires $\mathcal {O}(mn^{k}+n^{2}logn)$ time. □

3.2 Semi-supervised k-dependence Bayesian Classifier (SSKDB)

The proposed algorithm, SSKDB, contains two components (SSKDB$_{{\mathscr{B}}}$ and SSKDB_ℓ) and its learning framework is shown in Fig. 2. Firstly, SSKDB$_{{\mathscr{B}}}$ learns from training data $\mathcal {D}$ by maximizing the log-likelihood function $LL({\mathscr{B}}|\mathcal {D})$. Then, SSKDB_ℓ learns from pseudo complete instance d_i = {x₁,...,x_n,y_i} by maximizing LL(d_i), where y_i is a pre-assigned class label to test instance x = {x₁,...,x_n}. After training SSKDB_ℓ and SSKDB$_{{\mathscr{B}}}$, the final ensemble estimates the class membership probabilities P(x,y_i) by averaging both predictions.

The network topology learned from training data $\mathcal {D}$ may not correctly describe the “right” dependencies among attribute values in testing instance x, thus the resulting BNC ${\mathscr{B}}$ may overfit $\mathcal {D}$ whereas underfit x, that will lead to biased estimate of conditional probability. In contrast, there is only one true class label for unlabeled instance x and correspondingly only one subclassifier can describe the conditional dependencies among attribute values in x with high confidence level. The other subclassifiers learned from wrongly labeled instance x will negatively affect the estimate of joint probability distribution to some extents. The knowledge represented by ${\mathscr{B}}$ and ℓ are complementary in nature, and when they work jointly the final ensemble can achieve better classification performance than any one of them.

To further clarify the basic idea of the ensemble learning strategy, we take dataset localization as an example for case study. Dataset localization has 5 attributes and 11 class labels. Suppose that x = {x₁ = 1,x₂ = 3,x₃ = 17,x₄ = 3,x₅ = 3} is the given instance to be classified. The true class of x is y₄. ${\mathscr{B}}$ learns the general conditional dependencies among attributes from training data $\mathcal {D}$ by maximizing the log-likelihood function $LL({\mathscr{B}}|D)$, ℓ_i learns the local conditional dependencies among attribute values from one single instance d_i = (x,y_i) by maximizing LL(d_i). Since the true class label for x is y₄, thus we take the topology of ℓ₄ as the benchmark for comparison. The network topologies of ${\mathscr{B}}$ and ℓ₄ when k = 2 are shown in Figs. 3a and b, respectively.

By comparing the network topologies shown in Fig. 3a and b we can see that, there are 7 conditional dependencies represented in the network topologies of ℓ₄ or ${\mathscr{B}}$, and 14 conditional dependencies in total. Among them, only 2 conditional dependencies in the topology of ℓ₄ are the same as that in ${\mathscr{B}}$ while 5 dependencies are different. Diversity has been recognized as one of the key characteristic for ensemble classifier, and that can help improve classification accuracy, increase robustness to variance, reduce the overall sensitivity to different starting parameters and noise [35]. The topology of ${\mathscr{B}}$ can represent the general conditional dependencies among attributes implicated in training data, whereas it cannot represent local conditional dependencies among attribute values implicated in specific testing instance. In contrast, the topology of ℓ can represent local conditional dependencies, whereas the “noise” introduced by the uncertainty of class label may reduce the confidence level of ℓ [36]. SSKDB introduces diversity by training ℓ_i(1 ≤ i ≤ m) to model the conditional dependencies between attribute values given different class labels. When ℓ_i and ${\mathscr{B}}$ work jointly, the estimate of joint probability P(x,y_i) will be finely tuned by combining outputs from each individual model. To make the final prediction, linear combination is introduced to combine the results from individual models.

In order to reduce the negative effect caused by unreliable base probability estimates, we use η = 1 as minimum on sample size for statistical inference purposes [32]. If the frequency of the parent attribute values {π_i,y} is lower than η for ℓ_i or ${\mathscr{B}}$, when classifying an instance x SSKDB will exclude ℓ_i or ${\mathscr{B}}$ correspondingly. Thus,

$$ P(\mathbf{x},y) = P(y)\prod\limits_{i=1}^{n}P(x_{i}|{{\varPi}}_{i},y_{j},F({{\varPi}}_{i},y_{j}) \geq \eta), $$

(8)

where F(π_i,y_j) is the frequency of attribute values (π_i,y_j) in the training data. The linear combiner can be used for models that output real-valued numbers, so is applicable for ${\mathscr{B}}$ and ℓ_j. Corresponding ${\mathscr{B}}$ and ℓ_j will be combined to describe the dependency relationship conditioned on class label y_j. Then we can apply Bayes rule to predict or classify as follows,

$$ \begin{array}{@{}rcl@{}} y^{*}&=&\arg\max_{y_{j}\in Y} P(\mathbf{x},y_{j})\\ &=& \arg\max_{y_{j}\in Y}P(y_{j})\frac{ {\prod}^{n}_{i=1}P(x_{i}|{{\varPi}}_{i},y_{j},\mathcal{B})+{\prod}^{n}_{i=1}P(x_{i}|{{\varPi}}_{i}^{\prime},y_{j},\ell_{j})}{\{F({{\varPi}}_{i},y_{j}) \geq \eta \bigwedge F({{\varPi}}_{i}^{\prime},y_{j}) \geq \eta\}},\\ \end{array} $$

(9)

where π_i and ${{\varPi }}_{i}^{\prime }$ respectively denotes the parent attributes of X_i in the topologies of ${\mathscr{B}}$ and ℓ_j. The classification procedure is presented in Algorithm 3.

Classification of a single instance requires considering the k-dependence topology of ${\mathscr{B}}$ and ℓ, and then assigns class label y^∗ to instance x according to (9). Estimates of the joint probability for both SSKDB$_{{\mathscr{B}}}$ and SSKDB_ℓ require $\mathcal {O}(nmk)$ time. Thus the time complexity of SSKDB for classification is $\mathcal {O}(nmk)$.

4 Experiments and results

In this section, we compare our algorithm with several state-of-art or recently proposed classifiers. A series of experiments were conducted in terms of zero-one loss, bias, variance, training time and classification time on 40 datasets from the UCI machine leaning repository [37]. The description of these datasets is shown in Table 1, including the number of instances, attributes and classes. These datasets were divided into two categories in terms of their sizes. That is, datasets with < 2000 and ≥ 2000 instances are denoted as small and large datasets, respectively. For each dataset, numeric attribute values are discretized using Minimum Description Length (MDL) [38]. Each algorithm is tested on each dataset using 10-fold cross validation.

Table 1 Datasets

Full size table

The following algorithms are used for comparison:

CFWNB [15], Correlation-based Feature Weighting for NB.
WATAN [18], Weighted ATAN.
FK₁DB [10], flexible KDB with k = 1.
FK₂DB [10], flexible KDB with k = 2.
SKDB [20], Selective KDB with k ≤ 5.
IWAODE [9], instance-based weighting AODE.
SSK₁DB, Semi-supervised KDB with k = 1.
SSK₂DB, Semi-supervised KDB with k = 2.

4.1 Experimental Study on all dataset

If the output of a one-tailed binomial sign test is less than 0.05, then significant difference is supposed to exist between the experimental results for different learners. Win/Draw/Loss(WDL) record is applied to compare the experimental results by counting the number of datasets for which one algorithm performs significantly better, equally well or significantly worse than the other on a given criterion. In machine learning, zero-one loss [39] is one of the standard criteria to measure the extent to which a learner misclassify. Table 2 shows the experimental results of WDL record in terms of zero-one loss. Detailed results of zero-one loss are presented in Table 6, and the best scores corresponding to different datasets are marked in bold.

Table 2 The comparison results of Win/Draw/Loss records in terms of zero-one loss

Full size table

High-dependence topology will represent more conditional dependency relationships among attributes, that helps the estimate of conditional probability approximate the true one. Given n attributes and m class labels, when k = 2, CFWNB, FK₁DB and FK₂DB can respectively represent 0-dependence, 1-dependence and 2-dependence topologies. As shown in Table 2, FK₂DB performs better than FK₁DB (20∖13∖7), and FK₁DB better than CFWNB (16∖12∖12). As argued by Sahami [19] that, with more “right” dependencies captured the learned BNC would achieve optimal Bayesian accuracy. Attribute weighting can help the final BNC enhance the mutual dependence between predictive attributes and class variable, and that will improve the classification performance. By applying attribute weighting, BNCs with high-dependence topology also have significant advantages over low-dependence BNCs in classification performance. WATAN performs better than CFWNB (17∖10∖13). IWAODE performs better than the other single-structure BNCs due to its ensemble mechanism, it respectively outperforms WATN, FK₁DB, FK₂DB and SKDB on 11, 15, 15 and 14 datasets, and respectively loses on 7, 7, 13 and 11 datasets. SSKDB inherits the characteristics of high-dependence topology and weighting, and it performs much better than BNCs of the same structure complexity. For example, SSK₁DB outperforms WATAN on 15 datasets and loses on 9, and it outperforms FK₁DB on 16 datasets and loses on 8. SSK₂DB outperforms FK₂DB on 16 datasets and loses on 3, and it outperforms SKDB on 21 datasets and loses on 2.

As proposed by Kohavi and Wolpert [40], zero-one loss can be decomposed into bias and variance from sampling theory, which provides valuable insights into the components of the misclassification rate. Bias measures the systematic error of the learned algorithm, and variance measures the sensitivity of an algorithm to the random variation in the training data. An algorithm with low variance will usually enjoy advantage while dealing with small datasets, and in contrast that with low bias will usually perform much better on large datasets [39]. The detailed results of bias and variance are provided in Table 7 and Table 8, and the best scores corresponding to different datasets are marked in bold. The WDL records of bias and variance are shown in Table 3.

Table 3 The comparison results of Win/Draw/Loss records in terms of bias and variance

Full size table

Bias-wise, from Table 3 we can see that WATAN outperforms CFWNB on 25 datasets, whereas loses on 9 datasets. FK₂DB respectively outperforms WATAN and FK₁DB on 22 and 23 datasets, whereas loses on 3 and and 6 datasets. Variance-wise, WATAN loses to CFWNB on 33 datasets, whereas wins on 4 datasets. FK₂DB respectively loses to WATAN and FK₁DB on 19 and 26 datasets, whereas wins on 9 and and 8 datasets. CFWNB, WATAN, FK₁DB, FK₂DB and SKDB all learn in the framework of supervised learning and their corresponding topologies can represent the conditional dependencies implicated in labeled training data only. The topologies of CFWNB and IWAODE are determined by their independence assumption and thus they need no structure learning, they both enjoy the variance advantage compared to other algorithms.

Ensemble BNCs aggregate the prediction of a restricted class of subclassifiers, each representing lower-dependence topologies with different independence assumptions. The relatively simple structures of the subclassifiers help achieve the trade-off between bias and variance. The nature of BNC ensembles lends themselves to scalable parallelization and overcomes the limitations of single model BNCs in two prevalent directions, i.e., to diversely generate BNC components, and to sparsely combine multiple BNCs. SSKDB tries to mine possible conditional dependencies from unlabeled instance besides training data. Thus if the same learning strategy is applied, the number of conditional dependencies in the topology of SSKDB is twice as much as that of BNC of the same structure complexity. For example, when k = 1, the final SSKDB can learn 2(n − 1) attribute conditional dependencies from training data $\mathcal {D}$ and unlabeled instance d, n − 1 dependencies for each. Whereas FK₁DB can learn n − 1 conditional dependencies from training data $\mathcal {D}$ only. Bias-wise, from Table 3 we can see that SSK₁DB respectively outperforms WATAN and FK₁DB on 15 and 18 datasets, whereas loses on 11 and 7 datasets. SSK₂DB respectively outperforms FK₂DB and SKDB on 13 datasets, whereas loses on 6 and and 8 datasets. Variance-wise, SSK₁DB respectively outperforms WATAN and FK₁DB on 25 and 16 datasets, whereas loses on 9 and 11 datasets. SSK₂DB respectively outperforms FK₂DB and SKDB on 20 and 14 datasets, whereas loses on 5 and and 14 datasets. The variance advantage of SSKDB is more significant than the bias advantage, thus the advantage of SSKDB over other BNCs of the same structure complexity can be attributed to the knowledge learned from testing instance, which can mitigate the negative effect caused by overfitting training data.

Moreover, its is a common occurrence plaguing areas such as fault and intrusion detection, affective computing, classification problems in medical imagining and many more. Often, “real” data sets are imbalanced and are dominated by a group of “normal” sample points existing along with a small group of minority “abnormal” sample points [7]. One or more class labels are less represented and sometimes, the class with the minority of instances is commonly the class of interest. It is well known that major differences between the majority and minority group sizes can result in classifiers that are biased towards the majority class thereby compromising the accuracy of the classifier [8].

Learning from imbalanced data stemming from real-world problems is inherently challenging and difficult. Data imbalance received significant research interest since it is an intrinsic data issue that cannot be handled by more precise and higher volume data acquisition methods. When Information theory is applied to learn the network structure of BNC from data, information-theoretic metrics, e.g., conditional mutual information (CMI) I(X_i;X_j|Y ) defined by (10), are introduced to measure the conditional dependence between X_i and X_j given all possible values of variable Y, and the arrows in the diagram represent the flow of information.

$$ \begin{array}{@{}rcl@{}} I(X_{i};X_{j}|Y)&=&\underset{X_{i}}{\sum}\underset{X_{j}}{\sum}\underset{Y}{\sum}P(x_{i},x_{j},y)\log\frac{P(x_{i},x_{j}|y)}{P(x_{i}|y)P(x_{j}|y)}\\ &=&\underset{Y}{\sum}P(y)\!\left\{\underset{X_{i}}{\sum}\underset{X_{j}}{\sum}P(x_{i},x_{j}|y)\log\frac{P(x_{i},x_{j}|y)}{P(x_{i}|y)P(x_{j}|y)}\right\}\\ &=&\underset{Y}{\sum}P(y)I(X_{i};X_{j}|y). \end{array} $$

(10)

From (10), I(X_i;X_j|Y ) can be factorized into a set of weighted I(X_i;X_j|y) where P(y) is the assigned weight. Thus the presence of conditional dependence measured by I(X_i;X_j|Y ) will be dominated by the majority group to some extent. Correspondingly, the learned single-model BNC prefers to significant conditional dependencies implicated in majority group rather than minority group. In this paper, as shown in (5) the significant conditional dependencies for each class implicated in d will be identified severally and represented by different BNCs. The ensemble learning strategy will help fully describe all possible dependency relationships and then address the issue of imbalanced data.

The training time and classification time for SSKDB is highly related to the number of classes, because for each class SSKDB needs to learn a subclassifier SSKDB_ℓ. SSKDB also takes time for training SSKDB$_{{\mathscr{B}}}$ by building high-order maximum weighted spanning tree. But considering the classification advantages brought by SSKDB<^–s learning strategy, the extra time spent on training the classifier and assigning class labels for unlabeled instances is acceptable.

4.2 Experimental study on large datasets

The most accurate learners for dealing with large data will performs better than the most accurate learners for dealing with small data in terms of bias [20, 41]. Thus a new generation of computationally efficient and low-bias learners is in urgent need to address this issue. By applying log-likelihood function to make the learned BNCs fit training and testing data respectively, the biased estimate of joint probability due to limited number of instances will be improved. To demonstrate the advantage of the complementary characteristic introduced by SSKDB, we compare the experimental results of SSKDB and others algorithms in terms of zero-one loss and bias on large datasets in the following discussion.

The ensemble learning mechanism helps SSK₁DB exhibit excellent generalization ability on large datasets in terms of zero-one loss. As shown in Fig. 4, SSK₁DB outperforms WATAN on 4 datasets and loses on 1, outperforms FK₁DB on 8 datasets and loses only on 1. SSK₂DB respectively outperforms FK₂DB and SKDB on 6 datasets, and it loses on 1. From the above records we can find that, SSKDB is very suitable for dealing with large datasets. The strategy of learning BNC topology by maximizing the log-likelihood function proves its advantage and we further explore its effectiveness in terms of bias. From Fig. 5 we can see that, SSK₁DB respectively outperforms WATAN and FK₁DB on 4 and 9 datasets, and loses on 4 and 1. SSK₂DB respectively outperforms FK₂DB and SKDB on 6 and 3 datasets, and it respectively loses on 0 and 2 datasets.

The increase in the amount of training instance reduces the risk of biased estimate of conditional probabilities. The bias difference between SSKDB and other algorithms directly reflect the difference in significant dependency relationships identified by applying log-likelihood functions LL(X_i|π_i,Y ) or conditional mutual information I(X_i;π_i|Y ). From the experimental results we can see that, SSKDB has different but more reasonable structure than FKDB and SKDB. The ensemble learning strategy of SSKDB greatly mitigates the negative effect caused by the independence assumption and help it possess competitive performance. SSKDB$_{{\mathscr{B}}}$ applies the log likelihood metric to maximize the number of bits encoded in the network topology rather than mine significant conditional dependencies between attributes, that helps the learned joint probability approximates the true one. Conditional mutual information provides a global optimal but local non-optimal solution for learning BNC, whereas from (2) and (5) log likelihood metric provides one global optimal and one local optimal solution for data fitting. Considering the variation in dependency relationships for different instances, SSKDB_ℓ learns local dependency relationships implicated in every class of testing instances. This learning strategy for constructing complementary models can take into account the information on the training set $\mathcal {D}$ and the testing instance x at the same time. So when the estimate of the log-likelihood function is non-significant, SSK₂DB can often achieve classification advantages. This can clarify the reason why among all 13 large datasets, SSK₂DB performed significantly better than SKDB on 6 datasets and poorer on only 1 dataset.

4.3 Time comparison

Table 4 [20] shows the theoretical complexity of four state-of-the-art BNCs, NB, TAN, AODE and KDB, where t is the number of training instances, v is the maximum number of discrete values that any attribute may take, m the number of class labels.

Table 4 Theoretical complexity of four state-of-the-art BNCs

Full size table

The theoretical complexity of the BNCs used for comparison is shown in Table 5. The training process of CFWNB is very similar to standard naive Bayes, except for some additional training time to compute all feature weights. Feature-class correlation and the average feature-feature inter-correlation respectively measured by I(X_i;Y ) and I(X_i;X_j) are considered for assigning weights. The training time complexity of assigning and calculating I(X_i;Y ) and I(X_i;X_j) to each attribute is $\mathcal {O}(mnv)$ and $\mathcal {O}((nv)^{2})$ respectively. If only the highest order term is considered, the training time complexity of CFWNB for obtaining these weights is $\mathcal {O}((nv)^{2} )$. WATAN needs to construct n directed maximum weighted spanning trees for n attributes, thus the time complexity of training WATAN is $\mathcal {O}(tn^{2}+m(nv)^{2}+n^{3}logn)$. During the testing phase, the classification of a single instance requires calculation of the probability based on each subclassifier and is of time complexity $\mathcal {O}(mn^{2})$. SKDB extends KDB to select between attribute subsets and values of k in a single additional pass through the training data. In the first and second pass, SKDB is simply KDB, with time complexity $\mathcal {O}(tn^{2}+m(nv)^{2}+tnk)$. SKDB then extends KDB with an extra pass through the training data to perform leave-one-out cross validation for the different (ordered) attributes $\mathcal {O}(tmnk)$. In the Table 5, n_b (the best d) is the remaining attributes after applying SKDB, and k_b (the best k) is the k value selected in SKDB. FKDB takes an active learning strategy and uses conditional entropy to learn significant causal relationships from data. FKDB takes more time for training because it needs to build high-order maximum weighted spanning tree in terms of high-order conditional entropy and has training time complexity $\mathcal {O}(n^{2}logn)$. The training procedures of IWAODE is just the same as that of AODE, i.e., no structure learning and weighting are required.Thus the training time complexity is $\mathcal {O}(tn^{2})$. During the testing phase, IWAODE learn weights from each testing instance, and this instantiated weighting approach adjusts the weights flexibly but requires more time for testing. To compute the actual network structure of SSKDB$_{{\mathscr{B}}}$, we need to consider all attribute values and all possible class labels, thus it requires $\mathcal {O}(tn^{k}+m(nv)^{k}+n^{2}logn)$ time. In contrast, to compute the network structure of SSKDB_ℓ we only need to consider the attribute values in x and all possible class labels, thus it requires $\mathcal {O}(mn^{k}+n^{2}logn)$ time. SSKDB_ℓ learns a restricted class of subclassifiers from the pseudo complete instance d = (x,y) for all the possible class labels y and has training time complexity $\mathcal {O}(m^{2}n^{k}+mn^{2}logn)$. The training time complexity of SSKDB is $\mathcal {O}(tn^{k}+m(nv)^{k}+mn^{2}logn)$.

Table 5 Theoretical complexity of BNCs for comparison study

Full size table

Figures 6 and 7 show the comparisons of average training and classification time on all datasets. All the algorithms for comparison run on a desktop computer with an Intel(R) Core(TM) i5-7200U CPU @1.2 GHz, 64 bits and 16 GB of memory. In these graphs, each bar represents the sum of time required for training or classifying in 10-fold cross validation experimental study.

Training time refers to the average time spent on training the classifiers, including both the time it takes to build a model from the information contained in the training set $\mathcal {D}$, and the average time it takes to construct the model based on testing instances d with pre-assigned class labels. Figure 6 indicates that CFWNB and WATAN need negligible time for training due to their independence assumptions. As structure complexity increases high-dependence BNCs (e.g., SKDB) need more time than low-dependence BNCs (e.g., FK₁DB). The training time for SSKDB is highly related to the number of classes, because for each class SSKDB needs to learn a subclassifiers. SSKDB also takes time for training SSKDB$_{{\mathscr{B}}}$ by building high-order maximum weighted spanning tree. It can be seen from Fig. 6 that SSK₁DB spends more time on training the classifier than other 1-dependence BNCs (e.g.,FK₁DB), and SSK₂DB spends more training time than other 2-dependence BNCs (e.g., FK₂DB). But considering the classification advantages brought by SSKDB’s learning strategy, the extra time spent on training the classifier is acceptable

Classification time refers to the average time that classifiers take to assign class labels for unlabeled instances. Ensemble BNCs integrated the prediction of base classifiers in some way to obtain the final prediction. Ensemble BNCs generally imposes more classification time than single-structure BNCs. Weighting imposes computation overhead on computing the joint probability. As shown in Fig. 7, SSK₁DB requires more classification time than FK₁DB but less time than IWAODE, and SSK₂DB requires more classification time than FK₂DB and SKDB. In general, there exist a small increase in time consumption and definite huge gains applying ensemble learning. Finally, we can conclude that our proposed framework is effective for classification.

5 Conclusions and future work

Log-likelihood function has been proven to be an effective criterion for measuring the conditional dependencies among attributes, and the resulting BNC can achieve the trade-off between data fitting and knowledge representation. In this paper, we propose to use a variant of the log-likelihood function to measure the conditional dependencies among attribute values. We have presented the rationale and time complexity of the BNCs, SSKDB$_{{\mathscr{B}}}$ and SSKDB_ℓ, which are respectively learned from training data and testing instance, and they work as an ensemble to make the final prediction. We have conducted comprehensive experiments across 40 UCI benchmark datasets to evaluate SSKDB’s learning accuracy and efficiency. The experimental results prove the effectiveness and efficiency of SSKDB in terms of zero-one loss, bias, variance and etc. The log-likelihood function defined in (5) can measure the extent to which the learned BNC fits specific instance. Thus it can be used to build multiple BNCs based on the testing instance with different pre-assigned class labels. It is difficult to measure the confidence level of SSKDB$_{{\mathscr{B}}}$ and SSKDB_ℓ without domain knowledge from experts. One feasible approach is to assign different weights to the committee members and then linearly combine their probability estimates. If testing instance x = {x₁,...,x_n} corresponds to unseen class, then it shouldn’t fit any BNC learned from training data $\mathcal {D}$, and the criterion that $P(\mathbf {x})<P(\hat {\mathbf {x}})$ should hold much more often than not for any instance $\hat {\mathbf {x}}$ in $\mathcal {D}$. This is a challenging question and remains a potential direction for future research.

References

Saadatfar H, Khosravi S, Joloudari J, Mosavi A, Shamshirband S (2020) A new K-Nearest neighbors classifier for big data based on efficient data pruning. Mathematics 8(2):286–302
Article Google Scholar
Shao Y, Deng N, Yang Z, Chen W, Wang Z (2012) Probabilistic outputs for twin support vector machines. Knowl Based Syst 33:145–151
Article Google Scholar
Tanju O, Kalaylioglu Z (2018) A cluster tree based model selection approach for logistic regression classifier. J Stat Comput Simul 88:1394–1414
Article MathSciNet Google Scholar
Zhang Y, Lu S, Zhou X, Yang M, Wu L, Liu B, Phillips P, Wang S (2016) Comparison of machine learning methods for stationary wavelet entropy-based multiple sclerosis detection: decision tree, k-nearest neighbors, and support vector machine. Simulation 92:861–871
Article Google Scholar
Liu Y, Wang L, Mammadov M (2020) Learning semi-lazy Bayesian network classifier under the c.i.i.d assumption. Knowledge-Based Systems. https://doi.org/10.1016/j.knosys.2020.106422
Liu L, Peng T (2014) Clustering-based method for positive and unlabeled text categorization enhanced by improved tfidf. J Inf Sci Eng 30(5):1463–1481
Google Scholar
Han J, Zuo W, Liu L, Xu Y, Peng T (2016) Building text classifiers using positive, unlabeled and outdated examples. Concurr Comput Practice Exp 28(13):3691–3706
Article Google Scholar
Zheng F, Webb G, Suraweera P, Zhu L (2013) Subsumption resolution: an efficient and effective technique for semi-naive Bayesian learning. Machine learning 87:93–125
Article MathSciNet Google Scholar
Duan Z, Wang L, Chen S, Sun M (2020) Instance-based weighting filter for superparent one-dependence estimators. Knowl Based Syst 151:106085
Article Google Scholar
Wang L, Wang G, Duan Z, Lou H, Sun M (2019) Optimizing the topology of Bayesian network classifiers by applying conditional entropy to mine causal relationships between attributes. IEEE Access 7(2):134271–134279
Article Google Scholar
Maillo J, Garcia S, Luengo J, Herrera F, Triguero I (2020) Fast and scalable approaches to accelerate the fuzzy k-Nearest neighbors classifier for big data. IEEE Trans Fuzzy Syst 28:874–886
Article Google Scholar
Zhang Y, Wang L, Duan Z, et al. (2019) Structure learning of Bayesian network based on adaptive thresholding. Entropy 21(7):665–691
Article Google Scholar
Jiang L, Zhang L, Yu L, et al. (2019) Class-specific attribute weighted naive Bayes. Pattern Recognit 88:321–330
Article Google Scholar
Jiang L, Zhang H, Cai Z (2008) A novel Bayes model: Hidden naive Bayes. IEEE Trans Knowl Data Eng 21(6):1361–1371
Google Scholar
Jiang L, Zhang L, Li C, Wu J (2018) A correlation-based feature weighting filter for naive Bayes. IEEE Trans Knowl Data Eng 31:201–213
Article Google Scholar
Alhussan A, El Hindi K (2016) Selectively fine-tuning Bayesian network learning algorithm. Int J Pattern Recognit Artif Intell 30:165–182
Article MathSciNet Google Scholar
Long Y, Wang L, Sun M (2019) Structure extension of tree-augmented Naive Bayes. Entropy 21(8):721–746
Article MathSciNet Google Scholar
Jiang L, Cai Z, Wang D, Zhang H (2012) Improving tree augmented naive bayes for class probability estimation. Knowl Based Syst 26:239–245
Article Google Scholar
Sahami M (1996) Learning limited dependence Bayesian classifiers. In: Proceedings of the second international conference on knowledge discovery and data mining, vol 96, pp 335–338
Martinez A, Webb G, Chen S, Zaidi N (2016) Scalable learning of Bayesian network classifiers. J Mach Learn Res 17(2):1515–1549
MathSciNet MATH Google Scholar
Chen S, Martinez A, Webb G, Wang L (2017) Selective AnDE for large data learning: a low-bias memory constrained approach. Knowl Inf Syst 50(2):475–503
Article Google Scholar
Jiang L, Zhang H, Cai Z, Wang D (2012) Weighted average of one-dependence estimators. J Exp Theor Artif Intell 24(7):219–230
Article Google Scholar
Chen S, Martinez A, Webb G, Wang L (2016) Sample-based attribute selective AnDE for large data. IEEE Trans Knowl Data Eng 29(1):172–185
Article Google Scholar
He H, Han D, Dezert J (2020) Disagreement based semi-supervised learning approaches with belief functions. Knowledge-Based Systems. https://doi.org/10.1016/j.knosys.2019.105426
Blum A, Chawla S (2018) Learning from labeled and unlabeled data using graph mincuts. In: Proceedings of the 18th international conference on machine learning, pp 2143–2161
Zhou D, Hofmann T, Scholkopf B (2004) Semi-supervised learning on directed graphs. In: Proceedings of the Advances in neural information processing systems, pp 1633-1640
Jiang J (2015) Stochastic block model and exploratory analysis in signed networks. Phys Rev E 91:628–645
Article Google Scholar
Chen Y, Wang X, Yuan B, Tang B (2014) Overlapping community detection in networks with positive and negative links. J Stat Mech Theory Exp 3:1272–1284
Google Scholar
Yang B, Liu X, Li Y, Zhao X (2017) Stochastic blockmodeling and variational Bayes learning for signed network analysis. IEEE Trans Knowl Data Eng 29:2026–2039
Article Google Scholar
Joachims T (1999) Transductive inference for text classification using support vector machines. In: Proceedings of the international conference on machine learning, pp 29:2026–2039
Bennett K, Demiriz A (1999) Semi-supervised support vector machines. In: Proceedings of the advances in neural information processing systems, pp 368–374
Zheng F, Webb G, Suraweera P, Zhu L (2012) Subsumption resolution: an efficient and effective technique for semi-naive Bayesian learning. Mach Learn 87(2):93–125
Article MathSciNet Google Scholar
Zaidi N, Webb G (2013) Fast and effective single pass Bayesian learning. In: Proceedings of the pacific-asia conference on knowledge discovery and data mining, pp 149–160
Yu L, Jiang L, Wang D, Zhang L (2017) Attribute value weighted average of one-dependence estimators. Entropy 19(3): 501
Article Google Scholar
Barutcuoglu Z, Alpaydin E (2003) A comparison of model aggregation methods for regression. In: Proceedings of the artificial neural networks and neural information processing, pp 76–83
Liu Y, Wang L, Mammadov M (2021) Hierarchical Independence Thresholding for learning Bayesian network classifiers. Knowledge-Based Systems 2021(212): 106627
Bache K, Lichman M UCI Machine Learning Repository, Available online: https://archive.ics.uci.edu/ml/datasets.html
Fayyad U, Irani K (1993) Multi-interval discretization of continuous-valued attributes for classification learning. In: Proceedings of the Thirteenth international joint conference on artificial intelligence, pp 155–169
Duda R, Hart P, Stork D (2012) Pattern classification,2nd Edition
Kohavi R, Wolpert DH (1996) Bias plus variance decomposition for zero-one loss functions
Brain D, Webb G (1999) On the effect of dataset size on bias and variance in classification learning

Download references

Acknowledgements

The authors would like to thank the editor and the anonymous reviewers for their insightful comments and suggestions. And this work was supported by the National Key Research and Development Program of China (No. 2019YFC1804804) and the Scientific and Technological Developing Scheme of Jilin Province (No. 20200201281JC).

Author information

Authors and Affiliations

Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, ChangChun, 130012, China
LiMin Wang
College of Mathematics, JiLin University, ChangChun, 130012, China
XinHao Zhang
College of Computer Science and Technology, JiLin University, ChangChun, 130012, China
Kuo Li
College of Software, JiLin University, ChangChun, 130012, China
Shuai Zhang

Authors

LiMin Wang
View author publications
You can also search for this author in PubMed Google Scholar
XinHao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Kuo Li
View author publications
You can also search for this author in PubMed Google Scholar
Shuai Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to XinHao Zhang.

Ethics declarations

Conflict of Interests

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Table 6 Experimental results of zero-one loss

Full size table

Table 7 Experimental results of bias

Full size table

Table 8 Experimental results of variance

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, L., Zhang, X., Li, K. et al. Semi-supervised learning for k-dependence Bayesian classifiers. Appl Intell 52, 3604–3622 (2022). https://doi.org/10.1007/s10489-021-02531-y

Download citation

Accepted: 12 May 2021
Published: 08 July 2021
Issue Date: March 2022
DOI: https://doi.org/10.1007/s10489-021-02531-y

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Semi-supervised learning for k-dependence Bayesian classifiers

Abstract