1 Introduction

During the past decades, Supervised Learning (SL), which deeply exploits labeled dataset to achieve promising learning performance, is the main stream for classification in the machine learning field. Various methods and theoretical models have been presented, such as k-nearest neighbors [1], support vector machine [2], logistic regression [3] and decision tree [4], among them Bayesian network classifiers (BNCs) are powerful tools to graphically describe the statistical knowledge and infer under conditions of uncertainty. The dependency relationships among predictive attributes \(\mathcal {X}=\{X_{1}, \cdots , X_{n}\}\) and class variable Y are graphically described in a directed acyclic graph (DAG). Ideally the joint probability distribution corresponding to the topology of DAG should fit the data best, and it can be factorized into a set of conditional probabilities. However, existing BNCs under SL can not always obtain desired performance in all cases. The learned topology of BNCs may overfit the training dataset whereas underfit the testing instances, that results in high variance and performance degradation.

Semi-supervised learning (SSL) is an effective and efficient way to incorporate the information learned from testing instance into that learned from training dataset, and the final model will be refined as more instances are introduced for learning. However, some previous research works [5,6,7] have stated that risky unlabeled instances when incorporated into non-robust models will result in “noise propagation”, and the negative effect may lead to the biased decision boundaries. SSL may perform even worse than SL and the application scope of SSL will be limited to some extent. To address this issue, one feasible approach is to independently consider the impact of each unlabeled testing instance [8]. Thus the impact, whether positive or negative, will not accumulate or be transmitted to the next unlabeled instance.

Ever increasing data quantity makes ever more urgent the need of BNC learners that are highly scalable and can perform significantly better in terms of classification [9]. Labeled training data may account for only a small portion of the massive scientific data. Correspondingly the network topology \({\mathscr{B}}\) learned from training data can represent only a limited number of significant conditional dependencies [9]. The dependency relationships implicated in unlabeled testing instance may differ greatly from that in \({\mathscr{B}}\), thus they are more suitable to be used to rebuild the network topology rather than refine \({\mathscr{B}}\).

Log-likelihood function \(LL({\mathscr{B}}|\mathcal {D}) \) measures the number of bits needed to describe \(\mathcal {D}\) based on the the learned BNC \({\mathscr{B}}\) [10]. The log likelihood has a statistical interpretation: the higher the log likelihood is, the closer \({\mathscr{B}}\) is to model the probability distribution in the data \(\mathcal {D}\). Thus \(LL({\mathscr{B}}|\mathcal {D}) \) measures the extent to which the learned BNC \({\mathscr{B}}\) fit training data \(\mathcal {D}\), and conditional log-likelihood function LL(Xi|πi,Y ) can be used to identify directed conditional dependencies between attribute Xi and its parents πi. In this paper we argue that the knowledge learned from labeled training dataset \(\mathcal {D}\) and that from unlabeled testing instance x are complementary in nature. A variant of the conditional log-likelihood function is introduced to identify the conditional dependencies between attribute values in one single instance. On the basis of this, by pre-assigning any possible class label to the testing instance to make it complete, we apply heuristic search strategy to build a restricted class of subclassifiers for modeling the testing instance. The final BNC, which is called semi-supervised k-dependence Bayesian classifier (SSKDB), is the ensemble of subclassifiers, one general BNC\(_{{\mathscr{B}}}\) learned from training dataset and a set of local BNCs learned from testing instance. These subclassifiers will be built independently but work as an ensemble to make the final prediction under the framework of semi-supervised learning.

The contributions of this paper are as follows:

  • The log-likelihood functions are proved to be theoretically effective for measuring the extents to which the learned topologies fit data. We then propose to use conditional log likelihood to identify directed conditional dependence, and apply heuristic search strategy to learn BNCs from training data \(\mathcal {D}\) and labeled testing instance d, respectively. Prediction is determined by applying uniform averaging of the joint probability estimates.

  • We compare the performance of our algorithm, i.e., SSKDB, with other state-of-the-art classifiers on 40 datasets, ranging in size from 57 to 164 thousand instances and 2 to 22 class labels. We show that SSKDB achieves comparable or lower error than a range of state-of-the-art BNCs (e.g., CFWNB, WATAN, FKDB, SKDB and IWAODE).

The following section reviews the background knowledge (i.e., learning BNC in the framework of supervised learning or semi-supervised learning). Section 3 introduces the details of our algorithm. Section 4 gives the comparisons with 6 state-of-the-art or recently proposed algorithms to verify the effectiveness of our algorithm on 40 UCI (University of California at Irvine) benchmark datasets. The final section draws conclusions and outlines some directions for further research.

2 Background theory and related research work

2.1 Supervised learning for Bayesian network classifiers

Under the supervised learning framework, the trained supervised classification model will use the statistical knowledge learned to assign appropriate class label to each testing instance. Researchers have proposed numerous methods to train a supervised model. The k-nearest neighbors classifies an instance by a plurality vote of its neighbors, with the instance being assigned to the class most common among its k nearest neighbors [11]. Support vector machine assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier [2]. Logistic regression transforms its output using the logistic sigmoid function to return a probability value [3]. Decision tree builds a flowchart-like structure in which each internal node represents a specific attribute and the path from the root node to the leaf node represents the chain of reasoning [4]. BNCs use DAG to graphically represent the probabilistic dependencies, from which the joint probability distribution can be recovered.

For restricted BNC \({\mathscr{B}}\), class variable Y is assumed to be the root node and the common parent of predictive attributes. BNC models the joint distribution P(x,y) and factorizes it according to the topology \(\mathcal {G}\) in the form of product of a set of conditional probabilities as follows:

$$ P(\boldsymbol{x},y)=P(y)P(\boldsymbol{x}|y)=P(y)\prod\limits_{i=1}^{n} {P(x_{i}|{{\varPi}}_{i},y)}, $$
(1)

where πi denotes the parents of Xi. Each factor in P(x,y), i.e., P(xi|πi,y), is a categorical distribution. To achieve the trade-off between classification performance and structure complexity, BNCs need to identify limited number of parents for each attribute. Among all the BNCs, NB has the simplest topology and it takes class variable as the only parent of the rest of variables (explanatory variables) in the DAG [12]. This model will perform the best when the dataset satisfies the assumption of conditional independence among predictors, given the class.

Although NB is surprisingly effective, the implicit conditional dependencies between predictive attributes become to be significant as the amount of data increases and cannot be neglected, that makes the classification performance of NB degrade dramatically [13]. To retain the simplicity and direct theoretical foundation of NB, Hidden naive Bayes (HNB) creates a hidden parent for each attribute, which combines the influences from all other attributes [14]. Thus HNB avoids the computational complexity for learning an optimal BNC. Correlation-based Feature Weighting Naive Bayes (CFWNB) applies correlation-based feature weighting (CFW) filter for NB, that assigns greater weights to highly predictive features that are highly correlated with the class (maximum mutual relevance), yet uncorrelated with other features (minimum mutual redundancy) [15].

An efficient and effective approach for refining NB is to add augmented edges that impose a restriction on the topology of NB by permitting more than one parents for each attribute in addition to the class variable [16]. Tree Augmented Naive Bayes (TAN) [17] applies conditional mutual information and minimal description length as the scoring function, and allows a restricted amount of edges between the attributes. The weighted averaged TAN (WATAN) [18] constructs a set of maximum spanning trees and takes each attribute Xi in turn as the root. The mutual information I(Xi;Y ) is applied as the aggregation weight to the committee trees. The k-dependence Bayesian classifier (KDB) [19] uses a parameter k to control the number of interdependencies modeled and achieves the trade-off between bias and variance for different data quantities. Selective KDB (SKDB) [20] selects attribute subsets and value of k in a single additional pass through the training data, and discriminatively selects a submodel of a full KDB classifier. Flexible KDB(FKDB) [10] uses entropy function \(H_{B}(\mathcal {D})\) to roughly measure the robustness of the topology of BNC, and heuristic search strategy is introduced to explore the optimal topology.

Complex network topology for one single model may result in high variance and overfitting, in contrast ensemble model or multiple submodels with relatively simple structures can fully describe the conditional dependencies and achieve the trade-off between bias and variance. Superparent one-dependence estimator (SPODE) utilizes a weaker attribute independence assumption than NB. The averaged one-dependence estimators (AODE) [21] averages the predictions of all qualified SPODEs, that is, SPODEs are treated equally and assigned with the same weight. However, different SPODEs may not always play the same role while dealing with different classification problems. Some of them are more important than others. Weighted average of one-dependence estimators (WAODE) [22] extend AODE by assigning different weights to SPODEs of AODE. In WAODE, four different weighting criteria are applied and that results in four different versions, i.e., WAODE-MI, WAODE-ACC, WAODE-CLL and WAODE-AUC.

2.2 Semi-supervised learning for Bayesian network classifiers

Larger data quantities call for scaling-up of existing learning algorithms, whereas supervised learning requires more effort, expertise and time consumption for the labeled data during the training phase [23]. To address the issue of the lack of labeled data, semi-supervised learning (SSL) proposes to mine information from unlabeled data to refine the model learned in the framework of supervised learning. Semi-supervised classification methods are especially effective while tackling datasets with large amounts of unlabeled data and a small quantity of labeled data, and they can increase the classification accuracy compared with the default procedure of supervised methods.

Numerous approaches have been proposed and studied in the framework of SSL. Graph-based semi-supervised learning methods [24] aim to learn a graph from all the instances, use non-negative weights on the edge of any two instances to measure the similarity between them, and then propagate labels from labeled instances to unlabeled ones based on the pairwise similarity. Blum and Chawla [25] regard semi-supervised learning as a graph min-cut problem and used pairwise relationships among the instances to learn from both labeled and unlabeled data. Zhou et al. [26] propose a general regularization framework on directed graphs, in which the directionality and global relationship are considered. Generative models [24] assume that the data follows a determined parametric model and one key issue is how to learn the corresponding joint probability. Jiang and Chen et al. propose signed probabilistic mixture models to detect overlapping communities from signed networks [27, 28]. Yang et al. [29] theoretically derives a variational Bayes EM to estimate the parameter and variation-based approximate evidence to select appropriate model. Semi-supervised support vector machine (S3VMs) [30, 31] assumes that it is more possible for data points in the highly dense regions to share the same label. Unlabeled data is applied to adjust the decision boundary to lie in lower dense regions.

For BNC learning, Zheng et al. [32] present Subsumption Resolution (SR) to identify the generalization-specialization relationships between the parent and children attribute values, such that the redundant attribute values when instantiated will be removed and that helps improve the estimate of conditional probability distribution for AODE. Zaidi et al. [33] demonstrate that SR and MI-weighting are complementary, and the resulting combined technique delivers computationally efficient low-bias learning well suited to learning from big data. Jiang et al. [34] argue that the impact of root attribute value on the class variable can be applied as a more fine-grained weighting metric. The Kullback-Leibler (KL) measure and the information gain (IG) measure are introduced to compute the weights for different SPODEs in AODE. The resulting weighted AODE achieves the tradeoff between full representation of ground-truth dependencies and high-confidence estimate of conditional probabilities. Duan et al. [9] propose instance-based weighting AODE (IWAODE) [9] by applying instance-based weighting filter to flexibly assign discriminative weights to each single SPODE for different test instances.

3 Semi-supervised learning for k-dependence Bayesian classifiers

3.1 Model selection

Given the topology of Bayesian network, the conditional log-likelihood LL(Xi|πi,Y ) measures the number of bits to describe attribute Xi given its parents πi and is defined as follows,

$$ \begin{array}{ll} LL(X_{i}|{{\varPi}}_{i},Y)&= \underset{X_{i},{{\varPi}}_{i},Y}{\sum}P(x_{i},{{\varPi}}_{i},y)\log P(x_{i}|{{\varPi}}_{i},y). \end{array} $$
(2)

Given attributes {X1,X2,⋯ ,Xn} and corresponding factorization of the joint probability distribution P(x,y) (see (1)), the log-likelihood function \(LL({\mathscr{B}}|D)\), which is defined as follows, measures the number of bits encoded in the network topology \({\mathscr{B}}\) learned from training data \(\mathcal {D}\) [10].

$$ \begin{array}{ll} LL(\mathcal{B}|D)&= \underset{\mathbf{X},Y}{\sum}P(\mathbf{x},y)\log P_{\mathcal{B}}(\mathbf{x},y). \end{array} $$
(3)

Learning the topology \({\mathscr{B}}\) based on log-likelihood function \(LL({\mathscr{B}}|D)\) has obvious advantages in knowledge expression and data fitting. Ideally \(LL({\mathscr{B}}|D)\) should represent the most significant dependencies between Xi and its parents πi, and its corresponding joint probability distribution can fit the data to the maximum.

The SSKDB\(_{{\mathscr{B}}}\) procedure is shown in Algorithm 1

figure e

Theorem 1

Let \(\mathcal {D}\) be a collection of N instances of {Y,X1,...,Xn}. The learning procedure shown in Algorithm 1 builds a network topology \({\mathscr{B}}\) that maximizes \(LL({\mathscr{B}}|D)\) and has time complexity \(\mathcal {O}(tn^{k}+m(nv)^{k}+n^{2}logn)\).

Proof

We start with a reformulation of the log-likelihood by applying the chain law for joint probability:

$$ \begin{array}{@{}rcl@{}} LL(\mathcal{B}|D)\!&=&{\sum}_{\mathbf{X},Y}P(\mathbf{x},y)\log \left\{P(y)\prod\limits_{i=1}^{n}P(x_{i}|{{\varPi}}_{i},y)\right\}\\ &=&{\sum}_{Y}P(y)\log P(y) + \sum\limits_{i=1}^{n}\underset{X_{i},{{\varPi}}_{i},Y}{\sum}P(x_{i},{{\varPi}}_{i},y)\log P(x_{i}|{{\varPi}}_{i},y)\\ &=&LL(Y)+\sum\limits_{i=1}^{n} LL(X_{i}|{{\varPi}}_{i},Y),\\ &=&constant\ term+\sum\limits_{i=1}^{n} LL(X_{i}|{{\varPi}}_{i},Y), \end{array} $$
(4)

Thus, maximizing the log-likelihood function \(LL({\mathscr{B}}|D)\) is equivalent to maximizing the term \({\sum }^{n}_{i=1} LL(X_{i}|{{\varPi }}_{i},Y)\).

The attributes are divided into two groups, the attributes sorted in attribute order \({\mathscr{L}}\) and those to be sorted. The attributes will be added into \({\mathscr{L}}\) in turn, and each one can select candidate parents from \({\mathscr{L}}\) only. For restricted BNC, class variable is the common parent of all the attributes. Suppose that \(X_{1}=\arg \max \limits LL(X_{i}|Y)(1\leq i\leq n)\) holds, then X1 is the root node and will be added to the attribute order \({\mathscr{L}}\). The initial network topology with directed edge \(Y\rightarrow X_{1}\) only is guaranteed to describe this term. Then LL(Xi|πi,Y )(i≠ 1) will be computed and compared to select the second node X2, where \(X_{2}=\arg \max \limits LL(X_{i}|{{\varPi }}_{i},Y)(X_{i}\notin {\mathscr{L}},{{\varPi }}_{i}\subseteq {\mathscr{L}})\). Since only attribute X1 exists in \({\mathscr{L}}\), the directed edge \(\{X_{1},Y\}\rightarrow X_{2}\) will be added to the network topology, and X2 will be regarded as the second attribute and added into the order \({\mathscr{L}}\). By applying heuristic search strategy to maximize each factor LL(Xi|πi,Y ) (1 ≤ in) in (4) in turn, Algorithm 1 determines the attribute order and the parent-children relationships, and thus maximizes the log-likelihood function \(LL({\mathscr{B}}|D)\).

SSKDB\(_{{\mathscr{B}}}\) uses the topology of \({\mathscr{B}}\) to represent arbitrary k-dependence conditional dependencies implicated in training data in the form of high-order maximum weighted spanning tree. At training time, SSKDB\(_{{\mathscr{B}}}\) needs to generate a (k+ 1)-dimensional table of co-occurrence counts for every k attribute values and each class label. The resulting time complexity of generating a (k+ 1)-dimensional table is \(\mathcal {O}(tn^{k})\), where n is the number of attributes, t is the number of data instances. To calculate the log-likelihood function \(LL({\mathscr{B}}|D)\), SSKDB\(_{{\mathscr{B}}}\) considers every combination of attribute values xi in conjunction with parent πi and class label y. Thus the time complexity is \(\mathcal {O}(m(nv)^{k})\), where m is the number of classes, and v is the maximum number of discrete values that any attribute may take. The time complexity of parent assignment is \(\mathcal {O}(n^{2}logn)\). Thus the time complexity for building SSKDB\(_{{\mathscr{B}}}\) is \(\mathcal {O}(tn^{k}+m(nv)^{k}+n^{2}logn)\).

However, the conditional dependencies learned based on Algorithm 1 may fit different instances to different extents. Take dataset Localization for example, the attribute pair that maximizes the log-likelihood function is {X1,X3}, which respectively denotes “Sequence Name” and “X coordinate”. The estimates of conditional probability distribution P(x1,x3|y) when Y takes different values are shown in Fig. 1. The significant difference in probability distributions demonstrates significant variation in conditional dependencies between attribute values. In this paper, we propose to further explore the possible conditional dependencies between attribute values. One more BNC is built independently to fully represent the dependency characteristics implicated in specific instance x, and the network topology can adaptively change to fit it.

Fig. 1
figure 1

Distribution of P(x1,x3|y) when (a) y = y0 and (b)y = y1 for dataset localization

Similar to the definition of \(LL({\mathscr{B}}|D)\), the point-wise log-likelihood (PLL) function for instance d = {x,y}, which measures the number of bits encoded, is defined as

$$ \begin{array}{@{}rcl@{}} LL(d)&=&\log P(\mathbf{x},y) =\log P(y)+\sum\limits_{i=1}^{n}\log P(x_{i}|{{\varPi}}_{i},y)\\ &=&LL(y)+\sum\limits_{i=1}^{n} LL(x_{i}|{{\varPi}}_{i},y), \end{array} $$
(5)

Learning the topology based on PLL function LL(d) has obvious advantages in knowledge expression and data fitting. Ideally, LL(d) should represent the most significant dependencies between xj and its parents πj, and its corresponding joint probability distribution can fit d to the maximum. Similar to the learning procedure of Algorithm 1, Algorithm 2 applies heuristic search strategy to maximize each factor LL(xi|πi,y) (1 ≤ in) in (5) in turn, then the attribute order and the parent-children relationships will be determined, and the PLL function LL(d) will be maximized correspondingly.

figure f

The learning process of SSKDB for instance di is shown as follows.

As described in Algorithm 2, SSKDB learns a restricted class of subclassifiers from the pseudo complete instance d = (x,y) for all the possible class labels y, and represents arbitrary k-dependence conditional dependencies implicated. That is, each subclassifier can fully describe the conditional dependencies between attribute values in x given specific class label.

Theorem 2

Let d be a complete instance, i.e., d = {x1,...,xn,y}. The learning procedure of Algorithm 2 builds a network topology that maximizes LL(d) and has time complexity \(\mathcal {O}(mn^{k}+n^{2}logn)\).

Proof

Suppose that is the subclassifier learned from pseudo instance d = (x,y), then the estimate of the joint probability distribution P(x,y) for is

$$ P(\mathbf{x},y|\ell) = P(y)\prod\limits_{i=1}^{n}P(x_{i}|{{\varPi}}_{i},y,\ell). $$
(6)

Similar to the proof of Theorem 1, the PLL function LL(d) can be factorized as follows,

$$ \begin{array}{@{}rcl@{}} LL(d)&=&\log P(y)+\sum\limits_{j=1}^{n}\log{P(x_{j}|{{\varPi}}_{j},y)}\\ &=&LL(y)+\sum\limits_{j=1}^{n}LL(x_{j}|{{\varPi}}_{j},y), \end{array} $$
(7)

Thus, the learned network topology should encode the maximum number of bits for describing d by maximizing the PLL function LL(d).

To compute the actual network structure of SSKDB\(_{{\mathscr{B}}}\), we need to consider all attribute values and all possible class labels, thus it requires \(\mathcal {O}(tn^{k}+m(nv)^{k}+n^{2}logn)\) time. In contrast, to compute the network structure of SSKDB we only need to consider the attribute values in x and all possible class labels, thus it requires \(\mathcal {O}(mn^{k}+n^{2}logn)\) time. □

3.2 Semi-supervised k-dependence Bayesian Classifier (SSKDB)

The proposed algorithm, SSKDB, contains two components (SSKDB\(_{{\mathscr{B}}}\) and SSKDB) and its learning framework is shown in Fig. 2. Firstly, SSKDB\(_{{\mathscr{B}}}\) learns from training data \(\mathcal {D}\) by maximizing the log-likelihood function \(LL({\mathscr{B}}|\mathcal {D})\). Then, SSKDB learns from pseudo complete instance di = {x1,...,xn,yi} by maximizing LL(di), where yi is a pre-assigned class label to test instance x = {x1,...,xn}. After training SSKDB and SSKDB\(_{{\mathscr{B}}}\), the final ensemble estimates the class membership probabilities P(x,yi) by averaging both predictions.

Fig. 2
figure 2

Algorithm flowchart

The network topology learned from training data \(\mathcal {D}\) may not correctly describe the “right” dependencies among attribute values in testing instance x, thus the resulting BNC \({\mathscr{B}}\) may overfit \(\mathcal {D}\) whereas underfit x, that will lead to biased estimate of conditional probability. In contrast, there is only one true class label for unlabeled instance x and correspondingly only one subclassifier can describe the conditional dependencies among attribute values in x with high confidence level. The other subclassifiers learned from wrongly labeled instance x will negatively affect the estimate of joint probability distribution to some extents. The knowledge represented by \({\mathscr{B}}\) and are complementary in nature, and when they work jointly the final ensemble can achieve better classification performance than any one of them.

To further clarify the basic idea of the ensemble learning strategy, we take dataset localization as an example for case study. Dataset localization has 5 attributes and 11 class labels. Suppose that x = {x1 = 1,x2 = 3,x3 = 17,x4 = 3,x5 = 3} is the given instance to be classified. The true class of x is y4. \({\mathscr{B}}\) learns the general conditional dependencies among attributes from training data \(\mathcal {D}\) by maximizing the log-likelihood function \(LL({\mathscr{B}}|D)\), i learns the local conditional dependencies among attribute values from one single instance di = (x,yi) by maximizing LL(di). Since the true class label for x is y4, thus we take the topology of 4 as the benchmark for comparison. The network topologies of \({\mathscr{B}}\) and 4 when k = 2 are shown in Figs. 3a and b, respectively.

Fig. 3
figure 3

The network structure of \({\mathscr{B}}\) and 4 corresponding to localization dataset

By comparing the network topologies shown in Fig. 3a and b we can see that, there are 7 conditional dependencies represented in the network topologies of 4 or \({\mathscr{B}}\), and 14 conditional dependencies in total. Among them, only 2 conditional dependencies in the topology of 4 are the same as that in \({\mathscr{B}}\) while 5 dependencies are different. Diversity has been recognized as one of the key characteristic for ensemble classifier, and that can help improve classification accuracy, increase robustness to variance, reduce the overall sensitivity to different starting parameters and noise [35]. The topology of \({\mathscr{B}}\) can represent the general conditional dependencies among attributes implicated in training data, whereas it cannot represent local conditional dependencies among attribute values implicated in specific testing instance. In contrast, the topology of can represent local conditional dependencies, whereas the “noise” introduced by the uncertainty of class label may reduce the confidence level of [36]. SSKDB introduces diversity by training i(1 ≤ im) to model the conditional dependencies between attribute values given different class labels. When i and \({\mathscr{B}}\) work jointly, the estimate of joint probability P(x,yi) will be finely tuned by combining outputs from each individual model. To make the final prediction, linear combination is introduced to combine the results from individual models.

In order to reduce the negative effect caused by unreliable base probability estimates, we use η = 1 as minimum on sample size for statistical inference purposes [32]. If the frequency of the parent attribute values {πi,y} is lower than η for i or \({\mathscr{B}}\), when classifying an instance x SSKDB will exclude i or \({\mathscr{B}}\) correspondingly. Thus,

$$ P(\mathbf{x},y) = P(y)\prod\limits_{i=1}^{n}P(x_{i}|{{\varPi}}_{i},y_{j},F({{\varPi}}_{i},y_{j}) \geq \eta), $$
(8)

where F(πi,yj) is the frequency of attribute values (πi,yj) in the training data. The linear combiner can be used for models that output real-valued numbers, so is applicable for \({\mathscr{B}}\) and j. Corresponding \({\mathscr{B}}\) and j will be combined to describe the dependency relationship conditioned on class label yj. Then we can apply Bayes rule to predict or classify as follows,

$$ \begin{array}{@{}rcl@{}} y^{*}&=&\arg\max_{y_{j}\in Y} P(\mathbf{x},y_{j})\\ &=& \arg\max_{y_{j}\in Y}P(y_{j})\frac{ {\prod}^{n}_{i=1}P(x_{i}|{{\varPi}}_{i},y_{j},\mathcal{B})+{\prod}^{n}_{i=1}P(x_{i}|{{\varPi}}_{i}^{\prime},y_{j},\ell_{j})}{\{F({{\varPi}}_{i},y_{j}) \geq \eta \bigwedge F({{\varPi}}_{i}^{\prime},y_{j}) \geq \eta\}},\\ \end{array} $$
(9)

where πi and \({{\varPi }}_{i}^{\prime }\) respectively denotes the parent attributes of Xi in the topologies of \({\mathscr{B}}\) and j. The classification procedure is presented in Algorithm 3.

figure g

Classification of a single instance requires considering the k-dependence topology of \({\mathscr{B}}\) and , and then assigns class label y to instance x according to (9). Estimates of the joint probability for both SSKDB\(_{{\mathscr{B}}}\) and SSKDB require \(\mathcal {O}(nmk)\) time. Thus the time complexity of SSKDB for classification is \(\mathcal {O}(nmk)\).

4 Experiments and results

In this section, we compare our algorithm with several state-of-art or recently proposed classifiers. A series of experiments were conducted in terms of zero-one loss, bias, variance, training time and classification time on 40 datasets from the UCI machine leaning repository [37]. The description of these datasets is shown in Table 1, including the number of instances, attributes and classes. These datasets were divided into two categories in terms of their sizes. That is, datasets with < 2000 and ≥ 2000 instances are denoted as small and large datasets, respectively. For each dataset, numeric attribute values are discretized using Minimum Description Length (MDL) [38]. Each algorithm is tested on each dataset using 10-fold cross validation.

Table 1 Datasets

The following algorithms are used for comparison:

  • CFWNB [15], Correlation-based Feature Weighting for NB.

  • WATAN [18], Weighted ATAN.

  • FK1DB [10], flexible KDB with k = 1.

  • FK2DB [10], flexible KDB with k = 2.

  • SKDB [20], Selective KDB with k ≤ 5.

  • IWAODE [9], instance-based weighting AODE.

  • SSK1DB, Semi-supervised KDB with k = 1.

  • SSK2DB, Semi-supervised KDB with k = 2.

4.1 Experimental Study on all dataset

If the output of a one-tailed binomial sign test is less than 0.05, then significant difference is supposed to exist between the experimental results for different learners. Win/Draw/Loss(WDL) record is applied to compare the experimental results by counting the number of datasets for which one algorithm performs significantly better, equally well or significantly worse than the other on a given criterion. In machine learning, zero-one loss [39] is one of the standard criteria to measure the extent to which a learner misclassify. Table 2 shows the experimental results of WDL record in terms of zero-one loss. Detailed results of zero-one loss are presented in Table 6, and the best scores corresponding to different datasets are marked in bold.

Table 2 The comparison results of Win/Draw/Loss records in terms of zero-one loss

High-dependence topology will represent more conditional dependency relationships among attributes, that helps the estimate of conditional probability approximate the true one. Given n attributes and m class labels, when k = 2, CFWNB, FK1DB and FK2DB can respectively represent 0-dependence, 1-dependence and 2-dependence topologies. As shown in Table 2, FK2DB performs better than FK1DB (20∖13∖7), and FK1DB better than CFWNB (16∖12∖12). As argued by Sahami [19] that, with more “right” dependencies captured the learned BNC would achieve optimal Bayesian accuracy. Attribute weighting can help the final BNC enhance the mutual dependence between predictive attributes and class variable, and that will improve the classification performance. By applying attribute weighting, BNCs with high-dependence topology also have significant advantages over low-dependence BNCs in classification performance. WATAN performs better than CFWNB (17∖10∖13). IWAODE performs better than the other single-structure BNCs due to its ensemble mechanism, it respectively outperforms WATN, FK1DB, FK2DB and SKDB on 11, 15, 15 and 14 datasets, and respectively loses on 7, 7, 13 and 11 datasets. SSKDB inherits the characteristics of high-dependence topology and weighting, and it performs much better than BNCs of the same structure complexity. For example, SSK1DB outperforms WATAN on 15 datasets and loses on 9, and it outperforms FK1DB on 16 datasets and loses on 8. SSK2DB outperforms FK2DB on 16 datasets and loses on 3, and it outperforms SKDB on 21 datasets and loses on 2.

As proposed by Kohavi and Wolpert [40], zero-one loss can be decomposed into bias and variance from sampling theory, which provides valuable insights into the components of the misclassification rate. Bias measures the systematic error of the learned algorithm, and variance measures the sensitivity of an algorithm to the random variation in the training data. An algorithm with low variance will usually enjoy advantage while dealing with small datasets, and in contrast that with low bias will usually perform much better on large datasets [39]. The detailed results of bias and variance are provided in Table 7 and Table 8, and the best scores corresponding to different datasets are marked in bold. The WDL records of bias and variance are shown in Table 3.

Table 3 The comparison results of Win/Draw/Loss records in terms of bias and variance

Bias-wise, from Table 3 we can see that WATAN outperforms CFWNB on 25 datasets, whereas loses on 9 datasets. FK2DB respectively outperforms WATAN and FK1DB on 22 and 23 datasets, whereas loses on 3 and and 6 datasets. Variance-wise, WATAN loses to CFWNB on 33 datasets, whereas wins on 4 datasets. FK2DB respectively loses to WATAN and FK1DB on 19 and 26 datasets, whereas wins on 9 and and 8 datasets. CFWNB, WATAN, FK1DB, FK2DB and SKDB all learn in the framework of supervised learning and their corresponding topologies can represent the conditional dependencies implicated in labeled training data only. The topologies of CFWNB and IWAODE are determined by their independence assumption and thus they need no structure learning, they both enjoy the variance advantage compared to other algorithms.

Ensemble BNCs aggregate the prediction of a restricted class of subclassifiers, each representing lower-dependence topologies with different independence assumptions. The relatively simple structures of the subclassifiers help achieve the trade-off between bias and variance. The nature of BNC ensembles lends themselves to scalable parallelization and overcomes the limitations of single model BNCs in two prevalent directions, i.e., to diversely generate BNC components, and to sparsely combine multiple BNCs. SSKDB tries to mine possible conditional dependencies from unlabeled instance besides training data. Thus if the same learning strategy is applied, the number of conditional dependencies in the topology of SSKDB is twice as much as that of BNC of the same structure complexity. For example, when k = 1, the final SSKDB can learn 2(n − 1) attribute conditional dependencies from training data \(\mathcal {D}\) and unlabeled instance d, n − 1 dependencies for each. Whereas FK1DB can learn n − 1 conditional dependencies from training data \(\mathcal {D}\) only. Bias-wise, from Table 3 we can see that SSK1DB respectively outperforms WATAN and FK1DB on 15 and 18 datasets, whereas loses on 11 and 7 datasets. SSK2DB respectively outperforms FK2DB and SKDB on 13 datasets, whereas loses on 6 and and 8 datasets. Variance-wise, SSK1DB respectively outperforms WATAN and FK1DB on 25 and 16 datasets, whereas loses on 9 and 11 datasets. SSK2DB respectively outperforms FK2DB and SKDB on 20 and 14 datasets, whereas loses on 5 and and 14 datasets. The variance advantage of SSKDB is more significant than the bias advantage, thus the advantage of SSKDB over other BNCs of the same structure complexity can be attributed to the knowledge learned from testing instance, which can mitigate the negative effect caused by overfitting training data.

Moreover, its is a common occurrence plaguing areas such as fault and intrusion detection, affective computing, classification problems in medical imagining and many more. Often, “real” data sets are imbalanced and are dominated by a group of “normal” sample points existing along with a small group of minority “abnormal” sample points [7]. One or more class labels are less represented and sometimes, the class with the minority of instances is commonly the class of interest. It is well known that major differences between the majority and minority group sizes can result in classifiers that are biased towards the majority class thereby compromising the accuracy of the classifier [8].

Learning from imbalanced data stemming from real-world problems is inherently challenging and difficult. Data imbalance received significant research interest since it is an intrinsic data issue that cannot be handled by more precise and higher volume data acquisition methods. When Information theory is applied to learn the network structure of BNC from data, information-theoretic metrics, e.g., conditional mutual information (CMI) I(Xi;Xj|Y ) defined by (10), are introduced to measure the conditional dependence between Xi and Xj given all possible values of variable Y, and the arrows in the diagram represent the flow of information.

$$ \begin{array}{@{}rcl@{}} I(X_{i};X_{j}|Y)&=&\underset{X_{i}}{\sum}\underset{X_{j}}{\sum}\underset{Y}{\sum}P(x_{i},x_{j},y)\log\frac{P(x_{i},x_{j}|y)}{P(x_{i}|y)P(x_{j}|y)}\\ &=&\underset{Y}{\sum}P(y)\!\left\{\underset{X_{i}}{\sum}\underset{X_{j}}{\sum}P(x_{i},x_{j}|y)\log\frac{P(x_{i},x_{j}|y)}{P(x_{i}|y)P(x_{j}|y)}\right\}\\ &=&\underset{Y}{\sum}P(y)I(X_{i};X_{j}|y). \end{array} $$
(10)

From (10), I(Xi;Xj|Y ) can be factorized into a set of weighted I(Xi;Xj|y) where P(y) is the assigned weight. Thus the presence of conditional dependence measured by I(Xi;Xj|Y ) will be dominated by the majority group to some extent. Correspondingly, the learned single-model BNC prefers to significant conditional dependencies implicated in majority group rather than minority group. In this paper, as shown in (5) the significant conditional dependencies for each class implicated in d will be identified severally and represented by different BNCs. The ensemble learning strategy will help fully describe all possible dependency relationships and then address the issue of imbalanced data.

The training time and classification time for SSKDB is highly related to the number of classes, because for each class SSKDB needs to learn a subclassifier SSKDB. SSKDB also takes time for training SSKDB\(_{{\mathscr{B}}}\) by building high-order maximum weighted spanning tree. But considering the classification advantages brought by SSKDB<s learning strategy, the extra time spent on training the classifier and assigning class labels for unlabeled instances is acceptable.

4.2 Experimental study on large datasets

The most accurate learners for dealing with large data will performs better than the most accurate learners for dealing with small data in terms of bias [20, 41]. Thus a new generation of computationally efficient and low-bias learners is in urgent need to address this issue. By applying log-likelihood function to make the learned BNCs fit training and testing data respectively, the biased estimate of joint probability due to limited number of instances will be improved. To demonstrate the advantage of the complementary characteristic introduced by SSKDB, we compare the experimental results of SSKDB and others algorithms in terms of zero-one loss and bias on large datasets in the following discussion.

The ensemble learning mechanism helps SSK1DB exhibit excellent generalization ability on large datasets in terms of zero-one loss. As shown in Fig. 4, SSK1DB outperforms WATAN on 4 datasets and loses on 1, outperforms FK1DB on 8 datasets and loses only on 1. SSK2DB respectively outperforms FK2DB and SKDB on 6 datasets, and it loses on 1. From the above records we can find that, SSKDB is very suitable for dealing with large datasets. The strategy of learning BNC topology by maximizing the log-likelihood function proves its advantage and we further explore its effectiveness in terms of bias. From Fig. 5 we can see that, SSK1DB respectively outperforms WATAN and FK1DB on 4 and 9 datasets, and loses on 4 and 1. SSK2DB respectively outperforms FK2DB and SKDB on 6 and 3 datasets, and it respectively loses on 0 and 2 datasets.

Fig. 4
figure 4

The comparison results on large datasets in terms of zero-one loss

Fig. 5
figure 5

The comparison results on large datasets in terms of bias

The increase in the amount of training instance reduces the risk of biased estimate of conditional probabilities. The bias difference between SSKDB and other algorithms directly reflect the difference in significant dependency relationships identified by applying log-likelihood functions LL(Xi|πi,Y ) or conditional mutual information I(Xi;πi|Y ). From the experimental results we can see that, SSKDB has different but more reasonable structure than FKDB and SKDB. The ensemble learning strategy of SSKDB greatly mitigates the negative effect caused by the independence assumption and help it possess competitive performance. SSKDB\(_{{\mathscr{B}}}\) applies the log likelihood metric to maximize the number of bits encoded in the network topology rather than mine significant conditional dependencies between attributes, that helps the learned joint probability approximates the true one. Conditional mutual information provides a global optimal but local non-optimal solution for learning BNC, whereas from (2) and (5) log likelihood metric provides one global optimal and one local optimal solution for data fitting. Considering the variation in dependency relationships for different instances, SSKDB learns local dependency relationships implicated in every class of testing instances. This learning strategy for constructing complementary models can take into account the information on the training set \(\mathcal {D}\) and the testing instance x at the same time. So when the estimate of the log-likelihood function is non-significant, SSK2DB can often achieve classification advantages. This can clarify the reason why among all 13 large datasets, SSK2DB performed significantly better than SKDB on 6 datasets and poorer on only 1 dataset.

4.3 Time comparison

Table 4 [20] shows the theoretical complexity of four state-of-the-art BNCs, NB, TAN, AODE and KDB, where t is the number of training instances, v is the maximum number of discrete values that any attribute may take, m the number of class labels.

Table 4 Theoretical complexity of four state-of-the-art BNCs

The theoretical complexity of the BNCs used for comparison is shown in Table 5. The training process of CFWNB is very similar to standard naive Bayes, except for some additional training time to compute all feature weights. Feature-class correlation and the average feature-feature inter-correlation respectively measured by I(Xi;Y ) and I(Xi;Xj) are considered for assigning weights. The training time complexity of assigning and calculating I(Xi;Y ) and I(Xi;Xj) to each attribute is \(\mathcal {O}(mnv)\) and \(\mathcal {O}((nv)^{2})\) respectively. If only the highest order term is considered, the training time complexity of CFWNB for obtaining these weights is \(\mathcal {O}((nv)^{2} )\). WATAN needs to construct n directed maximum weighted spanning trees for n attributes, thus the time complexity of training WATAN is \(\mathcal {O}(tn^{2}+m(nv)^{2}+n^{3}logn)\). During the testing phase, the classification of a single instance requires calculation of the probability based on each subclassifier and is of time complexity \(\mathcal {O}(mn^{2})\). SKDB extends KDB to select between attribute subsets and values of k in a single additional pass through the training data. In the first and second pass, SKDB is simply KDB, with time complexity \(\mathcal {O}(tn^{2}+m(nv)^{2}+tnk)\). SKDB then extends KDB with an extra pass through the training data to perform leave-one-out cross validation for the different (ordered) attributes \(\mathcal {O}(tmnk)\). In the Table 5, nb (the best d) is the remaining attributes after applying SKDB, and kb (the best k) is the k value selected in SKDB. FKDB takes an active learning strategy and uses conditional entropy to learn significant causal relationships from data. FKDB takes more time for training because it needs to build high-order maximum weighted spanning tree in terms of high-order conditional entropy and has training time complexity \(\mathcal {O}(n^{2}logn)\). The training procedures of IWAODE is just the same as that of AODE, i.e., no structure learning and weighting are required.Thus the training time complexity is \(\mathcal {O}(tn^{2})\). During the testing phase, IWAODE learn weights from each testing instance, and this instantiated weighting approach adjusts the weights flexibly but requires more time for testing. To compute the actual network structure of SSKDB\(_{{\mathscr{B}}}\), we need to consider all attribute values and all possible class labels, thus it requires \(\mathcal {O}(tn^{k}+m(nv)^{k}+n^{2}logn)\) time. In contrast, to compute the network structure of SSKDB we only need to consider the attribute values in x and all possible class labels, thus it requires \(\mathcal {O}(mn^{k}+n^{2}logn)\) time. SSKDB learns a restricted class of subclassifiers from the pseudo complete instance d = (x,y) for all the possible class labels y and has training time complexity \(\mathcal {O}(m^{2}n^{k}+mn^{2}logn)\). The training time complexity of SSKDB is \(\mathcal {O}(tn^{k}+m(nv)^{k}+mn^{2}logn)\).

Table 5 Theoretical complexity of BNCs for comparison study

Figures 6 and 7 show the comparisons of average training and classification time on all datasets. All the algorithms for comparison run on a desktop computer with an Intel(R) Core(TM) i5-7200U CPU @1.2 GHz, 64 bits and 16 GB of memory. In these graphs, each bar represents the sum of time required for training or classifying in 10-fold cross validation experimental study.

Fig. 6
figure 6

Comparison of training time

Fig. 7
figure 7

Comparison of classification time

Training time refers to the average time spent on training the classifiers, including both the time it takes to build a model from the information contained in the training set \(\mathcal {D}\), and the average time it takes to construct the model based on testing instances d with pre-assigned class labels. Figure 6 indicates that CFWNB and WATAN need negligible time for training due to their independence assumptions. As structure complexity increases high-dependence BNCs (e.g., SKDB) need more time than low-dependence BNCs (e.g., FK1DB). The training time for SSKDB is highly related to the number of classes, because for each class SSKDB needs to learn a subclassifiers. SSKDB also takes time for training SSKDB\(_{{\mathscr{B}}}\) by building high-order maximum weighted spanning tree. It can be seen from Fig. 6 that SSK1DB spends more time on training the classifier than other 1-dependence BNCs (e.g.,FK1DB), and SSK2DB spends more training time than other 2-dependence BNCs (e.g., FK2DB). But considering the classification advantages brought by SSKDB’s learning strategy, the extra time spent on training the classifier is acceptable

Classification time refers to the average time that classifiers take to assign class labels for unlabeled instances. Ensemble BNCs integrated the prediction of base classifiers in some way to obtain the final prediction. Ensemble BNCs generally imposes more classification time than single-structure BNCs. Weighting imposes computation overhead on computing the joint probability. As shown in Fig. 7, SSK1DB requires more classification time than FK1DB but less time than IWAODE, and SSK2DB requires more classification time than FK2DB and SKDB. In general, there exist a small increase in time consumption and definite huge gains applying ensemble learning. Finally, we can conclude that our proposed framework is effective for classification.

5 Conclusions and future work

Log-likelihood function has been proven to be an effective criterion for measuring the conditional dependencies among attributes, and the resulting BNC can achieve the trade-off between data fitting and knowledge representation. In this paper, we propose to use a variant of the log-likelihood function to measure the conditional dependencies among attribute values. We have presented the rationale and time complexity of the BNCs, SSKDB\(_{{\mathscr{B}}}\) and SSKDB, which are respectively learned from training data and testing instance, and they work as an ensemble to make the final prediction. We have conducted comprehensive experiments across 40 UCI benchmark datasets to evaluate SSKDB’s learning accuracy and efficiency. The experimental results prove the effectiveness and efficiency of SSKDB in terms of zero-one loss, bias, variance and etc. The log-likelihood function defined in (5) can measure the extent to which the learned BNC fits specific instance. Thus it can be used to build multiple BNCs based on the testing instance with different pre-assigned class labels. It is difficult to measure the confidence level of SSKDB\(_{{\mathscr{B}}}\) and SSKDB without domain knowledge from experts. One feasible approach is to assign different weights to the committee members and then linearly combine their probability estimates. If testing instance x = {x1,...,xn} corresponds to unseen class, then it shouldn’t fit any BNC learned from training data \(\mathcal {D}\), and the criterion that \(P(\mathbf {x})<P(\hat {\mathbf {x}})\) should hold much more often than not for any instance \(\hat {\mathbf {x}}\) in \(\mathcal {D}\). This is a challenging question and remains a potential direction for future research.