Efficient parameter learning of Bayesian network classifiers
Abstract
Recent advances have demonstrated substantial benefits from learning with both generative and discriminative parameters. On the one hand, generative approaches address the estimation of the parameters of the joint distribution—\(\mathrm{P}(y,\mathbf{x})\), which for most network types is very computationally efficient (a notable exception to this are Markov networks) and on the other hand, discriminative approaches address the estimation of the parameters of the posterior distribution—and, are more effective for classification, since they fit \(\mathrm{P}(y\mathbf{x})\) directly. However, discriminative approaches are less computationally efficient as the normalization factor in the conditional loglikelihood precludes the derivation of closedform estimation of parameters. This paper introduces a new discriminative parameter learning method for Bayesian network classifiers that combines in an elegant fashion parameters learned using both generative and discriminative methods. The proposed method is discriminative in nature, but uses estimates of generative probabilities to speedup the optimization process. A second contribution is to propose a simple framework to characterize the parameter learning task for Bayesian network classifiers. We conduct an extensive set of experiments on 72 standard datasets and demonstrate that our proposed discriminative parameterization provides an efficient alternative to other stateoftheart parameterizations.
1 Introduction
Efficient training of Bayesian Network Classifiers has been the topic of much recent research (Buntine 1994; Carvalho et al. 2011; Friedman et al. 1997; Heckerman and Meek 1997; Martinez et al. 2016; Pernkopf and Bilms 2010; Webb et al. 2012; Zaidi et al. 2013). Two paradigms predominate (Jebara 2003). One can optimize the loglikelihood (LL). This is traditionally called generative learning. The goal is to obtain parameters characterizing the joint distribution in the form of local conditional distributions and then estimate classconditional probabilities using Bayes rule. Alternatively, one can optimize the conditionalloglikelihood (CLL)—known as discriminative learning. The goal is to directly estimate the parameters associated with the classconditional distribution—\(\mathrm{P}(y\mathbf{x})\).
Naive Bayes (NB) is a Bayesian network \({ \mathop { \text {BN} } }\) that specifies independence between attributes given the class. Recent work has shown that placing a perattributevalueperclassvalue weight on probabilities in NB (and learning these weights by optimizing the CLL) leads to an alternative parameterization of vanilla Logistic Regression (LR) (Zaidi et al. 2014). The introduction of these weights (and optimizing them by maximizing CLL) also makes it possible to relax NB’s conditional independence assumption and thus to create a classifier with lower bias (Ng and Jordan 2002; Zaidi et al. 2014). The classifier is lowbiased, as weights can remedy inaccuracies introduced by invalid attributeindependence assumptions.
In this paper, we generalize this idea to the general class of \({ \mathop { \text {BN} } }\) classifiers. Like NB, any given \({ \mathop { \text {BN} } }\) structure encodes assumptions about conditional independencies between the attributes and will result in error if they do not hold in the data. Optimizing the loglikelihood in this case will result in suboptimal performance for classification (Friedman et al. 1997; Grossman and Domingos 2004; Su et al. 2008) and one should either optimize directly the CLL by learning the parameters of the classconditional distribution or by placing weights on the probabilities and learn these weights by optimizing the CLL.
 1.
We develop a new discriminative parameter learning method for Bayesian network classifiers by combining fast generative parameter (and structure) learning with subsequent fast discriminative parameter estimation (using parameter estimates from the former to precondition search for the parameters of the latter). To achieve this, discriminative parameters are restated as weights rectifying deviations of the discriminative model from the generative one (in terms of the violation of independence between factors present in the generative model).
 2.
A second contribution of this work is the development of a simple framework to characterize the parameter learning task for Bayesian network classifiers. Building on previous work by Friedman et al. (1997), Greiner et al. (2005), Pernkopf and Wohlmayr (2009), Roos et al. (2005) and Zaidi et al. (2013), this framework allows us to lay out the different techniques in a systematic manner; highlighting similarities, distinctions and equivalences.
 1.
Generative step: We maximize the LL to obtain parameters for all local conditional distributions in the \({ \mathop { \text {BN} } }\).
 2.
Discriminative step: We associate a weight with each parameter learned in the generative step and reparameterize the classconditional distribution in terms of these weights (and of the fixed generative parameters). We can then discriminatively learn these weights by optimizing the CLL.

The proposed formalization of the parameter learning task for \({ \mathop { \text {BN} } }\) is actually a reparameterization of the one step (discriminative) learning problem (this will become clear when we introduce the proposed framework) but with faster convergence of the discriminative optimization procedure. In the experimental section, we complement our theoretical framework with an empirical analysis over 72 domains; the results demonstrate the superiority of our approach. In Sect. 5.5, we will discuss our proposed approach from the perspective of preconditioning in unconstrained optimization problems.

The proposed approach results in a threelevel hierarchy of nested parameterizations, where each additional level introduces (or “unties”) exponentially more parameters in order to fit ever smaller violations of independence.

Regularization of the discriminative parameters in the proposed discriminative learning approach allows to limit the amount of allowable violation of independence and effectively interpolate between discriminative and generative parameter estimation.
List of symbols used
Notation  Description 

n  Number of attributes 
N  Number of data points in \(\mathcal {D}\) 
\(\mathrm{P}(e)\)  Probability of event e 
\(\mathrm{P}(eg)\)  Conditional probability of event e given g 
\(\hat{\mathrm{P}}(.)\)  An estimate of \(\mathrm{P}(.)\) 
\(\mathcal {D}= \{ \mathbf{x}^{(0)},\ldots ,\mathbf{x}^{(N)} \}\)  Data consisting of N objects 
\(\mathcal {L}= \{ y^{(1)},\ldots ,y^{(N)}\}\)  Labels of data points in \(\mathcal {D}\) 
\(\mathbf{x}= \langle x_{1}, x_{2},\ldots ,x_{n} \rangle \)  An object (ndimensional vector of attribute values) 
\(\mathbf{z}=\langle z_0, z1, \ldots z_n\rangle \)  A labeled object, where \(z_0=y^{(i)}, z_1=x_1, \ldots ,z_n=x_n\) and \(\mathbf{x}^{(i)} \in \mathcal {D}\) 
Y  Random variable associated with class label 
y  \(y \in Y\). Class label for object. Same as \(z_0\) 
Y  Number of classes 
\(X_i\)  Random variable associated with attribute i 
\(x_i\)  \(x_i \in X_i\). ith attribute value 
\(X_i\)  Number of values of attribute \(X_i\) 
\(Z_i\)  Random variable associated with attribute i, or in the case of \(Z_0\), the class. 
\(z_i\)  \(z_i \in Z_i\). ith attribute value, or for \(z_0\), the class 
\(Z_i\)  Number of values of variable \(Z_i\) 
\({\mathcal {B}}\)  Bayesian network (directed acyclic graph), parameterized by \(\varTheta \) 
\({\mathcal {B}^{*}}\)  Classconditional \({ \mathop { \text {BN} } }\) based on \({\mathcal {B}}\), parameterized by \({\varvec{\theta }}\) 
\({\mathcal {G}}\)  Structure of \({ \mathop { \text {BN} } }\) \({\mathcal {B}}\) 
\(\varTheta \)  Set of parameters associated with \({\mathcal {B}}\) 
\(\mathrm{P}_{\mathcal {B}}(.)\)  Probability is based on \({ \mathop { \text {BN} } }\) \({\mathcal {B}}\) 
\(\varPi _i(.)\)  Function taking \(\mathbf{z}\) as an input, returns the values of the attributes which are the parents of i 
\(\varPi _0(.)\)  Parents of class 
\(\theta _{Z_i = z_i  \varPi _i(\mathbf{z})}\)  Probability of \(Z_i = z_i\) given its parents 
\(\theta _{z_i\varPi _i(\mathbf{z})}\)  Short form of \(\theta _{Z_i = z_i  \varPi _i(\mathbf{x})}\) 
\(\theta _{z_i:jy:k,\varPi _i:l}\)  Probability of variable i taking value j, class (y) taking value k and its parents (\(\varPi _i\)) taking value l 
\(\beta _{y,x_i,\varPi _i(\mathbf{x})}\)  Parameter associated with class y, attribute i taking value \(x_i\) and i’s parent’svalues \(\varPi _i\) 
\(\beta _{y,x_i,\varPi _i}\)  Same as \(\beta _{y,x_i,\varPi _i(\mathbf{x})}\) 
\(\beta _{x_i:j, y:k, \varPi _i:l}\)  Parameter associated with attribute i taking value j, class (y) taking value k and its parents (\(\varPi _i\)) taking value l 
\({\varvec{\theta }},\mathbf{w},\mathbf {\beta }\)  Vector of \(\theta \), w and \(\beta \) parameters respectively 
\(N_{x_i,y,\varPi _i(\mathbf{x})}\)  Empirical count of data with attribute i taking value \(x_i\), class taking value y and parents taking value \(\varPi _i (\mathbf{x})\) 
2 A simple framework for parameter learning of \({ \mathop { \text {BN} } }\) classifiers
We start by discussing Bayesian network classifiers in the following section.
2.1 Bayesian network classifiers
A \({ \mathop { \text {BN} } }\) \({\mathcal {B}}= \langle {\mathcal {G}},\varTheta \rangle \), is characterized by the structure \({\mathcal {G}}\) (a directed acyclic graph, where each vertex is a variable, \(Z_i\)), and a set of parameters \(\varTheta \), that quantifies the dependencies within the structure. The variables are partitioned into a single target, the class variable \(Y{=}Z_0\) and n covariates \(X_1{=}Z_1, X_2{=}Z_2, \ldots X_n{=}Z_n\), called the attributes. The parameter \(\varTheta \), contains a set of parameters for each vertex in \({\mathcal {G}}\): \(\theta _{z_0\varPi _0(\mathbf{x})}\) and for \(1\le i\le n\), \(\theta _{z_iy,\varPi _i(\mathbf{x})}\), where \(\varPi _i(.)\) is a function which given the datum \(\mathbf{x}= \langle x_1, x_1,\ldots ,x_n \rangle \) as its input, returns the values of the attributes that are the parents of node i in structure \({\mathcal {G}}\). For notational simplicity, instead of writing \(\theta _{Z_0 = z_0  \varPi _0(\mathbf{x})}\) and \(\theta _{Z_i = z_i  y,\varPi _i(\mathbf{x})}\), we write \(\theta _{z_0\varPi _0(\mathbf{x})}\) and \(\theta _{z_iy,\varPi _i(\mathbf{x})}\). A \({ \mathop { \text {BN} } }\) \({\mathcal {B}}\) computes the joint probability distribution as: \(\mathrm{P}_{\mathcal {B}}(y,\mathbf{x}) = \theta _{z_0  \varPi _0(\mathbf{x})}\cdot \prod _{i=1}^{n} \theta _{z_i y, \varPi _i(\mathbf{x})}\).
Theorem 1
Within the constraints in Eq. 4, Eq. 3 is maximized when \(\theta _{x_i  \varPi _i(\mathbf{x})}\) corresponds to empirical estimates of probabilities from the data, that is, \(\theta _{y  \varPi _0(\mathbf{x})} = \mathrm{P}_\mathcal {D}(y  \varPi _0(\mathbf{x}))\) and \(\theta _{x_i  \varPi _i(\mathbf{x})} = \mathrm{P}_\mathcal {D}(x_i  \varPi _i(\mathbf{x}))\).
Proof
See “Appendix 1”. \(\square \)
The parameters obtained by maximizing Eq. 3 (and fulfilling the constraints in Eq. 4) are typically known as ‘Generative’ estimates of the probabilities.
2.2 Classconditional BN (CCBN) models
 1.
It allows the parameters to be set in such a way as to reduce the effect of the conditional attribute independence assumption that is present in the BN structure and that might be violated in data.
 2.
We have \(\text {LL}({\mathcal {B}})=\text {CLL}({\mathcal {B}})+\text {LL}({\mathcal {B}}\backslash y)\). If optimizing \(\text {LL}({\mathcal {B}})\), most of the attention will be given to \(\text {LL}({\mathcal {B}}\backslash y)\)—because \(\text {CLL}({\mathcal {B}})\ll \text {LL}({\mathcal {B}}\backslash y)\)—which will often lead to poor estimates for classification.
Like any Bayesian network model, a classconditional \({ \mathop { \text {BN} } }\) model is composed of a graphical structure and of parameters (\({\varvec{\theta }}\)) quantifying the dependencies in the structure. For any \({ \mathop { \text {BN} } }\) \({\mathcal {B}}\), the corresponding CC\({ \mathop { \text {BN} } }\) will be based on graph \({\mathcal {B}^{*}}\) (where \({\mathcal {B}^{*}}\) is a subgraph of \({\mathcal {B}}\)) whose parameters are optimized by maximizing the CLL. We present below a slightly rephrased definition from Roos et al. (2005):
Definition 1
A classconditional Bayesian network model \({\mathcal {M}^{{\mathcal {B}}^*}}\) is the set of conditional distributions based on the network \({\mathcal {B}}^*\) equipped with any strictly positive parameter set \({\varvec{\theta }}^{{\mathcal {B}}^*}\); that is the set of all functions from \((X_1,X_2,\ldots ,X_n)\) to a distribution on Y takes the form of Eq. 2.
This means that the nodes in \({\mathcal {B}}^*\) are nodes comprising only the Markov blanket of the class y. However, for most \({ \mathop { \text {BN} } }\) classifiers the class has no parents and is made a parent of all attributes. This has the effect that every attribute is in the Markov blanket of the class.
We will assume that the parents of the class attribute constitute an empty set and, therefore, replace parameters characterizing the class attribute from \(\theta _{y^{(j)}  \varPi _0(\mathbf{x}^{(j)})}\) with \(\theta _{y^{(j)}}\). We will also drop the superscript j in equations for clarity.
2.3 A simple framework
 1.
Generative: Initialize \(\mathbf{w}\) with 1 and treat it as fixed parameter. Treat, \({\varvec{\theta }}\) as optimized parameter and optimize it with the generative objective function as given in Eq. 3.
 2.
Discriminative: Initialize \(\mathbf{w}\) with 1 and treat it as fixed parameter. Treat, \({\varvec{\theta }}\) as an optimized parameter and optimize it with the discriminative objective function as given in Eq. 5. As discussed, this results in adding a normalization term to convert \(\mathrm{P}(y,\mathbf{x})\) in Eq. 6 to \(\mathrm{P}(y\mathbf{x})\). We denote this ‘discriminative CCBN’ and describe it in detail in Sect. 3.
 3.
Discriminative: Initialize \(\mathbf{w}\) with 1 and treat it as a fixed parameter. Treat, \({\varvec{\theta }}\) as an optimized parameter and optimize it with the discriminative objective function as given in Eq. 5, but constrain parameter \({\varvec{\theta }}\) to be actual probabilities. We denote this ‘extended CCBN’ and provide a detailed description in Sect. 4.
 4.
Discriminative: Two step learning. In the first step, initialize \(\mathbf{w}\) with 1 and treat it as a fixed parameter. Treat, \({\varvec{\theta }}\) as an optimized parameter and optimize it with the generative objective function as given in Eq. 3. In the second step, treat \({\varvec{\theta }}\) as a fixed parameter and optimize for \(\mathbf{w}\) using a discriminative objective function. This approach is inspired from the fact that weights \(\mathbf{w}\) in Eq. 6 are set through generative learning, unlike discriminative and extended CCBN, where it is set to one. We denote this ‘weighted CCBN’ and describe it in detail in Sect. 5.
Comparison of different parameter learning techniques for Bayesian network classifiers
Generative—maximize LL  Discriminative—maximize CLL  Discriminative—maximize CLL  Discriminative—maximize CLL  

Extended CCBN model  Discriminative CCBN model  Weighted CCBN model  
Description  Estimate parameters of jointdistribution \(\mathrm{P}(y,\mathbf{x})\)  Estimate parameters of classconditional distribution \(\mathrm{P}(y\mathbf{x})\)  Estimate parameters of classconditional distribution \(\mathrm{P}(y\mathbf{x})\)  Estimate parameters of classconditional distribution \(\mathrm{P}(y\mathbf{x})\) 
\({ \mathop { \text {BN} } }\) structure  \({\mathcal {B}}\)  \({\mathcal {B}^{*}}\)  \({\mathcal {B}^{*}}\)  \({\mathcal {B}^{*}}\) 
CCBN model  Not applicable  \({\mathcal {M}_{\text {e}}^{{\mathcal {B}}^*}}\)  \({\mathcal {M}_{\text {d}}^{{\mathcal {B}}^*}}\)  \({\mathcal {M}_{\text {w}}^{{\mathcal {B}}^*}}\) 
Form  \(\mathrm{P}(y,\mathbf{x} {\varvec{\theta }})\)  \(\mathrm{P}(y\mathbf{x},{\varvec{\theta }})\)  \(\mathrm{P}(y\mathbf{x},{\varvec{\beta }})\)  \(\mathrm{P}(y\mathbf{x},{\varvec{\theta }},\mathbf{w})\) 
Formula  \(\theta _{y} \prod _{i=1}^{n} \theta _{x_i y, \varPi _i(\mathbf{x})}\)  \(\frac{ \theta _{y} \prod _{i=1}^{n} \theta _{x_i y, \varPi _i(\mathbf{x})} }{\sum _{y'}^{ \mathcal {Y} } \theta _{y'} \prod _{i=1}^{n} \theta _{x_iy',\varPi _i(\mathbf{x})}}\)  \(\frac{\exp (\beta _{y} + \sum _{i=1}^{n} \beta _{y,x_i,\varPi _i})}{\sum _{y'=1}^{\mathcal {Y}} \exp (\sum _{y'} \beta _{y'} + \sum _{i=1}^{n} \beta _{y',x_i,\varPi _i}) }\)  \(\frac{ \theta _{y}^{w_{y}} \prod _{i=1}^{n} \theta _{x_iy,\varPi _i(\mathbf{x})}^{w_{y,x_i,\varPi _i}} }{\sum _{y'}^{\mathcal {Y}} \theta _{y'}^{w_{y'}} \prod _{i=1}^{n} \theta _{x_iy',\varPi _i(\mathbf{x})}^{w_{y',x_i,\varPi _i}}}\) 
Prestep  None  None  None  Optimize for \({\varvec{\theta }}\) using Generative objective function of Eq. 3 
Constraints  \(\theta _y \in [0,1]^{\mathcal {Y}}\), \(\theta _{i,y} \in [0,1]^{\mathcal {X}_i}\)  \(\theta _y \in [0,1]^{\mathcal {Y}}\), \(\theta _{i,y} \in [0,1]^{\mathcal {X}_i}\)  None  \(\theta _y \in [0,1]^{\mathcal {Y}}\), \(\theta _{i,y} \in [0,1]^{\mathcal {X}_i}\) 
Optimized Param.  \({\varvec{\theta }}\)  \({\varvec{\theta }}\)  \({\varvec{\beta }}\)  \(\mathbf{w}\) 
‘Fixed’ Param.  None  None  None  \({\varvec{\theta }}\) 
3 Parameterization 1: Discriminative CCBN model
Logistic regression (LR) is the CCBN model associated to the NB structure optimizing Eq. 2. Typically, LR learns a weight for each attributevalue (perclass). However, one can extend LR by considering all or some subset of possible quadratic, cubic, or higherorder features (Langford et al. 2007; Zaidi et al. 2015). Inspired from Roos et al. (2005), we define discriminative CCBN as:
Definition 2
A discriminative classconditional Bayesian network model \({\mathcal {M}_{\text {d}}^{{\mathcal {B}}^*}}\) is a CCBN such that Eq. 2 is reparameterized in form of parameter \({\varvec{\beta }}\) such that \({\varvec{\beta }}= \log {\varvec{\theta }}\) and parameter \({\varvec{\beta }}\) is obtained by maximizing the CLL.
4 Parameterization 2: Extended CCBN model
The name Extended CCBN Model is inspired from Greiner et al. (2005), where the method named Extended Logistic Regression (ELR) is proposed. ELR is aimed at extending LR and leads to discriminative training of \({ \mathop { \text {BN} } }\) parameters. We define:
Definition 3
(Greiner et al. 2005) An extended classconditional Bayesian network model \({\mathcal {M}_{\text {e}}^{{\mathcal {B}}^*}}\) is a CCBN such that the parameters (\({\varvec{\theta }}\)) satisfy the constraints in Eq. 4 and is obtained by maximizing the CLL in Eq. 5.
5 Parameterization 3: Combined generative/discriminative parameterization: weighted CCBN model
Inspired from Zaidi et al. (2014), we define a weighted CCBN model as follows:
Definition 4
A weighted conditional Bayesian network model \({\mathcal {M}_{\text {w}}^{{\mathcal {B}}^*}}\) is a CCBN such that Eq. 2 has an extra weight parameter associated with every \(\theta \) such that it is reparameterized as: \({\varvec{\theta }}^\mathbf{w}\), where parameter \({\varvec{\theta }}\) is learned by optimizing the LL and parameter \(\mathbf{w}\) is obtained by maximizing the CLL.
5.1 On initialization of parameters

Initialization with Zeros This is the standard initialization where all the optimized parameters are initialized with 0 (Ripley 1996).

Initialization with Generative estimates Given that our approach utilizes generative estimates, a fair comparison with other approaches should study starting from the generative estimates for all approaches. This will correspond to initializing the \({\varvec{\theta }}\) parameter with the generative estimates for Parameterizations 1 and 2 (\({\mathcal {M}_{\text {d}}^{{\mathcal {B}}^*}}\) and \({\mathcal {M}_{\text {e}}^{{\mathcal {B}}^*}}\)), and initializing the \(\mathbf{w}\) parameter to 1 for Parameterization 3 (\({\mathcal {M}_{\text {w}}^{{\mathcal {B}}^*}}\)).
5.2 Comment on regularization
5.3 Optimizing discriminative/generative parameterization
There are great advantages in optimizing an objective function that is convex. The convexity of the three discriminative parameterizations that we have discussed depends on the underlying structure of the CCBN (\({\mathcal {M}^{{\mathcal {B}}^*}}\)). From Roos et al. (2005), it follows that optimizing a CCBN parameterized by either \({\mathcal {M}_{\text {d}}^{{\mathcal {B}}^*}}\), \({\mathcal {M}_{\text {e}}^{{\mathcal {B}}^*}}\) or \({\mathcal {M}_{\text {w}}^{{\mathcal {B}}^*}}\) leads to a convex optimization problem if and only if the structure has no immoral nodes. In other words, the optimization problem is convex if and only if parents of all the nodes are are connected with each other. This constraint is true for a number of popular \({ \mathop { \text {BN} } }\) classifiers including NB, TAN and KDB (\(K=1\)), but not true for general \({ \mathop { \text {BN} } }\) or for KDB structures with \(K > 1\). Therefore, in this work, we have used only limited \({ \mathop { \text {BN} } }\) structures such as NB, TAN and KDB (\(K=1\)). Investigation of the application of our approach to more complex moral structures is a promising topic for future work. We note in passing, that a similar two step discriminative parameterization has also been shown to be effective for the nonconvex objective function meansquareerror (Zaidi et al. 2016).
5.4 Nested parameterizations
Nesting as shown in Fig. 1, though effective, is not very intuitive for \({\mathcal {M}_{\text {e}}^{{\mathcal {B}}^*}}\) and \({\mathcal {M}_{\text {d}}^{{\mathcal {B}}^*}}\). For example, doing a logistic regression by learning a parameter associated only with the attributes will result in optimizing fewer parameters but might not be effective in terms of classification accuracy. However, the hierarchy of models applies naturally to \({\mathcal {M}_{\text {w}}^{{\mathcal {B}}^*}}\). \({\mathcal {M}_{\text {w}}^{{\mathcal {B}}^*}}\) incorporates learning initial parameters by optimizing the LL objective function. Therefore, the searched parameters optimized in the second step can be nested effectively. For example, Level 1 weighting in Fig. 1 can be seen as alleviating the conditional attribute independence assumption (CAIA) between attributes. Similarly, Level 2 will have the effect of binarizing each attribute, and alleviating CAIA between new attributes. In the following we will derive the respective gradients for each level from the most comprehensive case of Level 4.
5.5 Discussion
5.5.1 Preconditioning: why is our technique helpful?
It can be seen that \({\mathcal {M}_{\text {w}}^{{\mathcal {B}}^*}}\) results in rescaling of \({\mathcal {M}_{\text {d}}^{{\mathcal {B}}^*}}\) parameterization. What is the effect of this rescaling on the model? Since there is no closedform solution, we optimize the CLL with firstorder gradientdescent methods, such as gradient descent, conjugate gradient, quasiNewton (LBGFS) or Stochastic Gradient Descent. These are all affected by scaling.^{1} We use the generative estimates as an effective preconditioning method.
A preconditioner converts an illconditioned problem into a better conditioned one, such that the gradient of the objective function is uniform across all dimensions. A better conditioned optimization problem has a better convergence profile. This is because if different parameters have significantly different “influence” on the objective function, then the gradient does not point directly towards the minimum that is the objective of the optimization process. We illustrate this in Fig. 2 where we show the contour plot of the CLL for different \(\beta \). We can see that when the CLL has an ‘elliptical’ shape with respect to the parameters, then the gradient is not oriented directly towards the objective and each step makes only partial progress in the true direction of the final objective. Our rescaling improves the orientation of the gradient speeding convergence.
We compare the convergence profile of vanilla LR (discriminative CCBN) and our preconditioned weighted CCBN with Naive Bayes structure (which is associated to a LR model) by varying \(\alpha \) from 0 to 1. For each dataset 10000 data points were generated. We report the convergence results in Fig. 3. These confirm the explanation given above. The benefit of our technique progressively increases as the relative influence of the covariates on the class increases. We will show in Sect. 7 that this is the case for the vast majority of realworld datasets.
5.5.2 Is this an overparametrised model?
6 Related work
There have been several comparative studies of discriminative and generative structure and parameter learning of Bayesian networks (Greiner and Zhou 2002; Grossman and Domingos 2004; Pernkopf and Bilmes 2005). In all these works, generative parameter training is the estimation of parameters based on empirical estimates whereas discriminative training of parameters is actually the estimation of the parameters of CCBN models such as \({\mathcal {M}_{\text {e}}^{{\mathcal {B}}^*}}\) or \({\mathcal {M}_{\text {d}}^{{\mathcal {B}}^*}}\). The \({\mathcal {M}_{\text {e}}^{{\mathcal {B}}^*}}\) model was first proposed in Greiner and Zhou (2002). Our work differs from these previous works as our goal is to highlight different parameterization of CCBN models and investigate their interrelationship. Particularly, we are interested in the learning of parameters corresponding to a weighted CCBN model that leads to faster discriminative learning.
An approach for discriminative learning of the parameters of \({ \mathop { \text {BN} } }\) based on discriminative computation of pseudofrequencies from the data is presented in Su et al. (2008). Discriminative Frequency Estimates (DFE) are computed by injecting a discriminative element to generative computation of the probabilities. During the pseudofrequencies computation process, rather than using empirical frequencies, DFE estimates how well the current classifier does on each data point and then updates the frequency tables only in proportion to the classifier’s performance. For example, they propose a simple error measure, as: \(L(\mathbf{x}) = \mathrm{P}(y\mathbf{x})  \hat{\mathrm{P}}(y\mathbf{x})\), where \(\mathrm{P}(y\mathbf{x})\) is the true probability of class y given the datum \(\mathbf{x}\), and \(\hat{\mathrm{P}}(y\mathbf{x})\) is the predicted probability. The counts are updated as: \(\theta _{ijk}^{t+1} = \theta _{ijk}^{t} + L(\mathbf{x})\). Several iterations over the dataset are required. The algorithm is inspired from Perceptron based training and is shown to be an effective discriminative parameter learning approach.
7 Empirical results
Details of Datasets (UCI Domains)
Domain  Case  Att  Class  Domain  Case  Att  Class 

Pokerhand  1,175,067  11  10  Annealing  898  39  6 
Covertype  581,012  55  7  Vehicle  846  19  4 
CensusIncome(KDD)  299,285  40  2  PimaIndiansDiabetes  768  9  2 
Localization  164,860  7  3  BreastCancer(Wisconsin)  699  10  2 
Connect4Opening  67,557  43  3  CreditScreening  690  16  2 
Statlog(Shuttle)  58,000  10  7  BalanceScale  625  5  3 
Adult  48,842  15  2  Syncon  600  61  6 
LetterRecognition  20,000  17  26  Chess  551  40  2 
MAGICGammaTelescope  19,020  11  2  Cylinder  540  40  2 
Nursery  12,960  9  5  Musk1  476  167  2 
Sign  12,546  9  3  HouseVotes84  435  17  2 
PenDigits  10,992  17  10  HorseColic  368  22  2 
Thyroid  9169  30  20  Dermatology  366  35  6 
Pioneer  9150  37  57  Ionosphere  351  35  2 
Mushrooms  8124  23  2  LiverDisorders(Bupa)  345  7  2 
Musk2  6598  167  2  PrimaryTumor  339  18  22 
Satellite  6435  37  6  Haberman’sSurvival  306  4  2 
OpticalDigits  5620  49  10  HeartDisease(Cleveland)  303  14  2 
PageBlocksClassification  5473  11  5  Hungarian  294  14  2 
Wallfollowing  5456  25  4  Audiology  226  70  24 
Nettalk(Phoneme)  5438  8  52  NewThyroid  215  6  3 
Waveform5000  5000  41  3  GlassIdentification  214  10  3 
Spambase  4601  58  2  SonarClassification  208  61  2 
Abalone  4177  9  3  AutoImports  205  26  7 
Hypothyroid(Garavan)  3772  30  4  WineRecognition  178  14  3 
Sickeuthyroid  3772  30  2  Hepatitis  155  20  2 
Kingrookvskingpawn  3196  37  2  TeachingAssistantEvaluation  151  6  3 
SplicejunctionGeneSequences  3190  62  3  IrisClassification  150  5  3 
Segment  2310  20  7  Lymphography  148  19  4 
CarEvaluation  1728  8  4  Echocardiogram  131  7  2 
Volcanoes  1520  4  4  PromoterGeneSequences  106  58  2 
Yeast  1484  9  10  Zoo  101  17  7 
ContraceptiveMethodChoice  1473  10  3  PostoperativePatient  90  9  3 
German  1000  21  2  LaborNegotiations  57  17  2 
LED  1000  8  10  LungCancer  32  57  3 
Vowel  990  14  11  Contactlenses  24  5  3 
TicTacToeEndgame  958  10  2 
There are a total of 72 datasets, 41 datasets with less than 1000 instances, 21 datasets with between 1000 and 10000 instances, and 11 datasets with more than 10000 instances. Each algorithm is tested on each dataset using 5 rounds of 2fold cross validation. 2fold cross validation is used in order to maximize the variation in the training data from trial to trial, which is advantageous when estimating bias and variance. Note that the source code with running instructions is provided as a supplementary material to this paper.
We report Win–Draw–Loss (W–D–L) results when comparing the 0–1 Loss, RMSE, bias and variance of two models. A twotail binomial sign test is used to determine the significance of the results. Results are considered significant if \(p \le 0.05\). Significant results are shown in bold font in the table.
We report results on two categories of datasets. The first category, labeled All, consists of all datasets in Table 3. The second category, labeled Big, consists of datasets that have more than 10000 instances. The reason for splitting datasets into two categories is to show explicitly the effectiveness of our proposed optimization on bigger datasets^{2}. It is only on big datasets, that each iteration is expensive and, therefore, any technique that leads to faster and better convergence is highly desirable. Note, that we do not expect the three discriminative parameterizations \({\mathcal {M}_{\text {d}}^{{\mathcal {B}}^*}}\), \({\mathcal {M}_{\text {e}}^{{\mathcal {B}}^*}}\) and \({\mathcal {M}_{\text {w}}^{{\mathcal {B}}^*}}\) to differ in their prediction accuracy. That is, we should expect a similar spread of both 0–1 Loss and RMSE values. However, we should be interested in each parameterization’s convergence profile and the training time.
Numeric attributes are discretized using the Minimum Description Length (MDL) discretization method (Fayyad and Irani 1992). A missing value is treated as a separate attribute value and taken into account exactly like other values.
We experiment with three Bayesian network structures that is: naive Bayes (NB), TreeAugmented naive Bayes (TAN) (Friedman et al. 1997) and kDependence Bayesian network (KDB) with \(K=1\) (Sahami 1996). Naive Bayes, is a wellknown classifier which is based on the assumption that when conditioned on the class, attributes are independent. TreeAugmented Naive Bayes augments the NB structure by allowing each attribute to depend on at most one nonclass attribute. It relies on an extension of the ChowLiu tree (Chow and Liu 1968), that utilizes conditional mutual information (between pairs of attributes given the class) to find a maximum spanning tree over the attributes in order to determine the parent of each. Similarly, in KDB, each attribute takes k attributes plus the class as its parents. The attributes are selected based on their mutual information with the class. Then, the parent of an attribute i is chosen that maximizes the conditional mutual information of attribute i and parent j given the class that is: \(\text {argmax}_j \text {CMI}(X_i,X_j  Y)\).
We denote \({\mathcal {M}_{\text {w}}^{{\mathcal {B}}^*}}\), \({\mathcal {M}_{\text {d}}^{{\mathcal {B}}^*}}\) and \({\mathcal {M}_{\text {e}}^{{\mathcal {B}}^*}}\) with naive Bayes structure as \({ \mathop { \text {NB}^{\mathrm{w}}}}\), \({ \mathop { \text {NB}^{\mathrm{d}}}}\) and \({ \mathop { \text {NB}^{\mathrm{e}}}}\) respectively. With TAN structure, \({\mathcal {M}_{\text {w}}^{{\mathcal {B}}^*}}\), \({\mathcal {M}_{\text {d}}^{{\mathcal {B}}^*}}\) and \({\mathcal {M}_{\text {e}}^{{\mathcal {B}}^*}}\) are denoted as \({ \mathop { \text {TAN}^{\mathrm{w}}}}\), \({ \mathop { \text {TAN}^{\mathrm{d}}}}\) and \({ \mathop { \text {TAN}^{\mathrm{e}}}}\). With KDB (\(K=1\)), \({\mathcal {M}_{\text {w}}^{{\mathcal {B}}^*}}\), \({\mathcal {M}_{\text {d}}^{{\mathcal {B}}^*}}\) and \({\mathcal {M}_{\text {e}}^{{\mathcal {B}}^*}}\) are denoted as \({ \mathop { \text {KDB1}^{\mathrm{w}}}}\), \({ \mathop { \text {KDB1}^{\mathrm{d}}}}\) and \({ \mathop { \text {KDB1}^{\mathrm{e}}}}\).

The ‘(I)’ in the label represents this initialization

An absence of ‘(I)’ means the parameters are initialized to zero.
7.1 NB structure
Comparative scatter plots on all 72 datasets for 0–1 Loss, RMSE and training time values for \({ \mathop { \text {NB}^{\mathrm{w}}}}\), \({ \mathop { \text {NB}^{\mathrm{d}}}}\) and \({ \mathop { \text {NB}^{\mathrm{e}}}}\) are shown in Fig. 4. Training time plots are on the log scale. The plots are shown separately for Big datasets. It can be seen that the three parameterizations have a similar spread of 0–1 Loss and RMSE values, however, \({ \mathop { \text {NB}^{\mathrm{w}}}}\) is greatly advantaged in terms of its training time. We will see in Sect. 7.4 that this computational advantage arises due to the desirable convergence property of \({ \mathop { \text {NB}^{\mathrm{w}}}}\). Given that \({ \mathop { \text {NB}^{\mathrm{w}}}}\) achieves equivalent accuracy with much less computation indicates that it is a more effective parameterization than \({ \mathop { \text {NB}^{\mathrm{d}}}}\) and \({ \mathop { \text {NB}^{\mathrm{e}}}}\). Slight variation in the accuracy of three discriminative parameterizations (that is 0–1 Loss and RMSE performance) on All datasets is due to the numerical instability of the solver use for optimization. The difference mainly arises on very small datasets. It can be seen that on big datasets, the three parameterizations result in the same accuracy.
7.2 TAN structure
Figure 7 shows the comparative spread of 0–1 Loss, RMSE and training time in seconds of \({ \mathop { \text {TAN}^{\mathrm{w}}}}\), \({ \mathop { \text {TAN}^{\mathrm{d}}}}\) and \({ \mathop { \text {TAN}^{\mathrm{e}}}}\) on All and Big datasets. A trend similar to that of NB can be seen. With a similar spread of 0–1 Loss and RMSE among the three parameterizations, training time is greatly improved for \({ \mathop { \text {TAN}^{\mathrm{w}}}}\) when compared with \({ \mathop { \text {TAN}^{\mathrm{d}}}}\) and \({ \mathop { \text {TAN}^{\mathrm{e}}}}\). Note, as pointed out before, that minor variation in the performance of three discriminative parameterizations is due to the numerical issues within the solver on some small datasets. On big datasets, one can see a similar spread of 0–1 Loss and RMSE.
7.3 KDB (\(K=1\)) structure
Figure 10 shows the comparative spread of 0–1 Loss, RMSE and training time in seconds of \({ \mathop { \text {KDB1}^{\mathrm{w}}}}\), \({ \mathop { \text {KDB1}^{\mathrm{d}}}}\) and \({ \mathop { \text {KDB1}^{\mathrm{e}}}}\) on All and Big datasets. Like NB and TAN, it can be seen that a similar spread of 0–1 Loss and RMSE is present among the three parameterizations of discriminative learning. Similarly, it can be seen that training time is greatly improved for \({ \mathop { \text {KDB1}^{\mathrm{w}}}}\) when compared with \({ \mathop { \text {KDB1}^{\mathrm{d}}}}\) and \({ \mathop { \text {KDB1}^{\mathrm{e}}}}\).
7.4 Convergence analysis
A comparison of the convergence of Negative LogLikelihood (NLL) of the three parameterizations on some sample datasets with NB, TAN and KDB (\(K=1\)) structure is shown in Figs. 13 and 14.
To quantify how much \({\mathcal {M}_{\text {w}}^{{\mathcal {B}}^*}}\) is faster than the other two parameterizations, we plot a histogram of the number of iterations it takes \({\mathcal {M}_{\text {d}}^{{\mathcal {B}}^*}}\) and \({\mathcal {M}_{\text {e}}^{{\mathcal {B}}^*}}\) after five iterations to reach the negative loglikelihood that \({\mathcal {M}_{\text {w}}^{{\mathcal {B}}^*}}\) achieved in the fifth iteration. If the three parameterizations follow similar convergence, one should expect many zeros in the histogram. Note that if after the fifth iteration, NLL of \({\mathcal {M}_{\text {w}}^{{\mathcal {B}}^*}}\) is greater than that of \({\mathcal {M}_{\text {d}}^{{\mathcal {B}}^*}}\), we we plot the negative of the number of iterations it takes \({\mathcal {M}_{\text {w}}^{{\mathcal {B}}^*}}\) to reach the NLL of \({\mathcal {M}_{\text {d}}^{{\mathcal {B}}^*}}\). Similarly, if after the fifth iteration, NLL of \({\mathcal {M}_{\text {w}}^{{\mathcal {B}}^*}}\) is greater than that of \({\mathcal {M}_{\text {e}}^{{\mathcal {B}}^*}}\), we we plot the negative of the number of iterations it takes \({\mathcal {M}_{\text {w}}^{{\mathcal {B}}^*}}\) to reach the NLL of \({\mathcal {M}_{\text {e}}^{{\mathcal {B}}^*}}\). Figures 15, 16 and 17 show these histogram plots for NB, TAN and KDB (\(K=1\)) structure respectively. It can be seen that \({\mathcal {M}_{\text {w}}^{{\mathcal {B}}^*}}\) (with all three structures) achieves a NLL that otherwise, will take on average 10 more iterations over the data for \({\mathcal {M}_{\text {d}}^{{\mathcal {B}}^*}}\) and 15 more iterations for \({\mathcal {M}_{\text {e}}^{{\mathcal {B}}^*}}\). This is an extremely useful property of \({\mathcal {M}_{\text {w}}^{{\mathcal {B}}^*}}\) especially for big data where iterating through the dataset is expensive.
7.5 Comparison with MAP
Win–Draw–Loss: \({ \mathop { \text {NB}^{\mathrm{w}}}}\) versus NB, \({ \mathop { \text {TAN}^{\mathrm{w}}}}\) versus TAN and \({ \mathop { \text {KDB1}^{\mathrm{w}}}}\) versus KDB1
\({ \mathop { \text {NB}^{\mathrm{w}}}}\) versus NB  \({ \mathop { \text {TAN}^{\mathrm{w}}}}\) versus TAN  \({ \mathop { \text {KDB1}^{\mathrm{w}}}}\) versus KDB1  

W–D–L  p  W–D–L  p  W–D–L  p  
All datasets  
Bias  62/3/7  \({<}{} \mathbf{0.001}\)  50/4/18  \({<}{} \mathbf{0.001}\)  54/5/13  \({<}{} \mathbf{0.001}\) 
Variance  19/3/50  \({<}{} \mathbf{0.001}\)  21/2/49  0.011  19/4/49  \({<}{} \mathbf{0.001}\) 
0–1 Loss  45/4/23  0.010  34/3/35  1  39/4/29  0.275 
RMSE  45/3/24  0.015  25/1/46  0.017  29/2/41  0.1882 
Big datasets  
0–1 Loss  11/1/0  \({<}{} \mathbf{0.001}\)  11/1/0  \({<}{} \mathbf{0.001}\)  11/0/1  \({<}{} \mathbf{0.001}\) 
RMSE  11/0/1  \({<}{} \mathbf{0.001}\)  11/0/1  \({<}{} \mathbf{0.001}\)  11/0/1  \({<}{} \mathbf{0.001}\) 
Note that though discriminative learning (optimizes the parameters characterizing CCBN) has better 0–1 Loss and RMSE performance than generative learning (optimizing joint probability),—generative learning has the advantage of being extremely fast as it incorporates counting of sufficient statistics from the data. Another advantage of generative learning is its capability of backoff in case a certain combination does not exist in the data. For example, if TAN or KDB classifiers have not encountered a \({<}{} \texttt {featurevalue, parentvalue, classvalue}{>}\) combination at training time they can resort back to \({<}{} \texttt {featurevalue, classvalue}{>}\) at testing time. For instance TAN classifier can step back to NB and NB can step back to class prior probabilities. Such elegantly backtracking is missing from discriminative learning. If a certain combination does not exist in the data, parameters associated to that combination will not be optimized and will remain fixed to the initialized value (for example 0). A discriminative classifier will have no way of handling unseen combinations but to ignore them if they occur in the testing data. How to incorporate such hierarchical learning with discriminative learning is the goal of future research as will be discussed in Sect. 8.
8 Conclusion and future work
In this paper, we propose an effective parameterization of \({ \mathop { \text {BN} } }\). We present a unified framework for learning the parameters of Bayesian network classifiers. We formulate three different parameterizations and compare their performance in terms of 0–1 Loss, RMSE and training time each parameterization took to converge. We show with NB, TAN and KDB structures that the proposed weighted discriminative parameterization has similar 0–1 Loss and RMSE to the other two but significantly faster convergence. We also show that it not only has faster convergence but it also asymptotes to its global minimum much quicker than the other two parameterizations. This is desirable when learning from huge quantities of data with Stochastic Gradient Descent (SGD). It is also shown that discriminative training of \({ \mathop { \text {BN} } }\) classifiers also leads to lower bias than the generative parameter learning.

The three parameterizations presented in this work learn a weight for each attributevalueperclassvalueperparentvalues. As discussed in Sect. 5.4, contrary to \({\mathcal {M}_{\text {d}}^{{\mathcal {B}}^*}}\) and \({\mathcal {M}_{\text {e}}^{{\mathcal {B}}^*}}\), \({\mathcal {M}_{\text {w}}^{{\mathcal {B}}^*}}\) parameterization can generalize parameters. For example, once MAP estimates of probabilities are learned, one can learn a weight: (a) for each attribute only (i.e., same weight for all attributevalues, for all class values and for all parent values), (b) for each attributevalue only, (c) for each attributevalueperclassvalue, (d) for each attributevalueperclassvalueperparent, etc. Such parameter generalization could offer additional speedup of the training and is a promising avenue for future research.

Handling combinations of \(\langle \texttt {featurevalue}, \texttt {parentvalue}, \texttt {classvalue}\rangle \) that have not been seen at training time is one of the weaker properties of discriminative learning. We plan to design an hierarchical algorithm of discriminative learning that can learn lowerlevel discriminative weights and can backoff from higher levels if a combination is not observed in the training data.

We plan to conduct an extended analysis of \({ \mathop { \text {BN} } }\) models that can capture higherorder interactions. Because the CLL is not convex for most of these models (Roos et al. 2005), it falls outside the scope of this paper. This does, however, suggest inviting avenues for big data research, in which context lowbias classifiers are required.
9 Code and datasets
All the datasets used in this paper are in the public domain and can be downloaded from Frank and Asuncion (2010). Code with running instructions can be download from https://github.com/nayyarzaidi/EBNC.git.
Footnotes
 1.
Note that secondorder algorithms such as the Newton method are not affected by scaling, but they are often computationally impractical because they require computation and inversion of the Hessian at each step.
 2.
By big, we mean datasets that have large number of instances, rather than large number of features
Notes
Acknowledgements
This research has been supported by the Australian Research Council (ARC) under Grant DP140100087. This material is based upon work supported by the Air Force Office of Scientific Research, Asian Office of Aerospace Research and Development (AOARD) under Award Numbers FA23861514007 and FA23861614023. The authors would like to thank Reza Haffari and Ana Martinez for helpful discussion during the course of this research. Authors would like to acknowledge Jeusus Cerquides for his extremely useful ideas and suggestions that shaped this work.
References
 Buntine, W. (1994). Operations forlearning with graphical models. Journal of Artificial Intelligence Research, 2, 159–225.Google Scholar
 Byrd, R., Lu, P., & Nocedal, J. (1995). A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific and Statistical Computing, 16(5), 1190–1208.MathSciNetCrossRefMATHGoogle Scholar
 Carvalho, A., Roos, T., Oliveira, A., & Myllymaki, P. (2011). Discriminative learning of Bayesian networks via factorized conditional loglikelihood. Journal of Machine Learning Research.Google Scholar
 Chow, C., & Liu, C. (1968). Approximating discrete probability distributions with dependence trees. IEEE Transactions on Information Theory, 14(3), 462–467.CrossRefMATHGoogle Scholar
 Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7, 1–30.MathSciNetMATHGoogle Scholar
 Fayyad, U. M., & Irani, K. B. (1992). On the handling of continuousvalued attributes in decision tree generation. Machine Learning, 8(1), 87–102.MATHGoogle Scholar
 Frank, A., & Asuncion, A. (2010). UCI machine learning repository. http://archive.ics.uci.edu/ml.
 Friedman, N., Geiger, D., & Goldszmidt, M. (1997). Bayesian network classifiers. Machine Learning, 29(2), 131–163.CrossRefMATHGoogle Scholar
 Greiner, R., & Zhou, W. (2002). Structural extension to logistic regression: Discriminative parameter learning of belief net classifiers. In Annual national conference on artificial intelligence (AAAI), pp. 167–173.Google Scholar
 Greiner, R., Su, X., Shen, B., & Zhou, W. (2005). Structural extensions to logistic regression: Discriminative parameter learning of belief net classifiers. Machine Learning, 59, 297–322.Google Scholar
 Grossman, D., & Domingos, P. (2004). Learning Bayesian network classifiers by maximizing conditional likelihood. In ICML.Google Scholar
 Heckerman, D., & Meek, C. (1997). Models and selection criteria for regression and classification. In International conference on uncertainty in artificial intelligence.Google Scholar
 Jebara, T. (2003). Machine Learning: Discriminative and Generative. Berlin: Springer.MATHGoogle Scholar
 Kohavi, R., & Wolpert, D. (1996). Bias plus variance decomposition for zeroone loss functions. In ICML (pp. 275–283).Google Scholar
 Langford, J., Li, L., & Strehl, A. (2007). Vowpal wabbit online learning project. https://github.com/JohnLangford/vowpal_wabbit/wiki.
 Martinez, A., Chen, S., Webb, G. I., & Zaidi, N. A. (2016). Scalable learning of Bayesian network classifiers. Journal of Machine Learning Research, 17, 1–35.Google Scholar
 Ng, A., & Jordan, M. (2002). On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. In Advances in neural information processing systems.Google Scholar
 Pernkopf, F., & Bilmes, J. (2005). Discriminative versus generative parameter and structure learning of Bayesian network classifiers. In ICML.Google Scholar
 Pernkopf, F., & Bilms, J. A. (2010). Efficient heuristics for discriminative structure learning of Bayesian network classifiers. Journal of Machine Learning Research, 11, 2323–2360.Google Scholar
 Pernkopf, F., & Wohlmayr, M. (2009). On discriminative parameter learning of Bayesian network classifiers. In ECML PKDD.Google Scholar
 Ripley, B. D. (1996). Pattern Recognition and Neural Networks. Cambridge: Cambridge University Press.CrossRefMATHGoogle Scholar
 Roos, T., Wettig, H., Grünwald, P., Myllymäki, P., & Tirri, H. (2005). On discriminative Bayesian network classifiers and logistic regression. Machine Learning, 59(3), 267–296.MATHGoogle Scholar
 Rubinstein, Y. D., & Hastie, T. (1997). Discriminative vs informative learning. In AAAI.Google Scholar
 Sahami, M. (1996). Learning limited dependence bayesian classifiers. In Proceedings of the second international conference on knowledge discovery and data mining (pp. 335–338).Google Scholar
 Su, J., Zhang, H., Ling, C., & Matwin, S. (2008). Discriminative parameter learning for Bayesian networks. In ICML.Google Scholar
 Webb, G. I. (2000). Multiboosting: A technique for combining boosting and wagging. Machine Learning, 40(2), 159–196.CrossRefGoogle Scholar
 Webb, G. I., Boughton, J., Zheng, F., Ting, K. M., & Salem, H. (2012). Learning by extrapolation from marginal to fullmultivariate probability distributions: decreasingly naive Bayesian classification. Machine Learning, 86(2), 233–272.Google Scholar
 Zaidi, N. A., Carman, M. J., Cerquides, J., & Webb, G. I. (2014). Naivebayes inspired effective preconditioners for speedingup logistic regression. In IEEE international conference on data mining.Google Scholar
 Zaidi, N. A., Cerquides, J., Carman, M. J., & Webb, G. I. (2013). Alleviating naive Bayes attribute independence assumption by attribute weighting. Journal of Machine Learning Research, 14, 1947–1988.MathSciNetMATHGoogle Scholar
 Zaidi, N. A., Petitjean, F., & Webb, G. I. (2016). Preconditioning an artificial neural network using naive bayes. In Proceedings of the 20th Pacific–Asia conference on knowledge discovery and data mining (PAKDD).Google Scholar
 Zaidi, N. A., Webb, G. I., Carman, M. J., & Petitjean, F. (2015). Deep Broad Learning—Big models for big data. arXiv:1509.01346.
 Zhu, C., Byrd, R. H., & Nocedal, J. (1997). LBFGSB, Fortran routines for large scale bound constrained optimization. ACM Transactions on Mathematical Software, 23(4), 550–560.MathSciNetCrossRefMATHGoogle Scholar