Machine Learning

, Volume 104, Issue 2–3, pp 151–194 | Cite as

\(\text {ALR}^n\): accelerated higher-order logistic regression

  • Nayyar A. Zaidi
  • Geoffrey I. Webb
  • Mark J. Carman
  • François Petitjean
  • Jesús Cerquides
Article

Abstract

This paper introduces Accelerated Logistic Regression: a hybrid generative-discriminative approach to training Logistic Regression with high-order features. We present two main results: (1) that our combined generative-discriminative approach significantly improves the efficiency of Logistic Regression and (2) that incorporating higher order features (i.e. features that are the Cartesian products of the original features) reduces the bias of Logistic Regression, which in turn significantly reduces its error on large datasets. We assess the efficacy of Accelerated Logistic Regression by conducting an extensive set of experiments on 75 standard datasets. We demonstrate its competitiveness, particularly on large datasets, by comparing against state-of-the-art classifiers including Random Forest and Averaged n-Dependence Estimators.

Keywords

Higher-order Logistic Regression Low-bias classifiers Generative-discriminative learning 

1 Introduction

Machine learning is confronted with ever growing data quantity (Ganz and Reinsel 2012). However, many state-of-the-art learning algorithms were developed in the context of relatively small datasets. Large training sets often support the creation of very detailed models that can encode complex high-order multi-variate distributions, whereas such models would over-fit small training datasets and should be avoided (Brain and Webb 2002). We highlight this phenomenon in Fig. 1. We know that the accuracy of most classifiers increases as they are provided with more training data. This can be observed in Fig. 1 which plots the variation in error-rate of two classifiers with increasing quantities of training data on the poker-hand dataset (Frank and Asuncion 2010). One is a low-bias and high-variance learner (KDB \(k=5\), a K-Dependence Bayes estimator, taking into account quintic features, Sahami 1996) and the other is a low-variance and high-bias learner (naive Bayes, a linear classifier). For small quantities of data, the low-variance learner achieves the lowest error. However, as the data quantity increases, the low-bias learner comes to achieve the lower error as it can better model higher-order distributions from which the data might be sampled.
Fig. 1

Comparative study of the error committed by high- and low-bias classifiers on increasing quantities of data

It has been shown that Bayesian Network Classifiers (BNCs) that explicitly represent higher-order interactions tend to have lower bias than those that do not (Martinez et al. 2016; Webb et al. 2011). This is because BNCs that can represent higher-order interactions can exactly represent any of a superset of the distributions that can be represented by BNCs that are restricted to lower order interactions. Thus they have lower representation bias and hence, all other things being equal, lower inductive bias (Mitchell 1980) than the more restricted BNCs. Except in the specific cases where the true distribution to be modeled fits exactly into the more restricted model, given sufficient data the more expressive BNC will form a more accurate model.

It has also been shown that Logistic Regression (LR) tends to have lower bias than naive Bayes, which is a Bayesian Network Classifier with a model of equivalent form to that of LR (Zaidi et al. 2014, 2013).1 In consequence, it seems likely that variants of LR that explicitly represent higher-order interactions should have low bias as well, and that the bias should continue to decrease as the order of the interactions represented increases. We call such variants of LR—Higher-Order LR—and abbreviate them as \(\text {LR}^{n}\), where n is the order of interactions that are modeled. Formal definitions of these concepts are provided in Sect. 5.

While the use of higher-order LR models is quite common and at least one implementation of \(\text {LR}^{2}\) is in the public domain (Langford et al. 2007), its performance relative to standard LR as data quantities vary, investigations of \(\text {LR}^{3}\) and bias/variance profile of higher-order LR models warrants further investigation. We investigate these issues herein. It is noteworthy, that a significant amount of research has been done on correcting the estimation bias of Logistic Regression (Firth 1993; Szilard et al. 2009). Most of this research has been driven by the fact that LR’s parameters are obtained through Maximum Likelihood Estimation (MLE) which can be biased if data sample size is too small. However, it is shown that, asymptotically, MLE estimates will have zero estimation bias. Similarly, several studies have addressed the issue of bias due to omitted covariates in Logistic Regression models (Neuhaus and Jewell 1993; Hauck et al. 1998). Some studies have also investigated the Bayesian version of Logistic Regression (Genkin et al. 2012).

An \(\text {LR}^{n}\) model must be learned discriminatively through computationally intensive gradient-descent-based search. Considering all possible higher-order features in \(\text {LR}^{n}\) and learning the corresponding parameter by optimizing conditional log-likelihood (CLL) is a computationally intensive task. Any speed-up to the optimization process is highly desirable. A second objective of this paper is to provide an effective mechanism for achieving this.

It has been shown that a hybrid generative-discriminate learner can exploit the strengths of both Naive Bayes (NB) and Logistic Regression (LR) classifiers by creating a weighted variant of NB in which the weights are optimized using a discriminative objective function, that is, maximization of conditional log-likelihood (Zaidi et al. 2013, 2014). The resulting model can be viewed as either using weights to alleviate the feature independence assumption of NB, or as using the maximum likelihood parameterization of NB to pre-condition the discriminative search of LR. The result is a learner that learns models that are exactly equivalent to LR, but does so much more efficiently. In this work, we show how to achieve the same result with \(\text {LR}^{n}\).

We create a hybrid generative-discriminative learner named \(\text {ALR}^n\) for categorical data that learn models of equivalent order to those of \(\text {LR}^{n}\), but does so much more efficiently than \(\text {LR}^{n}\). We further demonstrate that the resulting models have low bias, which leads to very low error on large quantities of data. However, in order to create this hybrid learner we must first create an efficient generative counterpart to \(\text {LR}^{n}\).

In summary, the contributions of this work are:
  • developing an efficient generative counter-part to \(\text {LR}^{n}\), named Averaged n-Join Estimators (AnJE);

  • developing \(\text {ALR}^n\): a hybrid of \(\text {LR}^{n}\) and AnJE;

  • demonstrating that \(\text {ALR}^n\) has equivalent error to LR\(^n\), but is substantially more efficient,

  • demonstrating that \(\text {ALR}^n\) has low error on large data.

Note that it was initially proposed in Brain and Webb (2002) that for larger quantities of data, one should aim for low-bias models. This hypothesis was tested in the context of Bayesian Network classifiers in Webb et al. (2011, (2005) where the results corroborated the hypothesis. However, we are not aware of any past work that investigates this hypothesis in the context of higher-order Logistic Regression. Therefore, another contribution of this paper is:
  • demonstrating that the bias of \(\text {LR}^{n}\) decreases as n increases and that in consequence \(\text {LR}^{n}\) with higher n tends to achieve lower error with greater data quantities.

The rest of this paper is organized as follows. In Sect. 2, we introduce the notation that is used through-out this paper. We introduce higher-order Logistic Regression in Sect. 3. We evaluate \(\text {LR}^{n}\) empirically, and show that higher values of n lead to lower-bias. Using generative models to pre-condition discriminative learning is discussed in Sect. 4. The proposed algorithm (\(\text {ALR}^n\)) is presented in Sect. 5. Work related to the our proposed algorithm is discussed in Sect. 6. We empirically evaluate the proposed algorithm in Sect. 7. We conclude in Sect. 8 with some pointers to future work.

2 Notation

We seek to assign a value \(y\in \varOmega _Y=\{y_1, \ldots y_C\}\) of the class variable Y, to a given example \(\mathbf {x} = (x_1, \ldots , x_a)\), where the \(x_i\) are value assignments for the a features \(\mathcal {A} = \{X_1,\ldots , X_a\}\). We define \(\mathcal {A}\atopwithdelims ()n\) as the set of all subsets of \(\mathcal {A}\) of size n, where each subset in the set is denoted as \(\alpha \): \({\mathcal {A} \atopwithdelims ()n}=\{\alpha \subseteq \mathcal {A}: |\alpha |=n\}\). We use \(x_\alpha \) to denote the set of values taken by features in the subset \(\alpha \) for any data object \(\mathbf {x}\).

LR for categorical data learns a weight for every feature value per class. Therefore, for LR, we denote, \(\beta _y\) to be the weight associated with class y, and \(\beta _{y,i,x_i}\) to be the weight associated with feature i taking value \(x_i\) with class label y. For \(\text {LR}^{n}\), \(\beta _{y,\alpha ,x_{\alpha }}\) specifies the weight associated with class y and feature subset \(\alpha \) taking value \(x_\alpha \). The equivalent weights for \(\text {ALR}^n\) are denoted by \(w_y\), \(w_{y,i,x_i}\) and \(w_{y,\alpha ,x_\alpha }\). The probability of feature i taking value \(x_i\) given class y is denoted by \(\mathrm {P}(x_i\arrowvert y)\). Similarly, probability of feature subset \(\alpha \), taking value \(x_\alpha \) is denoted by \(\mathrm {P}(x_\alpha \arrowvert y)\). Note, all probabilities are estimated probabilities. For clarity, we will not use \(\hat{\mathrm{P}}(.)\) notation which is typically used for estimated probabilities.

3 Higher-order logistic regression

LR is a linear classifier. For categorical features LR can be expressed as:
$$\begin{aligned} \mathrm{P}_{\textit{LR}}(y\mid \mathbf{x}) = \exp \left( \beta _y + \sum _{i=1}^{a} \beta _{y,i,x_i} - \log \sum _{c \in \varOmega _Y} \exp \left( \beta _c + \sum _{j=1}^{a} \beta _{c,j,x_j} \right) \right) . \end{aligned}$$
(1)
Note, that LR for categorical data is often expressed as:
$$\begin{aligned} \mathrm{P}(y\mid \mathbf{x})= & {} \exp \left( \beta _y + \sum _{i=1}^{a} \beta _{y,i,x_i} \mathbf {1}\left( X_i=x_i, Y=y\right) \right. \nonumber \\&\left. - \log \sum _{c \in \varOmega _Y} \exp \left( \beta _c + \sum _{j=1}^{a} \beta _{c,j,x_j} \mathbf {1}\left( X_j=x_j, Y=c\right) \right) \right) , \end{aligned}$$
(2)
where \(\mathbf {1}(\cdot )\) is an indicator function that is 1 if it satisfies the input condition and zero otherwise. (2) reformulates (1) to sum only over the values that the indicator function will not cancel out.

Because its model is linear, LR is very restricted in the posterior distributions that it can precisely model. For example, it cannot model exclusive-or (XOR) between two features.

Adding higher-order features to LR increases the range of distributions that it can precisely model. Here, we define higher-order categorical features as features that are the Cartesian product of the primitive features, where the order n is the number of primitive features in the Cartesian product.

As mentioned in Sect. 1, it has been shown that Bayesian Network Classifiers that explicitly represent higher-order features tend to have lower bias than those that do not, and that the bias decreases as the order of the features increases (Martinez et al. 2016; Webb et al. 2011). Therefore, it seems likely that LR applied to higher-order features will likewise tend to have lower bias, with bias decreasing as the order increases. This is very significant, as LR is a powerful learning system and there is good reason to believe that the lower the bias of a learning system the lower its error will tend to be on very large datasets (Brain and Webb 2002).

We define \(\text {LR}^{n}\) as:Again, we are expressing the definition in a non-standard form for the sake of clarity. The conventional definition is:Note, that, in this work, we do not include lower-order terms. For example, if \(n=2\) we do not include terms for \(\beta _{y,i,x_i}\), because doing so does not increase the space of distinct distributions that can be modeled but does increase the number of parameters that must be optimized. However, it should be noted that including lower-level terms with regularization will affect the learning process and hence the model learned. A further advantage of including lower-order terms is that it provides an elegant backtracking procedure. If an higher-order term is not present at the training time but only appears at classification time, one can use the lower-order weights instead. In the current formulation, there is no such backtracking mechanism. As will be discussed in Sect. 8, this hierarchical parameterization of \(\text {ALR}^n\) has been left as a promising direction for future work. We also note that, \(LR^n\) can be viewed as a Generalized Linear Model with a logistic link function and a fractional factorial design (Hill and Lewicki 2013).

3.1 Kernel LR and \(\text {LR}^{n}\)

One way to deal with non-linearities in the data when applying plain Logistic Regression (LR) is by using kernels. Popularized with the advent of Support Vector Machines (SVM) using the kernel trick (Bishop 2006), one can project the data into a higher dimensional space without explicitly making the transformation. We can always write LR in the following form: \(\mathrm{P}(y|\mathbf{x}) = \frac{1}{1 + \exp (\beta ^T \phi (\mathbf{x}))}\), where \(\phi (\mathbf{x})\) is some function. By virtue of the representer theorem, we can write the \(\beta \) vector as: \(\beta = \sum _i \alpha _i \phi (\mathbf{x})\), which leads to LR of the form:
$$\begin{aligned} \mathrm{P}(y|\mathbf{x})= & {} \frac{1}{1 + \exp \left( \sum _i \alpha _i \phi (\mathbf{x}_i) \phi (\mathbf{x})\right) }, \nonumber \\= & {} \frac{1}{1 + \exp \left( \sum _i \alpha _i k(\mathbf{x}_j, \mathbf{x})\right) }. \end{aligned}$$
(5)
Equation 4 represents a form of higher-order LR with kernels. Several kernels can be used such as linear, Gaussian and sigmoid. Of particular relevance is a polynomial kernel of degree d that takes the form: \(k(u,v) = (u.v)^{d}\) that includes d-degree terms in LR. \(d=1\) leads to \(\text {LR}^{1}\), \(d=2\) leads to \(\text {LR}^{2}\) and so on. A form similar to this is used by SVM, but SVMs have an advantage that most of \(\alpha _i\) are zeros. This is due to the loss function that SVM optimizes—Hinge Loss. Those \(\alpha _i\) that are not zero are known as the support vectors. However, LR’s log loss function does not lead to such sparsity. The non-sparse nature of Kernel Logistic Regression (KLR) is one of its biggest drawback. Several methods such as Import Vector Machines (IVM) have been proposed to address this drawback and make KLR scalable to larger quantities of data (Zhu and Hastie 2001; Lin et al. 1998; Smola and Schölkopf 2000; Williams and Seeger 2001). The computational complexity of KLR is \(\mathcal {O}(N^3)\) and is not suitable for big data classification since it will not be any faster than the k-nearest neighbour classifier. Even sparse models such as SVM are not suitable for big data and truly large scale applications as they can also suffer from the curse of supporting vectors, i.e., most of \(\alpha _i\) are non-zero (Sonnenburg and Franc 2010). Kernel machines (e.g., IVM, SVM, etc.) are not substantially faster than k-nearest neighbour classifiers, as the number of support vectors are linear in the training set size (Steinwart 2004).

3.2 Experimental evaluation of \(\text {LR}^{n}\)

While \(\text {LR}^{n}\) is part of established data analytics practice, we are not aware of any research into its bias/variance profile or its performance relative to standard LR with respect to varying quantities of data. We here investigate those issues. Though we provide a detailed empirical analysis in Sect. 7, here we present some results to illustrate the power of modeling higher-order interactions.

Figure 2 shows learning curves for \(\text {LR}^{n}\) with \(n=1, 2\) and 3. We generated these curves using a prequential testing paradigm on the Localization dataset. For each run, we first randomized the dataset. Then the ordered dataset was processed sequentially. Each example was first classified and the probabilistic loss: \(\frac{1}{C}\sum _c^C (\delta _{y=c} - \mathrm{P}(c|\mathbf{x}))^2\), where \(\delta _{y=c}\) is an indicator function that is 1 if the actual class label y is the same as c and zero otherwise, is calculated. Then the example is used to update the model. This process was repeated five times with different randomization of the dataset. For each run this process generated N loss values, where \(N=164{,}860\), the size of the Localization dataset. To generate learning curves, for each point i on the X-axis, we plot: \(\frac{1}{T} \sum _{k=\max (i-T,1)}^{i} \text {loss}(x_k)\), where T is set to 1000. For \(T \le 1000\), we plot \(\frac{1}{i} \sum _{k=1}^{i} \text {loss}(x_k)\). It can be seen that for very small data quantity the lower variance \(\text {LR}^{2}\) results in lower error than \(\text {LR}^{3}\), but as data quantity increases the lower bias of \(\text {LR}^{3}\) results in lowest error. It can be seen that LR obtains better performance than \(\text {LR}^{2}\) and \(\text {LR}^{3}\) when learned from very small quantities of data (the learning curves are zoomed in between 0 and 1000 instances in Fig. 2 to illustrate this point). The obvious reason for the poor performance of \(\text {LR}^{2}\) and \(\text {LR}^{3}\) (models that including higher-order interactions) on smaller training sets is due to over-fitting. The powerful models can fit chance regularities in the data. Hence for smaller quantities of data, some sort of regularization that pulls the weights for many higher-order interactions back towards zero would lead to much better performance.

We note when interpreting results presented on insufficient data (as is the case for the bottom plot in Fig. 2) it is easy for a data analyst to be misled into believing that the curves are diverging and that the higher-order classifier (\(\text {LR}^{3}\)) will asymptote to poorer performance than the lower-order classifier (\(\text {LR}^{2}\)) on large data—a misunderstanding that is due to the faster learning rate that is achieved initially by the lower-order classifier.
Fig. 2

Comparison of the performance (RMSE) of \(\text {LR}^{1}\), \(\text {LR}^{2}\) and \(\text {LR}^{3}\) with varying quantities of data. For this demonstration, we used Stochastic gradient descent (SGD) for training the parameters of each model

Figure 3, shows a comparative scatter of Bias of \(\text {LR}^{n}\) as n increases (we compare \(\text {LR}^{1}\) with \(\text {LR}^{2}\) and \(\text {LR}^{2}\) with \(\text {LR}^{3}\), where \(\text {LR}^{1}\) is the standard LR). It can be seen that on the majority of 75 datasets from UCI repository (Table 1), the higher the value of n the lower the bias of \(\text {LR}^{n}\). The results are based on two rounds of two-fold cross-validation.
Fig. 3

Comparison of the scatter of the bias performance on 75 different datasets of LR versus \(\text {LR}^{2}\) (left) and \(\text {LR}^{2}\) versus \(\text {LR}^{3}\) (right)

In line with our expectation that low inductive bias will often lead to low statistical bias which will in turn translate to lower error on big datasets, it can be seen that in Fig. 4, higher-order LR results in much lower 0–1 Loss than standard LR and that this benefit tends to continue as n increases. Note that for one dataset, poker-hand2, \(\text {LR}^{2}\) achieves much lower error than \(\text {LR}^{3}\)—we conjecture, that this is because of strong two-level correlations that exists in the data. On this synthetic (and deterministic) dataset, \(\text {LR}^{3}\) will need much more data to estimate its parameters effectively. The current results only utilize half of the training data. It can be seen that for \(\text {LR}^{2}\) this much data is more than enough but not for \(\text {LR}^{3}\) and hence, \(\text {LR}^{3}\) results in poor performance than \(\text {LR}^{2}\).

4 Using generative models to precondition discriminative learning

It has been shown that a direct equivalence between a weighted NB and LR can be exploited to greatly speed up LR’s learning rate (Zaidi et al. 2013, 2014).

NB can be expressed as:
$$\begin{aligned} \mathrm{P}_{\textit{NB}}(y\mid \mathbf{x}) = \frac{\mathrm{P}(y)\prod _{i=1}^{a}\mathrm{P}(x_i\mid y)}{\sum _{c\in \varOmega _Y} \mathrm{P}(c)\prod _{i=1}^{a}\mathrm{P}(x_i\mid c) }. \end{aligned}$$
(6)
One can add weights to NB to alleviate the feature independence assumption, resulting in the WANBIA-C formulation (Zaidi et al. 2014), that can be written as:
$$\begin{aligned} \mathrm{P}_{\textit{W}}(y\mid \mathbf{x})&=\frac{\mathrm{P}(y)^{w_y}\prod _{i=1}^{a}\mathrm{P}(x_i\mid y)^{w_{y,i,x_i}}}{\sum _{c\in \varOmega _Y} \mathrm{P}(c)^{w_c}\prod _{j=1}^{a}\mathrm{P}(x_i\mid c)^{w_{c,j,x_j}} } \nonumber \\&= \exp \left( w_y \log \mathrm{P}(y) + \sum _{i=1}^{a} w_{y,i,x_i}\log \mathrm{P}(x_i\mid y) - \right. \nonumber \\&\left. \quad \log \sum _{c\in \varOmega _Y} \exp \left( w_c \log \mathrm{P}(c) + \sum _{j=1}^{a} w_{c,j,x_j} \log \mathrm{P}(x_j\mid c) \right) \right) . \end{aligned}$$
(7)
When conditional log-likelihood (CLL) is maximized for LR and weighted NB using Eqs. 1 and 6 respectively, we get an equivalence such that \(\beta _c\propto w_c\log \mathrm{P}(c)\) and \(\beta _{c,i,x_i}\propto w_{c,i,x_i}\log \mathrm{P}(x_i\mid c)\). Thus, WANBIA-C and LR generate equivalent models.
Table 1

Details of Datasets

Domain

Case

Att

Class

Domain

Case

Att

Class

Kddcup

5,209,000

41

40

Vowel

990

14

11

Poker-hand

1,175,067

10

10

Tic-Tac-ToeEndgame

958

10

2

MITFaceSetC

839,000

361

2

Annealing

898

39

6

Covertype

581,012

55

7

Vehicle

846

19

4

MITFaceSetB

489,400

361

2

PimaIndiansDiabetes

768

9

2

MITFaceSetA

474,000

361

2

BreastCancer(Wisconsin)

699

10

2

Census-Income(KDD)

299,285

40

2

CreditScreening

690

16

2

Localization

164,860

7

3

BalanceScale

625

5

3

Connect-4Opening

67,557

43

3

Syncon

600

61

6

Statlog(Shuttle)

58,000

10

7

Chess

551

40

2

Adult

48,842

15

2

Cylinder

540

40

2

LetterRecognition

20,000

17

26

Musk1

476

167

2

MAGICGammaTelescope

19,020

11

2

HouseVotes84

435

17

2

Nursery

12,960

9

5

HorseColic

368

22

2

Sign

12,546

9

3

Dermatology

366

35

6

PenDigits

10,992

17

10

Ionosphere

351

35

2

Thyroid

9169

30

20

LiverDisorders(Bupa)

345

7

2

Pioneer

9150

37

57

PrimaryTumor

339

18

22

Mushrooms

8124

23

2

Haberman’sSurvival

306

4

2

Musk2

6598

167

2

HeartDisease(Cleveland)

303

14

2

Satellite

6435

37

6

Hungarian

294

14

2

OpticalDigits

5620

49

10

Audiology

226

70

24

PageBlocksClassification

5473

11

5

New-Thyroid

215

6

3

Wall-following

5456

25

4

GlassIdentification

214

10

3

Nettalk(Phoneme)

5438

8

52

SonarClassification

208

61

2

Waveform-5000

5000

41

3

AutoImports

205

26

7

Spambase

4601

58

2

WineRecognition

178

14

3

Abalone

4177

9

3

Hepatitis

155

20

2

Hypothyroid(Garavan)

3772

30

4

TeachingAssistantEvaluation

151

6

3

Sick-euthyroid

3772

30

2

IrisClassification

150

5

3

King-rook-vs-king-pawn

3196

37

2

Lymphography

148

19

4

Splice-junctionGeneSequences

3190

62

3

Echocardiogram

131

7

2

Segment

2310

20

7

PromoterGeneSequences

106

58

2

CarEvaluation

1728

8

4

Zoo

101

17

7

Volcanoes

1520

4

4

PostoperativePatient

90

9

3

Yeast

1484

9

10

LaborNegotiations

57

17

2

ContraceptiveMethodChoice

1473

10

3

LungCancer

32

57

3

German

1000

21

2

Contact-lenses

24

5

3

LED

1000

8

10

    
While it might seem less efficient to use WANBIA-C, which has twice the number of parameters of LR, the probability estimates are learned very efficiently using maximum likelihood estimation, and provide useful information about the classification task that in practice serve to effectively precondition the search for the parameterization of weights to maximize conditional log-likelihood (Zaidi et al. 2014).
Fig. 4

Comparison of the scatter of 0–1 Loss on 9 Big (\({>}100{,}000\) instances) datasets of LR versus \(\text {LR}^{2}\) (left) and \(\text {LR}^{2}\) versus \(\text {LR}^{3}\) (right)

5 Accelerated logistic regression (ALR)

In order to create an efficient and effective low-bias learner, we want to perform the same strategy that is used by WANBIA-C for LR with higher-order categorical features. To precondition such a model using generative learning, we would like to build a model of the form:The only existing generative model of this form is a log-linear model, which requires computationally expensive conditional log-likelihood optimization and consequently would not be efficient to employ. It is not possible to create a Bayesian network of this form as it would require that \(\mathrm{P}(x_i,x_j)\) be independent of \(\mathrm{P}(x_i,x_k)\), which is impossible because they share the common feature \(x_i\). However, we can use a variant of the AnDE (Webb et al. 2005, 2011) approach of averaging many Bayesian networks. Unlike AnDE, we cannot use the arithmetic mean of the probability estimates from the constituent models, as we require a product of terms in numerator of Eq. 7 rather than a sum, so we must instead use a geometric mean.

5.1 Averaged n-join estimators (AnJE)

Let \(\mathcal P\) be a partition of the features \(\mathcal {A}\). By assuming independence only between the sets of features \(A \in \mathcal P\) one obtains an n-join estimator:
$$\begin{aligned} \mathrm{P}_{\text {AnJE}}(\mathbf{x}\mid y)=\prod _{\alpha \in \mathcal P}\mathrm{P}(x_\alpha \mid y). \nonumber \end{aligned}$$
For example, if there are four features \(X_1\), \(X_2\), \(X_3\) and \(X_4\) that are partitioned into the sets \(\{X_1,X_2\}\) and \(\{X_3,X_4\}\) then by assuming conditional independence between the sets we obtain
$$\begin{aligned} \mathrm{P}_{\text {AnJE}}(x_1,x_2,x_3,x_4\mid y) = \mathrm{P}(x_1,x_2 \mid y)\mathrm{P}(x_3,x_4 \mid y). \end{aligned}$$
Let \(\Psi ^{\mathcal {A}}_n\) be the set of all partitions of \(\mathcal {A}\) such that \(\forall _{\mathcal P\in \Psi ^{\mathcal {A}}_n}\forall _{\alpha \in \mathcal P}|\alpha |=n\). For convenience we assume that \(|\mathcal {A}|\) is a multiple of n. Let \(\Upsilon ^{\mathcal {A}}_N\) be a subset of \(\Psi ^{\mathcal {A}}_n\) that includes each set of n features once,The AnJE model is the geometric mean of the set of n-join estimators for the partitions \(Q \in \Upsilon ^{\mathcal {A}}_N\). Note, that this partitioning of features can be viewed from combinatorial design theory’s perspective, where the idea of partitioning the space is commonly used to reduce the overall complexity of the problem (Stinson 2003; Lint and Wilson 1992).
The AnJE estimate of conditional likelihood on a per-datum-basis can be written as:This is derived as follows. Each \(\mathcal P\) is of size \(s=a/n\). There are \(a\atopwithdelims ()n\) feature-value n-tuples. Each must occur in exactly one partition, so the number of partitions must be
$$\begin{aligned} p ={a\atopwithdelims ()n}/s = \frac{(a-1)!}{(n-1)!(a-n)!} . \end{aligned}$$
(10)
The geometric mean of all the AnJE models is thusUsing Eq. 9, we can write the \(\log \) of \(\mathrm{P}(y\mid \mathbf{x})\) as:

5.2 \(\text {ALR}^n\)

It can be seen that AnJE is a simple model that places the weight defined in Eq. 10 on all feature subsets in the ensemble. The main advantage of this weighting scheme is that it requires no optimization, making AnJE learning extremely efficient. All that is required for training is to calculate the counts from the data. However, the disadvantage of AnJE is its inability to perform any form of discriminative learning. Our proposed algorithm, \(\text {ALR}^n\) uses AnJE to precondition \(\text {LR}^{n}\) by placing weights on all probabilities in Eq. 7 and learning these weights by optimizing the conditional-likelihood. One can, however, initialize these weights with weights in Eq. 10 for faster convergence. We will discuss this in “Appendix 2”. One can re-write AnJE models with this parameterization as:Note that we can compute the likelihood and class-prior probabilities using either MLE or MAP. Therefore, we can write Eq. 22 as:Assuming a Dirichlet prior, a MAP estimate of \(\mathrm{P}(y)\) is \(\pi _y\) which equals: \(\frac{\#_y + m/|\mathcal {Y}|}{t + m}\), where \(\#_y\) is the number of instances in the dataset with class y and t is the total number of instances, and m is the smoothing parameter. We will set \(m = 1\) in this work. Similarly, a MAP estimate of \(\mathrm{P}(x_\alpha \mid y)\) is \(\theta _{x_\alpha |y}\) which equals: \(\frac{\#_{x_\alpha ,y} + m/| x_\alpha |}{\#_y + m}\), where \(\#_{x_\alpha ,y}\) is the number of instances in the dataset with class y and feature values \(x_\alpha \).

\(\text {ALR}^n\) computes weights by optimizing CLL. Therefore, one can compute the gradient of Eq. 14 with-respect-to weights and rely on gradient descent based methods to find the optimal value of these weights. Since we do not want to be stuck in local minimums, a natural question to ask is whether the resulting objective function is convex (Boyd and Vandenberghe 2008). It turns out that the objective function of \(\text {ALR}^n\) is indeed convex. Roos et al. (2005) proved that an objective function of the form \(\sum _{\mathbf{x}\in \mathcal {D}} \log \mathrm{P}_{\mathcal {B}}(y|\mathbf{x})\), optimized by any conditional Bayesian network model is convex if and only if the structure \({\mathcal {G}}\) of the Bayesian network \({\mathcal {B}}\) is perfect. A directed graph in which all nodes having a common child are connected is called perfect (Lauritzen 1996). \(\text {ALR}^n\) is a geometric mean of several sub-models where each sub-model models \(\lfloor \frac{a}{n} \rfloor \) interactions each conditioned on the class feature. Each sub-model has a structure that is perfect. Since, the product of two convex objective functions leads to a convex function, one can see that \(\text {ALR}^n\)’s optimization function will also lead to a convex objective function.

Let us first calculate the gradient of Eq. 14 with-respect-to weights associated with \(\pi _y\). We can write:where \(\mathbf {1}_{y}\) denotes an indicator function that is 1 if derivative is taken with-respect-to class y and 0 otherwise. Computing the gradient with-respect-to weights associated with \(\theta _{x_\alpha |y}\) gives:where \(\mathbf {1}_{\alpha }\) and \(\mathbf {1}_{y}\) denotes an indicator function that is 1 if the derivative is taken with-respect-to feature set \(\alpha \) (respectively, class y) and 0 otherwise.

5.3 Alternative parameterization

Let us reparameterize \(\text {ALR}^n\) such that:
$$\begin{aligned} \beta _y = w_y \log \pi _y, \;\;\;\; \text {and} \;\;\;\; \beta _{y,\alpha ,x_{\alpha }} = w_{y,\alpha ,x_{\alpha }} \log \theta _{x_\alpha \mid y}. \end{aligned}$$
(17)
Now, we can re-write Eq. 14 as:It can be seen that this leads to Eq. 3. We call this parameterization \(\text {LR}^{n}\).
Like \(\text {ALR}^n\), \(\text {LR}^{n}\) also leads to a convex optimization problem, and, therefore, its weights can also be optimized by simple gradient decent based algorithms. Let us compute the gradient of objective function in Eq. 18 with-respect-to \(\beta _y\). In this case, we can write:
$$\begin{aligned} \frac{\partial \log \mathrm {P}(y\arrowvert \mathbf {x})}{\partial \beta _{y}}= & {} \left( \mathbf {1}_{y} - \mathrm{P}(y|\mathbf{x}) \right) . \end{aligned}$$
(19)
Similarly, computing gradient with-respect-to \(\beta _{y,\alpha ,x_{\alpha }}\), we can write:
$$\begin{aligned} \frac{\partial \log \mathrm {P}(y\arrowvert \mathbf {x})}{\partial \beta _{y,\alpha ,x_{\alpha }}}= & {} \left( \mathbf {1}_{y} - \mathrm{P}(y|\mathbf{x}) \right) \mathbf {1}_{\alpha }. \end{aligned}$$
(20)

5.4 Comparative analysis of \(\text {ALR}^n\) and \(\text {LR}^{n}\)

It can be seen that the two models are actually equivalent and each is a re-parameterization of the other. However, there are subtle distinctions between the two. The most important distinction is the utilization of MAP or MLE probabilities in \(\text {ALR}^n\). Therefore, \(\text {ALR}^n\) is a two step learning algorithm:
  • Step 1 is the optimization of the log-likelihood of the data (\(\log \mathrm{P}(y,\mathbf{x})\)) to obtain the estimates of the prior and likelihood probabilities. One can view this step as of generative learning.

  • Step 2 is the introduction of weights on these probabilities and learning of these weights by maximizing CLL (\(\mathrm {P}(y\arrowvert \mathbf {x})\)) objective function. This step can be interpreted as discriminative learning.

\(\text {ALR}^n\) employs generative-discriminative learning as opposed to only discriminative learning by \(\text {LR}^{n}\).

One can expect a similar bias-variance profile and a very similar classification performance as both models will converge to a similar point in the optimization space, the only difference in the final parameterization being due to recursive descent being terminated before absolute convergence. However, the rate of convergence of the two models can be very different. Zaidi et al. (2014) shows that for NB, such \(\text {ALR}^n\) style parameterization with generative-discriminative learning can greatly speed-up convergence relative to only discriminative training. Note, discriminative training with NB as the graphical model is vanilla LR. We expect to see the same trend in the convergence performance of \(\text {ALR}^n\) and \(\text {LR}^{n}\).

Another distinction between the two models becomes explicit if a regularization penalty is added to the objective function. One can see that in case of \(\text {ALR}^n\), regularizing weights towards 1 will effectively pull parameters back towards the generative training estimates. For smaller datasets, one can expect to obtain better performance by using a large regularization parameter and pulling estimates back towards 1. However, one cannot do this for \(\text {LR}^{n}\). Therefore, \(\text {ALR}^n\) models can very elegantly combine generative and discriminative parameters.

An analysis of the gradient of \(\text {ALR}^n\) in Eqs. 15 and 16 and that of \(\text {LR}^{n}\) in Eqs. 19 and 20 also reveals an interesting comparison. We can write \(\text {ALR}^n\)’s gradients in terms of \(\text {LR}^{n}\)’s gradient as follows:
$$\begin{aligned} \frac{\partial \log \mathrm {P}(y\arrowvert \mathbf {x})}{\partial w_{y}}= & {} \frac{\partial \log \mathrm {P}(y\arrowvert \mathbf {x})}{\partial \beta _{y}} \log \pi _y, \nonumber \\ \frac{\partial \log \mathrm {P}(y\arrowvert \mathbf {x})}{\partial w_{y,\alpha ,x_{\alpha }}}= & {} \frac{\partial \log \mathrm {P}(y\arrowvert \mathbf {x})}{\partial \beta _{y,\alpha ,x_{\alpha }}} \log \theta _{x_\alpha |y}. \end{aligned}$$
(21)
It can be seen that \(\text {ALR}^n\) has the effect of re-scaling \(\text {LR}^{n}\)’s gradient by the log of the conditional probabilities. We conjecture that such re-scaling has the effect of pre-conditioning the parameter space and, therefore, will lead to faster convergence.

6 Related work

Averaged n-Dependent Estimators (AnDE) is the inspiration for AnJE. An AnDE model is the arithmetic mean of all Bayesian Network Classifiers in each of which all features depend on the class and the some n features. A simple depiction of A1DE in graphical form in shown in Fig. 5.
Fig. 5

Sub-models in an AnDE model with \(n=2\) and with four features

There are \({a \atopwithdelims ()n}\) possible combination of features that can be used as parents, producing \({a\atopwithdelims ()n}\) sub-models which are combined by averaging.

AnDE and AnJE both use simple generative learning, merely the counting the relevant sufficient statistics from the data. Both have only one tweaking parameter: n—that controls the bias-variance trade-off. Higher values of n leads to low bias and high variance and vice-versa.

It is important not to confuse the equivalence (in terms of the level of interactions they model) of AnJE and AnDE models. That is, the following holds:
$$\begin{aligned} f(\text {A2JE})= & {} f(\text {A1DE}), \\ f(\text {A3JE})= & {} f(\text {A2DE}), \\ \vdots \;\;\;\;= & {} \;\;\;\; \vdots \\ f(\text {AnJE})= & {} f(\mathrm{A}(\mathrm{n}-1)\mathrm{DE}), \end{aligned}$$
where f(.) is a function that returns the number of interactions that the algorithm models. Also, \(\text {A1JE} = \text {A0DE} = \text {naive Bayes}\). Thus, an AnJE model uses the same core statistics as an A\((\mathrm{n}-1)\)DE model. At training time, AnJE and A\((\mathrm{n}-1)\)DE must learn the same information from the data. However, at classification time, each of these statistics is accessed once by AnJE and n times by A\((\mathrm{n}-1)\)DE, making AnJE more efficient. However, as we will show, it turns out that AnJE’s use of the geometric mean results in a more biased estimator than the arithmetic mean used by AnDE. As a result, in practice, an AnJE model is less accurate than the equivalent AnDE model. However, due to the use of the arithmetic mean by AnDE, its weighted version would be much more difficult to optimize than AnJE, as when transformed to log space it does not admit to a simple linear model.

\(\text {ALR}^n\) has a number of similarities with ELR (Greiner and Zhou 2002; Greiner et al. 2005) for which the parameters associated with a Bayesian network classifier (naive Bayes or TAN) are learned by optimizing the CLL. ELR performs discriminative learning of the weights for a model with a Bayesian network structure. As explained in Sect. 5, it is not possible to create a single Bayesian network with the structure of the ALR model. Further, ELR does not utilize the generative parameters to precondition the search for discriminative parameters as does ALR. Some related ideas to ELR are also explored in Pernkopf and Bilmes (2005), Pernkopf and Wohlmayr (2009) and Su et al. (2008).

Feature construction has been studied extensively (Liu and Motoda 1998). The goal is to improve the classifier’s accuracy by creating new attributes from existing attributes. The new attributes can be either binary or arithmetic or other combinations of existing attributes. One approach that is closely related to the current work is the formation of Cartesian products of categorical features through hill-climbing search (Pazzani 1996). Our work differs in using all Cartesian products of a given order and using discriminative learning of weights to determine each combinations relative (weighted) contribution to the model.

7 Experiments

In this section, we compare and analyze the performance of our proposed algorithms and related methods on 76 natural domains from the UCI repository of machine learning datasets (Frank and Asuncion 2010).

The experiments are conducted on the datasets described in Table 1. 40 datasets have fewer than 1000 instances, 20 datasets have between 1000 and 10000 instances and 16 datasets have more than 10000 instances. There are 8 datasets with over 100,000 instances. These datasets are shown in bold font in Table 1.

Each algorithm is tested on each dataset using 5 rounds of 2-fold cross validation3.

We compare four different metrics, i.e., 0–1 Loss, RMSE, Bias and Variance4. There are a number of different bias-variance decomposition definitions. In this research, we use the bias and variance definitions of Kohavi and Wolpert (1996) together with the repeated cross-validation bias-variance estimation method proposed by Webb (2000). Kohavi and Wolpert (1996) define bias and variance as follows:
$$\begin{aligned} \text {bias}^2=\frac{1}{2}\sum _{y \in \mathcal {Y}}\left( \mathrm{P}(y|\mathbf{x}) - \hat{\mathrm {P}}(y\arrowvert \mathbf {x})\right) ^2, \end{aligned}$$
and
$$\begin{aligned} \text {variance}=\frac{1}{2}\left( 1-\sum _{y \in \mathcal {Y}}\hat{\mathrm {P}}(y\arrowvert \mathbf {x})^2\right) . \end{aligned}$$
We report Win–Draw–Loss (W–D–L) results when comparing the 0–1 Loss, RMSE, bias and variance of two models. A two-tail binomial sign test is used to determine the significance of the results. Results are considered significant if \(p \le 0.05\) and shown in bold.

The datasets in Table 1 are divided into two categories. We call the following datasets BigKDDCup, Poker-hand, USCensus1990, Covertype, MITFaceSetB,MITFaceSetA, Census-income, Localization. All remaining datasets are denoted as Little in the results.

Due to their size, experiments for most of the Big datasets had to be performed in a heterogeneous environment (grid computing) for which CPU wall-clock times are not commensurable. In consequence, when comparing classification and training time, the following 12 datasets constitutes Big category—Localization, Census-income, Poker-hand, Covtype, Connect-4, Shuttle, Adult, Letter-recog, Magic, Nursery, Sign, Pendigits. When comparing average results across Little and Big datasets, we normalize the results with respect to \(\text {ALR}^2\) and present a geometric mean.

Numeric features are discretized by using the Minimum Description Length (MDL) supervised discretization method (Fayyad and Irani 1992). Training data is discretized at training time. The cut-points learned during the discretization procedure are used to discretize the testing data. However, for kddcup, MITFaceSetA, MITFaceSetB, MITFaceSetC, for computational efficiency, the entire dataset is discretized before the training starts. That is the cut-points are learned over both training and test data. The bias introduced by including test data in the discretization process is not an issue here because it is uniform across all compared classifiers (i.e., AnDE and Random Forest).

A missing value is treated as a separate feature value and taken into account exactly like other values.

We employed the L-BFGS quasi-Newton method for solving the optimization5. Note, that we have used L-BFGS to demonstrate the efficacy of \(\text {ALR}^n\), the results generalize well to other optimization routines including Gradient Descent, Conjugate Gradient and Stochastic Gradient Descent (SGD). In “Appendix 1”, we also present results with Conjugate Gradient optimization

Random Forest (RF) (Breiman 2001) is considered to be a state of the art classification scheme. It consist of multiple decision trees, each tree is trained on data selected at random but with replacement from the original data (bagging). For example, if there are N data points, select N data points at random with replacement. If there are a attributes, a number m is specified, such that \(m<a\). At each node of the decision tree, m attributes are randomly selected out of a and are evaluated, the best being used to split the node. Note, we used \(m = \log _2(a) + 1\), where a is the total number of features. Each tree is grown to its largest possible size and no pruning is done. An instance is classified by passing it through each decision tree and selecting the mode of the output of the decision trees. We used 100 decision trees in this work.

The Internal discretization mechanism of Random Forest is used for all but the kddcup, MITFaceSetA, MITFaceSetB, MITFaceSetC datasets, where the entire data is first discretized, as described before.

7.1 \(\text {ALR}^n\) versus AnJE

A W–D–L comparison of the 0–1 Loss, RMSE, bias and variance of \(\text {ALR}^n\) and AnJE on Little datasets is shown in Table 2. We compare \(\text {ALR}^2\) with A2JE and \(\text {ALR}^3\) with A3JE only. It can be seen that \(\text {ALR}^n\) has significantly lower bias but significantly higher variance. The 0–1 Loss and RMSE results are not in favour of any algorithm. However, on Big datasets, \(\text {ALR}^n\) wins on 7 out of 8 datasets in terms of both RMSE and 0–1 Loss. The results are not significant since the p value of 0.070 is greater than our set threshold of 0.05. However, the evidence is consistent with the proposition that \(\text {ALR}^n\) successfully reduces the bias of AnJE, at the expense of increasing its variance.
Table 2

Win–Draw–Loss: \(\text {ALR}^2\) versus A2JE and \(\text {ALR}^3\) versus A3JE

 

\(\text {ALR}^2\) versus A2JE

\(\text {ALR}^3\) versus A3JE

 

W–D–L

p

W–D–L

p

 

All datasets

Bias

62/3/11

\(<{} \mathbf{0.001}\)

55/9/12

\(<{} \mathbf{0.001}\)

Variance

21/4/51

\(<{} \mathbf{0.001}\)

25/2/49

0.007

 

Little datasets

0–1 Loss

47/4/25

0.012

39/2/35

0.727

RMSE

39/0/37

0.908

32/0/44

0.206

 

Big datasets

0–1 Loss

8/0/0

0.007

7/0/1

0.070

RMSE

8/0/0

0.007

7/0/1

0.070

p is two-tail binomial sign test. Results are significant if \(p \le 0.05\)

Fig. 6

Geometric mean of 0–1 Loss (left), RMSE (right) performance of \(\text {ALR}^2\), A2JE, \(\text {ALR}^3\) and A3JE for Little and Big datasets

Normalized 0–1 Loss and RMSE results for both models are shown in Fig. 6. It can be seen that \(\text {ALR}^n\) has a lower averaged 0–1 Loss and RMSE than AnJE. This difference is substantial when comparing on Big datasets. The training and classification time of AnJE is, however, substantially lower than \(\text {ALR}^n\) as can be seen from Fig. 7. This is to be expected as \(\text {ALR}^n\) adds discriminative training to AnJE and uses twice the number of parameters at classification time.
Fig. 7

Geometric mean of training time (left), Classification Time (right) of \(\text {ALR}^2\), A2JE, \(\text {ALR}^3\) and A3JE for All and Big datasets

7.2 \(\text {ALR}^n\) versus AnDE

A W–D–L comparison for 0–1 Loss, RMSE, bias and variance results of the two \(\text {ALR}^n\) models relative to the corresponding AnDE models are presented in Table 3. We compare \(\text {ALR}^2\) with A1DE and \(\text {ALR}^3\) with A2DE only. It can be seen that \(\text {ALR}^n\) has significantly lower bias and non-significantly higher variance than AnDE models. Recently, AnDE models have been proposed as a fast and effective Bayesian classifiers when learning from large quantities of data (Zaidi and Webb 2012). These bias-variance results make \(\text {ALR}^n\) a suitable alternative to AnDE when dealing with big data. The 0–1 Loss results are similar, but AnDE has better RMSE results than \(\text {ALR}^n\) on Little datasets. On Big datasets, it can be seen that \(\text {ALR}^n\) wins on majority of datasets.
Table 3

Win–Draw–Loss: \(\text {ALR}^2\) versus A1DE and \(\text {ALR}^3\) versus A2DE

 

\(\text {ALR}^2\) versus A1DE

\(\text {ALR}^3\) versus A2DE

 

W–D–L

p

W–D–L

p

 

All datasets

Bias

60/5/11

\(<{} \mathbf{0.001}\)

47/11/18

\(<{} \mathbf{0.001}\)

Variance

22/9/45

0.006

26/4/46

0.024

 

Little datasets

0–1 Loss

43/3/30

0.159

33/4/39

0.556

RMSE

30/0/46

0.084

24/0/52

0.035

 

Big datasets

0–1 Loss

8/0/0

0.007

7/0/1

0.070

RMSE

8/0/0

0.073

7/0/1

0.070

p is two-tail binomial sign test. Results are significant if \(p \le 0.05\)

Normalized 0–1 Loss and RMSE are shown in Fig. 8. It can be seen that the \(\text {ALR}^n\) models have lower 0–1 Loss and RMSE than the corresponding AnDE models.
Fig. 8

Geometric mean of 0–1 Loss (left) and RMSE (right) performance of \(\text {ALR}^2\), A1DE, \(\text {ALR}^3\) and A2DE for Little and Big datasets

Fig. 9

Geometric mean of training time (left), Classification Time (right) of \(\text {ALR}^2\), A1DE, \(\text {ALR}^3\) and A2DE for All and Big datasets

A comparison of the training time of \(\text {ALR}^n\) and AnDE is given in Fig. 9. As expected, due to its additional discriminative learning, \(\text {ALR}^n\) requires substantially more training time than AnDE. However, AnDE does not share such a consistent advantage with respect to classification time, the relativities depending on the dimensionality of the data. For high-dimensional data the large number of permutations of features that AnDE must consider at classification time results in greater computation.
Fig. 10

Comparative scatter of 0–1 Loss of \(\text {ALR}^2\) and \(\text {LR}^{2}\) (left) and \(\text {ALR}^3\) and \(\text {LR}^{3}\) (right) for Little datasets

Fig. 11

Comparative scatter of 0–1 Loss of \(\text {ALR}^2\) and \(\text {LR}^{2}\) (left) and \(\text {ALR}^3\) and \(\text {LR}^{3}\) (right) for Big datasets

7.3 \(\text {ALR}^n\) versus \(\text {LR}^{n}\)

In this section, we compare the two \(\text {ALR}^n\) models with their equivalent \(\text {LR}^{n}\) models. As discussed before, we expect to see similar bias-variance profiles and similar classification performance as the two models are re-parameterizations of each other.
Fig. 12

Comparative scatter of RMSE of \(\text {ALR}^2\) and \(\text {LR}^{2}\) (left) and \(\text {ALR}^3\) and \(\text {LR}^{3}\) (right) for Little datasets

Fig. 13

Comparative scatter of RMSE of \(\text {ALR}^2\) and \(\text {LR}^{2}\) (left) and \(\text {ALR}^3\) and \(\text {LR}^{3}\) (right) for Big datasets

We compare the two parameterizations in terms of the scatter of their 0–1 Loss and RMSE values on Little datasets in Figs. 10 and 12 respectively, and on Big datasets in Figs. 11 and 13 respectively. It can be seen that the two parameterizations (with an exception of a few datasets6) have a similar spread of 0–1 Loss and RMSE values for both \(n=2\) and \(n=3\). We attribute the difference in the performance of the two parameterizations in terms of 0–1 Loss due to the numerical instability of the solver. The L-BFGS library we are using is written in java that internally calls C++ routines which eventually call a fortran library. There are some non-significant differences between \(\text {LR}^{n}\) and \(\text {ALR}^n\) only on the Phoneme, Lung-cancer and Promoters datasets. These models trained on these datasets are all extremely sparse. Lung-cancer, for example, has only 32 instances defined over 57 attributes and 3 classes. \(\text {LR}^{2}\) and \(\text {ALR}^2\) in this case optimize 75,246 parameters and \(\text {LR}^{3}\) and \(\text {ALR}^3\) optimize 5,465,451 parameters. We conjecture that the difference in the performance (0–1 Loss) is due to over-flowing of the estimated parameters. It appears that on these datasets, data is linearly separable in spaces spanned by \(\text {LR}^{2}\), \(\text {ALR}^2\), \(\text {LR}^{3}\) and \(\text {ALR}^3\)—this leads to parameters becoming too large. For these datasets, ideally, one should regularize the two parameterizations differently (tuning \(\lambda \) on some validation set) to make sure that the parameter estimates do not get too low or too high.

The comparative scatter of the number of iterations each parameterization takes to converge is shown in Figs. 14 and 15 for Little and Big datasets respectively. It can be seen that the number of iterations for \(\text {ALR}^n\) are far fewer than \(\text {LR}^{n}\). It should be noted that the scatter plots are on the log-scale and the ratio such as: \(\frac{10^{2.5}}{10^3}\) between \(\text {ALR}^n\) and \(\text {LR}^{n}\) results in three-times lesser iterations for \(\text {ALR}^n\) than \(\text {LR}^{n}\).
Fig. 14

Comparative scatter of number of iterations of \(\text {ALR}^2\) and \(\text {LR}^{2}\) (left) and \(\text {ALR}^3\) and \(\text {LR}^{3}\) (right) for Little datasets

Fig. 15

Comparative scatter of iterations of \(\text {ALR}^2\) and \(\text {LR}^{2}\) (left) and \(\text {ALR}^3\) and \(\text {LR}^{3}\) (right) for Big datasets

The number of iterations to converge plays a major part in determining an algorithm’s training time. The training time of the two parameterizations is shown in Figs. 16 and 17 for Little and Big datasets, respectively. It can be seen that \(\text {ALR}^n\) models are much faster than the equivalent \(\text {LR}^{n}\) models. Again, note that the scatter plots are on the log-scale. A simple ratio of \(\frac{10^{5.6}}{10^{5.8}}\) between \(\text {ALR}^n\) and \(\text {LR}^{n}\) is difficult to distinguish as a point over the diagonal line in favour of \(\text {ALR}^n\), but actually represents a speed-up of around 1.5 times.
Fig. 16

Comparative scatter of training time of \(\text {ALR}^2\) and \(\text {LR}^{2}\) (left) and \(\text {ALR}^3\) and \(\text {LR}^{3}\) (right) for Little datasets

Fig. 17

Comparative scatter of training time of \(\text {ALR}^2\) and \(\text {LR}^{2}\) (left) and \(\text {ALR}^3\) and \(\text {LR}^{3}\) (right) for Big datasets

A comparison of the rate of convergence of Negative-Log-Likelihood (NLL) of \(\text {ALR}^2\) and \(\text {LR}^{2}\) parameterizations on some sample datasets is shown in Fig. 18. It can be seen that, \(\text {ALR}^2\) has a steeper curve, asymptoting to its global minimum much quicker. For example, on almost all datasets, one can see that \(\text {ALR}^2\) follows a steeper, hence more desirable, path toward convergence. This is extremely advantageous when learning from very few iterations (for example, when learning using Stochastic Gradient Descent based optimization) and, therefore, is a desirable property for scalable learning.
Fig. 18

Comparison of rate of convergence of \(\text {ALR}^2\) and \(\text {LR}^{2}\) on several datasets. The X-axis (no. of iterations) is on log scale. Vertical lines show the point at which the optimization is deemed to have converged

A similar trend can be seen in Fig. 19 for \(\text {ALR}^3\) and \(\text {LR}^{3}\).
Fig. 19

Comparison of rate of convergence of \(\text {ALR}^3\) and \(\text {LR}^{3}\) on several datasets. The X-axis (no. of iterations) is on log scale. Vertical lines show the end of iterations of each curve

Fig. 20

Compared rates of convergence of \(\text {ALR}^n\) versus \(\text {LR}^{n}\), for \(n=1,2,3\) on the sample Localization dataset. Y-axis is the negative log-likelihood

Finally, let us present some comparison results about the speed of convergence of \(\text {ALR}^n\) versus \(\text {LR}^{n}\) as we increase n. In Fig. 20, we compare the convergence for \(n=1\), \(n=2\) and \(n=3\) on the sample Localization dataset. It can be seen that the improvement that \(\text {ALR}^n\) provides over \(\text {LR}^{n}\) gets better as n becomes larger. Similar behaviour was observed for many datasets and, although studying rates of convergence is a complicated matter and is outside the scope of this work, we anticipate this phenomenon to be an interesting area for future research.

7.4 \(\text {ALR}^n\) versus Random Forest

The two \(\text {ALR}^n\) models are compared in terms of W–D–L of 0–1 Loss, RMSE, bias and variance with Random Forest in Table 4. It can be seen that \(\text {ALR}^n\) has slightly lower bias than RF. The variance of \(\text {ALR}^3\) is significantly higher than RF, whereas, variance does not differ significantly between \(\text {ALR}^2\) and RF. On Little datasets, 0–1 Loss results of \(\text {ALR}^n\) and RF are similar. However, RF has significantly better RMSE results than \(\text {ALR}^n\) these datasets. On Big datasets, \(\text {ALR}^n\) has lower 0–1 Loss and RMSE on the majority of datasets.

The averaged 0–1 Loss and RMSE results are given in Fig. 21. It can be seen that \(\text {ALR}^2\), \(\text {ALR}^3\) and RF have similar 0–1 Loss and RMSE across Little datasets. However, on Big datasets, the lower bias of \(\text {ALR}^n\) results in much lower error than RF in terms of both 0–1 Loss and RMSE. These averaged results also corroborate with the W–D–L results in Table 4, showing \(\text {ALR}^n\) to be a less biased model than RF.

The comparison of training and classification time of \(\text {ALR}^n\) and RF is given in Fig. 22. It can be seen that \(\text {ALR}^n\) requires more learning time than RF but less classification time.

8 Conclusion and future work

In this paper, we studied higher-order Logistic Regression (\(\text {LR}^{n}\)) and showed that it is a low-bias classifier that has accuracy that is highly competitive to state-of-the-art classifiers on large data. We also proposed an accelerated version of higher-order Logistic Regression (\(\text {ALR}^n\)) which is based on both generative and discriminative learned parameters. To obtain the generative parameterization, we first developed AnJE, a generative counter-part of higher-order logistic regression. We showed that \(\text {ALR}^n\) and \(\text {LR}^{n}\) learn equivalent models, but that \(\text {ALR}^n\) is able to exploit the information gained generatively to effectively precondition the optimization process. \(\text {ALR}^n\) converges in fewer iterations, leading to its global minimum much more rapidly, resulting in faster training time. We also compared \(\text {ALR}^n\) with the equivalent AnJE and AnDE models and showed that \(\text {ALR}^n\) has lower bias than both AnJE and AnDE models. We compared \(\text {ALR}^n\) with state of the art classifier Random Forest and showed that \(\text {ALR}^n\) models are indeed lower biased than RF and on bigger datasets \(\text {ALR}^n\) often obtains lower 0–1 loss than RF.
Table 4

Win–Draw–Loss: \(\text {ALR}^2\) versus RF and \(\text {ALR}^3\) versus RF

 

\(\text {ALR}^2\) versus RF

\(\text {ALR}^3\) versus RF

 

W–D–L

p

W–D–L

p

 

All datasets

Bias

39/9/28

0.221

35/9/32

0.807

Variance

25/2/49

0.556

21/3/52

\(< \mathbf{0.001}\)

 

Little datasets

0–1 Loss

26/3/47

0.018

22/1/53

\(<\mathbf{0.001}\)

RMSE

26/0/50

0.007

25/0/51

0.003

 

Big datasets

0–1 Loss

4/1/3

1.000

5/0/3

0.726

RMSE

4/0/4

1.273

5/0/3

0.726

p is two-tail binomial sign test. Results are significant if \(p \le 0.05\)

Fig. 21

Geometric mean of 0–1 Loss (left) and RMSE (right) performance of \(\text {ALR}^2\), \(\text {ALR}^3\) and RF for Little and Big datasets

Fig. 22

Geometric average of training time (left), Classification Time (right) of \(\text {ALR}^2\), \(\text {ALR}^3\) and RF for Little and Big datasets

There are a number of promising new directions for future work.
  • We have shown that \(\text {ALR}^n\) is a low bias classifier that requires minimal tuning and has the ability to handle multiple classes. The obvious extension is to make it out-of-core. We argue that \(\text {ALR}^n\) is well suited for stochastic gradient descent based methods as it can converge to the global minimum very quickly.

  • It may be desirable to utilize a hierarchical ALR, such that \(h\text {ALR}^n = \{\text {ALR}^1 \cdots \text {ALR}^n\}\), incorporating all the parameters up till order n. This may be useful for smoothing the parameters. For example, if a certain interaction does not occur in the training data, at classification time one can resort to lower values of n.

  • In this work, we have constrained the values of n to two and three. Scaling-up \(\text {ALR}^n\) to higher values of n is highly desirable. One can exploit the fact that many interactions at higher values of n will not occur in the data and hence can develop sparse implementations of \(\text {ALR}^n\) models.

  • Exploring other objective functions such as Mean-Squared-Error or Hinge Loss may have desirable properties and has been left as a future work.

  • The preliminary version of ALR that we have developed is restricted to categorical data and hence requires that numeric data be discretized. While our results show that this is often highly competitive with Random Forests, which can use local cut-points (built-in discretization scheme), on some datasets it is not. In consequence, there is much scope for extensions to \(\text {ALR}^n\) to directly handle numeric data.

9 Code and datasets

Code with running instructions can be downloaded from https://github.com/nayyarzaidi/ALR.

Footnotes

  1. 1.

    Naive Bayes and LR are generally categorized as generative and discriminative counter-parts of each other. The number of the parameters of the two models are exactly the same. They only differ in the way the parameters are learned. For Naive Bayes, the parameters are actual probabilities and are learned by maximizing log-likelihood of the data and for LR, they are free parameters that are learned by optimizing the conditional log-likelihood.

  2. 2.

    Dataset is about classifying poker hands (each hand constitutes five cards) into 10 different classes, i.e., one pair, two pair, three of a kind, straight, flush, full house, four of a kind, straight flush, royal flush and nothing in hand. Each card is represented by two attributes that is card suite and card number. Therefore, there are total of 10 attributes describing a hand.

  3. 3.

    Exception is MITFaceSetA, MITFaceSetB, MITFaceSetA and Kddcup where results are reported with 2 rounds of 2-fold cross validation because of the time-constraints on the grid-computers on which the results were computed

  4. 4.

    As discussed in Sect. 1, the reason for performing bias/variance estimation is that it provides insights into how the learning algorithm will perform with varying amounts of data. We expect low variance algorithms to have relatively low error for small data and low bias algorithms to have relatively low error for large data (Brain and Webb 2002).

  5. 5.

    The original L-BFGS implementation of Byrd et al. (1995) from http://users.eecs.northwestern.edu/~nocedal/lbfgsb.html is used.

  6. 6.

    \(\text {LR}^{2}\) versus \(\text {ALR}^2\): two datasets on which the 0–1 Loss of two parameterization is significantly different are: Phoneme (0.1935, 0.2814) and Promoters (0.1132, 0.1717). \(\text {LR}^{3}\) versus \(\text {ALR}^3\): two datasets on which the 0–1 Loss of two parameterization is significantly different are: Phoneme (0.2347, 0.4743) and Lung-Cancer (0.6125, 0.5625).

Notes

Acknowledgments

This research has been supported by the Australian Research Council (ARC) under Grant DP140100087, and by the Asian Office of Aerospace Research and Development, Air Force Office of Scientific Research under Contracts FA2386-15-1-4007 and FA2386-15-1-4017. The authors would like to thank Wray Buntine, Reza Haffari and Bart Goethals for helpful discussions during the evolution of this paper. The authors also acknowledge tremendously useful comments by the reviewers that helped improve the quality of the paper.

References

  1. Bishop, C. (2006). Pattern recognition and machine learning. Berlin: Springer.MATHGoogle Scholar
  2. Boyd, S., & Vandenberghe, L. (2008). Convex optimization. Cambridge: Cambridge Unversity Press.MATHGoogle Scholar
  3. Brain, D., & Webb, G. I. (2002). The need for low bias algorithms in classification learning from small data sets. In: PKDD, pp. 62–73.Google Scholar
  4. Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32.CrossRefMATHGoogle Scholar
  5. Byrd, R., Lu, P., & Nocedal, J. (1995). A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific and Statistical Computing, 16(5), 1190–1208.MathSciNetCrossRefMATHGoogle Scholar
  6. Fayyad, U. M., & Irani, K. B. (1992). On the handling of continuous-valued attributes in decision tree generation. Machine Learning, 8(1), 87–102.MATHGoogle Scholar
  7. Firth, D. (1993). Bias reduction of maximum likelihood estimates. Biometrika, 80, 27–38.MathSciNetCrossRefMATHGoogle Scholar
  8. Frank, A., & Asuncion, A. (2010). UCI machine learning repository. http://archive.ics.uci.edu/ml.
  9. Ganz, J., & Reinsel, D. (2012). The digital universe in 2020: Big data, bigger digital shadows, and biggest growth in the far east. Framingham: International Data Corporation. https://www.emc.com/collateral/analyst-reports/idc-the-digital-universe-in-2020.pdf.
  10. Genkin, A., Lewis, D., & Madigan, M. (2012). Large-scale bayesian logistic regression for text categorization. Technometrics, 49, 291–304.MathSciNetCrossRefGoogle Scholar
  11. Greiner, R., Su, X., Shen, B., & Zhou, W. (2005). Structural extensions to logistic regression: Discriminative parameter learning of belief net classifiers. Machine Learning, 59, 297–322.Google Scholar
  12. Greiner, R., & Zhou, W. (2002). Structural extension to logistic regression: Discriminative parameter learning of belief net classifiers. In Eighteenth Annual National Conference on Artificial Intelligence (AAAI), pp. 167–173.Google Scholar
  13. Hauck, W., Anderson, S., & Marcus, S. (1998). Should we adjust for covariates in nonlinear regression analysis of randomised trials? Controlled Clinical Trials, 19, 249–256.CrossRefGoogle Scholar
  14. Hill, T., & Lewicki, P. (2013). Statistics: Methods and applications. DellGoogle Scholar
  15. Kohavi, R., & Wolpert, D. (1996). Bias plus variance decomposition for zero-one loss functions. In ICML, pp. 275–283.Google Scholar
  16. Langford, J., Li, L., & Strehl, A. (2007). Vowpal wabbit online learning project. https://github.com/JohnLangford/vowpal_wabbit/wiki
  17. Lauritzen, S. (1996). Graphical models. Oxford: Oxford University Press.MATHGoogle Scholar
  18. Lin, X., Wahba, G., Xiang, D., Gao, F., Klein, R., & Klein, B. (1998). Smoothing spline anova models for large data sets with bernoulli observations and the randomized gacv. Tech. rep., Technical Report 998, Department of Statistics, University of Wisconsin, Madison WI.Google Scholar
  19. Lint, J. H., & Wilson, M. R. (1992). A course in combinatorics. Cambridge: Cambridge University Press.MATHGoogle Scholar
  20. Liu, H., & Motoda, H. (1998). Feature extraction, construction and selection: A data mining perspective. Berlin: Springer.CrossRefMATHGoogle Scholar
  21. Martinez, A., Chen, S., Webb, G. I., & Zaidi, N. A. (2016). Scalable learning of Bayesian network classifiers. Journal of Machine Learning Research, 17, 1–35.Google Scholar
  22. Mitchell, T. M. (1980). The need for biases in learning generalizations. Technical Report CBM-TR-117, Rutgers University, Department of Computer Science, New Brunswick, NJ.Google Scholar
  23. Neuhaus, J., & Jewell, N. (1993). A geometric approach to assess bias due to omitted covariates in generalized linear models. Biometrika, 80, 807–815.MathSciNetCrossRefMATHGoogle Scholar
  24. Pazzani, M. J. (1996). Constructive induction of cartesian product attributes. In: Proceedings of the information, statistics and induction in science conference (ISIS96, pp. 66–77)Google Scholar
  25. Pernkopf, F., & Bilmes, J. (2005). Discriminative versus generative parameter and structure learning of Bayesian network classifiers. In International Conference on Machine Learning, pp. 657–664.Google Scholar
  26. Pernkopf, F., & Wohlmayr, M. (2009). On discriminative parameter learning of Bayesian network classifiers. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 221–237.Google Scholar
  27. Roos, T., Wettig, H., Grünwald, P., Myllymäki, P., & Tirri, H. (2005). On discriminative Bayesian network classifiers and logistic regression. Machine Learning, 59(3), 267–296.MATHGoogle Scholar
  28. Sahami, M. (1996). Learning limited dependence Bayesian classifiers. In: Proceedings of the second international conference on knowledge discovery and data mining, pp. 334–338. Menlo Park, CA: AAAI Press.Google Scholar
  29. Smola, A., & Schölkopf, B. (2000). Sparse greedy matrix approximation for machine learning. In International Conference on Machine Learning, pp. 911–918.Google Scholar
  30. Sonnenburg, S., & Franc, V. (2010). COFFIN: A computational framework for linear SVMs. In International Conference on Machine Learning, pp. 999–1006.Google Scholar
  31. Steinwart, I. (2004). Sparseness of support vector machines—Some asymptotically sharp bounds. In Advances in Neural Information Processing Systems 16.Google Scholar
  32. Stinson, D. R. (2003). Combinatorial designs: Constructions and analysis. Berlin: Springer.MATHGoogle Scholar
  33. Su, J., Zhang, H., Ling, C., & Matwin, S. (2008). Discriminative parameter learning for Bayesian networks. In International Conference on Machine Learning, pp. 1016–1023.Google Scholar
  34. Szilard, N., Jonasson, J., Genell, A., & Steineck, G. (2009). Bias in odds ratios by logistic regression modelling and sample size. BMC Medical Research Methodology, 9(1), 1–5.CrossRefGoogle Scholar
  35. Webb, G. I. (2000). Multiboosting: A technique for combining boosting and wagging. Machine Learning, 40(2), 159–196.CrossRefGoogle Scholar
  36. Webb, G. I., Boughton, J., & Wang, Z. (2005). Not so naive Bayes: Averaged one-dependence estimators. Machine Learning, 58(1), 5–24.CrossRefMATHGoogle Scholar
  37. Webb, G. I., Boughton, J., Zheng, F., Ting, K. M., & Salem, H. (2011). Learning by extrapolation from marginal to full-multivariate probability distributions: Decreasingly naive Bayesian classification. Machine Learning. doi:10.1007/s10994-011-5263-6.
  38. Williams, C., & Seeger, M. (2001). Using the Nyström method to speed up kernel machines. In Advances in Neural Information Processing Systems 13, pp. 682–688.Google Scholar
  39. Zaidi, N. A., Carman, M. J., Cerquides, J., & Webb, G. I. (2014). Naive-Bayes inspired effective pre-conditioners for speeding-up logistic regression. In IEEE international conference on data mining, pp. 1097–1102.Google Scholar
  40. Zaidi, N. A., Cerquides, J., Carman, M. J., & Webb, G. I. (2013). Alleviating naive Bayes attribute independence assumption by attribute weighting. Journal of Machine Learning Research, 14, 1947–1988.MathSciNetMATHGoogle Scholar
  41. Zaidi, N. A., & Webb, G. I. (2012). Fast and efficient single pass Bayesian learning. In Advances in knowledge discovery and data mining, pp. 149–160.Google Scholar
  42. Zhu, J., & Hastie, T. (2001). Kernel logistic regression and the import vector machine. In NIPS, pp. 1081–1088.Google Scholar

Copyright information

© The Author(s) 2016

Authors and Affiliations

  • Nayyar A. Zaidi
    • 1
  • Geoffrey I. Webb
    • 1
  • Mark J. Carman
    • 1
  • François Petitjean
    • 1
  • Jesús Cerquides
    • 2
  1. 1.Faculty of Information TechnologyMonash UniversityClaytonAustralia
  2. 2.IIIA-CSIC, Artificial Intelligence Research InstituteSpanish National Research CouncilBellaterraSpain

Personalised recommendations