\(\text {ALR}^n\): accelerated higherorder logistic regression
 718 Downloads
 4 Citations
Abstract
This paper introduces Accelerated Logistic Regression: a hybrid generativediscriminative approach to training Logistic Regression with highorder features. We present two main results: (1) that our combined generativediscriminative approach significantly improves the efficiency of Logistic Regression and (2) that incorporating higher order features (i.e. features that are the Cartesian products of the original features) reduces the bias of Logistic Regression, which in turn significantly reduces its error on large datasets. We assess the efficacy of Accelerated Logistic Regression by conducting an extensive set of experiments on 75 standard datasets. We demonstrate its competitiveness, particularly on large datasets, by comparing against stateoftheart classifiers including Random Forest and Averaged nDependence Estimators.
Keywords
Higherorder Logistic Regression Lowbias classifiers Generativediscriminative learning1 Introduction
It has been shown that Bayesian Network Classifiers (BNCs) that explicitly represent higherorder interactions tend to have lower bias than those that do not (Martinez et al. 2016; Webb et al. 2011). This is because BNCs that can represent higherorder interactions can exactly represent any of a superset of the distributions that can be represented by BNCs that are restricted to lower order interactions. Thus they have lower representation bias and hence, all other things being equal, lower inductive bias (Mitchell 1980) than the more restricted BNCs. Except in the specific cases where the true distribution to be modeled fits exactly into the more restricted model, given sufficient data the more expressive BNC will form a more accurate model.
It has also been shown that Logistic Regression (LR) tends to have lower bias than naive Bayes, which is a Bayesian Network Classifier with a model of equivalent form to that of LR (Zaidi et al. 2014, 2013).^{1} In consequence, it seems likely that variants of LR that explicitly represent higherorder interactions should have low bias as well, and that the bias should continue to decrease as the order of the interactions represented increases. We call such variants of LR—HigherOrder LR—and abbreviate them as \(\text {LR}^{n}\), where n is the order of interactions that are modeled. Formal definitions of these concepts are provided in Sect. 5.
While the use of higherorder LR models is quite common and at least one implementation of \(\text {LR}^{2}\) is in the public domain (Langford et al. 2007), its performance relative to standard LR as data quantities vary, investigations of \(\text {LR}^{3}\) and bias/variance profile of higherorder LR models warrants further investigation. We investigate these issues herein. It is noteworthy, that a significant amount of research has been done on correcting the estimation bias of Logistic Regression (Firth 1993; Szilard et al. 2009). Most of this research has been driven by the fact that LR’s parameters are obtained through Maximum Likelihood Estimation (MLE) which can be biased if data sample size is too small. However, it is shown that, asymptotically, MLE estimates will have zero estimation bias. Similarly, several studies have addressed the issue of bias due to omitted covariates in Logistic Regression models (Neuhaus and Jewell 1993; Hauck et al. 1998). Some studies have also investigated the Bayesian version of Logistic Regression (Genkin et al. 2012).
An \(\text {LR}^{n}\) model must be learned discriminatively through computationally intensive gradientdescentbased search. Considering all possible higherorder features in \(\text {LR}^{n}\) and learning the corresponding parameter by optimizing conditional loglikelihood (CLL) is a computationally intensive task. Any speedup to the optimization process is highly desirable. A second objective of this paper is to provide an effective mechanism for achieving this.
It has been shown that a hybrid generativediscriminate learner can exploit the strengths of both Naive Bayes (NB) and Logistic Regression (LR) classifiers by creating a weighted variant of NB in which the weights are optimized using a discriminative objective function, that is, maximization of conditional loglikelihood (Zaidi et al. 2013, 2014). The resulting model can be viewed as either using weights to alleviate the feature independence assumption of NB, or as using the maximum likelihood parameterization of NB to precondition the discriminative search of LR. The result is a learner that learns models that are exactly equivalent to LR, but does so much more efficiently. In this work, we show how to achieve the same result with \(\text {LR}^{n}\).
We create a hybrid generativediscriminative learner named \(\text {ALR}^n\) for categorical data that learn models of equivalent order to those of \(\text {LR}^{n}\), but does so much more efficiently than \(\text {LR}^{n}\). We further demonstrate that the resulting models have low bias, which leads to very low error on large quantities of data. However, in order to create this hybrid learner we must first create an efficient generative counterpart to \(\text {LR}^{n}\).

developing an efficient generative counterpart to \(\text {LR}^{n}\), named Averaged nJoin Estimators (AnJE);

developing \(\text {ALR}^n\): a hybrid of \(\text {LR}^{n}\) and AnJE;

demonstrating that \(\text {ALR}^n\) has equivalent error to LR\(^n\), but is substantially more efficient,

demonstrating that \(\text {ALR}^n\) has low error on large data.

demonstrating that the bias of \(\text {LR}^{n}\) decreases as n increases and that in consequence \(\text {LR}^{n}\) with higher n tends to achieve lower error with greater data quantities.
2 Notation
We seek to assign a value \(y\in \varOmega _Y=\{y_1, \ldots y_C\}\) of the class variable Y, to a given example \(\mathbf {x} = (x_1, \ldots , x_a)\), where the \(x_i\) are value assignments for the a features \(\mathcal {A} = \{X_1,\ldots , X_a\}\). We define \(\mathcal {A}\atopwithdelims ()n\) as the set of all subsets of \(\mathcal {A}\) of size n, where each subset in the set is denoted as \(\alpha \): \({\mathcal {A} \atopwithdelims ()n}=\{\alpha \subseteq \mathcal {A}: \alpha =n\}\). We use \(x_\alpha \) to denote the set of values taken by features in the subset \(\alpha \) for any data object \(\mathbf {x}\).
LR for categorical data learns a weight for every feature value per class. Therefore, for LR, we denote, \(\beta _y\) to be the weight associated with class y, and \(\beta _{y,i,x_i}\) to be the weight associated with feature i taking value \(x_i\) with class label y. For \(\text {LR}^{n}\), \(\beta _{y,\alpha ,x_{\alpha }}\) specifies the weight associated with class y and feature subset \(\alpha \) taking value \(x_\alpha \). The equivalent weights for \(\text {ALR}^n\) are denoted by \(w_y\), \(w_{y,i,x_i}\) and \(w_{y,\alpha ,x_\alpha }\). The probability of feature i taking value \(x_i\) given class y is denoted by \(\mathrm {P}(x_i\arrowvert y)\). Similarly, probability of feature subset \(\alpha \), taking value \(x_\alpha \) is denoted by \(\mathrm {P}(x_\alpha \arrowvert y)\). Note, all probabilities are estimated probabilities. For clarity, we will not use \(\hat{\mathrm{P}}(.)\) notation which is typically used for estimated probabilities.
3 Higherorder logistic regression
Because its model is linear, LR is very restricted in the posterior distributions that it can precisely model. For example, it cannot model exclusiveor (XOR) between two features.
Adding higherorder features to LR increases the range of distributions that it can precisely model. Here, we define higherorder categorical features as features that are the Cartesian product of the primitive features, where the order n is the number of primitive features in the Cartesian product.
As mentioned in Sect. 1, it has been shown that Bayesian Network Classifiers that explicitly represent higherorder features tend to have lower bias than those that do not, and that the bias decreases as the order of the features increases (Martinez et al. 2016; Webb et al. 2011). Therefore, it seems likely that LR applied to higherorder features will likewise tend to have lower bias, with bias decreasing as the order increases. This is very significant, as LR is a powerful learning system and there is good reason to believe that the lower the bias of a learning system the lower its error will tend to be on very large datasets (Brain and Webb 2002).
3.1 Kernel LR and \(\text {LR}^{n}\)
3.2 Experimental evaluation of \(\text {LR}^{n}\)
While \(\text {LR}^{n}\) is part of established data analytics practice, we are not aware of any research into its bias/variance profile or its performance relative to standard LR with respect to varying quantities of data. We here investigate those issues. Though we provide a detailed empirical analysis in Sect. 7, here we present some results to illustrate the power of modeling higherorder interactions.
Figure 2 shows learning curves for \(\text {LR}^{n}\) with \(n=1, 2\) and 3. We generated these curves using a prequential testing paradigm on the Localization dataset. For each run, we first randomized the dataset. Then the ordered dataset was processed sequentially. Each example was first classified and the probabilistic loss: \(\frac{1}{C}\sum _c^C (\delta _{y=c}  \mathrm{P}(c\mathbf{x}))^2\), where \(\delta _{y=c}\) is an indicator function that is 1 if the actual class label y is the same as c and zero otherwise, is calculated. Then the example is used to update the model. This process was repeated five times with different randomization of the dataset. For each run this process generated N loss values, where \(N=164{,}860\), the size of the Localization dataset. To generate learning curves, for each point i on the Xaxis, we plot: \(\frac{1}{T} \sum _{k=\max (iT,1)}^{i} \text {loss}(x_k)\), where T is set to 1000. For \(T \le 1000\), we plot \(\frac{1}{i} \sum _{k=1}^{i} \text {loss}(x_k)\). It can be seen that for very small data quantity the lower variance \(\text {LR}^{2}\) results in lower error than \(\text {LR}^{3}\), but as data quantity increases the lower bias of \(\text {LR}^{3}\) results in lowest error. It can be seen that LR obtains better performance than \(\text {LR}^{2}\) and \(\text {LR}^{3}\) when learned from very small quantities of data (the learning curves are zoomed in between 0 and 1000 instances in Fig. 2 to illustrate this point). The obvious reason for the poor performance of \(\text {LR}^{2}\) and \(\text {LR}^{3}\) (models that including higherorder interactions) on smaller training sets is due to overfitting. The powerful models can fit chance regularities in the data. Hence for smaller quantities of data, some sort of regularization that pulls the weights for many higherorder interactions back towards zero would lead to much better performance.
In line with our expectation that low inductive bias will often lead to low statistical bias which will in turn translate to lower error on big datasets, it can be seen that in Fig. 4, higherorder LR results in much lower 0–1 Loss than standard LR and that this benefit tends to continue as n increases. Note that for one dataset, pokerhand ^{2}, \(\text {LR}^{2}\) achieves much lower error than \(\text {LR}^{3}\)—we conjecture, that this is because of strong twolevel correlations that exists in the data. On this synthetic (and deterministic) dataset, \(\text {LR}^{3}\) will need much more data to estimate its parameters effectively. The current results only utilize half of the training data. It can be seen that for \(\text {LR}^{2}\) this much data is more than enough but not for \(\text {LR}^{3}\) and hence, \(\text {LR}^{3}\) results in poor performance than \(\text {LR}^{2}\).
4 Using generative models to precondition discriminative learning
It has been shown that a direct equivalence between a weighted NB and LR can be exploited to greatly speed up LR’s learning rate (Zaidi et al. 2013, 2014).
Details of Datasets
Domain  Case  Att  Class  Domain  Case  Att  Class 

Kddcup  5,209,000  41  40  Vowel  990  14  11 
Pokerhand  1,175,067  10  10  TicTacToeEndgame  958  10  2 
MITFaceSetC  839,000  361  2  Annealing  898  39  6 
Covertype  581,012  55  7  Vehicle  846  19  4 
MITFaceSetB  489,400  361  2  PimaIndiansDiabetes  768  9  2 
MITFaceSetA  474,000  361  2  BreastCancer(Wisconsin)  699  10  2 
CensusIncome(KDD)  299,285  40  2  CreditScreening  690  16  2 
Localization  164,860  7  3  BalanceScale  625  5  3 
Connect4Opening  67,557  43  3  Syncon  600  61  6 
Statlog(Shuttle)  58,000  10  7  Chess  551  40  2 
Adult  48,842  15  2  Cylinder  540  40  2 
LetterRecognition  20,000  17  26  Musk1  476  167  2 
MAGICGammaTelescope  19,020  11  2  HouseVotes84  435  17  2 
Nursery  12,960  9  5  HorseColic  368  22  2 
Sign  12,546  9  3  Dermatology  366  35  6 
PenDigits  10,992  17  10  Ionosphere  351  35  2 
Thyroid  9169  30  20  LiverDisorders(Bupa)  345  7  2 
Pioneer  9150  37  57  PrimaryTumor  339  18  22 
Mushrooms  8124  23  2  Haberman’sSurvival  306  4  2 
Musk2  6598  167  2  HeartDisease(Cleveland)  303  14  2 
Satellite  6435  37  6  Hungarian  294  14  2 
OpticalDigits  5620  49  10  Audiology  226  70  24 
PageBlocksClassification  5473  11  5  NewThyroid  215  6  3 
Wallfollowing  5456  25  4  GlassIdentification  214  10  3 
Nettalk(Phoneme)  5438  8  52  SonarClassification  208  61  2 
Waveform5000  5000  41  3  AutoImports  205  26  7 
Spambase  4601  58  2  WineRecognition  178  14  3 
Abalone  4177  9  3  Hepatitis  155  20  2 
Hypothyroid(Garavan)  3772  30  4  TeachingAssistantEvaluation  151  6  3 
Sickeuthyroid  3772  30  2  IrisClassification  150  5  3 
Kingrookvskingpawn  3196  37  2  Lymphography  148  19  4 
SplicejunctionGeneSequences  3190  62  3  Echocardiogram  131  7  2 
Segment  2310  20  7  PromoterGeneSequences  106  58  2 
CarEvaluation  1728  8  4  Zoo  101  17  7 
Volcanoes  1520  4  4  PostoperativePatient  90  9  3 
Yeast  1484  9  10  LaborNegotiations  57  17  2 
ContraceptiveMethodChoice  1473  10  3  LungCancer  32  57  3 
German  1000  21  2  Contactlenses  24  5  3 
LED  1000  8  10 
5 Accelerated logistic regression (ALR)
5.1 Averaged njoin estimators (AnJE)
5.2 \(\text {ALR}^n\)
\(\text {ALR}^n\) computes weights by optimizing CLL. Therefore, one can compute the gradient of Eq. 14 withrespectto weights and rely on gradient descent based methods to find the optimal value of these weights. Since we do not want to be stuck in local minimums, a natural question to ask is whether the resulting objective function is convex (Boyd and Vandenberghe 2008). It turns out that the objective function of \(\text {ALR}^n\) is indeed convex. Roos et al. (2005) proved that an objective function of the form \(\sum _{\mathbf{x}\in \mathcal {D}} \log \mathrm{P}_{\mathcal {B}}(y\mathbf{x})\), optimized by any conditional Bayesian network model is convex if and only if the structure \({\mathcal {G}}\) of the Bayesian network \({\mathcal {B}}\) is perfect. A directed graph in which all nodes having a common child are connected is called perfect (Lauritzen 1996). \(\text {ALR}^n\) is a geometric mean of several submodels where each submodel models \(\lfloor \frac{a}{n} \rfloor \) interactions each conditioned on the class feature. Each submodel has a structure that is perfect. Since, the product of two convex objective functions leads to a convex function, one can see that \(\text {ALR}^n\)’s optimization function will also lead to a convex objective function.
5.3 Alternative parameterization
5.4 Comparative analysis of \(\text {ALR}^n\) and \(\text {LR}^{n}\)

Step 1 is the optimization of the loglikelihood of the data (\(\log \mathrm{P}(y,\mathbf{x})\)) to obtain the estimates of the prior and likelihood probabilities. One can view this step as of generative learning.

Step 2 is the introduction of weights on these probabilities and learning of these weights by maximizing CLL (\(\mathrm {P}(y\arrowvert \mathbf {x})\)) objective function. This step can be interpreted as discriminative learning.
One can expect a similar biasvariance profile and a very similar classification performance as both models will converge to a similar point in the optimization space, the only difference in the final parameterization being due to recursive descent being terminated before absolute convergence. However, the rate of convergence of the two models can be very different. Zaidi et al. (2014) shows that for NB, such \(\text {ALR}^n\) style parameterization with generativediscriminative learning can greatly speedup convergence relative to only discriminative training. Note, discriminative training with NB as the graphical model is vanilla LR. We expect to see the same trend in the convergence performance of \(\text {ALR}^n\) and \(\text {LR}^{n}\).
Another distinction between the two models becomes explicit if a regularization penalty is added to the objective function. One can see that in case of \(\text {ALR}^n\), regularizing weights towards 1 will effectively pull parameters back towards the generative training estimates. For smaller datasets, one can expect to obtain better performance by using a large regularization parameter and pulling estimates back towards 1. However, one cannot do this for \(\text {LR}^{n}\). Therefore, \(\text {ALR}^n\) models can very elegantly combine generative and discriminative parameters.
6 Related work
There are \({a \atopwithdelims ()n}\) possible combination of features that can be used as parents, producing \({a\atopwithdelims ()n}\) submodels which are combined by averaging.
AnDE and AnJE both use simple generative learning, merely the counting the relevant sufficient statistics from the data. Both have only one tweaking parameter: n—that controls the biasvariance tradeoff. Higher values of n leads to low bias and high variance and viceversa.
\(\text {ALR}^n\) has a number of similarities with ELR (Greiner and Zhou 2002; Greiner et al. 2005) for which the parameters associated with a Bayesian network classifier (naive Bayes or TAN) are learned by optimizing the CLL. ELR performs discriminative learning of the weights for a model with a Bayesian network structure. As explained in Sect. 5, it is not possible to create a single Bayesian network with the structure of the ALR model. Further, ELR does not utilize the generative parameters to precondition the search for discriminative parameters as does ALR. Some related ideas to ELR are also explored in Pernkopf and Bilmes (2005), Pernkopf and Wohlmayr (2009) and Su et al. (2008).
Feature construction has been studied extensively (Liu and Motoda 1998). The goal is to improve the classifier’s accuracy by creating new attributes from existing attributes. The new attributes can be either binary or arithmetic or other combinations of existing attributes. One approach that is closely related to the current work is the formation of Cartesian products of categorical features through hillclimbing search (Pazzani 1996). Our work differs in using all Cartesian products of a given order and using discriminative learning of weights to determine each combinations relative (weighted) contribution to the model.
7 Experiments
In this section, we compare and analyze the performance of our proposed algorithms and related methods on 76 natural domains from the UCI repository of machine learning datasets (Frank and Asuncion 2010).
The experiments are conducted on the datasets described in Table 1. 40 datasets have fewer than 1000 instances, 20 datasets have between 1000 and 10000 instances and 16 datasets have more than 10000 instances. There are 8 datasets with over 100,000 instances. These datasets are shown in bold font in Table 1.
Each algorithm is tested on each dataset using 5 rounds of 2fold cross validation^{3}.
The datasets in Table 1 are divided into two categories. We call the following datasets Big—KDDCup, Pokerhand, USCensus1990, Covertype, MITFaceSetB, MITFaceSetA, Censusincome, Localization. All remaining datasets are denoted as Little in the results.
Due to their size, experiments for most of the Big datasets had to be performed in a heterogeneous environment (grid computing) for which CPU wallclock times are not commensurable. In consequence, when comparing classification and training time, the following 12 datasets constitutes Big category—Localization, Censusincome, Pokerhand, Covtype, Connect4, Shuttle, Adult, Letterrecog, Magic, Nursery, Sign, Pendigits. When comparing average results across Little and Big datasets, we normalize the results with respect to \(\text {ALR}^2\) and present a geometric mean.
Numeric features are discretized by using the Minimum Description Length (MDL) supervised discretization method (Fayyad and Irani 1992). Training data is discretized at training time. The cutpoints learned during the discretization procedure are used to discretize the testing data. However, for kddcup, MITFaceSetA, MITFaceSetB, MITFaceSetC, for computational efficiency, the entire dataset is discretized before the training starts. That is the cutpoints are learned over both training and test data. The bias introduced by including test data in the discretization process is not an issue here because it is uniform across all compared classifiers (i.e., AnDE and Random Forest).
A missing value is treated as a separate feature value and taken into account exactly like other values.
We employed the LBFGS quasiNewton method for solving the optimization^{5}. Note, that we have used LBFGS to demonstrate the efficacy of \(\text {ALR}^n\), the results generalize well to other optimization routines including Gradient Descent, Conjugate Gradient and Stochastic Gradient Descent (SGD). In “Appendix 1”, we also present results with Conjugate Gradient optimization
Random Forest (RF) (Breiman 2001) is considered to be a state of the art classification scheme. It consist of multiple decision trees, each tree is trained on data selected at random but with replacement from the original data (bagging). For example, if there are N data points, select N data points at random with replacement. If there are a attributes, a number m is specified, such that \(m<a\). At each node of the decision tree, m attributes are randomly selected out of a and are evaluated, the best being used to split the node. Note, we used \(m = \log _2(a) + 1\), where a is the total number of features. Each tree is grown to its largest possible size and no pruning is done. An instance is classified by passing it through each decision tree and selecting the mode of the output of the decision trees. We used 100 decision trees in this work.
The Internal discretization mechanism of Random Forest is used for all but the kddcup, MITFaceSetA, MITFaceSetB, MITFaceSetC datasets, where the entire data is first discretized, as described before.
7.1 \(\text {ALR}^n\) versus AnJE
Win–Draw–Loss: \(\text {ALR}^2\) versus A2JE and \(\text {ALR}^3\) versus A3JE
\(\text {ALR}^2\) versus A2JE  \(\text {ALR}^3\) versus A3JE  

W–D–L  p  W–D–L  p  
All datasets  
Bias  62/3/11  \(<{} \mathbf{0.001}\)  55/9/12  \(<{} \mathbf{0.001}\) 
Variance  21/4/51  \(<{} \mathbf{0.001}\)  25/2/49  0.007 
Little datasets  
0–1 Loss  47/4/25  0.012  39/2/35  0.727 
RMSE  39/0/37  0.908  32/0/44  0.206 
Big datasets  
0–1 Loss  8/0/0  0.007  7/0/1  0.070 
RMSE  8/0/0  0.007  7/0/1  0.070 
7.2 \(\text {ALR}^n\) versus AnDE
Win–Draw–Loss: \(\text {ALR}^2\) versus A1DE and \(\text {ALR}^3\) versus A2DE
\(\text {ALR}^2\) versus A1DE  \(\text {ALR}^3\) versus A2DE  

W–D–L  p  W–D–L  p  
All datasets  
Bias  60/5/11  \(<{} \mathbf{0.001}\)  47/11/18  \(<{} \mathbf{0.001}\) 
Variance  22/9/45  0.006  26/4/46  0.024 
Little datasets  
0–1 Loss  43/3/30  0.159  33/4/39  0.556 
RMSE  30/0/46  0.084  24/0/52  0.035 
Big datasets  
0–1 Loss  8/0/0  0.007  7/0/1  0.070 
RMSE  8/0/0  0.073  7/0/1  0.070 
7.3 \(\text {ALR}^n\) versus \(\text {LR}^{n}\)
We compare the two parameterizations in terms of the scatter of their 0–1 Loss and RMSE values on Little datasets in Figs. 10 and 12 respectively, and on Big datasets in Figs. 11 and 13 respectively. It can be seen that the two parameterizations (with an exception of a few datasets^{6}) have a similar spread of 0–1 Loss and RMSE values for both \(n=2\) and \(n=3\). We attribute the difference in the performance of the two parameterizations in terms of 0–1 Loss due to the numerical instability of the solver. The LBFGS library we are using is written in java that internally calls C++ routines which eventually call a fortran library. There are some nonsignificant differences between \(\text {LR}^{n}\) and \(\text {ALR}^n\) only on the Phoneme, Lungcancer and Promoters datasets. These models trained on these datasets are all extremely sparse. Lungcancer, for example, has only 32 instances defined over 57 attributes and 3 classes. \(\text {LR}^{2}\) and \(\text {ALR}^2\) in this case optimize 75,246 parameters and \(\text {LR}^{3}\) and \(\text {ALR}^3\) optimize 5,465,451 parameters. We conjecture that the difference in the performance (0–1 Loss) is due to overflowing of the estimated parameters. It appears that on these datasets, data is linearly separable in spaces spanned by \(\text {LR}^{2}\), \(\text {ALR}^2\), \(\text {LR}^{3}\) and \(\text {ALR}^3\)—this leads to parameters becoming too large. For these datasets, ideally, one should regularize the two parameterizations differently (tuning \(\lambda \) on some validation set) to make sure that the parameter estimates do not get too low or too high.
Finally, let us present some comparison results about the speed of convergence of \(\text {ALR}^n\) versus \(\text {LR}^{n}\) as we increase n. In Fig. 20, we compare the convergence for \(n=1\), \(n=2\) and \(n=3\) on the sample Localization dataset. It can be seen that the improvement that \(\text {ALR}^n\) provides over \(\text {LR}^{n}\) gets better as n becomes larger. Similar behaviour was observed for many datasets and, although studying rates of convergence is a complicated matter and is outside the scope of this work, we anticipate this phenomenon to be an interesting area for future research.
7.4 \(\text {ALR}^n\) versus Random Forest
The two \(\text {ALR}^n\) models are compared in terms of W–D–L of 0–1 Loss, RMSE, bias and variance with Random Forest in Table 4. It can be seen that \(\text {ALR}^n\) has slightly lower bias than RF. The variance of \(\text {ALR}^3\) is significantly higher than RF, whereas, variance does not differ significantly between \(\text {ALR}^2\) and RF. On Little datasets, 0–1 Loss results of \(\text {ALR}^n\) and RF are similar. However, RF has significantly better RMSE results than \(\text {ALR}^n\) these datasets. On Big datasets, \(\text {ALR}^n\) has lower 0–1 Loss and RMSE on the majority of datasets.
The averaged 0–1 Loss and RMSE results are given in Fig. 21. It can be seen that \(\text {ALR}^2\), \(\text {ALR}^3\) and RF have similar 0–1 Loss and RMSE across Little datasets. However, on Big datasets, the lower bias of \(\text {ALR}^n\) results in much lower error than RF in terms of both 0–1 Loss and RMSE. These averaged results also corroborate with the W–D–L results in Table 4, showing \(\text {ALR}^n\) to be a less biased model than RF.
The comparison of training and classification time of \(\text {ALR}^n\) and RF is given in Fig. 22. It can be seen that \(\text {ALR}^n\) requires more learning time than RF but less classification time.
8 Conclusion and future work
Win–Draw–Loss: \(\text {ALR}^2\) versus RF and \(\text {ALR}^3\) versus RF
\(\text {ALR}^2\) versus RF  \(\text {ALR}^3\) versus RF  

W–D–L  p  W–D–L  p  
All datasets  
Bias  39/9/28  0.221  35/9/32  0.807 
Variance  25/2/49  0.556  21/3/52  \(< \mathbf{0.001}\) 
Little datasets  
0–1 Loss  26/3/47  0.018  22/1/53  \(<\mathbf{0.001}\) 
RMSE  26/0/50  0.007  25/0/51  0.003 
Big datasets  
0–1 Loss  4/1/3  1.000  5/0/3  0.726 
RMSE  4/0/4  1.273  5/0/3  0.726 

We have shown that \(\text {ALR}^n\) is a low bias classifier that requires minimal tuning and has the ability to handle multiple classes. The obvious extension is to make it outofcore. We argue that \(\text {ALR}^n\) is well suited for stochastic gradient descent based methods as it can converge to the global minimum very quickly.

It may be desirable to utilize a hierarchical ALR, such that \(h\text {ALR}^n = \{\text {ALR}^1 \cdots \text {ALR}^n\}\), incorporating all the parameters up till order n. This may be useful for smoothing the parameters. For example, if a certain interaction does not occur in the training data, at classification time one can resort to lower values of n.

In this work, we have constrained the values of n to two and three. Scalingup \(\text {ALR}^n\) to higher values of n is highly desirable. One can exploit the fact that many interactions at higher values of n will not occur in the data and hence can develop sparse implementations of \(\text {ALR}^n\) models.

Exploring other objective functions such as MeanSquaredError or Hinge Loss may have desirable properties and has been left as a future work.

The preliminary version of ALR that we have developed is restricted to categorical data and hence requires that numeric data be discretized. While our results show that this is often highly competitive with Random Forests, which can use local cutpoints (builtin discretization scheme), on some datasets it is not. In consequence, there is much scope for extensions to \(\text {ALR}^n\) to directly handle numeric data.
9 Code and datasets
Code with running instructions can be downloaded from https://github.com/nayyarzaidi/ALR.
Footnotes
 1.
Naive Bayes and LR are generally categorized as generative and discriminative counterparts of each other. The number of the parameters of the two models are exactly the same. They only differ in the way the parameters are learned. For Naive Bayes, the parameters are actual probabilities and are learned by maximizing loglikelihood of the data and for LR, they are free parameters that are learned by optimizing the conditional loglikelihood.
 2.
Dataset is about classifying poker hands (each hand constitutes five cards) into 10 different classes, i.e., one pair, two pair, three of a kind, straight, flush, full house, four of a kind, straight flush, royal flush and nothing in hand. Each card is represented by two attributes that is card suite and card number. Therefore, there are total of 10 attributes describing a hand.
 3.
Exception is MITFaceSetA, MITFaceSetB, MITFaceSetA and Kddcup where results are reported with 2 rounds of 2fold cross validation because of the timeconstraints on the gridcomputers on which the results were computed
 4.
As discussed in Sect. 1, the reason for performing bias/variance estimation is that it provides insights into how the learning algorithm will perform with varying amounts of data. We expect low variance algorithms to have relatively low error for small data and low bias algorithms to have relatively low error for large data (Brain and Webb 2002).
 5.
The original LBFGS implementation of Byrd et al. (1995) from http://users.eecs.northwestern.edu/~nocedal/lbfgsb.html is used.
 6.
\(\text {LR}^{2}\) versus \(\text {ALR}^2\): two datasets on which the 0–1 Loss of two parameterization is significantly different are: Phoneme (0.1935, 0.2814) and Promoters (0.1132, 0.1717). \(\text {LR}^{3}\) versus \(\text {ALR}^3\): two datasets on which the 0–1 Loss of two parameterization is significantly different are: Phoneme (0.2347, 0.4743) and LungCancer (0.6125, 0.5625).
Notes
Acknowledgments
This research has been supported by the Australian Research Council (ARC) under Grant DP140100087, and by the Asian Office of Aerospace Research and Development, Air Force Office of Scientific Research under Contracts FA23861514007 and FA23861514017. The authors would like to thank Wray Buntine, Reza Haffari and Bart Goethals for helpful discussions during the evolution of this paper. The authors also acknowledge tremendously useful comments by the reviewers that helped improve the quality of the paper.
References
 Bishop, C. (2006). Pattern recognition and machine learning. Berlin: Springer.MATHGoogle Scholar
 Boyd, S., & Vandenberghe, L. (2008). Convex optimization. Cambridge: Cambridge Unversity Press.MATHGoogle Scholar
 Brain, D., & Webb, G. I. (2002). The need for low bias algorithms in classification learning from small data sets. In: PKDD, pp. 62–73.Google Scholar
 Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32.CrossRefMATHGoogle Scholar
 Byrd, R., Lu, P., & Nocedal, J. (1995). A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific and Statistical Computing, 16(5), 1190–1208.MathSciNetCrossRefMATHGoogle Scholar
 Fayyad, U. M., & Irani, K. B. (1992). On the handling of continuousvalued attributes in decision tree generation. Machine Learning, 8(1), 87–102.MATHGoogle Scholar
 Firth, D. (1993). Bias reduction of maximum likelihood estimates. Biometrika, 80, 27–38.MathSciNetCrossRefMATHGoogle Scholar
 Frank, A., & Asuncion, A. (2010). UCI machine learning repository. http://archive.ics.uci.edu/ml.
 Ganz, J., & Reinsel, D. (2012). The digital universe in 2020: Big data, bigger digital shadows, and biggest growth in the far east. Framingham: International Data Corporation. https://www.emc.com/collateral/analystreports/idcthedigitaluniversein2020.pdf.
 Genkin, A., Lewis, D., & Madigan, M. (2012). Largescale bayesian logistic regression for text categorization. Technometrics, 49, 291–304.MathSciNetCrossRefGoogle Scholar
 Greiner, R., Su, X., Shen, B., & Zhou, W. (2005). Structural extensions to logistic regression: Discriminative parameter learning of belief net classifiers. Machine Learning, 59, 297–322.Google Scholar
 Greiner, R., & Zhou, W. (2002). Structural extension to logistic regression: Discriminative parameter learning of belief net classifiers. In Eighteenth Annual National Conference on Artificial Intelligence (AAAI), pp. 167–173.Google Scholar
 Hauck, W., Anderson, S., & Marcus, S. (1998). Should we adjust for covariates in nonlinear regression analysis of randomised trials? Controlled Clinical Trials, 19, 249–256.CrossRefGoogle Scholar
 Hill, T., & Lewicki, P. (2013). Statistics: Methods and applications. DellGoogle Scholar
 Kohavi, R., & Wolpert, D. (1996). Bias plus variance decomposition for zeroone loss functions. In ICML, pp. 275–283.Google Scholar
 Langford, J., Li, L., & Strehl, A. (2007). Vowpal wabbit online learning project. https://github.com/JohnLangford/vowpal_wabbit/wiki
 Lauritzen, S. (1996). Graphical models. Oxford: Oxford University Press.MATHGoogle Scholar
 Lin, X., Wahba, G., Xiang, D., Gao, F., Klein, R., & Klein, B. (1998). Smoothing spline anova models for large data sets with bernoulli observations and the randomized gacv. Tech. rep., Technical Report 998, Department of Statistics, University of Wisconsin, Madison WI.Google Scholar
 Lint, J. H., & Wilson, M. R. (1992). A course in combinatorics. Cambridge: Cambridge University Press.MATHGoogle Scholar
 Liu, H., & Motoda, H. (1998). Feature extraction, construction and selection: A data mining perspective. Berlin: Springer.CrossRefMATHGoogle Scholar
 Martinez, A., Chen, S., Webb, G. I., & Zaidi, N. A. (2016). Scalable learning of Bayesian network classifiers. Journal of Machine Learning Research, 17, 1–35.Google Scholar
 Mitchell, T. M. (1980). The need for biases in learning generalizations. Technical Report CBMTR117, Rutgers University, Department of Computer Science, New Brunswick, NJ.Google Scholar
 Neuhaus, J., & Jewell, N. (1993). A geometric approach to assess bias due to omitted covariates in generalized linear models. Biometrika, 80, 807–815.MathSciNetCrossRefMATHGoogle Scholar
 Pazzani, M. J. (1996). Constructive induction of cartesian product attributes. In: Proceedings of the information, statistics and induction in science conference (ISIS96, pp. 66–77)Google Scholar
 Pernkopf, F., & Bilmes, J. (2005). Discriminative versus generative parameter and structure learning of Bayesian network classifiers. In International Conference on Machine Learning, pp. 657–664.Google Scholar
 Pernkopf, F., & Wohlmayr, M. (2009). On discriminative parameter learning of Bayesian network classifiers. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 221–237.Google Scholar
 Roos, T., Wettig, H., Grünwald, P., Myllymäki, P., & Tirri, H. (2005). On discriminative Bayesian network classifiers and logistic regression. Machine Learning, 59(3), 267–296.MATHGoogle Scholar
 Sahami, M. (1996). Learning limited dependence Bayesian classifiers. In: Proceedings of the second international conference on knowledge discovery and data mining, pp. 334–338. Menlo Park, CA: AAAI Press.Google Scholar
 Smola, A., & Schölkopf, B. (2000). Sparse greedy matrix approximation for machine learning. In International Conference on Machine Learning, pp. 911–918.Google Scholar
 Sonnenburg, S., & Franc, V. (2010). COFFIN: A computational framework for linear SVMs. In International Conference on Machine Learning, pp. 999–1006.Google Scholar
 Steinwart, I. (2004). Sparseness of support vector machines—Some asymptotically sharp bounds. In Advances in Neural Information Processing Systems 16.Google Scholar
 Stinson, D. R. (2003). Combinatorial designs: Constructions and analysis. Berlin: Springer.MATHGoogle Scholar
 Su, J., Zhang, H., Ling, C., & Matwin, S. (2008). Discriminative parameter learning for Bayesian networks. In International Conference on Machine Learning, pp. 1016–1023.Google Scholar
 Szilard, N., Jonasson, J., Genell, A., & Steineck, G. (2009). Bias in odds ratios by logistic regression modelling and sample size. BMC Medical Research Methodology, 9(1), 1–5.CrossRefGoogle Scholar
 Webb, G. I. (2000). Multiboosting: A technique for combining boosting and wagging. Machine Learning, 40(2), 159–196.CrossRefGoogle Scholar
 Webb, G. I., Boughton, J., & Wang, Z. (2005). Not so naive Bayes: Averaged onedependence estimators. Machine Learning, 58(1), 5–24.CrossRefMATHGoogle Scholar
 Webb, G. I., Boughton, J., Zheng, F., Ting, K. M., & Salem, H. (2011). Learning by extrapolation from marginal to fullmultivariate probability distributions: Decreasingly naive Bayesian classification. Machine Learning. doi: 10.1007/s1099401152636.
 Williams, C., & Seeger, M. (2001). Using the Nyström method to speed up kernel machines. In Advances in Neural Information Processing Systems 13, pp. 682–688.Google Scholar
 Zaidi, N. A., Carman, M. J., Cerquides, J., & Webb, G. I. (2014). NaiveBayes inspired effective preconditioners for speedingup logistic regression. In IEEE international conference on data mining, pp. 1097–1102.Google Scholar
 Zaidi, N. A., Cerquides, J., Carman, M. J., & Webb, G. I. (2013). Alleviating naive Bayes attribute independence assumption by attribute weighting. Journal of Machine Learning Research, 14, 1947–1988.MathSciNetMATHGoogle Scholar
 Zaidi, N. A., & Webb, G. I. (2012). Fast and efficient single pass Bayesian learning. In Advances in knowledge discovery and data mining, pp. 149–160.Google Scholar
 Zhu, J., & Hastie, T. (2001). Kernel logistic regression and the import vector machine. In NIPS, pp. 1081–1088.Google Scholar