\(\text {ALR}^n\): accelerated higher-order logistic regression
Abstract
This paper introduces Accelerated Logistic Regression: a hybrid generative-discriminative approach to training Logistic Regression with high-order features. We present two main results: (1) that our combined generative-discriminative approach significantly improves the efficiency of Logistic Regression and (2) that incorporating higher order features (i.e. features that are the Cartesian products of the original features) reduces the bias of Logistic Regression, which in turn significantly reduces its error on large datasets. We assess the efficacy of Accelerated Logistic Regression by conducting an extensive set of experiments on 75 standard datasets. We demonstrate its competitiveness, particularly on large datasets, by comparing against state-of-the-art classifiers including Random Forest and Averaged n-Dependence Estimators.
Keywords
Higher-order Logistic Regression Low-bias classifiers Generative-discriminative learning1 Introduction
It has been shown that Bayesian Network Classifiers (BNCs) that explicitly represent higher-order interactions tend to have lower bias than those that do not (Martinez et al. 2016; Webb et al. 2011). This is because BNCs that can represent higher-order interactions can exactly represent any of a superset of the distributions that can be represented by BNCs that are restricted to lower order interactions. Thus they have lower representation bias and hence, all other things being equal, lower inductive bias (Mitchell 1980) than the more restricted BNCs. Except in the specific cases where the true distribution to be modeled fits exactly into the more restricted model, given sufficient data the more expressive BNC will form a more accurate model.
It has also been shown that Logistic Regression (LR) tends to have lower bias than naive Bayes, which is a Bayesian Network Classifier with a model of equivalent form to that of LR (Zaidi et al. 2014, 2013).^{1} In consequence, it seems likely that variants of LR that explicitly represent higher-order interactions should have low bias as well, and that the bias should continue to decrease as the order of the interactions represented increases. We call such variants of LR—Higher-Order LR—and abbreviate them as \(\text {LR}^{n}\), where n is the order of interactions that are modeled. Formal definitions of these concepts are provided in Sect. 5.
While the use of higher-order LR models is quite common and at least one implementation of \(\text {LR}^{2}\) is in the public domain (Langford et al. 2007), its performance relative to standard LR as data quantities vary, investigations of \(\text {LR}^{3}\) and bias/variance profile of higher-order LR models warrants further investigation. We investigate these issues herein. It is noteworthy, that a significant amount of research has been done on correcting the estimation bias of Logistic Regression (Firth 1993; Szilard et al. 2009). Most of this research has been driven by the fact that LR’s parameters are obtained through Maximum Likelihood Estimation (MLE) which can be biased if data sample size is too small. However, it is shown that, asymptotically, MLE estimates will have zero estimation bias. Similarly, several studies have addressed the issue of bias due to omitted covariates in Logistic Regression models (Neuhaus and Jewell 1993; Hauck et al. 1998). Some studies have also investigated the Bayesian version of Logistic Regression (Genkin et al. 2012).
An \(\text {LR}^{n}\) model must be learned discriminatively through computationally intensive gradient-descent-based search. Considering all possible higher-order features in \(\text {LR}^{n}\) and learning the corresponding parameter by optimizing conditional log-likelihood (CLL) is a computationally intensive task. Any speed-up to the optimization process is highly desirable. A second objective of this paper is to provide an effective mechanism for achieving this.
It has been shown that a hybrid generative-discriminate learner can exploit the strengths of both Naive Bayes (NB) and Logistic Regression (LR) classifiers by creating a weighted variant of NB in which the weights are optimized using a discriminative objective function, that is, maximization of conditional log-likelihood (Zaidi et al. 2013, 2014). The resulting model can be viewed as either using weights to alleviate the feature independence assumption of NB, or as using the maximum likelihood parameterization of NB to pre-condition the discriminative search of LR. The result is a learner that learns models that are exactly equivalent to LR, but does so much more efficiently. In this work, we show how to achieve the same result with \(\text {LR}^{n}\).
We create a hybrid generative-discriminative learner named \(\text {ALR}^n\) for categorical data that learn models of equivalent order to those of \(\text {LR}^{n}\), but does so much more efficiently than \(\text {LR}^{n}\). We further demonstrate that the resulting models have low bias, which leads to very low error on large quantities of data. However, in order to create this hybrid learner we must first create an efficient generative counterpart to \(\text {LR}^{n}\).
developing an efficient generative counter-part to \(\text {LR}^{n}\), named Averaged n-Join Estimators (AnJE);
developing \(\text {ALR}^n\): a hybrid of \(\text {LR}^{n}\) and AnJE;
demonstrating that \(\text {ALR}^n\) has equivalent error to LR\(^n\), but is substantially more efficient,
demonstrating that \(\text {ALR}^n\) has low error on large data.
demonstrating that the bias of \(\text {LR}^{n}\) decreases as n increases and that in consequence \(\text {LR}^{n}\) with higher n tends to achieve lower error with greater data quantities.
2 Notation
We seek to assign a value \(y\in \varOmega _Y=\{y_1, \ldots y_C\}\) of the class variable Y, to a given example \(\mathbf {x} = (x_1, \ldots , x_a)\), where the \(x_i\) are value assignments for the a features \(\mathcal {A} = \{X_1,\ldots , X_a\}\). We define \(\mathcal {A}\atopwithdelims ()n\) as the set of all subsets of \(\mathcal {A}\) of size n, where each subset in the set is denoted as \(\alpha \): \({\mathcal {A} \atopwithdelims ()n}=\{\alpha \subseteq \mathcal {A}: |\alpha |=n\}\). We use \(x_\alpha \) to denote the set of values taken by features in the subset \(\alpha \) for any data object \(\mathbf {x}\).
LR for categorical data learns a weight for every feature value per class. Therefore, for LR, we denote, \(\beta _y\) to be the weight associated with class y, and \(\beta _{y,i,x_i}\) to be the weight associated with feature i taking value \(x_i\) with class label y. For \(\text {LR}^{n}\), \(\beta _{y,\alpha ,x_{\alpha }}\) specifies the weight associated with class y and feature subset \(\alpha \) taking value \(x_\alpha \). The equivalent weights for \(\text {ALR}^n\) are denoted by \(w_y\), \(w_{y,i,x_i}\) and \(w_{y,\alpha ,x_\alpha }\). The probability of feature i taking value \(x_i\) given class y is denoted by \(\mathrm {P}(x_i\arrowvert y)\). Similarly, probability of feature subset \(\alpha \), taking value \(x_\alpha \) is denoted by \(\mathrm {P}(x_\alpha \arrowvert y)\). Note, all probabilities are estimated probabilities. For clarity, we will not use \(\hat{\mathrm{P}}(.)\) notation which is typically used for estimated probabilities.
3 Higher-order logistic regression
Because its model is linear, LR is very restricted in the posterior distributions that it can precisely model. For example, it cannot model exclusive-or (XOR) between two features.
Adding higher-order features to LR increases the range of distributions that it can precisely model. Here, we define higher-order categorical features as features that are the Cartesian product of the primitive features, where the order n is the number of primitive features in the Cartesian product.
As mentioned in Sect. 1, it has been shown that Bayesian Network Classifiers that explicitly represent higher-order features tend to have lower bias than those that do not, and that the bias decreases as the order of the features increases (Martinez et al. 2016; Webb et al. 2011). Therefore, it seems likely that LR applied to higher-order features will likewise tend to have lower bias, with bias decreasing as the order increases. This is very significant, as LR is a powerful learning system and there is good reason to believe that the lower the bias of a learning system the lower its error will tend to be on very large datasets (Brain and Webb 2002).
3.1 Kernel LR and \(\text {LR}^{n}\)
3.2 Experimental evaluation of \(\text {LR}^{n}\)
While \(\text {LR}^{n}\) is part of established data analytics practice, we are not aware of any research into its bias/variance profile or its performance relative to standard LR with respect to varying quantities of data. We here investigate those issues. Though we provide a detailed empirical analysis in Sect. 7, here we present some results to illustrate the power of modeling higher-order interactions.
Figure 2 shows learning curves for \(\text {LR}^{n}\) with \(n=1, 2\) and 3. We generated these curves using a prequential testing paradigm on the Localization dataset. For each run, we first randomized the dataset. Then the ordered dataset was processed sequentially. Each example was first classified and the probabilistic loss: \(\frac{1}{C}\sum _c^C (\delta _{y=c} - \mathrm{P}(c|\mathbf{x}))^2\), where \(\delta _{y=c}\) is an indicator function that is 1 if the actual class label y is the same as c and zero otherwise, is calculated. Then the example is used to update the model. This process was repeated five times with different randomization of the dataset. For each run this process generated N loss values, where \(N=164{,}860\), the size of the Localization dataset. To generate learning curves, for each point i on the X-axis, we plot: \(\frac{1}{T} \sum _{k=\max (i-T,1)}^{i} \text {loss}(x_k)\), where T is set to 1000. For \(T \le 1000\), we plot \(\frac{1}{i} \sum _{k=1}^{i} \text {loss}(x_k)\). It can be seen that for very small data quantity the lower variance \(\text {LR}^{2}\) results in lower error than \(\text {LR}^{3}\), but as data quantity increases the lower bias of \(\text {LR}^{3}\) results in lowest error. It can be seen that LR obtains better performance than \(\text {LR}^{2}\) and \(\text {LR}^{3}\) when learned from very small quantities of data (the learning curves are zoomed in between 0 and 1000 instances in Fig. 2 to illustrate this point). The obvious reason for the poor performance of \(\text {LR}^{2}\) and \(\text {LR}^{3}\) (models that including higher-order interactions) on smaller training sets is due to over-fitting. The powerful models can fit chance regularities in the data. Hence for smaller quantities of data, some sort of regularization that pulls the weights for many higher-order interactions back towards zero would lead to much better performance.
In line with our expectation that low inductive bias will often lead to low statistical bias which will in turn translate to lower error on big datasets, it can be seen that in Fig. 4, higher-order LR results in much lower 0–1 Loss than standard LR and that this benefit tends to continue as n increases. Note that for one dataset, poker-hand^{2}, \(\text {LR}^{2}\) achieves much lower error than \(\text {LR}^{3}\)—we conjecture, that this is because of strong two-level correlations that exists in the data. On this synthetic (and deterministic) dataset, \(\text {LR}^{3}\) will need much more data to estimate its parameters effectively. The current results only utilize half of the training data. It can be seen that for \(\text {LR}^{2}\) this much data is more than enough but not for \(\text {LR}^{3}\) and hence, \(\text {LR}^{3}\) results in poor performance than \(\text {LR}^{2}\).
4 Using generative models to precondition discriminative learning
It has been shown that a direct equivalence between a weighted NB and LR can be exploited to greatly speed up LR’s learning rate (Zaidi et al. 2013, 2014).
Details of Datasets
Domain | Case | Att | Class | Domain | Case | Att | Class |
---|---|---|---|---|---|---|---|
Kddcup | 5,209,000 | 41 | 40 | Vowel | 990 | 14 | 11 |
Poker-hand | 1,175,067 | 10 | 10 | Tic-Tac-ToeEndgame | 958 | 10 | 2 |
MITFaceSetC | 839,000 | 361 | 2 | Annealing | 898 | 39 | 6 |
Covertype | 581,012 | 55 | 7 | Vehicle | 846 | 19 | 4 |
MITFaceSetB | 489,400 | 361 | 2 | PimaIndiansDiabetes | 768 | 9 | 2 |
MITFaceSetA | 474,000 | 361 | 2 | BreastCancer(Wisconsin) | 699 | 10 | 2 |
Census-Income(KDD) | 299,285 | 40 | 2 | CreditScreening | 690 | 16 | 2 |
Localization | 164,860 | 7 | 3 | BalanceScale | 625 | 5 | 3 |
Connect-4Opening | 67,557 | 43 | 3 | Syncon | 600 | 61 | 6 |
Statlog(Shuttle) | 58,000 | 10 | 7 | Chess | 551 | 40 | 2 |
Adult | 48,842 | 15 | 2 | Cylinder | 540 | 40 | 2 |
LetterRecognition | 20,000 | 17 | 26 | Musk1 | 476 | 167 | 2 |
MAGICGammaTelescope | 19,020 | 11 | 2 | HouseVotes84 | 435 | 17 | 2 |
Nursery | 12,960 | 9 | 5 | HorseColic | 368 | 22 | 2 |
Sign | 12,546 | 9 | 3 | Dermatology | 366 | 35 | 6 |
PenDigits | 10,992 | 17 | 10 | Ionosphere | 351 | 35 | 2 |
Thyroid | 9169 | 30 | 20 | LiverDisorders(Bupa) | 345 | 7 | 2 |
Pioneer | 9150 | 37 | 57 | PrimaryTumor | 339 | 18 | 22 |
Mushrooms | 8124 | 23 | 2 | Haberman’sSurvival | 306 | 4 | 2 |
Musk2 | 6598 | 167 | 2 | HeartDisease(Cleveland) | 303 | 14 | 2 |
Satellite | 6435 | 37 | 6 | Hungarian | 294 | 14 | 2 |
OpticalDigits | 5620 | 49 | 10 | Audiology | 226 | 70 | 24 |
PageBlocksClassification | 5473 | 11 | 5 | New-Thyroid | 215 | 6 | 3 |
Wall-following | 5456 | 25 | 4 | GlassIdentification | 214 | 10 | 3 |
Nettalk(Phoneme) | 5438 | 8 | 52 | SonarClassification | 208 | 61 | 2 |
Waveform-5000 | 5000 | 41 | 3 | AutoImports | 205 | 26 | 7 |
Spambase | 4601 | 58 | 2 | WineRecognition | 178 | 14 | 3 |
Abalone | 4177 | 9 | 3 | Hepatitis | 155 | 20 | 2 |
Hypothyroid(Garavan) | 3772 | 30 | 4 | TeachingAssistantEvaluation | 151 | 6 | 3 |
Sick-euthyroid | 3772 | 30 | 2 | IrisClassification | 150 | 5 | 3 |
King-rook-vs-king-pawn | 3196 | 37 | 2 | Lymphography | 148 | 19 | 4 |
Splice-junctionGeneSequences | 3190 | 62 | 3 | Echocardiogram | 131 | 7 | 2 |
Segment | 2310 | 20 | 7 | PromoterGeneSequences | 106 | 58 | 2 |
CarEvaluation | 1728 | 8 | 4 | Zoo | 101 | 17 | 7 |
Volcanoes | 1520 | 4 | 4 | PostoperativePatient | 90 | 9 | 3 |
Yeast | 1484 | 9 | 10 | LaborNegotiations | 57 | 17 | 2 |
ContraceptiveMethodChoice | 1473 | 10 | 3 | LungCancer | 32 | 57 | 3 |
German | 1000 | 21 | 2 | Contact-lenses | 24 | 5 | 3 |
LED | 1000 | 8 | 10 |
5 Accelerated logistic regression (ALR)
5.1 Averaged n-join estimators (AnJE)
5.2 \(\text {ALR}^n\)
\(\text {ALR}^n\) computes weights by optimizing CLL. Therefore, one can compute the gradient of Eq. 14 with-respect-to weights and rely on gradient descent based methods to find the optimal value of these weights. Since we do not want to be stuck in local minimums, a natural question to ask is whether the resulting objective function is convex (Boyd and Vandenberghe 2008). It turns out that the objective function of \(\text {ALR}^n\) is indeed convex. Roos et al. (2005) proved that an objective function of the form \(\sum _{\mathbf{x}\in \mathcal {D}} \log \mathrm{P}_{\mathcal {B}}(y|\mathbf{x})\), optimized by any conditional Bayesian network model is convex if and only if the structure \({\mathcal {G}}\) of the Bayesian network \({\mathcal {B}}\) is perfect. A directed graph in which all nodes having a common child are connected is called perfect (Lauritzen 1996). \(\text {ALR}^n\) is a geometric mean of several sub-models where each sub-model models \(\lfloor \frac{a}{n} \rfloor \) interactions each conditioned on the class feature. Each sub-model has a structure that is perfect. Since, the product of two convex objective functions leads to a convex function, one can see that \(\text {ALR}^n\)’s optimization function will also lead to a convex objective function.
5.3 Alternative parameterization
5.4 Comparative analysis of \(\text {ALR}^n\) and \(\text {LR}^{n}\)
Step 1 is the optimization of the log-likelihood of the data (\(\log \mathrm{P}(y,\mathbf{x})\)) to obtain the estimates of the prior and likelihood probabilities. One can view this step as of generative learning.
Step 2 is the introduction of weights on these probabilities and learning of these weights by maximizing CLL (\(\mathrm {P}(y\arrowvert \mathbf {x})\)) objective function. This step can be interpreted as discriminative learning.
One can expect a similar bias-variance profile and a very similar classification performance as both models will converge to a similar point in the optimization space, the only difference in the final parameterization being due to recursive descent being terminated before absolute convergence. However, the rate of convergence of the two models can be very different. Zaidi et al. (2014) shows that for NB, such \(\text {ALR}^n\) style parameterization with generative-discriminative learning can greatly speed-up convergence relative to only discriminative training. Note, discriminative training with NB as the graphical model is vanilla LR. We expect to see the same trend in the convergence performance of \(\text {ALR}^n\) and \(\text {LR}^{n}\).
Another distinction between the two models becomes explicit if a regularization penalty is added to the objective function. One can see that in case of \(\text {ALR}^n\), regularizing weights towards 1 will effectively pull parameters back towards the generative training estimates. For smaller datasets, one can expect to obtain better performance by using a large regularization parameter and pulling estimates back towards 1. However, one cannot do this for \(\text {LR}^{n}\). Therefore, \(\text {ALR}^n\) models can very elegantly combine generative and discriminative parameters.
6 Related work
There are \({a \atopwithdelims ()n}\) possible combination of features that can be used as parents, producing \({a\atopwithdelims ()n}\) sub-models which are combined by averaging.
AnDE and AnJE both use simple generative learning, merely the counting the relevant sufficient statistics from the data. Both have only one tweaking parameter: n—that controls the bias-variance trade-off. Higher values of n leads to low bias and high variance and vice-versa.
\(\text {ALR}^n\) has a number of similarities with ELR (Greiner and Zhou 2002; Greiner et al. 2005) for which the parameters associated with a Bayesian network classifier (naive Bayes or TAN) are learned by optimizing the CLL. ELR performs discriminative learning of the weights for a model with a Bayesian network structure. As explained in Sect. 5, it is not possible to create a single Bayesian network with the structure of the ALR model. Further, ELR does not utilize the generative parameters to precondition the search for discriminative parameters as does ALR. Some related ideas to ELR are also explored in Pernkopf and Bilmes (2005), Pernkopf and Wohlmayr (2009) and Su et al. (2008).
Feature construction has been studied extensively (Liu and Motoda 1998). The goal is to improve the classifier’s accuracy by creating new attributes from existing attributes. The new attributes can be either binary or arithmetic or other combinations of existing attributes. One approach that is closely related to the current work is the formation of Cartesian products of categorical features through hill-climbing search (Pazzani 1996). Our work differs in using all Cartesian products of a given order and using discriminative learning of weights to determine each combinations relative (weighted) contribution to the model.
7 Experiments
In this section, we compare and analyze the performance of our proposed algorithms and related methods on 76 natural domains from the UCI repository of machine learning datasets (Frank and Asuncion 2010).
The experiments are conducted on the datasets described in Table 1. 40 datasets have fewer than 1000 instances, 20 datasets have between 1000 and 10000 instances and 16 datasets have more than 10000 instances. There are 8 datasets with over 100,000 instances. These datasets are shown in bold font in Table 1.
Each algorithm is tested on each dataset using 5 rounds of 2-fold cross validation^{3}.
The datasets in Table 1 are divided into two categories. We call the following datasets Big—KDDCup, Poker-hand, USCensus1990, Covertype, MITFaceSetB,MITFaceSetA, Census-income, Localization. All remaining datasets are denoted as Little in the results.
Due to their size, experiments for most of the Big datasets had to be performed in a heterogeneous environment (grid computing) for which CPU wall-clock times are not commensurable. In consequence, when comparing classification and training time, the following 12 datasets constitutes Big category—Localization, Census-income, Poker-hand, Covtype, Connect-4, Shuttle, Adult, Letter-recog, Magic, Nursery, Sign, Pendigits. When comparing average results across Little and Big datasets, we normalize the results with respect to \(\text {ALR}^2\) and present a geometric mean.
Numeric features are discretized by using the Minimum Description Length (MDL) supervised discretization method (Fayyad and Irani 1992). Training data is discretized at training time. The cut-points learned during the discretization procedure are used to discretize the testing data. However, for kddcup, MITFaceSetA, MITFaceSetB, MITFaceSetC, for computational efficiency, the entire dataset is discretized before the training starts. That is the cut-points are learned over both training and test data. The bias introduced by including test data in the discretization process is not an issue here because it is uniform across all compared classifiers (i.e., AnDE and Random Forest).
A missing value is treated as a separate feature value and taken into account exactly like other values.
We employed the L-BFGS quasi-Newton method for solving the optimization^{5}. Note, that we have used L-BFGS to demonstrate the efficacy of \(\text {ALR}^n\), the results generalize well to other optimization routines including Gradient Descent, Conjugate Gradient and Stochastic Gradient Descent (SGD). In “Appendix 1”, we also present results with Conjugate Gradient optimization
Random Forest (RF) (Breiman 2001) is considered to be a state of the art classification scheme. It consist of multiple decision trees, each tree is trained on data selected at random but with replacement from the original data (bagging). For example, if there are N data points, select N data points at random with replacement. If there are a attributes, a number m is specified, such that \(m<a\). At each node of the decision tree, m attributes are randomly selected out of a and are evaluated, the best being used to split the node. Note, we used \(m = \log _2(a) + 1\), where a is the total number of features. Each tree is grown to its largest possible size and no pruning is done. An instance is classified by passing it through each decision tree and selecting the mode of the output of the decision trees. We used 100 decision trees in this work.
The Internal discretization mechanism of Random Forest is used for all but the kddcup, MITFaceSetA, MITFaceSetB, MITFaceSetC datasets, where the entire data is first discretized, as described before.
7.1 \(\text {ALR}^n\) versus AnJE
Win–Draw–Loss: \(\text {ALR}^2\) versus A2JE and \(\text {ALR}^3\) versus A3JE
\(\text {ALR}^2\) versus A2JE | \(\text {ALR}^3\) versus A3JE | |||
---|---|---|---|---|
W–D–L | p | W–D–L | p | |
All datasets | ||||
Bias | 62/3/11 | \(<{} \mathbf{0.001}\) | 55/9/12 | \(<{} \mathbf{0.001}\) |
Variance | 21/4/51 | \(<{} \mathbf{0.001}\) | 25/2/49 | 0.007 |
Little datasets | ||||
0–1 Loss | 47/4/25 | 0.012 | 39/2/35 | 0.727 |
RMSE | 39/0/37 | 0.908 | 32/0/44 | 0.206 |
Big datasets | ||||
0–1 Loss | 8/0/0 | 0.007 | 7/0/1 | 0.070 |
RMSE | 8/0/0 | 0.007 | 7/0/1 | 0.070 |
7.2 \(\text {ALR}^n\) versus AnDE
Win–Draw–Loss: \(\text {ALR}^2\) versus A1DE and \(\text {ALR}^3\) versus A2DE
\(\text {ALR}^2\) versus A1DE | \(\text {ALR}^3\) versus A2DE | |||
---|---|---|---|---|
W–D–L | p | W–D–L | p | |
All datasets | ||||
Bias | 60/5/11 | \(<{} \mathbf{0.001}\) | 47/11/18 | \(<{} \mathbf{0.001}\) |
Variance | 22/9/45 | 0.006 | 26/4/46 | 0.024 |
Little datasets | ||||
0–1 Loss | 43/3/30 | 0.159 | 33/4/39 | 0.556 |
RMSE | 30/0/46 | 0.084 | 24/0/52 | 0.035 |
Big datasets | ||||
0–1 Loss | 8/0/0 | 0.007 | 7/0/1 | 0.070 |
RMSE | 8/0/0 | 0.073 | 7/0/1 | 0.070 |
7.3 \(\text {ALR}^n\) versus \(\text {LR}^{n}\)
We compare the two parameterizations in terms of the scatter of their 0–1 Loss and RMSE values on Little datasets in Figs. 10 and 12 respectively, and on Big datasets in Figs. 11 and 13 respectively. It can be seen that the two parameterizations (with an exception of a few datasets^{6}) have a similar spread of 0–1 Loss and RMSE values for both \(n=2\) and \(n=3\). We attribute the difference in the performance of the two parameterizations in terms of 0–1 Loss due to the numerical instability of the solver. The L-BFGS library we are using is written in java that internally calls C++ routines which eventually call a fortran library. There are some non-significant differences between \(\text {LR}^{n}\) and \(\text {ALR}^n\) only on the Phoneme, Lung-cancer and Promoters datasets. These models trained on these datasets are all extremely sparse. Lung-cancer, for example, has only 32 instances defined over 57 attributes and 3 classes. \(\text {LR}^{2}\) and \(\text {ALR}^2\) in this case optimize 75,246 parameters and \(\text {LR}^{3}\) and \(\text {ALR}^3\) optimize 5,465,451 parameters. We conjecture that the difference in the performance (0–1 Loss) is due to over-flowing of the estimated parameters. It appears that on these datasets, data is linearly separable in spaces spanned by \(\text {LR}^{2}\), \(\text {ALR}^2\), \(\text {LR}^{3}\) and \(\text {ALR}^3\)—this leads to parameters becoming too large. For these datasets, ideally, one should regularize the two parameterizations differently (tuning \(\lambda \) on some validation set) to make sure that the parameter estimates do not get too low or too high.
Finally, let us present some comparison results about the speed of convergence of \(\text {ALR}^n\) versus \(\text {LR}^{n}\) as we increase n. In Fig. 20, we compare the convergence for \(n=1\), \(n=2\) and \(n=3\) on the sample Localization dataset. It can be seen that the improvement that \(\text {ALR}^n\) provides over \(\text {LR}^{n}\) gets better as n becomes larger. Similar behaviour was observed for many datasets and, although studying rates of convergence is a complicated matter and is outside the scope of this work, we anticipate this phenomenon to be an interesting area for future research.
7.4 \(\text {ALR}^n\) versus Random Forest
The two \(\text {ALR}^n\) models are compared in terms of W–D–L of 0–1 Loss, RMSE, bias and variance with Random Forest in Table 4. It can be seen that \(\text {ALR}^n\) has slightly lower bias than RF. The variance of \(\text {ALR}^3\) is significantly higher than RF, whereas, variance does not differ significantly between \(\text {ALR}^2\) and RF. On Little datasets, 0–1 Loss results of \(\text {ALR}^n\) and RF are similar. However, RF has significantly better RMSE results than \(\text {ALR}^n\) these datasets. On Big datasets, \(\text {ALR}^n\) has lower 0–1 Loss and RMSE on the majority of datasets.
The averaged 0–1 Loss and RMSE results are given in Fig. 21. It can be seen that \(\text {ALR}^2\), \(\text {ALR}^3\) and RF have similar 0–1 Loss and RMSE across Little datasets. However, on Big datasets, the lower bias of \(\text {ALR}^n\) results in much lower error than RF in terms of both 0–1 Loss and RMSE. These averaged results also corroborate with the W–D–L results in Table 4, showing \(\text {ALR}^n\) to be a less biased model than RF.
The comparison of training and classification time of \(\text {ALR}^n\) and RF is given in Fig. 22. It can be seen that \(\text {ALR}^n\) requires more learning time than RF but less classification time.
8 Conclusion and future work
Win–Draw–Loss: \(\text {ALR}^2\) versus RF and \(\text {ALR}^3\) versus RF
\(\text {ALR}^2\) versus RF | \(\text {ALR}^3\) versus RF | |||
---|---|---|---|---|
W–D–L | p | W–D–L | p | |
All datasets | ||||
Bias | 39/9/28 | 0.221 | 35/9/32 | 0.807 |
Variance | 25/2/49 | 0.556 | 21/3/52 | \(< \mathbf{0.001}\) |
Little datasets | ||||
0–1 Loss | 26/3/47 | 0.018 | 22/1/53 | \(<\mathbf{0.001}\) |
RMSE | 26/0/50 | 0.007 | 25/0/51 | 0.003 |
Big datasets | ||||
0–1 Loss | 4/1/3 | 1.000 | 5/0/3 | 0.726 |
RMSE | 4/0/4 | 1.273 | 5/0/3 | 0.726 |
We have shown that \(\text {ALR}^n\) is a low bias classifier that requires minimal tuning and has the ability to handle multiple classes. The obvious extension is to make it out-of-core. We argue that \(\text {ALR}^n\) is well suited for stochastic gradient descent based methods as it can converge to the global minimum very quickly.
It may be desirable to utilize a hierarchical ALR, such that \(h\text {ALR}^n = \{\text {ALR}^1 \cdots \text {ALR}^n\}\), incorporating all the parameters up till order n. This may be useful for smoothing the parameters. For example, if a certain interaction does not occur in the training data, at classification time one can resort to lower values of n.
In this work, we have constrained the values of n to two and three. Scaling-up \(\text {ALR}^n\) to higher values of n is highly desirable. One can exploit the fact that many interactions at higher values of n will not occur in the data and hence can develop sparse implementations of \(\text {ALR}^n\) models.
Exploring other objective functions such as Mean-Squared-Error or Hinge Loss may have desirable properties and has been left as a future work.
The preliminary version of ALR that we have developed is restricted to categorical data and hence requires that numeric data be discretized. While our results show that this is often highly competitive with Random Forests, which can use local cut-points (built-in discretization scheme), on some datasets it is not. In consequence, there is much scope for extensions to \(\text {ALR}^n\) to directly handle numeric data.
9 Code and datasets
Code with running instructions can be downloaded from https://github.com/nayyarzaidi/ALR.
Footnotes
- 1.
Naive Bayes and LR are generally categorized as generative and discriminative counter-parts of each other. The number of the parameters of the two models are exactly the same. They only differ in the way the parameters are learned. For Naive Bayes, the parameters are actual probabilities and are learned by maximizing log-likelihood of the data and for LR, they are free parameters that are learned by optimizing the conditional log-likelihood.
- 2.
Dataset is about classifying poker hands (each hand constitutes five cards) into 10 different classes, i.e., one pair, two pair, three of a kind, straight, flush, full house, four of a kind, straight flush, royal flush and nothing in hand. Each card is represented by two attributes that is card suite and card number. Therefore, there are total of 10 attributes describing a hand.
- 3.
Exception is MITFaceSetA, MITFaceSetB, MITFaceSetA and Kddcup where results are reported with 2 rounds of 2-fold cross validation because of the time-constraints on the grid-computers on which the results were computed
- 4.
As discussed in Sect. 1, the reason for performing bias/variance estimation is that it provides insights into how the learning algorithm will perform with varying amounts of data. We expect low variance algorithms to have relatively low error for small data and low bias algorithms to have relatively low error for large data (Brain and Webb 2002).
- 5.
The original L-BFGS implementation of Byrd et al. (1995) from http://users.eecs.northwestern.edu/~nocedal/lbfgsb.html is used.
- 6.
\(\text {LR}^{2}\) versus \(\text {ALR}^2\): two datasets on which the 0–1 Loss of two parameterization is significantly different are: Phoneme (0.1935, 0.2814) and Promoters (0.1132, 0.1717). \(\text {LR}^{3}\) versus \(\text {ALR}^3\): two datasets on which the 0–1 Loss of two parameterization is significantly different are: Phoneme (0.2347, 0.4743) and Lung-Cancer (0.6125, 0.5625).
Notes
Acknowledgments
This research has been supported by the Australian Research Council (ARC) under Grant DP140100087, and by the Asian Office of Aerospace Research and Development, Air Force Office of Scientific Research under Contracts FA2386-15-1-4007 and FA2386-15-1-4017. The authors would like to thank Wray Buntine, Reza Haffari and Bart Goethals for helpful discussions during the evolution of this paper. The authors also acknowledge tremendously useful comments by the reviewers that helped improve the quality of the paper.
References
- Bishop, C. (2006). Pattern recognition and machine learning. Berlin: Springer.MATHGoogle Scholar
- Boyd, S., & Vandenberghe, L. (2008). Convex optimization. Cambridge: Cambridge Unversity Press.MATHGoogle Scholar
- Brain, D., & Webb, G. I. (2002). The need for low bias algorithms in classification learning from small data sets. In: PKDD, pp. 62–73.Google Scholar
- Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32.CrossRefMATHGoogle Scholar
- Byrd, R., Lu, P., & Nocedal, J. (1995). A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific and Statistical Computing, 16(5), 1190–1208.MathSciNetCrossRefMATHGoogle Scholar
- Fayyad, U. M., & Irani, K. B. (1992). On the handling of continuous-valued attributes in decision tree generation. Machine Learning, 8(1), 87–102.MATHGoogle Scholar
- Firth, D. (1993). Bias reduction of maximum likelihood estimates. Biometrika, 80, 27–38.MathSciNetCrossRefMATHGoogle Scholar
- Frank, A., & Asuncion, A. (2010). UCI machine learning repository. http://archive.ics.uci.edu/ml.
- Ganz, J., & Reinsel, D. (2012). The digital universe in 2020: Big data, bigger digital shadows, and biggest growth in the far east. Framingham: International Data Corporation. https://www.emc.com/collateral/analyst-reports/idc-the-digital-universe-in-2020.pdf.
- Genkin, A., Lewis, D., & Madigan, M. (2012). Large-scale bayesian logistic regression for text categorization. Technometrics, 49, 291–304.MathSciNetCrossRefGoogle Scholar
- Greiner, R., Su, X., Shen, B., & Zhou, W. (2005). Structural extensions to logistic regression: Discriminative parameter learning of belief net classifiers. Machine Learning, 59, 297–322.Google Scholar
- Greiner, R., & Zhou, W. (2002). Structural extension to logistic regression: Discriminative parameter learning of belief net classifiers. In Eighteenth Annual National Conference on Artificial Intelligence (AAAI), pp. 167–173.Google Scholar
- Hauck, W., Anderson, S., & Marcus, S. (1998). Should we adjust for covariates in nonlinear regression analysis of randomised trials? Controlled Clinical Trials, 19, 249–256.CrossRefGoogle Scholar
- Hill, T., & Lewicki, P. (2013). Statistics: Methods and applications. DellGoogle Scholar
- Kohavi, R., & Wolpert, D. (1996). Bias plus variance decomposition for zero-one loss functions. In ICML, pp. 275–283.Google Scholar
- Langford, J., Li, L., & Strehl, A. (2007). Vowpal wabbit online learning project. https://github.com/JohnLangford/vowpal_wabbit/wiki
- Lauritzen, S. (1996). Graphical models. Oxford: Oxford University Press.MATHGoogle Scholar
- Lin, X., Wahba, G., Xiang, D., Gao, F., Klein, R., & Klein, B. (1998). Smoothing spline anova models for large data sets with bernoulli observations and the randomized gacv. Tech. rep., Technical Report 998, Department of Statistics, University of Wisconsin, Madison WI.Google Scholar
- Lint, J. H., & Wilson, M. R. (1992). A course in combinatorics. Cambridge: Cambridge University Press.MATHGoogle Scholar
- Liu, H., & Motoda, H. (1998). Feature extraction, construction and selection: A data mining perspective. Berlin: Springer.CrossRefMATHGoogle Scholar
- Martinez, A., Chen, S., Webb, G. I., & Zaidi, N. A. (2016). Scalable learning of Bayesian network classifiers. Journal of Machine Learning Research, 17, 1–35.Google Scholar
- Mitchell, T. M. (1980). The need for biases in learning generalizations. Technical Report CBM-TR-117, Rutgers University, Department of Computer Science, New Brunswick, NJ.Google Scholar
- Neuhaus, J., & Jewell, N. (1993). A geometric approach to assess bias due to omitted covariates in generalized linear models. Biometrika, 80, 807–815.MathSciNetCrossRefMATHGoogle Scholar
- Pazzani, M. J. (1996). Constructive induction of cartesian product attributes. In: Proceedings of the information, statistics and induction in science conference (ISIS96, pp. 66–77)Google Scholar
- Pernkopf, F., & Bilmes, J. (2005). Discriminative versus generative parameter and structure learning of Bayesian network classifiers. In International Conference on Machine Learning, pp. 657–664.Google Scholar
- Pernkopf, F., & Wohlmayr, M. (2009). On discriminative parameter learning of Bayesian network classifiers. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 221–237.Google Scholar
- Roos, T., Wettig, H., Grünwald, P., Myllymäki, P., & Tirri, H. (2005). On discriminative Bayesian network classifiers and logistic regression. Machine Learning, 59(3), 267–296.MATHGoogle Scholar
- Sahami, M. (1996). Learning limited dependence Bayesian classifiers. In: Proceedings of the second international conference on knowledge discovery and data mining, pp. 334–338. Menlo Park, CA: AAAI Press.Google Scholar
- Smola, A., & Schölkopf, B. (2000). Sparse greedy matrix approximation for machine learning. In International Conference on Machine Learning, pp. 911–918.Google Scholar
- Sonnenburg, S., & Franc, V. (2010). COFFIN: A computational framework for linear SVMs. In International Conference on Machine Learning, pp. 999–1006.Google Scholar
- Steinwart, I. (2004). Sparseness of support vector machines—Some asymptotically sharp bounds. In Advances in Neural Information Processing Systems 16.Google Scholar
- Stinson, D. R. (2003). Combinatorial designs: Constructions and analysis. Berlin: Springer.MATHGoogle Scholar
- Su, J., Zhang, H., Ling, C., & Matwin, S. (2008). Discriminative parameter learning for Bayesian networks. In International Conference on Machine Learning, pp. 1016–1023.Google Scholar
- Szilard, N., Jonasson, J., Genell, A., & Steineck, G. (2009). Bias in odds ratios by logistic regression modelling and sample size. BMC Medical Research Methodology, 9(1), 1–5.CrossRefGoogle Scholar
- Webb, G. I. (2000). Multiboosting: A technique for combining boosting and wagging. Machine Learning, 40(2), 159–196.CrossRefGoogle Scholar
- Webb, G. I., Boughton, J., & Wang, Z. (2005). Not so naive Bayes: Averaged one-dependence estimators. Machine Learning, 58(1), 5–24.CrossRefMATHGoogle Scholar
- Webb, G. I., Boughton, J., Zheng, F., Ting, K. M., & Salem, H. (2011). Learning by extrapolation from marginal to full-multivariate probability distributions: Decreasingly naive Bayesian classification. Machine Learning. doi:10.1007/s10994-011-5263-6.
- Williams, C., & Seeger, M. (2001). Using the Nyström method to speed up kernel machines. In Advances in Neural Information Processing Systems 13, pp. 682–688.Google Scholar
- Zaidi, N. A., Carman, M. J., Cerquides, J., & Webb, G. I. (2014). Naive-Bayes inspired effective pre-conditioners for speeding-up logistic regression. In IEEE international conference on data mining, pp. 1097–1102.Google Scholar
- Zaidi, N. A., Cerquides, J., Carman, M. J., & Webb, G. I. (2013). Alleviating naive Bayes attribute independence assumption by attribute weighting. Journal of Machine Learning Research, 14, 1947–1988.MathSciNetMATHGoogle Scholar
- Zaidi, N. A., & Webb, G. I. (2012). Fast and efficient single pass Bayesian learning. In Advances in knowledge discovery and data mining, pp. 149–160.Google Scholar
- Zhu, J., & Hastie, T. (2001). Kernel logistic regression and the import vector machine. In NIPS, pp. 1081–1088.Google Scholar