## Abstract

LogitBoost is a popular Boosting variant that can be applied to either binary or multi-class classification. From a statistical viewpoint LogitBoost can be seen as additive tree regression by minimizing the Logistic loss. Following this setting, it is still non-trivial to devise a sound multi-class LogitBoost compared with to devise its binary counterpart. The difficulties are due to two important factors arising in multiclass Logistic loss. The first is the invariant property implied by the Logistic loss, causing the optimal classifier output being not unique, i.e. adding a constant to each component of the output vector won’t change the loss value. The second is the density of the Hessian matrices that arise when computing tree node split gain and node value fittings. Oversimplification of this learning problem can lead to degraded performance. For example, the original LogitBoost algorithm is outperformed by ABC-LogitBoost thanks to the latter’s more careful treatment of the above two factors. In this paper we propose new techniques to address the two main difficulties in multiclass LogitBoost setting: (1) we adopt a vector tree model (i.e. each node value is vector) where the unique classifier output is guaranteed by adding a sum-to-zero constraint, and (2) we use an adaptive block coordinate descent that exploits the dense Hessian when computing tree split gain and node values. Higher classification accuracy and faster convergence rates are observed for a range of public data sets when compared to both the original and the ABC-LogitBoost implementations. We also discuss another possibility to cope with LogitBoost’s dense Hessian matrix. We derive a loss similar to the multi-class Logistic loss but which guarantees a diagonal Hessian matrix. While this makes the optimization (by Newton descent) easier we unfortunately observe degraded performance for this modification. We argue that working with the dense Hessian is likely unavoidable, therefore making techniques like those proposed in this paper necessary for efficient implementations.

## Keywords

LogitBoost Boosting Ensemble Supervised learning Convex optimization## 1 Introduction

Boosting is a successful technique for training classifier for both binary and multi-class classification (Freund and Schapire 1995; Schapire and Singer 1999). In this paper, our focus is on multiclass LogitBoost (Friedman et al. 1998), one of the popular boosting variants. Originally, LogitBoost was motivated by a statistical perspective (Friedman et al. 1998), where boosting algorithm consists of three key components: the loss, the function model, and the optimization algorithm. In the case of LogitBoost, these are the multi-class Logistic loss, the use of additive tree models, and a stage-wise optimization, respectively. For \(K\)-nary classification, LogitBoost directly learns a \(K\) dimensional vector as the classifier output, each component representing the confidence of predicting the corresponding class.

There are two important factors in LogitBoost’s settings. Firstly, an “invariant property” is implied by the Logistic loss, i.e., adding a constant to each component of the classifier output won’t change the loss value. Therefore, the classifier output minimizing the total loss is not unique, making the optimization procedure a bit difficult. Secondly, the Logistic loss produces a dense Hessian matrix, causing troubles when deriving the tree node split gain and node value fitting. It is challenging to design a tractable optimization algorithm that fully handles both these factors. Consequently, some simplification and/or approximation is needed.

In Friedman et al. (1998) the Hessian is diagonally approximated. In this way, the minimizer becomes unique, while the optimization—essentially a quadratic problem when using one step Newton—is substantially simplified. Consequently, at each Boosting iteration the tree model updating collapses to \(K\) independent Weighted Regression Tree fittings, each tree outputting a scalar.

Unfortunately, Friedman’s prescription turns out to have some drawbacks. The over simplified quadratic loss even doesn’t satisfy the invariant property, and is thus a very crude approximation. A later improvement, ABC-LogitBoost, is shown to outperform LogitBoost in terms of both classification accuracy and convergence rate (Li 2008, 2010b). This is due to ABC-LogitBoost’s careful handling of the above key problems of the LogitBoost setting. The invariant property is addressed by adding a sum-to-zero constraint to the output vector. To make the minimizer unique, at each iteration one variable is eliminated while the other \((K-1)\) are free, so that the loss is re-written with \((K-1)\) variables. At this point, the diagonal Hessian approximation is, again, adopted to permit \(K-1\) scalar trees being independently fitted for each of the \(K-1\) classes. The remaining class—called the base class—is recovered by taking minus the sum of the other \(K-1\) classes. The base class—i.e., the variable to be eliminated—is selected adaptively per iteration (or every several iterations), hence the acronym ABC (Adaptive Base Class). Note the diagonal Hessian approximation in ABC-LogitBoost is taken for the \((K-1)\) dimensional problem yielded by eliminating a redundant variable. Li (2008, 2010b) considers it a more refined approximation than that of original LogitBoost (Friedman et al. 1998).

**each**of these subproblems, only two coordinates (i.e., two classes or a class pair) are adaptively selected for updating, hence we call the modified algorithm AOSO-LogitBoost (Adaptive One vS One). Figure 1 gives an overview of our approach. In Sect 4.4 we show that first order and second order approximation of loss reduction can be a good measure for the quality of selected class pair.

Following the above formulation, ABC-LogitBoost (derived using a somewhat different framework than Li 2010b) can thus be shown to be a special case of AOSO-LogitBoost with a less flexible tree model. In Sect. 5 we compare the differences between the two approaches in detail and provide some intuition for AOSO’s improvement over ABC.

Both ABC or AOSO are carefully devised to address the difficulties of dense Hessian matrix arising in Logistic loss. In other words, the tree model learning would be easy if we were encountering diagonal Hessian matrix. Based on loss design in Primal/Dual view (Masnadi-Shirazi and Vasconcelos 2010; Reid and Williamson 2010), we show that the dense Hessian matrix essentially results from the sum-to-one constraint of class probability in Logistic loss. We thus investigate the possibility of diagonal Hessian matrix by removing such a sum-to-one constraint. The LogitBoost variant we produce does work. However, it is still inferior to AOSO LogitBoost (see Sect. 6 for detailed discussion). We therefore conclude that modifications of the original LogitBoost, such as AOSO and ABC, are necessary for efficient model learning since the dense Hessian matrix seems unavoidable.

The rest of this paper is organised as follows: In Sect. 2 we first formulate the problem setting for LogitBoost. In Sect. 3 we briefly review/discuss original/ABC-/AOSO- LogitBoost in regards of the interplay between the tree model and the optimisation procedure. In Sect. 4 we give the details of our approach. In Sect. 5 we compare our approach to (ABC-)LogitBoost. In Sect. 6 we show how to design a loss that produces diagonal Hessian matrices and assess its implications for model accuracy. In Sect. 7, experimental results in terms of classification errors and convergence rates are reported on a range of public datasets.

## 2 The problem setup

For \(K\)-class classification (\(K \ge 2\)), consider an \(N\) example training set \(\{\varvec{x}_i, y_i\}_{i=1}^N\) where \(\varvec{x}_i\) denotes a feature value and \(y_{i}\in \{1,\ldots ,K\}\) denotes a class label. From the training set a prediction function \(\varvec{F}(\varvec{x}) \in \mathbb {R}^K\) is learned. When clear from context, we omit the dependence on \(\varvec{x}\) and simply denote \(\varvec{F}(\varvec{x})\) by \(\varvec{F}\) (We will do the same to other related variables.). Given a test example with known \(\varvec{x}\) and unknown \(y\), we predict a class label by taking \(\hat{y} = \arg \max _{k} F_k, k=1,\ldots ,K\), where \(F_k\) is the \(k\)-th component of \(\varvec{F}\).

*loss*for a single training example \((\varvec{x},y)\). We will make the loss concrete shortly.

*function model*is needed to describe how \(\varvec{F}\) depends on \(\varvec{x}\). For example, linear model \(\varvec{F}= \varvec{W}^{T}\varvec{x}+ \varvec{b}\) is used in traditional Logistic regression, while Generalized Additive Model is adopted in LogitBoost where

Each \(\varvec{f}^{(m)}(\varvec{x})\), or simply \(\varvec{f}\), is learned by the so-called greedy stage-wise *optimization*. That is, at each Boosting iteration \(\varvec{f}^{(m)}\) is added only based on \(\varvec{F}= \sum _{j=1}^{m-1} \varvec{f}^{(j)}\).

In summary, the learning procedure, called training, consists of three ingredients: the *loss*, the *function model* and the *optimization* algorithm. In what follows we discuss them in greater details.

### 2.1 The logistic loss

#### 2.1.1 The invariant property

#### 2.1.2 Derivative and quadratic approximation

#### 2.1.3 \((K-1)\) Degrees of freedom and the sum-to-zero constraint

Due to the invariant property, the minimizer \(\varvec{f}^*\) of (8) is not unique since any \(\varvec{f}^* = c\varvec{1}\) would be a minimizer. To pin-down the value, we can add a constraint \(\varvec{1}^T \varvec{f}= 0\), which in effect restricts \(\varvec{f}\) to vary just in the linear subspace defined by \(\varvec{1}^T \varvec{f}= 0\). Obviously, now we need only \(K-1\) coordinates to express the vector living in the subspace, i.e., the degrees of freedom is \(K-1\).

In Sect. 4.2 we will discuss the rank of the Hessian matrix \(\varvec{H}\), which provides another perspective to why the minimizer of (8) is not unique.

Conceptually, the invariant property is primary, causing the minimizer being not unique; The sum-to-zero constraint is secondary, serving as a mathematical tool to make the minimizer unique.

#### 2.1.4 More on \((K-1)\) degrees of freedom

It is out of this paper’s scope to compare Exponential loss and Logistic loss in multiclass case. Instead, we focus on the Logistic loss and show how it is applied in LogitBoost. Also, in Sect. 6 we discuss a modified Logistic loss not satisfying the invariant property (but simplifying the Hessian and easing the quadratic solver) and show its degraded performance when comparing with original Logistic loss.

### 2.2 The tree model

As mentioned previously, \(\varvec{F}(\varvec{x}) = \sum _{m=1}^{M} \varvec{f}^{(m)}(\varvec{x})\) is additive tree model. However, the way each \(\varvec{f}(\varvec{x})\) (we have omitted the subscript \(m\) for simplicity and without confusion) being expressed by tree model is not unique.

Finally, it is possible to express \(fv(\varvec{x})\) with just \(K-1\) trees by adding to \(\varvec{f}(\varvec{x})\) a sum-to-zero constraint, as is adopted in Li’s series work (Li 2008, 2009, 2010b).

### 2.3 Stage wise optimization

## 3 Interplay between tree model and optimization

In last section we have reviewed the problem setup. In particular, we made concrete the three ingredients in LogitBoost: the loss, the function model and the optimization algorithm are the Logistic loss, the additive tree model and the stage wise optimization, respectively. To train a LogitBoost classifier, now the only thing left unexplained is how to optimize the quadratic optimization (14), which is still a bit complicated.

### 3.1 Original LogitBoost

Then \(K\) scalar trees are grown, the \(k\)-th tree approximately minimizing the \(k\)-th problem. Note that each tree can have its own partition, as in Fig. 3(a).

To grow the scalar tree, Friedman et al. (1998) borrowed a traditional regression model in statistics, namely the Weighted Regression Tree. The formulation becomes: for the \(k\)-th problem, fit a Weighted Regression Tree on the training examples \(\{\varvec{x}_i,y_i\}_{i=1}^N\) with targets \(\{ -g_{i,k}/h_{i,kk} \}_{i=1}^N\) and weights \(\{ h_{i,kk} \}_{i=1}^N\), where (in a slightly abuse of notation) we use an additional subscript \(i\) for \(g_k\) and \(h_{kk}\) to denote they correspond to the \(i\)-th training example.

### 3.2 ABC-LogitBoost

LogitBoost adopts a rather crude approximation to (8), where the Hessian matrix \(\varvec{H}\in \mathbb {R}^{K \times K}\) is diagonally approximated. ABC-LogitBoost (Li 2010b) considers an intuitively more refined way. Recall that \((K-1)\) degrees of freedom suffices to express (8), i.e., (8) can be re-written with just \((K-1)\) variables by eliminating one redundant variable. Then the new \((K-1) \times (K-1)\) Hessian matrix is, again, diagonally approximated. Li (2010b) shows that this does make a difference, as what follows.

Li (2010b) shows that the choice of \(b\) impacts on how well the diagonal approximation is. Two intuitive methods are proposed to select \(b\): (1) The “worst class”, i.e., the \(b\) having the biggest loss (before minimizing (14)) on that class is selected. (2) The “best class”, i.e., all possible \(b\), up to \(K\) choices, are tried and the one leading to lowest loss (after minimizing (14)) is selected.

### 3.3 AOSO-LogitBoost

In AOSO-LogitBoost, we also adopts the loss (18) with \(K-1\) free coordinates. However, we update only one coordinate a time. An equivalent formulation goes in the following. We add a sum-to-zero constraint \(\varvec{1}^T \varvec{f}= 0\) to the \(K\)-dim loss (8) and let only two coordinates of \(\varvec{f}\), say, \(f_r\) and \(f_s\), vary and the other \(K-2\) keep zero. Due to the sum-to-zero constraint, we further let \(f_r = +t\) and \(f_s = -t\) where \(t\) is a real number.

Obviously, the choice of \((r,s)\) impacts on the goodness of approximation. However, it is unlikely to select the best class pair \((r,s)\) for each training example. To permit the class pair selection as adaptive as possible, we adopt vector-valued tree model in AOSO-LogitBoost, i.e., any column in Fig. 3(c) can not be assigned to two cells. Then, for each node (i.e., each cell in the matrix as in Fig. 3c) we adaptively select the class pair \((r,s)\).

Superficially, AOSO is inferior to original/ABC- LogitBoost, because many of the \(\varvec{f}\) values in the matrix are untouched (zeros), as shown in Fig. 3(c). However, we should recall the “big picture”: the untouched values might hopefully receive better updating in later Boosting iterations. Consequently, AOSO is still “on average” more efficient than original/ABC- LogitBoost. We will further discuss this issue in Sect. 5.

## 4 The AOSO-LogitBoost algorithm

In this section we describe the details of AOSO-LogitBoost. Specifically, we will focus on how to build a tree in Boosting iteration. Some key ingredients of tree building improving previous techniques were firstly introduced by Li. However, we will re-derive them in our own language. The credits will be made clear when they are explained in the following.

### 4.1 Details of tree building

Solving (14) with the tree model (11) is equivalent to determining the parameters \(\{ \varvec{t}_j, R_j \}_{j=1}^J\) at the \(m\)-th iteration. In this subsection we will show how this problem reduces to solving a collection of quadratic subproblems for which we can use standard numerical methods-Block Coordinate Descent.^{1} Also, we will show how the gradient and Hessian can be computed incrementally.

*node loss*:

- 1.To obtain the values \(\varvec{t}_j\) for a given \(R_j\), we simply take the minimizer of (20):where \(\mathcal {I}_j\) denotes the index set for \(R_j\).$$\begin{aligned} \varvec{t}_j = \arg \min _{\varvec{t}} NodeLoss(\varvec{t};\mathcal {I}_j), \end{aligned}$$(21)
- 2.
To obtain the partition \(\{ R_j \}_{j=1}^J\), we recursively perform binary splitting until there are \(J\)-terminal nodes.

Note that (22) arises in the context of an \(O(N\times D)\) outer loop, where \(D\) is number of features. However, a naïve summing of the losses for (20) incurs an additional \(O(N)\) factor in complexity, which finally results in an unacceptable \(O(N^{2}D)\) complexity for a single boosting iteration.

^{2}

### 4.2 Properties of approximated node loss

To minimise (23), we make use of some properties for (23) that can be exploited when finding a solution. First, the invariant property carries over to the node loss (23):

### *Property 1*

\(loss(\varvec{t};\mathcal {I}) = loss(\varvec{t}+c\varvec{1};\mathcal {I})\).

### *Proof*

This is obvious by noting the additive form. \(\square \)

For the Hessian \(\varvec{H}\), we have \(\mathrm {rank}(\varvec{H}) \le \mathrm {rank}(\varvec{H}_i)\) by noting the additive form in (23). In Li (2010b) it is shown that \(\det \varvec{H}_i = 0\) by brute-force determinant expansion. Here we give a stronger property:

### *Property 2*

Each \(\varvec{H}_i\) is a positive semi-definite matrix such that (1) \(\mathrm {rank}(\varvec{H}_i) = \kappa -1\), where \(\kappa \) is the number of non-zero elements in \(\varvec{p}_i\); (2) \(\varvec{1}\) is the eigenvector for eigenvalue 0.

The proof can be found in Appendix 1.

The properties shown above indicate that (1) \(\varvec{H}\) is singular so that unconstrained Newton descent is not applicable here and (2) \(\mathrm {rank}(H)\) could be as high as \(K-1\), which prohibits the application of the standard fast quadratic solver designed for low rank Hessian. In the following we propose to address this problem via block coordinate descent, a technique that has been successfully used in training SVMs (Bottou and Lin 2007).

### 4.3 Block coordinate descent

### 4.4 Class pair selection

Both methods are \(O(K)\) procedures that are better than the \(K\times (K-1)/2\) naïve enumeration of all possible pairs. However, in our implementation we find that (33) achieves better results for AOSO-LogitBoost.

The selection of a class pair \((r,s)\) here is somewhat similar to the selection of base class \(b\) in ABC Boost. Actually, Li proposed in Li (2008) that the “worst class” with the largest loss be selected as \(b\). Clearly, the max derivative is another indicator for how “worst” it is. In this sense, our class pair selection extends the base class idea and can be seen as a concrete implementation of the general “worst class” idea.

## 5 Comparison to (ABC-)LogitBoost

In this section we compare the derivations of LogitBoost and ABC-LogitBoost and provide some intuition for observed behaviours in the experiments in Sect. 7.

*base class*. As explained in Sect. 3.2, \(b\) can be selected as either the “worst class” or the “best class”.

Li (2010b) gives how to compute the node value and the node split gain for building the \(k\)-th tree (\(k \ne b\)). Although derived from different motivation as ours, they are actually the same with (29) and (30) in this paper, where the class pair \((r,s)\) is replaced with \((k,b)\). We should not be surprised at this coincidence by noting that AOSO’s vector tree has only one freely altering coordinate on each node and thus “behaves” like a scalar tree. In this sense, AOSO and ABC is comparable. Actually, ABC can be viewed as a special form of AOSO with two differences: (1) For each tree, the class pair is fixed for every node in ABC, while it is selected adaptively in AOSO, and (2) \(K-1\) trees are added per iteration in ABC, while only one tree is added per iteration by AOSO.

It seems unappealing to add just one tree as in AOSO, since many \(\varvec{f}(\varvec{x})\) values are untouched (i.e.,, set to zeros, as illustrated in Fig. 3c); In the meanwhile, ABC would be better since it updates all the \(\varvec{f}(\varvec{x})\) values. Considering the Boosting context, we argue, however, that AOSO should still be preferred in an “on average” sense. After adding one tree in AOSO, the \(\varvec{F}(\varvec{x})\) values are updated and the gradient/Hessian is immediately recomputed for each training example, which impacts on how to build the tree at next iteration. Thus the \(\varvec{F}(\varvec{x})\) values might still receive good enough updating after several iterations, due to the adaptive class pair selection for every node at current iteration. In contrast, the \(K-1\) trees in ABC use the same set of gradients and Hessians, which are not recomputed until adding all the \(K-1\) trees.

Therefore, it is fair to compare ABC and AOSO in regards of number of trees, rather than number of iterations. AOSO’s “on average” better performance is confirmed by the experiments in Sect. 7.2.

To evaluate whether the adaptive class pair selection is critical, we considered a variant of AOSO-LogitBoost that adopts a *fixed* class pair selection. Specifically, we still add one tree per iteration, but select a single class pair root node and let it be fixed for all other nodes, which is very similar to ABC’s choice. This variant was tried but unfortunately, **degraded performance** was observed so the results are not reported here.

From the above analysis, we believe that AOSO-LogitBoost’s more flexible model obtained from the adaptive split selection (as well as its immediate model updating after adding one tree per iteration) is what contributes to its improvement over ABC.

## 6 Sum-to-one probability and dense Hessian matrix

As in previous discussion, both ABC or AOSO improve on the original LogitBoost method by dealing with dense Hessian matrix due to the Logistic loss as given in (3). An immediate question is that whether we can derive an effective alternative surrogate loss that has a diagonal Hessian matrix “by design”—i.e., can we define a modified Logistic loss that guarantees a diagonal Hessian matrix? In this section, we show how the original multi-class Logistic loss can be derived from a maximum entropy argument via convex duality, in a manner similar to derivations of boosting updates by Shen and Li (2010), Shen and Hao (2011), Lafferty (1999), and Kivinen and Warmuth (1999) and results connecting entropy and loss by Masnadi-Shirazi and Vasconcelos (2010) and Reid and Williamson (2010).

In contrast to earlier work, our analysis focuses on the role of the constraint in defining the loss and the effect it has on the form of its gradient and Hessian. In particular, we’ll see that the original Logistic loss’s dense Hessian matrix essentially results from sum-to-one constraint on class probabilities. Moreover, we are able to obtain a diagonal Hessian matrix by dropping this constraint when deriving an alternative to the Logistic loss from the same maximum entropy formulation. By doing so we show that the optimization (i.e., via Newton descent) becomes straightforward, however lower classification accuracy and slower convergence rate are observed for the new loss. Therefore, we argue the techniques used by ABC/AOSO seem necessary for dealing with the dense Hessian matrix for the original, more effective, Logistic loss.

### 6.1 Logistic loss in primal/dual view

We now examine the duality between entropy and loss in a manner similar to that of the more general treatment of Masnadi-Shirazi and Vasconcelos (2010); Reid and Williamson (2010). By starting with a “trivial” maximum entropy problem, we show how consideration of its dual problem recovers a “composite representation” Reid and Williamson (2010) of a loss function in terms of a loss defined on the simplex and corresponding link function. In case of Shannon entropy, we show that such a construction results in the original Logistic loss function. Appendix 2 describes the matrix calculus definitions and conventions we adopt, mainly from Magnus and Neudecker (2007).

For each feature vector \(\varvec{x}\), let \(\varvec{\eta }(\varvec{x}) = Pr(\varvec{y}|\varvec{x}) \in \Delta ^K\) be the true conditional probability for the \(K\) classes where \(\Delta ^K = \{ \varvec{p}\in [0,1]^K : \sum _{k=1}^K p_k = 1 \}\) denotes the \(K\)-simplex. Let \(\mathsf {H}: \Delta ^K \rightarrow \mathbb {R}\) be an *entropy function*—a concave function such that \(\mathsf {H}({\mathbf {e}}^k) = 0\) for each vertex \({\mathbf {e}}^k\) of \(\Delta ^K\).

### **Theorem 1**

The gradient of (39) is \(\nabla \ell (\varvec{F}) = \varvec{p}- \varvec{\eta }\).

See Appendix 2 for its proof. Note that the gradient is precisely the left-hand-side of the constraint in primal (35), which is a common result in primal-dual theory in convex optimization (Bertsekas 1982).

### **Theorem 2**

The Hessian of (39) is \(\nabla ^2\ell (\varvec{F}) = -A^{-1} + \frac{1}{\varvec{1}^TA^{-1}\varvec{1}}A^{-1}\varvec{1}\varvec{1}^TA^{-1}\) , where the shorthand \(A = \nabla ^2\mathsf {H}(\varvec{p})\).

See Appendix 2 for its proof.

In both of the above two theorems, the dependence on \(\varvec{F}\) is implicit where \(p\) is given by the Link function (40).

The Link, gradient/Hessian for the loss with or without the sum-to-one constraint

with sum-to-one | without sum-to-one | |
---|---|---|

Link \(\varvec{p}= \varvec{p}(\varvec{F})\) | \(p_k = \frac{e^{F_k}}{\sum _j^K e^{F_j}}, \quad k = 1,\ldots ,K\) | \(p_k = e^{F_k - 1}, \quad k = 1,\ldots ,K\) |

Gradient \(\nabla \ell (\varvec{F})\) | \(\varvec{p}- \varvec{\eta }\) | \(\varvec{p}- \varvec{\eta }\) |

Hessian \(\nabla ^2 \ell (\varvec{F})\) | \(\mathrm {diag}(p_1,\ldots ,p_K) - \varvec{p}\varvec{p}^T\) | \(\mathrm {diag}(p_1,\ldots ,p_K)\) |

### 6.2 Diagonal Hessian matrix by dropping the sum-to-one constraint

### 6.3 Degraded performance

We provide an intuitive explanation for this phenomenon. At convergence, the gradient \(\varvec{p}- \varvec{\eta }\) vanishes, and consequently, \(\varvec{p}\) satisfies the sum-to-one constraint since \(\varvec{\eta }\) is a probability. For the original Logistic loss setting, the apparently redundant sum-to-one constraint for \(\varvec{p}\) in (35) enforces that the primal variable \(\varvec{p}\) approaches \(\varvec{\eta }\) on the plane \(\varvec{1}^T\varvec{p}- 1= 0\) containing the simplex \(\Delta ^K\). In contrast, for K LogitBoost \(\varvec{p}\) may reside outside that plane during the optimization procedure. The latter optimization is intuitively slower since the L2 norm \(||\varvec{p}- \varvec{\eta }||_2^2\) never increases when \(\varvec{p}-\varvec{\eta }\) is projected onto a plane.

## 7 Experiments

^{3}

Datasets used in our experiments

Datasets | K | \(\#\)features | \(\#\)training | \(\#\)test |
---|---|---|---|---|

Poker525k | 10 | 25 | 5,25,010 | 5,00,000 |

Poker275k | 10 | 25 | 2,75,010 | 5,00,000 |

Poker150k | 10 | 25 | 1,50,010 | 5,00,000 |

Poker100k | 10 | 25 | 1,00,010 | 5,00,000 |

Poker25kT1 | 10 | 25 | 25,010 | 5,00,000 |

Poker25kT2 | 10 | 25 | 25,010 | 5,00,000 |

Covertype290k | 7 | 54 | 2,90,506 | 2,90,506 |

Covertype145k | 7 | 54 | 1,45,253 | 2,90,506 |

Letter | 26 | 16 | 16,000 | 4,000 |

Letter15k | 26 | 16 | 15,000 | 5,000 |

Letter2k | 26 | 16 | 2,000 | 18,000 |

Letter4K | 26 | 16 | 4,000 | 16,000 |

Pendigits | 10 | 16 | 7,494 | 3,498 |

Zipcode | 10 | 256 | 7,291 | 2,007 |

(a.k.a. USPS) | ||||

Isolet | 26 | 617 | 6,238 | 1,559 |

Optdigits | 10 | 64 | 3,823 | 1,797 |

Mnist10k | 10 | 784 | 10,000 | 60,000 |

M-Basic | 10 | 784 | 12,000 | 50,000 |

M-Image | 10 | 784 | 12,000 | 50,000 |

M-Rand | 10 | 784 | 12,000 | 50,000 |

M-Noise1 | 10 | 784 | 10,000 | 2,000 |

M-Noise2 | 10 | 784 | 10,000 | 2,000 |

M-Noise3 | 10 | 784 | 10,000 | 2,000 |

M-Noise4 | 10 | 784 | 10,000 | 2,000 |

M-Noise5 | 10 | 784 | 10,000 | 2,000 |

M-Noise6 | 10 | 784 | 10,000 | 2,000 |

To exhaust the learning ability of (ABC-)LogitBoost, Li let the boosting stop when either the training loss is small (implemented as \(\le \)10\(^{-16}\)) or a maximum number of iterations, \(M\), is reached. Test errors at last iteration are simply reported since no obvious over-fitting is observed. By default, \(M=10,000\), while for those large datasets (Covertype290k, Poker525k, Pokder275k, Poker150k, Poker100k) \(M=5,000\). We adopt the same criteria, except that our maximum iterations \(M_{AOSO} = (K-1) \times M_{ABC}\), where \(K\) is the number of classes. Note that only one tree is added at each iteration in AOSO, while \(K-1\) are added in ABC. Thus, this correction compares the same maximum number of trees for both AOSO and ABC.

The most important tuning parameters in LogitBoost are the number of terminal nodes \(J\), and the shrinkage factor \(v\). In (Li 2010b, 2009a), Li reported results of (ABC-)LogitBoost for a number of \(J\)-\(v\) combinations. We report the corresponding results for AOSO-LogitBoost for the same combinations. In the following, we intend to show that **for nearly all** \(J\)–\(v\) **combinations, AOSO-LogitBoost has lower classification error and faster convergence rates than ABC-LogitBoost**.

### 7.1 Classification errors

#### 7.1.1 Summary

Summary of test classification errors

Datasets | \(\#\)tests | ABC | AOSO | \(R\) | \(pv\) | ABC\(^*\) | AOSO\(^*\) | \(R\) | \(pv\) |
---|---|---|---|---|---|---|---|---|---|

Poker525k | 500000 | 1736 | | 0.1146 | 0.0002 | – | – | – | – |

Poker275k | 500000 | 2727 | | 0.0378 | 0.0790 | – | – | – | – |

Poker150k | 500000 | 5104 | | 0.2259 | 0.0000 | – | – | – | – |

Poker100k | 500000 | 13707 | | 0.4486 | 0.0000 | – | – | – | – |

Poker25kT1 | 500000 | 37345 | | 0.1592 | 0.0000 | 37345 | | 0.1592 | 0.0000 |

Poker25kT2 | 500000 | 36731 | | 0.1385 | 0.0000 | 36731 | | 0.1385 | 0.0000 |

Covertype290k | 290506 | 9727 | | 0.0145 | 0.1511 | – | – | – | – |

Covertype145k | 290506 | 13986 | | 0.0196 | 0.0458 | – | – | – | – |

Letter | 4000 | | 92 | -0.0337 | 0.5892 | 89 | | 0.0112 | 0.4697 |

Letter15k | 5000 | | 116 | -0.0642 | 0.6815 | – | – | – | – |

Letter4k | 16000 | 1055 | | 0.0607 | 0.0718 | 1034 | | 0.0706 | 0.0457 |

Letter2k | 18000 | 2034 | | 0.0846 | 0.0018 | 1991 | | 0.0703 | 0.0084 |

Pendigits | 3498 | 100 | | 0.1700 | 0.1014 | 90 | | 0.1000 | 0.2430 |

Zipcode | 2007 | | 99 | -0.0313 | 0.5872 | | 94 | \(-\)0.0217 | 0.5597 |

Isolet | 1559 | 65 | | 0.1538 | 0.1759 | 55 | | 0.0909 | 0.3039 |

Optdigits | 1797 | 55 | | 0.3091 | 0.0370 | 38 | | 0.1053 | 0.3170 |

Mnist10k | 60000 | 2102 | | 0.0733 | 0.0069 | 2050 | | 0.0805 | 0.0037 |

M-Basic | 50000 | 1602 | | 0.1049 | 0.0010 | – | – | – | – |

M-Rotate | 50000 | 5959 | | 0.0386 | 0.0118 | – | – | – | – |

M-Image | 50000 | 4268 | | 0.0237 | 0.1252 | 4214 | | 0.0503 | 0.0073 |

M-Rand | 50000 | 4725 | | 0.0290 | 0.0680 | – | – | – | – |

M-Noise1 | 2000 | 234 | | 0.0256 | 0.3833 | – | – | – | – |

M-Noise2 | 2000 | 237 | | 0.0169 | 0.4221 | – | – | – | – |

M-Noise3 | 2000 | 238 | | 0.0210 | 0.4031 | – | – | – | – |

M-Noise4 | 2000 | 238 | | 0.0210 | 0.4031 | – | – | – | – |

M-Noise5 | 2000 | 227 | | 0.0573 | 0.2558 | – | – | – | – |

M-Noise6 | 2000 | 201 | | 0.0498 | 0.2974 | – | – | – | – |

We also tested the statistical significance between AOSO and ABC. We assume the classification error rate is subject to some Binomial distribution. Let \(z\) denote the number of errors and \(n\) the number of tests, then the estimate of error rate \(\hat{p}=z/n\) and its variance is \(\hat{p}(1-\hat{p})/n\). Subsequently, we approximate the Binomial distribution by a Gaussian distribution and perform a hypothesis test. The \(p\) values are reported in Table 3.

#### 7.1.2 Comparisons with SVM and deep learning

For some problems, we note LogitBoost (both ABC and AOSO) outperforms other state-of-the-art classifier such as SVM or Deep Learning.

On dataset Poker, Li (2009a) reports that linear SVM works poorly (the test error rate is about \(40\,\%\)), while ABC-LogitBoost performs far better (i.e. \({<}10\,\%\) on Poker25kT1 and Poker25kT2). AOSO-LogitBoost proposed in this paper has even lower test error than ABC-LogitBoost, see Table 3.

Summary of test error rates for (ABC- and AOSO-)LogitBoost and deep learning algorithms on variants of Mnist

M-Basic (%) | M-Rotate (%) | M-Image (%) | M-Rand (%) | |
---|---|---|---|---|

SVM-RBF | 3.05 | 11.11 | 22.61 | 14.58 |

SVM-POLY | 3.69 | 15.42 | 24.01 | 16.62 |

NNET | 4.69 | 18.11 | 27.41 | 20.04 |

DBN-3 | 3.11 | | 16.31 | |

SAA-3 | 3.46 | | 23.00 | 11.28 |

DBN-1 | 3.94 | 14.69 | 16.15 | 9.80 |

ABC | 3.20 | 11.92 | 8.54 | 9.45 |

AOSO | 87 | | 33 | |

#### 7.1.3 Detailed results

We provide one-on-one comparison between ABC and AOSO over a number of \(J\)–\(v\) combinations, as follows.

Test classification errors on Mnist10k

\(v=0.04\) | \(v=0.06\) | \(v=0.08\) | \(v=0.1\) | |
---|---|---|---|---|

\(J=4\) | 2,630 | 2,600 | 2,535 | 2,522 |

\(J=6\) | 2,263 | 2,252 | 2,226 | 2,223 |

\(J=8\) | 2,159 | 2,138 | 2,120 | 2,143 |

\(J=10\) | 2,122 | 2,118 | 2,091 | 2,097 |

\(J=12\) | 2,084 | 2,090 | 2,090 | 2,095 |

\(J=14\) | 2,083 | 2,094 | 2,063 | 2,050 |

\(J=16\) | 2,111 | 2,114 | 2,097 | 2,082 |

\(J=18\) | 2,088 | 2,087 | 2,088 | 2,097 |

\(J=20\) | 2,128 | 2,112 | 2,095 | 2,102 |

\(J=24\) | 2,174 | 2,147 | 2,129 | 2,138 |

\(J=30\) | 2,235 | 2,237 | 2,221 | 2,177 |

\(J=40\) | 2,310 | 2,284 | 2,257 | 2,260 |

\(J=50\) | 2,353 | 2,359 | 2,332 | 2,341 |

Test classification errors on M-Image

\(v=0.04\) | \(v=0.06\) | \(v=0.08\) | \(v=0.1\) | |
---|---|---|---|---|

\(J=4\) | 5,539 | 5,480 | 5,408 | 5,430 |

\(J=6\) | 5,076 | 4,925 | 4,950 | 4,919 |

\(J=8\) | 4,756 | 4,748 | 4,678 | 4,670 |

\(J=10\) | 4,597 | 4,572 | 4,524 | 4,537 |

\(J=12\) | 4,432 | 4,455 | 4,416 | 4,389 |

\(J=14\) | 4,378 | 4,338 | 4,356 | 4,299 |

\(J=16\) | 4,317 | 4,307 | 4,279 | 4,313 |

\(J=18\) | 4,301 | 4,255 | 4,230 | 4,287 |

\(J=20\) | 4,251 | 4,231 | 4,214 | 4,268 |

\(J=24\) | 4,242 | 4,298 | 4,226 | 4,250 |

\(J=30\) | 4,351 | 4,307 | 4,311 | 4,286 |

\(J=40\) | 4,434 | 4,426 | 4,439 | 4,388 |

\(J=50\) | 4,502 | 4,534 | 4,487 | 4,479 |

Test classification errors on Letter4k

\(v=0.04\) | \(v=0.06\) | \(v=0.08\) | \(v=0.1\) | |
---|---|---|---|---|

\(J=4\) | 2,630 | 2,600 | 2,535 | 2,522 |

\(J=6\) | 2,263 | 2,252 | 2,226 | 2,223 |

\(J=8\) | 2,159 | 2,138 | 2,120 | 2,143 |

\(J=10\) | 2,122 | 2,118 | 2,091 | 2,097 |

\(J=12\) | 2,084 | 2,090 | 2,090 | 2,095 |

\(J=14\) | 2,083 | 2,094 | 2,063 | 2,050 |

\(J=16\) | 2,111 | 2,114 | 2,097 | 2,082 |

\(J=18\) | 2,088 | 2,087 | 2,088 | 2,097 |

\(J=20\) | 2,128 | 2,112 | 2,095 | 2,102 |

\(J=24\) | 2,174 | 2,147 | 2,129 | 2,138 |

\(J=30\) | 2,235 | 2,237 | 2,221 | 2,177 |

\(J=40\) | 2,310 | 2,284 | 2,257 | 2,260 |

\(J=50\) | 2,353 | 2,359 | 2,332 | 2,341 |

Test classification errors on Letter2k

\(v=0.04\) | \(v=0.06\) | \(v=0.08\) | \(v=0.1\) | |
---|---|---|---|---|

\(J=4\) | 2,347 | 2,299 | 2,256 | 2,231 |

\(J=6\) | 2,136 | 2,120 | 2,072 | 2,077 |

\(J=8\) | 2,080 | 2,049 | 2,035 | 2,037 |

\(J=10\) | 2,044 | 2,003 | 2,021 | 2,002 |

\(J=12\) | 2,024 | 1,992 | 2,018 | 2,018 |

\(J=14\) | 2,022 | 2,004 | 2,006 | 2,030 |

\(J=16\) | 2,024 | 2,004 | 2,005 | 1,999 |

\(J=18\) | 2,044 | 2,021 | 1,991 | 2,034 |

\(J=20\) | 2,049 | 2,021 | 2,024 | 2,034 |

\(J=24\) | 2,060 | 2,037 | 2,021 | 2,047 |

\(J=30\) | 2,078 | 2,057 | 2,041 | 2,045 |

\(J=40\) | 2,121 | 2,079 | 2,090 | 2,110 |

\(J=50\) | 2,174 | 2,155 | 2,133 | 2,150 |

^{4}outperforms ABC-LogitBoost in experiments by Li (2010b). Thus we cite the results for both ABC-MART and ABC-LogitBoost in Tables 9 and 10, with \(J \in \{ 4,6,8,10,12,14,16,18,20 \}\) and \(v \in \{ 0.04,0.06,0.08,0.1 \}\). The comparison with AOSO-LogitBoost is also listed. Unlike on previous datasets, AOSO-LogitBoost is a bit sensitive to parameters, which is also observed for ABC-MART and ABC-LogitBoost by Li (2010b).

Test classification errors on Poker25kT1

\(v=0.04\) | \(v=0.06\) | \(v=0.08\) | \(v=0.1\) | |
---|---|---|---|---|

\(J=4\) | 90,323 1,02,905 | | | |

\(J=6\) | 38,017 43,156 | 36,839 39,164 | 35,467 37,954 | 34,879 37,546 |

\(J=8\) | 39,220 46,076 | 37,112 40,162 | 36,407 38,422 | 35,777 37,345 |

\(J=10\) | 39,661 44,830 | 38,547 40,754 | 36,990 40,486 | 36,647 38,141 |

\(J=12\) | 41,362 48,412 | 39,221 44,886 | 37,723 42,100 | 37,345 39,798 |

\(J=14\) | 42,764 52,479 | 40,993 48,093 | 40,155 44,688 | 37,780 43,048 |

\(J=16\) | 44,386 53,363 | 43,360 51,308 | 41,952 47,831 | 40,050 46,968 |

\(J=18\) | 46,463 57,147 | 45,607 55,468 | 45,838 50,292 | 43,040 47,986 |

\(J=20\) | 49,577 62,345 | 47,901 57,677 | 45,725 53,696 | |

Test classification errors on Poker25kT2

\(v=0.04\) | \(v=0.06\) | \(v=0.08\) | \(v=0.1\) | |
---|---|---|---|---|

\(J=4\) | | | | |

\(J=6\) | 37,567 42,699 | 36,345 38,592 | 34,920 37,397 | 34,326 36,914 |

\(J=8\) | 38,703 45,737 | 36,586 39,648 | 35,836 37,935 | 35,129 36,731 |

\(J=10\) | 39,078 44,517 | 38,025 40,286 | 36,455 40,044 | 36,076 37,504 |

\(J=12\) | 40,834 47,948 | 38,657 44,602 | 37,203 41,582 | 36,781 39,378 |

\(J=14\) | 42,348 52,063 | 40,363 47,642 | 39,613 44,296 | 37,243 42,720 |

\(J=16\) | 44,067 52,937 | 42,973 50,842 | 41,485 47,578 | 39,446 46,635 |

\(J=18\) | 46,050 56,803 | 45,133 55,166 | 45,308 49,956 | 42,485 47,707 |

\(J=20\) | 49,046 61,980 | 47,430 57,383 | 45,390 53,364 | |

*and*

**Isolet***: For these five datasets, classification errors are reported by Li (2010b) with every combination of \(J \in \{ 4,6,8,10,12,14,16,18,20 \}\) and \(v \in \{ 0.04,0.06,0.08,0.1 \}\) (except that \(v \in \{0.06,0.1\}\) for Isolet). The comparison with AOSO-LogitBoost is listed in Tables 11, 12, 13, 14 and 15.*

**Optdigits**Test classification errors on Letter

\(v=0.04\) | \(v=0.06\) | \(v=0.08\) | \(v=0.1\) | |
---|---|---|---|---|

\(J=4\) | | | 122 | 119 |

\(J=6\) | 112 | 107 | | |

\(J=8\) | 104 | 102 | | |

\(J=10\) | 101 | 100 | | |

\(J=12\) | | 100 | 95 | 95 |

\(J=14\) | | 98 | | 89 |

\(J=16\) | 97 | 94 | 93 | |

\(J=18\) | 95 | | 96 | 93 |

\(J=20\) | 95 | 97 | 93 | 89 |

Test classification errors on Pendigits

\(v=0.04\) | \(v=0.06\) | \(v=0.08\) | \(v=0.1\) | |
---|---|---|---|---|

\(J=4\) | 92 | 93 | 90 | 92 |

\(J=6\) | 98 | 97 | 96 | 93 |

\(J=8\) | 97 | 94 | 95 | 93 |

\(J=10\) | 100 | 98 | 97 | 97 |

\(J=12\) | 98 | 98 | 98 | 98 |

\(J=14\) | 100 | 101 | 99 | 98 |

\(J=16\) | 100 | 97 | 98 | 96 |

\(J=18\) | 102 | 97 | 99 | 97 |

\(J=20\) | 106 | 102 | 100 | 100 |

Test classification errors on Zipcode

\(v=0.04\) | \(v=0.06\) | \(v=0.08\) | \(v=0.1\) | |
---|---|---|---|---|

\(J=4\) | | 108 | 114 | 107 |

\(J=6\) | 101 | 102 | | |

\(J=8\) | | | | 98 |

\(J=10\) | | | 97 | |

\(J=12\) | | 98 | | |

\(J=14\) | 100 | 99 | | |

\(J=16\) | 98 | | 99 | |

\(J=18\) | | | | |

\(J=20\) | | | 100 | |

Test classification errors on Isolet

\(v=0.04\) | \(v=0.06\) | \(v=0.08\) | \(v=0.1\) | |
---|---|---|---|---|

\(J=4\) | – 56 | 55 | – 56 | |

\(J=6\) | – 55 | 59 | – 54 | 58 |

\(J=8\) | – 54 | 57 | – 53 | 60 |

\(J=10\) | – 54 | 61 | – 55 | 62 |

\(J=12\) | – 52 | 63 | – 54 | 64 |

\(J=14\) | – 48 | 65 | – 54 | 60 |

\(J=16\) | – 55 | 64 | – 57 | 62 |

\(J=18\) | – 55 | 67 | – 53 | 62 |

\(J=20\) | – 51 | 63 | – 56 | 65 |

Test classification errors on Optdigits

\(v=0.04\) | \(v=0.06\) | \(v=0.08\) | \(v=0.1\) | |
---|---|---|---|---|

\(J=4\) | | 42 | | 41 |

\(J=6\) | 43 | 45 | 44 | |

\(J=8\) | 44 | 44 | 45 | 45 |

\(J=10\) | 50 | 50 | 46 | 42 |

\(J=12\) | 50 | 48 | 47 | 46 |

\(J=14\) | 48 | 46 | 51 | 48 |

\(J=16\) | 54 | 51 | 49 | 46 |

\(J=18\) | 54 | 55 | 53 | 51 |

\(J=20\) | 61 | 56 | 55 | 55 |

### 7.2 Convergence rate

Recall that we stop the boosting procedure if either the maximum number of iterations is reached or the training loss is small (i.e. the loss (1) \(\le 10^{-16}\)). The fewer trees added when boosting stops, the faster the convergence and the lower the time cost for either training or testing. We compare AOSO with ABC in terms of the number of trees added when boosting stops for the results of ABC available in Li (2010b, 2009a). Note that simply comparing number of boosting iterations is unfair to AOSO, since at each iteration only one tree is added in AOSO and \(K-1\) in ABC.

\(\#\)trees added when convergence on selected datasets. \(R\) stands for the ratio AOSO/ABC

Mnist10k | M-Rand | M-Image | Letter15k | Letter4k | Letter2k | |
---|---|---|---|---|---|---|

ABC | 7,092 | 15,255 | 14,958 | 45,000 | 20,900 | 13,275 |

\(R\) | 0.7689 | 0.7763 | 0.8101 | 0.5512 | 0.5587 | 0.5424 |

\(\#\)trees added when convergence on Mnist10k for a number of \(J\)–\(v\) combinations

\(v=0.04\) | \(v=0.06\) | \(v=0.1\) | |
---|---|---|---|

\(J=4\) | 90,000 1.0 | 90,000 1.0 | 90,000 1.0 |

\(J=6\) | 90,000 0.7740 | 63,531 0.7249 | 38,223 0.7175 |

\(J=8\) | 55,989 0.7962 | 38,223 0.7788 | 22,482 0.7915 |

\(J=10\) | 39,780 0.8103 | 27,135 0.7973 | 16,227 0.8000 |

\(J=12\) | 31,653 0.8109 | 20,997 0.8074 | 12,501 0.8269 |

\(J=14\) | 26,694 0.7854 | 17,397 0.8047 | 10,449 0.8160 |

\(J=16\) | 22,671 0.7832 | 11,704 1.0290 | 8,910 0.8063 |

\(J=18\) | 19,602 0.7805 | 13,104 0.7888 | 7,803 0.7933 |

\(J=20\) | 17,910 0.7706 | 11,970 0.7683 | 7,092 0.7689 |

\(J=24\) | 14,895 0.7514 | 9,999 0.7567 | 6,012 0.7596 |

\(J=30\) | 12,168 0.7333 | 8,028 0.7272 | 4,761 0.7524 |

\(J=40\) | 9,846 0.6750 | 6,498 0.6853 | 3,870 0.6917 |

\(J=50\) | 8,505 0.6420 | 5,571 0.6448 | 3,348 0.6589 |

### 7.3 Comparison between K-LogitBoost and AOSO-LogitBoost

## 8 Conclusions

We present an improved LogitBoost, namely AOSO-LogitBoost, for multi-class classification. Compared with ABC-LogitBoost, our experiments suggest that our adaptive class pair selection technique results in lower classification error and faster convergence rates.

## Footnotes

- 1.
- 2.
In Real AdaBoost.MH, such a second order approximation is not necessary (although possible, cf. Zou et al. 2008). Due to the special form of the exponential loss and the absence of a sum-to-zero constraint, there exists analytical solution for the node loss (20) by simply setting the derivative to \(\varvec{0}\). Here also, the computation can be incremental/decremental. Since the loss design and AdaBoost.MH are not our main interests, we do not discuss this further.

- 3.
Code and data are available at http://ivg.au.tsinghua.edu.cn/index.php?n=People.PengSun.

- 4.

## Notes

### Acknowledgments

We appreciate Ping Li’s inspiring discussion and generous encouragement. We also thank the MLJ anonymous reviewers and the action editor for their valuable comments. This work was supported by National Natural Science Foundation of China (61020106004, 61225008) and an Australian Research Council Discovery Early Career Research Award (DE130101605). NICTA is funded by the Australian Government through the Department of Communications and the Australian Research Council through the ICT Centre of Excellence Program.

## References

- Bertsekas, D. P. (1982).
*Constrained optimization and Lagrange multiplier methods*. Boston: Academic Press.zbMATHGoogle Scholar - Bottou, L., & Lin, C. J. (2007). Support vector machine solvers. In L. Bottou, O. Chapelle, D. DeCoste, & J. Weston (Eds.),
*Large scale Kernel machines*(pp. 301–320). Cambridge: MIT Press. http://leon.bottou.org/papers/bottou-lin-2006 - Freund, Y., & Schapire, R. (1995). A desicion-theoretic generalization of on-line learning and an application to boosting. In
*Computational learning theory*(pp. 23–37). New York: Springer.Google Scholar - Friedman, J. (2001). Greedy function approximation: A gradient boosting machine.
*The Annals of Statistics*,*29*(5), 1189–1232.zbMATHMathSciNetCrossRefGoogle Scholar - Friedman, J., Hastie, T., & Tibshirani, R. (1998). Additive logistic regression: A statistical view of boosting.
*Annals of Statistics*,*28*(2), 337–407.MathSciNetCrossRefGoogle Scholar - Jaynes, E. (1957). Information theory and statistical mechanics.
*The Physical Review*,*106*(4), 620–630.zbMATHMathSciNetCrossRefGoogle Scholar - Kégl, B., & Busa-Fekete, R. (2009). Boosting products of base classifiers. In
*Proceedings of the 26th Annual International Conference on Machine Learning*(pp. 497–504). New York: ACM.Google Scholar - Kivinen. J., & Warmuth, M. K. (1999). Boosting as entropy projection. In
*Proceedings of the Twelfth Annual Conference on Computational Learning Theory*(pp. 134–144). New York: ACM.Google Scholar - Lafferty, J. (1999). Additive models, boosting, and inference for generalized divergences. In
*Proceedings of the Twelfth Annual Conference on Computational Learning Theory*(pp. 125–133).Google Scholar - Larochelle, H., Erhan, D., Courville, A., Bergstra, J., & Bengio, Y. (2007). An empirical evaluation of deep architectures on problems with many factors of variation. In
*Proceedings of the 24th International Conference on Machine learning*(pp. 473–480). New York: ACM.Google Scholar - Li, P. (2008). Adaptive base class boost for multi-class classification.
*Arxiv*preprint arXiv:08111250.Google Scholar - Li, P. (2009a). Abc-boost: Adaptive base class boost for multi-class classification. In
*Proceedings of the 26th Annual International Conference on Machine Learning*(pp. 625–632). New York: ACM.Google Scholar - Li, P. (2009b). Abc-logitboost for multi-class classification.
*Arxiv*preprint arXiv:09084144.Google Scholar - Li, P. (2010a). An empirical evaluation of four algorithms for multi-class classification: Mart, abc-mart, robust logitboost, and abc-logitboost.
*Arxiv*preprint arXiv:10011020.Google Scholar - Li, P. (2010b). Robust logitboost and adaptive base class (abc) logitboost. In
*Conference on Uncertainty in Artificial Intelligence*.Google Scholar - Magnus, J. R., & Neudecker, H. (2007). Matrix differential calculus with applications in statistics and econometrics (3rd ed.). New York: Wiley.Google Scholar
- Masnadi-Shirazi, H., & Vasconcelos, N. (2010). Risk minimization, probability elicitation, and cost-sensitive svms. In
*Proceedings of the International Conference on Machine Learning*(pp. 204–213).Google Scholar - Reid, M. D., & Williamson, R. C. (2010). Composite binary losses.
*The Journal of Machine Learning Research*,*11*, 2387–2422.zbMATHMathSciNetGoogle Scholar - Schapire, R., & Singer, Y. (1999). Improved boosting algorithms using confidence-rated predictions.
*Machine learning*,*37*(3), 297–336.zbMATHCrossRefGoogle Scholar - Shen, C., & Hao, Z. (2011). A direct formulation for totally corrective multi-class boosting. In
*Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)*(pp. 2585–2592).Google Scholar - Shen, C., & Li, H. (2010). On the dual formulation of boosting algorithms.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*,*32*(12), 2216–2231.MathSciNetCrossRefGoogle Scholar - Zou, H., Zhu, J., & Hastie, T. (2008). New multicategory boosting algorithms based on multicategory fisher-consistent losses.
*The Annals of Applied Statistics*,*2*(4), 1290–1306.zbMATHMathSciNetCrossRefGoogle Scholar