Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Multi-class and multi-label classification problems are nowadays characterized not only by large sample sizes and feature spaces, but also by a large number of labels. In application fields like image classification [12], text classification [8], online advertising [3], and video recommendation [23], it is not uncommon to deal with tens or hundreds of thousands [11], or even millions of labels [20].

Label tree classifiers belong to the most efficient approaches for problems at this scale [2]. In this approach, a solution to the original problem is represented in the form of a hierarchy of classifiers, each of which is trained on a simpler subproblem. A prediction for a new example is then derived from the predictions of these (internal) classifiers, each of which corresponds to a node in the tree-like hierarchical structure; typically, each label in the original classification problem is uniquely represented by a path from the root to a leaf of that tree.

However, combining conventional training of the internal classifiers with greedy inference, namely, following a single root-to-leaf path in the tree, does not guarantee consistency of this approach [4, 10]. Thus, even perfect (zero regret) classifiers in each node of the tree do not imply a perfect (global) classification of new examples. There are two ways to remedy this problem: adjusting training and adjusting inference. The first idea is to modify the training of the internal classifiers so as to assure the consistency of greedy inference later on. The second approach, while training more conventionally, guarantees consistency by searching the tree-structure for an optimal prediction in a less greedy way.

The first idea is realized by the filter tree (FT) approach [4]. By constructing label trees in a bottom-up manner, an internal classifier can anticipate the decisions of its successor classifiers, and exploit this information to properly condition its own behavior to these classifiers. In the case of 0/1 loss, this is accomplished thanks to a specific filter technique, which removes examples from the training data on which successor classifiers made incorrect predictions. For this training procedure, a regret bound connecting the global performance with the average performance of node classifiers can be proved [4]. This bound can be generalized from 0/1 loss to any cost-based loss function, albeit at the price of a more expensive training procedure; ranking-based losses, which require the ordering of labels, cannot be tackled by FTs. Since inference can be done in a greedy way, the complexity of prediction is only logarithmic in the number of labels. More recently, the training of FTs has been further improved in the context of multi-label classification [17].

The second approach ensures consistency thanks to more sophisticated search of label trees in the inference phase [10, 16, 18]. To this end, probabilistic classifiers in each node of the tree are required, which allow for assessing the usefulness of different search directions. Label trees with probabilistic classifiers have already been considered in multi-class classification under the name of conditional probability trees [3] and nested dichotomies [14]. In multi-label classification, a similar approach has been referred to as probabilistic classifier chains [9]. The same concept also appears in neural networks and natural language processing under the name of hierarchical softmax [19]. In the following we unify all these approaches and jointly refer to them as probabilistic classifier trees (PCTs).

We restrict to binary label trees, which are especially natural for multi-label classification; here, each level of the binary tree directly corresponds to one label. Higher order trees (including nodes with more than two children) are often used in multi-class classification. This usually improves the predictive performance at the cost of an increase in prediction time. We also assume the tree structure to be given beforehand, or to have been induced using any of the methods developed for this purpose [2, 3, 23], and focus on the (orthogonal) problem of how training and prediction should be performed to ensure consistency (given the tree structure).

The main contribution of the paper is a regret bound for PCT in the case of 0/1 loss, which is expressed in terms of the search error and the Kullback-Leibler (KL) divergence (i.e., log-loss regret) of the internal classifiers. The regret bound implies the consistency of the method, a good “sanity check” for any learning algorithm. Its form quantifies a trade-off between the computational complexity and the statistical accuracy. Moreover, we show that under log-loss we do not theoretically pay any price in terms of performance for representing the joint distribution over classes by a tree structure. Our regret analysis significantly extends and improves the results of [3] for the estimation error of conditional probability trees expressed in terms of squared error loss. We also point out that the bound can be further generalized to ranking-based losses, e.g., recall at k. We also generalize the tree search algorithms of [10, 18] to get an anytime \(A^*\)-like algorithm and study its theoretical guarantees, extending the previous results given in [10]. Our theoretical contributions are complemented by a comparison of PCTs with filter trees, both conceptually and experimentally.

The paper is organized as follows. We formally state the problem in Sect. 2. Section 3 describes PCTs and gives a theoretical analysis of the generalized tree search algorithm. In Sect. 4, we prove the regret bound for 0/1 loss. Section 5 compares PCTs with other label tree approaches, particularly with conditional probability and filter trees. Section 6 discusses the use of PCTs for predicting top-k labels and its extension to multi-label classification. Section 7 presents experimental results, prior to concluding the paper in Sect. 8.

2 Problem Statement

We formalize our problem in the setting of multi-class classification. Let \((\varvec{x}, y)\) be an example coming from a probability distribution \(P(\varvec{X}= \varvec{x}, Y = y)\) (later denoted \(P(\varvec{x}, y)\)) on \(\mathcal {X} \times \mathcal {Y}\), where \(\varvec{x}\in \mathcal {X}= \mathbb {R}^d\) and \(y \in \mathcal {Y}= \{1, \ldots , m\}\). A classifier h predicts a label \(\hat{y} = h(\varvec{x}) \in \mathcal {Y}\) for each \(\varvec{x}\in \mathcal {X}\). The prediction accuracy of h can be measured in terms of 0/1 loss:Footnote 1

$$ \ell _{0/1}(y,h(\varvec{x})) = [\![y \ne h(\varvec{x}) ]\!] $$

We are interested in minimizing the expected loss, also referred to as the risk:

$$ L_{\mathrm {0/1}}(h) = \mathbb {E}_{(\varvec{x},y) \sim P} \left[ \ell _{\mathrm {0/1}}(y,h(\varvec{x})) \right] = \int _{\mathcal {X} \times \mathcal {Y}} [\![y \ne h(\varvec{x}) ]\!] \, dP(\varvec{x}, y) $$

The Bayes classifier

$$ h^* = \mathop {{{\mathrm{\arg \min }}}}_{h} L_{\mathrm {0/1}}(h) $$

minimizes the risk among all possible classifiers. While \(h^*\) may not be unique in general, the risk of \(h^*\), denoted \(L^*_{0/1}\), is unique, and is called the Bayes risk. Decomposing the risk over classes, i.e., writing \(L_{\mathrm {0/1}}(h)\) in the form

$$ L_{\mathrm {0/1}}(h) = \int _{\mathcal {X}} \bigg ( \underbrace{\sum _{y \in \mathcal {Y}} [\![y \ne h(\varvec{x}) ]\!] P(y|\varvec{x})}_{=1-P(h(\varvec{x})|\varvec{x})} \bigg ) d P(\varvec{x})\,, $$

reveals that \(h^*\) minimizes risk in a pointwise manner, i.e., for every \(\varvec{x}\),

$$ h^*(\varvec{x}) =\mathop {{{\mathrm{\arg \min }}}}_{y \in \mathcal {Y}} \left\{ 1-P(y|\varvec{x})\right\} =\mathop {{{\mathrm{\arg \max }}}}_{y \in \mathcal {Y}} P(y \, | \,\varvec{x}) \,. $$

Given a classifier h, the regret of h is defined as

$$\begin{aligned} \mathrm {reg}_{0/1}(h) ~=~ L_{0/1}(h) - L_{0/1}^* ~=~ \int _{\mathcal {X}} \Big (P(h^*(\varvec{x})|\varvec{x}) - P(h(\varvec{x})|\varvec{x}) \Big ) d P(\varvec{x}) \,. \end{aligned}$$
(1)

The regret quantifies the suboptimality of h compared to the optimal classifier \(h^*\). The goal is to train a classifier h with a small regret, ideally equal to zero.

In the following, we assume h to be represented as a label tree classifier. To this end, we encode the labels \(\{1, \ldots , m\}\) using a prefix code. Any such code can be represented by a tree with 0/1 splits. Each path from the root to a leaf node then corresponds to a code word. Recall that codes of fixed length are also prefix codes. Figure 1 shows two examples of coding trees for multi-class classification with 4 classes. Under the coding, we represent each label y by a binary vector \(\varvec{y}= (y_1, \ldots , y_l)\), where l is the maximum length of the code. The set of all code words we denote by \(\mathcal {C}\). As another special case, consider the problem of multi-label (instead of multi-class) classification, where the goal is to predict the set of labels assigned to a given instance \(\varvec{x}\). Such a set can be represented by a binary vector \(\varvec{y}= (y_1, \ldots , y_m)\), which in turn can be used as a prefix code.

Fig. 1.
figure 1

Different binary codes in multi-class classification.

In the label tree approach, we put a binary classifier in each non-leaf node of the tree. An internal node can be uniquely identified by the partial code word \(\varvec{y}^i = (y_1, \ldots , y_i)\). We denote the root node by \(\varvec{y}^0\), which is an empty vector (without any elements). The final prediction is determined by a sequence of decisions of internal classifiers. In the next section, we present a specific instance of the label tree approach that uses probabilistic classifiers in internal nodes of the tree.

3 Probabilistic Classifier Trees

Probabilistic classifier trees (PCTs) are designed to estimate probabilities \(P(y\, | \,\varvec{x}) \) by following a path from the root to a leaf node, which corresponds to a code word \(\varvec{y}= (y_1, \ldots , y_l)\) assigned to label \(y \in \mathcal {Y}\). Recalling the chain rule of probability, the process corresponds to computing

$$\begin{aligned} P(y \, | \,\varvec{x}) = P(\varvec{y}\, | \,\varvec{x}) = \prod _{i=1}^l P(y_i | {\varvec{y}}^{i-1}, \varvec{x}) \,, \end{aligned}$$
(2)

where \(P(y_i | {\varvec{y}}^{i-1}, \varvec{x})\) are probabilities of \(y_i \in \{0,1\}\), estimated in non-leaf nodes \(\varvec{y}^{i-1}\). In the next two subsections, training and inference (classification of new examples) for PCT will be discussed in more detail.

3.1 Training

Training of PCT naturally decomposes into learning problems over non-leaf nodes of the tree. In each node \(\varvec{y}^{i-1}\), the task is to train a probabilistic classifier (e.g., logistic regression) to estimate \(P(y_i | {\varvec{y}}^{i-1}, \varvec{x})\).

Looking at PCTs as a reduction technique, it is worth mentioning that its training complexity could be much lower than that of the 1-vs-all approach, since each example \((\varvec{x}, y)\) is used in only l instead of m binary problems, where l is the height of the tree (i.e., \(l = \lceil \log _2 m \rceil \) if the tree is balanced). To further improve the training time complexity, one can use online learning methods, such as stochastic gradient descent [5]. Moreover, internal classifiers in PCT can be trained independently of each other, thereby allowing for a massive parallelization of the training procedure. Let us also remark that the learning process can be defined as a single task; this is the so-called one-classifier trick [4], in which a node indicator is used as an additional feature. Alternatively, one can use a separate task for each level of the tree. This approach is used in multi-label classification, as will be discussed in Sect. 6.

3.2 Inference

The classification procedure in PCTs is more involved. To begin with, note that a probability estimate \(Q(y \, | \,\varvec{x})\) for any label y (given instance \(\varvec{x}\)) is obtained quite easily, simply by following the corresponding path in the tree and applying the chain rule:

$$ Q(y \, | \,\varvec{x}) = Q(\varvec{y}\, | \,\varvec{x}) = \prod _{i=1}^l Q(y_i | {\varvec{y}}^{i-1}, \varvec{x}) $$

However, being interested in minimization of 0/1 loss, we actually seek to find

$$\begin{aligned} \hat{\varvec{y}}^* =\mathop {{{\mathrm{\arg \max }}}}_{\varvec{y}\in \mathcal {C}} Q(\varvec{y}\, | \,\varvec{x}) \,, \end{aligned}$$
(3)

preferably without computing the probability of each label first. A simple idea is to follow a single path in the tree, starting in the root and always choosing the branch \(y_i \in \{0,1\}\) for which \(Q(y_i | {\varvec{y}}^{i-1}, \varvec{x}) > 0.5\). However, while being efficient, this approach is not guaranteed to find the optimal solution [4, 10]. Better inference methods have been presented in recent years, based on search algorithms such as uniform-cost search [10], beam search [16], and \(A^*\) [18].

All three approaches allow for trading complexity against optimality, and hence for using PCTs in an anytime fashion, thanks to a hyper-parameter \(\epsilon \). This parameter controls the degree of optimality, i.e., of finding the true loss minimizer (3), as a function of the runtime (it finds a solution \(\hat{\varvec{y}}_\epsilon \) the conditional probability \(Q(\hat{\varvec{y}}_\epsilon \, | \,\varvec{x})\) of which is not much worse than the probability of the optimal solution \(\hat{\varvec{y}}^*\) defined in Eq. 3). In the analysis that follows, we will use this property to give a formal bound on the error made by such inference algorithms, with a particular focus on uniform-cost and \(A^*\) search. An extension of the analysis to beam search is straightforward and omitted due to lack of space. The pseudo code in Algorithm 1 unifies the approaches of [10, 18]. This general algorithm, which we denote \(h_\epsilon (\varvec{x})\), is a variant of \(A^*\). It fulfills the anytime property, i.e., the search can be stopped at any time and the algorithm will deliver a valid though possibly suboptimal solution.

figure a

Recall that each node in the tree is uniquely defined by a path from the root to this node, i.e., by the partial code word \(\varvec{y}^{i}\). We use \(\varvec{v}\) to denote the node currently visited by the algorithm, and associate with this node the following value:

$$ E(\varvec{v} \, | \,\varvec{x}) = E(\varvec{y}^{i} \, | \,\varvec{x}) = Q(\varvec{y}^{i} \, | \,\varvec{x}) \times H(\varvec{y}^{i} \, | \,\varvec{x}) $$

This value can be interpreted as an approximation of the maximal value of \(Q(\varvec{y} \, | \,\varvec{x})\), in which \(Q(\varvec{y}^i \, | \,\varvec{x})\) is the part of the path that can be computed when moving from the root to node \(\varvec{v}\), and \(H(\varvec{y}^i \, | \,\varvec{x})\) is a heuristic that optimistically guesses the part of the path that has not yet been computed (in the considered case, \(E(\varvec{y}^i \, | \,\varvec{x})\) has to overestimate or to be the same as the maximal value of \(Q(\varvec{y}\, | \,\varvec{x})\)). \(Q(\varvec{y}^i \, | \,\varvec{x})\) can be computed recursively as follows: \(Q(\varvec{y}^{0} \, | \,\varvec{x}) = 1\) and

$$\begin{aligned} Q(\varvec{y}^i \, | \,\varvec{x})= & {} Q(y_i = 1| \varvec{y}^{i-1}, \varvec{x}) \times Q(\varvec{y}^{i-1} \, | \,\varvec{x}) \, . \end{aligned}$$
(4)

In [18], a procedure for computing \(H(\varvec{y}^i \, | \,\varvec{x})\) is proposed for the specific case of logistic regression as a base learner, whereas the heuristic is simply \(H(\varvec{y}^i \, | \,\varvec{x}) = 1\) in uniform-cost search used in [10]. The former approach has the advantage of providing a more accurate estimation of maximal \(Q(\varvec{y}\, | \,\varvec{x})\), albeit with an additional computing cost, while the latter approach makes a more rough estimation without any additional cost. Interestingly, as shown in experiments in [18], the former approach is still more expensive in terms of the total search cost than the latter.

In a nutshell, Algorithm 1 starts from the root of the label tree, which is the single element of priority list \(\mathcal {Q}\), sorted in descending order of E. In every iteration, the top element of the list is popped and the children \(\varvec{v}_0\) and \(\varvec{v}_1\) of the corresponding node \(\varvec{v}\) are visited. \(E(\varvec{y}^i \, | \,\varvec{x})\) is then recursively computed for the children of node \(\varvec{v}\), which are added to the list if this quantity exceeds the threshold \(\epsilon = 2^{-c}\) with \(1 \le c \le l\), where l is the maximal length of the path in the tree. Basically, they are inserted into the list at the appropriate position, so that the order imposed by \(E(\varvec{y}^i \, | \,\varvec{x})\) is respected. The first while-loop of the algorithm stops in two situations: (i) when the element popped from the list \(\mathcal {Q}\) corresponds to a leaf of the tree, or (ii) when the list \(\mathcal {Q}\) is empty. The label corresponding to the leaf is then returned in the former case, while in the latter case, inference by greedy search is applied to define a path from all nodes from the list \(\mathcal {K}\). This list, also sorted in descending order of E, contains nodes for which none of their children has been added to \(\mathcal {Q}\). The use of list \(\mathcal {K}\) ensures that by decreasing the value of \(\epsilon \), the algorithm will always find a solution that is not worse than a solution that would be found with greater \(\epsilon \).

Algorithm 1 enjoys strong theoretical guarantees. Assuming the cost for computing \(H(\varvec{y}^i \, | \,\varvec{x})\) to be constant, the following result immediately follows from a theorem proved in [10].

Theorem 1

Let \(1 \le c \le l\). Algorithm 1 with \(\epsilon = 2^{-c}\) needs at most \(\mathcal {O}(l \epsilon ^{-1})\) iterations to find a prediction \(h_\epsilon (\varvec{x}) = \hat{\varvec{y}}_\epsilon \) such that

$$ Q(\hat{\varvec{y}}^* \, | \,\varvec{x}) - Q(\hat{\varvec{y}}_\epsilon \, | \,\varvec{x}) \le \epsilon - 2^{-l}\,. $$

From the theorem, we see that the quality of the solution found by the algorithm improves with the length of the running time. Consequently, the algorithm will always find the optimal solution, provided its probability mass is greater than \(\epsilon \). Reformulating the above, we can say that the algorithm finds the solution in time linear in \(1/q_{\max }\), where \(q_{\max }\) is the probability mass of the best solution in the estimated distribution Q. For problems with low noise (high values of \(q_{\max }\)), this method should work very fast.

The theorem also implies that the greedy search, which corresponds to the algorithm with \(\epsilon = 0.5\), has very poor guarantees that approach the bound of 0.5 with \(m \rightarrow \infty \).

4 Regret Bounds for PCT

In this section, we are concerned with the generalization ability of the PCT classifier, measured by means of the regret (1). Assume for a moment that \(Q(\cdot |\varvec{x})\), the label distribution produced by PCT, coincides with the true conditional distribution \(P(\cdot |\varvec{x})\) for every \(\varvec{x}\). Then, if the \(\epsilon \)-approximate inference algorithm is used for classification, Theorem 1 implies the regret of the PCT classifier is at most \(\epsilon \), i.e., the expected classification error of PCT is at most \(\epsilon \) larger than the expected classification error of the Bayes classifier.

It is, however, unrealistic to assume that PCT is able to perfectly match the true data distribution, hence \(Q(\cdot |\varvec{x})\) and \(P(\cdot |\varvec{x})\) will differ in general. Thus, the question arises whether the expected classification error of PCT is still not much worse than the expected classification error of the Bayes classifier if \(Q(\cdot |\varvec{x})\) and \(P(\cdot |\varvec{x})\) do not coincide, but are close to each other in some sense. This section presents an affirmative answer to this question, delivering a regret bound on the classification error that takes into account the predictive performance of the internal classifiers. More precisely, we bound the PCT regret for 0/1 loss in terms of the difference between Q and P, quantified in terms of log-loss regret.

We start with a general definition of the log-loss. Consider a problem of estimating a probability distribution on some outcome space \(\mathcal {S}\). The log-loss of probability estimate \(Q(\cdot )\) on \(\mathcal {S}\) when the observed outcome is \(y \in \mathcal {S}\) is given by

$$ \ell _{\log }(y,Q) = -\log Q(y)\,. $$

The log-loss is by far the most popular measure for quantifying the accuracy of probabilistic predictions, and plays an important role in information theory, data compression, and statistics [7] (we briefly analyze the other loss function, squared loss, in Sect. 5). The log-loss risk is the expected log-loss of \(Q(\cdot )\):

$$ L_{\log }(Q) = \mathbb {E}_{y \sim P} [\ell _{\log }(y,Q)]\,, $$

where \(P(\cdot )\) is the true distribution of y. The log-loss is a strictly proper loss, which means that the unique minimizer of the risk is achieved at \(Q(\cdot ) \equiv P(\cdot )\) (see, e.g., [21]). We thus define the log-loss regret as:

$$ \mathrm {reg}_{\log }(Q) = L_{\log }(Q) - L_{\log }(P) = \mathbb {E}_{y \sim P} \left[ \log \frac{P(y)}{Q(y)} \right] = D(P \Vert Q), $$

where \(D(\cdot \Vert \cdot )\) is the Kullback-Leibler (KL) divergence.

We now turn back to PCTs. Let us first fix an instance \(\varvec{x}\in \mathcal {X}\) and consider the distribution over code words \(\varvec{y}\in \mathcal {C}\). There are two ways in which log-loss can be used in this setting:

  • To measure the quality of the estimate of the joint distribution of labels given \(\varvec{x}\), \(Q(\varvec{y}|\varvec{x})\), i.e., the outcome space is \(\mathcal {S} = \mathcal {C}\), and the log-loss is \(\ell _{\log }(\varvec{y},Q(\cdot | \varvec{x})) = -\log Q(\varvec{y}|\varvec{x})\). The log-loss regret is then the KL divergence between true joint conditional distribution \(P(\varvec{y}| \varvec{x})\) and its estimate \(Q(\varvec{y}| \varvec{x})\), \(\mathrm {reg}_{\log }(Q(\cdot |\varvec{x})) = D(P(\cdot |\varvec{x}) \Vert Q(\cdot | \varvec{x}))\).

  • To measure the quality of individual classifiers in each node of the tree. Given a node \(\varvec{y}^{i-1}=(y_1,\ldots ,y_{i-1})\), the probability estimate for label \(y_i \in \{0,1\}\) at this node is \(Q(\cdot |\varvec{y}^{i-1},\varvec{x})\). Thus, the outcome space is \(\mathcal {S} = \{0,1\}\), and \(\ell _{\log }(y_i,Q(\cdot | \varvec{y}^{i-1}, \varvec{x})) = -\log Q(y_i|\varvec{y}^{i-1},\varvec{x})\). The log-loss regret is then \(\mathrm {reg}_{\log }(Q(\cdot |\varvec{y}^{i-1}, \varvec{x})) = D(P(\cdot |\varvec{y}^{i-1}, \varvec{x}) \Vert Q(\cdot |\varvec{y}^{i-1}, \varvec{x}))\).

Both ways described above turn out to be equivalent. Indeed, we have

$$\begin{aligned} \ell _{\log }(\varvec{y},Q(\cdot | \varvec{x})) = -\log Q(\varvec{y}|\varvec{x})&= \sum _{i=1}^l - \log Q(y_i | \varvec{y}^{i-1},\varvec{x}) \\&= \sum _{i=1}^l \ell _{\log }(y_i,Q(\cdot | \varvec{y}^{i-1}, \varvec{x}))\,, \end{aligned}$$

so that the log-loss of the joint distribution is equal to the sum of log-losses of individual node classifiers along the path from the root to leaf \(\varvec{y}\). Similarly,

$$\begin{aligned} \mathrm {reg}_{\log }(Q(\cdot |\varvec{x}))&= \mathbb {E}_{\varvec{y}\sim P(\cdot |\varvec{x})} \left[ \log \frac{P(\varvec{y}|\varvec{x})}{Q(\varvec{y}| \varvec{x})} \right] = \mathbb {E}_{\varvec{y}\sim P(\cdot |\varvec{x})} \bigg [\sum _{i=1}^l \log \frac{P(y_i |\varvec{y}^{i-1},\varvec{x})}{Q(y_i | \varvec{y}^{i-1}, \varvec{x})} \bigg ] \nonumber \\&= \mathbb {E}_{\varvec{y}\sim P(\cdot |\varvec{x})} \bigg [\sum _{i=1}^l \mathrm {reg}_{\log }(Q(\cdot |\varvec{y}^{i-1}, \varvec{x}))\bigg ], \end{aligned}$$
(5)

i.e., the log-loss regret of the joint distribution is equal to the sum of the regrets of node classifiers along the random path from the root to leaf \(\varvec{y}\), where \(\varvec{y}\) is drawn from \(P(\cdot |\varvec{x})\). This basically expresses the chain rule for KL divergence [7]. The consequence of the above is that under log-loss we theoretically do not pay any price in terms of performance for representing the joint distribution by a tree structure.

We are now ready to present the main result of this section, which states that the 0/1-regret of the PCT classifier is bounded by means of the sum of log-loss regrets along a random path from the root to the leaf (or, equivalently, by the log-loss regret of the joint distribution) and the search error \(\epsilon \) of the inference procedure.

Theorem 2

Consider PCT, which estimates the probability \(Q(\cdot |\varvec{y}^{i-1},\varvec{x})\) in each non-leaf node \(\varvec{y}^{i-1}\), and let \(h_\epsilon \) be the classifier which for any \(\varvec{x}\), outputs \(\hat{\varvec{y}}_\epsilon \) found by the \(\epsilon \)-approximate inference procedure (Algorithm 1). Then, for any distribution \(P\),

$$ \mathrm {reg}_{0/1}(h_\epsilon ) \le \sqrt{2 \mathrm {reg}_{\log }(Q)} + \epsilon - 2^{-l}, $$

where \(\mathrm {reg}_{\log }(Q) = \mathbb {E}_{(\varvec{x},\varvec{y}) \sim P}\left[ \sum _{i=1}^l \mathrm {reg}_{\log }(Q(\cdot | \varvec{y}^i, \varvec{x})) \right] \) is the expected sum of regrets at internal classifiers along a path from the root to the leaf.

Proof

We first condition everything on a fixed \(\varvec{x}\). Let \(\varvec{y}^* = {{\mathrm{\arg \max }}}_{\varvec{y}} P(\varvec{y}|\varvec{x})\) be the mode of \(P(\cdot |\varvec{x})\), and let \(\hat{\varvec{y}}_\epsilon = h_\epsilon (\varvec{x})\) be the output of Algorithm 1 for input \(\varvec{x}\). Moreover, we let \(\hat{\varvec{y}}^* = {{\mathrm{\arg \max }}}_{\varvec{y}} Q(\varvec{y}|\varvec{x})\) denote the mode of \(Q(\cdot |\varvec{x})\), and note that from Theorem 1,

$$\begin{aligned} Q(\hat{\varvec{y}}^*|\varvec{x}) - Q(\hat{\varvec{y}}_\epsilon |\varvec{x}) \le \epsilon - 2^{-l}. \end{aligned}$$
(6)

According to (1), the 0/1-regret of \(\hat{\varvec{y}}_\epsilon \) conditioned at \(\varvec{x}\) is given by

$$ \mathrm {reg}_{0/1}(\hat{\varvec{y}}_\epsilon ) = P(\varvec{y}^*|\varvec{x}) - P(\hat{\varvec{y}}_\epsilon |\varvec{x}). $$

Note that the regret is 0 if \(\varvec{y}^* = \hat{\varvec{y}}_\epsilon \), hence we assume \(\varvec{y}^* \ne \hat{\varvec{y}}_\epsilon \) in what follows. From the definition of \(\hat{\varvec{y}}^*\), \(Q(\hat{\varvec{y}}^*|\varvec{x}) - Q(\varvec{y}^*|\varvec{x}) \ge 0\), which together with (6) gives \(Q(\hat{\varvec{y}}_\epsilon |\varvec{x}) - Q(\varvec{y}^*|\varvec{x}) + \epsilon - 2^{-l} \ge 0\). Hence, we obtain the upper bound

$$\begin{aligned} \mathrm {reg}_{0/1}(\hat{\varvec{y}}_\epsilon )&~\le ~ \Big (P(\varvec{y}^*|\varvec{x}) - Q(\varvec{y}^*|\varvec{x}) \Big ) + \Big (Q(\hat{\varvec{y}}_\epsilon |\varvec{x}) - P(\hat{\varvec{y}}_\epsilon |\varvec{x}) \Big ) + \epsilon - 2^{-l} \\&~\le ~ \left| P(\varvec{y}^*|\varvec{x}) - Q(\varvec{y}^*|\varvec{x}) \right| + \left| Q(\hat{\varvec{y}}_\epsilon |\varvec{x}) - P(\hat{\varvec{y}}_\epsilon |\varvec{x}) \right| + \epsilon - 2^{-l} \\&~\le ~ \sum _{\varvec{y}\in \mathcal {C}} \left| P(\varvec{y}|\varvec{x}) - Q(\varvec{y}|\varvec{x}) \right| + \epsilon - 2^{-l}, \end{aligned}$$

where the last inequality is from \(\varvec{y}^* \ne \hat{\varvec{y}}_\epsilon \). We now make use of Pinsker’s inequality

$$ \frac{1}{2} \sum _{\varvec{y}\in \mathcal {C}} \big | P(\varvec{y}\, | \,\varvec{x}) - Q(\varvec{y}\, | \,\varvec{x}) \big | \le \sqrt{\frac{1}{2} D(P(\cdot \, | \,\varvec{x}) \Vert Q(\cdot \, | \,\varvec{x}))} \,, $$

which together with (5) implies

$$\begin{aligned} \mathrm {reg}_{0/1}(\hat{\varvec{y}}_\epsilon ) \le \sqrt{2 \mathbb {E}_{\varvec{y}\sim P(\cdot |\varvec{x})} \bigg [\sum _{i=1}^l \mathrm {reg}_{\log }(Q(\cdot |\varvec{y}^{i-1}, \varvec{x}))\bigg ]} + \epsilon - 2^{-l}. \end{aligned}$$
(7)

Note that the 0/1-regret of \(h_\epsilon \), \(\mathrm {reg}_{0/1}(h_\epsilon )\), is just the expectation of the left-hand side of (7) with respect to \(\varvec{x}\). Thus, taking expectation on both sides of (7), and using \(\mathbb {E} \left[ \sqrt{\cdot } \right] \le \sqrt{\mathbb {E}\,\left[ \cdot \right] }\) on the right-hand side (which is Jensen’s inequality applied to the concave function \(x \mapsto \sqrt{x}\)) gives

$$\begin{aligned} \mathrm {reg}_{0/1}(h_\epsilon )&\le \sqrt{2 \mathbb {E}_{(\varvec{x},\varvec{y}) \sim P} \bigg [\sum _{i=1}^l \mathrm {reg}_{\log }(Q(\cdot |\varvec{y}^{i-1}, \varvec{x}))\bigg ]} + \epsilon - 2^{-l} \\&= \sqrt{2 \mathrm {reg}_{\log }(Q)} + \epsilon - 2^{-l}\,. \end{aligned}$$

   \(\square \)

Theorem 2 states that if the log-loss regret of node classifiers is small, the resulting \(\epsilon \)-approximate classifier will be close to the Bayes classifier in terms of 0/1 loss. This suggests to use node classifiers which minimize log-loss on the training sample, examples of which include logistic regression, Gradient Boosting Machines, deep neural networks,Footnote 2 and many others. One can show that the square-root dependence in the bound of Theorem 2 cannot be improved in general, since when the tree consists only of the root node, our bound essentially specializes to the bound in [1], which also exhibits square-root dependence.

5 Relation to Other Label Tree Approaches

5.1 Conditional Probability Trees

Conditional probability trees (CPTs) [4] estimate a conditional probability distribution \(P(y | \varvec{x})\) in the multiclass setting and have the same structure as PCTs. What distinguishes this approach from ours is that CPTs are used for probability estimation, with squared loss \(\ell _{\mathrm {sq}}(y_i,Q(\cdot | \varvec{y}^{i-1}, \varvec{x})) = \left( y_i - Q(y_i|\varvec{y}^{i-1},\varvec{x})\right) ^2\) as a performance measure, whence there is no inference phase to determine the mode of the conditional distribution. The main result in [4] relates the squared loss regret on the joint distribution to the expected squared loss over the nodes of the tree. This result is analogous to the identity (5), except that an additional \(O(\sqrt{l})\) factor appears in the squared loss bound. Moreover, no result analogous to Theorem 2 is given, which would relate expected squared loss regret to the 0/1 classification regret.

In fact, we can show a lower bound on the 0/1 regret in terms of expected squared loss, which is at least a factor of \(\varOmega (\sqrt{l})\) worse than our bound. To be more precise, one can show that for any \(l>2\), there exists a true distribution P and an estimate Q with the following property: even when assuming that the inference algorithm can identify the mode of the distribution exactly, it holds that \(\mathrm {reg}_{0/1}(h_\epsilon ) > \sqrt{l \, \mathrm {reg}_{\mathrm {sq}}(Q)}\), where \(\mathrm {reg}_{\mathrm {sq}}(Q)\) is the corresponding regret with log-loss replaced by squared loss.Footnote 3 In other words, using squared loss yields a bound for classification error that is at least a factor \(\varOmega (\sqrt{l})\) worse than the bound we obtained for log-loss.

5.2 Filter Trees

The filter tree (FT) approach [3] is the first label tree algorithm for which a regret bound for the classification error has been proved. Interestingly, the specific training procedure used in FTs ensures that the greedy classification procedure is sufficient for obtaining consistent predictions.

FT uses the same tree structure as PCT, but with binary classifiers instead of class probability estimators in the non-leaf nodes of the tree. The method follows a bottom-up strategy, which can be interpreted as a single elimination tournament on the set of labels. A classifier in node \(\varvec{y}^{i-1}\) is trained to predict \(y_{i}\), but FT implicitly transforms the underlying distribution of examples in the node. The transformation for 0/1 loss relies on filtering out all training examples that have been misclassified by successor classifiers on a path to a leaf. The learning algorithm starts with classifiers on the lowest non-leaf level of the tree. The correctly classified examples are then moved upward to nodes one level above. This process is repeated until the root node is reached.

In [3], a regret bound for 0/1 loss has been proved that is conceptually similar to the one given in Theorem 2. The difference is that the right side of the bound is expressed in terms of 0/1 loss of the binary classifiers in non-leaf nodes. Therefore, these two bounds are not directly comparable.

Another advantage of FTs is that they can be used with any cost-based loss function. An appropriate bound has also been proved in [3]. The classification procedure still follows a greedy search, but training is more demanding. It requires weighting of examples, the use of cost-sensitive learners, and each training example generally occurs in each internal classifier.

6 Extensions of PCTs

Since PCT estimates the entire conditional distribution over labels, it can be used with any loss function. This comes with no additional cost during training, but may lead to very costly inference. Actually, inference can be performed efficiently only for certain losses, such as 0/1 loss as discussed in Sect. 3.2, but also some ranking-based loss functions. As an example, consider recall at kth position defined as

$$ R_{@k}(\varvec{y},\varvec{x},\mathcal {Y}_k) = [\![\varvec{y}\in \mathcal {Y}_k ]\!]\,, $$

where \(\mathcal {Y}_k\) is a set of k labels predicted for \(\varvec{x}\). One can easily verify that an optimal \(\mathcal {Y}_k\) should contain k top-labels with largest \(P(y \, | \,\varvec{x})\). This can be approximated by k top-labels with largest \(Q(y \, | \,\varvec{x})\), which are easily obtained by PCT and a small extension of the \(\epsilon \)-approximate algorithm: it is enough to continue the search procedure until k leaves are visited. Moreover, the bound in Theorem 2 can be easily extended to this case.

As already mentioned, PCTs can also be used in multi-label classification. In this case, the tree is of height m and is fully balanced. Each path from the root to a leaf corresponds to one of possible label combinations. In principle, PCT contains a single classifier in each non-leaf node. In multi-label case, storing \(2^m-1\) classifiers for large m is not feasible. One can, however, follow a trick used in probabilistic classifier chains [9] and condensed filter trees [17], which relies on using one binary classifier per tree level. In other words, prediction of the ith label corresponds to the prediction made by the classifier on level i with additional features that indicate a given node of the tree.

7 Experimental Results

We empirically evaluate PCTs and FTs in two scenarios: multi-label classification (MLC) and multi-class classification (MCC). We test the algorithms in terms of 0/1 loss and the computational costs of their training and testing procedures. For PCTs, we additionally report \(R_{@k}\).

We conduct experiments on 3 multi-class and 3 multi-label datasets.Footnote 4 Table 1 provides a summary of basic statistics of the datasets. Notice that the number of leaf nodes is equal to m (the number of labels) in case of multi-class problems, and \(2^m\) (the number of all possible label combinations) in case of multi-label problems. We therefore use multi-label datasets up to around 100 labels. For datasets with a greater number of labels, the 0/1 loss is usually very close to 1. We use the original split into a training and test set if available; otherwise, we use 90/10 train/test splits. For the ILSVR2010 dataset, we use the visual code words (sbow) vectors provided by the organizers of the challenge. Features were generated on the basis of the guidance contained in the ILSVR development kit.

Table 1. Multi-class (MCC) and multi-label (MLC) datasets and their properties: the number of training (#train) and test (#test) examples, the number of labels (m) and features (d).

7.1 Implementation

We carefully implemented PCTs and FTs in Java. As internal classifiers, we use \(L_2\) linear logistic regression trained by a variant of stochastic gradient descent (SGD) introduced in [13]. To deal with a large number of weights, we use feature hashing [22] shared over all tree nodes using hashes up to size of \(2^{24}\). We use a random complete binary tree to code class labels in the MCC scenarios and train a classifier in each node of the tree. For MLC problems we take the original order of the labels to obtain the code words. We use one classifier per tree level. We tune the hyper-parameters of SGD in a 80/20 simple validation on the training set. We applied an off-the-shelf hyper-parameter optimizer [15] with a wide range of parameters. We tune PCTs to optimize the log-loss as suggested by our theoretical analysis. FTs are tuned to perform well on 0/1 loss.

We use PCTs with the \(\epsilon \)-approximate inference algorithm with different values of \(\epsilon \in \{0, 0.25, 0.5\}\). The variant with \(\epsilon =0.5\) corresponds to greedy search, while the algorithm with \(\epsilon =0\) will always find the optimal solution, but may visit all nodes of the tree in the worst case (in fact, \(\epsilon \) should be set to \(2^{-l}\) instead to 0 to be concordant with the description of the algorithm; to keep the notation simple, we use 0 to indicate the smallest possible value of \(\epsilon \) for a given dataset).

7.2 Results

The results are given in Table 2. We can observe that PCTs improve with decreasing value of \(\epsilon \). PCT with \(\epsilon = 0.5\) gets worse results than FT, which confirms the theoretical results, i.e., filtering of misclassified examples during training in FT improves for the greedy inference. For \(\epsilon = 0.25\), the results are already very competitive to FT. For \(\epsilon = 0\), PCT consistently outperforms FT, but the difference is not always large.

Table 2. Experimental results for 0/1 loss and 1-\(R_{@5}\) (both in \(\%\)), train (\(t_{trn}\)) and test (\(t_{test}\)) running times (in seconds), and the average (A) number of inner products per a test example. The Top 1 column indicates the results for top-1 prediction, while column Top 5 the results for top-5 prediction (only for PCT with \(\epsilon < 0.5\)). The best results are indicated in bold (except for wall-clock times which can be affected by many factors). The value in subscript of PCT corresponds to the value of \(\epsilon \).

From a computational perspective, FTs achieve better performance. The training time of both approaches is very similar, but the testing time is in favor of FTs (and PCTs with \(\epsilon =0.5\)). To give a deeper insight into the time costs we also report the average number of inner products computed by internal classifiers per test example. Interestingly, PCT with \(\epsilon =0\) always finds the solution in a reasonable time. Its testing time is never longer than three times that of FT. Similarly, the number of inner products is only up to three times greater than that of FT or PCT with \(\epsilon =0.5\).

Recall at kth position (\(R_{@k}\)) can be measured only for PCTs. There is no way to deliver top-k predictions in FTs, since this algorithm uses binary decisions in non-leaf nodes, so the search process results only in a single path from the root to a leaf node. From the results we observe that PCT efficiently finds topmost results. The positive label appears more often in the top-5 predictions than in the top-1. Similarly as for 0/1 loss, \(R_{@5}\) improves with decreasing value of \(\epsilon \). Unfortunately, predicting top-k labels increases test time. Therefore, the label tree search for \(\epsilon =0\) requires about 2–3 times more steps to find top-5 labels.

8 Conclusions

In this paper, we analyzed probabilistic classifier trees for efficient multi-class and multi-label classification. In particular, we proved a regret bound for 0/1 loss, which provides a strong theoretical foundation of PCTs, and which can also be extended to ranking-based losses. Moreover, we compared PCTs with the closely related filter tree method. We conclude the paper by summarizing the main theoretical and empirical results of FTs and PCTs, pointing out advantages and disadvantages of both approaches.

An unquestionable advantage of FTs is their prediction time, which is logarithmic in the number of classes or possible label combinations. FT can be used with any type of binary classifier as base learner and relies on simple 0/1 predictions. However, to guarantee the consistency of greedy inference, it requires more demanding training. In the naïve implementation, classifiers are trained sequentially in a bottom-up manner. The most important disadvantage is a significant reduction of the number of training examples in the top levels of the tree, which is caused by filtering examples in each level from bottom to top. This sparsity of training data may deteriorate predictive performance. However, thanks to filtering, an internal classifier is aware of errors of the successor classifiers. FT can be used with any cost-based loss function, but it is not able to predict top-k labels.

Prediction with PCTs requires search techniques, whence it is usually more demanding than FTs (yet significantly faster than 1-vs-all). Moreover, anytime algorithms can be used for searching the tree. The time complexity of PCT strongly depends on the noise contained in the data. If the signal-to-noise ratio is high, we can expect prediction time to be small. However, learning is much simpler for PCT than for FT, and can be easily parallelized. There is no filtering of training examples, so all examples are used for training on each level of the tree. The probabilistic nature of PCTs allows for delivering a list of top-labels and to work efficiently for \(R_{@k}\).

The results we obtained for FTs are comparable with those reported in [6]. We stress that better results can be obtained by other algorithms, for example LomTrees introduced in the same paper. This is mainly because LomTrees train the tree structure online, along with the internal classifiers, whereas PCTs and FTs use random trees/coding. Interestingly, LomTrees are not consistent. Thus, an important challenge for future research is to find an algorithm that is able to train the tree structure online while ensuring consistency.