1 Introduction

Hierarchical text categorization (HTC) is the task of generating (usually by means of supervised learning algorithms) text classifiers that operate on classification schemes endowed with a hierarchical structure. Notwithstanding the fact that most large-sized classification schemes for text (e.g. the ACM Classification SchemeFootnote 1 the MESH thesaurusFootnote 2 the NASA thesaurusFootnote 3) indeed have a hierarchical structure, so far the attention of text classification (TC) researchers has mostly focused on algorithms for “flat” classification, i.e. algorithms that operate on non-hierarchical classification schemes.Footnote 4 These algorithms, once applied to a hierarchical classification problem, are not capable of taking advantage of the information inherent in the class hierarchy, and may thus be suboptimal, in terms of efficiency and/or effectiveness. On the contrary, many researchers have argued that by leveraging on the hierarchical structure of the classification scheme, heuristics of various kinds can be brought to bear that make the classifier more efficient and/or more effective.

An important intuition is that, by viewing classification as the identification of the paths that, starting from the root, funnel the document down to the subtrees where it belongs (in “Pachinko machine” style), entire other subtrees can be pruned from consideration. That is, when the classifier corresponding to an internal node outputs a negative response, the classifiers corresponding to its descendant nodes need not be invoked any more, thus reducing the computational cost of classifier invocation exponentially (Chakrabarti et al. 1998; Koller and Sahami 1997).

A second important intuition is that, by training a binary classifier for an internal node category on a well-selected subset of training examples of local interest only, the resulting classifier may be made more attuned to recognizing the subtle distinctions between documents belonging to that node and those belonging to neighbouring nodes (Ng et al. 1997; Wiener et al. 1995). While this technique promises to bring about more effective classifiers, it is also going to improve efficiency, since a smaller set of examples is used in training, thereby making classifier learning speedier.

Many of these intuitions have been used in close association with a specific learning algorithm; the most popular choices in this respect have been naïve Bayesian methods (Chakrabarti et al. 1998; Gaussier et al. 2002; Koller and Sahami 1997; McCallum et al. 1998; Toutanova et al. 2001; Vinokourov and Girolami 2002), neural networks (Ruiz and Srinivasan 2002; Weigend et al. 1999; Wiener et al. 1995), support vector machines (Cai and Hofmann 2004; Dumais and Chen 2000; Liu et al. 2005; Yang et al. 2003), and example-based classifiers (Yang et al. 2003).

Within this literature, the absence of “boosting” methods is conspicuous: to the best of our knowledge, we do not know of any HTC method belonging to the boosting family. This is somehow surprising, (i) because of the high applicative interest of HTC, (ii) because boosting algorithms are well-known for their interesting theoretical properties and for their high accuracy, and (iii) because, given their relatively high computational cost, they would definitely benefit by the added efficiency that consideration of the hierarchical structure can bring about.

In this paper we try to fill this gap by proposing TreeBoost.MH, a multi-label HTC algorithm that consists of a hierarchical variant of AdaBoost.MH, the most important member of the family of boosting algorithms; here, multi-label (ML) means that a document can belong to zero, one, or several categories at the same time. TreeBoost.MH embodies several intuitions that had arisen before within HTC: e.g. the intuitions that both feature selection and the selection of negative training examples should be performed “locally”, i.e. by paying attention to the topology of the classification scheme. TreeBoost.MH also incorporates the novel intuition that the weight distribution that boosting algorithms update at every boosting round should likewise be updated “locally”. All these intuitions are embodied within TreeBoost.MH in an elegant and simple way, i.e. by defining TreeBoost.MH as a recursive algorithm that uses AdaBoost.MH as its base step, and that recurs over the tree structure.

The paper is structured as follows. In Sect. 2 we give a concise description of boosting and the AdaBoost.MH algorithm. Section 3 describes TreeBoost.MH, our hierarchical version of AdaBoost.MH. Section 4 goes the analytical way, comparing the computational costs of AdaBoost.MH and TreeBoost.MH, and showing that the latter obtains exponential savings over the former both at classifier-learning time and at classification time. In Sect. 5 we present experiments comparing AdaBoost.MH and TreeBoost.MH on three well-known HTC benchmarks, including a hierarchical version of the Reuters-21578 benchmark defined in Toutanova et al. (2001). Section 6 discusses related work, pointing out the differences between existing approaches and ours. Section 7 concludes.

2 An introduction to boosting and AdaBoost.MH

AdaBoost.MH (Schapire and Singer 2000) (see Fig. 1) is a boosting algorithm, i.e. an algorithm that generates a highly accurate classifier \(\hat{\Upphi}\) (also called final hypothesis) by combining a set of moderately accurate classifiers \(\hat{\Upphi}_{1}, \ldots, \hat{\Upphi}_{S}\) (also called weak hypotheses).Footnote 5 The input to the algorithm is a training set \(Tr= \{\langle d_{1},C_{1}\rangle, \ldots, \langle d_{g},C_{g}\rangle\},\) where \(C_{i}\subseteq C\) is the set of categories to each of which d i belongs. For each c j  ∈ C, by Tr +(c j ) we denote the set of the positive training examples of c j . Furthermore, for each c j  ∈ C we define the set Tr (c j ) of its negative training examples simply as the set difference between Tr and Tr +(c j ).

Fig. 1
figure 1

The AdaBoost.MH algorithm

AdaBoost.MH works by iteratively calling a weak learner to generate a sequence \(\hat{\Upphi}_{1}, \ldots, \hat{\Upphi}_{S}\) of weak hypotheses; at the end of the iteration the final hypothesis \(\hat{\Upphi}\) is obtained as a sum \(\hat{\Upphi}=\sum_{s=1}^{S}\hat{\Upphi}_{s}\) of these weak hypotheses. A weak hypothesis is a function \(\hat{\Upphi}_{s}: D \times C \rightarrow {\mathbb{R}}\) such that \(sign(\hat{\Upphi}_{s}(d_{i},c_{j}))\) can be interpreted as the prediction of \(\hat{\Upphi}_{s}\) on whether d i belongs to c j (i.e. \(\hat{\Upphi}_{s}(d_{i},c_{j}) > 0\) means that d i is believed to belong to c j while \(\hat{\Upphi}_{s}(d_{i},c_{j}) < \,0\) means it is believed not to belong to c j ), and the absolute value of \(\hat{\Upphi}_{s}(d_{i},c_{j})\) (indicated by \(|\hat{\Upphi}_{s}(d_{i},c_{j})|\)) can be interpreted as the strength of this belief.

At each iteration s AdaBoost.MH tests the effectiveness of the newly generated weak hypothesis \(\hat{\Upphi}_{s}\) on the training set and uses the results to update a distribution D s of weights on the training pairs \(\langle d_{i},c_{j}\rangle.\) The updated weight D s+1(d i , c j ) is meant to capture how effective \(\hat{\Upphi}_{1},\ldots,\hat{\Upphi}_{s}\) have been in correctly predicting whether the training document d i belongs to category c j or not. By passing (together with the training set Tr) this distribution to the weak learner, AdaBoost.MH asks this latter to generate a new weak hypothesis \(\hat{\Upphi}_{s+1}\) that concentrates on the pairs with the highest weight, i.e. those that had proven harder to classify for the previous weak hypotheses.

The initial distribution D 1 is uniform. At each iteration s all the weights D s (d i , c j ) are updated to D s+1(d i , c j ) according to the rule

$$ D_{s+1}(d_{i},c_{j})=\frac{D_{s}(d_{i},c_{j})\exp (-\Upphi(d_{i},c_{j})\cdot \hat{\Upphi}_{s}(d_{i},c_{j}))}{Z_{s}} $$
(1)

where the target function \(\Upphi(d_{i},c_{j})\) is defined to be 1 if d i  ∈ c j and −1 otherwise, and

$$ Z_{s}=\sum_{i=1}^{g}\sum_{j=1}^{m}D_{s}(d_{i},c_{j})\exp (-\Upphi(d_{i},c_{j})\cdot \hat{\Upphi}_{s}(d_{i},c_{j})) $$
(2)

is a normalization factor chosen so that D s+1 is in fact a distribution, i.e. so that \(\sum\nolimits_{i=1}^{g}\sum\nolimits_{j=1}^{m}D_{s+1}(d_{i},c_{j})=1.\) Equation 1 is such that the weight assigned to a pair \(\langle d_{i},c_{j}\rangle\) misclassified by \(\hat{\Upphi}_{s}\) is increased, as for such a pair Φ(d i , c j ) and \(\hat{\Upphi}_{s}(d_{i},c_{j})\) have different signs and the factor \(\Upphi(d_{i},c_{j})\cdot \hat{\Upphi}_{s}(d_{i},c_{j})\) is thus negative; likewise, the weight assigned to a pair correctly classified by \(\hat{\Upphi}_{s}\) is decreased. Weights are increased or decreased to a larger extent if the absolute value of \(\hat{\Upphi}_{s}(d_{i},c_{j})\) is higher, to reflect the fact that classification decisions taken with high confidence must have a higher impact in the process.

2.1 Choosing the weak hypotheses

In AdaBoost.MH each document d i is represented as a vector \({\mathbf{d}}_{i}=\langle w_{1i}, \ldots, w_{ri}\rangle\) of r binary weights, where w ki  = 1 (resp. w ki  = 0) is normally interpreted to mean that term t k occurs (resp. does not occur) in d i ; accordingly, T = {t 1,…,t r } is the set of terms that occur in at least one document in Tr. Of course, AdaBoost.MH does not make any assumption on what constitutes a term; single words, stems of words, phrases, or character n-grams are all plausible choices.

In AdaBoost.MH the weak hypotheses generated by the weak learner at iteration s are decision stumps of the form

$$ \hat{\Upphi}_{s}(d_{i},c_{j}) = \left\{ \begin{array}{ll} a_{0j}&\hbox{if }w_{ki}=0 \\ a_{1j}&\hbox{if }w_{ki}=1 \end{array} \right.$$
(3)

where t k (called the pivot term of \(\hat{\Upphi}_{s}\)) belongs to T, and a 0j and a 1j are real-valued constants. The choices for t k , a 0j and a 1j are in general different for each iteration s, and are made according to an error-minimization policy described in the rest of this section.

Schapire and Singer (1999) have proven that the Hamming loss of the final hypothesis \(\hat{\Upphi},\) defined as the percentage of pairs \(\langle d_{i},c_{j}\rangle\) for which \(sign(\Upphi(d_{i},c_{j})) \neq sign(\hat{\Upphi}(d_{i},c_{j})),\) is at most Π S s=1 Z s . The Hamming loss of a hypothesis is a measure of its classification (in)effectiveness; therefore, a reasonable (although suboptimal) way to maximize the effectiveness of the final hypothesis \(\hat{\Upphi}\) is to “greedily” choose each weak hypothesis \(\hat{\Upphi}_{s}\) (and thus its parameters t k , a 0j and a 1j ) in such a way as to minimize the normalization factor Z s .

Schapire and Singer (2000) define three different variants of AdaBoost.MH, corresponding to three different methods for making these choices:

  1. 1.

    AdaBoost.MH with real-valued predictions (here nicknamed \(\hbox{{\sc AdaBoost.MH}}^{{\sc R}}\));

  2. 2.

    AdaBoost.MH with real-valued predictions and abstaining (\(\hbox{{\sc AdaBoost.MH}}^{{\sc RA}}\));

  3. 3.

    AdaBoost.MH with discrete-valued predictions (\(\hbox{{\sc AdaBoost.MH}}^{{\sc D}}\)).

In this paper we concentrate on \(\hbox{{\sc AdaBoost.MH}}^{{\sc R}},\) since it is the one that, in the experiments of Schapire and Singer (2000), has been experimented most thoroughly and has given the best results; however, everything we say in this paper about \(\hbox{{\sc AdaBoost.MH}}^{{\sc R}}\) straightforwardly applies to \(\hbox{{\sc AdaBoost.MH}}^{{\sc RA}}\) and \(\hbox{{\sc AdaBoost.MH}}^{{\sc D}}.\)

At iteration s, \(\hbox{{\sc AdaBoost.MH}}^{{\sc R}}\) (from now on simply called AdaBoost.MH) chooses a weak hypothesis of the form described in Eq. 3 by the following algorithm.

Algorithm 1 (The AdaBoost.MH weak learner)

  1. 1.

    For each term t k  ∈ {t 1,…,t r } select, among all weak hypotheses \(\hat{\Upphi}\) that have t k as the “pivot term”, the one (indicated by \(\hat{\Upphi}_{best(k)}\)) for which Z s is minimum.

  2. 2.

    Among all the hypotheses \(\hat{\Upphi}_{best(1)}, \ldots,\hat{\Upphi}_{best(r)}\) selected for the r different terms in Step 1, select the one (indicated by \(\hat{\Upphi}_{s}\)) for which Z s is minimum.

Step 1 is clearly the key step, since there are a non-enumerable set of weak hypotheses with t k as the pivot. Schapire and Singer (1999) have proven that, given term t k and category c j ,

$$ \hat{\Upphi}_{best(k)}(d_{i},c_{j}) = \left\{ \begin{array}{ll} \frac{1}{2} \ln \frac{W_{+1}^{0jk}}{W_{-1}^{0jk}} &\hbox{if }w_{ki}=0 \\ \frac{1}{2} \ln \frac{W_{+1}^{1jk}}{W_{-1}^{1jk}} & \hbox{if }w_{ki}=1 \end{array} \right. $$
(4)

where

$$ W_{b}^{xjk} = \sum_{i=1}^{g} D_{s}(d_{i},c_{j})\cdot [\![ w_{ki}=x]\!]\cdot [\![ \Upphi(d_{i},c_{j})=b]\!] $$
(5)

for b ∈ {1, −1}, x ∈ {0, 1}, j ∈ {1,…,m} and k ∈ {1,…,r}, and where [[π]] indicates the characteristic function of predicate π (i.e. the function that returns 1 if π is true and 0 otherwise). For term t k and for these values of a xj we obtain

$$ Z_{s}= 2\sum_{j=1}^{m}\sum_{x=0}^{1}(W_{+1}^{xjk}W_{-1}^{xjk})^\frac{1}{2} $$
(6)

Choosing \(\frac{1}{2} \ln \frac{W_{+1}^{xjk}}{W_{-1}^{xjk}}\) as the value for a xj has the effect that \(\hat{\Upphi}_{s}(d_{i},c_{j})\) outputs a positive real value in the two following cases:

  1. 1.

    w ki  = 1 (i.e. t k occurs in d i ) and the majority of the training documents in which t k occurs belong to c j ;

  2. 2.

    w ki  = 0 (i.e. t k does not occur in d i ) and the majority of the training documents in which t k does not occur belong to c j .

In all the other cases \(\hat{\Upphi}_{s}\) outputs a negative real value. Here, “majority” has to be understood in a weighted sense, i.e. by bringing to bear the weight D s (d i , c j ) associated to the training pair \(\langle d_{i},c_{j}\rangle.\) The larger this majority is, the higher the absolute value of \(\hat{\Upphi}_{s}(d_{i},c_{j})\) is; this means that this absolute value represents a measure of the confidence that \(\hat{\Upphi}_{s}\) has in its own prediction (Schapire and Singer 1999).

In practice, the value \(a_{xj}=\frac{1}{2} \ln \frac{W_{+1}^{xjk}+\epsilon}{W_{-1}^{xjk}+\epsilon} \) is chosen in place of \(a_{xj}=\frac{1}{2} \ln \frac{W_{+1}^{xjk}}{W_{-1}^{xjk}},\) since this latter may produce outputs with a very large or infinite absolute value when the denominator is very small or zero.Footnote 6

The output of the final hypothesis is the value

$$ \hat{\Upphi}(d_{i},c_{j})=\sum_{s=1}^{S}\hat{\Upphi}_{s}(d_{i},c_{j}) $$
(7)

obtained by summing the outputs of the weak hypotheses.

2.2 Implementing AdaBoost.MH

Following Sebastiani et al. (2000), in our implementation of AdaBoost.MH we have further optimized the final hypothesis \(\hat{\Upphi}(d_{i},c_{j}) =\sum_{s=1}^{S}\hat{\Upphi}_{s}(d_{i},c_{j})\) by “combining” the weak hypotheses \(\hat{\Upphi}_{1}, \ldots, \hat{\Upphi}_{S}\) according to their pivot term t k . In fact, note that if \(\{\hat{\Upphi}_{1}, \ldots, \hat{\Upphi}_{S}\}\) contains a subset \(\{\hat{\Upphi}_{1}^{(k)}, \ldots, \hat{\Upphi}_{q(k)}^{(k)}\}\) of weak hypotheses that all hinge on the same term t k and are of the form

$$ \hat{\Upphi}_{r}^{(k)}(d_{i},c_{j})=\left\{ \begin{array}{ll} a_{0j}^r&\hbox{if }w_{ki}=0 \\ a_{1j}^r&\hbox{if }w_{ki}=1 \end{array} \right. $$
(8)

for r = 1,…,q(k), the collective contribution of \(\hat{\Upphi}_{1}^{(k)},\ldots, \hat{\Upphi}_{q(k)}^{(k)}\) to the final hypothesis is the same as that of a “combined hypothesis”

$$ \hat{\Upphi}^{(k)}(d_{i},c_{j})=\left\{ \begin{array}{ll} \sum\limits_{r=1}^{q(k)}a_{0j}^r&\hbox{if }w_{ki}=0 \\ \sum\limits_{r=1}^{q(k)}a_{1j}^r&\hbox{if }w_{ki}=1 \end{array} \right. $$
(9)

In the implementation we have thus replaced \(\sum\nolimits_{s=1}^{S}\hat{\Upphi}_{s}(d_{i},c_{j})\) with \(\sum\nolimits_{k=1}^{\Updelta}\hat{\Upphi}^{(k)}(d_{i},c_{j}),\) where \(\Updelta\) is the number of different terms that act as pivot for the weak hypotheses in \(\{\hat{\Upphi}_{1},\ldots, \hat{\Upphi}_{S}\}.\)

This modification brings about a considerable efficiency gain in the application of the final hypothesis to a test example. For instance, the final hypothesis we obtained on Reuters-21578 with AdaBoost.MH when S = 1,000 consists of 1,000 weak hypotheses, but the number of different pivot terms is only 766. The reduction in the size of the final hypothesis which derives from this modification is usually larger when high reduction factors have been applied in a feature selection phase, since in this case the number of different terms that can be chosen as the pivot is smaller. A large reduction is also obtained when the total number of iterations S is high, since in this case the terms chosen as pivot in the last iterations tend to be ones that have been chosen already in previous iterations.

In this work we further implement two important optimizations for reducing classification time.

The first optimization consists in building the vectorial representations \({\mathbf{d}}_{i}\) of the test documents after the final hypothesis \(\hat{\Upphi}=\sum\nolimits_{k=1}^{\Updelta}\hat{\Upphi}^{(k)}\) has been built, so that only the \(|\Updelta|\) features that act as pivot for some weak hypothesis in \(\hat{\Upphi}\) are actually included in the vectorial representations \({\mathbf{d}}_{i}\) of the test documents; the other terms can be discarded, since they play no role in the classification. Since it is usually the case that \(|{\Updelta|}\,{\ll}\,r,\) this brings about a substantial reduction in the space occupied by the \({\mathbf{d}}_{i}\)’s. For instance, the size of the vectors that we have obtained by this method on RCV1-v2 with AdaBoost.MH for S = 1,000 is 950, while the length r of the original vectors (see Sect. 5.3) was equal to 55,051.

The second optimization consists in sorting the final hypothesis \(\hat{\Upphi}\) (here viewed for convenience as a sequence \(\hat{\Upphi}^{(1)}, \ldots, \hat{\Upphi}^{(\Updelta)}\) of weak hypotheses) so that the terms that act as pivot appear in the same order as they appear in the vectorial representations of the documents. As a consequence, we can indeed view the final hypothesis as consisting of the 2m vectors \({\mathbf{a}}_{0j}=\langle a_{0j}^{1}, \ldots, a_{0j}^{\Updelta}\rangle\) and \({\mathbf{a}}_{1j}=\langle a_{1j}^{1}, \ldots, a_{1j}^{\Updelta}\rangle\) (for j = 1,…,m) that contain the constants output by the compressed hypotheses of Eq. 9. Classification thus amounts to computing

$$ \hat\Upphi(d_{i},c_{j})=\sum_{k=1}^{\Updelta}w_{ki}a_{1j}^{k}+\sum_{k=1}^{\Updelta}(1-w_{ki})a_{0j}^{k} $$

Since one of w ki and (1 − w ki ) is always 0, this amounts to a sum of \(\Updelta\) real numbers. This is extremely cheap, also due to the fact that \(\Updelta\) is typically small. Note for example that performing classification with any linear classifier (such as those generated by support vector machines) requires a dot product of length r, which is much more expensive than a sum of \({\Updelta} \ll r\) reals. This makes our system even more classification-time efficient than other leading-edge technologies.

3 A hierarchical version of AdaBoost.MH for multi-label TC

In this section we describe a version of AdaBoost.MH, called TreeBoost.MH, that is explicitly designed to work on tree-structured sets of categories, and is capable of leveraging on the information inherent in this structure.

3.1 Notation, definitions, and the semantics of hierarchies

Before discussing the intuitions on which TreeBoost.MH is based, let us first fix some notation and definitions. Let C be a tree-structured set of categories, and let r be its root category. For each category c j  ∈ C, we will use the following abbreviations:

Symbol

Meaning

\(\uparrow({c_{j}})\)

The parent category of c j

\(\downarrow({c_{j}})\)

The set of children categories of c j

\(\Uparrow({c_{j}})\)

The set of ancestor categories of c j

\(\Downarrow({c_{j}})\)

The set of descendant categories of c j

\(\leftrightarrow({c_{j}})\)

The set of sibling categories of c j

When discussing an HTC application it is always important to specify what the semantics of the hierarchy is, i.e., to specify the semantic constraints that a supposedly perfect classifier would enforce; which constraints are in place has important consequences on which algorithms we might want to apply to this task, and, more importantly, on how we should evaluate these algorithms.

For instance, one should specify whether a document can belong to zero, one, or several categories in C (which is indeed the case of this paper) or whether it always belongs to one and only one category in C.

No less importantly, one should specify whether it is the case that

  1. 1.

    a document d that is a positive example of a category c j is also a positive example of all its ancestor categories \(\Uparrow({c_{j}}).\) We assume this to be the case. We say that, for the categories in \(\Uparrow({c_{j}}),\) d is a bubbled-up positive example (in the sense that it has bubbled up to c j from somewhere down below).

  2. 2.

    a document d can in principle be a positive example of an internal node category c j and at the same time not be a positive example of any of its descendant categories \(\Downarrow({c_{j}}).\) We assume this to be the case. We say that d is an own positive example of c j .

Assumption 2 is indeed useful for tackling datasets, such as RCV1-v2 (see Sect. 5.1) in which documents with these characteristics do occur, while at the same time not preventing us to deal with datasets with the opposite characteristics. A consequence of these two assumptions is that

$$ Tr^{+}(c_{j})\supseteq\bigcup_{c \in \Downarrow({c_{j}})}Tr^{+}(c) $$
(10)

i.e., the set Tr +(c j ) of the positive training examples of a nonleaf category c j is a (possibly proper) superset of the union of the sets of positive training examples of all its descendant (leaf) categories.

3.2 The rationale

TreeBoost.MH (which is fully illustrated in Fig. 2) embodies several intuitions that had arisen before within HTC.

Fig. 2
figure 2

The TreeBoost.MH algorithm

The first, fairly obvious intuition (which lies at the basis of practically all HTC algorithms proposed in the literature) is that, in a hierarchical context, the classification of a document d i is to be seen as a descent through the hierarchy, from the root to the (internal or leaf node) categories where d i is deemed to belong. In ML classification this means that each nonroot category c j has an associated binary classifier \(\hat\Upphi_{j}\) which acts as a “filter” that prevents unsuitable documents to percolate to lower levels. All test documents that a classifier \(\hat\Upphi_{j}\) deems to belong to c j are passed as input to all the binary classifiers corresponding to the categories in \(\downarrow({c_{j}}),\) while the documents that \(\hat\Upphi_{j}\) deems not to belong to c j are “blocked” and analysed no further. Note that it may well be the case that a document d i is deemed to belong to c j by \(\hat\Upphi_{j}\) and is then rejected by all the binary classifiers corresponding to the categories in \(\downarrow({c_{j}});\) this is indeed consistent with assumption (b) above. In the end, each document may thus reach zero, one, or several (leaf or internal node) categories, and is thus classified under them.

The second intuition is that the training of \(\hat\Upphi_{j}\) should be performed “locally”, i.e. by paying attention to the topology of the classification scheme. To see this note that, during classification, if the classifier for \(\uparrow({c_{j}})\) has performed reasonably well, \(\hat\Upphi_{j}\) will only (or mostly) be presented with documents that belong to the subtree rooted in \(\uparrow({c_{j}}),\) i.e. with documents that belong to c j and/or to some of the categories in \(\leftrightarrow({c_{j}}).\) As a result, the training of \(\hat\Upphi_{j}\) should be performed by using, as negative training examples, the positive training examples of \(\uparrow({c_{j}}),\) with the obvious exception of the documents that are also positive training examples of c j . In particular, training documents that only belong to categories other than those in \(\Downarrow({\uparrow({c_{j}})})\) need not be used. The rationale of this choice is that the negative training examples thus selected are “quasi-positive” examples of c j (Schapire et al. 1998), i.e. are the negative examples that are closest to the boundary between the positive and the negative region of c j (a notion akin to that of “support vectors” in SVMs), and are thus the most informative negative examples that can be used in training. This is beneficial also from the standpoint of (both training and classification time) efficiency, since fewer training examples and fewer features are involved. In a similar form, this intuition (which we discuss at large in Fagni and Sebastiani 2007) had first been presented in Ng et al. (1997) and Wiener et al. (1995).

The third intuition is similar, i.e. that feature selection should also be performed “locally”, by paying attention to the topology of the classification scheme. As above, if the classifier for \(\uparrow({c_{j}})\) has performed reasonably well, \(\hat\Upphi_{j}\) will only (or mostly) be presented with documents that belong to the subtree rooted in \(\uparrow({c_{j}}).\) As a consequence, for the classifiers corresponding to c j and its siblings, it is cost-effective to employ features that are useful in discriminating (only) among themselves and \(\uparrow({c_{j}});\) features that discriminate among categories lying outside the subtree rooted in \(\uparrow({c_{j}})\) are too general, and features that discriminate among the subcategories of c j , or among the subcategories of one of its siblings, are too specific. This intuition, albeit in the slightly different context of single-label classification, was first presented in Koller and Sahami (1997).

TreeBoost.MH also embodies the novel intuition that the weight distribution that boosting algorithms update at every boosting round should likewise be updated “locally”. In fact, the two previously discussed intuitions indicate that hierarchical ML classification is best understood as consisting of several independent (flat) ML classification problems, one for each internal node of the hierarchy: for each such node c j we must generate a number of binary classifiers, one for each \(c_{q}\,\in\,\downarrow({c_{j}}).\) In a boosting context, this means that several independent distributions, each one “local” to an internal node and its children, should be generated and updated by the process. In this way, the “difficulty” of a category c q will only matter relative to the difficulty of its sibling categories. As discussed in Sect. 4, this intuition is of key importance in allowing TreeBoost.MH to obtain exponential savings in the cost of training over AdaBoost.MH.

3.3 The algorithm

TreeBoost.MH incorporates these four intuitions by factoring the hierarchical ML classification problem into several “flat” ML classification problems, one for every internal node in the tree. TreeBoost.MH learns in a recursive fashion, by identifying internal nodes c j and calling AdaBoost.MH to generate a ML (flat) classifier for the set of categories \(\downarrow({c_{j}}).\) Alternatively (and more conveniently), this process may be viewed as generating, for each nonroot category c j  ∈ C, a binary classifier \(\hat\Upphi\) for c j , by means of which hierarchical classification can be performed as described in Sect. 3.2.

Learning in TreeBoost.MH proceeds by first identifying whether a leaf category has been reached (line 6 of Fig. 2), in which case nothing is done, since the classifiers are generated only at internal nodes.

If an internal node c j has been reached, a ML feature selection process may (optionally) be run (line 10) to generate a reduced feature set on which the ML classifier for \(\downarrow({c_{j}})\) will operate. This may be dubbed a “glocal” feature selection policy, since it takes an intermediate stand between the well-known “global” policy (in which the same set of features is selected for all the categories in C) and “local” policy (in which a different set of features is chosen for each different category in \(\downarrow({c_{j}})\)). The glocal policy selects a different set of features for each (maximal) set of sibling categories in C, thus implementing a view of feature selection as described in Sect. 3.2.Footnote 7 Any of the standard feature scoring functions (e.g. information gain, chi-square, odds ratio) can be used, as well as any of the standard feature score globalization methods (e.g. max, weighted average, Forman’s (2004) round robin). Note that all these functions require a precise notion of what the positive and negative training examples of a category are; here, consistently with the “locality” principle discussed in Sect. 3.2, the negative training examples of a category c are taken to be the set \(Tr^{+}(\uparrow({c}))-Tr^{+}(c).\)

After the reduced feature set has been identified, TreeBoost.MH calls upon AdaBoost.MH (line 11) to solve a ML (flat) classification problem for the categories in \(\downarrow({c_{j}});\) again, in order to implement the “quasi-positive” policy discussed in Sect. 3.2, the negative training examples of a category c are taken to be the set \(Tr^{+}(\uparrow({c}))-Tr^{+}(c).\) Note that restricting the AdaBoost.MH call to the categories in \(\downarrow({c_{j}})\) implements the view, discussed in Sect. 3.2, of several independent, “local” distributions being generated and updated during the boosting process.

Finally, after the ML classifier for \(\downarrow({c_{j}})\) has been generated, for each category \(c_{q}\,\in\,\downarrow({c_{j}})\) a recursive call to TreeBoost.MH is issued (lines 12–18) that processes the subtree rooted in c q in the same way. The final result is a hierarchical ML classifier in the form of a tree of binary classifiers, one for each nonroot node, each consisting of a committee of S decision stumps.

Note that the generated classifiers would allow us to implement another, alternative view of what the hierarchical ML classifier consists of: instead of a tree of committees, we might have a committee of trees, with each tree T s having a single decision stump (the one generated at iteration s) at each nonroot node. In this paper we concentrate on the former view, leaving the latter for future investigation.

4 The computational cost of TreeBoost.MH

We now analyse the computational costs of AdaBoost.MH and TreeBoost.MH, and show that the latter is computationally cheaper than the former, allowing exponential savings at both training and testing time with respect to the former.

Let us first discuss the cost of classifier training. The key steps of AdaBoost.MH are (i) computing, for each t k  ∈ T, the Z s factor resulting from \(\hat\Upphi_{best(k)},\) and (ii) computing the minimum, over all t k , of such Z s factors. By inspecting Eqs. 5 and 6 we can clearly see that, for each t k , Step (i) requires O(gm) operations for each t k , where g is the number of training documents and m is the number of categories; since there are r such terms, the entire step requires O(gmr) operations.

The cost of classifier training in TreeBoost.MH heavily depends on the topology of the tree and on the distribution of positive training examples across the nodes of the tree; in particular, it depends from factors such as the ariety (i.e. branching factor) of each individual internal node, the depth h j of each individual node c j , and the number Tr +(c j ) of positive training examples in each such node. We will thus limit our analysis to the best case and the worst case, since they are more easily identifiable; the cost of the other cases will be intermediate between these two. The worst possible case is that of a “flat”, degenerate tree of height 1, i.e. a tree in which all leaf categories are children of the root category and there are no internal nodes aside from the root itself. In this case, TreeBoost.MH calls AdaBoost.MH exactly once, and on the entire category set, which means that the two algorithms coincide, and have thus the same cost. The best possible case is more interesting, and coincides with the “fully grown” case of a perfectly balanced tree of constant ariety a (in this case the height of the tree is h = log a m) in which leaf categories have all the same frequency and each document belongs to exactly one leaf category. At each level l = 1,…,h of such a tree (the root is conventionally assigned level 0) TreeBoost.MH calls AdaBoost.MH exactly a l−1 times. Since, as from the analysis above, AdaBoost.MH is O(gmr) in the general case, this means that in this case each call to AdaBoost.MH requires \(O(\frac{g}{a^{l-1}}ar)\) operations, given that (i) the training examples involved are not g but only \(\frac{g}{a^{l-1}}\) (since we have made the hypothesis that leaf categories are evenly populated and each training example belongs to exactly one leaf category) and (ii) only a (instead of m) categories are involved. This means that the number of operations required by TreeBoost.MH is

$$ O\left(\sum_{l=1}^{h}a^{l-1}\cdot\frac{g}{a^{l-1}}ar\right) =O\left(\sum_{l=1}^{h}gar\right) = O(garh) $$

This means that, for each of the g training examples and for each of the r terms, TreeBoost.MH performs O(ah) operations and AdaBoost.MH performs O(m) operations. Given that m = a h, this means that, at training time, TreeBoost.MH is cheaper than AdaBoost.MH by an exponential factor.

Let us then discuss the cost of testing (i.e. applying) the generated classifiers. Again, in the “flat” worst case discussed above the two algorithms are trivially the same. Let us then only analyse the “fully grown” best case, in the understanding that the cost of the other cases will be intermediate between these two. In AdaBoost.MH, each test document must be given as input to O(S) weak hypotheses, each of which performs 1 test and m additions, one per category; the cost is thus O(Sm).Footnote 8 In TreeBoost.MH, each test document is input to O(h) classifiers (corresponding to one or more—complete or incomplete—paths downwards from the root), each of them consisting of O(S) weak hypotheses each of which performs 1 test and a additions, one per category; the cost is thus O(Sah). Recalling that m = a h, we can see that TreeBoost.MH is cheaper than AdaBoost.MH by an exponential factor at testing time too.

5 Experimental results

5.1 Datasets

The first benchmark we have used in our experiments is the “Reuters-21578, Distribution 1.0” corpus, one of the most widely used benchmarks in TC research.Footnote 9 In origin, the Reuters-21578 category set is not hierarchically structured, and is thus not suitable “as is” for HTC experiments; we have thus used a hierarchical version of it generated in Toutanova et al. (2001) by the application of hierarchical agglomerative clustering on the 90 Reuters-21578 categories that have at least one positive training example and one positive test example. The original Reuters-21578 categories are thus “leaf” categories in the resulting hierarchy, and are clustered into four “macro-categories” whose parent category is the root of the tree. Conforming to the experiments of Toutanova et al. (2001), we have used (according to the ModApte split) the 7,770 training examples and 3,299 test examples that are labelled by at least one of the selected categories; the average number of categories per document is 1.23, ranging from a minimum of 1 to a maximum of 15. The average number of positive examples per category is 106.50, ranging from a minimum of 1 to a maximum of 2,877 (Table 1).

The second benchmark we have used is Reuters Corpus Volume 1 version 2 (RCV1-v2),Footnote 10 a more recent text categorization benchmark made available by Reuters and consisting of 804,414 news stories produced by Reuters from 20 Aug 1996 to 19 Aug 1997; all news stories are in English, and have 109 distinct terms per document on average (Rose et al. 2002). In our experiments we have used the “LYRL2004” split defined in Lewis et al. (2004), in which the (chronologically) first 23,149 documents are used for training and the other 781,265 are used for testing. The documents are classified according to three different, orthogonal classification schemes: “topics, “industries" and “regions”; consistently with Lewis et al. (2004) and all the literature that has followed, we focus on the “topics” classification scheme. Out of the 103 “topics” categories, in our experiments we have restricted our attention to the 101 categories (21 internal node and 80 leaf node categories) with at least one positive training example. Of the three benchmarks we use in this work, RCV1-v2 is the only one containing (both training and test) examples that are “own” documents of internal node categories.; each of the 21 internal node categories has at least one own document. The RCV1-v2 hierarchy is four levels deep (including the root, to which we conventionally assign level 0); there are four internal nodes at level 1, and the leaves are all at the levels 2 and 3. The 80 leaf categories instead have 347.2 positive training examples on average.

The third benchmark we have used is the ICCCFT from the 2007 “International Challenge on Classifying Clinical Free Text Using Natural Language Processing”,Footnote 11 organized by the Computational Medicine Center of the Cincinnati Children’s Hospital Medical Center and the University of Cincinnati Medical Center. The documents are short discharge reports classified according to the ICD-9-CM classification scheme,Footnote 12 the official system of assigning codes to diagnoses and procedures associated with hospital utilization in the United States. The experiments we present here use only the training set of the Challenge, since the labels of the test documents are not available to participants; unlike with the previous two benchmarks we thus compute effectiveness by leave-one-out cross-validation. There are only 978 documents in the training set, with an average length of 13.3 words. We restrict our experiments to the 79 categories that have at least one positive training document; of these 79 categories, 45 are leaf node categories and the other 34 are internal node categories, none of which has “own” documents. The ICD-9-CM hierarchy (or, at least, that part of it that is used for labelling our training data) is again four levels deep (including the root, to which we conventionally assign level 0); there are seven internal nodes at level 1, and the leaves are all at the levels 2 and 3. The average number of positive examples per leaf category is 27.1, ranging from a minimum of 1 and a maximum of 266. One peculiar feature of this dataset is that some nodes are single children, i.e., have no siblings. This makes it impossible to adopt our standard policy of choosing, as negative training examples of a category, the positive training documents of the father category. In these cases we merge father and child categories into a single node, since they always have the same positive and negative examples.

5.2 Averaging effectiveness across categories

As a measure of effectiveness that combines the contributions of precision (π) and recall (ρ) we have used the well-known F 1 function, defined as

$$ F_{1}=\frac{2\pi\rho}{\pi+\rho}=\frac{2 TP}{2 TP+FP+FN} $$
(11)

which corresponds to the harmonic mean of precision and recall, where TP stands for true positives, FP for false positives, and FN for false negatives. Note that F 1 is undefined when TP = FP = FN = 0; in this case, consistently with most other works in the literature, we take F 1 to equal 1.0, since the classifier has correctly classified all documents (as negative examples).

In text classification it is customary to average the category-specific F 1 scores by computing both microaveraged F 1 (denoted by F μ1 ) and macroaveraged F 1 (F M1 ). F μ1 is obtained by (i) computing the category-specific values TP i , (ii) obtaining TP as the sum of the TP i ’s (same for FP and FN), and then (iii) applying Eq. 11. F M1 is obtained by first computing the F 1 values specific to the individual categories, and then averaging them across the c j ’s.

However, in HTC one should specify exactly which categories the average is computed across. Should this average be computed across leaf categories only, or should it involve internal node categories too? It might seem reasonable that also the internal nodes that have “own” positive examples (see Sect. 3.1) are considered, since these nodes are not to be viewed as merely routing documents to the subtrees below them. However, the presence of “bubbled-up” positive test examples within internal node categories means that, if averages involve these categories too, they will involve classification decisions that are not independent of each other. For instance, the fact that test document d i has been correctly classified under a leaf category c j entails that d i has also been correctly classified under all the categories in \(\Uparrow({d_{i}}),\) which means that this decision will count as n true positives (where n is the depth of c j ) instead of a single true positive.

The approach we take to averaging is intermediate between these two extremes, and involves distinguishing, for each internal node c j , the roles of “own” and “bubbled-up” positive test examples. In turn, this corresponds to distinguishing the roles of internal nodes as “routers” towards its subordinate nodes or as “repositories” of documents in their own right (a distinction already addressed in Ruiz and Srinivasan 2002]).

The approach consists in

  1. 1.

    mapping the original hierarchy C into a modified hierarchy C′ such that, for each internal node category c j  ∈ C with “own” positive training examples, C′ contains a “dummy” child (leaf) node \(c_{j}^{\prime}\) appended to c j ;

  2. 2.

    moving into \(c_{j}^{\prime}\) all of c j ’s “own” positive test examples.

This simple mapping, originally proposed in Cheng et al. (2001), produces a hierarchy C′ semantically equivalent to C in which all documents are contained in at least one leaf category.Footnote 13 For evaluation purposes we then use the modified hierarchy C′ instead of the original hierarchy C (even if learning and classification have indeed used C), and average across leaf nodes only. The net effect is that we do take into consideration the ability of the system to correctly classify documents as “own” documents of internal nodes (in the modified hierarchy, this is represented by the system’s effectiveness on “dummy” nodes), and at the same time we remove the dependence between classification decisions due to inherited examples.

5.3 The experiments

In all the experiments discussed in this section, punctuation has been removed, all letters have been converted to lowercase, numbers have been removed, stop words have been removed using the stop list provided in Lewis (1992, pp. 117–118), and stemming has been performed by means of Porter’s stemmer.

In a first experiment we have compared AdaBoost.MH and TreeBoost.MH using a full feature set.

We have then performed a number of experiments using feature selection; however, these have been run on Reuters-21578 and RCV1-v2 only, due to the fact that the original feature set of ICCCFT has a very limited size already (1,294 terms only). Reduced feature sets were obtained according to a “global” feature selection policy in which (i) feature-category pairs have been scored by means of information gain, defined as

$$ IG(t_k,c_i)=\sum\limits_{c\in\{c_i,\overline{c}_i\}}\sum\limits_{t\in\{t_k,\overline{t}_k\}} P(t,c) \cdot\log \frac{P(t,c)}{P(t)\cdot P(c)} $$

and (ii) the final set of features has been chosen according to Forman’s round robin technique, which consists in picking, for each category c j , the v features with the highest IG(t k , c j ) value, and pooling all of them together into a category-independent set (Forman 2004). This set thus contains a number of features q ≤ vm, where m is the number of categories; it usually contains strictly fewer than vm, since it is usually the case that some features are among the best v features for more than one category. We have set v to 60 for Reuters-21578 and to 43 for RCV1-v2, which are the values that, for each corpora, best approximate a total number of features of 2,000; in fact, the reduced feature sets consist of 2,012 features for Reuters-21578 (11% of the 18,177 original ones) and 2,029 for RCV1-v2 (3.7% of the 55,051 original ones).

We have also run an experiment in which we have used the “glocal” feature selection policy described in Sect. 3.3, consisting in selecting a different subset of features (of the same cardinalities as in the global policy) for the set of children of each different internal node. Note that, for each corpus, the results obtained by means of this policy are reported only for TreeBoost.MH, since this policy obviously is not applicable to AdaBoost.MH.

Table 1 Reuters-21578 macro-categories and their member categories (from Toutanova et al. 2001))

5.4 Effectiveness

The results of our experiments are reported in Table 2.

Table 2 AdaBoost.MH and TreeBoost.MH on Reuters-21578 (top 5 rows), RCV1-v2 (mid 5 rows) and ICCCFT (bottom two rows)

We will now comment on the Reuters-21578 results;Footnote 14 the RCV1-v2 and ICCCFT results are qualitatively similar. The first observation we can make is that, in switching from AdaBoost.MH to TreeBoost.MH, effectiveness improves substantially. F μ1 improves from +2.3% to +17.2%, depending on the number S of boosting iterations. F M1 improves even more substantially, from +22.0% to +197.4%; this means that TreeBoost.MH is especially suited to categorization problems in which the distribution of training examples across the categories is highly skewed. For both F μ1 and F M1 , the improvements tend to be more substantial for low values of S, showing that TreeBoost.MH converges to optimum performance more rapidly than AdaBoost.MH. Altogether, these effectiveness improvements are somehow surprising, since it is well-known that hierarchical TC can introduce a deterioration of effectiveness due to classification errors made high up in the hierarchy, which cannot be recovered at the lower levels (Koller and Sahami 1997; McCallum et al. 1998). The improvements thus show that the “filters” placed at the internal nodes work well, likely due to the fact that they their training benefits from using only the “quasi-positive” examples of local interest as negative training examples.

Concerning the RCV1-v2 dataset, note that the results obtained by both AdaBoost.MH and TreeBoost.MH are inferior to the ones reported in Lewis et al. (2004) and obtained, on the same dataset, by SVM-based classification systems. One of the reasons is certainly the fact that F 1 is (micro- and macro-)averaged across different sets of categories. In fact, while the authors of Lewis et al. (2004) choose to include internal node categories in the average, as mentioned in Sect. 5.2 we only include their associated dummy nodes. By doing so we avoid “watering down” the evaluation by considering nodes (the internal ones) that are both “easy” (since they typically have many training examples, of the “bubbled-up” type) and scarcely significant from an application point of view (since their most important role is as routers towards the leaves, rather than as categories in their own right).

Obviously, this means that the classification problem we tackle is more difficult that the one tackled in Lewis et al. (2004); in fact, easy nodes are now removed from consideration, while “hard” ones (i.e., the dummy ones, which have very few training examples) are introduced. This is best appreciated by looking at Tables 3 and 4, in which the effectiveness of the classifiers is computed separately for original leaves (Table 3) and dummy nodes (Table 4). The fact that effectiveness is very, very low on dummy nodes shows that including them in the averages from which Table 2 was computed has considerably reduced the resulting values of effectiveness.

Table 3 AdaBoost.MH and TreeBoost.MH on RCV1-v2; averaging is performed across all original leaf categories (i.e., “dummy” categories are not considered)
Table 4 AdaBoost.MH and TreeBoost.MH on RCV1-v2; averaging is performed across “dummy” categories only

5.5 Significance testing

In order to check whether these results are statistically significant we have subjected them to thorough statistical significance testing, by applying to the results reported in Table 2 (we use those for S = 1,000 boosting iterations) all the significance tests defined for text classification systems in Yang and Liu (1999), Sect. 4) i.e.:

  1. 1.

    the s-test: a sign test (Spiegel and Stephens 1999, Chapter 17) which compares two classifiers \(\hat\Upphi_{1}\) and \(\hat\Upphi_{2}\) by analysing their binary decisions on each document/category pair;

  2. 2.

    the S-test on F 1: a sign test which compares two classifiers \(\hat\Upphi_{1}\) and \(\hat\Upphi_{2}\) by analysing the paired F 1 scores on individual categories;

  3. 3.

    the T-test on F 1: at-test (Spiegel and Stephens 1999, Chapter 11) which compares two classifiers \(\hat\Upphi_{1}\) and \(\hat\Upphi_{2}\) by analysing the paired F 1 values on individual categories;

  4. 4.

    the T′-test on F 1: a “t-test after rank transformation” which compares two classifiers \(\hat\Upphi_{1}\) and \(\hat\Upphi_{2}\) by analysing the rank positions (1st or 2nd) that the two systems have obtained, in terms of F 1, on each individual category;

  5. 5.

    the p-test on πμ and ρμ: a t-test which compares two classifiers \(\hat\Upphi_{1}\) and \(\hat\Upphi_{2}\) by analysing the microaveraged precision and recall values that the two systems have obtained.

Tests 1 and 5 are designed to evaluate the two systems at the (“micro”) level of individual classification decisions, while Tests 2, 3 and 4 are designed to evaluate them at the (“macro”) level of individual categories, i.e., by analysing the performance scores that the two systems have obtained on such categories. We refer the interested reader to (Yang and Liu 1999, Sect. 4] for full mathematical definitions and for a discussion of the strengths and weaknesses of these five tests; we here only note, along with (Yang and Liu 1999, p. 47), that “none of the tests is ‘perfect’ for all the performance measures, or for performance analysis with respect to a skewed category distribution, so using them jointly instead of using one test alone would be a better choice”.

Table 5 clearly shows that the results reported in Table 2 are statistically significant. In fact, in 33 out of 42 tests (6 significance tests times the 7 different scenarios in which TreeBoost.MH and AdaBoost.MH are compared) TreeBoost.MH turns out to be statistically significantly better than AdaBoost.MH at a p-value ≤ 0.01 AdaBoost.MH (this corresponds to the cells marked (“≪”), while in other 4 tests this holds only at a p-value ≤ 0.05 (cells marked “<”); only in the remaining 5 tests no statistically significant difference is found at p-values > 0.05 (cells marked “∼”).

Table 5 Statistical significance tests on the results of Table 2

Note that this result becomes even stronger if we restrict ourselves to the largest of the tested collection (namely, RCV1-v2), on which TreeBoost.MH turns out to be statistically significantly better than AdaBoost.MH at a p-value ≤ 0.01 in 17 out of 18 tests. The strength of these results is also witnessed by the fact that these tests have been run on the results obtained with S = 1,000 boosting iterations, i.e., the scenario in which (see Table 2) the difference between the two systems is smallest.

Finally, note (see 4th and 8th rows of Table 5) that no statistically significant difference is found between the global and “glocal” feature selection policies (however, note that global consistently scores better than glocal on micro-level tests on RCV1-v2), thereby reinforcing the impression that no significant advantage is to be gained by using the latter instead of the former.

5.6 Efficiency

In terms of efficiency, we can observe that training time is +50.4% smaller, irrespectively of the number of iterations, a reduction that confirms the theoretical findings discussed in Sect. 4 (and that might likely be even more substantial in classification problems characterized by a deeper, more articulated hierarchy). Classification time is also generally reduced; aside from an isolated case in which it increases by 16.8%, it is reduced from +5.5% to +67.4%, with higher reductions being obtained for high values of S; this is likely due to the fact that, since high values of S bring about more effective classifiers, the classifiers placed at internal nodes are more effective at “blocking” unsuitable documents from percolating down to leaves which would reject them anyway.

The experiments run after global feature selection qualitatively confirm the results above. Note that the effectiveness values are practically unchanged with respect to the full feature set experiment; this is especially noteworthy for the RCV1-v2 experiments, in which more than 96% of the original features have been discarded with no loss in effectiveness. Effectiveness does not change also when using “glocal” feature selection. This is somehow surprising, since an effectiveness improvement might have been expected here, due to the generation of feature sets customized to each internal node. It is thus likely that the values of v chosen when applying the global policy were large enough to allow the inclusion, for each internal node, of enough features customized to it. We plan to carry out further experiments in order to check whether, at more aggressive levels of reduction (i.e. smaller values of v), the glocal policy will indeed prove superior to the global one.

6 Related work

HTC was first tackled in Wiener et al. (1995), in the context of a TC system based on neural networks and latent semantic indexing. The intuition that it could be useful to perform feature selection locally by exploiting the topology of the tree is originally due to Koller and Sahami (1997). However, this work dealt with 1-of-n text categorization, which means that feature selection was performed “collectively”, i.e., relative to the set of children of each internal node; given that we are in an m-of-n classification context, we instead do it “individually”, i.e., relative to each child of any internal node. The intuition that the negative training examples for training the classifier for category c j could be limited to the positive training examples of categories topologically close to c j is due to Ng et al. (1997) and Wiener et al. (1995). The notion that, in an m-of-n classification context, classifiers at internal nodes act as “routers” informs much of the HTC literature, and is explicitly discussed at least in Ruiz and Srinivasan (2002), which proposes a HTC system based on neural networks.

Other works in hierarchical text categorization have focused on other specific aspects of the learning task. For instance, the “shrinkage” method presented in McCallum et al. (1998) is aimed at improving parameter estimation for data-sparse leaf categories in a 1-of-n HTC system based on a naïve Bayesian method; the underlying intuitions are specific to naïve Bayesian methods, and do not easily carry over to other contexts. Incidentally, the naïve Bayesian approach seems to have been the most popular among HTC researchers, since several other HTC models are hierarchical variations of naïve Bayesian learning algorithms (Chakrabarti et al. 1998; Gaussier et al. 2002; Toutanova et al. 2001; Vinokourov and Girolami 2002); SVMs have also recently gained popularity in this respect (Cai and Hofmann 2004; Dumais and Chen 2000; Liu et al. 2005; Yang et al. 2003).

Evaluation measures from flat classification are the most popular choices for evaluating HTC systems. Some other researchers (Ceci and Malerba 2007; Sun and Lim 2001) have proposed that evaluation measures specific to the hierarchical case should be used in HTC, so that credit is given to “partially correct” classification, i.e., to the misclassification of a document into a category topologically close to the correct one. We think that these measures are difficult to judge in the abstract, since whether a user would gain any more benefit from a “partially correct” classification than from a “completely wrong” classification remains open to question, and fundamentally dependent on the particular application. We also believe that such proposals may be interesting, if at all, only in 1-of-m classification, in which there is such a notion as the correct category to which a document belongs. For the evaluation of our systems we have thus stuck to using “traditional” evaluation measures.

7 Conclusion

We have presented TreeBoost.MH, a recursive algorithm for hierarchical text categorization that uses AdaBoost.MH as its base step and that recurs over the category tree structure. We have given complexity results in which we show that TreeBoost.MH, by leveraging on the hierarchical structure of the category tree, is exponentially cheaper to train and to test than AdaBoost.MH. These theoretical intuitions have been confirmed by thorough empirical testing on three standard benchmarks, on which TreeBoost.MH has brought about substantial savings at both learning time and classification time. TreeBoost.MH has also shown to bring about substantial improvements in effectiveness with respect to AdaBoost.MH, especially in terms of macroaveraged effectiveness; this feature makes it extremely suitable to categorization problems characterized by a skewed distribution of the positive training examples across the categories.