Evaluation measures for hierarchical classification: a unified view and novel approaches
Abstract
Hierarchical classification addresses the problem of classifying items into a hierarchy of classes. An important issue in hierarchical classification is the evaluation of different classification algorithms, an issue which is complicated by the hierarchical relations among the classes. Several evaluation measures have been proposed for hierarchical classification using the hierarchy in different ways without however providing a unified view of the problem. This paper studies the problem of evaluation in hierarchical classification by analysing and abstracting the key components of the existing performance measures. It also proposes two alternative generic views of hierarchical evaluation and introduces two corresponding novel measures. The proposed measures, along with the state-of-the-art ones, are empirically tested on three large datasets from the domain of text classification. The empirical results illustrate the undesirable behaviour of existing approaches and how the proposed methods overcome most of these problems across a range of cases.
Keywords
Evaluation Evaluation measures Hierarchical classification Tree-structured class hierarchies DAG-structured class hierarchies1 Introduction
Hierarchical classification addresses the problem of classifying items into a hierarchy of classes. In past years mainstream classification research did not place enough emphasis on the presence of relations between the classes, in our cases hierarchical relations. This is gradually changing and more effort is put into hierarchical classification in particular, partly because many real-world knowledge systems and services use a hierarchical scheme to organize their data (e.g. Yahoo, Wikipedia). Research in hierarchical classification has become important, because flat classification algorithms are ill-equipped to address large scale problems with hundreds of thousands of hierarchically related classes. Promising initial results on large-scale problems show that hierarchical classifiers can be effective in improving information retrieval (Kosmopoulos et al. 2010).
Many research questions in hierarchical classification remain open. An important issue is how to properly evaluate hierarchical classification algorithms. While standard flat classification problems have benefited from established measures such as precision and recall, there are no established evaluation measures for hierarchical classification tasks, where the assessment of an algorithm becomes more complicated due to the relations among the classes. For example, classification errors in the upper levels of the hierarchy (e.g. when wrongly classifying a document of the class music into the class food) are more severe than those in deeper levels (e.g. when classifying a document from progressive rock as alternative rock). Several evaluation measures have been proposed for hierarchical classification (HC) (Costa et al. 2007; Sokolova and Guy 2009) using the hierarchy in different ways. Nevertheless, none of them is widely adopted, making it very difficult to compare the performance of different HC algorithms.
A number of comparative studies of HC performance measures have been published. An early study can be found in (Sun et al. 2003), which is limited to a particular type of graph-distance measures. A review of HC measures is presented in (Costa et al. 2007), focusing on single-label tasks and without providing any empirical results; in multi-label tasks each object can be assigned to more than one classes, e.g. a newspaper article may belong to both politics and economics. In (Nowak et al. 2010) many multi-label evaluation measures are compared, but the role of the hierarchy is not fully considered. Finally, Brucker et al. (2011) provide a comprehensive empirical analysis of performance measures, but they focus on the evaluation of clustering methods rather than classification ones. While these studies provide interesting insights, they all miss important aspects of the problem of evaluating HC algorithms. In particular, they do not abstract the problem in order to describe existing evaluation measures within a common framework.
- 1.
It groups existing HC evaluation measures under two main types and provides a generic framework for each type, based on flow networks and set theory.
- 2.
It provides a critical overview of the existing HC performance measures using the proposed framework.
- 3.
It introduces two new HC evaluation measures that address important deficiencies of state-of-the-art measures.
- 4.
It provides comparative empirical results on large HC datasets from text classification with a variety of HC algorithms.
2 A framework for hierarchical classification performance measures
This section presents a new framework within which HC performance measures can be described and characterized. Firstly, the main dimensions of the problem are defined and then the general requirements for the evaluation are presented and discussed, based on standard problems that appear in hierarchical classification. We then proceed with the presentation of the proposed framework, which is used in further sections to describe and analyse the measures.
2.1 The main dimensions of the hierarchical classification problem
In classification tasks the training set is typically denoted as \(S=\left\{ (\mathbf x ^i,\mathbf y ^i)\right\} _{i=1}^{n}\), where \(\mathbf x ^i \in \mathcal {X}\) is the feature vector of instance \(i\) in the input space \(\mathcal {X}\) and \(\mathbf y ^i \subseteq \mathcal {Y}\) is the set of classes to which the instance belongs, where \(\mathcal {Y}=\left\{ y_1,\ldots ,y_K \right\} \) is the set of the target classes.
We define as the first dimension (D1) of the hierarchical classification problem, whether it is single-label or multi-label. In the single-label case, which is the simplest, each instance \(i\) belongs in only one category (the cardinality of the set of labels \(\mathbf y ^i\) is \(1\)), while in the multi-label case, an instance \(i\) might belong in more than one category (the cardinality of the set of labels \(\mathbf y ^i\) is \(\ge 1\)).
Asymmetry: if \(v_i \prec v_j\) then \(v_j \nprec v_i\) for every \(v_i\), \(v_j\)\(\in \mathcal {V}\).
Anti-reflexivity: \(v_i \nprec v_i\) for every \(v_i \in \mathcal {V}\).
Transitivity: if \(v_i \prec v_j\) and \(v_j \prec v_k\), then \(v_i \prec v_k\) for every \(v_i\), \(v_j\), \(v_k\)\(\in \mathcal {V}\).
The third dimension (D3) concerns whether an instance can be classified only to a leaf of the hierarchy or to any class of the hierarchy. In the first case the problem is called as mandatory leaf node prediction (Silla and Freitas 2011).
Description of symbols used throughout the article
Symbol | Description |
---|---|
\(\mathbf x \) | A feature vector |
\(y\) | A class label |
\(De(v), An(v), Pa(v)\) | Descendants, ancestors and parents of class \(v\) |
\(k_{ij}\) | Cost of predicting class \(\hat{y}_i\) instead of \(y_j\) |
\(\hat{Y}, Y\) | Sets of predicted and true classes |
\(\hat{Y}_{aug}, Y_{aug}\) | Augmented sets of predicted and true classes |
\([b_u; c_u]\) | Capacity interval of an edge \(u\) of a flow network |
\(\alpha _p, \alpha _t, \beta _p, \beta _t\) | Capacity variables of the flow network |
\(G\) | Directed graph of flow network |
\(E\) | Edge set of flow network |
\(E_H\) | Edge set of hierarchy \(H\) |
\(LCA\) | Lowest common ancestor |
2.2 General problems in hierarchical classification evaluation
Although in many influential hierarchical classification papers (Koller and Sahami 1997; McCallum et al. 1998) accuracy, precision, recall, F-measure, etc. are used for evaluation, these measures are not appropriate for HC, due to the relations that exist among the classes. A hierarchical performance measure should use the class hierarchy in order to evaluate properly HC algorithms. In particular, one must account for several different types of error according to the hierarchy. For example, consider the tree hierarchy in Fig. 1a. Assume that the true class for a test instance is 3.1 and that two different classification systems output 3 and 1 as the predicted classes. Using flat evaluation measures, both systems are punished equally, but the error of the second system is more severe as it makes a prediction in a different and unrelated sub-tree.
The first two distance-measuring problems are linked with dimension D3. They appear only when classification to inner nodes is allowed. Figure 2a presents an over-specialization error where the predicted class is a descendant of the true class. Figure 2b depicts an under-specialization error, where an ancestor of the true class is selected. In both cases the desired behaviour of the measure would be to reduce the penalty of the classification system, according to the distance between the true class and the predicted one.
The third case (Fig. 2c), called alternative paths, presents a scenario where there are two different ways to reach the true class starting from a predicted class. In this case, a measure could use one of the two paths or both in order to evaluate the performance of the classification system. Selecting the path that minimizes the distance between the two classes and using that as a measure of error seems reasonable. In Fig. 2c the predicted class is an ancestor of the true class, but an alternative paths case may also involve multiple paths from an ancestor to a descendant predicted class. This case appears only in DAG taxonomies and in that way it is linked to the D2 dimension of the hierarchical classification problem.
Figure 2d presents a scenario linked with the D1 dimension of the hierarchical classification problem: It can only occur in multi-label settings. In this case one must decide, before even measuring the error, which pairs of true and predicted classes should be compared. For example, node A (true class) could be compared to B (predicted) and D to C; or node A could be compared to both B and C, and node D to none; other pairings are also possible. Depending on the pairings, the score assigned to the classifier will be different. It seems reasonable to use the pairings that minimize the classification error. For example, in Fig. 2d it could be argued that the predictions of B and C are based on evidence about A and thus both B and C should be compared to A.
Finally, Fig. 2e presents a case where the predicted class should probably not be matched to any true class. This is typically the case when the predicted class and the true class are too distant, which is why we call this case the long distance problem.
The simplest case scenario is when we have a single-label classification problem, with a tree hierarchy and where instances are only allowed to be classified to leaves. The first four problems do not appear in this scenario and, hence, all the existing hierarchical evaluation measures have a similar behaviour in this case. If we alter any dimension of the problem, one has to deal with some of the above cases and existing measures start behaving differently. Only the final problem (2e) appears in every combination of the dimensions of the problem.
We now turn to the two main families of HC evaluation measures, namely pair-based measures and set-based measures.
2.3 Pair-based measures
Pair-based measures assign costs to pairs of predicted and true classes. For example, in Fig. 2d class B could be paired with A and class C with D, and then the sum of the corresponding costs would give the total misclassification error.
Let \(\hat{Y}=\{\hat{y_i}|i=1\ldots M\}\) and \(Y=\{y_j|j=1\ldots N\}\) be the sets of the predicted and true classes respectively, for a single test instance (the index of the instance is omitted for simplicity). The sets \(Y\) and \(\hat{Y}\) are augmented with a default predicted and a default true class, denoted respectively \(\hat{y}_{M+1}\) and \(y_{N+1}\). The default classes are used when a predicted class cannot or should not be paired to any true class and vice-versa. For example, when the distances between a predicted class \(\hat{y}_i\) and all the true classes \(y_i\) exceed a predefined threshold (see the long distance problem in Fig. 2e), the predicted class \(\hat{y}_i\) may be paired with the default true class.
Additionally, let \(\kappa _{ij}\) be the cost of predicting class \(\hat{y}_i\) instead of the true class \(y_j\). The matrix \(\mathbf {K}=[\kappa _{ij}]_{i=1\ldots M+1, j=1 \ldots N+1}\), \(\kappa _{ij}\ge 0, \forall i, j\) contains the costs of all possible pairs of predicted and true classes, including the default classes.
Pair-based measures typically calculate the cost \(\kappa _{ij}\) of a pair of a predicted class \(\hat{y}_i\) and a true class \(y_i\) as the minimum distance of \(\hat{y}_i\) and \(y_i\) in the hierarchy, e.g. as the number of edges between the classes along the shortest path that connects them. The intuition is that the closer the two classes are in the hierarchy, the more similar they are, and therefore the less severe the error. More elaborate cost measures may assign weights to the hierarchy’s edges, and the weights may decrease when moving from the top to the bottom (Blockeel et al. 2002; Holden and Freitas 2006). The distance to the default classes is usually set to a fixed large value.
In a spirit of fairness (minimum penalty), the aim of an evaluation measure is to pair the classes returned by a system and the true classes in a way that minimizes the overall classification error. This can be formulated as the following optimization problem:
Problem 1
Constraint (i) states that \(l_{ij}\), which denotes the alignment between classes, is either 0 (classes \(\hat{y}_i\) and \(y_j\) are not paired) or 1 (classes \(\hat{y}_i\) and \(y_j\) are paired); it furthermore states that the default predicted and true classes cannot be aligned (these default classes are solely used to “collect” those predicted and true classes with no counterpart). The parameters \(\alpha _p\), \(\beta _p \in \mathbb {N}\) (constraint (ii)) are the lower and upper bounds of the allowed number of true classes that a predicted class can be paired with. For example, setting \(\alpha _p=\beta _p=1\) requires each predicted class to be paired with exactly one true class. Similarly, the parameters \(\alpha _t\), \(\beta _t \in \mathbb {N}\) (constraint (iii)) limit the number of predicted classes that a true class can be paired with. The above constraints directly imply
that \(\forall i=1\ldots M, l_{iN+1} \le \beta _p\) and \(\forall j=1\ldots N, l_{M+1j} \le \beta _t\), which in turn implies that the default true class can be aligned to at most \(\beta _pM\) predicted classes and the default predicted class to at most \(\beta _tN\) true classes.
Problem 1 corresponds to a best pairing problem in a bipartite graph, with the nodes of the two types standing for predicted and true classes, respectively. It is important to note here that the pairing we are looking for is not a 1–1 matching, since the same node of one type can be paired with several nodes of the other type. We opt to approach Problem 1 as a graph pairing one rather than an integer linear programming one, for two reasons: first because there exist simple polynomial solutions to pairing problems in graphs, and second because the graph framework allows one to easily illustrate how the different cost-based measures proposed so far relate to each other. In particular, we model Problem 1 as a cost flow minimization problem (Ahuja et al. 1993).
2.3.1 A flow network model for class pairing
Integrality Theorem
If a flow network has capacities which are all integer valued and there exists some feasible flow in the network, then there is a minimum cost feasible flow with an integer valued flow on every arc.
\(V\) includes a source, a sink, the predicted classes, the true classes, a default true class and a default predicted class;
\(E\) includes edges from the source to all the predicted classes (including the default predicted class), from every predicted class to every true class (including the default true class), from every true class to the sink and from the sink to the source.
In our setting, the capacity intervals \([b_u;c_u]\) of the edges \(u\) express the possible number of pairs that each predicted or true class can participate in.
From each predicted class \(\hat{y_j}\) to each true class \(y_i\), excluding the default class, the capacity interval is [0;1]; the integrality theorem here implies that the flow value between predicted and true classes will be either 0 or 1, i.e. a predicted and a true class either be paired (1) or not paired (0). The capacity bounds here correspond to the \(l_{ij}\) values of Problem 1 (constraint (i)).
From the source to a (non-default) predicted class, the capacity interval is \([\alpha _p;\beta _p]\) meaning that a predicted class is aligned with at least \(\alpha _p\) and at most \(\beta _p\) true classes.
Similarly, from a (non-default) true class to the sink, the capacity interval is [\(\alpha _t\);\(\beta _t\)] meaning that a true class is aligned with at least \(\alpha _t\) and at most \(\beta _t\) predicted classes;
From each predicted class \(\hat{y_j}\) to the default true class the capacity interval is [0;\(\beta _p\)] and from the default predicted class to each true class the capacity interval is [0;\(\beta _t\)]; from the source (resp. sink) to the default predicted (resp. true) class, the capacity interval is \([0;\beta _t N]\) (resp. \([0;\beta _pM]\)), reflecting the fact that the default predicted (resp. true) class can be mapped to at most \(\beta _t N\) true classes (resp. \(\beta _p M\) predicted classes).
Lastly, from the sink to the source, the capacity interval is [\(\alpha _p M\);\(\beta _t N+\beta _p M\)], which corresponds to a loose setting compatible with the intervals given above; this last capacity interval does not impose any constraint, but is necessary to ensure flow conservation.
2.3.2 Existing pair-based measures
The majority of the existing pair-based measures deals only with tree hierarchies and single-label problems. Under these conditions the pairing problem becomes simple, because a single path exists between the predicted and the true classes of each classified item. The complexity of the problem increases when the hierarchy is a DAG or when the problem is multi-labeled; current measures cannot handle the majority of the phenomena presented in Sect. 2.2.
In the simplest case of pair-based measures (Dekel et al. 2004; Holden and Freitas 2006; Xiao et al. 2011), the measure trivially pairs the single prediction with the single true label (\(M=N=1\)), so that \(\alpha _p=\beta _p=\alpha _t=\beta _t=1\). Note that no default classes exist in this measure, or equivalently the corresponding costs are equal to infinity, \(\kappa (DP,y)=\kappa (DT,\hat{y})=+\infty \)
In (Sun and Lim 2001) two cost measures are proposed for multi-label problems in tree hierarchies, where all possible pairs of the predicted and true classes are used in the calculation. In this case, \(\alpha _p=\beta _p=N\) and \(\alpha _t=\beta _t=M\). Again, no default classes are used and so the corresponding costs are: \(\kappa (DP,y_i)=\kappa (DT,\hat{y_j})=+\infty , \ i =1\ldots N, j=1\ldots M\). Note that this is an extreme case, where all pairs of predicted and true labels are used. The weights \(w_e\) are calculated in two alternative ways: (a) as the similarity (e.g., cosine similarity between the gravity centres of the example vectors of the two classes) between the classes of the predicted and true ones, and (b) using the distances of the hierarchy as in Eq. 1.
2.3.3 Multi-label graph induced accuracy
We propose here a straightforward extension of GIE called Multi-label Graph Induced Accuracy (MGIA), in which each class is allowed to participate in more than one pair. This extension makes the method more suitable to the pairing problem. Figure 6 presents the MGIA flow network, in which \(\alpha _p=\alpha _t=1\), \(\beta _p=N\), \(\beta _t=M\), reflecting the fact that each predicted (resp. true) class must be paired with at least one and at most \(\beta _p=N\) (resp. \(\beta _t=M\)) true (resp. predicted) classes. The cost of pairing a class (predicted or true) with a default one is set as in GIE. Solving the flow network optimization problem is easy since the only constraints are that the default predicted class cannot be paired with the default true class and that categories of the same set (predicted or true) cannot be paired to each other. Thus each pairing can be solved separately from the others by pairing a class with either the default class of the other set, or the nearest class of the other set. An alternative extension is to consider that only true (or predicted) classes can be paired to more than one class. We do not consider these alternatives here even though we believe they can be useful in certain settings.
The above measure is bounded in [0,1] and the better a system is the closer it will be to 1. Note that in the case where all predicted classes and all true classes are paired with the respective default classes, \( fnerror \) will reach its maximum value \((|P| + |T|)\cdot D_{max}\) and will be equal to the denominator, as \(P\cap T =\emptyset \) resulting in a value of 0. Essentially, the advantage of the proposed measure over other pair-based measures is that it takes into account the correct predictions of the classification system (that is the true positives, \(P \cap T\)).
2.4 Set-based measures
The performance measures of this category are based on operations on the entire sets of predicted and true classes, possibly including also their ancestors and descendants, as opposed to pair-based measures, which consider only pairs of predicted and true classes.
- 1.
The augmentation of \(Y\) and \(\hat{Y}\) with information about the hierarchy.
- 2.
The calculation of a cost measure based on the augmented sets.
2.4.1 Existing set-based measures
2.4.2 Lowest common ancestor precision, recall and \(F_1\) measures
The set-based measure proposed in this paper is based on the hierarchical versions of precision, recall and \(F_1\), that adds all the ancestors of the predicted and true classes to \(Y_{aug}\) and \(\hat{Y}_{aug}\). However, adding all the ancestors has the undesirable effect of over-penalizing errors that happen to nodes with many ancestors. To an attempt to address this issue, we propose the lowest common ancestor precision (\(P_{LCA}\)), Recall (\(R_{LCA}\)) and \(F_1\) (\(F_{LCA}\)) measures. These measures use the concept of the lowest common ancestor (\(LCA\)) as defined in graph theory (Aho et al. 1973).
Definition 1
For example, in Fig. 8a \(LCA(3.1, 3.2.2) = 3\). In the case of a DAG the definition of \(LCA\) changes. \(LCA(n_{1},n_{2})\) is a set of nodes (instead of a single node), since it is possible for two nodes to have more than one \(LCA\). Furthermore, the \(LCA\) may not necessarily be the node that is furthest from the root and we need to rely on the shortest path between them.
Definition 2
Given a set \(p_{all}(n_{1},n_{2})\) containing all paths that connect nodes \(n_1\) and \(n_2\), we define \(path_{min}(n_1,n_2)\) as the subset of \(p_{all}(n_{1},n_{2})\) for which: \(\forall p \in path_{min}(n_1,n_2); \not \exists p' \in p_{all}(n_{1},n_{2}) : cost(p') < cost (p)\).
where the cost of a path corresponds to its length, when the edges of the hierarchy are unweighted and the cost of a set \(path_{min}(n_{1},n_{2})\) is the cost of any path in the set since all of them are of the same length.
\(path_{min}(2.1,3.1)=\big \{ \{2.1, 3, 3.1\} \big \}\)
\(path_{min}(2.1,3.2.2)=\big \{ \{2.1, 3, 3.2.2\} \big \}\)
\(path_{min}(3.2.2, 3.2.1)=\big \{ \{3.2.2, 3.2, 3.2.1\} \big \}\)
Definition 3
The lowest common ancestor \(LCA(n_1,n_2)\) of two nodes \(n_1\) and \(n_2\) of a DAG \(D\) is defined as the set of nodes \(S\) for which: \(\forall p \in path_{min}(n_1,n_2); \exists n \in S \cap p \wedge \not \exists n' \in p: n'\) is closer to the root than \(n\).
In multi-label classification, it is necessary to extend the definition of \(LCA\) to compare a node \(n\) (e.g. a true class) against a set of nodes \(S\) (e.g. the predicted classes), as follows:
Definition 4
The \(LCA(n,S)\) of a node \(n\) and a set of nodes \(S\) is the set of all the lowest common ancestors \(LCA(n,i)\) for each \(i \in S_{best}(n,S) \subseteq S\), where \(S_{best}(n,S)\) contains the closest to \(n\) nodes of \(S\) (\(S_{best}(n,S)= \{i \in S: \not \exists j \in S,\)\(j \ne i \wedge cost(path_{min}(n,i)) > cost(path_{min}(n,j))\}\)).
For example, in Fig. 8b \(S_{best}(3.1,\{2.1, 3.3, 3.2.1\})=\{2.1, 3.3\}\) and \(LCA(3.1, \{2.1, 3.3, 3.2.1\})\) is \(\{3\}\).
\(LCA (2.1, \hat{Y}) = \{3\}\), connecting 2.1 with either 3.1 using \(path_{min}(2.1,3.1)\) or 3.2.1 using \(path_{min}(2.1,3.2.1)\) or 3.2.2 using \(path_{min}(2.1,3.2.2)\).
\(LCA (3.3, \hat{Y}) = \{3\}\), connecting 3.3 with either 3.1 using \(path_{min}(3.3, 3.1)\) or 3.2.1 using \(path_{min}(3.3,3.2.1)\) or 3.2.2 using \(path_{min}(3.3, 3.2.2)\).
\(LCA (3.2.1, \hat{Y}) = \{3.2.1\}\), connecting 3.2.1 with itself.
\(LCA (3.2.1, Y) = \{3.2.1\}\), connecting 3.2.1 with itself.
\(LCA (3.1, Y) = \{3\}\), connecting 3.1 with either 2.1 using \(path_{min}(3.1, 2.1)\) or 3.3 using \(path_{min}(3.1, 3.3)\).
\(LCA (3.2.2, Y) = \{3.2, 3\}\), the first connecting 3.2.2 with 3.2.1 using \(path_{min}(3.2.2, 3.2.1)\) and the second connecting 3.2.2 with either 2.1 using \(path_{min}(3.2.2, 2.1)\) or 3.3 using \(path_{min}(3.2.2, 3.3)\).
Definition 5
Given a set of true classes (nodes) \(Y\) and a set of predicted classes (nodes) \(\hat{Y}\), we define \(LCA_{all}(Y,\hat{Y})\) as the union of all LCA(\(y,\hat{Y}\)) for all \(y \in Y\). Similarly we define \(LCA_{all}(\hat{Y},Y)\) as the union of all LCA(\(\hat{y},Y\)) for all \(\hat{y} \in \hat{Y}\).
In the above example, \(LCA_{all}(Y,\hat{Y})=\){3, 3.2.1}, \(LCA_{all}(\hat{Y},Y)=\){3, 3.2, 3.2.1}. From the above sets, one can then define the subgraph relating to sets of edges:
Definition 6
all \(path_{min}(y,a)\): \(y \in Y \wedge a \in LCA(y,\hat{Y})\)
all \(paths(y,a)\) subpaths of each path of \(paths_{min}(\hat{y}, y):\hat{y} \in \hat{Y} \wedge y \in S_{best}(\hat{y},Y) \wedge a \in LCA(\hat{y},Y)\)
all \(path_{min}(\hat{y},b)\): \(\hat{y} \in \hat{Y} \wedge b \in LCA(\hat{y},Y)\)
all \(paths(\hat{y},b)\) subpaths of each path of \(paths_{min}(y,\hat{y}):y \in Y \wedge \hat{y} \in S_{best}(y,\hat{Y}) \wedge b \in LCA(y,\hat{Y})\)
The two graphs \(G^{ex}_{t}(Y,\hat{Y})\) and \(G^{ex}_{p}(Y,\hat{Y})\) are created using all nodes of \(LCA_{all}(Y,\hat{Y})\) and \(LCA_{all}(\hat{Y},Y)\) and all corresponding paths. However, one can select subgraphs of the two graphs \(G_{t}(Y,\hat{Y})\subseteq G^{ex}_{t}(Y,\hat{Y})\) and \(G_{p}(Y,\hat{Y}) \subseteq G^{ex}_{p}(Y,\hat{Y})\) connecting each node of \(Y \cup \hat{Y}\) with an \(LCA\). For example, in Fig. 8b node 3.2.2 has two LCAs, node 3.2 and 3. Node 3.2 can be removed from \(G^{ex}_{t}(Y,\hat{Y})\) and \(G^{ex}_{p}(Y,\hat{Y})\) We would then get graphs \(G_{t}(Y,\hat{Y})\) and \(G_{p}(Y,\hat{Y})\) of Fig. 10. \(P_{LCA}\), \(R_{LCA}\) and \(F_{LCA}\), between the reduced sets \(Y_{aug}\) and \(\hat{Y}_{aug}\) of Fig. 10, are 0.5 instead of 0.6 (Fig. 9). In other words graphs \(G_{t}(Y,\hat{Y})\) and \(G_{p}(Y,\hat{Y})\) should comprise the nodes necessary for connecting the two sets, through their LCAs. As redundant nodes can lead to fluctuations in \(P_{LCA}\), \(R_{LCA}\) and \(F_{LCA}\), they should be removed. In order to obtain the minimal LCA graphs, one has to solve the following maximization problem:
Problem 2
The maximization of \(F_{LCA}(G_{t}(Y,\hat{Y}),G_{p}(Y,\hat{Y}))\) is subject to a set of constraints: Constraint (i) requires all class nodes of an initial set (\(Y\) or \(\hat{Y}\)) to be included in the final subgraphs. Constraint (ii) enforces the existence of at least one LCA for each node of \(Y \cup \hat{Y}\), in the subgraphs. Constraint (iii) limits the total number of LCAs used to the minimum required in order to be able to satisfy constraints (i) and (ii). Constraint (iv) implies the existence of at least one path connecting each class node of each subgraph to one of its LCAs, while constraint (v) implies the inverse, i.e. that each LCA of the subgraphs is connected with at least one class node of each subgraph.
Procedure GetBestLCAs returns an approximation of the minimum amount of \(LCAs\) needed in order to satisfy constraints (i), (ii) and (iii) of the maximization problem. This is achieved by initially sorting, in descending order, all \(LCAs\) by the number of \(Y\) and \(\hat{Y}\) that they connect. On this list, we perform two passes, top-down and bottom-up, removing all redundant \(LCAs\), i.e \(LCAs\) of nodes for which other \(LCAs\) are already included in the list. In the final step of the algorithm, GetBestPaths selects the minimum paths that satisfy constraints (iv) and (v). In case two or more paths exist that connect the same node with an LCA, we choose the one which leads to the smallest possible subgraphs.
An interesting issue arises when a class and one of its ancestors co-exist in the predicted or the true class sets. Assume for example that a system A predicts that an instance belongs to node \(X\), while another system B assigns it also to one of the ancestors of \(X\). Each extra ancestor of \(X\) would lead to higher \(F_{1}\) score since it would increase the size of the \(Y_{aug} \cap \hat{Y}_{aug}\) set. This happens because all the ancestors of an \(LCA(n_{1},x_{2})\) are also ancestors of nodes \(n_{1}\) and \(n_{2}\). We address this issue by removing from set \(Y\) any node \(y\) for which \(\exists y'\in Y : y'\) is a descendant of \(y\). We then do the same for set \(\hat{Y}\) by removing each node \(\hat{y}\) for which \(\exists \hat{y'} \in \hat{Y} : \hat{y'}\) is a descendant of \(\hat{y}\).
Summary of the different evaluation measures according to the three dimensions: (a) single or multi-label problem, (b) structure of the hierarchy and (c) mandatory leaf node prediction problem or not
Reference | D1 | D2 | D3 |
---|---|---|---|
(Dekel et al. 2004) | SL | T | Non-MLNP |
(Holden and Freitas 2006) | SL | T | Non-MLNP |
(Sun and Lim 2001) | ML | T | Non-MLNP |
GIE | ML | DAG | Non-MLNP |
MGIA | ML | DAG | Non-MLNP |
(Kiritchenko et al. 2005) | ML | DAG | Non-MLNP |
(Struyf et al. 2005) | ML | DAG | Non-MLNP |
(Cai and Hofmann 2007) | ML | DAG | Non-MLNP |
(Cesa-Bianchi et al. 2006) | ML | DAG | Non-MLNP |
\(P_{LCA}, R_{LCA}, F_{LCA}\) | ML | DAG | Non-MLNP |
2.5 Summary of approaches
Dimension 1: the problem can be either a single-labelling one (SL), where each instance is assigned to only one class label, or multi-labelling (ML) where each instance can be assigned multiple class labels.
Dimension 2: the graph structure can be either a tree (T) or a directed acyclic graph (DAG).
Dimension 3: each instance is classified only to leaf class labels of the hierarchy (mandatory leaf node prediction (MLNP)) or to any class in the hierarchy (non-MNLP).
3 Case studies
In this section we apply various measures to selected cases in order to demonstrate their pros and cons. As a representative of the previously proposed pair-based measures we chose the graph induced error (GIE), while for set-based measures we selected the hierarchical versions of precision (\(P_H\)), recall (\(R_H\)), \(F_1\)-measure (\(F_H=\tfrac{2 \cdot P_H \cdot R_H}{P_H+R_H}\)) and Symmetric Difference Loss (\(l_{\varDelta }(Y_{aug},\hat{Y}_{aug})\)), using all the ancestors of the predicted (\(\hat{Y}\)) and true (\(Y\)) labels in order to augment the sets of classes. We also use our proposed pair-based measure MGIA and the set-based LCA versions of precision (\(P_{LCA}\)), recall (\(R_{LCA}\)), \(F_1\)-measure (\(F_{LCA}\)) in order to illustrate their advantages and limitations, as well as the differences between the two types of measure. The LCAs used in this section are minimum LCAs. Regarding MGIA we also provide in brackets the \(fnerror\), before the transformation that we proposed in Sect. 2.3.3, in order to have an easier comparison with GIE. All the above measures are implemented in a fast and easy to use tool written in C++ that is open source and available for download.^{3}
Like all pair-based methods, GIE and MGIA require a maximum distance threshold, above which nodes are paired with a default one. In the cases studied here, this threshold is set to 5.
Based on the three dimensions of the hierarchical classification problem (Sects. 2.1 and 2.5) and the subproblems presented in Sect. 2.2, which highlight important challenges in hierarchical evaluation, we present cases of specific problem settings. The list of cases here is not exhaustive, but it is sufficient to motivate the use of the proposed measures. Additionally, in Sect. 4, we present results on real datasets with real classification systems.
3.1 Simplest problem setting: single-label classification in a tree hierarchy
Results per measure for Fig. 11
Pair-based | Set-based measures | ||||||||
---|---|---|---|---|---|---|---|---|---|
GIE | MGIA | \(P_H\) | \(R_H\) | \(F_H\) | \(l_{\varDelta }(Y_{{ aug}},\hat{Y}_{{ aug}})\) | \(P_{{ LCA}}\) | \(R_{{ LCA}}\) | \(F_{{ LCA}}\) | |
a | 2 | 0.8(2) | 0.66 | 0.66 | 0.66 | 2 | 0.5 | 0.5 | 0.5 |
b | 3 | 0.7(3) | 0.5 | 0.33 | 0.4 | 3 | 0.5 | 0.33 | 0.4 |
The first observation is that GIE behaves in the same desirable way as MGIA in such simple settings, since they both penalize more the second case (error increases and accuracy decreases). This is also true for all the set-based measures (\(F_H\), \(l_{\varDelta }(Y_{aug},\hat{Y}_{aug})\) and \(F_{LCA}\)), since they all penalize the second case more than the first. The only interesting observation here is between the existing hierarchical measures of precision, recall, \(F_1\) and the proposed LCA versions of them. In the first case, the LCA versions use in their calculations only the node Music, which is the LCA of Pop and Rock, while the existing hierarchical versions also take into account node Arts. In that way the LCA versions are stricter in the cases (a) than the existing set-based measures. We believe that such a behavior is desirable, because for every extra node above Arts the LCA versions would provide the same results, while the existing set-based measures would underestimate the error, giving higher results.
3.2 Switching the D1 dimension and handling the pairing problem
The cases presented here alter the D1 dimension of the hierarchical classification problem, by allowing an instance to belong in more than one category. The problems in these settings arise when the number of true and predicted labels (classes) differ, which was defined in Sect. 2.2 as the Pairing Problem. In this elementary case, where the true and predicted classes are at the same level of the hierarchy, Fig. 12a, b are two symmetric variants leading to different, but symmetric hierarchical precision (\(P_H\)) and recall (\(R_H\)) scores, as shown in Table 4. The same is true for our proposed LCA versions of precision and recall (\(P_{LCA}\) and \(R_{LCA}\)). The results between the hierarchical versions and the LCA versions differ, because the LCA versions ignore the graph above the node Music, which is the lowest common ancestor of nodes Pop, Rock and Classical. The hierarchical versions on the other hand also take into account node Arts and in that way they give higher results. This behavior is undesirable, since each node above Music would increase the results of the hierarchical measures, but would not affect the LCA versions.
Results per measure for Fig. 12
Pair-based | Set-based measures | ||||||||
---|---|---|---|---|---|---|---|---|---|
GIE | MGIA | \(P_H\) | \(R_H\) | \(F_H\) | \(l_{\varDelta }(Y_{{ aug}},\hat{Y}_{{ aug}})\) | \(P_{{ LCA}}\) | \(R_{{ LCA}}\) | \(F_{{ LCA}}\) | |
a | 7 | 0.73(4) | 0.5 | 0.66 | 0.57 | 3 | 0.33 | 0.5 | 0.4 |
b | 7 | 0.73(4) | 0.66 | 0.5 | 0.57 | 3 | 0.5 | 0.33 | 0.4 |
The more complex case of Fig. 13 is an example showing that taking into account all the ancestors is undesirable compared to our proposed LCA approach for set based measures. The hierarchy is still a tree and the classification is multi-label. Europop is predicted correctly, while an extra false category is predicted. Although the mistake in Fig. 13b is worse than that of Fig. 13a, since it is further from the true class Europop, all set-based measures except our proposed LCA measures continue to give the same results. This is because \(l_{\varDelta }(Y_{aug},\hat{Y}_{aug})\) and the hierarchical versions of precision, recall and \(F_1\) take into account all the ancestors of the predicted and true labels, while the LCA versions use the augmented graphs \(G_t(Y,\hat{Y})\) and \(G_p(Y,\hat{Y})\), which were created using only the least common ancestors (LCAs). LCA measures pair each node with the closest node of the other set and in that way take into account the distance between predicted and true nodes, which the other set-based measures ignore.
Thus in Fig. 13a the augmented sets of the LCA are {Europop, Pop} and {Europop, Pop, Beat Music}, while for all the other set-based measures they are {Europop, Pop, Music, Arts} and {Europop, Pop, Beat Music, Music, Arts}. In Fig. 13b the augmented sets of LCA become {Europop, Pop, Music} and {Europop, Rock, Music}, while for all the other set-based measures they remain the same. The differentiation between such cases is an advantage of the LCA versions of precision, recall and \(F_1\) over existing set-based measures.
3.3 Single label classification in a DAG Hierarchy
Results per measure for Fig. 13
Pair-based | Set-based measures | ||||||||
---|---|---|---|---|---|---|---|---|---|
GIE | MGIA | \(P_H\) | \(R_H\) | \(F_H\) | \(l_{\varDelta }(Y_{{ aug}},\hat{Y}_{{ aug}})\) | \(P_{{ LCA}}\) | \(R_{{ LCA}}\) | \(F_{{ LCA}}\) | |
a | 2 | 0.8(2) | 0.8 | 1 | 0.89 | 1 | 0.66 | 1 | 0.8 |
b | 3 | 0.7(3) | 0.8 | 1 | 0.89 | 1 | 0.66 | 0.66 | 0.66 |
3.4 Multi-label classification in a DAG
Results per measure for Fig. 14
Pair-based | Set-based measures | ||||||||
---|---|---|---|---|---|---|---|---|---|
GIE | MGIA | \(P_H\) | \(R_H\) | \(F_H\) | \(l_{\varDelta }(Y_{{ aug}},\hat{Y}_{{ aug}})\) | \(P_{{ LCA}}\) | \(R_{{ LCA}}\) | \(F_{{ LCA}}\) | |
a | 2 | 0.8(2) | 0.5 | 0.66 | 0.57 | 3 | 0.5 | 0.5 | 0.5 |
b | 2 | 0.8(2) | 0.66 | 0.66 | 0.66 | 2 | 0.5 | 0.5 | 0.5 |
Results per measure for Fig. 15
Pair-based | Set-based measures | |||||||
---|---|---|---|---|---|---|---|---|
GIE | MGIA | \(P_H\) | \(R_H\) | \(F_H\) | \(l_{\varDelta }(Y_{{ aug}},\hat{Y}_{{ aug}})\) | \(P_{{ LCA}}\) | \(R_{{ LCA}}\) | \(F_{{ LCA}}\) |
7 | 0.6(6) | 0.4 | 0.66 | 0.5 | 4 | 0.4 | 0.66 | 0.5 |
Results per measure for Fig. 16
Pair-based | Set-based measures | |||||||
---|---|---|---|---|---|---|---|---|
GIE | MGIA | \(P_H\) | \(R_H\) | \(F_H\) | \(l_{\varDelta }(Y_{{ aug}},\hat{Y}_{{ aug}})\) | \(P_{{ LCA}}\) | \(R_{{ LCA}}\) | \(F_{{ LCA}}\) |
10 | 0(10) | 0.166 | 0.33 | 0.22 | 7 | 0.2 | 0.33 | 0.25 |
3.5 Over and under-specialization
Results per measure for Fig. 17
Pair-based | Set-based measures | ||||||||
---|---|---|---|---|---|---|---|---|---|
GIE | MGIA | \(P_H\) | \(R_H\) | \(F_H\) | \(l_{\varDelta }(Y_{{ aug}},\hat{Y}_{{ aug}})\) | \(P_{{ LCA}}\) | \(R_{{ LCA}}\) | \(F_{{ LCA}}\) | |
a | 1 | 0.9(1) | 0.66 | 1 | 0.8 | 1 | 0.5 | 1 | 0.66 |
b | 1 | 0.9(1) | 1 | 0.66 | 0.8 | 1 | 1 | 0.5 | 0.66 |
c | 2 | 0.8(2) | 1 | 0.33 | 0.49 | 2 | 1 | 0.33 | 0.49 |
Figure 17a shows a case of over-specialization. As described in Sect. 2.2, different evaluation measures treat this type of error differently. One could even argue that since Pop is predicted, Music is also predicted as a direct ancestor of it. But the true category is Music not Pop and as shown in Table 9 all measures treat Music as a misclassification error.
Regarding under-specialization, the simplest example is shown in Fig. 17b. It is also considered an error that is more severe the further the true category is from the predicted one. For example in Fig. 17c the predicted node is an ancestor of the predicted node of Fig. 17b. All measures lead to a higher error estimate in this case than in 17b. A similar example for over-specialization would lead to the same observations.
Our proposed measures do not offer any advantage along this dimension of the hierarchical classification problem, compared to the existing ones. We have not observed any negative behaviors of the existing measures, that we would like to correct. We just wish show that our new measures do not have any peculiar behavior regarding this dimension of the problem.
3.6 Long distance predictions
The aim of this case (Fig. 18) is to show how each of the two types of measure (pair-based and set-based) handle very large distances between predicted and true labels. As we discussed in Sect. 2.2, the long distance problem is not affected by the dimensions of the hierarchical classification problem. Pair-based measures compute the distance between each pair of predicted and true nodes and if this distance is above a certain threshold, a standard maximum distance is assigned. Set-based measures can use a threshold on the number of ancestors of the predicted and true nodes that will be used in the augmented sets. Using this threshold, we impose an artificial common ancestor to be used in order to connect at least one predicted with one true node, at a distance equal to the threshold.
Results per measure for Fig. 18
Pair-based | Set-based measures | ||||||||
---|---|---|---|---|---|---|---|---|---|
GIE | MGIA | \(P_H\) | \(R_H\) | \(F_H\) | \(l_{\varDelta }(Y_{{ aug}},\hat{Y}_{{ aug}})\) | \(P_{{ LCA}}\) | \(R_{{ LCA}}\) | \(F_{{ LCA}}\) | |
a | 6 + Max | 0.2(12) | 0.16 | 0.33 | 0.22 | 7 | 0.16 | 0.33 | 0.22 |
b | 4 + Max | 0.466(8) | 0.25 | 0.33 | 0.28 | 5 | 0.25 | 0.33 | 0.28 |
3.7 Multiple path counting
Results per measure for Fig. 19
Pair-based | Set-based measures | ||||||||
---|---|---|---|---|---|---|---|---|---|
GIE | MGIA | \(P_H\) | \(R_H\) | \(F_H\) | \(l_{\varDelta }(Y_{aug},\hat{Y}_{aug})\) | \(P_{LCA}\) | \(R_{LCA}\) | \(F_{LCA}\) | |
a | 7 | 0.533(7) | 0.33 | 0.66 | 0.44 | 5 | 0.33 | 0.66 | 0.44 |
b | 10 | 0.33(10) | 0.2 | 0.33 | 0.25 | 6 | 0.2 | 0.33 | 0.25 |
Figure 19b presents a similar example. According to Table 11, the error of MGIA before the proposed transformation of Sect. 2.3.3 is increased from 7 to 10, while \(l_{\varDelta }(Y_{aug},\hat{Y}_{aug})\) increases from 5 to 6 and \(F_{LCA}\) decreases from 0.44 to 0.25. This is because the whole path form Drama to Pop is counted twice by the pair-based method. However double counting seems desirable in this case, as the errors in 19b are more severe than that in 19a. While both measures penalize the errors in 19b more than in 19a, the extra penalization of MGIA is roughly proportional to the distance between the true and the predicted category nodes, while that of the set based measures is not. All set based measures would give the same result even if Europop were a child of Music, which is a less severe error than when it is a child of Pop. Therefore, counting more than once the common paths, may be an advantage of the pair-based measures in some cases.
3.8 Summary
Summary table regarding evaluation measures over certain situations
Pair-based | Set-based measures | ||||
---|---|---|---|---|---|
GIE | MGIA | \(F_H\) | \(l_{\varDelta }(Y_{aug},\hat{Y}_{aug})\) | \(F_{LCA}\) | |
Alternative paths | + | + | \(-\) | \(-\) | + |
Over-specialization | + | + | + | + | + |
Under-specialization | + | + | + | + | + |
Pairing problem | \(-\) | + | + | + | + |
Long distance problem | + | + | * | * | * |
Multiple path count | \(-\) | \(-\) | + | + | + |
Summary table regarding evaluation measures over certain problem settings
Single-label | Multi-label | |
---|---|---|
Tree hierarchy | Any pair-based or LCA | LCA or MGIA |
DAG hierarchy | LCA or MGIA | LCA or MGIA |
As a general conclusion the proposed measures (MGIA and LCA-based) always behave better or at least as well as the existing measures of their type (pair-based and set-based). Therefore, if one wishes to use a pair-based or a set-based measure, we suggest using the ones proposed in this paper, instead of the existing ones. Furthermore, in most cases one should choose the LCA measures over MGIA, due to the multiple counting of paths discussed in Sect. 3.7. Multiple counting of paths is most of the times undesirable, since it leads to over-penalization. Additionally, these cases could also serve as benchmarks, in order to observe the behaviour of newly proposed hierarchical evaluation measures. In this way, we conclude the discussion regarding the behavior of the measures in benchmark cases. In the following section, we study them using real data and systems.
4 Empirical study
In this section we apply various evaluation measures to the predictions of the systems that participated in the Large Scale Hierarchical Text Classification Pascal Challenges of 2011 (LSHTC2) and 2012 (LSHTC3). We also present ranking correlations between measures from the first task of the 2013 BioASQ challenge. The goal of this section, is to study using real data and systems, the extent to which the performance ranking of systems is affected by the choice between flat and hierarchical evaluation measures and also by the type of hierarchical measure used. In Sect. 3 we demonstrated that, in certain cases, some measures behave more desirably than others. In this section we show that the differences among the methods also affect the rankings of real systems in practice. In the first subsection we present the datasets that we used, in the second subsection we discuss the evaluation measures included in the comparison and in the final subsection we discuss the results of the study.
4.1 Datasets
In LSHTC2 three different datasets were provided as three separate tasks. Each participant could participate in any or all of them with the same or with a different system. The first dataset (DMOZ) was based on pages crawled from the Open Directory Project (ODP), a human-edited hierarchical directory of the Web.^{4} The hierarchy of this dataset was transformed into a tree and all instances deeper than level five of the hierarchy were transferred to the fifth level, thus leading to a hierarchy with a maximum depth of 5. This dataset was the smallest of the three, regarding the number of categories and instances.
The other two datasets of LSHTC2, also used in LSHTC3, are based on DBpedia.^{5} They are called DBpedia Large and DBPedia Small, respectively. The largest of the two datasets, DBPedia Large, contains almost all abstracts of DBpedia, as instances to be used for training and classification, with the exception of some non-English abstracts. Therefore this dataset comprises many more categories than DMOZ and goes to a larger depth. DBpedia Small is a subset of DBpedia Large, selected in a way that led to a dataset of similar size than DMOZ, while maximizing the ratio of instances per node. This process has resulted in a much easier classification task. The hierarchy of the DBpedia Small dataset has been transformed into a \(DAG\), by removing cycles, while cycles still appear in DBpedia Large.
All three datasets were pre-processed in the same way. All the words of the abstracts were stemmed and each stem was mapped to a feature id. The categories (classes) were also mapped to category ids. Each instance was represented in sparse vector format as a collection of category ids and a collection of feature ids accompanied by their frequencies in the instance. The mapping between ids, categories and stems was different for each dataset. Only leaves of each hierarchy were used as valid classification nodes for LSHTC2, while in LSHTC3 participants were also allowed to classify instances in inner nodes. For each inner node of the hierarchy that was assigned instances however, a dummy leaf was created for evaluation purposes as a direct child and all the instances were transferred to the child.
Basic statistics of datasets showing the number of categories, the number of training and test instances, the average number of true categories per instance (multi-label factor), the ratio of training instances to categories, the ratio of multi-labeled training instances to categories and the maximum depth of the hierarchy
DMOZ | DBpedia small | DBpedia large | |
---|---|---|---|
Categories | 27,875 | 36,504 | 325,056 |
Training instances | 394,756 | 456,886 | 2,365,436 |
Test instances | 104,263 | 81,262 | 452,167 |
Multi-label factor | 1.0239 | 1.8596 | 3.2614 |
Training inst. per cat. | 14.16 | 12.5 | 7.2 |
Multi-label train. inst. per cat. | 14.5 | 23.27 | 23.73 |
Max depth | 5 | 10 | 14 |
We also present system rankings from the results of the first task of the 2013 BioASQ Challenge.^{6} The training data of this task consists of 800,000 PubMed biomedical journal abstracts belonging in 25,000 classes of the MeSH hierarchy.^{7} The participants were asked to classify new PubMed documents, as they became available online, before they were manually annotated with MeSH headings by PubMed curators. Our proposed \(F_{LCA}\) was the hierarchical measure used in this task in order to decide the winning systems.
4.2 Evaluation measures and statistical tests
The evaluation measures that were used in this study were the ones presented in Sect. 3. Accuracy and GIE are reproduced here as reported during the challenges. Using these evaluation measures, different rankings of the participating systems are created. In order to measure the correlation between these rankings, we used Kendall’s rank correlation (Kendall 1938).
\(a_i\) is the performance of system \(a\) for instance \(i\), according to an evaluation measure,
\(b_i\) is the performance of system \(b\) for instance \(i\), according to the same evaluation measure,
\(n\) is the number of times that \(a_i\) and \(b_i\) differ over all \(i\),
\(k\) is the number of times that \(a_i\) performs better than \(b_i\) over all \(i\),
4.3 Results
DMOZ results in LSHTC2
System | Acc | GIE | \(F_H\) | \(l_{\varDelta }(Y_{aug},\hat{Y}_{aug})\) | MGIA | \(F_{LCA}\) |
---|---|---|---|---|---|---|
A | 0.388 (1) | 2.829 (2) | 0.653 (3) | 3.962 (2) | 0.541 (2) | 0.557 (2) |
B | 0.387 (2) | 2.823 (1) | 0.660 (1) | 3.910 (1) | 0.550 (1) | 0.559 (1) |
C | 0.386 (3) | 2.831 (3) | 0.654 (2) | 3.987 (3) | 0.542 (1) | 0.555 (3) |
D | 0.380 (4) | 3.322 (6) | 0.642 (5) | 4.257 (4) | 0.515 (4) | 0.544 (5) |
E | 0.378 (5) | 3.832 (10) | 0.640 (6) | 4.458 (7) | 0.501 (5) | 0.538 (6) |
F | 0.371 (6) | 2.891 (4) | 0.652 (4) | 3.996 (4) | 0.538 (3) | 0.547 (4) |
G | 0.347 (7) | 3.027 (5) | 0.622 (7) | 4.335 (6) | 0.497 (6) | 0.522 (7) |
H | 0.284 (8) | 3.456 (7) | 0.497 (11) | 5.878 (11) | 0.364 (8) | 0.440 (9) |
I | 0.269 (9) | 3.503 (9) | 0.571 (8) | 4.987 (9) | 0.421 (7) | 0.460 (8) |
J | 0.262 (10) | 3.476 (8) | 0.570 (8) | 4.966 (8) | 0.428 (7) | 0.458 (8) |
K | 0.172 (11) | 3.898 (11) | 0.469 (12) | 6.165 (12) | 0.318 (10) | 0.373 (11) |
L | 0.155 (12) | 4.010 (12) | 0.446 (13) | 6.430 (13) | 0.282 (11) | 0.353 (12) |
M | 0.153 (13) | 4.024 (13) | 0.497 (10) | 5.803 (10) | 0.333 (9) | 0.374 (10) |
N | 0.107 (14) | 4.289 (14) | 0.384 (14) | 7.080 (14) | 0.202 (12) | 0.306 (13) |
O | 0.087 (15) | 4.419 (15) | 0.340 (15) | 7.744 (15) | 0.175 (13) | 0.280 (14) |
Kendall’s rank correlation on the evaluation measure rankings of DMOZ in LSHTC2
Acc | GIE | \(F_H\) | \(l_{\varDelta }(Y_{aug},\hat{Y}_{aug})\) | MGIA | \(F_{LCA}\) | |
---|---|---|---|---|---|---|
Acc | 1 | |||||
GIE | 0.829 | 1 | ||||
\(F_H\) | 0.842 | 0.785 | 1 | |||
\(l_{\varDelta }(Y_{aug},\hat{Y}_{aug})\) | 0.790 | 0.810 | 0.919 | 1 | ||
MGIA | 0.829 | 0.810 | 0.976 | 0.924 | 1 | |
\(F_{LCA}\) | 0.867 | 0.810 | 0.976 | 0.924 | 0.962 | 1 |
The first observation is that the ranking of the flat accuracy is different from that of the other hierarchical measures. This shows that flat and hierarchical measures treat the problem differently. Another interesting observation is that the rankings also differ between hierarchical measures.
The handling of multiple labels per instance is an important aspect of the classification methods. Table 17 presents the average number of predictions per instance for each system. Since most instances of the dataset are single-labeled, most of the participants treated the task as a single-label one. As discussed in previous sections, the treatment of multi-labeling by different measures greatly affects their behavior, but since multi-labeling is rare in this dataset, this decision did not affect much the hierarchical measures. However, there are some examples of systems, such as M and J, which assign multiple labels and also perform better according to hierarchical measures than according to accuracy. D and E, on the other hand, perform worse using some hierarchical measures than with accuracy. The more multi-labeled the decisions, the greater the opportunity for a hierarchical measure to reward or penalize the systems for their decisions.
Average number of predictions per instance of DMOZ systems in LSHTC2
A | B | C | D | E | F | G | H | I | J | K | L | M | N | O |
1 | 1 | 1 | 1.11 | 1.22 | 1 | 1 | 1 | 1.02 | 1.01 | 1 | 1 | 1.02 | 1 | 1 |
DBpedia Small results in LSHTC2
System | Acc | GIE | \(F_H\) | \(l_{\varDelta }(Y_{aug},\hat{Y}_{aug})\) | MGIA | \(F_{LCA}\) |
---|---|---|---|---|---|---|
A2 | 0.374 (1) | 4.171 (5) | 0.647 (1) | 12.114 (6) | 0.356 (1) | 0.481 (1) |
B2 | 0.362 (2) | 4.364 (6) | 0.641 (3) | 12.000 (4) | 0.337 (2) | 0.470 (2) |
C2 | 0.354 (3) | 4.076 (4) | 0.646 (2) | 11.651(3) | 0.323 (4) | 0.463 (3) |
D2 | 0.351 (4) | 3.858 (2) | 0.629 (4) | 11.332 (2) | 0.329 (3) | 0.462 (4) |
E2 | 0.279 (5) | 3.726 (1) | 0.600 (5) | 11.326 (1) | 0.286 (5) | 0.414 (5) |
F2 | 0.252 (6) | 3.859 (3) | 0.579 (6) | 11.996 (5) | 0.280 (6) | 0.399 (6) |
G2 | 0.249 (7) | 5.701 (7) | 0.561 (7) | 16.915 (7) | 0.245 (7) | 0.381 (7) |
Kendall’s rank correlation on the evaluation measure rankings of DBpedia small in LSHTC2
Acc | GIE | \(F_H\) | \(l_{\varDelta }(Y_{aug},\hat{Y}_{aug})\) | MGIA | \(F_{LCA}\) | |
---|---|---|---|---|---|---|
Acc | 1 | |||||
GIE | \(-\)0.143 | 1 | ||||
\(F_H\) | 0.905 | \(-\)0.048 | 1 | |||
\(l_{\varDelta }(Y_{aug},\hat{Y}_{aug})\) | \(-\)0.143 | 0.810 | \(-\)0.048 | 1 | ||
MGIA | 0.810 | \(-\)0.143 | 0.714 | \(-\)0.143 | 1 | |
\(F_{LCA}\) | 0.905 | \(-\)0.238 | 0.810 | \(-\)0.238 | 0.905 | 1 |
Average predictions per instance of DBpedia small systems in LSHTC2
A2 | B2 | C2 | D2 | E2 | F2 | G2 |
2.04 | 1.94 | 1.82 | 1.51 | 1.11 | 1.14 | 2.84 |
Tables 21 and 22 present the results on the same dataset (DBpedia Small), but with the systems of LSHTC3. The number of systems participating in LSHTC3 is much larger than LSHTC2 (17 instead of 7). This is not only important for statistical reasons (more experiments lead to safer conclusions), but also because according to Table 23 we now have more systems with a higher average number of predictions per instance, something which affects the behavior of the measures. Another important difference is that in LSHTC3, systems were allowed to classify to inner nodes, even if these nodes did not have any training instance directly belonging to them. \(F_H\) and \(F_{LCA}\) are the hierarchical measures that are most correlated with flat accuracy, although the correlation is much lower in this case where we have many more systems and inner node classification is treated as a mistake by accuracy. The correlation between GIE and MGIA is much higher than that of LSHTC2, but they are not fully correlated. A high correlation also continues to be observed between GIE and \(l_{\varDelta }(Y_{aug},\hat{Y}_{aug})\) for the reason explained previously in LSHTC2.
DBpedia Small results in LSHTC3
System | Acc | GIE | \(F_H\) | \(l_{\varDelta }(Y_{aug},\hat{Y}_{aug})\) | MGIA | \(F_{LCA}\) |
---|---|---|---|---|---|---|
H2 | 0.438 (1) | 3.060 (1) | 0.709 (1) | 9.096 (1) | 0.421 (2) | 0.543 (1) |
I2 | 0.429 (2) | 3.155 (2) | 0.689 (3) | 9.310 (2) | 0.398 (4) | 0.525 (4) |
J2 | 0.42 (3) | 3.530 (5) | 0.692 (2) | 10.143 (6) | 0.403 (3) | 0.529 (2) |
K2 | 0.417 (4) | 4.428 (11) | 0.677 (5) | 11.385 (11) | 0.378 (6) | 0.509 (5) |
L2 | 0.408 (5) | 3.187 (3) | 0.680 (4) | 9.561 (3) | 0.443 (1) | 0.527 (3) |
M2 | 0.385 (6) | 3.319 (4) | 0.666 (7) | 10.122 (5) | 0.390 (4) | 0.500 (6) |
N2 | 0.371 (7) | 4.991 (13) | 0.645 (8) | 13.117 (13) | 0.342 (8) | 0.476 (8) |
O2 | 0.357 (8) | 4.302 (10) | 0.643 (8) | 12.185 (12) | 0.323 (9) | 0.462 (10) |
P2 | 0.354 (9) | 3.550 (6) | 0.633 (9) | 11.146 (8) | 0.381 (5) | 0.478 (7) |
Q2 | 0.327 (10) | 3.600 (8) | 0.639 (8) | 10.944 (7) | 0.312 (11) | 0.450 (12) |
R2 | 0.32 (11) | 3.552 (7) | 0.603 (10) | 11.365 (10) | 0.361 (7) | 0.453 (11) |
S2 | 0.298 (12) | 5.693 (14) | 0.549 (13) | 16.873 (15) | 0.243 (13) | 0.407 (13) |
T2 | 0.25 (13) | 3.741 (9) | 0.592 (11) | 11.304 (9) | 0.089 (14) | 0.397 (14) |
U2 | 0.249 (14) | 5.701 (15) | 0.561 (12) | 16.915 (16) | 0.245 (12) | 0.381 (15) |
V2 | 0.245 (15) | 4.780 (12) | 0.537 (14) | 14.351 (14) | 0.234 (13) | 0.374 (16) |
W2 | 0.063 (16) | 9.139 (16) | 0.345 (15) | 24.009 (17) | 0.045 (15) | 0.208 (17) |
X2 | 0.047 (17) | 25.775 (17) | 0.668 (6) | 9.607 (4) | 0.321 (10) | 0.471 (9) |
Kendall’s rank correlation on the evaluation measure rankings of DBpedia Small in LSHTC3
Acc | GIE | \(F_H\) | \(l_{\varDelta }(Y_{aug},\hat{Y}_{aug})\) | MGIA | \(F_{LCA}\) | |
---|---|---|---|---|---|---|
Acc | 1 | |||||
GIE | 0.662 | 1 | ||||
\(F_H\) | 0.765 | 0.485 | 1 | |||
\(l_{\varDelta }(Y_{aug},\hat{Y}_{aug})\) | 0.485 | 0.735 | 0.662 | 1 | ||
MGIA | 0.691 | 0.618 | 0.721 | 0.588 | 1 | |
\(F_{LCA}\) | 0.794 | 0.574 | 0.853 | 0.603 | 0.838 | 1 |
Average predictions per instance on DBpedia Small in LSHTC3
H2 | I2 | J2 | K2 | L2 | M2 | N2 | O2 | P2 |
---|---|---|---|---|---|---|---|---|
1.506 | 1.482 | 1.909 | 2.208 | 1.415 | 1.490 | 2.427 | 1.889 | 1.414 |
Q2 | R2 | S2 | T2 | U2 | V2 | W2 | X2 | |
1.529 | 1.184 | 2.423 | 2.000 | 2.841 | 1.712 | 4.334 | 10.649 |
On the third dataset (DBpedia Large) we faced some computational issues with the hierarchical measures. The problem originated from the very large scale of the dataset’s hierarchy, which is a \(DAG\) (in reality it contains circles, but we removed them). To avoid the computational problems, we run the evaluation measures with a maximum path threshold of 2 and 4. This means that all nodes are forced to have a lowest common ancestor at a depth of 1 and 2 respectively (if they do not have one we create a dummy one). Although this seems restrictive, it is very similar to the idea behind the Long Distance problem of Fig. 2e discussed in Sect. 2. In the Long Distance problem we used dummy nodes in order to link nodes that were further than a threshold from each other, in order to avoid overpenalization. The same dummy nodes are used here for computational reasons.
DBpedia Large results with a maximum path threshold of 4 in LSHTC2
System | Acc | GIE | \(F_H\) | \(l_{\varDelta }(Y_{aug},\hat{Y}_{aug})\) | MGIA | \(F_{LCA}\) |
---|---|---|---|---|---|---|
A3 | 0.347 (1) | 4.647 (4) | 0.538 (1) | 42.470 (3) | 0.319 (1) | 0.44 (1) |
B3 | 0.337 (2) | 4.392 (2) | 0.511 (2) | 40.811 (1) | 0.315 (2) | 0.437 (2) |
C3 | 0.283 (3) | 6.178 (5) | 0.440 (4) | 50.709 (5) | 0.253 (4) | 0.39 (4) |
D3 | 0.272 (4) | 4.288 (1) | 0.483 (3) | 42.430 (2) | 0.294 (3) | 0.388 (3) |
E3 | 0.177 (5) | 4.535 (3) | 0.314 (5) | 47.957 (4) | 0.212 (5) | 0.331 (5) |
Kendall’s rank correlation on the evaluation measure rankings with a maximum path threshold of 4 on DBpedia Large in LSHTC2
Acc | GIE | \(F_H\) | \(l_{\varDelta }(Y_{aug},\hat{Y}_{aug})\) | MGIA | \(F_{LCA}\) | |
---|---|---|---|---|---|---|
Acc | 1 | |||||
GIE | 0 | 1 | ||||
\(F_H\) | 0.8 | 0.2 | 1 | |||
\(l_{\varDelta }(Y_{aug},\hat{Y}_{aug})\) | 0.2 | 0.8 | 0.4 | 1 | ||
MGIA | 0.8 | 0.2 | 1 | 0.4 | 1 | |
\(F_{LCA}\) | 0.8 | 0.2 | 1 | 0.4 | 1 | 1 |
Average predictions per instance on DBpedia Large systems in LSHTC2
A3 | B3 | C3 | D3 | E3 |
3.15 | 2.69 | 3.62 | 2.81 | 1.27 |
DBpedia Large results with a maximum path threshold of 2 in LSHTC2
Acc | GIE | \(F_H\) | \(l_{\varDelta }(Y_{aug},\hat{Y}_{aug})\) | MGIA | \(F_{LCA}\) | |
---|---|---|---|---|---|---|
A3 | 0.347 (1) | 4.647 (4) | 0.503 (1) | 22.006 (3) | 0.206 (2) | 0.460 (2) |
B3 | 0.337 (2) | 4.392 (2) | 0.475 (2) | 21.028 (1) | 0.207 (1) | 0.461 (1) |
C3 | 0.283 (3) | 6.178 (5) | 0.412 (4) | 25.465 (5) | 0.163 (3) | 0.416 (3) |
D3 | 0.272 (4) | 4.288 (1) | 0.439 (3) | 21.702 (2) | 0.155 (4) | 0.405 (4) |
E3 | 0.177 (5) | 4.535 (3) | 0.282 (5) | 24.146 (4) | 0.134 (5) | 0.360 (5) |
Kendall’s rank correlation on the evaluation measure rankings with a maximum path threshold of 2 on DBpedia Large in LSHTC2
Acc | GIE | \(F_H\) | \(l_{\varDelta }(Y_{aug},\hat{Y}_{aug})\) | MGIA | \(F_{LCA}\) | |
---|---|---|---|---|---|---|
Acc | 1 | |||||
GIE | 0 | 1 | ||||
\(F_H\) | 0.8 | 0.2 | 1 | |||
\(l_{\varDelta }(Y_{aug},\hat{Y}_{aug})\) | 0.2 | 0.8 | 0.4 | 1 | ||
MGIA | 0.8 | 0.2 | 0.6 | 0.4 | 1 | |
\(F_{LCA}\) | 0.8 | 0.2 | 0.6 | 0.4 | 1 | 1 |
Another interesting observation is the disagreement of GIE and MGIA about systems C3 and E3. As shown in Table 26, E3 predicts fewer categories per instance than C3. Since most of the times the predicted categories (labels) are fewer than the true ones and GIE over-penalizes all the unmatched true categories, it is natural for GIE to penalize system C3 more than E3. This problem is fixed by MGIA, which allows multi-pairing and this is why it instead ranks C3 as a better system than E3.
We can also see that this difficult hierarchy affects the performance of \(l_{\varDelta }(Y_{aug},\hat{Y}_{aug})\) and its ranks become less correlated with \(F_{LCA}\), compared to the previous datasets. A more interesting observation is that \(F_H\), MGIA and \(F_{LCA}\) are completely correlated with each other (Table 25) and not correlated with the average number of predictions per instance (Table 26).
Tables 29 and 30 present the results on the same dataset (DBpedia Large), but with the systems of LSHTC3. The main difference is that in LSHTC3, systems were allowed to classify to inner nodes. Table 31 shows that the average number of predictions per instance is similar to that of LSHTC2. The most interesting observation is that GIE and \(l_{\varDelta }(Y_{aug},\hat{Y}_{aug})\) rank F3 as one of the worst systems, while all the other hierarchical measures and Accuracy ranked it once again high. It is even more interesting that, according to Table 31, system F3 provides the most labels per instance. The assignement of many labels is penalized heavily by measures based only on FP and FN (GIE and \(l_{\varDelta }(Y_{aug},\hat{Y}_{aug})\)), as we mentioned before.
Another interesting observation is that MGIA and \(F_{LCA}\) are not fully correlated anymore. In fact MGIA is more correlated with \(F_H\) than with \(F_{LCA}\). As the hierarchy becomes more complicated and the results more multi-labeled our two proposed measures behave more differently. Finally system G3, which predicts the smallest number of instances per document, is one of the best systems according to all measures except MGIA which ranks as the second worst. This is because although G3 has the highest \(P_H\) and \(P_{LCA}\), it also has a very low \(R_H\) and \(R_{LCA}\) compared to other systems. The computation of \(F_1\) seems more sutable in this case compared to the transformation that we proposed for MGIA, one extra reason why we propose \(F_{LCA}\) over MGIA.
DBpedia Large results with a maximum path threshold of 4 in LSHTC3
System | Acc | GIE | \(F_H\) | \(l_{\varDelta }(Y_{aug},\hat{Y}_{aug})\) | MGIA | \(F_{LCA}\) |
---|---|---|---|---|---|---|
F3 | 0.381 (1) | 6.104 (6) | 0.557 (1) | 42.790 (5) | 0.374 (2) | 0.465 (1) |
G3 | 0.346 (2) | 3.756 (1) | 0.513 (3) | 33.630 (1) | 0.309 (5) | 0.456 (2) |
H3 | 0.340 (3) | 3.763 (2) | 0.508 (4) | 38.137 (2) | 0.35 (3) | 0.45 (3) |
I3 | 0.333 (4) | 3.763 (3) | 0.507 (5) | 44.435 (6) | 0.31 (4) | 0.43 (5) |
J3 | 0.332 (5) | 4.216 (4) | 0.517 (2) | 40.560 (3) | 0.381 (1) | 0.449 (4) |
K3 | 0.272 (6) | 4.288 (5) | 0.483 (6) | 42.430 (4) | 0.294 (6) | 0.388 (6) |
Kendall’s rank correlation on the evaluation measure rankings with a maximum path threshold of 4 on DBpedia Large in LSHTC3
Acc | GIE | \(F_H\) | \(l_{\varDelta }(Y_{aug},\hat{Y}_{aug})\) | MGIA | \(F_{LCA}\) | |
---|---|---|---|---|---|---|
Acc | 1 | |||||
GIE | 0.276 | 1 | ||||
\(F_H\) | 0.6 | 0.138 | 1 | |||
\(l_{\varDelta }(Y_{aug},\hat{Y}_{aug})\) | 0.2 | 0.552 | 0.067 | 1 | ||
MGIA | 0.2 | \(-\)0.276 | 0.6 | \(-\)0.067 | 1 | |
\(F_{LCA}\) | 0.828 | 0.071 | 0.828 | 0.276 | 0.414 | 1 |
Average predictions per instance of DBpedia Large systems in LSHTC3
F3 | G3 | H3 | I3 | J3 | K3 |
3.949 | 1.482 | 2.315 | 2.903 | 2.902 | 2.810 |
Tables 32, 33 and 34 present the Kendall’s rank correlations of the three batches of the 2013 BioASQ challenge. These tables contain results for two flat and two hierarchical evaluation measures. The flat measures are accuracy and Micro-\(F_1\). The hierarchical measures are our proposed \(F_{LCA}\) and \(F_H\). \(F_{LCA}\) and Micro-\(F_1\) were used in the challenge in order to select the winning systems.
Kendall’s rank correlation on the evaluation measure rankings with a maximum path threshold of 8 on batch 1 of the BioASQ challenge (Task 1)
Acc | \(F_{LCA}\) | Micro-\(F_1\) | \(F_H\) | |
---|---|---|---|---|
Acc | 1 | |||
\(F_{LCA}\) | 0.91 | 1 | ||
Micro-\(F_1\) | 0.95 | 0.88 | 1 | |
\(F_H\) | 0.89 | 0.90 | 0.87 | 1 |
Kendall’s rank correlation on the evaluation measure rankings with a maximum path threshold of 8 on batch 2 of the BioASQ challenge (Task 1)
Acc | \(F_{LCA}\) | Micro-\(F_1\) | \(F_H\) | |
---|---|---|---|---|
Acc | 1 | |||
\(F_{LCA}\) | 0.89 | 1 | ||
Micro-\(F_1\) | 0.98 | 0.91 | 1 | |
\(F_H\) | 0.94 | 0.92 | 0.94 | 1 |
Kendall’s rank correlation on the evaluation measure rankings with a maximum path threshold of 8 on batch 3 of the BioASQ challenge (Task 1)
Acc | \(F_{LCA}\) | Micro-\(F_1\) | \(F_H\) | |
---|---|---|---|---|
Acc | 1 | |||
\(F_{LCA}\) | 0.90 | 1 | ||
Micro-\(F_1\) | 0.99 | 0.90 | 1 | |
\(F_H\) | 0.92 | 0.92 | 0.90 | 1 |
The experiments presented in this section illustrated, with the use of real systems and datasets, that hierarchical measures treat the competing systems differently than flat measures. This was shown by presenting the differences in the rankings of the systems across the four datasets. Flat evaluation measures, which are commonly used, often provide a false indication of which system performs better by ignoring hierarchical dependencies of classes and treating all errors equally. As a result, their use guides research away from the methods that incorporate the hierarchy in the classification process. We also showed that different variants of hierarchical measures give different rankings under different conditions. The goal was not to choose the best measure, but to show that different hierarchical evaluation measures give different results, not only in absolute values but also in the ranking of the systems. Finally we showed that the scale of the task is also an issue which requires attention.
5 Conclusions
In this work we studied the problem of evaluating the performance of hierarchical classification methods. Specifically, this work abstracted and presented the key points of existing performance measures. We proposed a grouping of the methods into a) pair-based and b) set-based. Measures in the former group attempt to match each prediction to a true class and measure their distance. By contrast, set-based measures use the hierarchical relations in order to augment the sets of predicted and true labels, and then use set operations, like symmetric difference and intersection, on the augmented label sets.
In order to model pair-based measures, we introduced a novel generic framework based on flow networks, while for set-based measures we provided a framework based on set operations. Thus, salient features of these measures were stressed and presented under a common formalism.
Another contribution of this paper was the proposal of two new measures (one for each group) that address several deficiencies of existing measures. The proposed measures, along with existing ones, were assessed in two ways. First, we applied them to selected cases, in order to demonstrate their pros and cons. Second, we studied them empirically on four large datasets based on DMOZ, DBpedia and biomedical data (BioASQ) with different characteristics (single-label, multi-label,tree and DAG hierarchies). The analysis of the results showed that the hierarchical measures behave differently, especially in cases of multi-label data and DAG hierarchies. Also, the two proposed measures have shown a more robust behavior compared to their counterparts. Finally, the results supported our initial premise that flat measures are not adequate for evaluating hierarchical categorization systems.
Our analysis showed that although in certain rare cases pair-based measures may behave more desirably, in most cases the set-based method proposed in this paper (\(F_{LCA}\)) exhibits more desirable behavior than that of our proposed pair-based measure (MGIA), since it is actually a hybrid measure, because of the pairings it uses to select LCAs. This why we propose the use of \(F_{LCA}\) instead of all other hierarchical measures, although it is still an open-issue to devise a measure that combines all the pros of our proposed MGIA and \(F_{LCA}\) measures.
Footnotes
- 1.Without loss of generality, we assume a subclass-of relationship among the classes, but in some cases a different relationship may hold, for example part-of. We assume, however, that the three properties always hold for the relationship.
- 2.
- 3.
The tool is available from http://nlp.cs.aueb.gr/software_and_datasets/HEMKit.zip.
- 4.
- 5.
- 6.
- 7.
References
- Aho AV, Hopcroft JE, Ullman JD (1973) On finding lowest common ancestors in trees. In: Proceedings of 5th ACM Symposium Theory of Computing (STOC), pp 253–265Google Scholar
- Ahuja RK, Magnanti TL, Orlin JB (1993) Network flows: theory, algorithms, and applications. Prentice Hall, Upper Saddle RiverMATHGoogle Scholar
- Blockeel H, Bruynooghe M, Dzeroski S, Ramon J, Struyf J (2002) Hierarchical multi-classification. In: ACM SIGKDD 2002 Workshop on multi-relational data mining, pp 21–35Google Scholar
- Brucker F, Benites F, Sapozhnikova, E (2011) An empirical comparison of flat and hierarchical performance measures for multi-label classification with hierarchy extraction. In: Proceedings of the 15th international conference on Knowledge-based and intelligent information and engineering systems—Volume Part I, pp 579–589Google Scholar
- Cai L, Hofmann T (2007) Exploiting known taxonomies in learning overlapping concepts. In: International joint conferences on artificial intelligence, pp 714–719Google Scholar
- Cesa-Bianchi N, Gentile C, Zaniboni L (2006) Incremental algorithms for hierarchical classification. J Mach Learn Res 7:31–54MATHMathSciNetGoogle Scholar
- Costa EP, Lorena AC, Carvalho, Freitas AA (2007) A review of performance evaluation measures for hierarchical classifiers. In: 2007 AAAI Workshop, VancouverGoogle Scholar
- Dekel O, Keshet J, Singer, Y (2004) Large margin hierarchical classification. In: Proceedings of the twenty-first international conference on machine learning, pp 209–216Google Scholar
- Holden N, Freitas AA (2006) Hierarchical classification of g-protein-coupled receptors with a pso/aco algorithm. In: IEEE swarm intelligence symposium (SIS-06), pp 77–84Google Scholar
- Ipeirotis PG, Gravano L, Sahami M (2001) Probe, count, and classify: categorizing hidden web databases. In: ACM SIGMOD international conference on management of data, SIGMOD ’01, pp 67–78Google Scholar
- Kendall MG (1938) A new measure of rank correlation. Biometrica 30:81–93CrossRefMATHGoogle Scholar
- Kiritchenko S, Matwin S, Fazel FA (2005) Functional annotation of genes using hierarchical text categorization. In: ACL workshop on linking biological literature, ontologies and databases: mining biological semanticsGoogle Scholar
- Koller D, Sahami M (1997) Hierarchically classifying documents using very few wordsGoogle Scholar
- Kosmopoulos A, Gaussier E, Paliouras G (2010) The ECIR 2010 large scale hierarchical classification workshop. SIGIR Forum 44:23–32CrossRefGoogle Scholar
- McCallum A, Rosenfeld R (1998) Improving text classification by shrinkage in a hierarchy of classes. ICML 98:359–367Google Scholar
- Nowak S, Lukashevich H, Dunker P, Rüger S (2010) Performance measures for multilabel evaluation: a case study in the area of image classification. In: Proceedings of the international conference on multimedia information retrieval, pp 35–44Google Scholar
- Silla CN Jr, Freitas AA (2011) A survey of hierarchical classification across different application domains. Data Min Knowl Discov 22:31–72CrossRefMATHMathSciNetGoogle Scholar
- Sokolova M, Guy L (2009) A systematic analysis of performance measures for classification tasks. Inf Process Manag 45:427–437CrossRefGoogle Scholar
- Struyf J, Dzeroski S, Blockeel H, Clare A (2005) Hierarchical multi-classification with predictive clustering trees in functional genomics. In Carlos B, Cardoso A, and Dias G, (eds) Progress in artificial Intelligence. Lecture Notes in Computer Science, vol 3808, pp 272–283Google Scholar
- Sun A, Lim E-P (2001) Hierarchical text classification and evaluation. In: IEEE International conference on data mining, pp 521–528Google Scholar
- Sun A, Lim E-P, Ng W-K (2003) Performance measurement framework for hierarchical text classification. J Am Soc Inf Sci Technol 54:1014–1028CrossRefGoogle Scholar
- Wilcoxon F (1945) Individual comparisons by ranking methods. Biom Bull 1(6):80–83CrossRefGoogle Scholar
- Xiao L, Zhou D, Wu M (2011) Hierarchical classification via orthogonal transfer. In: Proceedings of the 28th international conference on machine learning (ICML-11), pp 801–808Google Scholar
- Yang Y, Liu X (1999) A re-examination of text categorization methods. In: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, ACM, pp 42–49Google Scholar