Evaluation of Different Data-Derived Label Hierarchies in Multi-label Classification

Madjarov, Gjorgji; Dimitrovski, Ivica; Gjorgjevikj, Dejan; Džeroski, Sašo

doi:10.1007/978-3-319-17876-9_2

Gjorgji Madjarov¹⁰,
Ivica Dimitrovski¹⁰,
Dejan Gjorgjevikj¹⁰ &
…
Sašo Džeroski¹¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8983))

Included in the following conference series:

International Workshop on New Frontiers in Mining Complex Patterns

665 Accesses
4 Citations

Abstract

Motivated by an increasing number of new applications, the research community is devoting an increasing amount of attention to the task of multi-label classification (MLC). Many different approaches to solving multi-label classification problems have been recently developed. Recent empirical studies have comprehensively evaluated many of these approaches on many datasets using different evaluation measures. The studies have indicated that the predictive performance and efficiency of the approaches could be improved by using data derived (artificial) hierarchies, in the learning and prediction phases. In this paper, we compare different clustering algorithms for constructing the label hierarchies (in a data-driven manner), in multi-label classification. We consider flat label sets and construct the label hierarchies from the label sets that appear in the annotations of the training data by using four different clustering algorithms (balanced $k$-means, agglomerative clustering with single and complete linkage and predictive clustering trees). The hierarchies are then used in conjunction with global hierarchical multi-label classification (HMC) approaches. The results from the statistical and experimental evaluation reveal that the data-derived label hierarchies used in conjunction with global HMC methods greatly improve the performance of MLC methods. Additionally, multi-branch hierarchies appear much more suitable for the global HMC approaches as compared to the binary hierarchies.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://clus.sourceforge.net.

References

Madjarov, G., Kocev, D., Gjorgjevikj, D., Dzeroski, S.: An extensive experimental comparison of methods for multi-label learning. Pattern Recogn. 45(9), 3084–3104 (2012)
Article Google Scholar
Tsoumakas, G., Katakis, I., Vlahavas, I.: Effective and efficient multilabel classification in domains with large number of labels. In: Proceedings of the ECML/PKDD Workshop on Mining Multidimensional Data, pp. 30–44 (2008)
Google Scholar
Kocev, D.: Ensembles for predicting structured outputs. Ph.D. thesis, IPS Jožef Stefan, Ljubljana, Slovenia (2011)
Google Scholar
Tsoumakas, G., Katakis, I.: Multi label classification: an overview. Int. J. Data Warehouse Min. 3(3), 1–13 (2007)
Article Google Scholar
Mencía, E.L., Park, S.H., Fürnkranz, J.: Efficient voting prediction for pairwise multilabel classification. Neurocomputing 73, 1164–1176 (2010)
Article Google Scholar
Blockeel, H., Raedt, L.D., Ramon, J.: Top-down induction of clustering trees. In: Proceedings of the 15th International Conference on Machine Learning, pp. 55–63 (1998)
Google Scholar
Vens, C., Struyf, J., Schietgat, L., Džeroski, S., Blockeel, H.: Decision trees for hierarchical multi-label classification. Mach. Learn. 73(2), 185–214 (2008)
Article Google Scholar
Manning, C.D., Raghavan, P., Schütze, H.: An Introduction to Information Retrieval. Cambridge University Press, Cambridge (2009)
Book Google Scholar
Kocev, D., Vens, C., Struyf, J., Džeroski, S.: Tree ensembles for predicting structured outputs. Pattern Recogn. 46(3), 817–833 (2013)
Article Google Scholar
de Carvalho, A.C.P.L.F., Freitas, A.A.: A tutorial on multi-label classification techniques. In: Abraham, A., Hassanien, A.-E., Snášel, V. (eds.) Foundations of Comput. Intel. Vol. 5. SCI, vol. 205, pp. 177–195. Springer, Heidelberg (2009)
Chapter Google Scholar
Tsoumakas, G., Katakis, I., Vlahavas, I.: Mining multi-label data. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, pp. 667–685. Springer, Heidelberg (2010)
Google Scholar
Silla Jr., C.N., Freitas, A.: A survey of hierarchical classification across different application domains. Data Min. Knowl. Dis. 22, 31–72 (2011)
Article MATH MathSciNet Google Scholar
Dimitrovski, I., Kocev, D., Loskovska, S., Džeroski, S.: Fast and scalable image retrieval using predictive clustering trees. In: Fürnkranz, J., Hüllermeier, E., Higuchi, T. (eds.) DS 2013. LNCS, vol. 8140, pp. 33–48. Springer, Heidelberg (2013)
Chapter Google Scholar
Levatić, J., Kocev, D., Džeroski, S.: The use of the label hierarchy in HMC improves performance: a case study in predicting community structure in ecology. In: Proceedings of the Workshop on New Frontiers in Mining Complex Patterns held in Conjunction with ECML/PKDD2013, pp. 189–201 (2013)
Google Scholar
Trohidis, K., Tsoumakas, G., Kalliris, G., Vlahavas, I.: Multilabel classification of music into emotions. In: Proceedings of the 9th International Conference on Music Information Retrieval, pp. 320–330 (2008)
Google Scholar
Boutell, M.R., Luo, J., Shen, X., Brown, C.M.: Learning multi-label scene classification. Pattern Recogn. 37(9), 1757–1771 (2004)
Article Google Scholar
Elisseeff, A., Weston, J.: A kernel method for multi-labelled classification. In: Proceedings of the Annual ACM Conference on Research and Development in Information Retrieval, pp. 274–281 (2005)
Google Scholar
Read, J., Pfahringer, B., Holmes, G., Frank, E.: Classifier chains for multi-label classification. In: Buntine, W., Grobelnik, M., Mladenić, D., Shawe-Taylor, J. (eds.) ECML PKDD 2009, Part II. LNCS, vol. 5782, pp. 254–269. Springer, Heidelberg (2009)
Chapter Google Scholar
Klimt, B., Yang, Y.: The enron corpus: a new dataset for email classification research. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML 2004. LNCS (LNAI), vol. 3201, pp. 217–226. Springer, Heidelberg (2004)
Chapter Google Scholar
Duygulu, P., Barnard, K., de Freitas, J.F.G., Forsyth, D.: Object recognition as machine translation: learning a lexicon for a fixed image vocabulary. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002, Part IV. LNCS, vol. 2353, pp. 97–112. Springer, Heidelberg (2002)
Chapter Google Scholar
Srivastava, A., Zane-Ulman, B.: Discovering recurring anomalies in text reports regarding complex space systems. In: Proceedings of the IEEE Aerospace Conference, pp. 55–63 (2005)
Google Scholar
Snoek, C.G.M., Worring, M., van Gemert, J.C., Geusebroek, J.M., Smeulders, A.W.M.: The challenge problem for automated detection of 101 semantic concepts in multimedia. In: Proceedings of the 14th Annual ACM International Conference on Multimedia, pp. 421–430 (2006)
Google Scholar
Katakis, I., Tsoumakas, G., Vlahavas, I.: Multilabel text classification for automated tag suggestion. In: Proceedings of the ECML/PKDD Discovery Challenge (2008)
Google Scholar
Friedman, M.: A comparison of alternative tests of significance for the problem of m rankings. Ann. Math. Stat. 11, 86–92 (1940)
Article Google Scholar
Nemenyi, P.B.: Distribution-free multiple comparisons. Ph.D. thesis, Princeton University (1963)
Google Scholar
Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)
MATH MathSciNet Google Scholar
Pearson, E.S., Hartley, H.O.: Biometrika Tables for Statisticians, vol. 1. Cambridge University Press, Cambridge (1966)
MATH Google Scholar

Download references

Acknowledgements

We would like to acknowledge the support of the European Commission through the project MAESTRA - Learning from Massive, Incompletely annotated, and Structured Data (Grant number ICT-2013-612944). Also, we would like to acknowledge the support of the Faculty of Computer Science and Engineering at the “Ss. Cyril and Methodius” University.

Author information

Authors and Affiliations

Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University, Rudgjer Boshkovikj 16, 1000, Skopje, Macedonia
Gjorgji Madjarov, Ivica Dimitrovski & Dejan Gjorgjevikj
Jožef Stefan Institute, Jamova Cesta 39, 1000, Ljubljana, Slovenia
Sašo Džeroski

Authors

Gjorgji Madjarov
View author publications
You can also search for this author in PubMed Google Scholar
Ivica Dimitrovski
View author publications
You can also search for this author in PubMed Google Scholar
Dejan Gjorgjevikj
View author publications
You can also search for this author in PubMed Google Scholar
Sašo Džeroski
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gjorgji Madjarov .

Editor information

Editors and Affiliations

Università degli Studi di Bari Aldo Moro, Bari, Italy
Annalisa Appice
Università degli Studi di Bari Aldo Moro, Bari, Italy
Michelangelo Ceci
Università degli Studi di Bari Aldo Moro, Bari, Italy
Corrado Loglisci
ICAR-CNR, Rende, Italy
Giuseppe Manco
ICAR-CNR, Rende, Italy
Elio Masciari
University of North Carolina, Charlotte, USA and Warsaw University of Technology, Warsaw, Poland
Zbigniew W. Ras

Appendices

A Evaluation Measures

In this section, we present the measures that are used to evaluate the predictive performance of the compared methods in our experiments. In the definitions below, $\mathcal {Y}_i$ denotes the set of true labels of example $\mathbf{x_{i}}$ and $h(\mathbf{x_{i}})$ denotes the set of predicted labels for the same examples. All definitions refer to the multi-label setting.

1.1 A.1 Example Based Measures

Hamming Loss evaluates how many times an example-label pair is misclassified, i.e., label not belonging to the example is predicted or a label belonging to the example is not predicted. The smaller the value of $hamming\_loss(h)$, the better the performance. The performance is perfect when $hamming\_loss(h) = 0$. This metric is defined as:

$$\begin{aligned} hamming\_loss(h)=\frac{1}{N}\sum ^{N}_{i=1}\frac{1}{Q}\left| h(\mathbf{x_{i}})\varDelta \mathcal {Y}_{i}\right| \end{aligned}$$

(1)

where $\varDelta $ stands for the symmetric difference between two sets, $N$ is the number of examples and $Q$ is the total number of possible class labels.

Accuracy for a single example $\mathbf{x_{i}}$ is defined by the Jaccard similarity coefficients between the label sets $h(\mathbf{x_{i}})$ and $\mathcal {Y}_i$. Accuracy is micro-averaged across all examples.

$$\begin{aligned} accuracy(h)=\frac{1}{N}\sum ^{N}_{i=1}\frac{\left| h(\mathbf{x_{i}})\bigcap \mathcal {Y}_{i}\right| }{\left| h(\mathbf{x_{i}})\bigcup \mathcal {Y}_{i}\right| } \end{aligned}$$

(2)

Precision is defined as:

$$\begin{aligned} precision(h)=\frac{1}{N}\sum ^{N}_{i=1}\frac{\left| h(\mathbf{x_{i}})\bigcap \mathcal {Y}_{i}\right| }{\left| h(\mathbf{x_{i}})\right| } \end{aligned}$$

(3)

Recall is defined as:

$$\begin{aligned} recall(h)=\frac{1}{N}\sum ^{N}_{i=1}\frac{\left| h(\mathbf{x_{i}})\bigcap \mathcal {Y}_{i}\right| }{\left| \mathcal {Y}_{i}\right| } \end{aligned}$$

(4)

$F_{1}$ score is the harmonic mean between precision and recall and is defined as:

$$\begin{aligned} F_{1}=\frac{1}{N}\sum ^{N}_{i=1}\frac{2 \times \left| h(\mathbf{x_{i}}) \cap \mathcal {Y}_{i}\right| }{\left| h(\mathbf{x_{i}})\right| + \left| \mathcal {Y}_{i}\right| } \end{aligned}$$

(5)

$F_{1}$ is an example based metric and its value is an average over all examples in the dataset. $F_{1}$ reaches its best value at 1 and worst score at 0.

Subset Accuracy or classification accuracy is defined as follows:

$$\begin{aligned} subset\_accuracy(h)=\frac{1}{N}\sum ^{N}_{i=1}I(h(\mathbf{x_{i}})=\mathcal {Y}_{i}) \end{aligned}$$

(6)

where $I(true)$ = 1 and $I(false)$ = 0. This is a very strict evaluation measure as it requires the predicted set of labels to be an exact match of the true set of labels.

1.2 A.2 Label Based Measures

Macro Precision (precision averaged across all labels) is defined as:

$$\begin{aligned} macro\_precision=\frac{1}{Q}\sum ^{Q}_{j=1}\frac{tp_{j}}{tp_{j} + fp_{j}} \end{aligned}$$

(7)

where $tp_{j}$, $fp_{j}$ are the number of true positives and false positives for the label $\lambda _{j}$ considered as a binary class.

Macro Recall (recall averaged across all labels) is defined as:

$$\begin{aligned} macro\_recall=\frac{1}{Q}\sum ^{Q}_{j=1}\frac{tp_{j}}{tp_{j} + fn_{j}} \end{aligned}$$

(8)

where $tp_{j}$, $fp_{j}$ are defined as for the macro precision and $fn_{j}$ is the number of false negatives for the label $\lambda _{j}$ considered as a binary class.

Macro $F_{1}$ is the harmonic mean between precision and recall, where the average is calculated per label and then averaged across all labels. If $p_{j}$ and $r_{j}$ are the precision and recall for all $\lambda _{j} \in h(\mathbf{x_{i}})$ from $\lambda _{j} \in \mathcal {Y}_{i}$, the macro $F_{1}$ is

$$\begin{aligned} macro\_F_{1}=\frac{1}{Q}\sum ^{Q}_{j=1}\frac{2\times p_{j} \times r_{j}}{p_{j} + r_{j}} \end{aligned}$$

(9)

Micro Precision (precision averaged over all the example/label pairs) is defined as:

$$\begin{aligned} micro\_precision=\frac{\sum ^{Q}_{j=1}{tp_{j}}}{\sum ^{Q}_{j=1}{tp_{j}} + \sum ^{Q}_{j=1}{fp_{j}}} \end{aligned}$$

(10)

where $tp_{j}$, $fp_{j}$ are defined as for macro precision.

Micro Recall (recall averaged over all the example/label pairs) is defined as:

$$\begin{aligned} micro\_recall=\frac{\sum ^{Q}_{j=1}{tp_{j}}}{\sum ^{Q}_{j=1}{tp_{j}} + \sum ^{Q}_{j=1}{fn_{j}}} \end{aligned}$$

(11)

where $tp_{j}$ and $fn_{j}$ are defined as for macro recall.

Micro $F_{1}$ is the harmonic mean between micro precision and micro recall. Micro $F_{1}$ is defined as:

$$\begin{aligned} micro\_F_{1}=\frac{2 \times micro\_precision \times micro\_recall}{micro\_precision + micro\_recall} \end{aligned}$$

(12)

1.3 A.3 Ranking Based Measures

One Error evaluates how many times the top-ranked label is not in the set of relevant labels of the example. The metric $one\_error(f)$ takes values between 0 and 1. The smaller the value of $one\_error(f)$, the better the performance. This evaluation metric is defined as:

$$\begin{aligned} one\_error(f)=\frac{1}{N}\sum ^{N}_{i=1}\left[ \!\!\!\left[ \left[ \arg \max _{\lambda \in \mathcal {Y}} f(\mathbf{x_{i}}, \lambda )\right] \notin \mathcal {Y}_{i} \right] \!\!\!\right] \end{aligned}$$

(13)

where $\lambda \in \mathcal {L} = \left\{ \lambda _{1}, \lambda _{2}, ..., \lambda _{Q}\right\} $ and $\left[ \!\left[ \pi \right] \!\right] $ equals 1 if $\pi $ holds and 0 otherwise for any predicate $\pi $. Note that, for single-label classification problems, the One Error is identical to ordinary classification error.

Coverage evaluates how far, on average, we need to go down the list of ranked labels in order to cover all the relevant labels of the example. The smaller the value of $coverage(f)$, the better the performance.

$$\begin{aligned} coverage(f)=\frac{1}{N}\sum ^{N}_{i=1}\max _{\lambda \in \mathcal {Y}_{i}} rank_{f}(\mathbf{x_{i}}, \lambda ) - 1 \end{aligned}$$

(14)

where $rank_{f}(\mathbf{x_{i}}, \lambda )$ maps the outputs of $f(\mathbf{x_{i}}, \lambda )$ for any $\lambda \in \mathcal {L}$ to $\left\{ \lambda _{1}, \lambda _{2}, ..., \lambda _{Q}\right\} $ so that $f(\mathbf{x_{i}}, \lambda _{m}) > f(\mathbf{x_{i}}, \lambda _{n})$ implies $rank_{f}(\mathbf{x_{i}}, \lambda _{m}) < rank_{f}(\mathbf{x_{i}}, \lambda _{n})$. The smallest possible value for $coverage(f)$ is $l_{c}$, i.e., the label cardinality of the given dataset.

Ranking Loss evaluates the average fraction of label pairs that are reversely ordered for the particular example given by:

$$\begin{aligned} ranking\ loss(f)=\frac{1}{N}\sum ^{N}_{i=1}\frac{\left| D_{i}\right| }{\left| \mathcal {Y}_{i}\right| \left| \bar{\mathcal {Y}_{i}}\right| } \end{aligned}$$

(15)

where $D_{i} = \{(\lambda _{m}, \lambda _{n}) | f(\mathbf{x_{i}}, \lambda _{m}) \le f(\mathbf{x_{i}}, \lambda _{n}), (\lambda _{m}, \lambda _{n}) \in \mathcal {Y}_{i} \times \bar{\mathcal {Y}_{i}}\}$, while $\bar{\mathcal {Y}}$ denotes the complementary set of $\mathcal {Y}$ in $\mathcal {L}$. The smaller the value of $ranking\_loss(f)$, the better the performance, so the performance is perfect when $ranking\_loss(f) = 0$.

Average Precision is the average fraction of labels ranked above an actual label $\lambda \in \mathcal {Y}_{i}$ that actually are in $\mathcal {Y}_{i}$. The performance is perfect when $avg\_precision(f) = 1$; the larger the value of $avg\_precision(f)$, the better the performance. This metric is defined as:

$$\begin{aligned} avg\_precision(f)=\frac{1}{N}\sum ^{N}_{i=1}\frac{1}{\left| \mathcal {Y}_{i}\right| }\sum _{\lambda \in \mathcal {Y}_{i}}\frac{\left| \mathcal {L}_{i}\right| }{rank_{f}(\mathbf{x_{i}}, \lambda )} \end{aligned}$$

(16)

where $\mathcal {L}_{i}=\{\lambda '| rank_{f}(\mathbf{x_{i}}, \lambda ') \le rank_{f}(\mathbf{x_{i}}, \lambda ), \lambda ' \in \mathcal {Y}_{i}\}$ and $rank_{f}(\mathbf{x_{i}}, \lambda )$ is defined as in coverage above.

B Complete Results from the Experimental Evaluation

In this section, we present the results from the experimental evaluation. Table 3 shows the predictive performance of the compared methods. First column of the tables describes the methods used for defining the hierarchies, while the other columns show the predictive performance of the compared methods and hierarchies in terms of the 16 performance evaluation measures.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Madjarov, G., Dimitrovski, I., Gjorgjevikj, D., Džeroski, S. (2015). Evaluation of Different Data-Derived Label Hierarchies in Multi-label Classification. In: Appice, A., Ceci, M., Loglisci, C., Manco, G., Masciari, E., Ras, Z. (eds) New Frontiers in Mining Complex Patterns. NFMCP 2014. Lecture Notes in Computer Science(), vol 8983. Springer, Cham. https://doi.org/10.1007/978-3-319-17876-9_2

Download citation

DOI: https://doi.org/10.1007/978-3-319-17876-9_2
Published: 28 April 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-17875-2
Online ISBN: 978-3-319-17876-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics