Skip to main content

Evaluation of Different Data-Derived Label Hierarchies in Multi-label Classification

  • Conference paper
  • First Online:
New Frontiers in Mining Complex Patterns (NFMCP 2014)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8983))

Included in the following conference series:

Abstract

Motivated by an increasing number of new applications, the research community is devoting an increasing amount of attention to the task of multi-label classification (MLC). Many different approaches to solving multi-label classification problems have been recently developed. Recent empirical studies have comprehensively evaluated many of these approaches on many datasets using different evaluation measures. The studies have indicated that the predictive performance and efficiency of the approaches could be improved by using data derived (artificial) hierarchies, in the learning and prediction phases. In this paper, we compare different clustering algorithms for constructing the label hierarchies (in a data-driven manner), in multi-label classification. We consider flat label sets and construct the label hierarchies from the label sets that appear in the annotations of the training data by using four different clustering algorithms (balanced \(k\)-means, agglomerative clustering with single and complete linkage and predictive clustering trees). The hierarchies are then used in conjunction with global hierarchical multi-label classification (HMC) approaches. The results from the statistical and experimental evaluation reveal that the data-derived label hierarchies used in conjunction with global HMC methods greatly improve the performance of MLC methods. Additionally, multi-branch hierarchies appear much more suitable for the global HMC approaches as compared to the binary hierarchies.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://clus.sourceforge.net.

References

  1. Madjarov, G., Kocev, D., Gjorgjevikj, D., Dzeroski, S.: An extensive experimental comparison of methods for multi-label learning. Pattern Recogn. 45(9), 3084–3104 (2012)

    Article  Google Scholar 

  2. Tsoumakas, G., Katakis, I., Vlahavas, I.: Effective and efficient multilabel classification in domains with large number of labels. In: Proceedings of the ECML/PKDD Workshop on Mining Multidimensional Data, pp. 30–44 (2008)

    Google Scholar 

  3. Kocev, D.: Ensembles for predicting structured outputs. Ph.D. thesis, IPS Jožef Stefan, Ljubljana, Slovenia (2011)

    Google Scholar 

  4. Tsoumakas, G., Katakis, I.: Multi label classification: an overview. Int. J. Data Warehouse Min. 3(3), 1–13 (2007)

    Article  Google Scholar 

  5. Mencía, E.L., Park, S.H., Fürnkranz, J.: Efficient voting prediction for pairwise multilabel classification. Neurocomputing 73, 1164–1176 (2010)

    Article  Google Scholar 

  6. Blockeel, H., Raedt, L.D., Ramon, J.: Top-down induction of clustering trees. In: Proceedings of the 15th International Conference on Machine Learning, pp. 55–63 (1998)

    Google Scholar 

  7. Vens, C., Struyf, J., Schietgat, L., Džeroski, S., Blockeel, H.: Decision trees for hierarchical multi-label classification. Mach. Learn. 73(2), 185–214 (2008)

    Article  Google Scholar 

  8. Manning, C.D., Raghavan, P., Schütze, H.: An Introduction to Information Retrieval. Cambridge University Press, Cambridge (2009)

    Book  Google Scholar 

  9. Kocev, D., Vens, C., Struyf, J., Džeroski, S.: Tree ensembles for predicting structured outputs. Pattern Recogn. 46(3), 817–833 (2013)

    Article  Google Scholar 

  10. de Carvalho, A.C.P.L.F., Freitas, A.A.: A tutorial on multi-label classification techniques. In: Abraham, A., Hassanien, A.-E., Snášel, V. (eds.) Foundations of Comput. Intel. Vol. 5. SCI, vol. 205, pp. 177–195. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  11. Tsoumakas, G., Katakis, I., Vlahavas, I.: Mining multi-label data. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, pp. 667–685. Springer, Heidelberg (2010)

    Google Scholar 

  12. Silla Jr., C.N., Freitas, A.: A survey of hierarchical classification across different application domains. Data Min. Knowl. Dis. 22, 31–72 (2011)

    Article  MATH  MathSciNet  Google Scholar 

  13. Dimitrovski, I., Kocev, D., Loskovska, S., Džeroski, S.: Fast and scalable image retrieval using predictive clustering trees. In: Fürnkranz, J., Hüllermeier, E., Higuchi, T. (eds.) DS 2013. LNCS, vol. 8140, pp. 33–48. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  14. Levatić, J., Kocev, D., Džeroski, S.: The use of the label hierarchy in HMC improves performance: a case study in predicting community structure in ecology. In: Proceedings of the Workshop on New Frontiers in Mining Complex Patterns held in Conjunction with ECML/PKDD2013, pp. 189–201 (2013)

    Google Scholar 

  15. Trohidis, K., Tsoumakas, G., Kalliris, G., Vlahavas, I.: Multilabel classification of music into emotions. In: Proceedings of the 9th International Conference on Music Information Retrieval, pp. 320–330 (2008)

    Google Scholar 

  16. Boutell, M.R., Luo, J., Shen, X., Brown, C.M.: Learning multi-label scene classification. Pattern Recogn. 37(9), 1757–1771 (2004)

    Article  Google Scholar 

  17. Elisseeff, A., Weston, J.: A kernel method for multi-labelled classification. In: Proceedings of the Annual ACM Conference on Research and Development in Information Retrieval, pp. 274–281 (2005)

    Google Scholar 

  18. Read, J., Pfahringer, B., Holmes, G., Frank, E.: Classifier chains for multi-label classification. In: Buntine, W., Grobelnik, M., Mladenić, D., Shawe-Taylor, J. (eds.) ECML PKDD 2009, Part II. LNCS, vol. 5782, pp. 254–269. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  19. Klimt, B., Yang, Y.: The enron corpus: a new dataset for email classification research. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML 2004. LNCS (LNAI), vol. 3201, pp. 217–226. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  20. Duygulu, P., Barnard, K., de Freitas, J.F.G., Forsyth, D.: Object recognition as machine translation: learning a lexicon for a fixed image vocabulary. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002, Part IV. LNCS, vol. 2353, pp. 97–112. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  21. Srivastava, A., Zane-Ulman, B.: Discovering recurring anomalies in text reports regarding complex space systems. In: Proceedings of the IEEE Aerospace Conference, pp. 55–63 (2005)

    Google Scholar 

  22. Snoek, C.G.M., Worring, M., van Gemert, J.C., Geusebroek, J.M., Smeulders, A.W.M.: The challenge problem for automated detection of 101 semantic concepts in multimedia. In: Proceedings of the 14th Annual ACM International Conference on Multimedia, pp. 421–430 (2006)

    Google Scholar 

  23. Katakis, I., Tsoumakas, G., Vlahavas, I.: Multilabel text classification for automated tag suggestion. In: Proceedings of the ECML/PKDD Discovery Challenge (2008)

    Google Scholar 

  24. Friedman, M.: A comparison of alternative tests of significance for the problem of m rankings. Ann. Math. Stat. 11, 86–92 (1940)

    Article  Google Scholar 

  25. Nemenyi, P.B.: Distribution-free multiple comparisons. Ph.D. thesis, Princeton University (1963)

    Google Scholar 

  26. Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)

    MATH  MathSciNet  Google Scholar 

  27. Pearson, E.S., Hartley, H.O.: Biometrika Tables for Statisticians, vol. 1. Cambridge University Press, Cambridge (1966)

    MATH  Google Scholar 

Download references

Acknowledgements

We would like to acknowledge the support of the European Commission through the project MAESTRA - Learning from Massive, Incompletely annotated, and Structured Data (Grant number ICT-2013-612944). Also, we would like to acknowledge the support of the Faculty of Computer Science and Engineering at the “Ss. Cyril and Methodius” University.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gjorgji Madjarov .

Editor information

Editors and Affiliations

Appendices

A Evaluation Measures

In this section, we present the measures that are used to evaluate the predictive performance of the compared methods in our experiments. In the definitions below, \(\mathcal {Y}_i\) denotes the set of true labels of example \(\mathbf{x_{i}}\) and \(h(\mathbf{x_{i}})\) denotes the set of predicted labels for the same examples. All definitions refer to the multi-label setting.

1.1 A.1 Example Based Measures

Hamming Loss evaluates how many times an example-label pair is misclassified, i.e., label not belonging to the example is predicted or a label belonging to the example is not predicted. The smaller the value of \(hamming\_loss(h)\), the better the performance. The performance is perfect when \(hamming\_loss(h) = 0\). This metric is defined as:

$$\begin{aligned} hamming\_loss(h)=\frac{1}{N}\sum ^{N}_{i=1}\frac{1}{Q}\left| h(\mathbf{x_{i}})\varDelta \mathcal {Y}_{i}\right| \end{aligned}$$
(1)

where \(\varDelta \) stands for the symmetric difference between two sets, \(N\) is the number of examples and \(Q\) is the total number of possible class labels.

Accuracy for a single example \(\mathbf{x_{i}}\) is defined by the Jaccard similarity coefficients between the label sets \(h(\mathbf{x_{i}})\) and \(\mathcal {Y}_i\). Accuracy is micro-averaged across all examples.

$$\begin{aligned} accuracy(h)=\frac{1}{N}\sum ^{N}_{i=1}\frac{\left| h(\mathbf{x_{i}})\bigcap \mathcal {Y}_{i}\right| }{\left| h(\mathbf{x_{i}})\bigcup \mathcal {Y}_{i}\right| } \end{aligned}$$
(2)

Precision is defined as:

$$\begin{aligned} precision(h)=\frac{1}{N}\sum ^{N}_{i=1}\frac{\left| h(\mathbf{x_{i}})\bigcap \mathcal {Y}_{i}\right| }{\left| h(\mathbf{x_{i}})\right| } \end{aligned}$$
(3)

Recall is defined as:

$$\begin{aligned} recall(h)=\frac{1}{N}\sum ^{N}_{i=1}\frac{\left| h(\mathbf{x_{i}})\bigcap \mathcal {Y}_{i}\right| }{\left| \mathcal {Y}_{i}\right| } \end{aligned}$$
(4)

\(F_{1}\) score is the harmonic mean between precision and recall and is defined as:

$$\begin{aligned} F_{1}=\frac{1}{N}\sum ^{N}_{i=1}\frac{2 \times \left| h(\mathbf{x_{i}}) \cap \mathcal {Y}_{i}\right| }{\left| h(\mathbf{x_{i}})\right| + \left| \mathcal {Y}_{i}\right| } \end{aligned}$$
(5)

\(F_{1}\) is an example based metric and its value is an average over all examples in the dataset. \(F_{1}\) reaches its best value at 1 and worst score at 0.

Subset Accuracy or classification accuracy is defined as follows:

$$\begin{aligned} subset\_accuracy(h)=\frac{1}{N}\sum ^{N}_{i=1}I(h(\mathbf{x_{i}})=\mathcal {Y}_{i}) \end{aligned}$$
(6)

where \(I(true)\) = 1 and \(I(false)\) = 0. This is a very strict evaluation measure as it requires the predicted set of labels to be an exact match of the true set of labels.

1.2 A.2 Label Based Measures

Macro Precision (precision averaged across all labels) is defined as:

$$\begin{aligned} macro\_precision=\frac{1}{Q}\sum ^{Q}_{j=1}\frac{tp_{j}}{tp_{j} + fp_{j}} \end{aligned}$$
(7)

where \(tp_{j}\), \(fp_{j}\) are the number of true positives and false positives for the label \(\lambda _{j}\) considered as a binary class.

Macro Recall (recall averaged across all labels) is defined as:

$$\begin{aligned} macro\_recall=\frac{1}{Q}\sum ^{Q}_{j=1}\frac{tp_{j}}{tp_{j} + fn_{j}} \end{aligned}$$
(8)

where \(tp_{j}\), \(fp_{j}\) are defined as for the macro precision and \(fn_{j}\) is the number of false negatives for the label \(\lambda _{j}\) considered as a binary class.

Macro \(F_{1}\) is the harmonic mean between precision and recall, where the average is calculated per label and then averaged across all labels. If \(p_{j}\) and \(r_{j}\) are the precision and recall for all \(\lambda _{j} \in h(\mathbf{x_{i}})\) from \(\lambda _{j} \in \mathcal {Y}_{i}\), the macro \(F_{1}\) is

$$\begin{aligned} macro\_F_{1}=\frac{1}{Q}\sum ^{Q}_{j=1}\frac{2\times p_{j} \times r_{j}}{p_{j} + r_{j}} \end{aligned}$$
(9)

Micro Precision (precision averaged over all the example/label pairs) is defined as:

$$\begin{aligned} micro\_precision=\frac{\sum ^{Q}_{j=1}{tp_{j}}}{\sum ^{Q}_{j=1}{tp_{j}} + \sum ^{Q}_{j=1}{fp_{j}}} \end{aligned}$$
(10)

where \(tp_{j}\), \(fp_{j}\) are defined as for macro precision.

Micro Recall (recall averaged over all the example/label pairs) is defined as:

$$\begin{aligned} micro\_recall=\frac{\sum ^{Q}_{j=1}{tp_{j}}}{\sum ^{Q}_{j=1}{tp_{j}} + \sum ^{Q}_{j=1}{fn_{j}}} \end{aligned}$$
(11)

where \(tp_{j}\) and \(fn_{j}\) are defined as for macro recall.

Micro \(F_{1}\) is the harmonic mean between micro precision and micro recall. Micro \(F_{1}\) is defined as:

$$\begin{aligned} micro\_F_{1}=\frac{2 \times micro\_precision \times micro\_recall}{micro\_precision + micro\_recall} \end{aligned}$$
(12)

1.3 A.3 Ranking Based Measures

One Error evaluates how many times the top-ranked label is not in the set of relevant labels of the example. The metric \(one\_error(f)\) takes values between 0 and 1. The smaller the value of \(one\_error(f)\), the better the performance. This evaluation metric is defined as:

$$\begin{aligned} one\_error(f)=\frac{1}{N}\sum ^{N}_{i=1}\left[ \!\!\!\left[ \left[ \arg \max _{\lambda \in \mathcal {Y}} f(\mathbf{x_{i}}, \lambda )\right] \notin \mathcal {Y}_{i} \right] \!\!\!\right] \end{aligned}$$
(13)

where \(\lambda \in \mathcal {L} = \left\{ \lambda _{1}, \lambda _{2}, ..., \lambda _{Q}\right\} \) and \(\left[ \!\left[ \pi \right] \!\right] \) equals 1 if \(\pi \) holds and 0 otherwise for any predicate \(\pi \). Note that, for single-label classification problems, the One Error is identical to ordinary classification error.

Coverage evaluates how far, on average, we need to go down the list of ranked labels in order to cover all the relevant labels of the example. The smaller the value of \(coverage(f)\), the better the performance.

$$\begin{aligned} coverage(f)=\frac{1}{N}\sum ^{N}_{i=1}\max _{\lambda \in \mathcal {Y}_{i}} rank_{f}(\mathbf{x_{i}}, \lambda ) - 1 \end{aligned}$$
(14)

where \(rank_{f}(\mathbf{x_{i}}, \lambda )\) maps the outputs of \(f(\mathbf{x_{i}}, \lambda )\) for any \(\lambda \in \mathcal {L}\) to \(\left\{ \lambda _{1}, \lambda _{2}, ..., \lambda _{Q}\right\} \) so that \(f(\mathbf{x_{i}}, \lambda _{m}) > f(\mathbf{x_{i}}, \lambda _{n})\) implies \(rank_{f}(\mathbf{x_{i}}, \lambda _{m}) < rank_{f}(\mathbf{x_{i}}, \lambda _{n})\). The smallest possible value for \(coverage(f)\) is \(l_{c}\), i.e., the label cardinality of the given dataset.

Ranking Loss evaluates the average fraction of label pairs that are reversely ordered for the particular example given by:

$$\begin{aligned} ranking\ loss(f)=\frac{1}{N}\sum ^{N}_{i=1}\frac{\left| D_{i}\right| }{\left| \mathcal {Y}_{i}\right| \left| \bar{\mathcal {Y}_{i}}\right| } \end{aligned}$$
(15)

where \(D_{i} = \{(\lambda _{m}, \lambda _{n}) | f(\mathbf{x_{i}}, \lambda _{m}) \le f(\mathbf{x_{i}}, \lambda _{n}), (\lambda _{m}, \lambda _{n}) \in \mathcal {Y}_{i} \times \bar{\mathcal {Y}_{i}}\}\), while \(\bar{\mathcal {Y}}\) denotes the complementary set of \(\mathcal {Y}\) in \(\mathcal {L}\). The smaller the value of \(ranking\_loss(f)\), the better the performance, so the performance is perfect when \(ranking\_loss(f) = 0\).

Average Precision is the average fraction of labels ranked above an actual label \(\lambda \in \mathcal {Y}_{i}\) that actually are in \(\mathcal {Y}_{i}\). The performance is perfect when \(avg\_precision(f) = 1\); the larger the value of \(avg\_precision(f)\), the better the performance. This metric is defined as:

$$\begin{aligned} avg\_precision(f)=\frac{1}{N}\sum ^{N}_{i=1}\frac{1}{\left| \mathcal {Y}_{i}\right| }\sum _{\lambda \in \mathcal {Y}_{i}}\frac{\left| \mathcal {L}_{i}\right| }{rank_{f}(\mathbf{x_{i}}, \lambda )} \end{aligned}$$
(16)

where \(\mathcal {L}_{i}=\{\lambda '| rank_{f}(\mathbf{x_{i}}, \lambda ') \le rank_{f}(\mathbf{x_{i}}, \lambda ), \lambda ' \in \mathcal {Y}_{i}\}\) and \(rank_{f}(\mathbf{x_{i}}, \lambda )\) is defined as in coverage above.

B Complete Results from the Experimental Evaluation

In this section, we present the results from the experimental evaluation. Table 3 shows the predictive performance of the compared methods. First column of the tables describes the methods used for defining the hierarchies, while the other columns show the predictive performance of the compared methods and hierarchies in terms of the 16 performance evaluation measures.

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Madjarov, G., Dimitrovski, I., Gjorgjevikj, D., Džeroski, S. (2015). Evaluation of Different Data-Derived Label Hierarchies in Multi-label Classification. In: Appice, A., Ceci, M., Loglisci, C., Manco, G., Masciari, E., Ras, Z. (eds) New Frontiers in Mining Complex Patterns. NFMCP 2014. Lecture Notes in Computer Science(), vol 8983. Springer, Cham. https://doi.org/10.1007/978-3-319-17876-9_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-17876-9_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-17875-2

  • Online ISBN: 978-3-319-17876-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics