Abstract
The multi-label classification is a frequent task in machine learning notably in text categorization. When binary classifiers are not suited, an alternative consists in using a multiclass classifier that provides for each document a score per category and then in applying a thresholding strategy in order to select the set of categories which must be assigned to the document. The common thresholding strategies, such as RCut, PCut and SCut methods, need a training step to determine the value of the threshold. To overcome this limit, we propose a new strategy, called MCut which automatically estimates a value for the threshold. This method does not have to be trained and does not need any parametrization. Experiments performed on two textual corpora, XML Mining 2009 and RCV1 collections, show that the MCut strategy results are on par with the state of the art but MCut is easy to implement and parameter free.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Clare, A., King, R.D.: Knowledge Discovery in Multi-label Phenotype Data. In: Siebes, A., De Raedt, L. (eds.) PKDD 2001. LNCS (LNAI), vol. 2168, pp. 42–53. Springer, Heidelberg (2001)
Cohen, W.W., Singer, Y.: Context-sensitive learning methods for text categorization. In: Proceedings of the 19th ACM Special Interest Group on Information Retrieval Conference on Research and Development in Information Retrieval (SIGIR 1996), pp. 307–315 (1996)
De Comité, F., Gilleron, R., Tommasi, M.: Learning multi-label alternating decision trees from texts and data. In: Perner, P., Rosenfeld, A. (eds.) MLDM 2003. LNCS, vol. 2734, pp. 251–274. Springer, Heidelberg (2003)
Crammer, K., Singer, Y., Jaz, K., Hofmann, T., Poggio, T., Shawe-taylor, J.: A family of additive online algorithms for category ranking. Journal of Machine Learning Research (JMLR) 3, 1025–1058 (2003)
Denoyer, L., Gallinari, P.: The wikipedia xml corpus. Special Interest Group on Information Retrieval Forum (SIGIR 2006) 40(1), 64–69 (2006)
Denoyer, L., Gallinari, P.: Report on the xml mining classification track at inex 2009. In: INitiative for the Evaluation of XML Retrieval 2009 Workshop Pre-proceedings (INEX 2009), pp. 339–343 (2009)
Elisseeff, A., Weston, J.: A kernel method for multi-labelled classification. In: Advances in Neural Information Processing Systems 14 (NIPS 2001), pp. 681–687 (2001)
Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., Lin, C.-J.: LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research (JMLR) 9, 1871–1874 (2008)
Har-Peled, S., Roth, D., Zimak, D.: Constraint Classification: A New Approach to Multiclass Classification. In: Cesa-Bianchi, N., Numao, M., Reischuk, R. (eds.) ALT 2002. LNCS (LNAI), vol. 2533, pp. 365–379. Springer, Heidelberg (2002)
Lewis, D.D.: An evaluation of phrasal and clustered representations on a text categorization task. In: Proceedings of the 15th ACM Special Interest Group on Information Retrieval Conference on Research and Development in Information Retrieval (SIGIR 1992), pp. 37–50 (1992)
Lewis, D.D., Ringuette, M.: A comparison of two learning algorithms for text categorization. In: Third Annual Symposium on Document Analysis and Information Retrieval (SDAIR 1994), pp. 81–93 (1994)
Lewis, D.D., Yang, Y., Rose, T.G., Dietterich, G., Li, F.: Rcv1: A new benchmark collection for text categorization research. Journal of Machine Learning Research (JMLR) 5, 361–397 (2004)
Luo, X., Zincir-Heywood, A.N.: Evaluation of Two Systems on Multi-class Multi-label Document Classification. In: Hacid, M.-S., Murray, N.V., Raś, Z.W., Tsumoto, S. (eds.) ISMIS 2005. LNCS (LNAI), vol. 3488, pp. 161–169. Springer, Heidelberg (2005)
Porter, M.: An algorithm for suffix stripping. Program 3, 130–137 (1980)
Montejo-Ráez, A., Ureña-López, L.A.: Selection Strategies for Multi-label Text Categorization. In: Salakoski, T., Ginter, F., Pyysalo, S., Pahikkala, T. (eds.) FinTAL 2006. LNCS (LNAI), vol. 4139, pp. 585–592. Springer, Heidelberg (2006)
Salton, G., McGill, M.: Introduction to Modern Information Retrieval. McGraw-Hill (1983)
Schapire, R.E., Singer, Y.: Boostexter: A boosting-based system for text categorization. Machine Learning 39(2-3), 135–168 (2000)
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Tang, L., Rajan, S., Narayanan, V.K.: Large scale multi-label classification via metalabeler. In: Proceedings of the 18th International Conference on World Wide Web (WWW 2009), pp. 211–220 (2009)
Tsoumakas, G., Katakis, I.: Multi-label classification: An overview. International Journal of Data Warehousing and Mining (IJDWM 2007) 3(3), 1–13 (2007)
Vapnik, V.: The Nature of Statistical Learning Theory. Springer (1995)
Yang, Y.: A study of thresholding strategies for text categorization. In: Proceedings of the 24th ACM Special Interest Group on Information Retrieval Conference on Research and Development in Information Retrieval (SIGIR 2001), pp. 137–145 (2001)
Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proceedings of the 22nd ACM Special Interest Group on Information Retrieval Conference on Research and Development in Information Retrieval (SIGIR 1999), pp. 42–49 (1999)
Zhang, M.-L., Zhou, Z.-H.: A k-nearest neighbor based algorithm for multi-label classification. In: Proceedings of the 1st IEEE International Conference on Granular Computing (GrC 2005), pp. 718–721 (2005)
Zhang, M.-L., Zhou, Z.-H.: Multilabel neural networks with applications to functional genomics and text categorization. IEEE Transactions on Knowledge and Data Engineering (TKDE 2006) 18, 1338–1351 (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Largeron, C., Moulin, C., Géry, M. (2012). MCut: A Thresholding Strategy for Multi-label Classification. In: Hollmén, J., Klawonn, F., Tucker, A. (eds) Advances in Intelligent Data Analysis XI. IDA 2012. Lecture Notes in Computer Science, vol 7619. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-34156-4_17
Download citation
DOI: https://doi.org/10.1007/978-3-642-34156-4_17
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-34155-7
Online ISBN: 978-3-642-34156-4
eBook Packages: Computer ScienceComputer Science (R0)