Selection Strategies for Multi-label Text Categorization

Montejo-Ráez, Arturo; Ureña-López, Luis Alfonso

doi:10.1007/11816508_58

Arturo Montejo-Ráez²¹ &
Luis Alfonso Ureña-López²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4139))

Included in the following conference series:

International Conference on Natural Language Processing (in Finland)

1605 Accesses
7 Citations

Abstract

In multi-label text categorization, determining the final set of classes that will label a given document is not trivial. It implies first to determine whether a class is suitable of being attached to the text and, secondly, the number of them that we have to consider. Different strategies for determining the size of the final set of assigned labels are studied here. We analyze several classification algorithms along with two main strategies for selection: by a fixed number of top ranked labels, or using per-class thresholds. Our experiments show the effects of each approach and the issues to consider when using them.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Pouliquen, B., Steinberger, R., Ignat, C.: Automatic Annotation of Multilingual Text Collections with a Conceptual Thesaurus. In: Todirascu, A. (ed.) Proceedings of the workshop Ontologies and Information Extraction’ at the EuroLan Summer School The Semantic Web and Language Technology (EUROLAN 2003), Bucharest (Romania), p. 8 (2003)
Google Scholar
Dallman, D., Meur, J.Y.L.: Automatic keywording of High Energy Physics. In: 4th International Conference on Grey Literature: New Frontiers in Grey Literature, Washington, DC, USA (October 1999)
Google Scholar
Genkin, A., Lewis, D.D., Madigan, D.: Large-scale bayesian logistic regression for text categorization. Technical report, Center for Discrete Mathematics and Theoretical Computer Science (2004)
Google Scholar
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Chapter Google Scholar
Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proc. of the Fourteenth International Joint Conference on Artificial Intelligence, pp. 1137–1145. Morgan Kaufmann, San Mateo (1995)
Google Scholar
Lewis, D.D.: Evaluating Text Categorization. In: Proceedings of Speech and Natural Language Workshop, pp. 312–318. Morgan Kaufmann, San Francisco (1991)
Chapter Google Scholar
Lewis, D.D.: Evaluating and Optimizing Autonomous Text Classification Systems. In: Fox, E.A., Ingwersen, P., Fidel, R. (eds.) Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, Washington, pp. 246–254. ACM Press, New York (1995)
Chapter Google Scholar
Lewis, D.D., Schapire, R.E., Callan, J.P., Papka, R.: Training algorithms for linear text classifiers. In: Frei, H.-P., Harman, D., Schäuble, P., Wilkinson, R. (eds.) Proceedings of SIGIR 1996, 19th ACM International Conference on Research and Development in Information Retrieval, Zürich, CH, pp. 298–306. ACM Press, New York (1996)
Chapter Google Scholar
Montejo-Ráez, A.: Towards conceptual indexing using automatic assignment of descriptors. In: Workshop in Personalization Techniques in Electronic Publishing on the Web: Trends and Perspectives, Málaga, Spain (May 2002)
Google Scholar
Montejo-Ráez, A., Dallman, D.: Experiences in automatic keywording of particle physics literature. High Energy Physics Libraries Webzine (issue 5) (November 2001), URL: http://library.cern.ch/HEPLW/5/papers/3/
Montejo-Ráez, A., Steinberger, R., Ureña-López, L.A.: Adaptive Selection of Base Classifiers in One-Against-All Learning for Large Multi-labeled Collections. In: Vicedo, J.L., Martínez-Barco, P., Muńoz, R., Saiz Noeda, M. (eds.) EsTAL 2004. LNCS, vol. 3230, pp. 1–12. Springer, Heidelberg (2004)
Chapter Google Scholar
Porter, M.F.: An algorithm for suffix stripping, pp. 313–316. Morgan Kaufmann Publishers Inc., San Francisco (1997)
Google Scholar
Salton, G., Wong, A., Yang, C.S.: A Vector Space Model for Automatic Indexing. Technical Report TR74-218, Cornell University, Computer Science Department (July 1974)
Google Scholar
Schapire, R.E., Singer, Y.: BoosTexter: A boosting-based system for text categorization. Machine Learning 39(2/3), 135–168 (2000)
Article MATH Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Article Google Scholar
van Rijsbergen, C.J.: Information Retrieval. Butterworths, London (1975), http://www.dcs.gla.ac.uk/Keith/Preface.html
Google Scholar
Li, Y., Zaragoza, H., Herbrich, R., Shawe-Taylor, J., Kandola, J.: The perceptron algorithm with uneven margins. In: Proceedings of the International Conference of Machine Learning (ICML 2002) (2002)
Google Scholar
Yang, Y.: A study on thresholding strategies for text categorization. In: Croft, W.B., Harper, D.J., Kraft, D.H., Zobel, J. (eds.) Proceedings of SIGIR 2001, 24th ACM International Conference on Research and Development in Information Retrieval, New Orleans, US, pp. 137–145. ACM Press, New York (2001); Describes RCut, Scut, etc.
Chapter Google Scholar
Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Hearst, M.A., Gey, F., Tong, R. (eds.) Proceedings of SIGIR 1999, 22nd ACM International Conference on Research and Development in Information Retrieval, Berkeley, US, pp. 42–49. ACM Press, New York (1999)
Chapter Google Scholar
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proceedings of the Fourteenth International Conference on Machine Learning, pp. 412–420. Morgan Kaufmann Publishers Inc., San Francisco (1997)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Jaén, Spain
Arturo Montejo-Ráez & Luis Alfonso Ureña-López

Authors

Arturo Montejo-Ráez
View author publications
You can also search for this author in PubMed Google Scholar
Luis Alfonso Ureña-López
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Turku Centre for Computer Science (TUCS), Department of Information Technology, University of Turku, Joukahaisenkatu 3-5 B, FIN-20520, Turku, Finland
Tapio Salakoski
Turku Centre for Computer Science (TUCS) and Department of IT, University of Turku, Lemminkäisenkatu 14 A, 20520, Turku, Finland
Filip Ginter & Sampo Pyysalo &
Department of Information Technology, University of Turku, Lemminkäisenkatu 14–18 A, FIN-20520, Turku, Finland
Tapio Pahikkala

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Montejo-Ráez, A., Ureña-López, L.A. (2006). Selection Strategies for Multi-label Text Categorization. In: Salakoski, T., Ginter, F., Pyysalo, S., Pahikkala, T. (eds) Advances in Natural Language Processing. FinTAL 2006. Lecture Notes in Computer Science(), vol 4139. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11816508_58

Download citation

DOI: https://doi.org/10.1007/11816508_58
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-37334-6
Online ISBN: 978-3-540-37336-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics