Adaptive Selection of Base Classifiers in One-Against-All Learning for Large Multi-labeled Collections

Ráez, Arturo Montejo; López, Luís Alfonso Ureña; Steinberger, Ralf

doi:10.1007/978-3-540-30228-5_1

Arturo Montejo Ráez⁵,
Luís Alfonso Ureña López⁶ &
Ralf Steinberger⁷

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3230))

Included in the following conference series:

International Conference on Natural Language Processing (in Spain)

706 Accesses
11 Citations

Abstract

In this paper we present the problem found when studying an automated text categorization system for a collection of High Energy Physics (HEP) papers, which shows a very large number of possible classes (over 1,000) with highly imbalanced distribution. The collection is introduced to the scientific community and its imbalance is studied applying a new indicator: the inner imbalance degree. The one-against-all approach is used to perform multi-label assignment using Support Vector Machines. Over-weighting of positive samples and S-Cut thresholding is compared to an approach to automatically select a classifier for each class from a set of candidates. We also found that it is possible to reduce computational cost of the classification task by discarding classes for which classifiers cannot be trained successfully.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Allwein, E.L., Schapire, R.E., Singer, Y.: Reducing multiclass to binary: A unifying approach for margin classifiers. In: Proc. 17th International Conf. on Machine Learning, pp. 9–16. Morgan Kaufmann, San Francisco (2000)
Google Scholar
Pouliquen, B., Steinberger, R., Ignat, C.: Automatic Annotation of Multilingual Text Collections with a Conceptual Thesaurus. In: Todirascu, A. (ed.) Proceedings of the workshop ’Ontologies and Information Extraction’ at the EuroLan Summer School ’The Semantic Web and Language Technology’ (EUROLAN 2003), Bucharest (Romania), p. 8 (2003)
Google Scholar
Chawla, N.V.: C4.5 and imbalanced data sets: Investigating the effect of sampling method, probabilistic estimate and decision tree structure. In: Workshop on Learning from Imbalanced Datasets II, ICML, Washington DC (2003)
Google Scholar
Drummond, C., Holte, R.C.: C4.5, class imbalance, and cost sensitivity: Why under-sampling beats over-sampling. In: Workshop on Learning from Imbalanced Datasets II, ICML, Washington DC (2003)
Google Scholar
Hersh, W., Buckley, C., Leone, T.J., Hickam, D.: Ohsumed: an interactive retrieval evaluation and new large test collection for research. In: Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 192–201. Springer, New York (1994)
Google Scholar
Japkowicz, N.: Class imbalances: Are we focusing on the right issue? In: Workshop on Learning from Imbalanced Datasets II, ICML, Washington DC (2003)
Google Scholar
Japkowicz, N., Stephen, S.: The class imbalance problem: A systematic study. Intelligent Data Analysis Journal 6(5) (November 2002)
Google Scholar
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Chapter Google Scholar
Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proc. of the Fourteenth International Joint Conference on Artificial Intelligence, pp. 1137–1145. Morgan Kaufmann, San Mateo (1995)
Google Scholar
Lewis, D.D.: Evaluating Text Categorization. In: Proceedings of Speech and Natural Language Workshop, pp. 312–318. Morgan Kaufmann, San Francisco (1991)
Chapter Google Scholar
Martín-Valdivia, M., García-Vega, M., Ureña López, L.: LVQ for text categorization using multilingual linguistic resource. Neurocomputing 55, 665–679 (2003)
Article Google Scholar
Montejo-Ráez, A.: Towards conceptual indexing using automatic assignment of descriptors. In: Workshop in Personalization Techniques in Electronic Publishing on the Web: Trends and Perspectives, Málaga, Spain (May 2002)
Google Scholar
Montejo-Ráez, A., Dallman, D.: Experiences in automatic keywording of particle physics literature. High Energy Physics Libraries Webzine (issue 5) (November 2001), URL: http://library.cern.ch/HEPLW/5/papers/3/
Morik, K., Brockhausen, P., Joachims, T.: Combining statistical learning with a knowledge-based approach - a case study in intensive care monitoring. In: Proc. 16th International Conf. on Machine Learning, pp. 268–277. Morgan Kaufmann, San Francisco (1999)
Google Scholar
Porter, M.F.: An algorithm for suffix stripping, pp. 313–316. Morgan Kaufmann Publishers Inc., San Francisco (1997)
Google Scholar
Raskutti, B., Kowalczyk, A.: Extreme re-balancing for svms: a case study. In: Workshop on Learning from Imbalanced Datasets II, ICML, Washington DC (2003)
Google Scholar
Salton, G., Wong, A., Yang, C.S.: A Vector Space Model for Automatic Indexing. Technical Report TR74-218, Cornell University, Computer Science Department (July 1974)
Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Article MathSciNet Google Scholar
van Rijsbergen, C.J.: Information Retrieval. Butterworths, London (1975), http://www.dcs.gla.ac.uk/Keith/Preface.html
Wu, G., Chang, E.Y.: Class-boundary alignment for imbalanced dataset learning. In: Workshop on Learning from Imbalanced Datasets II, ICML, Washington DC (2003)
Google Scholar
Yang, Y.: A study on thresholding strategies for text categorization. In: Croft, W.B., Harper, D.J., Kraft, D.H., Zobel, J. (eds.) Proceedings of SIGIR 2001, 24th ACM International Conference on Research and Development in Information Retrieval, New Orleans, US, pp. 137–145. ACM Press, New York (2001); Describes RCut, Scut, etc
Google Scholar
Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Hearst, M.A., Gey, F., Tong, R. (eds.) Proceedings of SIGIR 1999, 22nd ACM International Conference on Research and Development in Information Retrieval, Berkeley, US, pp. 42–49. ACM Press, New York (1999)
Google Scholar

Download references

Author information

Authors and Affiliations

European Laboratory for Nuclear Research, Geneva, Switzerland
Arturo Montejo Ráez
Department of Computer Science, University of Jaén, Spain
Luís Alfonso Ureña López
European Commission, Joint Research Centre, Ispra, Italy
Ralf Steinberger

Authors

Arturo Montejo Ráez
View author publications
You can also search for this author in PubMed Google Scholar
Luís Alfonso Ureña López
View author publications
You can also search for this author in PubMed Google Scholar
Ralf Steinberger
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Software and Computing Systems, University of Alicante, Spain
José Luis Vicedo
Natural Language Processing and Information Systems Group, Department of Software and Computing Systems, University of Alicante, Spain
Patricio Martínez-Barco
Grupo de investigación del Procesamiento del Lenguaje y Sistemas de Información, Departamento de Lenguajes y Sistemas Informáticos, Universidad de Alicante, Alicante, Spain
Rafael Muńoz
Departamento de Lenguajes y Sistemas Informáticos, Carretera de San Vicente del Raspeig, Universidad de Alicante, 03690 San Vicente del Raspeig, Alicante, Spain
Maximiliano Saiz Noeda

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ráez, A.M., López, L.A.U., Steinberger, R. (2004). Adaptive Selection of Base Classifiers in One-Against-All Learning for Large Multi-labeled Collections. In: Vicedo, J.L., Martínez-Barco, P., Muńoz, R., Saiz Noeda, M. (eds) Advances in Natural Language Processing. EsTAL 2004. Lecture Notes in Computer Science(), vol 3230. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30228-5_1

Download citation

DOI: https://doi.org/10.1007/978-3-540-30228-5_1
Published: 20 October 2004
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23498-2
Online ISBN: 978-3-540-30228-5
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics