Adaptive Selection of Base Classifiers in One-Against-All Learning for Large Multi-labeled Collections

  • Arturo Montejo Ráez
  • Luís Alfonso Ureña López
  • Ralf Steinberger
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3230)


In this paper we present the problem found when studying an automated text categorization system for a collection of High Energy Physics (HEP) papers, which shows a very large number of possible classes (over 1,000) with highly imbalanced distribution. The collection is introduced to the scientific community and its imbalance is studied applying a new indicator: the inner imbalance degree. The one-against-all approach is used to perform multi-label assignment using Support Vector Machines. Over-weighting of positive samples and S-Cut thresholding is compared to an approach to automatically select a classifier for each class from a set of candidates. We also found that it is possible to reduce computational cost of the classification task by discarding classes for which classifiers cannot be trained successfully.


Support Vector Machine Positive Sample Negative Sample Class Imbalance Binary Case 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Allwein, E.L., Schapire, R.E., Singer, Y.: Reducing multiclass to binary: A unifying approach for margin classifiers. In: Proc. 17th International Conf. on Machine Learning, pp. 9–16. Morgan Kaufmann, San Francisco (2000)Google Scholar
  2. 2.
    Pouliquen, B., Steinberger, R., Ignat, C.: Automatic Annotation of Multilingual Text Collections with a Conceptual Thesaurus. In: Todirascu, A. (ed.) Proceedings of the workshop ’Ontologies and Information Extraction’ at the EuroLan Summer School ’The Semantic Web and Language Technology’ (EUROLAN 2003), Bucharest (Romania), p. 8 (2003)Google Scholar
  3. 3.
    Chawla, N.V.: C4.5 and imbalanced data sets: Investigating the effect of sampling method, probabilistic estimate and decision tree structure. In: Workshop on Learning from Imbalanced Datasets II, ICML, Washington DC (2003)Google Scholar
  4. 4.
    Drummond, C., Holte, R.C.: C4.5, class imbalance, and cost sensitivity: Why under-sampling beats over-sampling. In: Workshop on Learning from Imbalanced Datasets II, ICML, Washington DC (2003)Google Scholar
  5. 5.
    Hersh, W., Buckley, C., Leone, T.J., Hickam, D.: Ohsumed: an interactive retrieval evaluation and new large test collection for research. In: Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 192–201. Springer, New York (1994)Google Scholar
  6. 6.
    Japkowicz, N.: Class imbalances: Are we focusing on the right issue? In: Workshop on Learning from Imbalanced Datasets II, ICML, Washington DC (2003)Google Scholar
  7. 7.
    Japkowicz, N., Stephen, S.: The class imbalance problem: A systematic study. Intelligent Data Analysis Journal 6(5) (November 2002)Google Scholar
  8. 8.
    Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)CrossRefGoogle Scholar
  9. 9.
    Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proc. of the Fourteenth International Joint Conference on Artificial Intelligence, pp. 1137–1145. Morgan Kaufmann, San Mateo (1995)Google Scholar
  10. 10.
    Lewis, D.D.: Evaluating Text Categorization. In: Proceedings of Speech and Natural Language Workshop, pp. 312–318. Morgan Kaufmann, San Francisco (1991)CrossRefGoogle Scholar
  11. 11.
    Martín-Valdivia, M., García-Vega, M., Ureña López, L.: LVQ for text categorization using multilingual linguistic resource. Neurocomputing 55, 665–679 (2003)CrossRefGoogle Scholar
  12. 12.
    Montejo-Ráez, A.: Towards conceptual indexing using automatic assignment of descriptors. In: Workshop in Personalization Techniques in Electronic Publishing on the Web: Trends and Perspectives, Málaga, Spain (May 2002)Google Scholar
  13. 13.
    Montejo-Ráez, A., Dallman, D.: Experiences in automatic keywording of particle physics literature. High Energy Physics Libraries Webzine (issue 5) (November 2001), URL:
  14. 14.
    Morik, K., Brockhausen, P., Joachims, T.: Combining statistical learning with a knowledge-based approach - a case study in intensive care monitoring. In: Proc. 16th International Conf. on Machine Learning, pp. 268–277. Morgan Kaufmann, San Francisco (1999)Google Scholar
  15. 15.
    Porter, M.F.: An algorithm for suffix stripping, pp. 313–316. Morgan Kaufmann Publishers Inc., San Francisco (1997)Google Scholar
  16. 16.
    Raskutti, B., Kowalczyk, A.: Extreme re-balancing for svms: a case study. In: Workshop on Learning from Imbalanced Datasets II, ICML, Washington DC (2003)Google Scholar
  17. 17.
    Salton, G., Wong, A., Yang, C.S.: A Vector Space Model for Automatic Indexing. Technical Report TR74-218, Cornell University, Computer Science Department (July 1974)Google Scholar
  18. 18.
    Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)CrossRefGoogle Scholar
  19. 19.
    van Rijsbergen, C.J.: Information Retrieval. Butterworths, London (1975),
  20. 20.
    Wu, G., Chang, E.Y.: Class-boundary alignment for imbalanced dataset learning. In: Workshop on Learning from Imbalanced Datasets II, ICML, Washington DC (2003)Google Scholar
  21. 21.
    Yang, Y.: A study on thresholding strategies for text categorization. In: Croft, W.B., Harper, D.J., Kraft, D.H., Zobel, J. (eds.) Proceedings of SIGIR 2001, 24th ACM International Conference on Research and Development in Information Retrieval, New Orleans, US, pp. 137–145. ACM Press, New York (2001); Describes RCut, Scut, etcGoogle Scholar
  22. 22.
    Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Hearst, M.A., Gey, F., Tong, R. (eds.) Proceedings of SIGIR 1999, 22nd ACM International Conference on Research and Development in Information Retrieval, Berkeley, US, pp. 42–49. ACM Press, New York (1999)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Arturo Montejo Ráez
    • 1
  • Luís Alfonso Ureña López
    • 2
  • Ralf Steinberger
    • 3
  1. 1.European Laboratory for Nuclear ResearchGenevaSwitzerland
  2. 2.Department of Computer ScienceUniversity of JaénSpain
  3. 3.European CommissionJoint Research CentreIspraItaly

Personalised recommendations