MEDLINE Abstracts Classification Based on Noun Phrases Extraction

  • Fernando Ruiz-Rico
  • José-Luis Vicedo
  • María-Consuelo Rubio-Sánchez
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 25)


Many algorithms have come up in the last years to tackle automated text categorization. They have been exhaustively studied, leading to several variants and combinations not only in the particular procedures but also in the treatment of the input data. A widely used approach is representing documents as Bag-Of-Words (BOW) and weighting tokens with the TFIDF schema. Many researchers have thrown into precision and recall improvements and classification time reduction enriching BOW with stemming, n-grams, feature selection, noun phrases, metadata, weight normalization, etc. We contribute to this field with a novel combination of these techniques. For evaluation purposes, we provide comparisons to previous works with SVM against the simple BOW. The well known OHSUMED corpus is exploited and different sets of categories are selected, as previously done in the literature. The conclusion is that the proposed method can be successfully applied to existing binary classifiers such as SVM outperforming the mixture of BOW and TFIDF approaches.


Text classification SVM MEDLINE OHSUMED Medical Subject Headings 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Sebastiani, F.: A tutorial on automated text categorisation. In: Amandi, A., Zunino, R. (eds.) Proceedings of ASAI 1999, 1st Argentinian Symposium on Artificial Intelligence, Buenos Aires, AR, pp. 7–35 (1999)Google Scholar
  2. 2.
    Aas, K., Eikvil, L.: Text categorisation: A survey. Technical report, Norwegian Computer Center (June 1999)Google Scholar
  3. 3.
    Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Hearst, M.A., Gey, F., Tong, R. (eds.) Proceedings of SIGIR 1999, 22nd ACM International Conference on Research and Development in Information Retrieval, Berkeley, US, pp. 42–49. ACM Press, New York (1999)Google Scholar
  4. 4.
    Scott, S., Matwin, S.: Feature engineering for text classification. In: Bratko, I., Dzeroski, S. (eds.) Proceedings of ICML 1999, 16th International Conference on Machine Learning, Bled, SL, pp. 379–388. Morgan Kaufmann Publishers, San Francisco (1999)Google Scholar
  5. 5.
    Tan, C.M., Wang, Y.F., Lee, C.D.: The use of bigrams to enhance text categorization. Information Processing and Management 38(4), 529–546 (2002)CrossRefGoogle Scholar
  6. 6.
    Tesar, R., Strnad, V., Jezek, K., Poesio, M.: Extending the single words-based document model: a comparison of bigrams and 2-itemsets. In: DocEng 2006: Proceedings of the 2006 ACM symposium on Document engineering, pp. 138–146. ACM Press, New York (2006)Google Scholar
  7. 7.
    Antonie, M., Zaane, O.: Text document categorization by term association. In: IEEE International Conference on Data Mining (ICDM), pp. 19–26 (2002)Google Scholar
  8. 8.
    Zhang, Y., Zhang, L., Yan, J., Li, Z.: Using association features to enhance the performance of naive bayes text classifier. In: Fifth International Conference on Computational Intelligence and Multimedia Applications, ICCIMA 2003, pp. 336–341 (2003)Google Scholar
  9. 9.
    Basili, R., Moschitti, A., Pazienza, M.T.: Language-sensitive text classification. In: Proceeding of RIAO 2000, 6th International Conference Recherche d’Information Assistee par Ordinateur, Paris, FR, pp. 331–343 (2000)Google Scholar
  10. 10.
    Granitzer, M.: Hierarchical text classification using methods from machine learning. Master’s thesis, Graz University of Technology (2003)Google Scholar
  11. 11.
    Moschitti, A., Basili, R.: Complex linguistic features for text classification: A comprehensive study. In: McDonald, S., Tait, J.I. (eds.) ECIR 2004. LNCS, vol. 2997, pp. 181–196. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  12. 12.
    Buckley, C.: The importance of proper weighting methods. In: Bates, M. (ed.) Human Language Technology. Morgan Kaufman, San Francisco (1993)Google Scholar
  13. 13.
    Singhal, A., Buckley, C., Mitra, M.: Pivoted document length normalization. Department of Computer Science, Cornell University, Ithaca, NY 14853 (1996)Google Scholar
  14. 14.
    Ruiz-Rico, F., Vicedo, J.L., Rubio-Sánchez, M.C.: Newpar: an automatic feature selection and weighting schema for category ranking. In: Proceedings of DocEng 2006, 6th ACM symposium on Document engineering, pp. 128–137 (2006)Google Scholar
  15. 15.
    Màrquez, L., Giménez, J.: A general pos tagger generator based on support vector machines. Journal of Machine Learning Research (2004),
  16. 16.
    Kongovi, M., Guzman, J.C., Dasigi, V.: Text categorization: An experiment using phrases. In: Crestani, F., Girolami, M., van Rijsbergen, C.J.K. (eds.) ECIR 2002. LNCS, vol. 2291, pp. 213–228. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  17. 17.
    Joachims, T.: Making large-Scale SVM Learning Practical. Advances in Kernel Methods - Support Vector Learning (1999),
  18. 18.
    Joachims, T.: Support Vector and Kernel Methods. In: SIGIR 2003 Tutorial (2003)Google Scholar
  19. 19.
    Zu, G., Ohyama, W., Wakabayashi, T., Kimura, F.: Accuracy improvement of automatic text classification based on feature transformation. In: Proceedings of DOCENG 2003, ACM Symposium on Document engineering, Grenoble, FR, pp. 118–120. ACM Press, New York (2003)CrossRefGoogle Scholar
  20. 20.
    Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)CrossRefGoogle Scholar
  21. 21.
    Joachims, T.: Estimating the generalization performance of a svm efficiently. In: Langley, P. (ed.) Proceedings of ICML 2000, 17th International Conference on Machine Learning, Stanford, US, pp. 431–438. Morgan Kaufmann Publishers, San Francisco (2000)Google Scholar
  22. 22.
    Dumais, S.T., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representations for text categorization. In: Gardarin, G., French, J.C., Pissinou, N., Makki, K., Bouganim, L. (eds.) Proceedings of CIKM 1998, 7th ACM International Conference on Information and Knowledge Management, Bethesda, US, pp. 148–155. ACM Press, New York (1998)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Fernando Ruiz-Rico
    • 1
  • José-Luis Vicedo
    • 1
  • María-Consuelo Rubio-Sánchez
    • 1
  1. 1.University of AlicanteSpain

Personalised recommendations