Abstract
Many algorithms have come up in the last years to tackle automated text categorization. They have been exhaustively studied, leading to several variants and combinations not only in the particular procedures but also in the treatment of the input data. A widely used approach is representing documents as Bag-Of-Words (BOW) and weighting tokens with the TFIDF schema. Many researchers have thrown into precision and recall improvements and classification time reduction enriching BOW with stemming, n-grams, feature selection, noun phrases, metadata, weight normalization, etc. We contribute to this field with a novel combination of these techniques. For evaluation purposes, we provide comparisons to previous works with SVM against the simple BOW. The well known OHSUMED corpus is exploited and different sets of categories are selected, as previously done in the literature. The conclusion is that the proposed method can be successfully applied to existing binary classifiers such as SVM outperforming the mixture of BOW and TFIDF approaches.
Keywords
- Text classification
- SVM
- MEDLINE
- OHSUMED
- Medical Subject Headings
This is a preview of subscription content, access via your institution.
Buying options
Preview
Unable to display preview. Download preview PDF.
References
Sebastiani, F.: A tutorial on automated text categorisation. In: Amandi, A., Zunino, R. (eds.) Proceedings of ASAI 1999, 1st Argentinian Symposium on Artificial Intelligence, Buenos Aires, AR, pp. 7–35 (1999)
Aas, K., Eikvil, L.: Text categorisation: A survey. Technical report, Norwegian Computer Center (June 1999)
Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Hearst, M.A., Gey, F., Tong, R. (eds.) Proceedings of SIGIR 1999, 22nd ACM International Conference on Research and Development in Information Retrieval, Berkeley, US, pp. 42–49. ACM Press, New York (1999)
Scott, S., Matwin, S.: Feature engineering for text classification. In: Bratko, I., Dzeroski, S. (eds.) Proceedings of ICML 1999, 16th International Conference on Machine Learning, Bled, SL, pp. 379–388. Morgan Kaufmann Publishers, San Francisco (1999)
Tan, C.M., Wang, Y.F., Lee, C.D.: The use of bigrams to enhance text categorization. Information Processing and Management 38(4), 529–546 (2002)
Tesar, R., Strnad, V., Jezek, K., Poesio, M.: Extending the single words-based document model: a comparison of bigrams and 2-itemsets. In: DocEng 2006: Proceedings of the 2006 ACM symposium on Document engineering, pp. 138–146. ACM Press, New York (2006)
Antonie, M., Zaane, O.: Text document categorization by term association. In: IEEE International Conference on Data Mining (ICDM), pp. 19–26 (2002)
Zhang, Y., Zhang, L., Yan, J., Li, Z.: Using association features to enhance the performance of naive bayes text classifier. In: Fifth International Conference on Computational Intelligence and Multimedia Applications, ICCIMA 2003, pp. 336–341 (2003)
Basili, R., Moschitti, A., Pazienza, M.T.: Language-sensitive text classification. In: Proceeding of RIAO 2000, 6th International Conference Recherche d’Information Assistee par Ordinateur, Paris, FR, pp. 331–343 (2000)
Granitzer, M.: Hierarchical text classification using methods from machine learning. Master’s thesis, Graz University of Technology (2003)
Moschitti, A., Basili, R.: Complex linguistic features for text classification: A comprehensive study. In: McDonald, S., Tait, J.I. (eds.) ECIR 2004. LNCS, vol. 2997, pp. 181–196. Springer, Heidelberg (2004)
Buckley, C.: The importance of proper weighting methods. In: Bates, M. (ed.) Human Language Technology. Morgan Kaufman, San Francisco (1993)
Singhal, A., Buckley, C., Mitra, M.: Pivoted document length normalization. Department of Computer Science, Cornell University, Ithaca, NY 14853 (1996)
Ruiz-Rico, F., Vicedo, J.L., Rubio-Sánchez, M.C.: Newpar: an automatic feature selection and weighting schema for category ranking. In: Proceedings of DocEng 2006, 6th ACM symposium on Document engineering, pp. 128–137 (2006)
Màrquez, L., Giménez, J.: A general pos tagger generator based on support vector machines. Journal of Machine Learning Research (2004), www.lsi.upc.edu/~nlp/SVMTool
Kongovi, M., Guzman, J.C., Dasigi, V.: Text categorization: An experiment using phrases. In: Crestani, F., Girolami, M., van Rijsbergen, C.J.K. (eds.) ECIR 2002. LNCS, vol. 2291, pp. 213–228. Springer, Heidelberg (2002)
Joachims, T.: Making large-Scale SVM Learning Practical. Advances in Kernel Methods - Support Vector Learning (1999), http://svmlight.joachims.org/
Joachims, T.: Support Vector and Kernel Methods. In: SIGIR 2003 Tutorial (2003)
Zu, G., Ohyama, W., Wakabayashi, T., Kimura, F.: Accuracy improvement of automatic text classification based on feature transformation. In: Proceedings of DOCENG 2003, ACM Symposium on Document engineering, Grenoble, FR, pp. 118–120. ACM Press, New York (2003)
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Joachims, T.: Estimating the generalization performance of a svm efficiently. In: Langley, P. (ed.) Proceedings of ICML 2000, 17th International Conference on Machine Learning, Stanford, US, pp. 431–438. Morgan Kaufmann Publishers, San Francisco (2000)
Dumais, S.T., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representations for text categorization. In: Gardarin, G., French, J.C., Pissinou, N., Makki, K., Bouganim, L. (eds.) Proceedings of CIKM 1998, 7th ACM International Conference on Information and Knowledge Management, Bethesda, US, pp. 148–155. ACM Press, New York (1998)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ruiz-Rico, F., Vicedo, JL., Rubio-Sánchez, MC. (2008). MEDLINE Abstracts Classification Based on Noun Phrases Extraction. In: Fred, A., Filipe, J., Gamboa, H. (eds) Biomedical Engineering Systems and Technologies. BIOSTEC 2008. Communications in Computer and Information Science, vol 25. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-92219-3_38
Download citation
DOI: https://doi.org/10.1007/978-3-540-92219-3_38
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-92218-6
Online ISBN: 978-3-540-92219-3
eBook Packages: Computer ScienceComputer Science (R0)