Skip to main content

Complex Linguistic Features for Text Classification: A Comprehensive Study

  • Conference paper
Advances in Information Retrieval (ECIR 2004)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2997))

Included in the following conference series:

Abstract

Previous researches on advanced representations for document retrieval have shown that statistical state-of-the-art models are not improved by a variety of different linguistic representations. Phrases, word senses and syntactic relations derived by Natural Language Processing (NLP) techniques were observed ineffective to increase retrieval accuracy. For Text Categorization (TC) are available fewer and less definitive studies on the use of advanced document representations as it is a relatively new research area (compared to document retrieval).

In this paper, advanced document representations have been investigated. Extensive experimentation on representative classifiers, Rocchio and SVM, as well as a careful analysis of the literature have been carried out to study how some NLP techniques used for indexing impact TC. Cross validation over 4 different corpora in two languages allowed us to gather an overwhelming evidence that complex nominals, proper nouns and word senses are not adequate to improve TC accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Collins, M.: Three generative, lexicalized models for statistical parsing. In: Proceedings of the ACL and EACL, Somerset, New Jersey, pp. 16–23 (1997)

    Google Scholar 

  2. Strzalkowski, T., Jones, S.: NLP track at TREC-5. In: Text REtrieval Conference (1996)

    Google Scholar 

  3. Fellbaum, C.: WordNet: An Electronic Lexical Database. MIT Press, Cambridge (1998)

    MATH  Google Scholar 

  4. Strzalkowski, T., Carballo, J.P.: Natural language information retrieval: TREC-6 report. In: TREC (1997)

    Google Scholar 

  5. Strzalkowski, T., Stein, G.C., Wise, G.B., Carballo, J.P., Tapanainen, P., Jarvinen, T., Voutilainen, A., Karlgren, J.: Natural language information retrieval: TREC-7 report. In: TREC (1998)

    Google Scholar 

  6. Strzalkowski, T., Carballo, J.P., Karlgren, J., Hulth, A., Tapanainen, P., Jarvinen, T.: Natural language information retrieval: TREC-8 report. In: TREC (1999)

    Google Scholar 

  7. Smeaton, A.F.: Using NLP or NLP resources for information retrieval tasks. In: Strzalkowski, T. (ed.) Natural language information retrieval, pp. 99–111. Kluwer Academic Publishers, Dordrecht (1999)

    Google Scholar 

  8. Sussua, M.: Word sense disambiguation for free-text indexing using a massive semantic network. In: New York, A.P. (ed.) Proceeding of CKIM 1993 (1993)

    Google Scholar 

  9. Voorhees, E.M.: Using wordnet to disambiguate word senses for text retrieval. In: Proceedings of SIGIR 1993, PA, USA (1993)

    Google Scholar 

  10. Voorhees, E.M.: Query expansion using lexical-semantic relations. In: Proceedings of SIGIR 1994 (1994)

    Google Scholar 

  11. Voorhees, E.M.: Using wordnet for text retrieval. In: Fellbaum, C. (ed.) WordNet: An Electronic Lexical Database, pp. 285–303. The MIT Press, Cambridge (1998)

    Google Scholar 

  12. Kilgarriff, A., Rosenzweig, J.: English senseval: Report and results. In: English SENSEVAL: Report and Results. In Proceedings of the 2nd International Conference on Language Resources and Evaluation, LREC, Athens, Greece (2000)

    Google Scholar 

  13. Stokoe, C., Oakes, M.P., Tait, J.: Word sense disambiguation in information retrieval revisited. In: Proceedings of SIGIR 2003, Canada (2003)

    Google Scholar 

  14. Furnkranz, J., Mitchell, T., Rilof, E.: A case study in using linguistic phrases for text categorization on the www. In: AAAI/ICML Workshop (1998)

    Google Scholar 

  15. Mladenić, D., Grobelnik, M.: Word sequences as features in text-learning. In: Proceedings of ERK 1998, Ljubljana, SL (1998)

    Google Scholar 

  16. Raskutti, B., Ferrá, H., Kowalczyk, A.: Second order features for maximising text classification performance. In: Flach, P.A., De Raedt, L. (eds.) ECML 2001. LNCS (LNAI), vol. 2167, p. 419. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  17. Bekkerman, R., El-Yaniv, R., Tishby, N., Winter, Y.: On feature distributional clustering for text categorization. In: Proceedings of the ACM SIGIR 2001, pp. 146–153. ACM Press, New York (2001)

    Chapter  Google Scholar 

  18. Tan, C.M., Wang, Y.F., Lee, C.D.: The use of bigrams to enhance text categorization. Information Processing & Management (2002)

    Google Scholar 

  19. Scott, S., Matwin, S.: Feature engineering for text classification. In: Proceedings of ICML 1999, Bled, SL (1999)

    Google Scholar 

  20. Rocchio, J.: Relevance feedback in information retrieval. In: Salton, G. (ed.) The SMART Retrieval System–Experiments in Automatic Document Processing, pp. 313–323. Prentice Hall, Inc., Englewood Cliffs (1971)

    Google Scholar 

  21. Basili, R., Moschitti, A., Pazienza, M.: NLP-driven IR: Evaluating performances over text classification task. In: Proceedings of IJCAI 2001, USA (2001)

    Google Scholar 

  22. Moschitti, A.: A study on optimal parameter tuning for Rocchio text classifier. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 420–435. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  23. Vapnik, V.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1995)

    MATH  Google Scholar 

  24. Joachims, T.: T. joachims, making large-scale svm learning practical. In: Advances in Kernel Methods - Support Vector Learning (1999)

    Google Scholar 

  25. Brill, E.: A simple rule-based part of speech tagger. In: Proc. of the Third Applied Natural Language Processing, Povo, Trento, Italy (1992)

    Google Scholar 

  26. Basili, R., De Rossi, G., Pazienza, M.: Inducing terminology for lexical acquisition. In: Preoceeding of EMNLP 1997 Conference, Providence, USA (1997)

    Google Scholar 

  27. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing and Management 24(5), 513–523 (1988)

    Article  Google Scholar 

  28. Yang, Y.: An evaluation of statistical approaches to text categorization. Information Retrieval Journal (1999)

    Google Scholar 

  29. Caropreso, M.F., Matwin, S., Sebastiani, F.: A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization. In: Idea Group Publishing, Hershey, US (2001)

    Google Scholar 

  30. Lewis, D.D.: An evaluation of phrasal and clustered representations on a text categorization task. In: Proceedings of SIGIR 1992, Kobenhavn, DK (1992)

    Google Scholar 

  31. Riloff, E.: Automatically generating extraction patterns from untagged text. In: AAAI/IAAI, vol. 2, pp. 1044–1049 (1996)

    Google Scholar 

  32. Mitchell, T.: Machine Learning. McGraw-Hill, New York (1997)

    MATH  Google Scholar 

  33. Cohen, W.W., Singer, Y.: Context-sensitive learning methods for text categorization. ACM Transactions on Information Systems 17, 141–173 (1999)

    Article  Google Scholar 

  34. Furnkranz, J.: A study using n-gram features for text categorization. Technical report oefai-tr-9830, Austrian Institute for Artificial Intelligence (1998)

    Google Scholar 

  35. Nigam, K., Lafferty, J., McCallum, A.: Using maximum entropy for text classification. In: IJCAI 1999 Workshop (1999)

    Google Scholar 

  36. Dumais, S.T., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representations for text categorization. In: Proceedings of CIKM 1998, Bethesda, US, pp. 148–155. ACM Press, New York (1998)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Moschitti, A., Basili, R. (2004). Complex Linguistic Features for Text Classification: A Comprehensive Study. In: McDonald, S., Tait, J. (eds) Advances in Information Retrieval. ECIR 2004. Lecture Notes in Computer Science, vol 2997. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24752-4_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-24752-4_14

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-21382-6

  • Online ISBN: 978-3-540-24752-4

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics