Complex Linguistic Features for Text Classification: A Comprehensive Study

Moschitti, Alessandro; Basili, Roberto

doi:10.1007/978-3-540-24752-4_14

Alessandro Moschitti⁶ &
Roberto Basili⁷

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2997))

Included in the following conference series:

European Conference on Information Retrieval

1136 Accesses
70 Citations

Abstract

Previous researches on advanced representations for document retrieval have shown that statistical state-of-the-art models are not improved by a variety of different linguistic representations. Phrases, word senses and syntactic relations derived by Natural Language Processing (NLP) techniques were observed ineffective to increase retrieval accuracy. For Text Categorization (TC) are available fewer and less definitive studies on the use of advanced document representations as it is a relatively new research area (compared to document retrieval).

In this paper, advanced document representations have been investigated. Extensive experimentation on representative classifiers, Rocchio and SVM, as well as a careful analysis of the literature have been carried out to study how some NLP techniques used for indexing impact TC. Cross validation over 4 different corpora in two languages allowed us to gather an overwhelming evidence that complex nominals, proper nouns and word senses are not adequate to improve TC accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Collins, M.: Three generative, lexicalized models for statistical parsing. In: Proceedings of the ACL and EACL, Somerset, New Jersey, pp. 16–23 (1997)
Google Scholar
Strzalkowski, T., Jones, S.: NLP track at TREC-5. In: Text REtrieval Conference (1996)
Google Scholar
Fellbaum, C.: WordNet: An Electronic Lexical Database. MIT Press, Cambridge (1998)
MATH Google Scholar
Strzalkowski, T., Carballo, J.P.: Natural language information retrieval: TREC-6 report. In: TREC (1997)
Google Scholar
Strzalkowski, T., Stein, G.C., Wise, G.B., Carballo, J.P., Tapanainen, P., Jarvinen, T., Voutilainen, A., Karlgren, J.: Natural language information retrieval: TREC-7 report. In: TREC (1998)
Google Scholar
Strzalkowski, T., Carballo, J.P., Karlgren, J., Hulth, A., Tapanainen, P., Jarvinen, T.: Natural language information retrieval: TREC-8 report. In: TREC (1999)
Google Scholar
Smeaton, A.F.: Using NLP or NLP resources for information retrieval tasks. In: Strzalkowski, T. (ed.) Natural language information retrieval, pp. 99–111. Kluwer Academic Publishers, Dordrecht (1999)
Google Scholar
Sussua, M.: Word sense disambiguation for free-text indexing using a massive semantic network. In: New York, A.P. (ed.) Proceeding of CKIM 1993 (1993)
Google Scholar
Voorhees, E.M.: Using wordnet to disambiguate word senses for text retrieval. In: Proceedings of SIGIR 1993, PA, USA (1993)
Google Scholar
Voorhees, E.M.: Query expansion using lexical-semantic relations. In: Proceedings of SIGIR 1994 (1994)
Google Scholar
Voorhees, E.M.: Using wordnet for text retrieval. In: Fellbaum, C. (ed.) WordNet: An Electronic Lexical Database, pp. 285–303. The MIT Press, Cambridge (1998)
Google Scholar
Kilgarriff, A., Rosenzweig, J.: English senseval: Report and results. In: English SENSEVAL: Report and Results. In Proceedings of the 2nd International Conference on Language Resources and Evaluation, LREC, Athens, Greece (2000)
Google Scholar
Stokoe, C., Oakes, M.P., Tait, J.: Word sense disambiguation in information retrieval revisited. In: Proceedings of SIGIR 2003, Canada (2003)
Google Scholar
Furnkranz, J., Mitchell, T., Rilof, E.: A case study in using linguistic phrases for text categorization on the www. In: AAAI/ICML Workshop (1998)
Google Scholar
Mladenić, D., Grobelnik, M.: Word sequences as features in text-learning. In: Proceedings of ERK 1998, Ljubljana, SL (1998)
Google Scholar
Raskutti, B., Ferrá, H., Kowalczyk, A.: Second order features for maximising text classification performance. In: Flach, P.A., De Raedt, L. (eds.) ECML 2001. LNCS (LNAI), vol. 2167, p. 419. Springer, Heidelberg (2001)
Chapter Google Scholar
Bekkerman, R., El-Yaniv, R., Tishby, N., Winter, Y.: On feature distributional clustering for text categorization. In: Proceedings of the ACM SIGIR 2001, pp. 146–153. ACM Press, New York (2001)
Chapter Google Scholar
Tan, C.M., Wang, Y.F., Lee, C.D.: The use of bigrams to enhance text categorization. Information Processing & Management (2002)
Google Scholar
Scott, S., Matwin, S.: Feature engineering for text classification. In: Proceedings of ICML 1999, Bled, SL (1999)
Google Scholar
Rocchio, J.: Relevance feedback in information retrieval. In: Salton, G. (ed.) The SMART Retrieval System–Experiments in Automatic Document Processing, pp. 313–323. Prentice Hall, Inc., Englewood Cliffs (1971)
Google Scholar
Basili, R., Moschitti, A., Pazienza, M.: NLP-driven IR: Evaluating performances over text classification task. In: Proceedings of IJCAI 2001, USA (2001)
Google Scholar
Moschitti, A.: A study on optimal parameter tuning for Rocchio text classifier. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 420–435. Springer, Heidelberg (2003)
Chapter Google Scholar
Vapnik, V.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1995)
MATH Google Scholar
Joachims, T.: T. joachims, making large-scale svm learning practical. In: Advances in Kernel Methods - Support Vector Learning (1999)
Google Scholar
Brill, E.: A simple rule-based part of speech tagger. In: Proc. of the Third Applied Natural Language Processing, Povo, Trento, Italy (1992)
Google Scholar
Basili, R., De Rossi, G., Pazienza, M.: Inducing terminology for lexical acquisition. In: Preoceeding of EMNLP 1997 Conference, Providence, USA (1997)
Google Scholar
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing and Management 24(5), 513–523 (1988)
Article Google Scholar
Yang, Y.: An evaluation of statistical approaches to text categorization. Information Retrieval Journal (1999)
Google Scholar
Caropreso, M.F., Matwin, S., Sebastiani, F.: A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization. In: Idea Group Publishing, Hershey, US (2001)
Google Scholar
Lewis, D.D.: An evaluation of phrasal and clustered representations on a text categorization task. In: Proceedings of SIGIR 1992, Kobenhavn, DK (1992)
Google Scholar
Riloff, E.: Automatically generating extraction patterns from untagged text. In: AAAI/IAAI, vol. 2, pp. 1044–1049 (1996)
Google Scholar
Mitchell, T.: Machine Learning. McGraw-Hill, New York (1997)
MATH Google Scholar
Cohen, W.W., Singer, Y.: Context-sensitive learning methods for text categorization. ACM Transactions on Information Systems 17, 141–173 (1999)
Article Google Scholar
Furnkranz, J.: A study using n-gram features for text categorization. Technical report oefai-tr-9830, Austrian Institute for Artificial Intelligence (1998)
Google Scholar
Nigam, K., Lafferty, J., McCallum, A.: Using maximum entropy for text classification. In: IJCAI 1999 Workshop (1999)
Google Scholar
Dumais, S.T., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representations for text categorization. In: Proceedings of CIKM 1998, Bethesda, US, pp. 148–155. ACM Press, New York (1998)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Human Language Technology Research Institute, University of Texas at Dallas, Richardson, TX, 75083-0688, USA
Alessandro Moschitti
Computer Science Department, University of Rome Tor Vergata, 00133, Roma, (Italy)
Roberto Basili

Authors

Alessandro Moschitti
View author publications
You can also search for this author in PubMed Google Scholar
Roberto Basili
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computing and Technology, David Goldman Informatics Centre, University of Sunderland, St. Peter’s Campus, SR6 0DD, Sunderland, UK
Sharon McDonald
School of Computing and Technology, University of Sunderland, St. Peter’s Campus, St. Peter’s Way, SR6 0DD, Sunderland, United Kingdom
John Tait

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Moschitti, A., Basili, R. (2004). Complex Linguistic Features for Text Classification: A Comprehensive Study. In: McDonald, S., Tait, J. (eds) Advances in Information Retrieval. ECIR 2004. Lecture Notes in Computer Science, vol 2997. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24752-4_14

Download citation

DOI: https://doi.org/10.1007/978-3-540-24752-4_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-21382-6
Online ISBN: 978-3-540-24752-4
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics