Text Categorization for Improved Priors of Word Meaning

Koeling, Rob; McCarthy, Diana; Carroll, John

doi:10.1007/978-3-540-70939-8_22

Rob Koeling¹,
Diana McCarthy¹ &
John Carroll¹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4394))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

1498 Accesses
3 Citations

Abstract

Distributions of the senses of words are often highly skewed. This fact is exploited by word sense disambiguation (WSD) systems which back off to the predominant (most frequent) sense of a word when contextual clues are not strong enough. The topic domain of a document has a strong influence on the sense distribution of words. Unfortunately, it is not feasible to produce large manually sense-annotated corpora for every domain of interest. Previous experiments have shown that unsupervised estimation of the predominant sense of certain words using corpora whose domain has been determined by hand outperforms estimates based on domain-independent text for a subset of words and even outperforms the estimates based on counting occurrences in an annotated corpus.

In this paper we address the question of whether we can automatically produce domain-specific corpora which could be used to acquire predominant senses appropriate for specific domains. We collect the corpora by automatically classifying documents from a very large corpus of newswire text. Using these corpora we estimate the predominant sense of words for each domain. We first compare with the results presented in [1]. Encouraged by the results we start exploring using text categorization for WSD by evaluating on a standard data set (documents from the SENSEVAL-2 and 3 English all-word tasks). We show that for these documents and using domain-specific predominant senses, we are able to improve on the results that we obtained with predominant senses estimated using general, non domain-specific text. We also show that the confidence of the text classifier is a good indication whether it is worthwhile using the domain-specific predominant sense or not.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Koeling, R., McCarthy, D., Carroll, J.: Domain-specific sense distributions and predominant sense acquisition. In: Proceedings of the Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, Vancouver, Canada, pp. 419–426 (2005)
Google Scholar
Miller, G.A., Leacock, C., Tengi, R., Bunker, R.T.: A semantic concordance. In: Proceedings of the ARPA Workshop on Human Language Technology, pp. 303–308 (1993)
Google Scholar
Yarowsky, D., Florian, R.: Evaluating sense disambiguation performance across diverse parameter spaces. Natural Language Engineering 8(4), 293–310 (2002)
Article Google Scholar
Snyder, B., Palmer, M.: The English all-words task. In: Proceedings of SENSEVAL-3, Barcelona, Spain, pp. 41–43 (2004)
Google Scholar
Magnini, B., Strapparava, C., Pezzulo, G., Gliozzo, A.: The role of domain information in word sense disambiguation. Natural Language Engineering 8(4), 359–373 (2002)
Article Google Scholar
McCarthy, D., Koeling, R., Weeds, J., Carroll, J.: Finding predominant senses in untagged text. In: Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, Barcelona, Spain, pp. 280–287 (2004)
Google Scholar
Lin, D.: Automatic retrieval and clustering of similar words. In: Proceedings of COLING-ACL 98, Montreal, Canada (1998)
Google Scholar
Patwardhan, S., Pedersen, T.: The cpan wordnet::similarity package (2003), http://search.cpan.org/~sid/WordNet-Similarity/
Jiang, J., Conrath, D.: Semantic similarity based on corpus statistics and lexical taxonomy. In: International Conference on Research in Computational Linguistics, Taiwan (1997)
Google Scholar
Leech, G.: 100 million words of English: the British National Corpus. Language Research 28(1), 1–13 (1992)
MathSciNet Google Scholar
Briscoe, T., Carroll, J.: Robust accurate statistical annotation of general text. In: Proceedings of LREC-2002, Las Palmas de Gran Canaria, pp. 1499–1504 (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Informatics, University of Sussex, Brighton BN1 9QH, UK, UK
Rob Koeling, Diana McCarthy & John Carroll

Authors

Rob Koeling
View author publications
You can also search for this author in PubMed Google Scholar
Diana McCarthy
View author publications
You can also search for this author in PubMed Google Scholar
John Carroll
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Koeling, R., McCarthy, D., Carroll, J. (2007). Text Categorization for Improved Priors of Word Meaning. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2007. Lecture Notes in Computer Science, vol 4394. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-70939-8_22

Download citation

DOI: https://doi.org/10.1007/978-3-540-70939-8_22
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-70938-1
Online ISBN: 978-3-540-70939-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Text Categorization for Improved Priors of Word Meaning