Skip to main content
Log in

Creating a system for lexical substitutions from scratch using crowdsourcing

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

This article describes the creation and application of the Turk Bootstrap Word Sense Inventory for 397 frequent nouns, which is a publicly available resource for lexical substitution. This resource was acquired using Amazon Mechanical Turk. In a bootstrapping process with massive collaborative input, substitutions for target words in context are elicited and clustered by sense; then, more contexts are collected. Contexts that cannot be assigned to a current target word’s sense inventory re-enter the bootstrapping loop and get a supply of substitutions. This process yields a sense inventory with its granularity determined by substitutions as opposed to psychologically motivated concepts. It comes with a large number of sense-annotated target word contexts. Evaluation on data quality shows that the process is robust against noise from the crowd, produces a less fine-grained inventory than WordNet and provides a rich body of high precision substitution data at low cost. Using the data to train a system for lexical substitutions, we show that amount and quality of the data is sufficient for producing high quality substitutions automatically. In this system, co-occurrence cluster features are employed as a means to cheaply model topicality.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. http://beam.to/biem/software/TinyCC2.html.

  2. \( F1 = \frac{2PR}{P + R} \).

  3. http://www.ukp.tu-darmstadt.de/data/twsi-lexical-substitutions/.

  4. http://www.ukp.tu-darmstadt.de/software/twsi-sense-substituter/.

References

  • Agirre, E., & Edmonds, P. (Eds.). (2006). Word sense disambiguation: Algorithms and applications, volume 33 of text, speech and language technology. New York: Springer.

    Google Scholar 

  • Agirre, E., & Lopez de Lacalle, O. (2007). UBC-ALM: Combining k-NN with SVD for WSD. In Proceedings of the fourth international workshop on semantic evaluations (SemEval-2007) (pp. 342–345). Prague, Czech Republic.

  • Agirre, E., Martínez, D., Lopez de Lacalle, O., & Soroa, A. (2006). Evaluating and optimizing the parameters of an unsupervised graph-based WSD algorithm. In Proceedings of TextGraphs: The second workshop on graph based methods for natural language processing (pp. 89–96). New York City, USA.

  • Barr, C., Jones, R., & Regelson, M. (2008). The linguistic structure of English web-search queries. In Proceedings of EMNLP 2008 (pp. 1021–1030). Honolulu, HI, USA.

  • Biemann, C. (2006). Chinese whispers—an efficient graph clustering algorithm and its application to natural language processing problems. In Proceedings of the HLT-NAACL-06 workshop on Textgraphs-06. New York, USA.

  • Biemann C. (2010). Co-occurrence cluster features for lexical substitutions in context. In Proceedings of the ACL-2010 Workshop on Textgraphs. Uppsala, Sweden.

  • Biemann, C., & Nygaard, V. (2010). Crowdsourcing WordNet. In Proceedings of the 5th global WordNet conference, Mumbai, India. ACL Data and Code Repository, ADCR2010T005.

  • Cai, J. F., Lee, W. S., & Teh, Y. W. (2007). NUS-Ml: Improving word sense disambiguation using topic features. In Proceedings of the fourth international workshop on semantic evaluations (SemEval-2007) (pp. 249–252). Prague, Czech Republic.

  • Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61–74.

    Google Scholar 

  • Edmonds, P. (1997). Choosing the word most typical in context using a lexical co-occurrence network. In Proceedings of the 35th annual meeting of the association for computational linguistics (pp. 507–509). Madrid, Spain.

  • Erk, K., McCarthy, D., & Gaylord, N. (2009). Investigations on word senses and word usages. In Proceedings of the joint conference of the 47th annual meeting of the association for computational linguistics and the 4th international joint conference on natural language processing of the asian federation of natural language processing ACL-IJCNLP, Singapore.

  • Gliozzo, A., Giuliano, C., & Strapparava, C. (2005). Domain kernels for word sense disambiguation. In ACL ‘05: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics (pp. 403–410). Morristown, NJ, USA.

  • Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA data mining software: An update. SIGKDD Explorations, 11(1), 10–18.

    Google Scholar 

  • Hovy, E., Marcus, M., Palmer, M. Ramshaw, L., & Weischedel, R. (2006). OntoNotes: The 90% solution. In Proceedings of HLT-NAACL 2006 (pp. 57–60).

  • Inkpen, D. (2007). Near-synonym choice in an intelligent thesaurus. In Human language technologies 2007: The conference of the North American chapter of the association for computational linguistics (NAACL); Proceedings of the Main Conference (pp. 356–363). Rochester, NY, USA.

  • Kilgarriff, A. (2004). How dominant is the commonest sense of a word. In Proceedings of text, speech, dialogue (pp. 1–9). Springer-Verlag.

  • Klapaftis, I. P., & Manandhar, S. (2008). Word sense induction using graphs of collocations. In Proceedings of the 18th European conference on artificial intelligence (ECAI-2008). Patras, Greece: IOS Press.

  • Martínez, D., Lopez de Lacalle, O., & Agirre, E. (2008). On the use of automatically acquired examples for all-nouns word sense disambiguation. Journal of Artificial Intelligence (JAIR), 33, 79–107.

    Google Scholar 

  • McCarthy, D., & Navigli, R. (2007). Semeval-2007 task 10: English lexical substitution task. In Proceedings of the fourth international workshop on semantic evaluations (SemEval-2007) (pp. 48–53), Prague, Czech Republic.

  • Mihalcea, R. (1998). SEMCOR semantically tagged corpus. Unpublished manuscript.

  • Mihalcea, R. (2007). Using Wikipedia for automatic word sense disambiguation. In Human language technologies 2007: The conference of the North American chapter of the association for computational linguistics (NAACL), Rochester, NY, USA.

  • Mihalcea, R., & Moldovan, D. (2001). Automatic generation of a coarse grained WordNet. In Proceedings of the NAACL worshop on WordNet and other lexical resources, Pittsburg, USA.

  • Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., & Miller, K. (1990). WordNet: An on-line lexical database. International Journal of Lexicography, 3, 235–244.

    Article  Google Scholar 

  • Pradhan, S., Loper, E., Dligach, D., & Palmer, M. (2007). SemEval-2007 Task-17: English Lexical Sample, SRL and All Words. In Proceedings of the fourth international workshop on semantic evaluations (SemEval-2007) (pp. 87–92). Prague, Czech Republic.

  • Sanderson, M. (1994). Word sense disambiguation and information retrieval. In Proceedings of the 17th annual international ACM-SIGIR conference on research and development in information retrieval. Dublin, Ireland, 3–6 July 1994 (Special Issue of the SIGIR Forum), pp. 142–151. New York: ACM/Springer.

  • Sanderson, M. (2000). Retrieving with good sense. Information Retrieval, 2(1), 49–69.

    Article  Google Scholar 

  • Snow, R., O’Connor, B., Jurafsky, D., & Ng, A. Y. (2008). Cheap and fast—but is it good? Evaluating non-expert annotations for natural language tasks. In Conference on empirical methods in natural language processing, EMNLP 2008, proceedings of the conference (pp. 254–263), 25–27 October 2008, Honolulu, Hawaii, USA.

  • Stokoe, C. (2005). Differentiating homonymy and polysemy in information retrieval. In Proceedings of EMNLP-HLT. Vancouver, BC, Canada.

  • Veronis, J. (2004). Hyperlex: Lexical cartography for information retrieval. Computer Speech & Language, 18(3), 223–252.

    Article  Google Scholar 

  • Webb, G., Boughton, J., & Wang, Z. (2005). Not so Naive Bayes: Aggregating one-dependence estimators. Machine Learning, 58(1), 5–24.

    Article  Google Scholar 

  • Widdows, D., & Dorow, B. (2002). A graph model for unsupervised lexical acquisition. In Proceedings of the 19th international conference on computational linguistics (pp. 1–7). Morristown, NJ, USA.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chris Biemann.

Additional information

This work was done while the author worked for the Powerset division of Microsoft Bing.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Biemann, C. Creating a system for lexical substitutions from scratch using crowdsourcing. Lang Resources & Evaluation 47, 97–122 (2013). https://doi.org/10.1007/s10579-012-9180-5

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-012-9180-5

Keywords

Navigation