Skip to main content

A Comparison of Co-occurrence and Similarity Measures as Simulations of Context

  • Conference paper
Computational Linguistics and Intelligent Text Processing (CICLing 2008)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4919))

Abstract

Observations of word co-occurrences and similarity computations are often used as a straightforward way to represent the global contexts of words and achieve a simulation of semantic word similarity for applications such as word or document clustering and collocation extraction. Despite the simplicity of the underlying model, it is necessary to select a proper significance, a similarity measure and a similarity computation algorithm. However, it is often unclear how the measures are related and additionally often dimensionality reduction is applied to enable the efficient computation of the word similarity. This work presents a linear time complexity approximative algorithm for computing word similarity without any dimensionality reduction. It then introduces a large-scale evaluation based on two languages and two knowledge sources and discusses the underlying reasons for the relative performance of each measure.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Finch, S.P.: Finding Structure in Language. PhD thesis, University of Edinburgh, Edinburgh, Scotland, UK (1993)

    Google Scholar 

  2. Sahlgren, M.: The Word-Space Model: Using distributional analysis to represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces. PhD thesis, Swedish Intitute of Computer Science, Stockholm, Sweden (2006)

    Google Scholar 

  3. Smadja, F.: Retrieving collocations from text: Xtract. Computational Linguistics 19, 43–177 (1993)

    Google Scholar 

  4. Lin, D.: Extracting collocations from text corpora. In: Proceedings of the First Workshop on Computational Terminology (1998)

    Google Scholar 

  5. Evert, S.: The Statistics of Word Cooccurrences: Word Pairs and Collocations. PhD thesis, University of Stuttgart, Stuttgart, Germany (2004)

    Google Scholar 

  6. Kilgarriff, A., et al.: The sketch engine. In: Proceedings of Euralex, Lorient, France, pp. 105–116 (2004)

    Google Scholar 

  7. Riloff, E., Shepherd, J.: A corpus-based approach for building semantic lexicons. In: Cardie, C., Weischedel, R. (eds.) Proceedings of the Second Conference on Empirical Methods in Natural Language Processing (EMNLP, Somerset, NJ, USA, Association for Computational Linguistics (ACL 1997) pp. 117–124 (1997)

    Google Scholar 

  8. Roark, B., Charniak, E.: Noun-phrase co-occurrence statistics for semi-automatic semantic lexicon construction. In: Proceedings of The 17th International Conference on Computational Linguistics (COLING/ACL), Montreal, Quebec, Canada, pp. 1110–1116 (1998)

    Google Scholar 

  9. Widdows, D.: Unsupervised methods for developing taxonomies by combining syntactic and statistical information. In: Proceedings of the Human Language Technology Conference (HLT) of the NAACL, Edmonton, Canada, pp. 276–283 (2003)

    Google Scholar 

  10. Rohwer, R., Freitag, D.: Towards full automation of lexicon construction. In: Proceedings of Computational Lexical Semantics Workshop at the HLT/NAACL, Boston, MA, USA (2004)

    Google Scholar 

  11. Dumais, S.T.: Latent semantic indexing (LSI). In: Harman, D.K. (ed.) Overview of the Third Text Retrieval Conference (TREC), Gaithersburg, MD, USA, National Institute of Standards and Technology, pp. 219–230 (1995)

    Google Scholar 

  12. Grefenstette, G.: Explorations in Automatic Thesaurus Discovery. Kluwer Academic Press, Boston (1994)

    MATH  Google Scholar 

  13. Church, K.W., et al.: Using statistics in lexical analysis. In: Zernik, U. (ed.) Lexical Acquisition: Exploiting On-Line Resources to Build up a Lexicon, pp. 115–164. Lawrence Erlbaum, Hillsdale (1991)

    Google Scholar 

  14. Dunning, T.E.: Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19, 61–74 (1993)

    Google Scholar 

  15. Lee, L.: Measures of distributional similarity. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL), College Park, MD, USA, pp. 25–32 (1999)

    Google Scholar 

  16. Holtsberg, A., Willners, C.: Statistics for sentential co-occurrence. Working Papers 48, 135–148 (2001)

    Google Scholar 

  17. Quasthoff, U., Wolff, C.: The poisson collocation measure and its applications. In: Second International Workshop on Computational Approaches to Collocations, Vienna, Austria (2002)

    Google Scholar 

  18. Curran, J.R.: From Distributional to Semantic Similarity. PhD thesis, Institute for Communicating and Collaborative Systems, School of Informatics. University of Edinburgh, Edinburgh, Scotland, UK (2003)

    Google Scholar 

  19. Terra, E., Clarke, C.L.A.: Frequency estimates for statistical word similarity measures. In: Proceedings of the Human Language Technology Conference (HLT) of the NAACL, Edmonton, Canada, pp. 165–172 (2003)

    Google Scholar 

  20. Gale, W., Church, K.W., Yarowsky, D.: Work on statistical methods for word sense disambiguation. In: Intelligent Probabilistic Approaches to Natural Language. Fall Symposium Series, pp. 54–60 (1992)

    Google Scholar 

  21. Schütze, H.: Context space. In: Working Notes of the AAAI Fall Symposium on Probabilistic Approaches to Natural Language, pp. 113–120. AAAI Press, Menlo Park (1992)

    Google Scholar 

  22. Lin, D.: Automatic retrieval and clustering of similar words. In: Proceedings of The 17th International Conference on Computational Linguistics (COLING/ACL), pp. 768–774 (1998)

    Google Scholar 

  23. Weeds, J., Weir, D.: Co-occurrence retrieval: A flexible framework for lexical distributional similarity. Computational Linguistics, 439–475 (2005)

    Google Scholar 

  24. Weeds, J.: The reliability of a similarity measure. In: Proceedings of the 5th UK Special Interest Group for Computational Linguistics (CLUK), Manchester, UK (2005)

    Google Scholar 

  25. Smadja, F., McKeown, K.R., Hatzivassiloglou, V.: Translating collocations for bilingual lexicons: A statistical approach. Computational Linguistics 22, 1–38 (1996)

    Google Scholar 

  26. Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)

    MATH  Google Scholar 

  27. Bordag, S.: Elements of Knowledge-free and Unsupervised lexical acquisition. PhD thesis, Department of Natural Language Processing, University of Leipzig, Leipzig, Germany (2007)

    Google Scholar 

  28. Fellbaum, C.: A semantic network of English: The mother of all WordNets. Computers and the Humanities 32, 209–220 (1998)

    Article  Google Scholar 

  29. Hamp, B., Feldweg, H.: GermaNet - a lexical-semantic net for German. In: Proceedings of workshop on Automatic Information Extraction and Building of Lexical Semantic Resources for NLP Applications at the ACL, Madrid, Spain (1997)

    Google Scholar 

  30. Krenn, B., Evert, S.: Can we do better than frequency? a case study on extracting pp-verb collocations. In: Proceedings of the Workshop on Collocations at the ACL, Toulouse, France, pp. 39–46 (2001)

    Google Scholar 

  31. Zipf, G.K.: Human Behaviour and the Principle of Least-Effort. Cambridge MA edn. Addison-Wesley, Reading (1949)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Bordag, S. (2008). A Comparison of Co-occurrence and Similarity Measures as Simulations of Context. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2008. Lecture Notes in Computer Science, vol 4919. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78135-6_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-78135-6_5

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-78134-9

  • Online ISBN: 978-3-540-78135-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics