Skip to main content

Getting One’s First Million ...Collocations

  • Conference paper
Computational Linguistics and Intelligent Text Processing (CICLing 2004)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2945))

Abstract

Many-long-years-of experience in creating a very large database of Russian collocations is summarized. The collocations here described are syntactically connected and semantically compatible pairs of content components(single or multi-words. We begin from a synopsis of various applications of collocation databases (CDBs). Then we describe the main features of collocation components, syntactic types of collocations, and links of other nature between their components that amplify the applicability of the enclosing systems. All of the above-mentioned characterizes the CrossLexica system created for Russian but with a universal structure suited for other languages. The statistics of CrossLexica is given and discussed. It now contains more that a million collocations and more than a million WordNet-like links.

Work done under partial support of Micra Inc., USA (1993–1994), Soros Foundation, Russia (1995), and Mexican Government (CONACyT, SNI, CGPI-IPN) (since 1999). Thanks to Dr. Patrick Cassidy for his valuable suggestions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Banerjee, S., Pedersen, T.: The Design, Implementation and Use of Ngram Statistics Package. In: Gelbukh, A. (ed.) CICLing 2003. LNCS, vol. 2588, pp. 370–381. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  2. Benson, M., Benson, E., Ilson, R.: The BBI Combinatory Dictionary of English. John Benjamin Publ. (1986)

    Google Scholar 

  3. Bogatz, H.: The Advanced Reader’s Collocation Searcher, ARCS (1997), ISBN 09709341-4-9, http://www.asksam.com/web/bogatz

  4. Bentivogli, L., Pianta, E.: Detecting Hidden Multiwords in Bilingual Dictionaries. In: Proc. 10th EURALEX Intern. Congress, Copenhagen, Denmark, August 2002, pp. 14–17 (2002)

    Google Scholar 

  5. Bolshakov, I.A.: Thesaurus in word processing: What should it be? International Forum on Information and Documentation 16(2), 3–10 (1991)

    MathSciNet  Google Scholar 

  6. Bolshakov, I.A.: Multifunctional thesaurus for computerized preparation of Russian texts. Automatic Documentation and Mathematical Linguistics 28(1), 13–28 (1994)

    Google Scholar 

  7. Bolshakov, I.A.: Multifunction thesaurus for Russian word processing. In: Proc. 4th Conf. on Applied Natural language Processing, Stuttgart, October 1994, pp. 200–202 (1994)

    Google Scholar 

  8. Bolshakov, I.A., Gelbukh, A.: A Very Large Database of Collocations and Semantic Links. In: Bouzeghoub, M., Kedad, Z., Métais, E. (eds.) NLDB 2000. LNCS, vol. 1959, pp. 103–114. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  9. Bolshakov, I.A., Gelbukh, A.: Text Segmentation to Paragraphs Based on Local Text Cohesion. In: Matoušek, V., Mautner, P., Mouček, R., Tauser, K. (eds.) TSD 2001. LNCS (LNAI), vol. 2166, pp. 158–166. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  10. Bolshakov, I.A., Gelbukh, A.: Word Combinations as an Important Part of Modern Electronic Dictionaries. Procesamiento del Lenguaje Natural 29, 47–54 (2002)

    Google Scholar 

  11. Bolshakov, I.A., Gelbukh, A.: Heuristics-Based Replenishment of Collocation Databases. In: Ranchhod, E., Mamede, N.J. (eds.) PorTAL 2002. LNCS (LNAI), vol. 2389, pp. 25–32. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  12. Bolshakov, I.A., Gelbukh, A.: On Detection of Malapropisms by Multistage Collocation Testing. In: Düsterhöft, A., Talheim, B. (eds.) Proc. 8th Intern. Conf. on Applications of Natural Language to Information Systems, NLDB 2003, Burg, Germany, Bonn. Lecture Notes in Informatics V. P-29, pp. 28–41. GI-Edition (2003)

    Google Scholar 

  13. Bolshakov, I.A., Gelbukh, A., Galicia-Haro, S.N.: Stable Coordinated Pairs in Text Processing. In: Matoušek, V., Mautner, P. (eds.) TSD 2003. LNCS (LNAI), vol. 2807, pp. 27–34. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  14. Calvo, H., Gelbukh, A.: Improving Disambiguation of Prepositional Phrase Attachments Using the Web as Corpus. In: Sanfeliu, A., Ruiz-Shulcloper, J. (eds.) CIARP 2003. LNCS, vol. 2905, pp. 604–610. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  15. Fellbaum, C. (ed.): WordNet: An Electronic Lexical Database. MIT Press, Cambridge (1998)

    MATH  Google Scholar 

  16. Gelbukh, A., Sidorov, G., Bolshakov, I.A.: Dictionary-based Method for Coherence Maintenance in Man-Machine Dialogue with Indirect Antecedents and Ellipses. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2000. LNCS (LNAI), vol. 1902, pp. 357–362. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  17. Gelbukh, A., Sidorov, G., Han, S.-Y., Hernández-Rubio, E.: Automatic Syntactic Analysis for Detection of Word Combinations. In: Gelbukh, A. (ed.) CICLing 2004. LNCS, vol. 2945, pp. 240–244. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  18. Gelbukh, A., Sidorov, G., Han, S.-Y., Hernández-Rubio, E.: Automatic Enrichment of Very Large Dictionary of Word Combinations on the Basis of Dependency Formalism. In: Monroy, R., Arroyo-Figueroa, G., Sucar, L.E., Sossa, H. (eds.) MICAI 2004. LNCS (LNAI), vol. 2972, pp. 430–437. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  19. Ledo Mezquita, Y., Sidorov, G., Gelbukh, A.: Tool for Computer-Aided Spanish Word Sense Disambiguation. In: Gelbukh, A. (ed.) CICLing 2003. LNCS, vol. 2588, pp. 277–280. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  20. Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)

    MATH  Google Scholar 

  21. Mel’čuk, I.: Dependency Syntax: Theory and Practice. SONY Press, NY (1988)

    Google Scholar 

  22. Mel’čuk, I.: Phrasemes in Language and Phraseology in Linguistics. In: Everaert, M., et al. (eds.) Idioms: Structural and Psychological Perspectives, pp. 169–252. Lawrence Erlbaum Associates Publ., Hillsdale (1995)

    Google Scholar 

  23. Oxford Collocations Dictionary for Students of English. Oxford University Press, Oxford (2003)

    Google Scholar 

  24. Sag, I.A., Baldwin, T., Bond, F., Copestake, A., Flickinger, D.: Multiword Expressions: A Pain in the Neck for NLP. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 1–15. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  25. Sekine, S., et al.: Automatic Learning for Semantic Collocation. In: Proc. 3rd Conf. on Applied Natural Language Processing, ANLP, Trento, Italy, pp. 104–110 (1992)

    Google Scholar 

  26. Sidorov, G., Gelbukh, A.: Word sense disambiguation in a Spanish explanatory dictionary. In: Proc. of TALN 2001 (Tratamiento automático de lengauje natural), Tours, France, July 2-5, pp. 398–402 (2001)

    Google Scholar 

  27. Smadja, F.: Retreiving Collocations from text: Xtract. Computational Linguistics 19(1), 143–177 (1990)

    Google Scholar 

  28. Vossen, P. (ed.): EuroWordNet General Document. Vers. 3 final (2000), http://www.hum.uva.nl/~ewn

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Bolshakov, I.A. (2004). Getting One’s First Million ...Collocations. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2004. Lecture Notes in Computer Science, vol 2945. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24630-5_28

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-24630-5_28

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-21006-1

  • Online ISBN: 978-3-540-24630-5

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics