Skip to main content

Building a Collocation Net

  • Conference paper
  • 1008 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4285))

Abstract

This paper presents an approach to build a novel two-level collocation net, which enables calculation of the collocation relationship between any two words, from a large raw corpus. The first level consists of atomic classes (each atomic class consists of one word and feature bigram), which are clustered into the second level class set. Each class in both levels is represented by its collocation candidate distribution, extracted from the linguistic analysis of the raw training corpus, over possible collocation relation types. In this way, all the information extracted from the linguistic analysis is kept in the collocation net. Our approach applies to both frequently and less-frequently occurring words by providing a clustering mechanism resolve the data sparseness problem through the collocation net. Experimentation shows that the collocation net is efficient and effective in solving the data sparseness problem and determining the collocation relationship between any two words.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   99.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Church, K.W., Patrick, H.: Word Association Norms, Mutural Information and Lexicography. In: ACL 1989, pp. 76–83 (1989)

    Google Scholar 

  2. Church, K.W., William, A.G.: A Comparison of the Enhanced Good Turing and Deleted Estimation Methods for Estimating Probabilities of English Bigrams. Computer, Speech and Language 5(1), 19–54 (1991)

    Article  Google Scholar 

  3. Church, K.W., Robert, L.M.: Introduction to Special Issue on Computational Linguistics Using Large Corpora. Computational Linguistics 19(1), 1–24 (1993)

    Google Scholar 

  4. Dunning, T.: Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics 19(1), 61–74 (1993)

    Google Scholar 

  5. Halliday, M.: Lexis as a linguistic level. In: Bazell, C., Catford, J., Halliday, M., Robins, R. (eds.) memory of J.R.Firth, Longman (1966)

    Google Scholar 

  6. Hindle, D., Rooth, M.: Structural Ambiguity and Lexical Relations. Computational Linguistics 19(1), 102–119 (1993)

    Google Scholar 

  7. Justeson, J.S., Katz, S.M.: Technical Terminology: Some Linguistic Properties and an Algorithm for Identification in Text. Natural Language Engineering 1(1), 9–27 (1995)

    Article  Google Scholar 

  8. Julian, K., Pederson, J., Chen, F.: A Trainable Document Summarizer. In: SIGIR 1995, pp. 68–73 (1995)

    Google Scholar 

  9. Manning, C.D., Schutze, H.: Fundations of Statistical Natural Language Processing, p. 185. MIT Press, Cambridge (1999)

    Google Scholar 

  10. Meyer, D., et al.: Loci of Contextual Effects on Visual Word Recognition. In: Rabbitt, P., Dornie, S. (eds.) Attention and Performance V, pp. 98–116. Academic Press, London (1975)

    Google Scholar 

  11. Ross, I.C., Tukey, J.W.: Introduction to these Volumes. In: Tukey, J.W. (ed.) Index to Statistics amd Probability, pp. Iv-x. R&D Press, Los Altos (1975)

    Google Scholar 

  12. Rosenfeld, R.: Adaptive Statistical Language Modeling: A Maximum Entropy Approach. Ph.D. Thesis, Carneige Mellon University (1994)

    Google Scholar 

  13. Smadja, F.: Retrieving Collocations from Text: Xtract. Computational Linguistics 19(1), 143–177 (1993)

    Google Scholar 

  14. Snedecor, G.W., William, G.C.: Statistical Methods, p. 127. Iowa State University Press, Ames (1989)

    MATH  Google Scholar 

  15. Yang, J.: Towards the automatic Acquisition of Lexical Selection Rules. MT Summit VII, Singapore, pp. 397–403 (1999)

    Google Scholar 

  16. Yuret, D.: Discovery of Linguistic Relations Using Lexical Attraction. Ph.D thesis. cmp-lg/9805009. MIT (1998)

    Google Scholar 

  17. Zhao, J., Huang, C.N.: Aquasi-Dependency Model for the Structural Analysis of Chinese BaseNPs. In: COLING-ACL 1998, Univ. de Montreal, Canada, pp. 1–7 (1998)

    Google Scholar 

  18. Zhou, G.D., Lua, K.T.: Word Association and MI-Trigger-based Language Modeling. In: COLING-ACL 1998, Univ. of Montreal, Canada, pp. 1465–1471 (1998)

    Google Scholar 

  19. Zhou, G.D., Lua, K.T.: Interpolation of N-gram and MI-based Trigger Pair Language Modeling in Mandarin Speech Recognition. Computer, Speech and Language 13(2), 123–135 (1999)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Zhou, G., Zhang, M., Fu, G. (2006). Building a Collocation Net. In: Matsumoto, Y., Sproat, R.W., Wong, KF., Zhang, M. (eds) Computer Processing of Oriental Languages. Beyond the Orient: The Research Challenges Ahead. ICCPOL 2006. Lecture Notes in Computer Science(), vol 4285. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11940098_56

Download citation

  • DOI: https://doi.org/10.1007/11940098_56

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-49667-0

  • Online ISBN: 978-3-540-49668-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics