This paper presents an approach to build a novel two-level collocation net, which enables calculation of the collocation relationship between any two words, from a large raw corpus. The first level consists of atomic classes (each atomic class consists of one word and feature bigram), which are clustered into the second level class set. Each class in both levels is represented by its collocation candidate distribution, extracted from the linguistic analysis of the raw training corpus, over possible collocation relation types. In this way, all the information extracted from the linguistic analysis is kept in the collocation net. Our approach applies to both frequently and less-frequently occurring words by providing a clustering mechanism resolve the data sparseness problem through the collocation net. Experimentation shows that the collocation net is efficient and effective in solving the data sparseness problem and determining the collocation relationship between any two words.


Parse Tree Linguistic Analysis Atomic Class Data Sparseness Problem Statistical Natural Language Processing 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Church, K.W., Patrick, H.: Word Association Norms, Mutural Information and Lexicography. In: ACL 1989, pp. 76–83 (1989)Google Scholar
  2. 2.
    Church, K.W., William, A.G.: A Comparison of the Enhanced Good Turing and Deleted Estimation Methods for Estimating Probabilities of English Bigrams. Computer, Speech and Language 5(1), 19–54 (1991)CrossRefGoogle Scholar
  3. 3.
    Church, K.W., Robert, L.M.: Introduction to Special Issue on Computational Linguistics Using Large Corpora. Computational Linguistics 19(1), 1–24 (1993)Google Scholar
  4. 4.
    Dunning, T.: Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics 19(1), 61–74 (1993)Google Scholar
  5. 5.
    Halliday, M.: Lexis as a linguistic level. In: Bazell, C., Catford, J., Halliday, M., Robins, R. (eds.) memory of J.R.Firth, Longman (1966)Google Scholar
  6. 6.
    Hindle, D., Rooth, M.: Structural Ambiguity and Lexical Relations. Computational Linguistics 19(1), 102–119 (1993)Google Scholar
  7. 7.
    Justeson, J.S., Katz, S.M.: Technical Terminology: Some Linguistic Properties and an Algorithm for Identification in Text. Natural Language Engineering 1(1), 9–27 (1995)CrossRefGoogle Scholar
  8. 8.
    Julian, K., Pederson, J., Chen, F.: A Trainable Document Summarizer. In: SIGIR 1995, pp. 68–73 (1995)Google Scholar
  9. 9.
    Manning, C.D., Schutze, H.: Fundations of Statistical Natural Language Processing, p. 185. MIT Press, Cambridge (1999)Google Scholar
  10. 10.
    Meyer, D., et al.: Loci of Contextual Effects on Visual Word Recognition. In: Rabbitt, P., Dornie, S. (eds.) Attention and Performance V, pp. 98–116. Academic Press, London (1975)Google Scholar
  11. 11.
    Ross, I.C., Tukey, J.W.: Introduction to these Volumes. In: Tukey, J.W. (ed.) Index to Statistics amd Probability, pp. Iv-x. R&D Press, Los Altos (1975)Google Scholar
  12. 12.
    Rosenfeld, R.: Adaptive Statistical Language Modeling: A Maximum Entropy Approach. Ph.D. Thesis, Carneige Mellon University (1994)Google Scholar
  13. 13.
    Smadja, F.: Retrieving Collocations from Text: Xtract. Computational Linguistics 19(1), 143–177 (1993)Google Scholar
  14. 14.
    Snedecor, G.W., William, G.C.: Statistical Methods, p. 127. Iowa State University Press, Ames (1989)MATHGoogle Scholar
  15. 15.
    Yang, J.: Towards the automatic Acquisition of Lexical Selection Rules. MT Summit VII, Singapore, pp. 397–403 (1999)Google Scholar
  16. 16.
    Yuret, D.: Discovery of Linguistic Relations Using Lexical Attraction. Ph.D thesis. cmp-lg/9805009. MIT (1998)Google Scholar
  17. 17.
    Zhao, J., Huang, C.N.: Aquasi-Dependency Model for the Structural Analysis of Chinese BaseNPs. In: COLING-ACL 1998, Univ. de Montreal, Canada, pp. 1–7 (1998)Google Scholar
  18. 18.
    Zhou, G.D., Lua, K.T.: Word Association and MI-Trigger-based Language Modeling. In: COLING-ACL 1998, Univ. of Montreal, Canada, pp. 1465–1471 (1998)Google Scholar
  19. 19.
    Zhou, G.D., Lua, K.T.: Interpolation of N-gram and MI-based Trigger Pair Language Modeling in Mandarin Speech Recognition. Computer, Speech and Language 13(2), 123–135 (1999)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • GuoDong Zhou
    • 1
    • 2
  • Min Zhang
    • 2
  • GuoHong Fu
    • 3
  1. 1.School of Computer Science and TechnologySuzhou UniversityChina
  2. 2.Institute for Infocomm ResearchSingapore
  3. 3.Department of LinguisticsThe University of Hong KongHong Kong

Personalised recommendations