Machine Translation

, Volume 10, Issue 3, pp 219–258 | Cite as

Automatic acquisition of semantic collocation from corpora

  • Satoshi Sekine
  • Jun-ichi Tsujii


The real difficulty in developing practical NLP systems is due to the fact that we do not know in advance what actualinstances of knowledge should be used in the application system, even though we know in advance whattypes of knowledge are required. An effective method for extracting linquistic knowledge from corpora is needed. We propose automatic linguistic knowledge acquisition from sublanguage corpora. The system combines existing linquistic knowledge and human intervention with corpus based techniques. The algorithm involves a “Gradual Approximation”, which works to converge linguistic knowledge gradually towards desirable results. We conducted three experiments. The first experiment revealed the characteristic of this algorithm and the others proved the effectiveness of this algorithm for a real corpus. The results show the algorithm is promising, though there are some problems; the practical problem of the parameters, the formalism problems to include more linguistic features and the combination with other linguistic clues for more development. We would like to continue the research to perform further experiments and to improve the algorithm.

Key words

Knowledge acquisition linguistic knowledge corpus linguistic 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. M. R. Brent: Semantic Classification of Verbs from their Syntactic Contexts: An Implemented Classifier for stativity.Proceedings of the fifth conference of the European chapter of the Association for Computational Linguistics, Berlin, 9–11 April 1991, pp. 222–226 (1991).Google Scholar
  2. M. R. Brent: Automatic acquisition of subcategorization frames from untagged text.20th Conference of ACL (1991).Google Scholar
  3. P. Brown, J. Cocke, S. Della Pietra, V. Della Pietra, F. Jelinek, R. Mercer, and P. Roossin: A statistical approach to language translation.Proceedings of COLING, Budapest, 1988, pp. 71–76 (1988).Google Scholar
  4. Peter F. Brown, Stephen A. Dolla Pietra, Vincent J. Della Pietra, and Robert L. Mercer: A statistical approach to machine translation.Computer Linguistics, 16(2) pp. 79–85 (1990).Google Scholar
  5. N. Calzolari: Lexical Database and a Lexical Knowledge Base.1st International lexical Acquisition Workshop, Detroit, Michigan (1989).Google Scholar
  6. N. Calzolari and R. Bindi: Acquisition of lexical information from a large textual Italian corpus.Proceedings of the 13th International Conference on Computational Linguistics. Helsinki 1990, vol. 3, pp. 54–59 (1990).Google Scholar
  7. K. Church: A stochastic parts programme and noun phrase parser for unrestricted text.Proceedings of the second conference on applied natural language processing. Austin Texas, 9–7 February 1988, pp. 136–143 (1988).Google Scholar
  8. K. Church and P. Hanks: Word association norms, mutual information and lexicography.Proceedings of the 27th annual meeting of the Association for Computational Linguistics. Vancouver, 26–29 June, 1989, pp. 76–83 (1989).Computational Linguistics, 1989 vol. 16.1 (1989).Google Scholar
  9. K. Church, W. Gale, P. Hanks, and D. Hindle: Using statistics in Lexical Analysis.Lexical Acquisition (edited by Uri Zernik) (1991).Google Scholar
  10. I. Dagan and I. Alon: Automatic processing of large corpora for resolution of anaphora reference.Proceedings of the 13th International Conference on Computational Linguistics, Helsinki 1990, vol. 3, pp. 330–332 (1990).Google Scholar
  11. I. Dagan, A. Itai and U. Schwall: Two languages are more informative than one.29th ACL, Berkely, pp. 130–137 (1991).Google Scholar
  12. L. Frazier: On comprehending sentences: Syntactic parsing strategies.PH.D. dissertation, University of Connecticut (1978).Google Scholar
  13. S. J. DeRose: Grammatical Disambiguation by Statical Optimization.Computational Linguistics, Vol. 14, No. 1 (1988).Google Scholar
  14. R. Gnanadesikan: Methods for statistical Data Analysis of Multivariate Observations (1977).Google Scholar
  15. R. Grishman, N. T. Nhan, and E. Marsh: Turning Natural Language Grammars for New Domains.Conference on Intelligent Systems and machines, Rochester, Minnesota, pp. 342–346 (1984).Google Scholar
  16. Ralph Grishman, L.Hirschman, and N. T. Nhan: Discover Procedures for Sublanguage Selectional Patterns: Initial Experiments,Comp. Linguistics Vol. 12, No. 3 (1986).Google Scholar
  17. M. Chitrao and R. Grishman: Statistical Parsing of Message.Proceedings of the June DARPA Speech and Natural Language workshop. Hidden Valley, Pennsylvania (1990).Google Scholar
  18. D. Hindle: Acquiring disambiguation rules from text.Proceedings of the 27th annual meeting of the Association for Computational Linguistics. Vancouver, 26–29 June 1989, pp. 118–125 (1989).Google Scholar
  19. D. Hindle: Structural Ambiguity and Lexical Relations.Proceedings of the June DARPA Speech and Natural Language workshop. Hidden Valley, Pennsylvania (1990).Google Scholar
  20. D. Hindle and M. Rooth: Noun classification from predicative argument structures.28th Conference of the A.C.L. pp. 268–275 (1990).Google Scholar
  21. D. Hindle and M. Rooth: Structural Ambiguity and Lexical Relations.29th Conference of the A.C.L. (1991).Google Scholar
  22. J. Kimball: Seven principles of surface structure parsing in natural language,Cognition 2 (1973).Google Scholar
  23. R. Kittredge and L. Hirschman: Sublanguage: Studies of Language in Restricted Semantic domains.Series of Foundations of Communications, Walter de Gruyter, Berlin (1983).Google Scholar
  24. C. Leech, R. Garside, and E. Atwell: The Automatic Grammatical Tagging of the LOB Corpus.ICAME News 7, pp. 13–33 (1983).Google Scholar
  25. G. Leech, R. Garside and G. Sampson: The computational analysis of English; A corpus based approach.London, Longman (1987).Google Scholar
  26. M. Marcus: Tutorial on Tagging and Processing Large Textual Corpora.Presented 28th annual meeting of the ACL (1990).Google Scholar
  27. E. Brill, D. Magerman, M. Marcus and B. Santorini: Deducting Linguistic Structure from the statistics of Large Corpora.Proceedings of the June DARPA Speech and Natural Language workshop. Hidden Valley, Pennsylvania (1990).Google Scholar
  28. David M. Margerman and Mitchell P. Marcus: Pearl: A Probabilistic Chart Parser.5th Conference of the E.A.C.L. (1991).Google Scholar
  29. S. Sekine, J. J. Carroll, S. Ananiadou and J. Tsujii: Automatic Learning for Semantic Collocation.3rd Conference on Applied Natural Language Processing (1992).Google Scholar
  30. S. Sekine, S. Ananiadou, J. J. Carroll and J. Tsujii: Linguistic Knowledge Generator.Proceedings of the 14th COLING, Nante, July 1992 (1992).Google Scholar
  31. F. Smadja: Lexical Co-occurrence, The missing Link in Language Acquisition.Program and abstracts of the 15th International ALLC, conference of the Association for Literary and Linguistic computing, Jerusalem, Israel (1988).Google Scholar
  32. F. Smadja and K. McKeown: Automatically Extracting and Representing Collocations for language generation.28th Conference of the A.C.L. (1990).Google Scholar
  33. F. Smadja: From N-Grams to Collocations an Evaluation of XTRACT.29th Conference of the A.C.L. (1991).Google Scholar
  34. F. Smadja: Marcro coding the lexicon with co-occurrence knowledge.Lexical Acquisition (edited by Uri Zernik) (1991).Google Scholar
  35. J. Tsujii, J. Carroll and S. Ananiadou: Methodologies for Development of Sublanguage MT System.CCL/UMIST Report, No. 90/10 (1990).Google Scholar
  36. J. Tsujii, S. Ananiadou, J. Carroll and S. Sekine: Methodologies for Development of Sublanguage MT System II.CCL/UMIST Report, No. 91/11 (1991).Google Scholar
  37. J. Tsujii, S. Ananiadou, I. Arad and S. Sekine: Linguistic Knowledge Acquisition from Corpora.International workshop on Fundamental research for the Future Generation of natural Language Processing (UMIST) (1992).Google Scholar
  38. T. Utsuro, Y. Matsumota and M. Nagao: Lexical Knowledge Acquisition from Bilingual Corpora.COLING-92 (1992).Google Scholar
  39. Uri Zernik: Paradigms in lexical acquisition.First International Lexical Acquisition Workshop, Detroit, Michigan (1989).Google Scholar
  40. Uri Zernik: Lexical acquisition: Learning from corpus by capitalizing on lexical categories.Eleventh International Joint Conference on Artificial Intelligence, Detroit, Michigan (1989).Google Scholar
  41. Uri Zernik and Paul Jacobs: Tagging for Learning: Collecting Thematic Relations from Corpus.13th COLING-90 (1990).Google Scholar
  42. Uri Zernik (ed.): Introduction and Tagging Word Sense In Corpus, lexical Acquisition: Using on-line resources to build a lexicon.Lawrence Erlbaum (1991).Google Scholar
  43. Yorick Wilks, Right Attachment and Preference Semantics.2nd Conference of the European Chapter of the A.C.L. (1985).Google Scholar
  44. A. Zampolli: Reusable Linguistic Resources (Invited paper).5th Conference of the E.A.C.L. (1991).Google Scholar
  45. G. Zipf: Human behaviour and the principle of least effort.Addison-Wesley (1949).Google Scholar
  46. G. Ziph: The psycho-biology of language: An introduction to dynamic philology.MIT Press (1965).Google Scholar

Copyright information

© Kluwer Academic Publishers 1995

Authors and Affiliations

  • Satoshi Sekine
    • 1
  • Jun-ichi Tsujii
    • 2
  1. 1.Computer Science DepartmentNew York UniversityNew YorkUSA
  2. 2.Institute of Science and IndustryUniversity of ManchesterManchesterUK

Personalised recommendations