Abstract
In this paper, we propose a statistical method to automaticallyextract collocations from Korean POS-tagged corpus. Since a large portion of language is represented by collocation patterns, the collocational knowledge provides a valuable resource for NLP applications. One difficulty of collocation extraction is that Korean has a partially free word order, which also appears in collocations. In this work, we exploit four statistics, ‘frequency’,‘randomness’, ‘convergence’, and ‘correlation' in order to take into account the flexible word order of Korean collocations. We separate meaningful bigrams using an evaluation function based on the four statistics and extend the bigrams to n-gram collocations using a fuzzy relation. Experiments show that this method works well for Korean collocations.
Similar content being viewed by others
References
Benson, M., E. Benson and R. Ilson. The BBI Combinatory Dictionary of English: A Guide to Word Combinations. Amsterdam and Philadelphia: John Benjamins, 1986.
Breidt, E. “Extraction of V-N Collocations from Text Corpora: A Feasibility Study for German”. In the 1st ACL-Workshop on Very Large Corpora. 1993.
Choueka, Y., T. Klein and E. Neuwitz. 1983. “Automatic Retrieval of Frequent Idiomatic and Collocational Expressions in a Large Corpus”. Journal for Literary and Linguistic Computing, 4 (1983), 34–38.
Church, K. and P. Hanks. “Word Association Norms, Mutual Information, and Lexicography”. Computational Linguistics, 16(1) (1989), 22–29.
Cowie, A.P. “The Treatment of Collocations and Idioms in Learner's Dictionaries”. Applied Linguistics, 2(3) (1981), 223–235.
Cruse, D.P. Lexical Semantics. Cambridge University Press, 1986.
Dunning, T. “Accurate Methods for the Statistics of Surprise and Coincidence”. Computational Linguistics (1993).
Haruno, M., S. Ikehara and T. Yamazaki. “Learning Bilingual Collocations by Word-Level Sorting”. In Proceedings of the 16th COLING, 1996, pp. 525–530.
Ikehara, S., S. Shirai and H. Uchino. “A Statistical Method for Extracting Uninterrupted and Interrupted Collocations”. In Proceedings of the 16th COLING, 1996, pp. 574–579.
Kjellmer, G. 1995 A Mint of Phrases: Corpus Linguistics. Longman, 1995, pp. 111–127.
Klir, J.G. and B. Yuan. Fuzzy Sets And Fuzzy Logic: Theory and Applications. Prentice-Hall, 1995.
Lee, K.J., J.-H. Kim and G.C. Kim. “Extracting Collocations from Tagged Corpus in Korean”. Proceedings of the 22nd Korean Information Science Society, 2 (1995), 623–626.
Lin, D. “Extracting Collocations from Text Corpora”. In Proceedings of Tirst Workshop on Computational Terminology. Montreal, Canada, 1998.
Lin, D. “Automatic Identification of Non-compositional Phrases”. In the 37th Annual Meeting of ACL, 1999, pp. 317–324.
Manning, D.C. and H. Schütze. Foundations of Statistical Natural Language Processing. Cambridge, MA: The MIT Press, 1999.
Martin, W. and V.P. Sterkenburg. Lexicography: Principles and Practice, 1983.
Nagao, M. and S. Mori. “A New Method of n-Gram Statistics for Large Number of n and Automatic Extraction of Words and Phrases from Large Text Data of Japanese”. In Proceedings of the 15th COLING, 1994, pp. 611–615.
Ross, S.M. Introduction To Probability and Statistics for Engineers and Scientists. John Wiley & Sons, 1987.
Shimohata, S., T. Sugio and J. Nagata. “Retrieving Collocations by Co-Occurrences and Word Order Constraints”. In the 35th Annual Meeting of ACL, 1997, pp. 476–481.
Smadja, F. “Retrieving Collocations from Text: Xtract”. Computational Linguistics, 19(1) (1993), 143–177.
Smadja, F., K. MaKeown and V. Hatzivassiloglou. “Translating Collocations for Bilingual Lexicons: A Statistical Approach”. In Computational Linguistics, 22(1) (1996), 1–38.
Yoon, J., C. Lee, S. Kim and M. Song. “Morphological Analysis Based on Lexical Datatbase Extracted from Corpus”. In Proceedings of Hangul and Korean Information Processing. 1999.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Kim, S., Yoon, J. & Song, M. Automatic Extraction of Collocations From Korean Text. Computers and the Humanities 35, 273–297 (2001). https://doi.org/10.1023/A:1017507019909
Issue Date:
DOI: https://doi.org/10.1023/A:1017507019909