Abstract
In Korean information retrieval, compound nouns play an important role in improving precision in search experiments. There are two major approaches to compound noun indexing in Korean: statistical and linguistic. Each method, however, has its own shortcomings, such as limitations when indexing diverse types of compound nouns, over-generation of compound nouns, and data sparseness in training. In this paper, we propose a corpus-based learning method, which can index diverse types of compound nouns using rules automatically extracted from a large corpus. The automatic learning method is more portable and requires less human effort, although it exhibits a performance level similar to the manual-linguistic approach. We also present a new filtering method to solve the problems of compound noun over-generation and data sparseness.
Article PDF
Similar content being viewed by others
References
Cha J, Lee G and Lee J-H (1998) Generalized unknown morpheme guessing for hybrid POS tagging of Korean. In: Proceedings of Sixth Workshop on Very Large Corpora in Coling-ACL 98.
Church KW and Hanks P (1990) Word association norms, mutual information, and lexicography. Computational Linguistics, 16(1): 22-29.
Evans DA and Zhai C (1996) Noun-phrase analysis in unrestricted text for information retrieval. In: Proceeding of the 34th Annual Meeting of the Association for Computational Linguistics, Santa Cruz, CA, pp. 17-24.
Fagan JL (1989) The effectiveness of a non-syntactic approach to automatic phrase indexing for document retrieval. JASIS, 40(2): 115-132.
Fox EA (1983) Extending the Boolean and vector space models of information retrieval with p-norm queries and multiple concept types. Ph.D. Thesis, Cornell University.
Kando N, Kageura K, Yoshoka M and Oyama K (1998) Phrase processing methods for Japanase text retrieval. SIGIR Forum, 32(2): 23-28.
Kim MJ, Park M, Chang H, Choi J and Lee SJ (1998) The generation methods of compound noun for efficient index term extraction. In: Proceedings of the 10th Conference of Korean and Korean Information Processing, pp. 121-129.
Kim PK (1994) The automatic indexing of compound words from Korean text based on mutual information. Journal of KISS, 21(7): 1333-1340.
Lee H-A, Lee J-H and Lee G (1997) Noun phrase indexing using clausal segmentation. Journal of KISS, 24(3): 302-311.
Lee JH (1995) Combining multiple evidence from different properties of weighting schemes. In: SIGIR'95, pp. 180-188.
Salton G and Buckley C (1991) Text REtrieval conferences evaluation program. In: ftp://ftp.cs.cornell.edu/pub/ smart/.trec eval.7.0beta.tar.gz.
Strzalkowski T, Guthrie L, Karlgren J, Leistensnider J, Lin F, Perez-Carballo J, Straszheim T, Wang J andWilding J. (1996) Natural language information retrieval: TREC-5 report. In: The Fifth Text REtrieval Conference (TREC-5), NIST Special Publication. pp. 500-238.
Su K-Y, Wu M-W and Chang J-S (1994) A corpus-based approach to automatic compound extraction. In: Proceedings of ACL 94, pp. 242-247.
van Rijsbergen CJ (1979) Information Retrieval. Butterworths, London.
Won H, ParkMand Lee G (2000) Integrated multi-level indexing method for compound noun processing. Journal of KISS, 27(1): 84-95.
Yoon J-T, Jong E-S and Song M (1998) Analysis of Korean compound noun indexing using lexical information between nouns. Journal of KISS, 25(11): 1716-1725.
Yun B-H, Kwak Y-J and Rim H-C (1997) A Korean information retrieval model alleviating syntactic term mismatches. In: Proceedings of the Natural Language Processing Pacific Rim Symposium, pp. 107-112.
Zhai C (1997) Fast statistical parsing of noun phrases for document indexing. In: Fifth Conference on Applied Natural Language Processing, pp. 312-319.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Kim, JH., Kwak, BK., Lee, S. et al. A Corpus-Based Learning Method of Compound Noun Indexing Rules for Korean. Information Retrieval 4, 115–132 (2001). https://doi.org/10.1023/A:1011466928139
Issue Date:
DOI: https://doi.org/10.1023/A:1011466928139