Abstract
New words could benefit many NLP tasks such as sentence chunking and sentiment analysis. However, automatic new word extraction is a challenging task because new words usually have no fixed language pattern, and even appear with the new meanings of existing words. To tackle these problems, this paper proposes a novel method to extract new words. It not only considers domain specificity, but also combines with multiple statistical language knowledge. First, we perform a filtering algorithm to obtain a candidate list of new words. Then, we employ the statistical language knowledge to extract the top ranked new words. Experimental results show that our proposed method is able to extract a large number of new words both in Chinese and English corpus, and notably outperforms the state-of-the-art methods. Moreover, we also demonstrate our method increases the accuracy of Chinese word segmentation by 10% on corpus containing new words.
创新点
-
1.
本文提出了一个基于领域特殊性和统计语言知识的新词抽取方法。首先, 采用基于领域特殊性的垃圾串过滤方法过滤垃圾串, 得到候选新词列表; 然后基于统计语言知识(词频、凝聚度和自由度)对新词进行抽取。实验验证了该方法的有效性、语言独立性和领域无关性。
-
2.
该方法能够有效提升中文分词系统的分词效果。
Similar content being viewed by others
References
Sproat R, Emerson T. The first international Chinese word segmentation bakeoff. In: Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing. Stroudsburg: Association for Computational Linguistics, 2003. 17: 133–143
Sun X, Wang H, Li W. Fast online training with frequency-adaptive learning rates for chinese word segmentation and new word detection. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2012. 1: 253–262
Nie L, Yan S, Wang M, et al. Harvesting visual concepts for image search with complex queries. In: Proceedings of the 20th ACM International Conference on Multimedia. New York: ACM, 2012. 59–68
Huang M, Ye B, Wang Y, et al. New word detection for sentiment analysis. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, 2014. 531–541
Isozaki H. Japanese named entity recognition based on a simple rule generator and decision tree learning. In: Proceedings of the 39th Annual Meeting on Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2001. 314–321
Chen K J, Ma W Y. Unknown word extraction for Chinese documents. In: Proceedings of the 19th International Conference on Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2002. 1: 1–7
Meng Y, Yu H, Nishino F. Chinese new word identification based on character parsing model. In: Proceedings of the 1st International Joint Conference on Natural Language Processing, Hainan, 2004. 489–496
Peng F, Feng F, McCallum A. Chinese segmentation and new word detection using conditional random fields. In: Proceedings of the 20th International Conference on Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2004. 562
Jiang X, Wang L, Cao Y, et al. Automatic recognition of Chinese unknown word for single-character and affix models. In: Knowledge Engineering and Management. Berlin: Springer, 2011. 435–444
He S, Zhu J. Bootstrap method for Chinese new words extraction. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, Salt Lake City, 2001. 1: 581–584
Church K W, Hanks P. Word association norms, mutual information, and lexicography. Comput Linguist, 1990, 16: 22–29
Zhang W, Yoshida T, Tang X, et al. Improving effectiveness of mutual information for substantival multiword expression extraction. Expert Syst Appl, 2009, 36: 10919–10930
Bu F, Zhu X, Li M. Measuring the non-compositionality of multiword expressions. In: Proceedings of the 23rd International Conference on Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2010. 116–124
Luo S, Sun M. Two-character Chinese word extraction based on hybrid of internal and contextual measures. In: Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing. Stroudsburg: Association for Computational Linguistics, 2003. 17: 24–30
Boser B E, Guyon I M, Vapnik V N. A training algorithm for optimal margin classifiers. In: Proceedings of the 5th Annual Workshop on Computational Learning Theory. New York: ACM, 1992. 144–152
Qu A P, Chen J M, Wang L W, et al. Segmentation of Hematoxylin-Eosin stained breast cancer histopathological images based on pixel-wise SVM classifier. Sci China Inf Sci, 2015, 58: 092105
Zou B, Peng Z M, Xu Z B. The learning performance of support vector machine classification based on Markov sampling. Sci China Inf Sci, 2013, 56: 032110
Chang C C, Lin C J. LIBSVM: a library for support vector machines. ACM Trans Intell Syst Tech (TIST), 2011, 2: 27
Lafferty J, McCallum A, Pereira F C N. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th International Conference on Machine Learning. San Francisco: Morgan Kaufmann Publishers Inc, 2001. 282–289
Yi J, Peng Y X, Xiao J G. A temporal context model for boosting video annotation. Sci China Inf Sci, 2013, 56: 110904
Rabiner L R. A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE, 1989, 77: 257–286
Suk M, Ramadass A, Jin Y, et al. Video human motion recognition using a knowledge-based hybrid method based on a hidden Markov model. ACM Trans Intell Syst Tech, 2012, 3: 42
Hong F, Tang J W, Lu P P. Multichannel DEM reconstruction method based on Markov random fields for bistatic SAR. Sci China Inf Sci, 2015, 58: 062302
Xu Y S, Wang X, Tang B Z, et al. Chinese unknown word recognition using improved conditional random fields. In: Proceedings of the 8th International Conference on Intelligent Systems Design and Applications, Kaohsiung, 2008. 2: 363–367
Hu Q H, Guo M Z, Yu D R, et al. Information entropy for ordinal classification. Sci China Inf Sci, 2010, 53: 1188–1200
Sun Y L, Tao J X, Chen H, et al. The entropy weighted non-uniform scanning algorithm for diffraction tomography. Sci China Inf Sci, 2015, 58: 067102
Ding Y, Zhang Y, Wang X, et al. Perceptual image quality assessment metric using mutual information of Gabor features. Sci China Inf Sci, 2014, 57: 032111
Li H, Huang C N, Gao J, et al. The use of SVM for Chinese new word identification. In: Natural Language Processing—IJCNLP 2004. Berlin: Springer, 2005. 723–732
Zhou G D. A chunking strategy towards unknown word detection in Chinese word segmentation. In: Proceedings of the 1st International Joint Conference on Natural Language Processing. Berlin: Springer, 2005. 530–541
Wu A D, Jiang Z X. Statistically-enhanced new word identification in a rule-based Chinese system. In: Proceedings of the 2nd Workshop on Chinese Language Processing. Stroudsburg: Association for Computational Linguistics, 2000. 12: 46–51
Liberman M, Davis K, Grossman M, et al. Emotional Prosody Speech and Transcripts. LDC2002S28. Philadelphia: Linguistic Data Consortium, 2002
Huang S D, Graff D, Doddington G. Multiple-Translation Chinese Corpus. LDC2002T01. Philadelphia: Linguistic Data Consortium, 2002
Zhang H P, Yu H K, Xiong D Y, et al. HHMM-based Chinese lexical analyzer ICTCLAS. In: Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing. Stroudsburg: Association for Computational Linguistics, 2003. 17: 184–187
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Mei, L., Huang, H., Wei, X. et al. A novel unsupervised method for new word extraction. Sci. China Inf. Sci. 59, 92102 (2016). https://doi.org/10.1007/s11432-015-0906-9
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11432-015-0906-9
Keywords
- new word extraction
- word segmentation
- domain specificity
- statistical language knowledge
- domain word extraction