Advertisement

A novel unsupervised method for new word extraction

  • Lili Mei
  • Heyan HuangEmail author
  • Xiaochi Wei
  • Xianling Mao
Research Paper

Abstract

New words could benefit many NLP tasks such as sentence chunking and sentiment analysis. However, automatic new word extraction is a challenging task because new words usually have no fixed language pattern, and even appear with the new meanings of existing words. To tackle these problems, this paper proposes a novel method to extract new words. It not only considers domain specificity, but also combines with multiple statistical language knowledge. First, we perform a filtering algorithm to obtain a candidate list of new words. Then, we employ the statistical language knowledge to extract the top ranked new words. Experimental results show that our proposed method is able to extract a large number of new words both in Chinese and English corpus, and notably outperforms the state-of-the-art methods. Moreover, we also demonstrate our method increases the accuracy of Chinese word segmentation by 10% on corpus containing new words.

Keywords

new word extraction word segmentation domain specificity statistical language knowledge domain word extraction 

一种新颖的非监督新词抽取方法

创新点

  1. 1.

    本文提出了一个基于领域特殊性和统计语言知识的新词抽取方法。首先, 采用基于领域特殊性的垃圾串过滤方法过滤垃圾串, 得到候选新词列表; 然后基于统计语言知识(词频、凝聚度和自由度)对新词进行抽取。实验验证了该方法的有效性、语言独立性和领域无关性。

     
  2. 2.

    该方法能够有效提升中文分词系统的分词效果。

     

关键词

新词抽取 分词 领域特殊性 统计语言知识 领域词抽取 

References

  1. 1.
    Sproat R, Emerson T. The first international Chinese word segmentation bakeoff. In: Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing. Stroudsburg: Association for Computational Linguistics, 2003. 17: 133–143CrossRefGoogle Scholar
  2. 2.
    Sun X, Wang H, Li W. Fast online training with frequency-adaptive learning rates for chinese word segmentation and new word detection. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2012. 1: 253–262Google Scholar
  3. 3.
    Nie L, Yan S, Wang M, et al. Harvesting visual concepts for image search with complex queries. In: Proceedings of the 20th ACM International Conference on Multimedia. New York: ACM, 2012. 59–68CrossRefGoogle Scholar
  4. 4.
    Huang M, Ye B, Wang Y, et al. New word detection for sentiment analysis. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, 2014. 531–541Google Scholar
  5. 5.
    Isozaki H. Japanese named entity recognition based on a simple rule generator and decision tree learning. In: Proceedings of the 39th Annual Meeting on Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2001. 314–321Google Scholar
  6. 6.
    Chen K J, Ma W Y. Unknown word extraction for Chinese documents. In: Proceedings of the 19th International Conference on Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2002. 1: 1–7Google Scholar
  7. 7.
    Meng Y, Yu H, Nishino F. Chinese new word identification based on character parsing model. In: Proceedings of the 1st International Joint Conference on Natural Language Processing, Hainan, 2004. 489–496Google Scholar
  8. 8.
    Peng F, Feng F, McCallum A. Chinese segmentation and new word detection using conditional random fields. In: Proceedings of the 20th International Conference on Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2004. 562Google Scholar
  9. 9.
    Jiang X, Wang L, Cao Y, et al. Automatic recognition of Chinese unknown word for single-character and affix models. In: Knowledge Engineering and Management. Berlin: Springer, 2011. 435–444CrossRefGoogle Scholar
  10. 10.
    He S, Zhu J. Bootstrap method for Chinese new words extraction. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, Salt Lake City, 2001. 1: 581–584Google Scholar
  11. 11.
    Church K W, Hanks P. Word association norms, mutual information, and lexicography. Comput Linguist, 1990, 16: 22–29Google Scholar
  12. 12.
    Zhang W, Yoshida T, Tang X, et al. Improving effectiveness of mutual information for substantival multiword expression extraction. Expert Syst Appl, 2009, 36: 10919–10930CrossRefGoogle Scholar
  13. 13.
    Bu F, Zhu X, Li M. Measuring the non-compositionality of multiword expressions. In: Proceedings of the 23rd International Conference on Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2010. 116–124Google Scholar
  14. 14.
    Luo S, Sun M. Two-character Chinese word extraction based on hybrid of internal and contextual measures. In: Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing. Stroudsburg: Association for Computational Linguistics, 2003. 17: 24–30CrossRefGoogle Scholar
  15. 15.
    Boser B E, Guyon I M, Vapnik V N. A training algorithm for optimal margin classifiers. In: Proceedings of the 5th Annual Workshop on Computational Learning Theory. New York: ACM, 1992. 144–152Google Scholar
  16. 16.
    Qu A P, Chen J M, Wang L W, et al. Segmentation of Hematoxylin-Eosin stained breast cancer histopathological images based on pixel-wise SVM classifier. Sci China Inf Sci, 2015, 58: 092105CrossRefGoogle Scholar
  17. 17.
    Zou B, Peng Z M, Xu Z B. The learning performance of support vector machine classification based on Markov sampling. Sci China Inf Sci, 2013, 56: 032110MathSciNetCrossRefGoogle Scholar
  18. 18.
    Chang C C, Lin C J. LIBSVM: a library for support vector machines. ACM Trans Intell Syst Tech (TIST), 2011, 2: 27Google Scholar
  19. 19.
    Lafferty J, McCallum A, Pereira F C N. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th International Conference on Machine Learning. San Francisco: Morgan Kaufmann Publishers Inc, 2001. 282–289Google Scholar
  20. 20.
    Yi J, Peng Y X, Xiao J G. A temporal context model for boosting video annotation. Sci China Inf Sci, 2013, 56: 110904CrossRefGoogle Scholar
  21. 21.
    Rabiner L R. A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE, 1989, 77: 257–286CrossRefGoogle Scholar
  22. 22.
    Suk M, Ramadass A, Jin Y, et al. Video human motion recognition using a knowledge-based hybrid method based on a hidden Markov model. ACM Trans Intell Syst Tech, 2012, 3: 42CrossRefGoogle Scholar
  23. 23.
    Hong F, Tang J W, Lu P P. Multichannel DEM reconstruction method based on Markov random fields for bistatic SAR. Sci China Inf Sci, 2015, 58: 062302CrossRefGoogle Scholar
  24. 24.
    Xu Y S, Wang X, Tang B Z, et al. Chinese unknown word recognition using improved conditional random fields. In: Proceedings of the 8th International Conference on Intelligent Systems Design and Applications, Kaohsiung, 2008. 2: 363–367Google Scholar
  25. 25.
    Hu Q H, Guo M Z, Yu D R, et al. Information entropy for ordinal classification. Sci China Inf Sci, 2010, 53: 1188–1200MathSciNetCrossRefGoogle Scholar
  26. 26.
    Sun Y L, Tao J X, Chen H, et al. The entropy weighted non-uniform scanning algorithm for diffraction tomography. Sci China Inf Sci, 2015, 58: 067102Google Scholar
  27. 27.
    Ding Y, Zhang Y, Wang X, et al. Perceptual image quality assessment metric using mutual information of Gabor features. Sci China Inf Sci, 2014, 57: 032111CrossRefzbMATHGoogle Scholar
  28. 28.
    Li H, Huang C N, Gao J, et al. The use of SVM for Chinese new word identification. In: Natural Language Processing—IJCNLP 2004. Berlin: Springer, 2005. 723–732CrossRefGoogle Scholar
  29. 29.
    Zhou G D. A chunking strategy towards unknown word detection in Chinese word segmentation. In: Proceedings of the 1st International Joint Conference on Natural Language Processing. Berlin: Springer, 2005. 530–541Google Scholar
  30. 30.
    Wu A D, Jiang Z X. Statistically-enhanced new word identification in a rule-based Chinese system. In: Proceedings of the 2nd Workshop on Chinese Language Processing. Stroudsburg: Association for Computational Linguistics, 2000. 12: 46–51Google Scholar
  31. 31.
    Liberman M, Davis K, Grossman M, et al. Emotional Prosody Speech and Transcripts. LDC2002S28. Philadelphia: Linguistic Data Consortium, 2002Google Scholar
  32. 32.
    Huang S D, Graff D, Doddington G. Multiple-Translation Chinese Corpus. LDC2002T01. Philadelphia: Linguistic Data Consortium, 2002Google Scholar
  33. 33.
    Zhang H P, Yu H K, Xiong D Y, et al. HHMM-based Chinese lexical analyzer ICTCLAS. In: Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing. Stroudsburg: Association for Computational Linguistics, 2003. 17: 184–187CrossRefGoogle Scholar

Copyright information

© Science China Press and Springer-Verlag Berlin Heidelberg 2016

Authors and Affiliations

  • Lili Mei
    • 1
  • Heyan Huang
    • 1
    Email author
  • Xiaochi Wei
    • 1
  • Xianling Mao
    • 1
  1. 1.Beijing Engineering Research Center of High Volume Language Information Processing and Cloud Computing Applications, Department of Computer Science and TechnologyBeijing Institute of TechnologyBeijingChina

Personalised recommendations