Advertisement

Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

A novel unsupervised method for new word extraction

一种新颖的非监督新词抽取方法

Abstract

New words could benefit many NLP tasks such as sentence chunking and sentiment analysis. However, automatic new word extraction is a challenging task because new words usually have no fixed language pattern, and even appear with the new meanings of existing words. To tackle these problems, this paper proposes a novel method to extract new words. It not only considers domain specificity, but also combines with multiple statistical language knowledge. First, we perform a filtering algorithm to obtain a candidate list of new words. Then, we employ the statistical language knowledge to extract the top ranked new words. Experimental results show that our proposed method is able to extract a large number of new words both in Chinese and English corpus, and notably outperforms the state-of-the-art methods. Moreover, we also demonstrate our method increases the accuracy of Chinese word segmentation by 10% on corpus containing new words.

创新点

  1. 1.

    本文提出了一个基于领域特殊性和统计语言知识的新词抽取方法。首先, 采用基于领域特殊性的垃圾串过滤方法过滤垃圾串, 得到候选新词列表; 然后基于统计语言知识(词频、凝聚度和自由度)对新词进行抽取。实验验证了该方法的有效性、语言独立性和领域无关性。

  2. 2.

    该方法能够有效提升中文分词系统的分词效果。

This is a preview of subscription content, log in to check access.

References

  1. 1

    Sproat R, Emerson T. The first international Chinese word segmentation bakeoff. In: Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing. Stroudsburg: Association for Computational Linguistics, 2003. 17: 133–143

  2. 2

    Sun X, Wang H, Li W. Fast online training with frequency-adaptive learning rates for chinese word segmentation and new word detection. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2012. 1: 253–262

  3. 3

    Nie L, Yan S, Wang M, et al. Harvesting visual concepts for image search with complex queries. In: Proceedings of the 20th ACM International Conference on Multimedia. New York: ACM, 2012. 59–68

  4. 4

    Huang M, Ye B, Wang Y, et al. New word detection for sentiment analysis. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, 2014. 531–541

  5. 5

    Isozaki H. Japanese named entity recognition based on a simple rule generator and decision tree learning. In: Proceedings of the 39th Annual Meeting on Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2001. 314–321

  6. 6

    Chen K J, Ma W Y. Unknown word extraction for Chinese documents. In: Proceedings of the 19th International Conference on Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2002. 1: 1–7

  7. 7

    Meng Y, Yu H, Nishino F. Chinese new word identification based on character parsing model. In: Proceedings of the 1st International Joint Conference on Natural Language Processing, Hainan, 2004. 489–496

  8. 8

    Peng F, Feng F, McCallum A. Chinese segmentation and new word detection using conditional random fields. In: Proceedings of the 20th International Conference on Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2004. 562

  9. 9

    Jiang X, Wang L, Cao Y, et al. Automatic recognition of Chinese unknown word for single-character and affix models. In: Knowledge Engineering and Management. Berlin: Springer, 2011. 435–444

  10. 10

    He S, Zhu J. Bootstrap method for Chinese new words extraction. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, Salt Lake City, 2001. 1: 581–584

  11. 11

    Church K W, Hanks P. Word association norms, mutual information, and lexicography. Comput Linguist, 1990, 16: 22–29

  12. 12

    Zhang W, Yoshida T, Tang X, et al. Improving effectiveness of mutual information for substantival multiword expression extraction. Expert Syst Appl, 2009, 36: 10919–10930

  13. 13

    Bu F, Zhu X, Li M. Measuring the non-compositionality of multiword expressions. In: Proceedings of the 23rd International Conference on Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2010. 116–124

  14. 14

    Luo S, Sun M. Two-character Chinese word extraction based on hybrid of internal and contextual measures. In: Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing. Stroudsburg: Association for Computational Linguistics, 2003. 17: 24–30

  15. 15

    Boser B E, Guyon I M, Vapnik V N. A training algorithm for optimal margin classifiers. In: Proceedings of the 5th Annual Workshop on Computational Learning Theory. New York: ACM, 1992. 144–152

  16. 16

    Qu A P, Chen J M, Wang L W, et al. Segmentation of Hematoxylin-Eosin stained breast cancer histopathological images based on pixel-wise SVM classifier. Sci China Inf Sci, 2015, 58: 092105

  17. 17

    Zou B, Peng Z M, Xu Z B. The learning performance of support vector machine classification based on Markov sampling. Sci China Inf Sci, 2013, 56: 032110

  18. 18

    Chang C C, Lin C J. LIBSVM: a library for support vector machines. ACM Trans Intell Syst Tech (TIST), 2011, 2: 27

  19. 19

    Lafferty J, McCallum A, Pereira F C N. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th International Conference on Machine Learning. San Francisco: Morgan Kaufmann Publishers Inc, 2001. 282–289

  20. 20

    Yi J, Peng Y X, Xiao J G. A temporal context model for boosting video annotation. Sci China Inf Sci, 2013, 56: 110904

  21. 21

    Rabiner L R. A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE, 1989, 77: 257–286

  22. 22

    Suk M, Ramadass A, Jin Y, et al. Video human motion recognition using a knowledge-based hybrid method based on a hidden Markov model. ACM Trans Intell Syst Tech, 2012, 3: 42

  23. 23

    Hong F, Tang J W, Lu P P. Multichannel DEM reconstruction method based on Markov random fields for bistatic SAR. Sci China Inf Sci, 2015, 58: 062302

  24. 24

    Xu Y S, Wang X, Tang B Z, et al. Chinese unknown word recognition using improved conditional random fields. In: Proceedings of the 8th International Conference on Intelligent Systems Design and Applications, Kaohsiung, 2008. 2: 363–367

  25. 25

    Hu Q H, Guo M Z, Yu D R, et al. Information entropy for ordinal classification. Sci China Inf Sci, 2010, 53: 1188–1200

  26. 26

    Sun Y L, Tao J X, Chen H, et al. The entropy weighted non-uniform scanning algorithm for diffraction tomography. Sci China Inf Sci, 2015, 58: 067102

  27. 27

    Ding Y, Zhang Y, Wang X, et al. Perceptual image quality assessment metric using mutual information of Gabor features. Sci China Inf Sci, 2014, 57: 032111

  28. 28

    Li H, Huang C N, Gao J, et al. The use of SVM for Chinese new word identification. In: Natural Language Processing—IJCNLP 2004. Berlin: Springer, 2005. 723–732

  29. 29

    Zhou G D. A chunking strategy towards unknown word detection in Chinese word segmentation. In: Proceedings of the 1st International Joint Conference on Natural Language Processing. Berlin: Springer, 2005. 530–541

  30. 30

    Wu A D, Jiang Z X. Statistically-enhanced new word identification in a rule-based Chinese system. In: Proceedings of the 2nd Workshop on Chinese Language Processing. Stroudsburg: Association for Computational Linguistics, 2000. 12: 46–51

  31. 31

    Liberman M, Davis K, Grossman M, et al. Emotional Prosody Speech and Transcripts. LDC2002S28. Philadelphia: Linguistic Data Consortium, 2002

  32. 32

    Huang S D, Graff D, Doddington G. Multiple-Translation Chinese Corpus. LDC2002T01. Philadelphia: Linguistic Data Consortium, 2002

  33. 33

    Zhang H P, Yu H K, Xiong D Y, et al. HHMM-based Chinese lexical analyzer ICTCLAS. In: Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing. Stroudsburg: Association for Computational Linguistics, 2003. 17: 184–187

Download references

Author information

Correspondence to Heyan Huang.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Mei, L., Huang, H., Wei, X. et al. A novel unsupervised method for new word extraction. Sci. China Inf. Sci. 59, 92102 (2016). https://doi.org/10.1007/s11432-015-0906-9

Download citation

Keywords

  • new word extraction
  • word segmentation
  • domain specificity
  • statistical language knowledge
  • domain word extraction

关键词

  • 新词抽取
  • 分词
  • 领域特殊性
  • 统计语言知识
  • 领域词抽取