Advertisement

A Chinese New Word Detection Approach Based on Independence Testing

  • Dongchen JiangEmail author
  • Xiaoyu ChenEmail author
  • Xin Yang
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11110)

Abstract

New word detection is of great significance for Chinese text information processing, which directly affects the capabilities of word segmentation, information retrieval and automatic translation. Focusing on the problem of Chinese new word detection, this paper proposes an independence-testing-based detection approach with no need of prior information. The paper analyzes statistical characteristics of new words in Chinese texts, uses statistical hypothesis testing to infer the correlations between adjacent semantic units, and proposes an iterative algorithm to detect new words gradually. Our algorithm is evaluated on both large-scale corpus and short news texts. Experimental results show that this approach can effectively detect new words from all kinds of news.

Keywords

New word detection Hypothesis testing Test of independence Semantic unit 

References

  1. 1.
    Huang, C.N., Hai, Z.: Chinese word segmentation: a decade review. J. Chin. Inf. Process. 21(3), 8–19 (2007)Google Scholar
  2. 2.
    Zou, G., Liu, Y., Liu, Q.: Internet-oriented Chinese new words detection. J. Chin. Inf. Process. 18(6), 1–9 (2004)Google Scholar
  3. 3.
    Luo, Z., Song, R.: An integrated method for Chinese unknown word extraction. In: Proceedings of the 3rd SIGHAN Workshop on Chinese Language Processing, pp. 148–154. Association for Computational Linguistics (2004)Google Scholar
  4. 4.
    Li, D., Tu, W., Shi, L.: Chinese new word identification algorithm based on context-aware. Comput. Eng. Des. 33(10), 4022–4027 (2012)Google Scholar
  5. 5.
    Zhang, H., Yong, L.I., Yan, Q.: Method of new Chinese words identification from large scale network corpora. Comput. Eng. Appl. 51(5), 208–213 (2015)Google Scholar
  6. 6.
    He, M., Gong, C., Zhang, H., Cheng, X.: Method of new word identification based on lager-scale corpus. Comput. Eng. Appl. 43(21), 157–159 (2007)Google Scholar
  7. 7.
    Zhao, X., Zhang, H.: New words identification based on iterative algorithm. Comput. Eng. 40(7), 154–158 (2014)Google Scholar
  8. 8.
    Zeng, H.L., Zhou, C.L., Shi, X.D., et al.: New word detection algorithm for Chinese based on extraction of local context information. In: Proceedings of the 3rd International Conference on Intelligent System and Knowledge Engineering, pp. 797–801. IEEE Xplore (2008)Google Scholar
  9. 9.
    Peng, F., Feng, F., Mccallum, A.: Chinese segmentation and new word detection using conditional random fields. In: Proceedings of the 20th International Conference on Computational Linguistics, Geneva, Switzerland, pp. 562–568 (2004)Google Scholar
  10. 10.
    Cui, S.: New word detection based on large-scale corpus. J. Comput. Res. Dev. 43(5), 927–932 (2006)CrossRefGoogle Scholar
  11. 11.
    Zhang, H., Luan, J., Li, Y., Qi, X.: Method of new Chinese word detection based on statistical learning framework. Comput. Sci. 39(2), 232–235 (2012)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.School of Information Science and TechnologyBeijing Forestry UniversityBeijingChina
  2. 2.Beijing Advanced Innovation Center for Big Data and Brain ComputingBeihang UniversityBeijingChina
  3. 3.School of Mathematics and Systems ScienceBeihang UniversityBeijingChina

Personalised recommendations