A Kalman Filter Based Human-Computer Interactive Word Segmentation System for Ancient Chinese Texts

  • Tongfei Chen
  • Weimeng Zhu
  • Xueqiang Lv
  • Junfeng Hu
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8202)

Abstract

Previous research showed that Kalman filter based humancomputer interaction Chinese word segmentation algorithm achieves an encouraging effect in reducing user interventions. This paper designs an improved statistical model for ancient Chinese texts, and integrates it with the Kalman filter based framework. An online interactive system is presented to segment ancient Chinese corpora. Experiments showed that this approach has advantage in processing domain-specific text without the support of dictionaries or annotated corpora. Our improved statistical model outperformed the baseline model by 30% in segmentation precision.

Keywords

Word Segmentation Human-Computer Interactive System Kalman Filter Ancient Chinese Corpus Processing 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Liang, N.Y.: CDWS: An Automatic Word Segmentation System for Written Chinese Texts. Journal of Chinese Information Processing 2(2), 44–52 (1987) (in Chinese) Google Scholar
  2. 2.
    Nie, J.Y., Jin, W., Hannan, M.L.: A Hybrid Approach to Unknown Word Detection and Segmentation of Chinese. In: Proceedings of the International Conference on Chinese Computing, pp. 326–335 (1994)Google Scholar
  3. 3.
    Sun, M., Shen, D., Tsou, B.K.: Chinese Word Segmentation Without Using Lexicon and Hand-Crafted Training Data. In: COLING/ACL 1998, pp. 1265–1271 (1998)Google Scholar
  4. 4.
    Luo, X., Sun, M., Tsou, B.K.: Covering Ambiguity Resolution in Chinese Word Segmentation Based on Contextual Information. In: COLING 2002, pp. 1–7 (2002)Google Scholar
  5. 5.
    Zhang, H.P., Liu, Q., Cheng, X.Q., Yu, H.K.: Chinese Lexical Analysis Using Hierarchical Hidden Markov Model. In: Proceedings of the Second SIGHAN Workshop on Chinese Language Processing, pp. 63–70 (2003)Google Scholar
  6. 6.
    Peng, F., Feng, F., McCallum, A.: Chinese Segmentation and New Word Detection Using Conditional Random Fields. In: COLING 2004, pp. 23–27 (2004)Google Scholar
  7. 7.
    Goldwater, S., Griffiths, T.L., Johnson, M.: Contextual Dependencies in Unsupervised Word Segmentation. In: COLING/ACL 2006, pp. 673–680 (2006)Google Scholar
  8. 8.
    Wang, Z., Araki, K., Tochinai, K.: A Word Segmentation Method with Dynamic Adapting to Text Using Inductive Learning. In: Proceedings of the First SIGHAN Workshop on Chinese Language Processing, pp. 1–5 (2002)Google Scholar
  9. 9.
    Li, M., Gao, J., Huang, C., Li, J.: Unsupervised Training for Overlapping Ambiguity Resolution in Chinese Word Segmentation. In: Proceedings of the Second SIGHAN Workshop on Chinese Language Processing, pp. 1–7 (2003)Google Scholar
  10. 10.
    Sproat, R., Gale, W., Shih, C., Chang, N.: A Stochastic Finite-State Word-Segmentation Algorithm for Chinese. Computation Linguistics 22(3), 377–404 (1996)Google Scholar
  11. 11.
    Zhu, W., Sun, N., Zou, X., Hu, J.: The Application of Kalman Filter Based Human-Computer Learning Model to Chinese Word Segmentation. In: Gelbukh, A. (ed.) CICLing 2013, Part I. LNCS, vol. 7816, pp. 218–230. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  12. 12.
    Sproat, R., Shih, C.: A Statistical Method for Finding Word Boundaries in Chinese Text. In: Computer Processing of Chinese and Oriental Languages, pp. 336–351 (1990)Google Scholar
  13. 13.
    Chien, L.F.: Pat-Tree-Based Keyword Extraction for Chinese Information Retrieval. ACM SIGIR Forum, 50–58 (1997)Google Scholar
  14. 14.
    Yamamoto, M., Kenneth, C.W.: Using Suffix Arrays to Compute Term Frequency and Document Frequency for All Substrings in a Corpus. Computer Linguistics 27(1), 1–30 (2001)CrossRefGoogle Scholar
  15. 15.
    Sun, M., Xiao, M., Tsou, B.K.: Chinese Word Segmentation without Using Dictionary Based on Unsupervised Learning Strategy. Chinese Journal of Computers 27(6), 736–742 (2004) (in Chinese)Google Scholar
  16. 16.
    Kit, C., Wilks, Y.: Unsupervised Learning of Word Boundary with Description Length Gain. In: Proceedings of the CoNLL 1999 ACL Workshop, pp. 1–6 (1999)Google Scholar
  17. 17.
    Feng, H., Chen, K., Deng, X., Zheng, W.: Accessor Variety Criteria for Chinese Word Extraction. Computation Linguistics 30(1), 75–93 (2004)CrossRefGoogle Scholar
  18. 18.
    Jin, Z., Tanaka-Ishii, K.: Unsupervised Segmentation of Chinese Text by Use of Branching Entropy. In: COLING/ACL 2006, pp. 428–435 (2006)Google Scholar
  19. 19.
    Shi, M., Li, B., Chen, X.: CRF Based Research on a Unified Approach to Word Segmentation and POS Tagging for Pre-Qin Chinese. Journal of Chinees Information Processing 24(2), 39–45 (2010) (in Chinese) Google Scholar
  20. 20.
    Feng, C., Chen, Z., Huang, H., Guan, Z.: Active Learning in Chinese Word Segmentation Based on Multigram Language Model. Journal of Chinese Information Processing 20(1), 50–58 (2006) (in Chinese) Google Scholar
  21. 21.
    Li, B., Chen, X.: A Human-Computer Interaction Word Segmentation Method Adapting to Chinese Unknown Texts. Journal of Chinese Information Processing 21(3), 92–98 (2007) (in Chinese) Google Scholar
  22. 22.
    Kalman, R.E.: A New Approach to Linear Filtering and Prediction Problems. Journal of Basic Engineering 82(1), 35–45 (1960)CrossRefGoogle Scholar
  23. 23.
    Agarwal, D., Chen, B.C., Elango, P., Motgi, N., Park, S.T., Ramakrishnan, R., Roy, S., Zachariah, J.: Online Models for Content Optimization. In: Proceedings of NIPS 2008, pp. 17–24 (2008)Google Scholar
  24. 24.
    Liu, Z., Sun, M.: Web-Based Automatic Detection for IT New Terms. In: Proceedings of the 9th China National Conference on Computational Linguistics, pp. 515–521 (2007)Google Scholar
  25. 25.
    Bookstein, A., Klein, S.T., Raita, T.: Clumping Properties of Content-bearing Words. Journal of the American Society for Information Science 49(2), 102–114 (1998)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Tongfei Chen
    • 1
  • Weimeng Zhu
    • 1
  • Xueqiang Lv
    • 3
  • Junfeng Hu
    • 2
  1. 1.School of Electronics Engineering & Computer SciencePeking UniversityBeijingP.R. China
  2. 2.Key Laboratory of Computational Linguistics, Ministry of EducationPeking UniversityBeijingP.R. China
  3. 3.Beijing Key Laboratory of Internet Culture and Digital Dissemination ResearchBeijing Information Science and Technology UniversityBeijingP.R. China

Personalised recommendations